diff --git a/.claude/skills/add-jit-kernel/SKILL.md b/.claude/skills/add-jit-kernel/SKILL.md new file mode 100644 index 000000000000..e63a7d77b549 --- /dev/null +++ b/.claude/skills/add-jit-kernel/SKILL.md @@ -0,0 +1,630 @@ +--- +name: add-jit-kernel +description: Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module +--- + +# Tutorial: Adding a New JIT Kernel to SGLang + +This tutorial walks through adding a simple element-wise scale operation as a JIT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow. + +## Goal + +Add a new operation that scales each element of a tensor by a scalar factor: + +- Input: tensor `x` (CUDA) and scalar `factor` (float, passed at runtime) +- Output: `x * factor` (element-wise), allocated internally +- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)** + +## When to use JIT vs AOT (`sgl-kernel`) + +- **JIT (`jit_kernel`)**: prefer this first for kernels that do **not** depend on CUTLASS or another large C++ project. It is the default choice for lightweight kernels that benefit from rapid iteration and first-use compilation. +- **AOT (`sgl-kernel`)**: prefer this when the kernel **does** depend on CUTLASS or another large C++ project, or when it should live in `sgl-kernel/` and participate in the wheel build / torch op registration flow. +- **Exception**: kernels that depend on `flashinfer`, or on CUTLASS that is already provided through `flashinfer`, can still be implemented as `jit_kernel`. + +--- + +## Common Abstractions in `python/sglang/jit_kernel/include/sgl_kernel/` + +**Always prefer these abstractions over raw CUDA primitives.** They provide safety, readability, and consistency with the rest of the codebase. + +**Important include rule:** for every `#include ` line, add a short trailing comment explaining why that header is included (for example `// For TensorMatcher, SymbolicSize, SymbolicDevice`). This matches the current JIT kernel style and keeps include usage self-documenting. + +### `utils.h` — Host-side utilities + +```cpp +#include +``` + +- **`host::RuntimeCheck(cond, args...)`** — Assert a condition at runtime; throws `PanicError` with file/line info on failure. Prefer this over bare `assert`. +- **`host::Panic(args...)`** — Unconditionally throw a `PanicError` with a descriptive message. +- **`host::div_ceil(a, b)`** — Integer ceiling division `(a + b - 1) / b`. +- **`host::irange(n)`** / **`host::irange(start, end)`** — Range views for cleaner loops. +- **`host::pointer::offset(ptr, offsets...)`** — Byte-safe pointer arithmetic on `void*`. Use this instead of raw casts. + +### `utils.cuh` — Device-side utilities + `LaunchKernel` + +```cpp +#include +``` + +- **Type aliases**: `fp16_t`, `bf16_t`, `fp32_t`, `fp8_e4m3_t`, `fp8_e5m2_t` and their packed variants `fp16x2_t`, `bf16x2_t`, `fp32x2_t`, etc. +- **`SGL_DEVICE`** — Expands to `__forceinline__ __device__`. Use on all device functions. +- **`device::kWarpThreads`** — Constant `32`. +- **`device::load_as(ptr, offset)`** / **`device::store_as(ptr, val, offset)`** — Type-safe loads/stores from `void*`. +- **`device::pointer::offset(ptr, offsets...)`** — Pointer arithmetic on device. +- **`host::LaunchKernel(grid, block, device_or_stream [, smem])`** — RAII kernel launcher that: + - Resolves the CUDA stream from a `DLDevice` via TVM-FFI automatically. + - Checks the CUDA error with file/line info after launch via `operator()(kernel, args...)`. + - Supports `.enable_pdl(bool)` for PDL (Programmatic Dependent Launch, SM90+). +- **`host::RuntimeDeviceCheck(cudaError_t)`** — Check a CUDA error; throw on failure. + +### `tensor.h` — Tensor validation (`TensorMatcher`, Symbolic types) + +```cpp +#include +``` + +This is the **primary validation API** for all kernel launchers. Use it to validate every `tvm::ffi::TensorView` argument. + +- **`host::SymbolicSize{"name"}`** — A named symbolic dimension. Call `.set_value(n)` to pin it, `.unwrap()` to extract after verification. +- **`host::SymbolicDType`** — Symbolic dtype. Use `.set_options()` to restrict allowed types. +- **`host::SymbolicDevice`** — Symbolic device. Use `.set_options()` to restrict to CUDA. +- **`host::TensorMatcher({dims...})`** — Fluent builder for tensor validation: + - `.with_dtype()` — require a specific C++ type (e.g. `fp16_t`) + - `.with_dtype()` — allow a set of types + - `.with_device(device_sym)` — require CUDA and bind the checked device to a `SymbolicDevice` + - `.with_strides({strides...})` — validate strides (omit to require contiguous) + - `.verify(tensor_view)` — execute the check; throws `PanicError` with full context on failure; **chainable** (`verify(a).verify(b)` to check multiple tensors with the same shape) + +**Typical pattern:** +```cpp +auto N = SymbolicSize{"num_elements"}; +auto device = SymbolicDevice{}; +device.set_options(); +TensorMatcher({N}) // + .with_dtype() + .with_device(device) + .verify(dst) + .verify(src); // same shape, dtype, device as dst +const size_t n = N.unwrap(); +const DLDevice dev = device.unwrap(); +``` + +### `type.cuh` — `dtype_trait` and `packed_t` + +```cpp +#include +``` + +- **`dtype_trait`** — Static trait struct for each scalar type. Provides: + - `dtype_trait::from(value)` — convert from another type (e.g. `fp32_t` → `fp16_t`) + - `dtype_trait::abs/sqrt/rsqrt/exp/sin/cos(x)` — type-dispatched unary math (primarily for `fp32_t`) + - `dtype_trait::max/min(x, y)` — type-dispatched binary math (primarily for `fp32_t`) +- **`packed_t`** — Two-element packed alias: `packed_t` = `fp16x2_t`, `packed_t` = `bf16x2_t`, `packed_t` = `fp32x2_t`. Use for vectorized loads/stores. +- **`device::cast(value)`** — Type-safe cast using `dtype_trait`, e.g. `cast(v)`. + +### `vec.cuh` — Vectorized memory access (`AlignedVector`) + +```cpp +#include +``` + +- **`device::AlignedVector`** — Aligned storage for N elements of type T. N must be a power of two, `sizeof(T)*N <= 32`. Enables vectorized loads/stores for bandwidth efficiency. In terms of API/codegen constraints, the upper bound is 256-bit; in practice, 128-bit is the portable default, while 256-bit vectorization is typically only viable on `SM100+` and should be gated by an architecture check when needed. + - `.load(ptr, offset)` — vectorized load from `ptr[offset]` + - `.store(ptr, offset)` — vectorized store to `ptr[offset]` + - `.fill(value)` — fill all lanes + - `operator[](i)` — element access + +### `tile.cuh` — `tile::Memory` (strided memory access pattern) + +```cpp +#include +``` + +- `tile::Memory` is fundamentally a **1D cooperative accessor** over a contiguous region. +- **`device::tile::Memory::cta(blockDim.x)`** — Creates a tile accessor where each thread handles `tid = threadIdx.x` with stride `tsize` (for `cta(blockDim.x)`, this is `blockDim.x`). Common for loops over a 1D array. +- **`.load(ptr, offset)`** — loads `ptr[tid + offset * tsize]` +- **`.store(ptr, val, offset)`** — stores to `ptr[tid + offset * tsize]` +- **`.in_bound(n, offset)`** — boundary check + +For a **2D tile**, either flatten `(row, col)` into a linear tile index first, or compute the address manually with `ptr[row * stride + col]` using your thread/block coordinates. + +### `math.cuh` — Device math (`device::math::`) + +```cpp +#include +``` + +- `device::math::max/min(a, b)` — type-dispatched binary math via `dtype_trait` +- `device::math::abs/sqrt/rsqrt/exp/sin/cos(x)` — type-dispatched unary math via `dtype_trait` + +### `warp.cuh` — Warp-level primitives + +```cpp +#include +``` + +- `device::warp::reduce_sum(value)` — warp-level sum reduction via `__shfl_xor_sync` +- `device::warp::reduce_max(value)` — warp-level max reduction + +### `cta.cuh` — CTA-level primitives + +```cpp +#include +``` + +- `device::cta::reduce_max(value, smem, min_value)` — CTA-wide max using shared memory + warp reduction. Caller is responsible for a `__syncthreads()` after if the result in `smem[0]` is needed. + +### `atomic.cuh` — Atomic operations + +```cpp +#include +``` + +- `device::atomic::max(float* addr, float value)` — float atomic max (handles negative values correctly via bit tricks). + +### `runtime.cuh` — Occupancy and device info + +```cpp +#include +``` + +- `host::runtime::get_blocks_per_sm(kernel, block_dim)` — max active blocks per SM (occupancy) +- `host::runtime::get_sm_count(device_id)` — number of SMs on the device +- `host::runtime::get_cc_major(device_id)` — compute capability major version + +**Persistent kernel pattern** (cap blocks to SM count × occupancy): +```cpp +static const uint32_t max_occ = runtime::get_blocks_per_sm(kernel, kBlockSize); +static const uint32_t num_sm = runtime::get_sm_count(device.unwrap().device_id); +const auto num_blocks = std::min(num_sm * max_occ, div_ceil(n, kBlockSize)); +LaunchKernel(num_blocks, kBlockSize, device.unwrap())(kernel, params); +``` + +--- + +## Step 0 (optional): Generate a `.clangd` config for better IDE support + +```bash +python -m sglang.jit_kernel -h # for verbose help info about clangd configuration +python -m sglang.jit_kernel +python -m sglang.jit_kernel --dep cutlass flashinfer # with cutlass/flashinfer dependency +``` + +--- + +## Step 1: Implement the CUDA kernel in `jit_kernel/csrc/` + +Create `python/sglang/jit_kernel/csrc/elementwise/scale.cuh`. + +The implementation fully uses the project abstractions described above: + +```cpp +// NOTE: Comments for headers are not common in practice. +// It is only shown here for tutorial purposes to highlight the key abstractions. +#include // For TensorMatcher, SymbolicSize, SymbolicDevice +#include // For dtype_trait, fp16_t, bf16_t, fp32_t +#include // For RuntimeCheck, div_ceil +#include // For LaunchKernel, SGL_DEVICE +#include // For AlignedVector + +#include +#include + +namespace { + +// ---------------------------------------------------------------- +// Kernel: element-wise scale using vectorized 128-bit loads/stores +// T = fp16_t | bf16_t | fp32_t +// kVecN = number of elements per vector load (e.g. 8 for fp16) +// factor = runtime scale factor +// ---------------------------------------------------------------- +template +__global__ void scale_kernel(T* __restrict__ dst, + const T* __restrict__ src, + float factor, + uint32_t n_total) { + using vec_t = device::AlignedVector; + const uint32_t n_vecs = n_total / kVecN; + + // If using PDL, wait for primary kernel before any global memory load. + // This is NOT a synchronization point, which means some threads can early exit before this. + device::PDLWaitPrimary(); + + // --- vectorised body --- + const uint32_t vec_stride = blockDim.x * gridDim.x; + for (uint32_t vi = blockIdx.x * blockDim.x + threadIdx.x; + vi < n_vecs; + vi += vec_stride) { + vec_t v; + v.load(src, vi); +#pragma unroll + for (int i = 0; i < kVecN; ++i) { + v[i] = static_cast(static_cast(v[i]) * factor); + } + v.store(dst, vi); + } + + // --- scalar tail --- + const uint32_t base = n_vecs * kVecN; + const uint32_t scalar_stride = blockDim.x * gridDim.x; + for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; + base + i < n_total; + i += scalar_stride) { + dst[base + i] = static_cast(static_cast(src[base + i]) * factor); + } + + // If using PDL, signal for the secondary kernel to start after all threads have finished + // This is NOT a synchronization point, which means some threads can early exit before this. + device::PDLTriggerSecondary(); +} + +// ---------------------------------------------------------------- +// Launcher: validates tensors, selects vector width, launches kernel +// ---------------------------------------------------------------- +template +void scale(tvm::ffi::TensorView dst, tvm::ffi::TensorView src, float factor) { + using namespace host; + + // 1. Validate input tensors with TensorMatcher + SymbolicSize N = {"num_elements"}; + SymbolicDevice device_; + device_.set_options(); + + TensorMatcher({N}) // + .with_dtype() + .with_device(device_) + .verify(dst) + .verify(src); // same shape / dtype / device as dst + + const uint32_t n = static_cast(N.unwrap()); + const DLDevice device = device_.unwrap(); + + RuntimeCheck(n > 0, "scale: num_elements must be > 0, got ", n); + + // 2. Choose vector width for 128-bit loads (16 bytes) + // fp16/bf16: 8 elements x 2 bytes = 16 bytes + // fp32: 4 elements x 4 bytes = 16 bytes + // We encourage using `device::kMaxVecBytes`, which will change according to + // the target architecture and can enable 256-bit vectorization on SM100+ if desired. + // But 128-bit is more commonly adapted for better compatibility, + // so it's still ok to hardcode 16 here just for simplicity. + constexpr int kVecN = 16 / sizeof(T); + const uint32_t n_work_items = div_ceil(n, static_cast(kVecN)); + + // 3. Launch + constexpr uint32_t kBlockSize = 256; + const uint32_t grid = div_ceil(n_work_items, kBlockSize); + + // PDL feature is 100% optional. Without `enable_pdl`, the code should still be correct. + // Try to enable it if profiling shows that it can benefit the performance of this kernel. + LaunchKernel(grid, kBlockSize, device).enable_pdl(kUsePDL)( + scale_kernel, + static_cast(dst.data_ptr()), + static_cast(src.data_ptr()), + factor, + n); +} + +} // namespace +``` + +**Key points:** + +- Include headers from `sgl_kernel/` — **not** raw CUDA headers for anything already covered +- Add a short trailing `// For ...` explanation to every `#include ` line +- Use `TensorMatcher` for all tensor validation; never manually check shape/dtype/device +- Use `AlignedVector` for vectorised 128-bit loads/stores — significant bandwidth win +- Use `LaunchKernel` — it resolves the stream and checks errors automatically +- Use `RuntimeCheck` for runtime assertions with useful error messages +- Prefer passing runtime scalars like `factor` directly unless compile-time specialisation is genuinely required +- `fp16_t` / `bf16_t` / `fp32_t` are the project's type aliases (from `utils.cuh`) +- `device::cast` or `dtype_trait::from(val)` for cross-type conversions +- `device::math::` functions for device math instead of bare `__` intrinsics if possible. +- Try to use `PDL` feature. In some cases, this will benefit the performance. + +--- + +## Step 2: Add the Python wrapper in `jit_kernel/` + +Create `python/sglang/jit_kernel/scale.py`: + +```python +from __future__ import annotations + +from typing import TYPE_CHECKING + +import torch + +from sglang.jit_kernel.utils import ( + cache_once, + is_arch_support_pdl, + load_jit, + make_cpp_args, +) + +if TYPE_CHECKING: + from tvm_ffi.module import Module + + +@cache_once +def _jit_scale_module(dtype: torch.dtype) -> Module: + """Compile and cache the JIT scale module for a given dtype.""" + args = make_cpp_args(dtype, is_arch_support_pdl()) + return load_jit( + "scale", + *args, + cuda_files=["elementwise/scale.cuh"], + cuda_wrappers=[("scale", f"scale<{args}>")], + ) + + +def scale(src: torch.Tensor, factor: float, out: torch.Tensor | None = None) -> torch.Tensor: + """ + Element-wise scale: dst = src * factor. + + Supported dtypes: torch.float16, torch.bfloat16, torch.float32. + + Parameters + ---------- + src : CUDA tensor (FP16 / BF16 / FP32) + factor : scale factor + out : optional pre-allocated output tensor (same shape/dtype as src) + + Returns + ------- + Scaled tensor (dst = src * factor). + """ + # DO NOT add too much proactive validation here. + # Keep the Python wrapper thin, only enforce the preconditions + # that the current JIT/FFI path (C++ side) does not reject on its own. + if src.dtype not in (torch.float16, torch.bfloat16, torch.float32): + raise RuntimeError( + f"Unsupported dtype {src.dtype}. Supported: float16, bfloat16, float32" + ) + if out is None: + out = torch.empty_like(src) + + module = _jit_scale_module(src.dtype) + module.scale(out, src, factor) + return out +``` + +**Key points:** + +- Use `cache_once` — **not** `functools.lru_cache` (incompatible with `torch.compile`) +- `load_jit` first arg(s) form the unique build marker; same marker = same cached binary +- Only include compile-time specialisation knobs in the build marker; runtime values like `factor` should stay runtime unless the kernel truly needs templating +- `cuda_wrappers`: `(export_name, kernel_symbol)` — `export_name` is called from Python +- `make_cpp_args(dtype, ...)` converts `torch.dtype` to C++ type alias: +- `is_arch_support_pdl()` checks if the current architecture supports PDL, which is typically passed as a template argument to the kernel. +- Keep Python launchers thin, but still validate the basic invariants (`is_cuda`, supported dtype, `out` metadata). In the current JIT/FFI path, invalid tensors are not always rejected safely before launch + +| `torch.dtype` | C++ type | +|--------------------|------------| +| `torch.float16` | `fp16_t` | +| `torch.bfloat16` | `bf16_t` | +| `torch.float32` | `fp32_t` | + +--- + +## Step 3 (optional): Tune JIT build flags + +If your kernel uses some math functions like `expf` or `sinf`, consider enabling `--use_fast_math` for better performance (with a potential precision tradeoff): + +```python +return load_jit( + "scale", + *args, + cuda_files=["elementwise/scale.cuh"], + cuda_wrappers=[("scale", f"scale<{args}>")], + extra_cuda_cflags=["-O3", "--use_fast_math"], +) +``` + +If your kernel requires SM90+, raise a clear Python error before calling `load_jit`: + +```python +if torch.cuda.get_device_capability()[0] < 9: + raise RuntimeError("This kernel requires SM90 (Hopper) or later") +``` + +--- + +## Step 4: Write tests (required) + +JIT kernel tests live under `python/sglang/jit_kernel/tests/`. **CI does not run `pytest` in that directory directly.** The unified runner `test/run_suite.py` discovers every `test_*.py` there (and every `bench_*.py` under `benchmark/`), collects `register_*_ci(...)` calls by **statically parsing each file's AST**, and executes the selected suite. Every test file must register at least one CUDA entry or the collector fails its sanity check. + +- **PR / per-commit CUDA suites** (see `test/run_suite.py` → `PER_COMMIT_SUITES`): JIT unit tests use `stage-b-kernel-unit-1-gpu-large` on H100 and `stage-b-kernel-unit-1-gpu-b200` on B200/SM100 paths (see `.github/workflows/pr-test-jit-kernel.yml`). Multi-GPU JIT tests use `stage-b-kernel-unit-8-gpu-h200`. +- **Nightly kernel suite**: `nightly-kernel-1-gpu` with `--nightly` — typically used with `SGLANG_JIT_KERNEL_RUN_FULL_TESTS=1` in CI for expanded parameter grids (see `python/sglang/jit_kernel/utils.py` → `should_run_full_tests` / `get_ci_test_range`). Wired in `.github/workflows/nightly-test-nvidia.yml` (e.g. `python3 run_suite.py --hw cuda --suite nightly-kernel-1-gpu --nightly --continue-on-error`). + +Registration pattern (module level, **literal** `est_time` and `suite` strings — required for AST parsing): + +```python +from sglang.test.ci.ci_register import register_cuda_ci + +register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large") +# Optional B200/SM100 registration for tests that cover Blackwell-specific code paths +# register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-b200") +# Optional second registration: same file also listed under the nightly kernel suite +# register_cuda_ci(est_time=120, suite="nightly-kernel-1-gpu", nightly=True) +``` + +Keep `est_time` and `suite` as literal values. `run_suite.py` collects them from the file AST, so computed values and helper wrappers can break CI discovery. + +Use `register_cuda_ci(..., disabled="reason")` if the file must stay in-tree but should be skipped in CI (e.g. multi-GPU only). + +**Run like CI** (from repo root): + +```bash +(cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large) +# For B200/SM100-specific coverage: +(cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-b200) +``` + +For fast iteration you can still run `pytest` on a single file locally; CI coverage is via `run_suite.py`. + +Create `python/sglang/jit_kernel/tests/test_scale.py`: + +```python +import pytest +import torch +from sglang.jit_kernel.scale import scale +from sglang.test.ci.ci_register import register_cuda_ci + +register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large") + + +@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32]) +@pytest.mark.parametrize("size", [1, 127, 128, 1024, 4097]) # cover tail remainder +@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, 3.0]) +def test_scale_correctness(dtype, size, factor): + src = torch.randn(size, dtype=dtype, device="cuda") + out = scale(src, factor) + expected = src * factor + + rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2) + torch.testing.assert_close(out, expected, rtol=rtol, atol=atol) + + +@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32]) +def test_scale_out_param(dtype): + src = torch.randn(1024, dtype=dtype, device="cuda") + out = torch.empty_like(src) + result = scale(src, 2.0, out=out) + assert result is out + torch.testing.assert_close(out, src * 2.0, rtol=1e-2, atol=1e-2) + + +def test_scale_cpu_error(): + src = torch.randn(128, dtype=torch.float16) # CPU tensor + with pytest.raises(RuntimeError, match="CUDA"): + scale(src, 2.0) + + +def test_scale_unsupported_dtype(): + src = torch.randint(0, 10, (128,), dtype=torch.int32, device="cuda") + with pytest.raises(RuntimeError, match="dtype"): + scale(src, 2.0) + + +if __name__ == "__main__": + import sys + sys.exit(pytest.main([__file__, "-v", "-s"])) +``` + +--- + +## Step 5: Add a benchmark (required) + +Benchmarks are `bench_*.py` files under `python/sglang/jit_kernel/benchmark/`. They are picked up by the same `run_suite.py` machinery as unit tests. Register them for **`stage-b-kernel-benchmark-1-gpu-large`** (PR JIT benchmark job: `python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large`). + +Create `python/sglang/jit_kernel/benchmark/bench_scale.py`: + +```python +import itertools + +import torch +import triton +import triton.testing + +from sglang.jit_kernel.benchmark.utils import ( + DEFAULT_DEVICE, + DEFAULT_DTYPE, + get_benchmark_range, + run_benchmark, +) +from sglang.jit_kernel.scale import scale as jit_scale +from sglang.test.ci.ci_register import register_cuda_ci + +register_cuda_ci(est_time=6, suite="stage-b-kernel-benchmark-1-gpu-large") + +SIZE_LIST = get_benchmark_range( + full_range=[2**n for n in range(10, 20)], # 1K … 512K elements + ci_range=[4096, 65536], +) + +configs = list(itertools.product(SIZE_LIST)) + + +@triton.testing.perf_report( + triton.testing.Benchmark( + x_names=["size"], + x_vals=configs, + line_arg="provider", + line_vals=["jit", "torch"], + line_names=["SGL JIT Kernel", "PyTorch"], + styles=[("blue", "-"), ("red", "--")], + ylabel="us", + plot_name="scale-performance", + args={}, + ) +) +def benchmark(size: int, provider: str): + src = torch.randn(size, dtype=DEFAULT_DTYPE, device=DEFAULT_DEVICE) + factor = 2.0 + + if provider == "jit": + fn = lambda: jit_scale(src, factor) + else: + fn = lambda: src * factor + + return run_benchmark(fn) + + +if __name__ == "__main__": + benchmark.run(print_data=True) +``` + +Run locally: + +```bash +python python/sglang/jit_kernel/benchmark/bench_scale.py +``` + +Run the benchmark suite the way CI does: + +```bash +cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large +``` + +--- + +## Troubleshooting + +- **`No CI registry found in ...` from `run_suite.py`**: add a module-level `register_cuda_ci(...)` with literal `est_time` and `suite` (and optional `nightly=True`); starred args and non-literal values break AST collection +- **JIT compilation fails**: ensure the `.cuh` file is under `python/sglang/jit_kernel/csrc/`; reduce template argument combinations +- **CUDA crash / illegal memory access**: `CUDA_LAUNCH_BLOCKING=1`; `compute-sanitizer --tool memcheck python ...` +- **Unstable benchmark results**: `run_benchmark` uses CUDA-graph-based timing by default + +--- + +## References + +- `docs/developer_guide/development_jit_kernel_guide.md` +- `test/run_suite.py` — suite names, discovery of `jit_kernel/tests/` and `jit_kernel/benchmark/`, execution entrypoint for CI +- `python/sglang/test/ci/ci_register.py` — `register_cuda_ci` and AST registration rules +- `python/sglang/jit_kernel/utils.py` — `cache_once`, `load_jit`, `make_cpp_args`, `should_run_full_tests`, `get_ci_test_range` +- `python/sglang/jit_kernel/include/sgl_kernel/tensor.h` — `TensorMatcher`, `SymbolicSize/DType/Device` +- `python/sglang/jit_kernel/include/sgl_kernel/utils.cuh` — type aliases, `LaunchKernel`, `SGL_DEVICE` +- `python/sglang/jit_kernel/include/sgl_kernel/vec.cuh` — `AlignedVector` +- `python/sglang/jit_kernel/include/sgl_kernel/tile.cuh` — `tile::Memory` +- `python/sglang/jit_kernel/include/sgl_kernel/type.cuh` — `dtype_trait`, `packed_t`, `device::cast` +- `python/sglang/jit_kernel/include/sgl_kernel/math.cuh` — `device::math::` +- `python/sglang/jit_kernel/include/sgl_kernel/warp.cuh` — `warp::reduce_sum/max` +- `python/sglang/jit_kernel/include/sgl_kernel/cta.cuh` — `cta::reduce_max` +- `python/sglang/jit_kernel/include/sgl_kernel/atomic.cuh` — `atomic::max` +- `python/sglang/jit_kernel/include/sgl_kernel/runtime.cuh` — occupancy / SM count helpers +- `python/sglang/jit_kernel/csrc/add_constant.cuh` — minimal runnable reference +- `python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh` — real example using `TensorMatcher` + `LaunchKernel` + `tile::Memory` +- `python/sglang/jit_kernel/csrc/elementwise/qknorm.cuh` — real example using `runtime::get_blocks_per_sm` + persistent kernel pattern +- `python/sglang/jit_kernel/benchmark/utils.py` — benchmark helpers + +## Summary of Files Created + +``` +python/sglang/jit_kernel/csrc/elementwise/scale.cuh # NEW: CUDA kernel +python/sglang/jit_kernel/scale.py # NEW: Python wrapper +python/sglang/jit_kernel/tests/test_scale.py # NEW: Tests +python/sglang/jit_kernel/benchmark/bench_scale.py # NEW: Benchmark +``` diff --git a/.claude/skills/add-sgl-kernel/SKILL.md b/.claude/skills/add-sgl-kernel/SKILL.md new file mode 100644 index 000000000000..559b8751fb8e --- /dev/null +++ b/.claude/skills/add-sgl-kernel/SKILL.md @@ -0,0 +1,367 @@ +--- +name: add-sgl-kernel +description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks) +--- + +# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight) + +This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow. + +## Goal + +Add a new operation that scales each element of a tensor by a scalar factor: + +- Input: tensor `x` (CUDA) and scalar `factor` (float) +- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`) +- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)** + - Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`) + +## Two rules of thumb (must follow) + +1. **Prefer `python/sglang/jit_kernel` first** when the kernel does **not** depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration. +2. **Prefer `sgl-kernel`** when the kernel **does** depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow. +3. **Exception**: if the dependency is `flashinfer`, or CUTLASS that is already provided through `flashinfer`, the kernel can still be implemented as `jit_kernel`. + +In addition, every new kernel must ship with: + +- **Tests** (pytest) +- **A benchmark script** (triton.testing) + +--- + +## Repository integration map + +You will typically touch these files/areas: + +- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory) +- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h` +- Torch extension registration: `sgl-kernel/csrc/common_extension.cc` +- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`) +- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py` +- Tests: `sgl-kernel/tests/test_scale.py` +- Benchmarks: `sgl-kernel/benchmark/bench_scale.py` + +--- + +## Step 1: Implement the kernel in `csrc/` + +Pick the right subdirectory: + +- `csrc/elementwise/` — for element-wise ops (our example) +- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories + +Create `sgl-kernel/csrc/elementwise/scale.cu`: + +```cpp +#include +#include +#include + +#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 + +// scale_kernel: out[i] = input[i] * factor +// Supports float, half (__half), __nv_bfloat16 via template T +template +__global__ void scale_kernel(T* __restrict__ out, + const T* __restrict__ input, + float factor, + int64_t n) { + int64_t idx = static_cast(blockIdx.x) * blockDim.x + threadIdx.x; + if (idx < n) { + out[idx] = static_cast(static_cast(input[idx]) * factor); + } +} + +void scale(at::Tensor& out, const at::Tensor& input, double factor) { + TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor"); + TORCH_CHECK(input.is_contiguous(), "input must be contiguous"); + TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor"); + TORCH_CHECK(out.is_contiguous(), "out must be contiguous"); + TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape"); + TORCH_CHECK(out.scalar_type() == input.scalar_type(), + "out and input must have the same dtype"); + + const int64_t n = input.numel(); + const int threads = 256; + const int blocks = (n + threads - 1) / threads; + + const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); + const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); + + // Dispatches over float, float16, bfloat16 + DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] { + scale_kernel<<>>( + static_cast(out.data_ptr()), + static_cast(input.data_ptr()), + static_cast(factor), + n); + cudaError_t status = cudaGetLastError(); + TORCH_CHECK(status == cudaSuccess, + "scale_kernel launch failed: ", cudaGetErrorString(status)); + return true; + }); +} +``` + +**Key points:** + +- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream +- Keep Python wrappers thin; do shape/dtype/device validation in C++ right around the launch path +- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16) +- Add device error checking after every kernel launch +- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests + +--- + +## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h` + +Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section: + +```cpp +void scale(at::Tensor& out, const at::Tensor& input, double factor); +``` + +--- + +## Step 3: Register the op in `csrc/common_extension.cc` + +Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`: + +```cpp +// From csrc/elementwise +m.def("scale(Tensor! out, Tensor input, float factor) -> ()"); +m.impl("scale", torch::kCUDA, &scale); +``` + +**Key points:** + +- `Tensor!` means in-place / mutable output argument +- The schema is important for `torch.compile` and for consistent call signatures +- Keep the torch schema in PyTorch scalar types (`float` here), but note that the C++ launcher signature still needs `double` for scalar arguments accepted by `torch::Library` + +--- + +## Step 4: Add the new source file to `CMakeLists.txt` + +Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`: + +```cmake +csrc/elementwise/scale.cu +``` + +**Key points:** + +- Keep the list **alphabetically sorted** (the file explicitly requires this) +- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic + +--- + +## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/` + +Prefer following the existing module organization first. For elementwise kernels, the usual pattern is: + +- implement the Python wrapper in `sgl-kernel/python/sgl_kernel/elementwise.py` +- then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` + +For example, in `sgl-kernel/python/sgl_kernel/elementwise.py`, add: + +```python +import torch + +def scale( + input: torch.Tensor, + factor: float, + out: torch.Tensor | None = None, +) -> torch.Tensor: + """ + Element-wise scale: out = input * factor. + + Supported dtypes: torch.float16, torch.bfloat16, torch.float32. + + Parameters + ---------- + input : CUDA input tensor + factor : scale factor (float) + out : optional pre-allocated CUDA output tensor (same shape/dtype as input) + """ + if out is None: + out = torch.empty_like(input) + torch.ops.sgl_kernel.scale.default(out, input, factor) + return out +``` + +Then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` following the existing import style used by other kernels. + +--- + +## Step 6: Write tests (required) + +Create `sgl-kernel/tests/test_scale.py`: +```python +import pytest + +import torch +import sgl_kernel + +@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32]) +@pytest.mark.parametrize("size", [128, 1024, 4096, 65536]) +@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0]) +def test_scale_correctness(dtype, size, factor): + input = torch.randn(size, dtype=dtype, device="cuda") + out = torch.empty_like(input) + + result = sgl_kernel.scale(input, factor, out=out) + assert result is out + + expected = input * factor + rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2) + torch.testing.assert_close(out, expected, rtol=rtol, atol=atol) + + +def test_scale_shape_mismatch(): + input = torch.randn(128, dtype=torch.float16, device="cuda") + out = torch.empty(256, dtype=torch.float16, device="cuda") + with pytest.raises(RuntimeError, match="same shape"): + sgl_kernel.scale(input, 2.0, out=out) + + +def test_scale_cpu_input(): + input = torch.randn(128, dtype=torch.float16) # CPU + out = torch.empty_like(input) + with pytest.raises(RuntimeError, match="CUDA"): + sgl_kernel.scale(input, 2.0, out=out) + + +if __name__ == "__main__": + import sys + sys.exit(pytest.main([__file__, "-q"])) +``` + +--- + +## Step 7: Add a benchmark (required) + +Create `sgl-kernel/benchmark/bench_scale.py`: + +```python +import itertools + +import torch +import triton +import triton.testing + +import sgl_kernel +from sglang.utils import is_in_ci + +IS_CI = is_in_ci() + +dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32] +sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K +factors = [2.0] + +configs = list(itertools.product(dtypes, sizes)) + + +def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor: + return input * factor + + +@triton.testing.perf_report( + triton.testing.Benchmark( + x_names=["dtype", "size"], + x_vals=configs, + line_arg="provider", + line_vals=["sglang", "torch"], + line_names=["SGL Kernel", "PyTorch"], + styles=[("green", "-"), ("red", "--")], + ylabel="µs (median)", + plot_name="scale-performance", + args={}, + ) +) +def benchmark(dtype, size, provider): + input = torch.randn(size, dtype=dtype, device="cuda") + out = torch.empty_like(input) + factor = 2.0 + + if provider == "sglang": + fn = lambda: sgl_kernel.scale(input, factor, out=out) + else: + fn = lambda: torch_scale(input, factor) + + ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( + fn, quantiles=[0.5, 0.2, 0.8] + ) + return 1000 * ms, 1000 * max_ms, 1000 * min_ms + + +if __name__ == "__main__": + benchmark.run(print_data=True) +``` + +--- + +## Step 8: Build + +Build: + +```bash +cd sgl-kernel +make build -j16 +``` + +If you need to limit host resource usage: + +```bash +cd sgl-kernel +make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1" +``` + +--- + +## Step 9: Validate + +After building successfully, run the test and benchmark: + +```bash +pytest sgl-kernel/tests/test_scale.py -q +python sgl-kernel/benchmark/bench_scale.py +``` + +PR CI also runs `pr-test-sgl-kernel.yml`, including the B200 job +`sgl-kernel-b200-test` when kernel changes are detected. Use that job as the +Blackwell coverage signal for AOT `sgl-kernel` changes. + +--- + +## Troubleshooting + +- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1` +- **Memory errors**: `compute-sanitizer --tool memcheck python ...` +- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS` +- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py` +- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time + +--- + +## References + +- `sgl-kernel/README.md` +- `sgl-kernel/include/sgl_kernel_ops.h` +- `sgl-kernel/csrc/common_extension.cc` +- `sgl-kernel/CMakeLists.txt` +- `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends +- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern + +## Summary of Files Created/Modified + +``` +sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher +sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration +sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration +sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical) +sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper +sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API +sgl-kernel/tests/test_scale.py # NEW: tests +sgl-kernel/benchmark/bench_scale.py # NEW: benchmark +``` diff --git a/.claude/skills/ci-workflow-guide/SKILL.md b/.claude/skills/ci-workflow-guide/SKILL.md new file mode 100644 index 000000000000..99885d3ef0d9 --- /dev/null +++ b/.claude/skills/ci-workflow-guide/SKILL.md @@ -0,0 +1,391 @@ +--- +name: ci-workflow-guide +description: Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages. +--- + +# SGLang CI Workflow Orchestration Guide + +This skill covers the CI **infrastructure** layer — how tests are dispatched, gated, and fast-failed across stages. For test authoring (templates, fixtures, registration, model selection), see the [write-sglang-test skill](../write-sglang-test/SKILL.md). + +--- + +## Naming Conventions + +- **Suite**: `stage-{a,b,c}-test-{gpu_count}-gpu-{hardware}` (e.g., `stage-b-test-1-gpu-small`) +- **Test group**: Directory-level registered test group under `test/registered/` (e.g., `hicache` maps to `test/registered/hicache/test_*.py`) +- **CI runner**: `{gpu_count}-gpu-{hardware}` (e.g., `1-gpu-5090`, `4-gpu-h100`, `8-gpu-h200`) + +--- + +## Key Files + +| File | Role | +|------|------| +| `.github/workflows/pr-test.yml` | Main workflow — all stages, jobs, conditions, matrix definitions | +| `.github/workflows/pr-gate.yml` | PR gating: draft check, `run-ci` label, per-user rate limiting | +| `.github/actions/check-stage-health/action.yml` | Cross-job fast-fail: queries API for any failed job | +| `.github/actions/wait-for-jobs/action.yml` | Stage gating: polls API until stage jobs complete | +| `.github/actions/check-maintenance/action.yml` | Maintenance mode check | +| `test/run_suite.py` | Suite runner: collects, filters, partitions, executes tests | +| `python/sglang/test/ci/ci_register.py` | Test registration (AST-parsed markers), LPT auto-partition | +| `python/sglang/test/ci/ci_utils.py` | `run_unittest_files()`: execution, retry, continue-on-error | +| `scripts/ci/utils/slash_command_handler.py` | Handles slash commands from PR comments | + +--- + +## Architecture Overview + +``` + ┌──────────────┐ + │ build kernel │ + └──────┬───────┘ + │ + ├─ check-changes ──── detects which packages changed + │ (main_package, sgl_kernel, jit_kernel, multimodal_gen) + │ + ├─ call-gate ──────── pr-gate.yml (draft? label? rate limit?) + │ + ├─────────────────────────────────────────────────────┐ + │ │ + ▼ │ + ┌─────────────────────────────────────┐ │ + │ Stage A (~3 min) │ │ + │ pre-flight check │ │ + │ │ │ + │ ┌─────────────────────────────┐ │ │ + │ │ stage-a-test-1-gpu-small │ │ │ + │ │ (small GPUs) │ │ │ + │ └─────────────────────────────┘ │ │ + │ ┌─────────────────────────────┐ │ │ + │ │ stage-a-test-cpu │ │ │ + │ │ (CPU) │ │ │ + │ └─────────────────────────────┘ │ │ + └──────┬──────────────────────────────┘ │ + │ │ + ▼ ▼ + ┌─────────────────────────────────────┐ ┌──────────────────────────┐ + │ Stage B (~30 min) │ │ kernel test │ + │ basic tests │ └──────────────────────────┘ + │ │ ┌──────────────────────────┐ + │ ┌─────────────────────────────┐ │ │ multimodal gen test │ + │ │ stage-b-test-1-gpu-small │ │ └──────────────────────────┘ + │ │ (small GPUs, e.g. 5090) │ │ + │ └─────────────────────────────┘ │ + │ ┌─────────────────────────────┐ │ + │ │ stage-b-test-1-gpu-large │ │ + │ │ (large GPUs, e.g. H100) │ │ + │ └─────────────────────────────┘ │ + │ ┌─────────────────────────────┐ │ + │ │ stage-b-test-2-gpu-large │ │ + │ │ (large GPUs, e.g. H100) │ │ + │ └─────────────────────────────┘ │ + └──────┬──────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ Stage C (~30 min) │ + │ advanced tests │ + │ │ + │ ┌─────────────────────────────┐ │ + │ │ stage-c-test-4-gpu-h100 │ │ + │ │ (H100 GPUs) │ │ + │ └─────────────────────────────┘ │ + │ ┌─────────────────────────────┐ │ + │ │ stage-c-test-8-gpu-h200 │ │ + │ │ (8 x H200 GPUs) │ │ + │ └─────────────────────────────┘ │ + │ ┌─────────────────────────────┐ │ + │ │ stage-c-test-4-gpu-b200 │ │ + │ │ (4 x B200 GPUs) │ │ + │ └─────────────────────────────┘ │ + │ ┌─────────────────────────────┐ │ + │ │ Other advanced tests │ │ + │ │ (DeepEP, PD Disagg, GB300) │ │ + │ └─────────────────────────────┘ │ + └──────┬──────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ pr-test-finish │ + │ aggregates all results, fails if │ + │ any job failed/cancelled │ + └─────────────────────────────────────┘ +``` + +**Every stage test job** includes a `check-stage-health` step after checkout — if any job in the run has already failed, the job fast-fails (red X) with a root cause annotation. + +**Scheduled runs** skip `wait-for-stage-*` jobs, running all stages in parallel. Fast-fail is also disabled. + +--- + +## Fast-Fail Layers + +4 layers of fast-fail, from fine to coarse: + +| Layer | Mechanism | Granularity | Disabled on schedule? | +|-------|-----------|-------------|----------------------| +| **1. Test method → file** | `unittest -f` (failfast) | One test method fails → entire test file stops immediately | Yes | +| **2. File → suite** | `run_unittest_files()` default | One test file fails → entire suite stops (`--continue-on-error` off) | Yes | +| **3. Job → job (same stage)** | `check-stage-health` action | One job fails → other waiting jobs in same stage fast-fail (red X) | Yes | +| **4. Stage → stage (cross-stage)** | `wait-for-stage` + `needs` | Stage A fails → stage B/C jobs skip entirely (never get a runner) | Yes (wait jobs skipped) | + +- **Layer 1**: `-f` flag appended to all `python3 -m pytest` / `unittest` invocations in `ci_utils.py` +- **Layer 2**: `--continue-on-error` flag in `run_suite.py` — off for PRs, on for scheduled runs +- **Layer 3**: `check-stage-health` auto-detects `schedule` event and skips; filters out cascade failures to show only root cause jobs +- **Layer 4**: `wait-for-stage-*` jobs are conditioned on `github.event_name == 'pull_request'` — skipped for scheduled runs + +--- + +## Execution Modes + +| Aspect | PR (`pull_request`) | Scheduled (`cron`, every 6h) | `/rerun-stage` (`workflow_dispatch`) | +|--------|---------------------|------------------------------|--------------------------------------| +| **Stage ordering** | Sequential: A → B → C via `wait-for-stage-*` | Parallel (all at once) | Single target stage only | +| **Cross-job fast-fail** | Yes (`check-stage-health`) | Yes | Yes | +| **continue-on-error** | No (stop at first failure within suite) | Yes (run all tests) | No | +| **Retry** | Enabled | Enabled | Enabled | +| **max_parallel** | 3 (default), 14 if `high priority` label | 14 | 3 (default), 14 if `high priority` | +| **PR gate** | Yes (draft, label, rate limit) | Skipped | Skipped | +| **Concurrency** | `cancel-in-progress: true` per branch | Queue (no cancel) | Isolated per stage+SHA | + +--- + +## Stage Gating (`wait-for-jobs` action) + +`wait-for-stage-a` and `wait-for-stage-b` are lightweight `ubuntu-latest` jobs that poll the GitHub Actions API. + +**How it works:** +1. Calls `listJobsForWorkflowRun` to list all jobs in the current run +2. Matches jobs by exact name or prefix (for matrix jobs, e.g., `stage-b-test-1-gpu-small (3)`) +3. If any matched job has `conclusion === 'failure'` → fail immediately (fast-fail) +4. If all matched jobs are completed and count matches `expected_count` → success +5. Otherwise → sleep `poll-interval-seconds` (default: 60s) and retry +6. Timeout after `max-wait-minutes` (240 min for stage-a, 480 min for stage-b) + +**Job specs example** (stage-b): +```json +[ + {"prefix": "stage-b-test-1-gpu-small", "expected_count": 8}, + {"prefix": "stage-b-test-1-gpu-large", "expected_count": 14}, + {"prefix": "stage-b-test-2-gpu-large", "expected_count": 4}, + {"prefix": "stage-b-test-4-gpu-b200", "expected_count": 1} +] +``` + +> **Critical**: `expected_count` must match the matrix size. If you add/remove matrix entries, update the wait job's spec accordingly. + +**PR only**: Condition `github.event_name == 'pull_request' && !inputs.target_stage` — scheduled runs and `/rerun-stage` skip these entirely, allowing parallel execution. + +--- + +## Cross-Job Fast-Fail (`check-stage-health` action) + +Composite action called after checkout in every stage test job (21 jobs total across `pr-test.yml`, `pr-test-multimodal-gen.yml`, `pr-test-sgl-kernel.yml`, `pr-test-jit-kernel.yml`). + +**How it works:** +1. Queries `listJobsForWorkflowRun` for the current workflow run +2. Filters for **root cause failures only** — jobs with `conclusion === 'failure'` whose failing step is NOT `check-stage-health` (excludes cascade failures) +3. If root cause failures found → calls `core.setFailed()` with the list of root cause job names +4. If none → does nothing (step succeeds) + +**Cascade filtering**: When job A fast-fails due to health check, it also has `conclusion: failure`. Without filtering, job B would list both the original failure AND job A's fast-fail. The filter checks each failed job's `steps` array — if the failing step name contains `check-stage-health` or `Check stage health`, it's excluded from the root cause list. + +**Usage pattern:** +```yaml +steps: + - name: Checkout code + uses: actions/checkout@v4 + ... + + - uses: ./.github/actions/check-stage-health + id: stage-health + + - name: Install dependencies # skipped automatically if health check failed + ... # (default if: success() is false) + + - name: Run test # also skipped + ... +``` + +**Visual effect**: Job shows **red X** (failure) with error annotation showing root cause job names. Subsequent steps are naturally skipped (default `if: success()` is false after a failed step). No per-step `if` guards needed. + +**No stage filtering**: Checks ALL jobs in the run, not just the current stage. Any failure anywhere triggers fast-fail. + +**Error message example:** +``` +Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-small (0), stage-b-test-1-gpu-small (1) +``` + +--- + +## Within-Suite Failure Handling + +Controlled by `run_unittest_files()` in `python/sglang/test/ci/ci_utils.py`. + +### Flags + +| Flag | PR default | Scheduled default | Effect | +|------|------------|-------------------|--------| +| `--continue-on-error` | Off | On | Off: stop at first failure. On: run all files, report all failures at end | +| `--enable-retry` | On | On | Retry retriable failures (accuracy/perf assertions) | +| `--max-attempts` | 2 | 2 | Max attempts per file including initial run | + +### Retry Classification + +When a test fails and retry is enabled, the output is classified: + +**Non-retriable** (checked first — real code errors): +`SyntaxError`, `ImportError`, `ModuleNotFoundError`, `NameError`, `TypeError`, `AttributeError`, `RuntimeError`, `CUDA out of memory`, `OOM`, `Segmentation fault`, `core dumped`, `ConnectionRefusedError`, `FileNotFoundError` + +**Retriable** (accuracy/performance): +`AssertionError` with comparison patterns (`not greater than`, `not less than`, `not equal to`), `accuracy`, `score`, `latency`, `throughput`, `timeout` + +**Default**: Unknown `AssertionError` → retriable. Other unknown failures → not retriable. + +### How `continue_on_error` is set + +In `pr-test.yml`'s `check-changes` job: +- `schedule` runs or `run_all_tests` flag → `continue_on_error = 'true'` +- PR runs → `continue_on_error = 'false'` + +Each test job propagates via: +```yaml +env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} +run: | + python3 run_suite.py --hw cuda --suite $CONTINUE_ON_ERROR_FLAG +``` + +--- + +## Test Partitioning + +Large suites are split across matrix jobs using the **LPT (Longest Processing Time) heuristic** in `ci_register.py:auto_partition()`: + +1. Sort tests by `est_time` descending, filename as tie-breaker (deterministic) +2. Greedily assign each test to the partition with smallest cumulative time +3. Result: roughly equal total time per partition + +**Partition table** (CUDA per-commit suites): + +| Suite | Partitions | Runner | max_parallel | +|-------|-----------|--------|-------------| +| `stage-a-test-1-gpu-small` | 1 (no matrix) | `1-gpu-5090` | — | +| `stage-a-test-cpu` | 4 | `ubuntu-latest` | — | +| `stage-b-test-1-gpu-small` | 8 | `1-gpu-5090` | 8 | +| `stage-b-test-1-gpu-large` | 14 | `1-gpu-h100` | dynamic (3 or 14) | +| `stage-b-test-2-gpu-large` | 4 | `2-gpu-h100` | — | +| `stage-b-test-4-gpu-b200` | 1 (no matrix) | `4-gpu-b200` | — | +| `stage-b-kernel-unit-1-gpu-large` | 1 (no matrix) | `1-gpu-h100` | — | +| `stage-b-kernel-unit-1-gpu-b200` | 1 (no matrix) | `4-gpu-b200` | — | +| `stage-b-kernel-unit-8-gpu-h200` | 1 (no matrix) | `8-gpu-h200` | — | +| `stage-b-kernel-benchmark-1-gpu-large` | 1 (no matrix) | `1-gpu-h100` | — | +| `stage-c-test-4-gpu-h100` | 3 | `4-gpu-h100` | — | +| `stage-c-test-8-gpu-h200` | 4 | `8-gpu-h200` | — | +| `stage-c-test-8-gpu-h20` | 2 | `8-gpu-h20` | — | +| `stage-c-test-deepep-4-gpu-h100` | 1 (no matrix) | `4-gpu-h100` | — | +| `stage-c-test-deepep-8-gpu-h200` | 1 (no matrix) | `8-gpu-h200` | — | +| `stage-c-test-4-gpu-b200` | 3 | `4-gpu-b200` | — | +| `stage-c-test-4-gpu-b200-small` | 3 | `4-gpu-b200-low-disk` | — | +| `stage-c-test-8-gpu-b200` | registered only | `8-gpu-b200` | — | +| `stage-c-test-4-gpu-gb200` | registered only | `4-gpu-gb200` | — | + +> **Note**: Kernel suites (`stage-b-kernel-*`) run via `pr-test-jit-kernel.yml` and `pr-test-sgl-kernel.yml`, not the main `pr-test.yml`. `stage-c-test-8-gpu-b200` is registered in `test/run_suite.py` but not wired to PR CI. The GB200 job is currently commented out in `pr-test.yml` until a company-owned runner is provisioned. Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`. + +**Workflow usage:** +```yaml +strategy: + matrix: + partition: [0, 1, 2, 3, 4, 5, 6, 7] +steps: + - run: python3 run_suite.py --hw cuda --suite stage-b-test-1-gpu-small \ + --auto-partition-id ${{ matrix.partition }} --auto-partition-size 8 +``` + +--- + +## check-changes Job + +Determines which test suites to run based on file changes. + +### Detection Methods + +| Trigger | Method | Details | +|---------|--------|---------| +| `pull_request` | `dorny/paths-filter` | Detects changes via GitHub diff | +| `workflow_dispatch` (with `pr_head_sha`) | GitHub API | `repos/{repo}/compare/main...{sha}` | +| `schedule` / `run_all_tests` | Force all true | Runs everything | + +### Output Flags + +| Output | Triggers | +|--------|----------| +| `main_package` | Stage A/B/C test suites | +| `sgl_kernel` | Kernel wheel builds + kernel test suites; also switches B200 jobs to kernel-build runner labels outside `target_stage` mode | +| `jit_kernel` | JIT kernel test workflow | +| `multimodal_gen` | Multimodal-gen test workflow | + +> **Note**: In `target_stage` mode, `sgl_kernel` is only active when `include_wheel_build=true`. Without that opt-in, kernel-change reruns fail validation instead of running a target stage without freshly built wheels. Outside `target_stage`, `sgl_kernel=true` switches B200 jobs from `4-gpu-b200` / `4-gpu-b200-low-disk` to `4-gpu-b200-kernel` / `4-gpu-b200-kernel-low-disk`. + +--- + +## Concurrency Control + +``` +group: pr-test-{event_name}-{branch}-{pr_sha}-{stage} +``` + +| Segment | Source | Purpose | +|---------|--------|---------| +| `event_name` | `github.event_name` | Prevents scheduled runs colliding with fork PRs named `main` | +| `branch` | `github.head_ref \|\| github.ref_name` | Per-branch isolation | +| `pr_sha` | `inputs.pr_head_sha \|\| 'current'` | Isolates `/rerun-stage` from main runs | +| `stage` | `inputs.target_stage \|\| 'all'` | Allows parallel stage dispatches | + +`cancel-in-progress: true` for `pull_request` events (new push cancels old run), `false` for `workflow_call`. + +--- + +## How To: Add a New Stage Job + +1. Define the job in `pr-test.yml` with `needs: [check-changes, call-gate, wait-for-stage-X, ...]` +2. Copy the `if:` condition pattern from an existing same-stage job (handles `target_stage`, `schedule`, `main_package`) +3. Add `checkout` step +4. Add `check-stage-health` step (after checkout) — if any prior job failed, `core.setFailed()` fires and all subsequent steps auto-skip via default `if: success()` +5. Add `check-maintenance` step +6. Add `download-artifact` step if `sgl_kernel` changed +7. Add `install dependencies` step +8. Add `run test` step with `$CONTINUE_ON_ERROR_FLAG` +9. Add `upload-cuda-coredumps` step with `if: always()` +10. Register the suite name in `PER_COMMIT_SUITES` in `test/run_suite.py` +11. If using matrix, add `--auto-partition-id` and `--auto-partition-size` to the run command +12. **Update `wait-for-stage-X`** job spec with the new job name and `expected_count` (if matrix) +13. **Add the job to `pr-test-finish.needs`** list + +--- + +## How To: Debug CI Failures + +| Symptom | Likely cause | What to check | +|---------|-------------|---------------| +| All stage-B/C jobs green but steps skipped | Earlier job failed, `check-stage-health` triggered | Find the actual failed job (red X) | +| `wait-for-stage-b` timeout | `expected_count` doesn't match matrix size | Verify job spec counts match `matrix:` array length | +| `pr-test-finish` fails but all jobs green | A job was `cancelled` (counts as failure in finish) | Check concurrency cancellation | +| Tests pass locally but fail in CI | Partition assignment, runner GPU type, or `est_time` inaccuracy | Check which partition the test lands in; verify runner label | +| Flaky test retried and passed | Retriable failure (accuracy/perf) | Check `[CI Retry]` markers in job logs | +| Flaky test NOT retried | Matched non-retriable pattern | Check if error matches `NON_RETRIABLE_PATTERNS` in `ci_utils.py` | + +--- + +## Slash Commands + +| Command | Effect | +|---------|--------| +| `/tag-run-ci-label` | Adds `run-ci` label to PR | +| `/rerun-failed-ci` | Reruns failed jobs in the latest workflow run | +| `/tag-and-rerun-ci` | Adds label + reruns | +| `/rerun-stage ` | Dispatches `pr-test.yml` with `target_stage=` | +| `/rerun-test ` | Reruns a specific test file via `rerun-test.yml` | +| `/rerun-group [ ...]` | Expands registered test groups, then reuses `/rerun-test` | + +Handled by `scripts/ci/utils/slash_command_handler.py` → `.github/workflows/slash-command-handler.yml`. diff --git a/.claude/skills/clean-startup-log/SKILL.md b/.claude/skills/clean-startup-log/SKILL.md new file mode 100644 index 000000000000..8f7c254115ca --- /dev/null +++ b/.claude/skills/clean-startup-log/SKILL.md @@ -0,0 +1,179 @@ +--- +name: clean-startup-log +description: Clean up noisy startup warnings and spurious prints in SGLang server logs. Use when users ask to clean up unwanted warnings, deprecation messages, or third-party noise in the server startup output. +disable-model-invocation: true +--- + +# Clean Up SGLang Server Startup Logs + +Goal: ensure the server startup log is clean and minimal, with no spurious warnings, deprecation messages, or unformatted prints from third-party libraries. + +## Workflow + +### 1. Launch a server and capture the log + +```bash +uv run sglang serve --model-path Qwen/Qwen3-8B 2>&1 | tee /tmp/startup_log.txt +``` + +Wait until the server prints `The server is fired up and ready to roll!`, then Ctrl-C. + +For TP>1 testing: +```bash +uv run sglang serve --model-path Qwen/Qwen3-8B --tp 2 2>&1 | tee /tmp/startup_log.txt +``` + +### 2. Compare against the clean reference log + +Read `/tmp/startup_log.txt` and compare it against the reference log at the bottom of this file. Identify lines that: + +- Do NOT have the `[timestamp]` or `[timestamp TPx]` logger prefix +- Contain `WARNING`, `deprecated`, `is deprecated`, or similar noise +- Are printed by third-party libraries (transformers, torchao, NCCL, Gloo, tqdm, etc.) +- Are duplicate/redundant with information already logged by SGLang + +### 3. Classify each noisy line + +For each noisy line, determine: + +| Category | Action | +|----------|--------| +| **SGLang code using wrong API** | Fix the SGLang code (e.g., replace deprecated API with new one) | +| **SGLang code logging at wrong level** | Change log level (e.g., warning -> debug for non-actionable messages) | +| **Third-party lib prints at import time** | Suppress the logger or redirect stdout during that import | +| **C-level print from .so library** | Redirect fd 1 during the specific C call, or accept it if too invasive | +| **Real warning the user should see** | Keep it | + +### 4. Present findings before fixing + +List all noisy lines with their source and proposed fix. Ask the user to review before making changes. + +### 5. Apply fixes and verify + +After approval, apply fixes one at a time, re-launch the server, and verify each fix works. + +## Known Noise Sources and Fixes (from past sessions) + +### 1. torchao "Skipping import of cpp extensions due to incompatible torch version" + +- **Source:** `torchao/__init__.py` — printed via `logger.warning()` when torch version < 2.11.0 +- **Trigger:** `sglang/__init__.py` -> `_apply_hf_patches()` -> `_patch_removed_symbols()` -> `from transformers.models.llama import modeling_llama` -> deep import chain -> `transformers/quantizers/auto.py` -> `from .quantizer_torchao import TorchAoHfQuantizer` -> imports torchao +- **Fix:** In `hf_transformers_patches.py::_patch_removed_symbols()`, temporarily set the `torchao` logger level to `ERROR` around the `modeling_llama` import: + ```python + _torchao_logger = logging.getLogger("torchao") + _prev_level = _torchao_logger.level + _torchao_logger.setLevel(logging.ERROR) + try: + from transformers.models.llama import modeling_llama + finally: + _torchao_logger.setLevel(_prev_level) + ``` + +### 2. "`torch_dtype` is deprecated! Use `dtype` instead!" + +- **Source:** `transformers/configuration_utils.py` — the `torch_dtype` property warns via `logger.warning_once()` +- **Trigger:** `get_hf_text_config()` in `sglang/srt/utils/hf_transformers/common.py` accesses `config.torch_dtype` +- **Fix:** Replace all `getattr(config, "torch_dtype", ...)` with `getattr(config, "dtype", ...)` and `config.torch_dtype = X` with `config.dtype = X` in `common.py` + +### 3. "`BaseImageProcessorFast` is deprecated" + +- **Source:** `transformers/utils/import_utils.py` — the lazy module `__getattr__` warns when `BaseImageProcessorFast` is accessed +- **Trigger:** `base_processor.py` and `ernie45_vl.py` have `from transformers import BaseImageProcessorFast` at top level. These are imported eagerly via `tokenizer_manager.py` -> `multimodal_processor.py` -> `base_processor.py`, even for non-multimodal models. +- **Fix:** Replace `from transformers import BaseImageProcessorFast` with `from transformers import BaseImageProcessor` and update all `isinstance(..., BaseImageProcessorFast)` checks to `isinstance(..., BaseImageProcessor)` + +### 4. "No platform detected. Using base SRTPlatform with defaults." + +- **Source:** `sglang/srt/platforms/__init__.py` — `logger.warning()` +- **Fix:** Change to `logger.debug()` — this is expected on machines without a platform plugin and not actionable. + +### 5. `NCCL version 2.27.7+cuda13.0` + +- **Source:** C-level print from `libnccl.so` during `ncclCommInitRank()` call +- **Status:** Accepted as-is. SGLang already logs the version via `sglang is using nccl==X.Y.Z`. The C-level print cannot be suppressed without redirecting stdout fd, which is too invasive. `NCCL_DEBUG=WARN` does not suppress it in NCCL 2.27+. + +### 6. `[Gloo] Rank X is connected to Y peer ranks` + +- **Source:** C++ Gloo library print during process group init +- **Status:** Accepted as-is. From C++ code inside PyTorch's Gloo backend. + +### 7. `torchao SyntaxWarning: invalid escape sequence` + +- **Source:** `torchao/quantization/quant_api.py` — a raw string with unescaped `\.` +- **Status:** Upstream torchao bug. Cannot fix from SGLang side. + +### 8. tqdm progress bars (e.g., `Multi-thread loading shards`, `Capturing batches`) + +- **Status:** These are expected and useful. They show progress during weight loading and CUDA graph capture. Keep them. + +## Investigation Techniques + +### Trace what triggers an import +```python +import sys +_real_import = __builtins__.__import__ +def _tracing_import(name, *args, **kwargs): + if 'TARGET_MODULE' in name: + import traceback + print(f'=== Importing {name} ===') + traceback.print_stack() + return _real_import(name, *args, **kwargs) +__builtins__.__import__ = _tracing_import +``` + +### Trace what triggers a logger warning +```python +import logging, traceback +class TraceHandler(logging.Handler): + def emit(self, record): + if 'SEARCH_STRING' in record.getMessage(): + traceback.print_stack() +h = TraceHandler() +h.setLevel(logging.WARNING) +logging.getLogger('TARGET_LOGGER_NAME').addHandler(h) +``` + +### Find C-level prints in .so files +```bash +strings /path/to/library.so | grep "SEARCH_STRING" +``` + +## Reference: Clean Startup Log (TP=1, Qwen3-8B) + +``` +[2026-04-27 02:35:53] Attention backend not specified. Use trtllm_mha backend by default. +[2026-04-27 02:35:53] TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. +[2026-04-27 02:35:54] server_args=ServerArgs(model_path='Qwen/Qwen3-8B', ...) +[2026-04-27 02:35:56] Using default HuggingFace chat template with detected content format: string +[2026-04-27 02:36:03] Init torch distributed begin. +[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 +[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 +[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 +[2026-04-27 02:36:03] Init torch distributed ends. elapsed=0.27 s, mem usage=0.09 GB +[2026-04-27 02:36:04] Load weight begin. avail mem=177.57 GB +[2026-04-27 02:36:04] Found local HF snapshot for Qwen/Qwen3-8B at ...; skipping download. +Multi-thread loading shards: 100% Completed | 5/5 [00:01<00:00, 3.08it/s] +[2026-04-27 02:36:06] Load weight end. elapsed=1.97 s, type=Qwen3ForCausalLM, avail mem=162.30 GB, mem usage=15.28 GB. +[2026-04-27 02:36:06] Using KV cache dtype: torch.bfloat16 +[2026-04-27 02:36:06] KV Cache is allocated. #tokens: 992896, K size: 68.18 GB, V size: 68.18 GB +[2026-04-27 02:36:06] Memory pool end. avail mem=25.26 GB +[2026-04-27 02:36:06] Capture cuda graph begin. This can take up to several minutes. avail mem=24.14 GB +[2026-04-27 02:36:06] Capture cuda graph bs [1, 2, 4, ...] +Capturing batches (bs=1 avail_mem=23.54 GB): 100% | 52/52 [00:03<00:00, 16.76it/s] +[2026-04-27 02:36:09] Capture cuda graph end. Time elapsed: 3.74 s. mem usage=0.60 GB. avail mem=23.54 GB. +[2026-04-27 02:36:09] Capture piecewise CUDA graph begin. avail mem=23.54 GB +[2026-04-27 02:36:09] Capture cuda graph num tokens [4, 8, 12, ...] +Compiling num tokens (num_tokens=4): 100% | 74/74 [00:09<00:00, 8.16it/s] +Capturing num tokens (num_tokens=4 avail_mem=21.23 GB): 100% | 74/74 [00:08<00:00, 9.11it/s] +[2026-04-27 02:36:27] Capture piecewise CUDA graph end. Time elapsed: 17.62 s. mem usage=2.32 GB. avail mem=21.22 GB. +[2026-04-27 02:36:28] max_total_num_tokens=992896, chunked_prefill_size=16384, ... +[2026-04-27 02:36:29] INFO: Started server process [399368] +[2026-04-27 02:36:29] INFO: Waiting for application startup. +[2026-04-27 02:36:29] Using default chat sampling params from model generation config: ... +[2026-04-27 02:36:29] INFO: Application startup complete. +[2026-04-27 02:36:29] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) +[2026-04-27 02:36:30] Prefill batch, #new-seq: 1, #new-token: 64, ... +[2026-04-27 02:36:30] INFO: 127.0.0.1:34916 - "POST /generate HTTP/1.1" 200 OK +[2026-04-27 02:36:30] The server is fired up and ready to roll! +``` + +Note: `[Gloo]` messages and tqdm progress bars are acceptable. The key is no warnings or deprecation messages from transformers, torchao, or other third-party libraries. diff --git a/.claude/skills/debug-cuda-crash/SKILL.md b/.claude/skills/debug-cuda-crash/SKILL.md new file mode 100644 index 000000000000..d32126c7a744 --- /dev/null +++ b/.claude/skills/debug-cuda-crash/SKILL.md @@ -0,0 +1,657 @@ +--- +name: debug-cuda-crash +description: Call this skill when you need to debug CUDA crashes in SGLang using kernel API logging +--- + +# Tutorial: Debugging CUDA Crashes with Kernel API Logging + +This tutorial shows you how to debug CUDA crashes and errors in SGLang using the `@debug_kernel_api` logging decorator. + +## Goal + +When your code crashes with CUDA errors such as illegal memory access, device-side assert, out-of-bounds, or NaN/Inf, use kernel API logging to: +- Capture input tensors BEFORE the crash occurs +- Understand what data caused the problem +- Track tensor shapes, dtypes, and values through the call boundary that triggered the crash +- Detect numerical issues such as NaN, Inf, or obviously wrong shapes + +## Why Use Kernel API Logging? + +**Problem**: CUDA errors often crash the program before normal debugging output is flushed. + +**Solution**: SGLang's `@debug_kernel_api` decorator logs inputs before execution, so you can still see what caused the crash even after the program aborts. + +## What Is Covered? + +The current logging coverage focuses on the highest-value kernel boundaries in SGLang: +- Custom ops registered through `register_custom_op(...)` +- External custom ops registered through `register_custom_op_from_extern(...)` +- LLM attention, linear, quantization, and multi-platform wrapper entry points +- Diffusion attention impl, linear, rotary, and custom-op wrapper entry points +- Selected direct `torch.ops.sglang.*` hotspots and model-specific bypasses + +This means the logging is useful for both LLM and diffusion kernel debugging, but it does not automatically cover every pure PyTorch call in the repository. + +## Step 1: Enable Kernel API Logging + +### Basic Logging (Function Names Only) + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=1 +export SGLANG_KERNEL_API_LOGDEST=stdout + +python my_script.py +``` + +Output: +``` +================================================================================ +[2026-03-19 00:47:06] SGLang Kernel API Call: RMSNorm.forward +================================================================================ +[2026-03-19 00:47:06] SGLang Kernel API Call: sglang.quant_method.UnquantizedLinearMethod.apply +================================================================================ +[2026-03-19 00:47:06] SGLang Kernel API Call: sglang.custom_op.fused_inplace_qknorm +``` + +This is a real level-1 excerpt captured from `Qwen/Qwen3-0.6B`. + +### Detailed Logging (Inputs with Metadata) + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +export SGLANG_KERNEL_API_LOGDEST=debug.log + +python my_script.py +``` + +Output in `debug.log`: +``` +================================================================================ +[2026-03-19 00:47:30] SGLang Kernel API Call: sglang.quant_method.UnquantizedLinearMethod.apply +Positional input arguments: + arg[0]=QKVParallelLinear( + repr=QKVParallelLinear(in_features=1024, output_features=4096, bias=False, tp_size=1, gather_output=False) + ) + arg[1]=Tensor( + shape=(1, 1024) + dtype=torch.bfloat16 + device=cuda:0 + requires_grad=False + is_contiguous=True + ) + arg[2]=None +Output: + return=Tensor( + shape=(1, 4096) + dtype=torch.bfloat16 + device=cuda:0 + requires_grad=False + is_contiguous=True + ) +``` + +This is a real level-3 excerpt captured from `Qwen/Qwen3-0.6B`. + +### Full Logging (With Tensor Statistics) + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=5 +export SGLANG_KERNEL_API_LOGDEST=debug.log + +python my_script.py +``` + +Additional output: +``` +================================================================================ +[2026-03-19 01:00:42] SGLang Kernel API Call: diffusion.quant_method.UnquantizedLinearMethod.apply +Positional input arguments: + arg[1]=Tensor( + shape=(1, 77, 768) + dtype=torch.bfloat16 + device=cuda:0 + requires_grad=False + is_contiguous=True + min=-27.250000 + max=28.500000 + mean=0.011723 + nan_count=0 + inf_count=0 + ) +Output: + return=Tensor( + shape=(1, 77, 2304) + dtype=torch.bfloat16 + device=cuda:0 + requires_grad=False + is_contiguous=True + min=-8.937500 + max=9.375000 + mean=0.009460 + nan_count=0 + inf_count=0 + ) +``` + +This is a real level-5 excerpt captured from `black-forest-labs/FLUX.1-dev`. + +### Crash-Safe Dumps (Inputs Saved Before Execution) + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=10 +export SGLANG_KERNEL_API_LOGDEST=debug.log +export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps + +python my_script.py +``` + +At level 10, SGLang saves the inputs before execution. If the kernel crashes, the dump directory still contains the inputs and exception metadata. + +If CUDA graph capture is active, tensor dumps are skipped automatically to avoid capture-time CUDA errors. In that case, you still get the kernel API call log, but not `inputs.pt` / `outputs.pt`. + +Level-10 dumps are best understood as crash-safe call snapshots. They always preserve the observed call boundary. They do not guarantee one-click replay for every method, because some methods depend on module state that is not serialized into the dump. + +Real level-10 dump layout from `Qwen/Qwen3-0.6B`: + +```text +/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps +/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001 +/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/inputs.pt +/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/metadata.json +/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/outputs.pt +``` + +Real `metadata.json` excerpt: + +```json +{ + "function_name": "RotaryEmbedding.forward", + "timestamp": "20260319_004821_182", + "process_id": 919286, + "execution_status": "completed", + "input_tensor_keys": ["arg_0", "arg_1", "arg_2"], + "output_tensor_keys": ["result_0", "result_1"] +} +``` + +## Step 2: Reproduce an LLM CUDA Crash + +Create a temporary reproducer: + +```bash +python3 - <<'PY' +from pathlib import Path +Path("/tmp/sglang_llm_crash.py").write_text( + "import torch\\n" + "import torch.nn.functional as F\\n" + "from sglang.srt.utils.custom_op import register_custom_op\\n\\n" + "def _fake_embedding(indices, table):\\n" + " return torch.empty((*indices.shape, table.shape[-1]), device=table.device, dtype=table.dtype)\\n\\n" + "@register_custom_op(op_name='mock_llm_cuda_crash', fake_impl=_fake_embedding)\\n" + "def mock_llm_cuda_crash(indices, table):\\n" + " out = F.embedding(indices, table)\\n" + " torch.cuda.synchronize()\\n" + " return out\\n\\n" + "table = torch.randn(4, 8, device='cuda', dtype=torch.float16)\\n" + "indices = torch.tensor([0, 7], device='cuda', dtype=torch.long)\\n" + "mock_llm_cuda_crash(indices, table)\\n" +) +PY + +SGLANG_KERNEL_API_LOGLEVEL=1 \ +SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level1.log \ +python3 /tmp/sglang_llm_crash.py +``` + +What to expect: +- The script exits with a CUDA `device-side assert` +- The log still contains the last API boundary before the crash + +Try the same example at level 3: + +```bash +SGLANG_KERNEL_API_LOGLEVEL=3 \ +SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level3.log \ +python3 /tmp/sglang_llm_crash.py +``` + +Now the log shows tensor metadata before the crash. + +Try level 10: + +```bash +SGLANG_KERNEL_API_LOGLEVEL=10 \ +SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level10.log \ +SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_llm_level10_dumps \ +python3 /tmp/sglang_llm_crash.py +``` + +Now you should see: +- A log entry for `sglang.custom_op.mock_llm_cuda_crash` +- A dump directory with `inputs.pt` +- `metadata.json` showing `execution_status: "exception"` +- No `outputs.pt`, because the kernel crashed before producing output + +For real-model success-path level-10 dumps, it is often easier to temporarily disable CUDA graph and piecewise CUDA graph for the debug run. + +## Step 3: Reproduce a Diffusion CUDA Crash + +Create a temporary diffusion-side reproducer: + +```bash +python3 - <<'PY' +from pathlib import Path +Path("/tmp/sglang_diffusion_crash.py").write_text( + "import torch\\n" + "import torch.nn.functional as F\\n" + "from sglang.multimodal_gen.runtime.layers.utils import register_custom_op\\n\\n" + "def _fake_embedding(positions, cache):\\n" + " return torch.empty((*positions.shape, cache.shape[-1]), device=cache.device, dtype=cache.dtype)\\n\\n" + "@register_custom_op(op_name='mock_diffusion_cuda_crash', fake_impl=_fake_embedding)\\n" + "def mock_diffusion_cuda_crash(positions, cache):\\n" + " out = F.embedding(positions, cache)\\n" + " torch.cuda.synchronize()\\n" + " return out\\n\\n" + "cache = torch.randn(4, 64, device='cuda', dtype=torch.float16)\\n" + "positions = torch.tensor([0, 9], device='cuda', dtype=torch.long)\\n" + "mock_diffusion_cuda_crash(positions, cache)\\n" +) +PY + +SGLANG_KERNEL_API_LOGLEVEL=1 \ +SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level1.log \ +python3 /tmp/sglang_diffusion_crash.py +``` + +Try level 3: + +```bash +SGLANG_KERNEL_API_LOGLEVEL=3 \ +SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level3.log \ +python3 /tmp/sglang_diffusion_crash.py +``` + +Try level 10: + +```bash +SGLANG_KERNEL_API_LOGLEVEL=10 \ +SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level10.log \ +SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_diffusion_level10_dumps \ +python3 /tmp/sglang_diffusion_crash.py +``` + +If your local environment has unrelated FlashInfer import issues, resolve them in the shell before running the example. The example itself does not set any `FLASHINFER_*` environment variable. + +## Step 4: Multi-Process Debugging + +When running with multiple GPUs or worker processes, use `%i` in the log path: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +export SGLANG_KERNEL_API_LOGDEST=debug_rank_%i.log + +torchrun --nproc_per_node=4 my_script.py +``` + +This creates separate logs such as: +- `debug_rank_12345.log` +- `debug_rank_12346.log` +- `debug_rank_12347.log` +- `debug_rank_12348.log` + +Real multi-process example from a 2-GPU `Qwen/Qwen2.5-0.5B-Instruct` run: + +```text +/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950201.log +/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950349.log +/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950350.log +/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950351.log +``` + +You should usually do the same for level-10 dump directories: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=10 +export SGLANG_KERNEL_API_LOGDEST=debug_rank_%i.log +export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps_%i +``` + +This avoids multiple ranks writing into the same dump directory tree. + +## Step 5: Filter Level-10 Dumps + +If level 10 is too noisy, restrict dumps to specific APIs: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=10 +export SGLANG_KERNEL_API_LOGDEST=debug.log +export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps +export SGLANG_KERNEL_API_DUMP_INCLUDE='sglang.custom_op.*' +export SGLANG_KERNEL_API_DUMP_EXCLUDE='*.fake_impl' +``` + +`SGLANG_KERNEL_API_DUMP_INCLUDE` and `SGLANG_KERNEL_API_DUMP_EXCLUDE` use shell-style wildcard matching. + +## Step 6: Common CUDA Errors and What to Check + +### Illegal Memory Access or Device-Side Assert + +**Typical errors**: +``` +RuntimeError: CUDA error: an illegal memory access was encountered +torch.AcceleratorError: CUDA error: device-side assert triggered +``` + +Use: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +``` + +Check in the logs: +- ✅ Tensor shapes +- ✅ Tensor dtypes +- ✅ CUDA vs CPU device placement +- ✅ Tensor stride / contiguity +- ✅ Whether the failing call has inputs logged but no outputs logged + +Typical shape-mismatch pattern: + +```text +SGLang Kernel API Call: ... +arg[0]=Tensor(shape=(..., 128), ...) # ✅ expected dimension +arg[1]=Tensor(shape=(..., 64), ...) # ❌ mismatch +``` + +This often points to head-dim, hidden-dim, or cache-layout mismatch rather than a random CUDA failure. + +### NaN or Inf + +Use: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=5 +``` + +Check: +- `min` +- `max` +- `mean` +- `nan_count` +- `inf_count` + +Typical bad pattern: + +```text +Tensor( + ... + min=-1234567.000000 # ❌ suspiciously large + max=9876543.000000 # ❌ suspiciously large + mean=nan # ❌ bad + nan_count=128 # ❌ found NaNs + inf_count=0 # ✅ no Infs here +) +``` + +This usually means the bad values were already present before the crashing kernel. + +### Out of Memory + +Use: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +``` + +Check: +- Unexpectedly large tensor shapes +- Batch size +- Sequence length +- Frame count or image resolution in diffusion workloads + +Also check whether a supposedly per-token or per-frame tensor accidentally became full-sequence or full-image sized. + +Typical bad pattern: + +```text +Tensor( + shape=(1024, 8192, 128, 128) # ❌ way too large + ... +) +``` + +### Example: Spot a Shape Bug from the Log + +Suppose the failing API log looks like this: + +```text +[2026-03-19 00:47:30] SGLang Kernel API Call: RotaryEmbedding.forward +Positional input arguments: + arg[0]=Tensor(shape=(1, 8), dtype=torch.int64, ...) + arg[1]=Tensor(shape=(1, 8, 8, 256), dtype=torch.bfloat16, ...) # ✅ query + arg[2]=Tensor(shape=(1, 8, 4, 64), dtype=torch.bfloat16, ...) # ❌ key head_dim mismatch +``` + +What this tells you: +- ✅ positions look reasonable +- ✅ query looks plausible +- ❌ key last dimension is inconsistent with the expected rotary/head dimension + +That usually means the bug is in projection layout, head packing, or cache format rather than in the rotary kernel itself. + +## Step 7: Combine with compute-sanitizer + +For harder bugs, combine kernel API logging with CUDA memory checking: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +export SGLANG_KERNEL_API_LOGDEST=debug.log + +compute-sanitizer --tool memcheck python3 /tmp/sglang_llm_crash.py +``` + +Use `debug.log` to see the exact inputs that reached the crashing API boundary. + +Typical `compute-sanitizer` output: + +```text +========= COMPUTE-SANITIZER +========= Invalid __global__ write of size 4 bytes +========= at 0x1234 in SomeKernel +========= by thread (256,0,0) in block (10,0,0) +========= Address 0x... is out of bounds +``` + +Use the sanitizer output to identify the failing kernel and use `debug.log` to identify the exact tensors that reached the API boundary right before it. + +If you need more synchronous host-side error reporting, you can try `CUDA_LAUNCH_BLOCKING=1` as a separate follow-up experiment. It is not part of the default workflow because it changes execution timing and can hide concurrency-related behavior. + +## Step 8: Combine with cuda-gdb + +For crashes that need a stack trace instead of only memory diagnostics: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +export SGLANG_KERNEL_API_LOGDEST=debug.log + +cuda-gdb --args python3 /tmp/sglang_llm_crash.py +``` + +Inside `cuda-gdb`: + +```text +(cuda-gdb) run +(cuda-gdb) where +``` + +Then correlate the backtrace with `debug.log`. + +## Step 9: Kernel-Level Debugging with printf() + +When you own the CUDA kernel, `printf()` is still useful for narrowing down bad indices, bad launch geometry, or broken state propagation. + +Basic pattern: + +```cpp +__global__ void MyKernel(const float* input, float* output, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + + if (threadIdx.x == 0 && blockIdx.x == 0) { + printf("n=%d input0=%f\n", n, input[0]); + } + + if (idx < n) { + output[idx] = input[idx] * 2.0f; + } +} +``` + +After launch, force the output to flush: + +```python +my_kernel(...) +torch.cuda.synchronize() +``` + +For warp-specialized kernels, do not blindly print only on `threadIdx.x == 0`. Pick one representative thread per warp or per specialization group instead. + +### Warp-Specialized Kernels: Choosing the Right Print Thread + +Problem: +- `threadIdx.x == 0` only prints from the first warp in the block +- for warp-specialized kernels, that often misses the warp or group that is actually wrong + +Better pattern: + +```cpp +__global__ void WarpSpecializedKernel(...) { + // Example: first lane of each warp + if ((threadIdx.x % 32) == 0) { + printf("warp=%d\n", threadIdx.x / 32); + } +} +``` + +Or, if the kernel is organized in larger specialization groups, print once per group instead of once per block. + +Common mistake: + +```cpp +// Only warp 0 prints +if (threadIdx.x == 0) { + printf("warp=%d\n", threadIdx.x / 32); +} +``` + +### Quick Reference + +| Kernel Type | Print Condition | Notes | +|----------|----------|-------------| +| Simple kernel | `threadIdx.x == 0` | One thread per block is usually enough | +| Warp-specialized kernel | one representative lane per warp | e.g. `threadIdx.x % 32 == 0` | +| Group-specialized kernel | one representative lane per group | choose based on the kernel's scheduling layout | + +### Other Kernel Debugging Tools + +```cpp +assert(value >= 0.0f && "value must be non-negative"); +static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be warp aligned"); +``` + +## Environment Variables Reference + +| Variable | Values | Description | +|----------|--------|-------------| +| `SGLANG_KERNEL_API_LOGLEVEL` | `0` | No logging (default) | +| | `1` | Function names only | +| | `3` | Inputs and outputs with metadata | +| | `5` | Level 3 plus tensor statistics | +| | `10` | Level 5 plus crash-safe tensor dumps | +| `SGLANG_KERNEL_API_LOGDEST` | `stdout` | Log to stdout | +| | `stderr` | Log to stderr | +| | `` | Log to file | +| | `log_%i.txt` | `%i` expands to process ID | +| `SGLANG_KERNEL_API_DUMP_DIR` | `` | Directory for level-10 dumps | +| `SGLANG_KERNEL_API_DUMP_INCLUDE` | wildcard list | Only dump matching API names | +| `SGLANG_KERNEL_API_DUMP_EXCLUDE` | wildcard list | Skip matching API names | + +## Best Practices + +### 1. Start with Level 3 + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +``` + +Level 3 is usually enough to catch wrong shapes, wrong dtypes, and wrong devices. + +### 2. Use Level 5 for Numerical Issues + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=5 +``` + +Use it when you suspect NaN or Inf values. + +### 3. Use Level 10 for Crash Reproduction + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=10 +``` + +This is the most useful mode when the process crashes before you can inspect live tensors. + +If you need successful input/output dumps from a real model run, temporarily disable CUDA graph for that debug session. + +When level 10 is too noisy, pair it with `SGLANG_KERNEL_API_DUMP_INCLUDE` / `SGLANG_KERNEL_API_DUMP_EXCLUDE` instead of dumping every covered API. + +### 4. Log to File for Crashes + +```bash +export SGLANG_KERNEL_API_LOGDEST=crash.log +``` + +File logs are safer than stdout when the process aborts. + +### 5. Disable Logging in Production + +```bash +unset SGLANG_KERNEL_API_LOGLEVEL +``` + +When disabled, the decorator returns the original callable and adds no runtime logging overhead. + +## Troubleshooting + +### No Logs Appear + +Check: +1. `echo $SGLANG_KERNEL_API_LOGLEVEL` +2. `echo $SGLANG_KERNEL_API_LOGDEST` +3. Whether the failing path goes through a covered API boundary + +### Too Much Output + +Reduce the level: + +```bash +export SGLANG_KERNEL_API_LOGLEVEL=3 +``` + +### Statistics Are Skipped During CUDA Graph Capture + +If you see: +```text +statistics=[skipped: CUDA graph capture in progress] +``` + +That is expected. Level-5 statistics are intentionally skipped during CUDA graph capture to avoid synchronization side effects. + +### Tensor Dumps Are Skipped During CUDA Graph Capture + +If you see: +```text +Tensor dump skipped: CUDA graph capture in progress +``` + +That is also expected. Level-10 dumps require copying tensors to CPU, which is not allowed during CUDA graph capture. diff --git a/.claude/skills/debug-distributed-hang/SKILL.md b/.claude/skills/debug-distributed-hang/SKILL.md new file mode 100644 index 000000000000..4db4086b4155 --- /dev/null +++ b/.claude/skills/debug-distributed-hang/SKILL.md @@ -0,0 +1,248 @@ +--- +name: debug-distributed-hang +description: Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations. +--- + +# Debugging Distributed Hangs in SGLang + +## Overview + +Hangs in distributed inference happen when ranks diverge in state, causing collective operations (AllGather, AllReduce, Broadcast, Barrier) to deadlock. Common causes: + +- **Size mismatch**: ranks pass different tensor sizes to a collective +- **Branch divergence**: one rank enters a collective, another skips it +- **Cascading state drift**: a small non-determinism (e.g., floating-point) propagates into different batch structures +- **Resource exhaustion**: one rank OOMs or crashes, others wait forever + +## Prerequisites + +- **py-spy**: `pip install py-spy` or system package. Requires root or `CAP_SYS_PTRACE` to attach to running processes. +- **cuda-gdb**: Ships with the CUDA toolkit. Ensure it's on your `PATH`. + +## Step 1: Confirm and Locate the Hang + +### 1a. Watchdog / py-spy + +SGLang's watchdog automatically dumps py-spy traces on timeout. Look for: + +``` +Scheduler watchdog timeout (self.watchdog_timeout=300, self.soft=False) +``` + +The py-spy dump shows the stack trace of each thread. The hanging thread is typically blocked in a CUDA synchronize or NCCL collective: + +``` +Thread (active): "MainThread" + cuStreamSynchronize (libcuda.so) + ... + forward_extend (model_runner.py) +``` + +SGLang has two watchdog modes (see `python/sglang/srt/utils/watchdog.py`): +- **Hard watchdog** (`soft=False`, default): dumps py-spy traces then sends `SIGQUIT` to kill the parent process. +- **Soft watchdog** (`soft=True`): only logs the timeout without killing the process, giving you more time to manually attach debuggers or collect coredumps. + +If the watchdog doesn't trigger, manually dump: + +```bash +py-spy dump --pid +``` + +### 1b. NCCL Debug Logging + +```bash +export NCCL_DEBUG=INFO +export NCCL_DEBUG_SUBSYS=COLL +``` + +Look for the last collective logged before the hang. Mismatched sizes show up as one rank waiting and another never entering. + +### 1c. CUDA Coredump + +When a process hangs, you can trigger a GPU coredump on demand to see which kernel is stuck. Set these env vars before launching: + +```bash +export CUDA_ENABLE_USER_TRIGGERED_COREDUMP=1 +export CUDA_COREDUMP_PIPE="/tmp/cuda_pipe_%h_%p" +export CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h_%p" +export CUDA_COREDUMP_SHOW_PROGRESS=1 +export CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' +``` + +While the process is hanging, find the pipe via `/proc//fd/` and write to it to trigger the dump: + +```bash +ls /proc//fd/ -la 2>/dev/null | grep cuda_pipe +dd if=/dev/zero bs=1M count=1 > /tmp/cuda_pipe__ +``` + +Alternatively, if you don't need to keep the process alive, `kill -SIGABRT ` also triggers a CUDA coredump (but terminates the process). + +Then open with `cuda-gdb --batch -ex "target cudacore "`. On load, it immediately shows which kernel is stuck. For example: + +``` +Opening GPU coredump: +[Current focus set to CUDA kernel 0, grid 622721, cluster (4,0,0), block (16,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0] +#0 0x00007f8029b2b040 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(512,1,1)>>> () +``` + +This told us the hang was in an NCCL AllGather — not a compute kernel. Combined with the py-spy stack pointing to `LogitsProcessor.forward` → `tensor_model_parallel_all_gather`, we knew it was an AllGather size mismatch between TP ranks. + + +### 1d. Identify the Collective + +From the stack traces and logs, identify: +- Which collective hangs (AllGather, AllReduce, Broadcast) +- Which code path invokes it (e.g., `LogitsProcessor`, `tensor_model_parallel_all_gather`) +- Whether it's a size mismatch or a missing participant + +## Step 2: Per-Rank Logging + +The key technique: each rank writes its own log file so you can diff them. + +### Setup Pattern + +```python +import os + +_debug_files = {} + +def get_debug_file(rank): + key = f"rank{rank}" + if key not in _debug_files: + _debug_files[key] = open(f"/tmp/debug_rank{rank}.log", "w") + return _debug_files[key] +``` + +Gate logging behind an env var to avoid overhead in production. `SGLANG_DEBUG_HANG` is not a built-in SGLang env var — you need to add this check yourself in the code you're instrumenting: + +```python +if os.environ.get("SGLANG_DEBUG_HANG"): + f = get_debug_file(rank) + f.write(f"EVENT_NAME key1={val1} key2={val2}\n") + f.flush() +``` + +### What to Log + +Log structured events at key state-mutation points: + +```python +f.write(f"SCHED_BATCH step={step} num_reqs={n} extend_lens={lens}\n") +f.write(f"VERIFY predict_hash={hash} accept_len={alen}\n") +f.write(f"CACHE_INSERT rid={rid} num_tokens={n}\n") +``` + +Use consistent event names (uppercase prefix) for easy grep/diff. + +### Hash Large Tensors + +For tensor values, compute a hash instead of dumping raw data: + +```python +import hashlib +h = hashlib.md5(tensor.cpu().numpy().tobytes()).hexdigest()[:8] +f.write(f"LOGITS logits_hash={h}\n") +``` + +For token ID lists, `str(list).encode()` works: + +```python +h = hashlib.md5(str(tensor.tolist()).encode()).hexdigest()[:8] +``` + +### Avoid Implicit Synchronization + +`tensor.cpu()`, `tensor.tolist()`, and `tensor.numpy()` all trigger CUDA synchronization. This can: +- Change timing and mask or move the hang +- Deadlock if the log point is between two collectives that must run back-to-back + +Prefer logging values that are already on CPU (e.g., Python ints, list lengths, request IDs). When you must hash a GPU tensor, do it at a point where the GPU is already idle (e.g., between scheduler steps, not inside a model forward pass). + +## Step 3: Diff to Find the Diverge Point + +### Basic Diff + +```bash +# Extract specific event type +grep "^VERIFY" /tmp/debug_rank0.log > /tmp/v_r0.txt +grep "^VERIFY" /tmp/debug_rank1.log > /tmp/v_r1.txt +diff /tmp/v_r0.txt /tmp/v_r1.txt | head -20 +``` + +### Count Events + +```bash +grep -c "^VERIFY" /tmp/debug_rank*.log +``` + +If counts differ, one rank executed more iterations — that's already a diverge signal. + +### Find First Diverge + +The first diff line tells you the exact step where ranks diverge. All lines before it are identical — the root cause is at or before this step. + +## Step 4: Binary-Search the Root Cause + +Once you find the diverging event, trace backwards: + +### 4a. Identify Inputs + +For the diverging operation, list all its inputs. Add hash logging for each: + +```python +f.write( + f"OP_INPUTS input_a_hash={h_a} input_b_hash={h_b} " + f"input_c_hash={h_c} input_d_hash={h_d}\n" +) +``` + +### 4b. Diff Inputs Across Ranks + +Compare the hashes. Some inputs will match, some won't. The non-matching input is where divergence entered. + +### 4c. Recurse + +For the non-matching input, trace where it was produced and repeat: hash its inputs, diff across ranks, find the divergent one. Continue until you reach the root cause. + +## Step 5: Common Root Causes and Fixes + +### Floating-Point Non-Determinism + +**Symptom**: All "logical" inputs are identical (same logits after all-gather), but derived floating-point values (softmax, probabilities) differ across GPUs. + +**Example**: EAGLE speculative decoding — `F.softmax` → `top_k_renorm_prob` → `top_p_renorm_prob` produces slightly different `target_probs` on each GPU. The sampling kernel then picks different tokens. These flow into `output_ids` → radix cache → different prefix match depths → different `extend_seq_lens` → AllGather size mismatch → hang. + +### Random Number Divergence + +**Symptom**: Operations using `torch.rand` produce different values on each rank. + +**Fix**: Generate on rank 0 and broadcast, or use a shared seed. + +### Conditional Code Paths + +**Symptom**: A condition (e.g., memory check, queue length) evaluates differently on different ranks, causing one rank to enter a collective while another skips it. + +**Fix**: Synchronize the condition value before branching, or restructure to ensure all ranks take the same path. + +### Pipeline Parallel (PP) Send/Recv Mismatch + +**Symptom**: In PP setups, one stage issues a `send` that the next stage never `recv`s (or vice versa), causing both to block indefinitely. Unlike TP hangs (collective mismatches), PP hangs typically involve point-to-point operations. + +**Fix**: Ensure all stages agree on the number of microbatches and the sequence of send/recv calls for each microbatch. + +## Step 6: Verify the Fix + +Run the failing test multiple times to confirm the fix is stable. Intermittent hangs require many runs. A test that hung ~30% of the time needs at least 10 clean passes to be confident. + +## Quick Reference + +| Technique | When to Use | +|-----------|-------------| +| py-spy dump | First step — see where each rank is stuck | +| `NCCL_DEBUG=INFO` | Identify which collective and sizes | +| CUDA coredump + `cuda-gdb` | See which GPU kernel is blocked | +| Per-rank log files | Compare rank states over time | +| Hash of tensors | Efficiently compare large tensors across ranks | +| `diff` on extracted events | Find the exact step of divergence | +| `broadcast(result, src=0)` | Fix floating-point or sampling non-determinism | diff --git a/.claude/skills/generate-profile/SKILL.md b/.claude/skills/generate-profile/SKILL.md new file mode 100644 index 000000000000..dae475cfafce --- /dev/null +++ b/.claude/skills/generate-profile/SKILL.md @@ -0,0 +1,143 @@ +--- +name: generate-profile +description: Generate an e2e profiling trace of an SGLang server run. Launches a server, validates accuracy, captures a Chrome-compatible trace, and returns the profile path. +--- + +# Generate an E2E Profile of an SGLang Server Run + +This skill launches an SGLang server, validates it with a quick accuracy test, generates a profiling trace, and returns the profile file path. + +## Prerequisites + +- A working SGLang installation (`pip install -e .` or equivalent) +- At least one available CUDA GPU + +## Step-by-step Workflow + +### Step 1: Launch the server + +```bash +CUDA_VISIBLE_DEVICES= sglang serve --model-path --port & +``` + +- Default model: `Qwen/Qwen3-8B` (good balance of speed and quality) +- Default port: `30000` +- The server runs in the background. Save the PID for cleanup. +- Use the GPU specified by the user's preferences (check memory files for GPU preferences). + +### Step 2: Wait for server readiness + +Poll the health endpoint until the server is ready: + +```bash +for i in $(seq 1 120); do + if curl -s http://127.0.0.1:/health 2>/dev/null | grep -q "ok\|healthy"; then + echo "Server ready" + break + fi + sleep 5 +done +``` + +The server prints **"The server is fired up and ready to roll!"** to stdout when ready. The health endpoint returns 200 once the server can accept requests. + +Typical startup time: 30-90 seconds depending on model size and whether CUDA graphs are being compiled. + +### Step 3: Validate accuracy (sanity check) + +```bash +python3 -m sglang.test.run_eval --host 127.0.0.1 --port --eval-name gsm8k --num-examples 20 +``` + +- Expected accuracy: **> 0.8** for capable models (Qwen3-8B, Llama-3.1-8B-Instruct, etc.) +- This is a quick sanity check, not a rigorous benchmark. +- `sglang.test.few_shot_gsm8k` is deprecated; use the unified `run_eval` entrypoint. +- If you intentionally need the old completion-style GSM8K path, add `--api completion`. +- If accuracy is unexpectedly low, something is wrong — do not proceed to profiling. + +### Step 4: Generate the profile + +```bash +python3 -m sglang.test.send_one --profile +``` + +This command: +1. Sends a request to the server +2. Triggers the profiler for 5 steps (default) +3. Generates a trace file under `/tmp//` +4. The trace directory contains: + - `-TP-0.trace.json.gz` — Chrome trace format (open in `chrome://tracing` or Perfetto) + - `server_args.json` — the server configuration used + +**Output format:** +``` +Dump profiling traces to /tmp/ +``` + +The profile path is printed to stdout. Parse it from the output. + +**Optional flags:** +- `--profile-steps N` — number of profiling steps (default: 5) +- `--profile-by-stage` — profile by stage (prefill/decode separately) +- `--profile-prefix ` — custom output prefix + +### Step 5: Kill the server + +```bash +pkill -9 -f "sglang.launch_server\|sglang serve\|sglang.srt" +``` + +Wait a moment and verify no sglang processes remain: +```bash +sleep 2 && pgrep -af "sglang serve" || echo "Server killed" +``` + +### Step 6: Report the profile path + +Return the profile directory path (e.g., `/tmp/1773999986.4769795`) and list its contents so the user knows what files were generated. + +## Example Full Run + +```bash +# 1. Launch server +source cleanup/bin/activate +CUDA_VISIBLE_DEVICES=1 sglang serve --model-path Qwen/Qwen3-8B --port 30000 & + +# 2. Wait for ready +for i in $(seq 1 120); do + curl -s http://127.0.0.1:30000/health | grep -q "ok" && break + sleep 5 +done + +# 3. Accuracy check +python3 -m sglang.test.run_eval --host 127.0.0.1 --port 30000 --eval-name gsm8k --num-examples 20 +# Expected: Accuracy > 0.8 + +# 4. Profile +python3 -m sglang.test.send_one --profile +# Output: "Dump profiling traces to /tmp/1773999986.4769795" + +# 5. Cleanup +pkill -9 -f "sglang.launch_server\|sglang serve\|sglang.srt" +sleep 2 + +# 6. Check output +ls -la /tmp/1773999986.4769795/ +# 1773999986.4851577-TP-0.trace.json.gz (Chrome trace) +# server_args.json (server config) +``` + +## Customization + +- **Different port**: Pass `--port ` and use `--host 127.0.0.1 --port ` for test commands +- **Multi-GPU**: Use `--tp ` for tensor parallelism; trace files will be generated per TP rank +- **Longer profile**: Use `--profile-steps 10` for more steps in the trace +- **Stage profiling**: Use `--profile-by-stage` to separate prefill and decode phases + +## Viewing the Profile + +Open the `.trace.json.gz` file in: +- **Perfetto UI**: https://ui.perfetto.dev/ (drag and drop the file) +- **Chrome tracing**: `chrome://tracing` (load the file) + +Both support the gzipped Chrome trace format natively. diff --git a/.claude/skills/llm-serving-auto-benchmark/SKILL.md b/.claude/skills/llm-serving-auto-benchmark/SKILL.md new file mode 100644 index 000000000000..8c720e495291 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/SKILL.md @@ -0,0 +1,527 @@ +--- +name: llm-serving-auto-benchmark +description: Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA. +--- + +# LLM Serving Auto Benchmark + +## Overview + +Use this skill to compare LLM serving frameworks such as SGLang, vLLM, and +TensorRT-LLM for the same model and workload. + +Use a config-driven workflow: + +- keep launch-only capacity choices in each framework's `base_server_flags` +- put the search knobs in `search_space` +- run the same dataset scenarios for every framework +- generate a bounded candidate list from `search_space`, with the baseline + candidate included first +- keep failed candidates in the result file +- pick the best SLA-passing candidate after normalizing the results + +For model-specific starting points, prefer the shipped configs in +`configs/cookbook-llm/`. They define a framework-neutral LLM serving cookbook +model set and translate each entry into framework-native SGLang, vLLM, and +TensorRT-LLM server flags. Validate those configs before a real run: + +```bash +python .claude/skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \ + .claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm +``` + +If you have captured target-environment `--help` files, add +`--help-dir `. That check only loads configs, verifies the +server flag names, and renders candidate commands; it does not launch model +servers. + +Prefer native tooling when it gives better coverage: + +- SGLang: `python -m sglang.auto_benchmark` when available, otherwise + `python -m sglang.bench_serving` +- vLLM: `vllm bench sweep serve` for server-parameter sweeps, otherwise + `vllm serve` plus `vllm bench serve` +- TensorRT-LLM: `trtllm-serve` for the OpenAI-compatible server plus the + TensorRT-LLM serving benchmark client or a common OpenAI-compatible benchmark + client + +TensorRT-LLM has one hard scope rule in this skill: the server backend is fixed +to `trtllm-serve serve --backend pytorch`. Do not search TensorRT-LLM backend +choice. If a request, config, or candidate asks for `trt`, an engine backend, or +any other non-PyTorch TensorRT-LLM server backend, reject that candidate as +unsupported for this skill and record the reason. This does not change the +benchmark client backend; the TensorRT-LLM benchmark client still uses +OpenAI-compatible modes such as `--backend openai` or `--backend openai-chat`. + +Only pick a winner after each requested framework has had its main serving knobs +tuned. + +The parameter lists in this skill are not a compatibility contract. They are +version-sensitive candidate knob families. Before every real run, record the +exact framework version or git commit and verify the concrete CLI flag names +with `--help` in the target environment. + +The default search style is framework-neutral: start from a mostly pure-TP +baseline, sweep a small set of high-impact runtime knobs, and cap the first +pass around 10 candidates per framework. Do not search memory fractions by +default. + +## Validation Environment + +This skill is target-agnostic. It assumes any one of the following is +available, and nothing more: + +- a local GPU host with Docker/Podman and the target framework images pulled; +- a remote GPU host reached via `ssh ` with the framework images already + running in a container there; +- a CI runner that can exec into a pre-built image for each framework. + +Do not assume a specific operator host name (`h100_sglang`, `b200_*`, +`radixark*`, `rtx5090_*`, etc.) inside this skill's own workflow. The concrete +SSH wiring, container names, workspace paths, and HF token plumbing for a given +box live in the operator-side per-host skills (for example `h100`, +`h100-sglang-diffusion`, `b200`, `rtx5090`, `radixark02`, `radixark03`); this +skill only requires that the caller can reach a shell inside a container with +`sglang`, `vllm`, or `tensorrt_llm` installed. + +Reference files are optional and version-sensitive. Treat historical flag notes +as evidence from one image, not as a compatibility guarantee for the next run. + +Additional H100 validation on `2026-05-01` used two 2-card models with a +bounded search of two SGLang memory-fraction candidates and two vLLM +memory-utilization candidates. The workload was random input `512`, output +`64`, 8 prompts, and 2 warmup requests, only to prove the search and summary +path can finish quickly. + +| Model | GPUs | Best SGLang | Best vLLM | Artifact root | +| --- | --- | --- | --- | --- | +| `Qwen/Qwen3-8B` | 2x H100, TP=2 | `sglang_mem086`, 21.64 req/s, 1385.05 output tok/s, mean TTFT 70.54 ms | `vllm_mem080`, 22.88 req/s, 1464.25 output tok/s, mean TTFT 60.56 ms | `/data/bbuf/validate/core_skill_validation_20260501/qwen3_8b/auto_benchmark` | +| `mistralai/Mistral-7B-Instruct-v0.3` | 2x H100, TP=2 | `sglang_mem080`, 24.09 req/s, 1541.92 output tok/s, mean TTFT 61.47 ms | `vllm_mem090`, 24.76 req/s, 1584.54 output tok/s, mean TTFT 58.63 ms | `/data/bbuf/validate/core_skill_validation_20260501/mistral_7b_instruct_v03/auto_benchmark` | + +## Skill Scope + +This skill is a playbook plus a config+validator toolchain, not a turn-key +orchestrator. The operator still launches servers, drives workloads, and writes +one normalized JSONL row per candidate. + +The `scripts/` directory contains exactly two tools: + +- `validate_cookbook_configs.py`: load cookbook YAML, render bounded candidate + server commands, and check flag names against captured `--help` snapshots + without launching servers. +- `compare_benchmark_results.py`: turn normalized per-candidate JSONL into the + markdown and optional CSV tables described in the Output Contract. + +Cookbook configs under `configs/cookbook-llm/` must pass the validator. The +shorter [references/example-plan.yaml](references/example-plan.yaml) is a +one-off runtime-plan skeleton and is not expected to pass as-is. Use +[references/result-schema.md](references/result-schema.md) as the single source +of truth for SLA key names. + +## Required Inputs + +Collect these before a long run: + +- model and tokenizer path, target frameworks, GPU model/count, multi-node + allowance, precision, and quantization constraints +- endpoint shape, workload source, dataset scenarios, SLA target, search budget, + and artifact output directory +- version manifest: framework package version or git commit, container/Python + environment, `--help` snapshots, and whether each search parameter was + accepted by that exact CLI + +If real production traffic is the goal, use the real request distribution. A +synthetic workload is fine for bring-up and first-pass comparison, but it is not +enough for a production choice. + +Record each scenario's input/output length distribution in the normalized +result rows. This is now part of the profiler handoff contract: if SGLang is +slower and `sglang-sota-performance` invokes `llm-torch-profiler-analysis`, +the profiler workload must reuse the slow SGLang benchmark scenario lengths +instead of falling back to its generic prefill `4090->1` and decode `1->2048` +defaults. + +## Known Gotchas + +Short list of failure modes that have bitten past validation runs. Check these +before starting a long sweep. + +- SGLang `fa3` attention backends need Hopper or newer. On A100, L40S, RTX + 5090, and older GPUs, drop `fa3` from the SGLang `search_space` and keep + `flashinfer` (or `triton` when FlashInfer is unavailable). +- SGLang `bench_serving` has two SGLang-facing backends: `--backend sglang` for + the native `/generate` endpoint and `--backend sglang-oai` for the + OpenAI-compatible endpoint. For cross-framework comparisons, prefer + `sglang-oai` so every framework is measured on the same request path. +- vLLM `--enable-dbo` only works when the target vLLM image is built with a + supported all2all backend. Keep DBO out of the default candidate list unless + the operator has verified the image. +- vLLM `--max-num-partial-prefills > 1` is model- and runtime-gated. Keep `1` + in the default pass; raise only after a preflight with the actual model. +- The historical TensorRT-LLM 1.0.0 validation image accepted + `--kv_cache_free_gpu_memory_fraction`; the older `--free_gpu_memory_fraction` + exited with a CLI error. TensorRT-LLM was refreshed to 1.2.1 stable and + 1.3.0 release candidates by 2026-04-28, so re-check the accepted flag name + via `--help` on the target image before a real run. +- The historical TensorRT-LLM 1.0.0 multi-GPU PyTorch-backend validation used + `--ipc=host`, `--ulimit memlock=-1`, `--ulimit stack=67108864`, + `--shm-size=16g`, and `NCCL_IB_DISABLE=1` (for single-node) or an equivalent + NCCL setup. Keep these as a starting point, not as a version-independent + requirement. +- The historical TensorRT-LLM 1.0.0 benchmark client took `--backend openai` or + `--backend openai-chat`; `--backend trtllm` was rejected. This is separate + from the server backend, which is pinned to `pytorch` by this skill. +- `trtllm` `benchmark_serving --dataset-name random` silently falls back to + ShareGPT sampling without `--random-ids` (or `--download-path`). +- `max_seq_len` / `max_model_len` / `context_length` candidates must cover + `max(input_len + output_len)` across every scenario, including values inside + `search_space`, not just the baseline. The validator checks this; do not + bypass it. + +## Secrets Hygiene + +- Never print `HF_TOKEN`, `HUGGINGFACE_HUB_TOKEN`, or any upstream API key into + a saved artifact. Pass them through container `-e VAR` (unquoted on the right + side so the host value is inherited) and keep them out of `server_command` + and `benchmark_command` fields written to the result JSONL. +- When a framework echoes the full argv at startup, scrub the log or redact + token-shaped substrings before uploading the artifact. + +## Fairness Rules + +Use these rules throughout the benchmark: + +- Run every framework on the same GPU type, GPU count, model weights, tokenizer, + precision, quantization policy, prompt distribution, output length target, and + sampling settings. +- Record framework version, git commit, container image, CUDA/NCCL versions, GPU + driver, visible GPU ids, launch command, and benchmark command. +- Warm the server before measuring. Restart or clear state between candidate + configurations when cache effects would bias the comparison. +- Compare steady-state fixed-QPS runs separately from burst throughput runs. +- Keep failed candidates in the final results with their failure reason. +- Report both raw throughput and SLA-passing throughput. The fastest failing + candidate is not the best deployment command. + +## Workflow + +### 1. Preflight + +Verify all requested frameworks before starting a search: + +```bash +python -m sglang.launch_server --help +python -m sglang.bench_serving --help +vllm serve --help +vllm serve --help=all +vllm bench serve --help +vllm bench serve --help=all +vllm bench sweep serve --help=all +trtllm-serve serve --help +python -m tensorrt_llm.serve.scripts.benchmark_serving --help +``` + +Use the framework-specific `--help` output in the target environment as the +source of truth. Do not keep a stale launch flag just because it appears in an +old note. + +vLLM 0.19 and newer use grouped help. Plain `vllm serve --help` only shows the +groups, so capture `--help=all` before deciding whether a search knob exists. + +Save these `--help` outputs into the run artifact directory. If a listed search +knob is missing from the current CLI, remove or translate that knob before +running the benchmark. Do not silently pass unknown flags. + +For TensorRT-LLM, also confirm that `trtllm-serve serve --help` accepts +`--backend pytorch`. If it does not, mark TensorRT-LLM unsupported in that +environment rather than falling back to a different server backend. + +For each framework, launch a minimal server, confirm `/v1/models` or the native +model-info endpoint, send one streaming request, run one tiny benchmark with at +least 5 requests, then save the launch command, benchmark command, server log, +and benchmark output. + +Before any GPU-backed smoke run, check the requested GPU ids directly with +`nvidia-smi`. If a requested GPU is already in use, stop and record that fact. +Do not silently borrow a different GPU count for a performance comparison. It is +fine to run a smaller one-GPU smoke only when the result is clearly labeled as a +flow check rather than a fair throughput comparison. + +If the target environment runs through containers, follow +[references/container-runbook.md](references/container-runbook.md) and save image +tags, pull commands, launch/benchmark logs, and cleanup commands. + +### 2. Normalize The Workload + +Use one canonical workload for all frameworks. Recommended JSONL row shape: + +```json +{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256} +{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128} +``` + +Optional fields: + +```json +{ + "prompt": [{"role": "user", "content": "Use low temperature."}], + "output_len": 256, + "extra_request_body": {"temperature": 0.0, "top_p": 0.95}, + "metadata": {"source": "prod-sample"} +} +``` + +When converting user data: + +- inspect at least 3 rows before conversion +- preserve request-level sampling options in `extra_request_body` +- do not include the final assistant answer in the prompt when that answer is + the target completion +- keep multimodal or tool-call payloads only if all requested frameworks support + the chosen endpoint shape + +For synthetic bring-up, use the shipped two-scenario shape: + +```yaml +dataset: + kind: random + num_prompts: 80 + scenario_names: [chat, summarization] + input_len: [1000, 8000] + output_len: [1000, 1000] +``` + +Each aligned `input_len` / `output_len` pair is one scenario. Do not take the +cartesian product unless the user asks for that. +Name each scenario and keep the aligned pair in the artifacts. For custom +datasets, compute or record representative `input_len` and `output_len` +buckets, at least p50 and p95 when possible, so later profiler runs can match +the slow bucket rather than profiling an unrelated synthetic shape. + +Before searching any sequence-length limit, compute the largest +`input_len + output_len` in the dataset. SGLang `context_length`, vLLM +`max_model_len`, and TensorRT-LLM `max_seq_len` must be at least that value for +every candidate that is expected to run all scenarios. + +### 3. Pick A Search Tier + +Use the smallest tier that can answer the user's question: + +- Tier 1: smoke and sanity. One baseline plus a few high-impact knobs. +- Tier 2: default. A bounded sweep over the most likely server settings. +- Tier 3: exhaustive. Only when the search space is already tight and the user + accepts a long run. + +Default budget: + +- `num_prompts: 80` for the default cross-framework comparison; `num_prompts: + 20` per scenario is acceptable for a smoke/flow check and must be labeled as + such in the artifact (not as a performance result). +- `search.max_candidates_per_framework: 10` for the first useful pass +- candidate generation: baseline first, then a bounded product or ordered + candidate list from `search_space` +- at most 5 QPS search rounds unless the user asks for more +- stop early when every candidate in one framework is clearly OOM or fails the + basic health check + +Keep these in `base_server_flags` unless the user specifically wants a capacity +or memory study: + +- SGLang `mem_fraction_static` +- SGLang `schedule_policy` +- vLLM `gpu_memory_utilization` +- TensorRT-LLM `kv_cache_free_gpu_memory_fraction` + +These are real knobs, but they widen the search quickly and often turn a serving +comparison into a memory-limit study. + +### 4. Tune SGLang + +Prefer the SGLang auto-benchmark runner when the target checkout supports it: + +```bash +python -m sglang.auto_benchmark run --config /path/to/sglang.yaml +``` + +Otherwise launch the server manually and benchmark with: + +```bash +python -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 256 \ + --num-prompts 80 \ + --request-rate 8 \ + --output-file /path/to/sglang/results.json \ + --output-details +``` + +Version-sensitive SGLang knob families to verify: + +- `tp_size`, `pp_size`, `dp_size`, `ep_size` +- `attention_backend`, `prefill_attention_backend`, `decode_attention_backend` +- `sampling_backend` +- `max_running_requests`, `max_queued_requests` +- `chunked_prefill_size`, `prefill_max_requests`, `max_prefill_tokens` +- `max_total_tokens`, `page_size` +- CUDA graph and piecewise CUDA graph settings +- speculative or EAGLE settings only after the non-speculative baseline is tuned + +Keep `mem_fraction_static` and `schedule_policy` pinned in the default pass, +matching the shared cookbook config style. + +For quick smoke tests, it is reasonable to disable CUDA graph and piecewise CUDA +graph startup work if the goal is only to prove the framework flow. Record those +flags in the artifact. Do not carry that smoke setting into a performance winner +unless the user asked to tune eager-mode serving. + +### 5. Tune vLLM + +Use vLLM's sweep runner when available: + +```bash +vllm bench sweep serve \ + --serve-cmd 'vllm serve --port 8000' \ + --bench-cmd 'vllm bench serve --backend vllm --model --port 8000 --dataset-name random --num-prompts 80' \ + --serve-params /path/to/vllm_serve_params.json \ + --bench-params /path/to/vllm_bench_params.json \ + --output-dir /path/to/vllm_results +``` + +If sweep support is unavailable, run `vllm serve` for each candidate and measure +with `vllm bench serve`. + +Version-sensitive vLLM knob families to verify: + +- tensor, pipeline, data, decode-context, and expert parallelism +- `gpu_memory_utilization` +- `max_num_seqs` +- `max_num_batched_tokens` +- `max_model_len` +- `enable_chunked_prefill`, partial prefill limits, and DBO thresholds +- KV cache dtype and block size +- dtype and quantization settings +- CUDA graph capture sizes or eager-mode toggles when relevant +- prefix cache and speculative decoding settings only when the workload needs + those features + +vLLM should get a normal sweep, not one baseline command. See +[references/framework-reference.md](references/framework-reference.md) for +native command templates and cross-framework knob families. Confirm each flag on +the target image's `--help` before a run. + +Keep `gpu_memory_utilization` in the baseline for the default pass. Search it +only when the question is explicitly about fitting the model or trading capacity +against throughput. + +Keep DBO and all2all backend settings out of the default pass unless the target +vLLM environment is already set up for them. They are real tuning knobs, but a +candidate can fail at startup if the required all2all backend is not available. +Also preflight concurrent partial prefill before raising +`max_num_partial_prefills` above 1; some model/runtime combinations reject it at +startup. + +### 6. Tune TensorRT-LLM + +Use `trtllm-serve serve` as the server entrypoint when the target environment +supports it: + +```bash +trtllm-serve serve \ + --backend pytorch \ + --tp_size \ + --pp_size \ + --kv_cache_free_gpu_memory_fraction 0.75 \ + --host 0.0.0.0 \ + --port 8000 +``` + +Then benchmark the OpenAI-compatible endpoint with the TensorRT-LLM serving +benchmark client or with the same OpenAI-compatible client used for the other +frameworks. + +In the historical TensorRT-LLM 1.0.0 validation image, +`benchmark_serving --dataset-name random` sampled from ShareGPT unless either +`--download-path` or `--random-ids` was passed. For a fast synthetic smoke test, +pass `--random-ids`, then confirm the behavior on the target TensorRT-LLM image. + +TensorRT-LLM flag names are especially version-sensitive. In the validated +TensorRT-LLM 1.0.0 image, the KV-cache memory flag accepted by +`trtllm-serve serve` was `--kv_cache_free_gpu_memory_fraction`, not +`--free_gpu_memory_fraction`. TensorRT-LLM 1.2.1 is the latest stable GitHub +release as of 2026-04-28, with 1.3.0 release candidates also published; verify +the current flag with `trtllm-serve serve --help` before running a search on any +GPU target. + +TensorRT-LLM backend policy for this skill: + +- launch the server with `--backend pytorch` +- keep `backend: pytorch` in `base_server_flags` +- do not add `backend` to `search_space` +- reject `trt`, engine-backed serving, or any other non-PyTorch TensorRT-LLM + server backend as unsupported for this skill + +Version-sensitive TensorRT-LLM knob families to verify: + +- `tp_size`, `pp_size`, and `ep_size` +- max batch size, max sequence length, max number of tokens, and KV-cache budget +- inflight batching and scheduler options +- extra LLM API options YAML used by `trtllm-serve` with the PyTorch backend + +The `trtllm-serve serve` CLI exposes fewer direct runtime knobs than SGLang or +vLLM. Use direct flags when they exist, then use `--extra_llm_api_options` for +PyTorch-backend settings that are not top-level CLI flags. Keep unsupported +backend or engine requests in the failure table instead of translating them. + +Keep `kv_cache_free_gpu_memory_fraction` in the baseline for the default pass. +Search `max_batch_size`, `max_num_tokens`, `max_seq_len`, and validated +PyTorch-backend config options first. The server backend remains fixed to +`pytorch`. + +### 7. Normalize Results + +Write one JSONL row per candidate using the schema in +[references/result-schema.md](references/result-schema.md). Then run: + +```bash +python .claude/skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \ + --input /path/to/candidates.jsonl \ + --output /path/to/summary.md +``` + +Rank candidates in this order: + +1. SLA passed +2. highest request throughput or goodput +3. highest output token throughput +4. lower mean TTFT +5. lower mean TPOT/ITL +6. lower GPU count or simpler deployment if performance is close + +Keep the SLA gate itself unchanged. In the cookbook configs and normalized +result schema, TTFT SLA still uses `max_p99_ttft_ms` and TPOT SLA still uses +`max_p99_tpot_ms`; only the default cross-candidate comparison order switches +to mean TTFT and mean TPOT. + +## Output Contract + +Return a compact report with workload/SLA, hardware and framework versions, best +deployment-command tables per framework/scenario, one cross-framework comparison +table, exact launch and benchmark commands for winners, and artifact paths for +workload, raw/normalized results, CSV or markdown summary, and server logs. + +When SGLang is not the winner, include a profiler handoff note with the slow +SGLang scenario name and the exact input/output lengths or percentile bucket to +pass to `llm-torch-profiler-analysis`. + +Include failed or excluded candidates with reasons. Explain that this table is a +record of tried configs that were not selected: candidates that failed, were +skipped by policy, or completed but missed the SLA. Add caveats for synthetic +workloads, incomplete fair searches, or framework-specific parameter +substitutions. + +Use [references/framework-reference.md](references/framework-reference.md) when +you need command templates, source links, or knob-family mappings. Use +[references/example-plan.yaml](references/example-plan.yaml) as the starting +point for a full cross-framework run plan. diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/README.md b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/README.md new file mode 100644 index 000000000000..cad9f7b444bd --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/README.md @@ -0,0 +1,17 @@ +# Cookbook LLM Configs + +These configs define a framework-neutral LLM serving cookbook model set and translate each model into a three-framework run plan for SGLang, vLLM, and TensorRT-LLM. + +Scope: +- SGLang can preserve source-recipe `base_flags` and `search_space` where applicable; if a sequence limit is smaller than the default synthetic scenario, the config raises that limit so the shipped workload can run. +- vLLM uses framework-native `vllm serve` flags. The translation keeps the same model, tokenizer, dataset shape, GPU count, and high-impact batching/prefix-cache knobs; it does not copy SGLang-only parser or scheduler flags. +- TensorRT-LLM uses `trtllm-serve serve` with `backend: pytorch` fixed in `base_server_flags`. Backend choice is never searched. +- The two default random scenarios remain aligned pairs: `chat` uses `1000 -> 1000`, and `summarization` uses `8000 -> 1000`. + +Before a real run, capture the target framework `--help` output and validate the configs: + +```bash +python .claude/skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py .claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm +``` + +With captured help files, add `--help-dir ` to check the concrete flag names against that environment. This check only loads configs and renders candidate commands; it does not launch model servers. diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-math-v2.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-math-v2.yaml new file mode 100644 index 000000000000..1649101f322c --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-math-v2.yaml @@ -0,0 +1,130 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: deepseek-math-v2.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: deepseek-ai/DeepSeek-Math-V2 + tokenizer: deepseek-ai/DeepSeek-Math-V2 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: deepseek-ai/DeepSeek-Math-V2 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/deepseek-math-v2 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: deepseek-ai/DeepSeek-Math-V2 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - flashinfer + decode_attention_backend: + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-r1-0528.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-r1-0528.yaml new file mode 100644 index 000000000000..a7cbea9f741b --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-r1-0528.yaml @@ -0,0 +1,133 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: deepseek-r1-0528.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: deepseek-ai/DeepSeek-R1-0528 + tokenizer: deepseek-ai/DeepSeek-R1-0528 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: deepseek-ai/DeepSeek-R1-0528 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/deepseek-r1-0528 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + enable_symm_mem: true + model_path: deepseek-ai/DeepSeek-R1-0528 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.1.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.1.yaml new file mode 100644 index 000000000000..2402e4600e39 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.1.yaml @@ -0,0 +1,132 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: deepseek-v3.1.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: deepseek-ai/DeepSeek-V3.1 + tokenizer: deepseek-ai/DeepSeek-V3.1 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: deepseek-ai/DeepSeek-V3.1 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/deepseek-v3.1 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: deepseek-ai/DeepSeek-V3.1 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.2.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.2.yaml new file mode 100644 index 000000000000..51f1ef07b791 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.2.yaml @@ -0,0 +1,132 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: deepseek-v3.2.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: deepseek-ai/DeepSeek-V3.2 + tokenizer: deepseek-ai/DeepSeek-V3.2 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: deepseek-ai/DeepSeek-V3.2 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/deepseek-v3.2 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: deepseek-ai/DeepSeek-V3.2 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.yaml new file mode 100644 index 000000000000..a32a92ce544d --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/deepseek-v3.yaml @@ -0,0 +1,133 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: deepseek-v3.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: deepseek-ai/DeepSeek-V3 + tokenizer: deepseek-ai/DeepSeek-V3 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: deepseek-ai/DeepSeek-V3 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/deepseek-v3 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + enable_symm_mem: true + model_path: deepseek-ai/DeepSeek-V3 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/devstral-small-2-24b-instruct-2512.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/devstral-small-2-24b-instruct-2512.yaml new file mode 100644 index 000000000000..4d6d6782e70f --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/devstral-small-2-24b-instruct-2512.yaml @@ -0,0 +1,123 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: devstral-small-2-24b-instruct-2512.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: mistralai/Devstral-Small-2-24B-Instruct-2512 + tokenizer: mistralai/Devstral-Small-2-24B-Instruct-2512 + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: mistralai/Devstral-Small-2-24B-Instruct-2512 + max_concurrency: + - null + - 16 + - 32 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 16.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/devstral-small-2-24b-instruct-2512 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + model_path: mistralai/Devstral-Small-2-24B-Instruct-2512 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ernie-4.5-21b-a3b-pt.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ernie-4.5-21b-a3b-pt.yaml new file mode 100644 index 000000000000..ab6ddbfa31a0 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ernie-4.5-21b-a3b-pt.yaml @@ -0,0 +1,117 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: ernie-4.5-21b-a3b-pt.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: baidu/ERNIE-4.5-21B-A3B-PT + tokenizer: baidu/ERNIE-4.5-21B-A3B-PT + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: baidu/ERNIE-4.5-21B-A3B-PT + max_concurrency: + - null + - 16 + - 32 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 16.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/ernie-4.5-21b-a3b-pt +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + model_path: baidu/ERNIE-4.5-21B-A3B-PT + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.5.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.5.yaml new file mode 100644 index 000000000000..e5010eae5a1d --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.5.yaml @@ -0,0 +1,122 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: glm-4.5.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: zai-org/GLM-4.5 + tokenizer: zai-org/GLM-4.5 + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: zai-org/GLM-4.5 + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 8.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/glm-4.5 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + context_length: 9000 + model_path: zai-org/GLM-4.5 + trust_remote_code: true + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + trust_remote_code: true + gpu_memory_utilization: 0.9 + max_model_len: 9000 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + trust_remote_code: true + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 9000 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 9000 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.6.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.6.yaml new file mode 100644 index 000000000000..b32f61f97896 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.6.yaml @@ -0,0 +1,135 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: glm-4.6.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: zai-org/GLM-4.6 + tokenizer: zai-org/GLM-4.6 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: zai-org/GLM-4.6 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/glm-4.6 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: zai-org/GLM-4.6 + trust_remote_code: true + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + trust_remote_code: true + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + trust_remote_code: true + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.7-flash.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.7-flash.yaml new file mode 100644 index 000000000000..4e59645eed5f --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.7-flash.yaml @@ -0,0 +1,126 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: glm-4.7-flash.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: zai-org/GLM-4.7-Flash + tokenizer: zai-org/GLM-4.7-Flash + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: zai-org/GLM-4.7-Flash + max_concurrency: + - null + - 16 + - 32 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 16.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/glm-4.7-flash +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + model_path: zai-org/GLM-4.7-Flash + trust_remote_code: true + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + trust_remote_code: true + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + trust_remote_code: true + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.7.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.7.yaml new file mode 100644 index 000000000000..0cdd86f86202 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-4.7.yaml @@ -0,0 +1,130 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: glm-4.7.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: zai-org/GLM-4.7 + tokenizer: zai-org/GLM-4.7 + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: zai-org/GLM-4.7 + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 8.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/glm-4.7 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + context_length: 9000 + model_path: zai-org/GLM-4.7 + trust_remote_code: true + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + ep_size: + - 1 + - 2 + - 4 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + trust_remote_code: true + gpu_memory_utilization: 0.9 + max_model_len: 9000 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + trust_remote_code: true + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 9000 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 9000 + - 16384 + ep_size: + - 1 + - 2 + - 4 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-5-fp8.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-5-fp8.yaml new file mode 100644 index 000000000000..63df942fc3f7 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glm-5-fp8.yaml @@ -0,0 +1,132 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: glm-5-fp8.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: zai-org/GLM-5-FP8 + tokenizer: zai-org/GLM-5-FP8 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: zai-org/GLM-5-FP8 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/glm-5-fp8 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: zai-org/GLM-5-FP8 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glyph.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glyph.yaml new file mode 100644 index 000000000000..f9b793b9ab01 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/glyph.yaml @@ -0,0 +1,126 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: glyph.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: zai-org/Glyph + tokenizer: zai-org/Glyph + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: zai-org/Glyph + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 8.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/glyph +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + reasoning_parser: glm45 + tool_call_parser: glm45 + model_path: zai-org/Glyph + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/gpt-oss-120b.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/gpt-oss-120b.yaml new file mode 100644 index 000000000000..0ca920c07281 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/gpt-oss-120b.yaml @@ -0,0 +1,132 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: gpt-oss-120b.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: openai/gpt-oss-120b + tokenizer: openai/gpt-oss-120b + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: openai/gpt-oss-120b + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/gpt-oss-120b +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: openai/gpt-oss-120b + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/intern-s1.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/intern-s1.yaml new file mode 100644 index 000000000000..ad30b4dd2b36 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/intern-s1.yaml @@ -0,0 +1,135 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: intern-s1.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: internlm/Intern-S1 + tokenizer: internlm/Intern-S1 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: internlm/Intern-S1 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/intern-s1 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + trust_remote_code: true + model_path: internlm/Intern-S1 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-k2-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-k2-instruct.yaml new file mode 100644 index 000000000000..de03222c877f --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-k2-instruct.yaml @@ -0,0 +1,133 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: kimi-k2-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: moonshotai/Kimi-K2-Instruct + tokenizer: moonshotai/Kimi-K2-Instruct + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: moonshotai/Kimi-K2-Instruct + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/kimi-k2-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + trust_remote_code: true + model_path: moonshotai/Kimi-K2-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-k2.5.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-k2.5.yaml new file mode 100644 index 000000000000..51303495de29 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-k2.5.yaml @@ -0,0 +1,127 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: kimi-k2.5.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: moonshotai/Kimi-K2.5 + tokenizer: moonshotai/Kimi-K2.5 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: moonshotai/Kimi-K2.5 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/kimi-k2.5 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + trust_remote_code: true + model_path: moonshotai/Kimi-K2.5 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-linear-48b-a3b-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-linear-48b-a3b-instruct.yaml new file mode 100644 index 000000000000..dc2b3a1e530b --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/kimi-linear-48b-a3b-instruct.yaml @@ -0,0 +1,121 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: kimi-linear-48b-a3b-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: moonshotai/Kimi-Linear-48B-A3B-Instruct + tokenizer: moonshotai/Kimi-Linear-48B-A3B-Instruct + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: moonshotai/Kimi-Linear-48B-A3B-Instruct + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 8.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/kimi-linear-48b-a3b-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + trust_remote_code: true + model_path: moonshotai/Kimi-Linear-48B-A3B-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ling-2.5-1t.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ling-2.5-1t.yaml new file mode 100644 index 000000000000..d4d076d1b36f --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ling-2.5-1t.yaml @@ -0,0 +1,134 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: ling-2.5-1t.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: inclusionAI/Ling-2.5-1T + tokenizer: inclusionAI/Ling-2.5-1T + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: true +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: inclusionAI/Ling-2.5-1T + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 2.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/ling-2.5-1t +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + pp_size: 2 + nnodes: 2 + trust_remote_code: true + tool_call_parser: qwen + model_path: inclusionAI/Ling-2.5-1T + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + pp_size: + - 1 + - 2 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + pipeline_parallel_size: 2 + trust_remote_code: true + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 2 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llada2-1-mini.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llada2-1-mini.yaml new file mode 100644 index 000000000000..ab6795c33e57 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llada2-1-mini.yaml @@ -0,0 +1,130 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: llada2-1-mini.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: inclusionAI/LLaDA2.1-mini + tokenizer: inclusionAI/LLaDA2.1-mini + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: inclusionAI/LLaDA2.1-mini + max_concurrency: + - 1 + - 2 + - 4 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/llada2-1-mini +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 1 + dllm_algorithm: JointThreshold + trust_remote_code: true + max_running_requests: 1 + attention_backend: flashinfer + model_path: inclusionAI/LLaDA2.1-mini + mem_fraction_static: 0.77 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 1 + - 2 + - 4 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 1 + - 2 + - 4 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 1 + - 2 + - 4 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-3.1-70b-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-3.1-70b-instruct.yaml new file mode 100644 index 000000000000..cf7f88ed3667 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-3.1-70b-instruct.yaml @@ -0,0 +1,124 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: llama-3.1-70b-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: meta-llama/Llama-3.1-70B-Instruct + tokenizer: meta-llama/Llama-3.1-70B-Instruct + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: meta-llama/Llama-3.1-70B-Instruct + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 12.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/llama-3.1-70b-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + model_path: meta-llama/Llama-3.1-70B-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-3.3-70b-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-3.3-70b-instruct.yaml new file mode 100644 index 000000000000..a7fdc34c2e87 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-3.3-70b-instruct.yaml @@ -0,0 +1,118 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: llama-3.3-70b-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: meta-llama/Llama-3.3-70B-Instruct + tokenizer: meta-llama/Llama-3.3-70B-Instruct + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: meta-llama/Llama-3.3-70B-Instruct + max_concurrency: + - null + - 16 + - 32 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 16.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/llama-3.3-70b-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tool_call_parser: llama3 + model_path: meta-llama/Llama-3.3-70B-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-4-maverick-17b-128e-instruct-fp8.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-4-maverick-17b-128e-instruct-fp8.yaml new file mode 100644 index 000000000000..c97dfd6ab18e --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-4-maverick-17b-128e-instruct-fp8.yaml @@ -0,0 +1,122 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: llama-4-maverick-17b-128e-instruct-fp8.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 + tokenizer: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 + max_concurrency: + - null + - 2 + - 4 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 2.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/llama-4-maverick-17b-128e-instruct-fp8 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + context_length: 1000000 + trust_remote_code: true + enable_multimodal: true + model_path: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 4 + - 8 + - 12 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 1000000 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 4 + - 8 + - 12 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 1000000 + trust_remote_code: true + search_space: + max_batch_size: + - 4 + - 8 + - 12 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 1000000 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-4-scout-17b-16e-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-4-scout-17b-16e-instruct.yaml new file mode 100644 index 000000000000..0b95311b2d8d --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/llama-4-scout-17b-16e-instruct.yaml @@ -0,0 +1,129 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: llama-4-scout-17b-16e-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: meta-llama/Llama-4-Scout-17B-16E-Instruct + tokenizer: meta-llama/Llama-4-Scout-17B-16E-Instruct + precision: bfloat16 + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: meta-llama/Llama-4-Scout-17B-16E-Instruct + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/llama-4-scout-17b-16e-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + enable_multimodal: true + context_length: 65536 + dtype: bfloat16 + trust_remote_code: true + model_path: meta-llama/Llama-4-Scout-17B-16E-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 8 + - 16 + - 24 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 65536 + dtype: bfloat16 + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 8 + - 16 + - 24 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 65536 + trust_remote_code: true + search_space: + max_batch_size: + - 8 + - 16 + - 24 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 65536 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/mimo-v2-flash.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/mimo-v2-flash.yaml new file mode 100644 index 000000000000..f98917f09aec --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/mimo-v2-flash.yaml @@ -0,0 +1,133 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: mimo-v2-flash.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: XiaomiMiMo/MiMo-V2-Flash + tokenizer: XiaomiMiMo/MiMo-V2-Flash + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: XiaomiMiMo/MiMo-V2-Flash + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/mimo-v2-flash +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + trust_remote_code: true + max_running_requests: 128 + chunked_prefill_size: 16384 + model_loader_extra_config: '{"enable_multithread_load": "true","num_threads": 64}' + attention_backend: fa3 + reasoning_parser: qwen3 + tool_call_parser: mimo + model_path: XiaomiMiMo/MiMo-V2-Flash + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/minimax-m2.1.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/minimax-m2.1.yaml new file mode 100644 index 000000000000..bbd62ea698cd --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/minimax-m2.1.yaml @@ -0,0 +1,121 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: minimax-m2.1.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: MiniMaxAI/MiniMax-M2.1 + tokenizer: MiniMaxAI/MiniMax-M2.1 + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: MiniMaxAI/MiniMax-M2.1 + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 8.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/minimax-m2.1 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + trust_remote_code: true + model_path: MiniMaxAI/MiniMax-M2.1 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/minimax-m2.5.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/minimax-m2.5.yaml new file mode 100644 index 000000000000..70c4d2b2d128 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/minimax-m2.5.yaml @@ -0,0 +1,133 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: minimax-m2.5.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: MiniMaxAI/MiniMax-M2.5 + tokenizer: MiniMaxAI/MiniMax-M2.5 + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: MiniMaxAI/MiniMax-M2.5 + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/minimax-m2.5 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + trust_remote_code: true + model_path: MiniMaxAI/MiniMax-M2.5 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + ep_size: + - 1 + - 4 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ministral-3-8b-instruct-2512.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ministral-3-8b-instruct-2512.yaml new file mode 100644 index 000000000000..5c80f7f0aa98 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ministral-3-8b-instruct-2512.yaml @@ -0,0 +1,121 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: ministral-3-8b-instruct-2512.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: mistralai/Ministral-3-8B-Instruct-2512 + tokenizer: mistralai/Ministral-3-8B-Instruct-2512 + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: mistralai/Ministral-3-8B-Instruct-2512 + max_concurrency: + - null + - 16 + - 32 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 16.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/ministral-3-8b-instruct-2512 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + trust_remote_code: true + tool_call_parser: mistral + model_path: mistralai/Ministral-3-8B-Instruct-2512 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/mistral-small-4-119b-2603.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/mistral-small-4-119b-2603.yaml new file mode 100644 index 000000000000..2970dca5ccb7 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/mistral-small-4-119b-2603.yaml @@ -0,0 +1,124 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: mistral-small-4-119b-2603.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: mistralai/Mistral-Small-4-119B-2603 + tokenizer: mistralai/Mistral-Small-4-119B-2603 + precision: auto + quantization: model default +hardware: + gpu_count: 2 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: mistralai/Mistral-Small-4-119B-2603 + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 6.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/mistral-small-4-119b-2603 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 2 + model_path: mistralai/Mistral-Small-4-119B-2603 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 2 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 2 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/nemotron-3-nano-30b-a3b-bf16.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/nemotron-3-nano-30b-a3b-bf16.yaml new file mode 100644 index 000000000000..bb031103e8c0 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/nemotron-3-nano-30b-a3b-bf16.yaml @@ -0,0 +1,128 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: nemotron-3-nano-30b-a3b-bf16.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 + tokenizer: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 + precision: auto + quantization: model default +hardware: + gpu_count: 1 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 + max_concurrency: + - null + - 16 + - 32 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 16.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/nemotron-3-nano-30b-a3b-bf16 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 1 + trust_remote_code: true + kv_cache_dtype: fp8_e4m3 + model_path: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 1 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: fp8_e4m3 + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 1 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/nemotron-3-super-120b-a12b-bf16.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/nemotron-3-super-120b-a12b-bf16.yaml new file mode 100644 index 000000000000..f9f5a03a67c8 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/nemotron-3-super-120b-a12b-bf16.yaml @@ -0,0 +1,128 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: nemotron-3-super-120b-a12b-bf16.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 + tokenizer: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 6.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/nemotron-3-super-120b-a12b-bf16 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + trust_remote_code: true + kv_cache_dtype: fp8_e4m3 + model_path: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: fp8_e4m3 + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-235b-a22b.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-235b-a22b.yaml new file mode 100644 index 000000000000..4bc39b66edaa --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-235b-a22b.yaml @@ -0,0 +1,132 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: qwen3-235b-a22b.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: Qwen/Qwen3-235B-A22B + tokenizer: Qwen/Qwen3-235B-A22B + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: Qwen/Qwen3-235B-A22B + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/qwen3-235b-a22b +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: Qwen/Qwen3-235B-A22B + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-coder-480b-a35b-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-coder-480b-a35b-instruct.yaml new file mode 100644 index 000000000000..7583867f2124 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-coder-480b-a35b-instruct.yaml @@ -0,0 +1,131 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: qwen3-coder-480b-a35b-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: Qwen/Qwen3-Coder-480B-A35B-Instruct + tokenizer: Qwen/Qwen3-Coder-480B-A35B-Instruct + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: Qwen/Qwen3-Coder-480B-A35B-Instruct + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/qwen3-coder-480b-a35b-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + ep_size: 2 + moe_runner_backend: triton + model_path: Qwen/Qwen3-Coder-480B-A35B-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - flashinfer + decode_attention_backend: + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 2 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + ep_size: 2 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 2 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-coder-next.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-coder-next.yaml new file mode 100644 index 000000000000..8e8be77d3224 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-coder-next.yaml @@ -0,0 +1,124 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: qwen3-coder-next.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: Qwen/Qwen3-Coder-Next + tokenizer: Qwen/Qwen3-Coder-Next + precision: auto + quantization: model default +hardware: + gpu_count: 2 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: Qwen/Qwen3-Coder-Next + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 12.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/qwen3-coder-next +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 2 + model_path: Qwen/Qwen3-Coder-Next + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 2 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 2 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-next-80b-a3b-instruct.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-next-80b-a3b-instruct.yaml new file mode 100644 index 000000000000..81c8f7d42d00 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen3-next-80b-a3b-instruct.yaml @@ -0,0 +1,130 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: qwen3-next-80b-a3b-instruct.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: Qwen/Qwen3-Next-80B-A3B-Instruct + tokenizer: Qwen/Qwen3-Next-80B-A3B-Instruct + precision: auto + quantization: model default +hardware: + gpu_count: 2 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: Qwen/Qwen3-Next-80B-A3B-Instruct + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 12.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/qwen3-next-80b-a3b-instruct +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 2 + model_path: Qwen/Qwen3-Next-80B-A3B-Instruct + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + ep_size: + - 1 + - 2 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 2 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 2 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 2 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen35-397b-a17b-fp8.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen35-397b-a17b-fp8.yaml new file mode 100644 index 000000000000..e2d2cf369329 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/qwen35-397b-a17b-fp8.yaml @@ -0,0 +1,132 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: qwen35-397b-a17b-fp8.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: Qwen/Qwen3.5-397B-A17B-FP8 + tokenizer: Qwen/Qwen3.5-397B-A17B-FP8 + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: Qwen/Qwen3.5-397B-A17B-FP8 + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 4.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/qwen35-397b-a17b-fp8 +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + model_path: Qwen/Qwen3.5-397B-A17B-FP8 + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + ep_size: + - 1 + - 4 + - 8 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 + - 8 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ring-2.5-1t.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ring-2.5-1t.yaml new file mode 100644 index 000000000000..5c8aa8330c52 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/ring-2.5-1t.yaml @@ -0,0 +1,124 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: ring-2.5-1t.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: inclusionAI/Ring-2.5-1T + tokenizer: inclusionAI/Ring-2.5-1T + precision: auto + quantization: model default +hardware: + gpu_count: 8 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: inclusionAI/Ring-2.5-1T + max_concurrency: + - null + - 4 + - 8 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 2.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/ring-2.5-1t +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 8 + model_path: inclusionAI/Ring-2.5-1T + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 32 + - 48 + - 64 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 8 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + search_space: + max_num_seqs: + - 32 + - 48 + - 64 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 8 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + search_space: + max_batch_size: + - 32 + - 48 + - 64 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 diff --git a/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/step-3.5-flash.yaml b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/step-3.5-flash.yaml new file mode 100644 index 000000000000..77ef1b45f52a --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/configs/cookbook-llm/step-3.5-flash.yaml @@ -0,0 +1,133 @@ +schema_version: 1 +source: + kind: llm_serving_cookbook + source_recipe_file: step-3.5-flash.yaml + translation: SGLang flags preserve the source recipe where applicable, with sequence limits raised when needed for the default dataset; vLLM and TensorRT-LLM use framework-native serving flags for the same model and dataset shape. +model: + name: stepfun-ai/Step-3.5-Flash + tokenizer: stepfun-ai/Step-3.5-Flash + precision: auto + quantization: model default +hardware: + gpu_count: 4 + multi_node: false +dataset: + kind: random + num_prompts: 80 + scenario_names: + - chat + - summarization + input_len: + - 1000 + - 8000 + output_len: + - 1000 + - 1000 +benchmark: + endpoint: /v1/completions + backend: openai-compatible + tokenizer: stepfun-ai/Step-3.5-Flash + max_concurrency: + - null + - 8 + - 16 + extra_request_body: + temperature: 0.0 + qps: + lower: 0.25 + upper: 8.0 + tolerance: 0.1 + sla: + max_p99_ttft_ms: 1500 + max_p99_tpot_ms: 30 + min_success_rate: 0.99 + output_dir: ./auto_benchmark_results/cookbook-llm/step-3.5-flash +search: + tier: 2 + max_candidates_per_framework: 8 + candidate_generation: baseline_first_bounded_product + resume: true +frameworks: + sglang: + enabled: true + server_command: python -m sglang.launch_server + base_server_flags: + tp_size: 4 + trust_remote_code: true + model_path: stepfun-ai/Step-3.5-Flash + mem_fraction_static: 0.82 + schedule_policy: lpm + search_space: + prefill_attention_backend: + - fa3 + - flashinfer + decode_attention_backend: + - fa3 + - flashinfer + chunked_prefill_size: + - 4096 + - 8192 + max_running_requests: + - 64 + - 96 + - 128 + ep_size: + - 1 + - 4 + vllm: + enabled: true + server_command: vllm serve + config_source: framework_generic_translation + base_server_flags: + tensor_parallel_size: 4 + gpu_memory_utilization: 0.9 + max_model_len: 12288 + dtype: auto + enable_chunked_prefill: true + kv_cache_dtype: auto + trust_remote_code: true + search_space: + max_num_seqs: + - 64 + - 96 + - 128 + max_num_batched_tokens: + - 8192 + - 16384 + max_num_partial_prefills: + - 1 + max_long_partial_prefills: + - 1 + long_prefill_token_threshold: + - 0 + - 4096 + enable_prefix_caching: + - true + block_size: + - 16 + tensorrt_llm: + enabled: true + server_command: trtllm-serve serve + backend_policy: fixed_pytorch + config_source: framework_generic_translation + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + max_seq_len: 12288 + trust_remote_code: true + search_space: + max_batch_size: + - 64 + - 96 + - 128 + max_num_tokens: + - 8192 + - 16384 + max_seq_len: + - 12288 + - 16384 + ep_size: + - 1 + - 4 diff --git a/.claude/skills/llm-serving-auto-benchmark/references/container-runbook.md b/.claude/skills/llm-serving-auto-benchmark/references/container-runbook.md new file mode 100644 index 000000000000..59da36c37d0a --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/references/container-runbook.md @@ -0,0 +1,321 @@ +# Container Runbook + +Use this runbook when the benchmark environment is container-based. It records +the exact image, command, help output, server log, benchmark log, and cleanup +step for each framework. + +This runbook is target-agnostic. Every `docker run` / `docker exec` command +works on a local box, an SSH-reachable remote GPU host, or a CI runner; the +per-host skills (for example `h100`, `b200`, `rtx5090`, `radixark02`, +`radixark03`) only add the SSH wrapper, container name, and workspace path +for a specific operator box. Substitute those values where you see +`$SGLANG_CONTAINER`, `$SGLANG_WORKSPACE`, and similar; nothing below assumes +an H100. + +## Common Setup + +Pull the images that will be used: + +```bash +docker pull lmsysorg/sglang:dev +docker pull vllm/vllm-openai:latest +docker pull nvcr.io/nvidia/tensorrt-llm/release:latest +``` + +Use quoted Docker GPU device lists: + +```bash +GPU_ARG='"device=6,7"' +docker run --gpus "$GPU_ARG" ... +``` + +The unquoted form `--gpus device=6,7` can be parsed incorrectly by Docker. + +Mount the shared Hugging Face cache and pass tokens through environment variables +when gated models are used: + +```bash +-v /data/.cache:/root/.cache \ +-e HF_TOKEN \ +-e HUGGINGFACE_HUB_TOKEN +``` + +Do not print token values into logs. + +Set the run variables once and pass them into containers that need them: + +```bash +export MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 +export TP=1 +export PP=1 +export PORT=8000 +export RUN_DIR=/tmp/llm-serving-auto-benchmark +mkdir -p "$RUN_DIR" +``` + +For synthetic validation, use two aligned scenarios rather than one tiny request +shape: + +```bash +# chat-like +RANDOM_INPUT_LEN=1000 +RANDOM_OUTPUT_LEN=1000 + +# summarization-like +RANDOM_INPUT_LEN=8000 +RANDOM_OUTPUT_LEN=1000 +``` + +For a fast smoke on larger models, 20 prompts per scenario is a reasonable +minimum. Do not treat that as a performance result. + +Set each framework's sequence-length limit to cover the largest scenario. For +the example above, use at least 9000 tokens for SGLang `--context-length`, vLLM +`--max-model-len`, and TensorRT-LLM `--max_seq_len`. + +Before launching a server, save the help output: + +```bash +python -m sglang.launch_server --help > artifacts/help/sglang_launch_server.txt +python -m sglang.bench_serving --help > artifacts/help/sglang_bench_serving.txt +vllm serve --help=all > artifacts/help/vllm_serve_all.txt +vllm bench serve --help=all > artifacts/help/vllm_bench_serve_all.txt +vllm bench sweep serve --help=all > artifacts/help/vllm_bench_sweep_serve_all.txt +trtllm-serve serve --help > artifacts/help/trtllm_serve.txt +python -m tensorrt_llm.serve.scripts.benchmark_serving --help \ + > artifacts/help/trtllm_benchmark_serving.txt +``` + +## SGLang + +If a prepared GPU host already has a long-running SGLang container (local or +reached via ssh; name is operator-specific), reuse it via `docker exec` +instead of creating a new container. The per-host skills — `h100`, +`h100-sglang-diffusion`, `b200`, `rtx5090`, `radixark02`, `radixark03`, +and similar — provide the concrete container name and workspace path for +that box; this runbook assumes the operator substitutes them: + +```bash +docker exec \ + -e MODEL \ + -e TP \ + -e PORT \ + "$SGLANG_CONTAINER" bash -lc " +cd \"\$SGLANG_WORKSPACE\" +python -m sglang.launch_server \\ + --model-path \"\$MODEL\" \\ + --tp-size \"\$TP\" \\ + --host 0.0.0.0 \\ + --port \"\$PORT\" +" +``` + +For a fresh container: + +```bash +docker run -d --name llmbench-sglang \ + --gpus "$GPU_ARG" \ + --network host \ + --ipc=host \ + -v /data/.cache:/root/.cache \ + -e MODEL \ + -e TP \ + -e PORT \ + -e HF_TOKEN \ + -e HUGGINGFACE_HUB_TOKEN \ + --entrypoint bash \ + lmsysorg/sglang:dev -lc ' +python -m sglang.launch_server \ + --model-path "$MODEL" \ + --tp-size "$TP" \ + --host 0.0.0.0 \ + --port "$PORT" +' +``` + +Then run either SGLang auto benchmark: + +```bash +python -m sglang.auto_benchmark run --config /path/to/sglang.yaml +``` + +or a tiny OpenAI-compatible smoke benchmark: + +```bash +python -m sglang.bench_serving \ + --backend sglang-oai \ + --host 127.0.0.1 \ + --port "$PORT" \ + --dataset-name random \ + --random-input-len 32 \ + --random-output-len 8 \ + --num-prompts 4 \ + --request-rate 1 \ + --max-concurrency 2 \ + --output-file "$RUN_DIR/sglang/results.json" \ + --output-details +``` + +## vLLM + +Server template: + +```bash +docker run -d --name llmbench-vllm \ + --gpus "$GPU_ARG" \ + --network host \ + --ipc=host \ + -v /data/.cache:/root/.cache \ + -e MODEL \ + -e TP \ + -e PORT \ + -e HF_TOKEN \ + -e HUGGINGFACE_HUB_TOKEN \ + --entrypoint bash \ + vllm/vllm-openai:latest -lc ' +vllm serve "$MODEL" \ + --host 0.0.0.0 \ + --port "$PORT" \ + --tensor-parallel-size "$TP" \ + --dtype auto \ + --gpu-memory-utilization 0.90 \ + --max-model-len 4096 \ + --max-num-seqs 64 \ + --max-num-batched-tokens 8192 \ + --enable-chunked-prefill \ + --kv-cache-dtype auto \ + --enable-prefix-caching \ + --trust-remote-code +' +``` + +Benchmark template: + +```bash +docker run --rm \ + --network host \ + -v /data/.cache:/root/.cache \ + -v "$RUN_DIR:/artifacts" \ + -e MODEL \ + -e PORT \ + --entrypoint bash \ + vllm/vllm-openai:latest -lc ' +vllm bench serve \ + --backend vllm \ + --base-url "http://127.0.0.1:$PORT" \ + --model "$MODEL" \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 256 \ + --num-prompts 80 \ + --request-rate 8 \ + --max-concurrency 64 \ + --save-result \ + --result-dir /artifacts/vllm \ + --result-filename results.json +' +``` + +Use `vllm bench sweep serve` when the target image supports it and the search +can be described with serve/bench parameter JSON files. + +## TensorRT-LLM + +This skill only supports the TensorRT-LLM PyTorch server backend. Keep +`--backend pytorch` in every `trtllm-serve serve` command. Do not switch the +server to `--backend trt`, an engine path, or any other backend; mark that +candidate unsupported instead. + +For single-node multi-GPU TensorRT-LLM containers, keep the IPC, ulimit, shared +memory, and NCCL settings below. In a multi-GPU PyTorch-backend validation +run (captured on an H100 host; the rule is not H100-specific), the server +entered `PyTorchConfig` but failed NCCL allreduce without these container +options; the same model and candidate list passed after adding them. Expect +the same requirement on any single-node multi-GPU target. + +Server template: + +```bash +docker run -d --name llmbench-trtllm \ + --gpus "$GPU_ARG" \ + --ipc=host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + --shm-size=16g \ + --network host \ + -v /data/.cache:/root/.cache \ + -e MODEL \ + -e TP \ + -e PP \ + -e PORT \ + -e HF_TOKEN \ + -e HUGGINGFACE_HUB_TOKEN \ + -e NCCL_IB_DISABLE=1 \ + --entrypoint bash \ + nvcr.io/nvidia/tensorrt-llm/release:latest -lc ' +trtllm-serve serve "$MODEL" \ + --host 0.0.0.0 \ + --port "$PORT" \ + --backend pytorch \ + --tp_size "$TP" \ + --pp_size "$PP" \ + --max_batch_size 64 \ + --max_num_tokens 8192 \ + --max_seq_len 4096 \ + --kv_cache_free_gpu_memory_fraction 0.75 \ + --trust_remote_code +' +``` + +Benchmark template: + +```bash +docker run --rm \ + --network host \ + -v /data/.cache:/root/.cache \ + -v "$RUN_DIR:/artifacts" \ + -e MODEL \ + -e PORT \ + --entrypoint bash \ + nvcr.io/nvidia/tensorrt-llm/release:latest -lc ' +python -m tensorrt_llm.serve.scripts.benchmark_serving \ + --backend openai \ + --host 127.0.0.1 \ + --port "$PORT" \ + --endpoint /v1/completions \ + --model "$MODEL" \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 256 \ + --random-ids \ + --num-prompts 80 \ + --request-rate 8 \ + --max-concurrency 64 \ + --save-result \ + --result-dir /artifacts/trtllm \ + --result-filename results.json +' +``` + +For TensorRT-LLM 1.0.0, the serving benchmark client `--backend` choices are +`openai` and `openai-chat`. Do not pass `--backend trtllm`. This client flag is +separate from the server backend pinned above. + +## Cleanup + +Use unique container names per run and clean up by name: + +```bash +docker rm -f llmbench-sglang llmbench-vllm llmbench-trtllm +``` + +If a port remains bound after container cleanup, inspect it before killing +anything: + +```bash +ss -ltnp | grep ':8000' +ps -eo pid,ppid,user,etime,cmd | grep '' +``` + +Only kill raw PIDs when the command line proves they belong to the current +validation run. diff --git a/.claude/skills/llm-serving-auto-benchmark/references/example-plan.yaml b/.claude/skills/llm-serving-auto-benchmark/references/example-plan.yaml new file mode 100644 index 000000000000..bc6e6c431791 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/references/example-plan.yaml @@ -0,0 +1,133 @@ +# Example run plan for the llm-serving-auto-benchmark skill. Baseline flags stay +# in base_server_flags, search knobs stay in search_space, and aligned dataset +# length pairs define scenarios. +# +# Note: this is the runtime plan shape (top-level `sla`, no `schema_version` or +# `server_command`). Cookbook configs in configs/cookbook-llm/ use the extended +# schema enforced by scripts/validate_cookbook_configs.py; do not run the +# validator against this file as-is. + +model: + name: Qwen/Qwen3-32B + tokenizer: Qwen/Qwen3-32B + precision: bf16 + quantization: none + +version_manifest: + sglang: + container_image: lmsysorg/sglang:dev + package_version: null + git_commit: null + server_help: artifacts/help/sglang_launch_server.txt + benchmark_help: artifacts/help/sglang_bench_serving.txt + vllm: + container_image: vllm/vllm-openai:latest + package_version: null + git_commit: null + server_help: artifacts/help/vllm_serve_all.txt + benchmark_help: artifacts/help/vllm_bench_serve_all.txt + sweep_help: artifacts/help/vllm_bench_sweep_serve_all.txt + tensorrt_llm: + container_image: nvcr.io/nvidia/tensorrt-llm/release:latest + package_version: null + git_commit: null + server_help: artifacts/help/trtllm_serve.txt + benchmark_help: artifacts/help/trtllm_benchmark_serving.txt + +hardware: + # Example values; replace with the actual target GPU (A100, H100, H200, + # B200, MI300, RTX 5090, etc.). gpu_model is recorded for fairness audit, + # not used as a scheduling hint. + gpu_model: NVIDIA H100 80GB HBM3 + gpu_count: 4 + multi_node: false + +dataset: + kind: random + num_prompts: 80 + scenario_names: [chat, summarization] + input_len: [1000, 8000] + output_len: [1000, 1000] + canonical_jsonl: null + +benchmark: + endpoint: /v1/chat/completions + backend: auto + request_rates: null + max_concurrency: [null, 16, 32] + qps: + lower: 1.0 + upper: 12.0 + tolerance: 0.1 + max_rounds: 5 + extra_request_body: + temperature: 0.0 + +sla: + max_p99_ttft_ms: 2000 + max_p99_tpot_ms: 80 + min_success_rate: 0.99 + +search: + tier: 2 + max_candidates_per_framework: 10 + candidate_generation: baseline_first_bounded_product + resume: true + output_dir: /bench/results/llm-serving-auto-benchmark + +frameworks: + sglang: + enabled: true + base_server_flags: + tp_size: 4 + trust_remote_code: true + mem_fraction_static: 0.82 + schedule_policy: lpm + context_length: 12288 + search_space: + # Verify these names against `python -m sglang.launch_server --help`. + prefill_attention_backend: [fa3, flashinfer] + decode_attention_backend: [fa3, flashinfer] + chunked_prefill_size: [8192, 16384] + max_running_requests: [64, 128] + + vllm: + enabled: true + base_server_flags: + tensor_parallel_size: 4 + trust_remote_code: true + gpu_memory_utilization: 0.90 + max_model_len: 12288 + dtype: auto + search_space: + # Verify these names against `vllm serve --help=all`. + max_num_seqs: [64, 128] + max_num_batched_tokens: [8192, 16384] + enable_chunked_prefill: [true] + # Raise above 1 only after the target model/runtime supports concurrent partial prefill. + max_num_partial_prefills: [1] + max_long_partial_prefills: [1] + long_prefill_token_threshold: [0, 4096] + enable_prefix_caching: [true] + kv_cache_dtype: [auto] + block_size: [16] + + tensorrt_llm: + enabled: true + backend_policy: fixed_pytorch + base_server_flags: + backend: pytorch + tp_size: 4 + pp_size: 1 + kv_cache_free_gpu_memory_fraction: 0.75 + trust_remote_code: true + search_space: + # Verify these names against `trtllm-serve serve --help`. + # Do not add backend choices here; TensorRT-LLM is fixed to the PyTorch backend. + max_batch_size: [64, 128] + max_num_tokens: [8192, 16384] + max_seq_len: [12288, 16384] + # Uncomment and point at concrete config files to sweep PyTorch-backend + # options via --extra_llm_api_options. A single [null] value contributes + # no dimension to the search. + # extra_llm_api_options: [null, /path/to/trt_llm_config_A.yaml] diff --git a/.claude/skills/llm-serving-auto-benchmark/references/framework-reference.md b/.claude/skills/llm-serving-auto-benchmark/references/framework-reference.md new file mode 100644 index 000000000000..35d094a17591 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/references/framework-reference.md @@ -0,0 +1,113 @@ +# Framework Reference + +Use this file when choosing native framework commands or translating tuning +knobs across SGLang, vLLM, and TensorRT-LLM. Always verify the concrete CLI in +the target container with `--help` before a long run. + +## Native Entry Points + +| Framework | Server | Benchmark | Notes | +| --- | --- | --- | --- | +| SGLang | `python -m sglang.launch_server` | `python -m sglang.auto_benchmark` or `python -m sglang.bench_serving` | Use `auto_benchmark` when available for server-flag search. Use `bench_serving` for direct native or OpenAI-compatible endpoint checks. | +| vLLM | `vllm serve` | `vllm bench sweep serve` or `vllm bench serve` | Prefer `bench sweep serve` when sweeping server and benchmark parameter JSON files. | +| TensorRT-LLM | `trtllm-serve serve --backend pytorch` | TensorRT-LLM serving benchmark client or a common OpenAI-compatible client | This skill does not cover engine-backed serving or non-PyTorch server backends. | + +Common source docs: + +- SGLang bench serving: +- vLLM benchmark sweeps: +- vLLM `bench sweep serve`: +- TensorRT-LLM `trtllm-serve`: +- TensorRT-LLM deployment guide: + +## Command Templates + +### SGLang + +```bash +python -m sglang.launch_server \ + --model-path \ + --tp-size \ + --port 30000 + +python -m sglang.bench_serving \ + --backend sglang-oai \ + --host 127.0.0.1 \ + --port 30000 \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 256 \ + --num-prompts 80 \ + --request-rate 8 +``` + +Use `--backend sglang` for SGLang-native `/generate` checks. Use +`--backend sglang-oai` when comparing against vLLM or TensorRT-LLM through an +OpenAI-compatible path. + +### vLLM + +```bash +vllm serve \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size \ + --gpu-memory-utilization 0.90 \ + --max-model-len 4096 \ + --max-num-seqs 64 \ + --max-num-batched-tokens 8192 \ + --enable-chunked-prefill + +vllm bench serve \ + --backend vllm \ + --base-url http://127.0.0.1:8000 \ + --model \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 256 \ + --num-prompts 80 +``` + +### TensorRT-LLM + +```bash +trtllm-serve serve \ + --backend pytorch \ + --tp_size \ + --kv_cache_free_gpu_memory_fraction 0.75 \ + --host 0.0.0.0 \ + --port 8000 +``` + +Benchmark the OpenAI-compatible endpoint with the TensorRT-LLM serving benchmark +client or the same OpenAI-compatible client used for the other frameworks. Keep +server backend choice fixed to `pytorch`. + +## Knob Family Mapping + +Do not copy flag names across frameworks. Compare knob families, then translate +to the target CLI. + +| Family | SGLang | vLLM | TensorRT-LLM | +| --- | --- | --- | --- | +| Parallelism | `--tp-size`, `--pp-size`, `--dp-size`, `--ep-size`, `--expert-parallel-size` | `--tensor-parallel-size`, `--pipeline-parallel-size`, `--data-parallel-size`, `--enable-expert-parallel` | `--tp_size`, `--pp_size`, `--ep_size`, `--gpus_per_node`, `--cluster_size` | +| Memory and KV cache | `--mem-fraction-static`, `--max-total-tokens`, `--kv-cache-dtype`, `--page-size`, `--cpu-offload-gb` | `--gpu-memory-utilization`, `--kv-cache-memory-bytes`, `--kv-cache-dtype`, `--block-size`, `--cpu-offload-gb` | `--kv_cache_free_gpu_memory_fraction`, plus `--max_num_tokens`, `--max_seq_len`, `--max_batch_size` | +| Batching and scheduler | `--max-running-requests`, `--schedule-policy`, `--chunked-prefill-size`, `--max-prefill-tokens`, `--prefill-max-requests` | `--max-num-seqs`, `--max-num-batched-tokens`, `--enable-chunked-prefill`, partial-prefill and DBO flags | `--max_batch_size`, `--max_num_tokens`, `--max_seq_len`; extra scheduler knobs may require `--extra_llm_api_options` | +| Attention/backend | `--attention-backend`, `--prefill-attention-backend`, `--decode-attention-backend`, `--sampling-backend` | `--attention-backend`, `--gdn-prefill-backend`, `--mm-encoder-attn-backend` | `--backend pytorch` is fixed; do not search backend choice | +| CUDA graph and compile | `--disable-cuda-graph`, `--cuda-graph-bs`, `--cuda-graph-max-bs`, `--disable-piecewise-cuda-graph`, `--enable-torch-compile` | `--enforce-eager`, `--compilation-config`, `--cudagraph-capture-sizes`, `--max-cudagraph-capture-size` | use direct flags or `--extra_llm_api_options`; record resolved PyTorch config from logs | +| Prefix/speculative | `--disable-radix-cache`, `--disable-chunked-prefix-cache`, speculative decoding flags | `--enable-prefix-caching`, `--speculative-config` | only use PyTorch-backend options accepted by the target image | +| Dtype, quantization, loading | `--dtype`, `--quantization`, `--load-format`, `--model-loader-extra-config`, `--trust-remote-code` | `--dtype`, `--quantization`, `--load-format`, `--model-loader-extra-config`, `--trust-remote-code`, `--hf-token` | `--trust_remote_code`, `--tokenizer`; engine build and non-PyTorch quantization flows are out of scope | + +## Version Rules + +Framework CLIs move quickly. For every real run: + +1. Record the framework package version, git commit, image tag, and help files. +2. Validate concrete flags with + `scripts/validate_cookbook_configs.py --help-dir `. +3. Move renamed or removed flags out of the run plan before benchmarking. +4. Record which frameworks were model-smoked and which only passed preflight. + +Historical validation from April 2026 used SGLang `0.5.10rc0`, vLLM `0.19.1`, +and TensorRT-LLM `1.0.0`. Treat those notes as old evidence, not as current +compatibility guarantees. diff --git a/.claude/skills/llm-serving-auto-benchmark/references/result-schema.md b/.claude/skills/llm-serving-auto-benchmark/references/result-schema.md new file mode 100644 index 000000000000..8193c3ececbb --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/references/result-schema.md @@ -0,0 +1,161 @@ +# Result Schema + +Write one JSON object per candidate. Keep failed candidates in the same file so +the final summary explains what was tried. + +## SLA Key Convention + +One canonical naming across this skill. Config files and normalized result rows +must agree. + +| Key | Where | Type | +| --- | --- | --- | +| `max_p99_ttft_ms` | both | float, milliseconds, p99 | +| `max_p99_tpot_ms` | both | float, milliseconds, p99 | +| `min_success_rate` | both | float in [0, 1] | +| `passed` | result only | bool; recomputed after the run | + +Do not use `max_ttft_ms` or `max_tpot_ms` without the `p99_` prefix; those names +hide whether the target is a mean or a tail. Older cookbook configs used mean +latency targets by accident and have been migrated to the p99 names above. + +The config-level SLA block lives under `benchmark.sla` (cookbook configs) or at +the top level (example plan). Either location is acceptable, but the key names +must match this table. + +## JSONL Row + +The values below (`gpu_model`, `gpu_count`, file paths, numeric metrics, etc.) +are illustrative. Replace them with the actual target hardware and measured +values; this schema is not tied to H100. + +```json +{ + "framework": "sglang", + "framework_version": "0.5.0", + "framework_commit": "abcdef0", + "candidate_id": "sglang-tp8-flashinfer", + "model": "meta-llama/Llama-3.1-70B-Instruct", + "status": "ok", + "failure_reason": "", + "hardware": { + "gpu_model": "NVIDIA H100 80GB HBM3", + "gpu_count": 8, + "visible_devices": "0,1,2,3,4,5,6,7" + }, + "workload": { + "kind": "custom", + "scenario": "chat", + "dataset_path": "/bench/workload.autobench.jsonl", + "input_len": 2048, + "output_len": 512, + "input_len_p50": 1800, + "input_len_p95": 4096, + "output_len_p50": 384, + "output_len_p95": 1024, + "num_prompts": 1000, + "request_rate": 16, + "max_concurrency": 256, + "endpoint": "/v1/chat/completions" + }, + "sla": { + "max_p99_ttft_ms": 2000, + "max_p99_tpot_ms": 80, + "min_success_rate": 0.99, + "passed": true + }, + "metrics": { + "request_throughput": 15.8, + "output_token_throughput": 12500.0, + "total_token_throughput": 42000.0, + "mean_ttft_ms": 430.0, + "p99_ttft_ms": 1550.0, + "mean_tpot_ms": 26.0, + "p99_tpot_ms": 72.0, + "mean_e2e_ms": 8200.0, + "p99_e2e_ms": 19000.0, + "success_rate": 0.995 + }, + "server_command": "python -m sglang.launch_server ...", + "benchmark_command": "python -m sglang.bench_serving ...", + "validated_cli_flags": { + "server": ["tp_size", "attention_backend"], + "benchmark": ["dataset_name", "request_rate", "max_concurrency"] + }, + "artifacts": { + "server_log": "/bench/sglang/server.log", + "raw_result": "/bench/sglang/results.jsonl", + "server_help": "/bench/sglang/help_launch_server.txt", + "benchmark_help": "/bench/sglang/help_bench_serving.txt" + } +} +``` + +`input_len` and `output_len` are the representative scenario lengths used for +synthetic workloads or a named bucket. For custom production-like datasets, +also include p50/p95 buckets when available. These fields let +`sglang-sota-performance` pass the slow benchmark shape directly into +`llm-torch-profiler-analysis`: + +- prefill profile: `--prefill-input-len ` and + `--prefill-output-len 1` +- decode profile: `--decode-input-len 1` and + `--decode-output-len ` + +## Status Values + +- `ok`: benchmark finished and metrics are trustworthy +- `failed`: command failed for a known non-OOM reason +- `oom`: model or candidate exhausted GPU/host memory +- `timeout`: server or benchmark timed out +- `skipped`: intentionally not run, with a reason in `failure_reason` + +## Ranking Rule + +The default ranking is: + +1. `status == "ok"` +2. `sla.passed == true` +3. higher `metrics.request_throughput` +4. higher `metrics.output_token_throughput` +5. lower `metrics.mean_ttft_ms` +6. lower `metrics.mean_tpot_ms` +7. lower `hardware.gpu_count` + +If the user cares more about token throughput than request throughput, swap +steps 3 and 4 and state that in the final report. + +This ranking rule does not change the SLA gate. Keep `sla.max_p99_ttft_ms` and +`sla.max_p99_tpot_ms` as the tail-latency constraints; use mean TTFT and mean +TPOT only for default winner selection among rows that have already passed SLA. + +Missing metric semantics: + +- If `metrics.mean_ttft_ms` is absent from a row, the ranking script treats it + as the worst possible value, so that row falls below any candidate with a + real mean-TTFT measurement. Do not write `0` as a placeholder for "no + measurement"; leave the field out or set it to `null`. +- If `metrics.mean_tpot_ms` is absent from a row, the ranking script treats it + as the worst possible value, so that row falls below any candidate with a + real mean-TPOT measurement. Do not write `0` as a placeholder for "no + measurement"; leave the field out or set it to `null`. +- If `metrics.request_throughput` or `metrics.output_token_throughput` is + missing, the row ranks below any candidate with a real measurement in those + keys. A failed candidate that still produced partial metrics should keep the + metrics it did produce. + +## Final Report Tables + +The markdown summary must include these sections: + +1. `Best Commands By Framework`: one table per framework. Each table has one row + per workload scenario and includes the best candidate, SLA result, throughput, + latency metrics, GPU count, exact server command, and artifacts. +2. `Cross-Framework Best Comparison`: one table that compares the best SGLang, + vLLM, and TensorRT-LLM command for each scenario. Sort each scenario by the + ranking rule above so the best deployment choice is first. +3. `Failed Or SLA-Failing Candidates`: include this table when any candidate + failed, was skipped, or completed without passing SLA. This table records + tried configs that were not selected. Keep each reason concrete enough to + tell whether the candidate needs a retry, lower concurrency, a parameter fix, + or no further action. diff --git a/.claude/skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py b/.claude/skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py new file mode 100755 index 000000000000..c7c4c21f66db --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py @@ -0,0 +1,308 @@ +#!/usr/bin/env python3 +"""Summarize normalized cross-framework benchmark JSONL results.""" + +from __future__ import annotations + +import argparse +import csv +import json +from pathlib import Path +from typing import Any + + +def _get(row: dict[str, Any], path: str, default: Any = None) -> Any: + current: Any = row + for part in path.split("."): + if not isinstance(current, dict) or part not in current: + return default + current = current[part] + return current + + +def _float(row: dict[str, Any], path: str, default: float = 0.0) -> float: + value = _get(row, path, default) + try: + return float(value) + except (TypeError, ValueError): + return default + + +def _bool(row: dict[str, Any], path: str, default: bool = False) -> bool: + value = _get(row, path, default) + if isinstance(value, bool): + return value + if isinstance(value, str): + return value.lower() in {"1", "true", "yes", "y"} + return bool(value) + + +def _mean_ttft_ms(row: dict[str, Any]) -> float: + return _float(row, "metrics.mean_ttft_ms", 1e30) + + +def _mean_tpot_ms(row: dict[str, Any]) -> float: + return _float(row, "metrics.mean_tpot_ms", 1e30) + + +def _rank_key(row: dict[str, Any]) -> tuple[Any, ...]: + return ( + _get(row, "status") == "ok", + _bool(row, "sla.passed"), + _float(row, "metrics.request_throughput"), + _float(row, "metrics.output_token_throughput"), + -_mean_ttft_ms(row), + -_mean_tpot_ms(row), + -_float(row, "hardware.gpu_count", 1e30), + ) + + +def _is_winner_candidate(row: dict[str, Any]) -> bool: + return _get(row, "status") == "ok" and _bool(row, "sla.passed") + + +def _fmt(value: Any, digits: int = 2) -> str: + if value is None: + return "" + if isinstance(value, float): + return f"{value:.{digits}f}" + return str(value) + + +def _cell(value: Any, digits: int = 2) -> str: + text = _fmt(value, digits) + return text.replace("\n", "
").replace("|", "\\|") + + +def _scenario(row: dict[str, Any]) -> str: + for path in ( + "workload.scenario", + "workload.scenario_name", + "workload.dataset_scenario", + "workload.dataset_name", + "workload.kind", + "scenario", + ): + value = _get(row, path) + if value: + return str(value) + return "default" + + +def _server_command(row: dict[str, Any]) -> str: + return str(_get(row, "server_command") or _get(row, "launch_command") or "") + + +def _artifact_summary(row: dict[str, Any]) -> str: + artifacts = _get(row, "artifacts", {}) + if not isinstance(artifacts, dict): + return "" + parts = [] + for key in ("raw_result", "server_log", "benchmark_log", "summary"): + value = artifacts.get(key) + if value: + parts.append(f"{key}: {value}") + return "
".join(parts) + + +def load_rows(path: Path) -> list[dict[str, Any]]: + rows: list[dict[str, Any]] = [] + with path.open(encoding="utf-8") as f: + for line_no, line in enumerate(f, 1): + stripped = line.strip() + if not stripped: + continue + try: + row = json.loads(stripped) + except json.JSONDecodeError as exc: + raise SystemExit(f"{path}:{line_no}: invalid JSON: {exc}") from exc + if not isinstance(row, dict): + raise SystemExit(f"{path}:{line_no}: expected a JSON object") + rows.append(row) + return rows + + +def best_by_framework_and_scenario(rows: list[dict[str, Any]]) -> list[dict[str, Any]]: + best: dict[tuple[str, str], dict[str, Any]] = {} + for row in rows: + if not _is_winner_candidate(row): + continue + key = (str(_get(row, "framework", "unknown")), _scenario(row)) + if key not in best or _rank_key(row) > _rank_key(best[key]): + best[key] = row + return sorted( + best.values(), key=lambda row: (_scenario(row), _rank_key(row)), reverse=True + ) + + +def write_csv(path: Path, rows: list[dict[str, Any]]) -> None: + fields = [ + "framework", + "scenario", + "candidate_id", + "status", + "sla_passed", + "request_throughput", + "output_token_throughput", + "mean_ttft_ms", + "mean_tpot_ms", + "p99_ttft_ms", + "p99_tpot_ms", + "gpu_count", + "server_command", + "failure_reason", + ] + with path.open("w", encoding="utf-8", newline="") as f: + writer = csv.DictWriter(f, fieldnames=fields) + writer.writeheader() + for row in rows: + writer.writerow( + { + "framework": _get(row, "framework", ""), + "scenario": _scenario(row), + "candidate_id": _get(row, "candidate_id", ""), + "status": _get(row, "status", ""), + "sla_passed": _bool(row, "sla.passed"), + "request_throughput": _get(row, "metrics.request_throughput", ""), + "output_token_throughput": _get( + row, "metrics.output_token_throughput", "" + ), + "mean_ttft_ms": _get(row, "metrics.mean_ttft_ms", ""), + "mean_tpot_ms": _get(row, "metrics.mean_tpot_ms", ""), + "p99_ttft_ms": _get(row, "metrics.p99_ttft_ms", ""), + "p99_tpot_ms": _get(row, "metrics.p99_tpot_ms", ""), + "gpu_count": _get(row, "hardware.gpu_count", ""), + "server_command": _server_command(row), + "failure_reason": _get(row, "failure_reason", ""), + } + ) + + +def _append_best_commands_by_framework( + lines: list[str], scenario_winners: list[dict[str, Any]] +) -> None: + frameworks = sorted( + {str(_get(row, "framework", "unknown")) for row in scenario_winners} + ) + lines.extend(["## Best Commands By Framework", ""]) + for framework in frameworks: + lines.extend( + [ + f"### `{framework}`", + "", + "| Scenario | Candidate | Status | SLA | Req/s | Output tok/s | Total tok/s | Mean TTFT ms | Mean TPOT ms | Success rate | GPUs | Server command | Artifacts |", + "| --- | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | --- |", + ] + ) + rows = [row for row in scenario_winners if _get(row, "framework") == framework] + for row in sorted(rows, key=_scenario): + lines.append( + "| {scenario} | {candidate} | {status} | {sla} | {rps} | {otps} | {ttps} | {ttft} | {tpot} | {success} | {gpus} | {command} | {artifacts} |".format( + scenario=_cell(_scenario(row)), + candidate=_cell(_get(row, "candidate_id", "")), + status=_cell(_get(row, "status", "")), + sla=_cell(_bool(row, "sla.passed")), + rps=_cell(_get(row, "metrics.request_throughput")), + otps=_cell(_get(row, "metrics.output_token_throughput")), + ttps=_cell(_get(row, "metrics.total_token_throughput")), + ttft=_cell(_get(row, "metrics.mean_ttft_ms")), + tpot=_cell(_get(row, "metrics.mean_tpot_ms")), + success=_cell(_get(row, "metrics.success_rate")), + gpus=_cell(_get(row, "hardware.gpu_count")), + command=_cell(_server_command(row)), + artifacts=_cell(_artifact_summary(row)), + ) + ) + lines.append("") + + +def _append_cross_framework_table( + lines: list[str], scenario_winners: list[dict[str, Any]] +) -> None: + lines.extend( + [ + "## Cross-Framework Best Comparison", + "", + "| Scenario | Rank | Framework | Candidate | SLA | Req/s | Output tok/s | Mean TTFT ms | Mean TPOT ms | GPUs | Server command |", + "| --- | ---: | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | --- |", + ] + ) + scenario_names = sorted({_scenario(row) for row in scenario_winners}) + for scenario_name in scenario_names: + rows = [row for row in scenario_winners if _scenario(row) == scenario_name] + for rank, row in enumerate(sorted(rows, key=_rank_key, reverse=True), 1): + lines.append( + "| {scenario} | {rank} | {framework} | {candidate} | {sla} | {rps} | {otps} | {ttft} | {tpot} | {gpus} | {command} |".format( + scenario=_cell(scenario_name), + rank=rank, + framework=_cell(_get(row, "framework", "")), + candidate=_cell(_get(row, "candidate_id", "")), + sla=_cell(_bool(row, "sla.passed")), + rps=_cell(_get(row, "metrics.request_throughput")), + otps=_cell(_get(row, "metrics.output_token_throughput")), + ttft=_cell(_get(row, "metrics.mean_ttft_ms")), + tpot=_cell(_get(row, "metrics.mean_tpot_ms")), + gpus=_cell(_get(row, "hardware.gpu_count")), + command=_cell(_server_command(row)), + ) + ) + lines.append("") + + +def render_markdown(rows: list[dict[str, Any]]) -> str: + scenario_winners = best_by_framework_and_scenario(rows) + + lines = ["# Benchmark Summary", ""] + if not rows: + lines.append("No rows found.") + return "\n".join(lines) + "\n" + + _append_best_commands_by_framework(lines, scenario_winners) + _append_cross_framework_table(lines, scenario_winners) + + failed = [ + row + for row in rows + if _get(row, "status") != "ok" or not _bool(row, "sla.passed") + ] + if failed: + lines.extend( + [ + "", + "## Failed Or SLA-Failing Candidates", + "", + "This table records tried configs that were not selected. They either failed, were skipped by policy, or completed without passing the SLA.", + "", + "| Framework | Candidate | Status | SLA | Reason |", + "| --- | --- | --- | --- | --- |", + ] + ) + for row in failed: + lines.append( + "| {framework} | {candidate} | {status} | {sla} | {reason} |".format( + framework=_cell(_get(row, "framework", "")), + candidate=_cell(_get(row, "candidate_id", "")), + status=_cell(_get(row, "status", "")), + sla=_cell(_bool(row, "sla.passed")), + reason=_cell(_get(row, "failure_reason", "")), + ) + ) + return "\n".join(lines) + "\n" + + +def main() -> None: + parser = argparse.ArgumentParser() + parser.add_argument("--input", required=True, type=Path, help="Normalized JSONL") + parser.add_argument("--output", required=True, type=Path, help="Markdown summary") + parser.add_argument("--csv", type=Path, help="Optional CSV table") + args = parser.parse_args() + + rows = load_rows(args.input) + args.output.parent.mkdir(parents=True, exist_ok=True) + args.output.write_text(render_markdown(rows), encoding="utf-8") + if args.csv: + args.csv.parent.mkdir(parents=True, exist_ok=True) + write_csv(args.csv, sorted(rows, key=_rank_key, reverse=True)) + + +if __name__ == "__main__": + main() diff --git a/.claude/skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py b/.claude/skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py new file mode 100755 index 000000000000..549c446c1e22 --- /dev/null +++ b/.claude/skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py @@ -0,0 +1,434 @@ +#!/usr/bin/env python3 +"""Validate cross-framework cookbook benchmark configs. + +The validator is intentionally shallow: it proves that every config can be +loaded, translated into bounded candidate commands, and checked against the +known server flag surface. It does not launch model servers. +""" + +from __future__ import annotations + +import argparse +import itertools +import re +import shlex +from pathlib import Path +from typing import Any + +import yaml + +FRAMEWORKS = ("sglang", "vllm", "tensorrt_llm") +ALLOWED_SOURCE_KINDS = {"llm_serving_cookbook"} + +SEQUENCE_LIMIT_KEY = { + "sglang": "context_length", + "vllm": "max_model_len", + "tensorrt_llm": "max_seq_len", +} + +ALLOWED_SLA_KEYS = { + "max_p99_ttft_ms", + "max_p99_tpot_ms", + "min_success_rate", + "max_p99_e2e_ms", +} + +DEPRECATED_SLA_KEYS = { + "max_ttft_ms": "max_p99_ttft_ms", + "max_tpot_ms": "max_p99_tpot_ms", + "max_e2e_ms": "max_p99_e2e_ms", +} + +STATIC_SERVER_FLAGS = { + "sglang": { + "attention_backend", + "chunked_prefill_size", + "context_length", + "decode_attention_backend", + "dllm_algorithm", + "dtype", + "enable_multimodal", + "enable_symm_mem", + "ep_size", + "host", + "kv_cache_dtype", + "max_running_requests", + "mem_fraction_static", + "model_loader_extra_config", + "model_path", + "moe_runner_backend", + "nnodes", + "port", + "pp_size", + "prefill_attention_backend", + "reasoning_parser", + "schedule_policy", + "tool_call_parser", + "tp_size", + "trust_remote_code", + }, + "vllm": { + "block_size", + "dtype", + "enable_chunked_prefill", + "enable_prefix_caching", + "gpu_memory_utilization", + "host", + "kv_cache_dtype", + "long_prefill_token_threshold", + "max_long_partial_prefills", + "max_model_len", + "max_num_batched_tokens", + "max_num_partial_prefills", + "max_num_seqs", + "pipeline_parallel_size", + "port", + "tensor_parallel_size", + "trust_remote_code", + }, + "tensorrt_llm": { + "backend", + "ep_size", + "extra_llm_api_options", + "host", + "kv_cache_free_gpu_memory_fraction", + "max_batch_size", + "max_num_tokens", + "max_seq_len", + "port", + "pp_size", + "tp_size", + "trust_remote_code", + }, +} + +HELP_FILE_HINTS = { + "sglang": ("sglang", "launch"), + "vllm": ("vllm", "serve"), + "tensorrt_llm": ("trtllm", "serve"), +} + + +def flag_name(framework: str, key: str) -> str: + if framework in {"sglang", "vllm"}: + return "--" + key.replace("_", "-") + return "--" + key + + +def load_yaml(path: Path) -> dict[str, Any]: + with path.open(encoding="utf-8") as f: + data = yaml.safe_load(f) + if not isinstance(data, dict): + raise ValueError(f"{path}: expected a YAML mapping") + return data + + +def _as_list(value: Any) -> list[Any]: + if isinstance(value, list): + return value + return [value] + + +def _enabled(config: dict[str, Any], framework: str) -> bool: + return bool(config.get("frameworks", {}).get(framework, {}).get("enabled", False)) + + +def _max_required_sequence(dataset: dict[str, Any]) -> int: + input_len = dataset.get("input_len") + output_len = dataset.get("output_len") + if not isinstance(input_len, list) or not isinstance(output_len, list): + raise ValueError("dataset.input_len and dataset.output_len must be lists") + if len(input_len) != len(output_len): + raise ValueError("dataset.input_len and dataset.output_len must be aligned") + if not input_len: + raise ValueError("dataset.input_len and dataset.output_len must not be empty") + return max(int(i) + int(o) for i, o in zip(input_len, output_len, strict=True)) + + +def _candidate_dicts( + base_flags: dict[str, Any], + search_space: dict[str, Any], + limit: int, +) -> list[dict[str, Any]]: + candidates = [dict(base_flags)] + keys = list(search_space) + values = [_as_list(search_space[key]) for key in keys] + for combo in itertools.product(*values): + candidate = dict(base_flags) + candidate.update(dict(zip(keys, combo, strict=True))) + if candidate not in candidates: + candidates.append(candidate) + if len(candidates) >= limit: + break + return candidates + + +def _command_tokens( + framework: str, + config: dict[str, Any], + flags: dict[str, Any], +) -> list[str]: + server = config["frameworks"][framework] + command = shlex.split(server["server_command"]) + model = config["model"]["name"] + + if framework in {"vllm", "tensorrt_llm"}: + command.append(model) + + for key, value in flags.items(): + if value is None or value is False: + continue + command.append(flag_name(framework, key)) + if value is not True: + command.append(str(value)) + + return command + + +def render_command( + framework: str, config: dict[str, Any], flags: dict[str, Any] +) -> str: + return shlex.join(_command_tokens(framework, config, flags)) + + +def _extract_help_flags(text: str) -> set[str]: + return { + item.lstrip("-") for item in re.findall(r"--[A-Za-z0-9][A-Za-z0-9_-]*", text) + } + + +def load_help_flags(help_dir: Path) -> dict[str, set[str]]: + help_flags: dict[str, set[str]] = {} + for framework, hints in HELP_FILE_HINTS.items(): + matches = [] + for path in help_dir.rglob("*.txt"): + name = path.name.lower() + if all(hint in name for hint in hints): + matches.append(path) + if matches: + text = "\n".join( + path.read_text(encoding="utf-8", errors="replace") for path in matches + ) + help_flags[framework] = _extract_help_flags(text) + return help_flags + + +def _known_flag( + framework: str, + key: str, + help_flags: dict[str, set[str]] | None, +) -> bool: + static_keys = STATIC_SERVER_FLAGS[framework] + if key not in static_keys: + return False + if not help_flags or framework not in help_flags: + return True + + concrete = flag_name(framework, key).lstrip("-") + aliases = {concrete, concrete.replace("-", "_"), concrete.replace("_", "-")} + return bool(aliases & help_flags[framework]) + + +def _validate_framework( + config: dict[str, Any], + framework: str, + help_flags: dict[str, set[str]] | None, + max_candidates: int, +) -> list[str]: + errors: list[str] = [] + server = config["frameworks"].get(framework) + if not isinstance(server, dict): + return [f"missing frameworks.{framework}"] + if not server.get("enabled", False): + return [] + + base_flags = server.get("base_server_flags") + search_space = server.get("search_space") + if not isinstance(base_flags, dict): + errors.append(f"{framework}: base_server_flags must be a mapping") + base_flags = {} + if not isinstance(search_space, dict): + errors.append(f"{framework}: search_space must be a mapping") + search_space = {} + server_command_is_valid = isinstance(server.get("server_command"), str) + if not server_command_is_valid: + errors.append(f"{framework}: server_command must be a string") + + for key in set(base_flags) | set(search_space): + if not _known_flag(framework, key, help_flags): + errors.append(f"{framework}: unknown or unsupported server flag {key!r}") + + if framework == "tensorrt_llm": + if server.get("backend_policy") != "fixed_pytorch": + errors.append("tensorrt_llm: backend_policy must be fixed_pytorch") + if base_flags.get("backend") != "pytorch": + errors.append("tensorrt_llm: base backend must be pytorch") + if "backend" in search_space: + errors.append("tensorrt_llm: backend must not appear in search_space") + + candidates = _candidate_dicts(base_flags, search_space, max_candidates) + if not candidates: + errors.append(f"{framework}: no candidates generated") + can_render = server_command_is_valid and isinstance( + config.get("model", {}).get("name"), str + ) + if can_render: + for candidate in candidates: + command = render_command(framework, config, candidate) + if not command: + errors.append(f"{framework}: rendered an empty command") + + return errors + + +def validate_config( + path: Path, + help_flags: dict[str, set[str]] | None = None, +) -> list[str]: + errors: list[str] = [] + try: + config = load_yaml(path) + except Exception as exc: # noqa: BLE001 + return [str(exc)] + + if config.get("schema_version") != 1: + errors.append("schema_version must be 1") + if not isinstance(config.get("model", {}).get("name"), str): + errors.append("model.name must be set") + if config.get("source", {}).get("kind") not in ALLOWED_SOURCE_KINDS: + errors.append(f"source.kind must be one of {sorted(ALLOWED_SOURCE_KINDS)}") + + try: + required_sequence = _max_required_sequence(config["dataset"]) + except Exception as exc: # noqa: BLE001 + errors.append(str(exc)) + required_sequence = 0 + + search = config.get("search") + if not isinstance(search, dict): + errors.append("search must be a mapping") + max_candidates = 1 + else: + try: + max_candidates = int(search.get("max_candidates_per_framework", 0)) + except (TypeError, ValueError): + errors.append("search.max_candidates_per_framework must be an integer") + max_candidates = 1 + if max_candidates < 1: + errors.append("search.max_candidates_per_framework must be positive") + max_candidates = 1 + + frameworks = config.get("frameworks") + if not isinstance(frameworks, dict): + return errors + ["frameworks must be a mapping"] + + for framework in FRAMEWORKS: + errors.extend( + _validate_framework(config, framework, help_flags, max_candidates) + ) + + for framework in FRAMEWORKS: + if not _enabled(config, framework): + continue + key = SEQUENCE_LIMIT_KEY[framework] + fw = frameworks[framework] + base_flags = fw.get("base_server_flags", {}) or {} + search_space = fw.get("search_space", {}) or {} + if not isinstance(base_flags, dict) or not isinstance(search_space, dict): + continue + + try: + if framework == "sglang": + base_value = int(base_flags.get(key, required_sequence)) + else: + base_value = int(base_flags.get(key, 0)) + except (TypeError, ValueError): + errors.append(f"{framework}: base {key} is not an integer") + continue + if base_value < required_sequence: + errors.append( + f"{framework}: base {key} ({base_value}) is smaller than the largest dataset scenario ({required_sequence})" + ) + + if key in search_space: + for value in _as_list(search_space[key]): + try: + if int(value) < required_sequence: + errors.append( + f"{framework}: search_space {key} candidate {value} is smaller than the largest dataset scenario ({required_sequence})" + ) + except (TypeError, ValueError): + errors.append( + f"{framework}: search_space {key} candidate {value!r} is not an integer" + ) + + sla_block = ( + config.get("benchmark", {}).get("sla") + if isinstance(config.get("benchmark"), dict) + else None + ) + if sla_block is None: + sla_block = config.get("sla") + if isinstance(sla_block, dict): + for key in sla_block: + if key in DEPRECATED_SLA_KEYS: + errors.append( + f"sla: {key!r} is deprecated; use {DEPRECATED_SLA_KEYS[key]!r} (see references/result-schema.md)" + ) + elif key not in ALLOWED_SLA_KEYS: + errors.append( + f"sla: unknown key {key!r}; allowed keys are {sorted(ALLOWED_SLA_KEYS)}" + ) + + return errors + + +def iter_config_files(paths: list[Path]) -> list[Path]: + files: list[Path] = [] + for path in paths: + if path.is_dir(): + files.extend(sorted(path.rglob("*.yaml"))) + files.extend(sorted(path.rglob("*.yml"))) + else: + files.append(path) + return sorted(dict.fromkeys(files)) + + +def main() -> None: + parser = argparse.ArgumentParser() + parser.add_argument("paths", nargs="+", type=Path) + parser.add_argument("--help-dir", type=Path) + parser.add_argument("--print-commands", action="store_true") + args = parser.parse_args() + + help_flags = load_help_flags(args.help_dir) if args.help_dir else None + failed = False + for path in iter_config_files(args.paths): + errors = validate_config(path, help_flags) + if errors: + failed = True + for error in errors: + print(f"{path}: {error}") + continue + + if args.print_commands: + config = load_yaml(path) + limit = int(config["search"].get("max_candidates_per_framework", 1)) + for framework in FRAMEWORKS: + if not _enabled(config, framework): + continue + server = config["frameworks"][framework] + candidates = _candidate_dicts( + server["base_server_flags"], + server["search_space"], + limit, + ) + print(f"# {path.name} {framework}") + print(render_command(framework, config, candidates[0])) + + if failed: + raise SystemExit(1) + + +if __name__ == "__main__": + main() diff --git a/.claude/skills/llm-torch-profiler-analysis/SKILL.md b/.claude/skills/llm-torch-profiler-analysis/SKILL.md new file mode 100644 index 000000000000..24fac5dddc1d --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/SKILL.md @@ -0,0 +1,453 @@ +--- +name: llm-torch-profiler-analysis +description: "Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables." +--- + +# Unified LLM Torch Profiler Analysis + +## Overview + +Use this skill for `torch.profiler` analysis across: + +- `sglang` +- `vllm` +- `TensorRT-LLM` + +There is only one public workflow: + +- `triage` + +Preferred unified entrypoint: + +- [scripts/analyze_llm_torch_profile.py](scripts/analyze_llm_torch_profile.py) + +Backwards-compatibility shim (kept so older `docker exec ... analyze_sglang_torch_profile.py ...` calls keep working; it just forwards to the unified entrypoint): + +- [scripts/analyze_sglang_torch_profile.py](scripts/analyze_sglang_torch_profile.py) + +Markdown bundling helper: + +- [scripts/render_triage_markdown_bundle.py](scripts/render_triage_markdown_bundle.py) + +`triage` always prints the same three tables: + +- kernel table +- overlap-opportunity table +- fuse-pattern table + +By default, all three tables only render rows at or above `1.0%` cumulative GPU-time share. +Rows below that are hidden by default unless the user asks for a lower cutoff. + +Keep the fuse-pattern table source-backed and deterministic. +Do not turn it into a fuzzy matcher. + +If exact source-backed matching is weak but a kernel cluster is still close to a known family, +add one short note after the tables with exactly one of: + +- `high` +- `medium` +- `low` + +## Capability Matrix + +| Capability | SGLang | vLLM | TensorRT-LLM | +| --- | --- | --- | --- | +| Existing trace triage | yes | yes | yes | +| Single-trace live capture | yes | yes, if torch profiler is enabled on server | requires profiler control endpoints | +| Two-trace mapping+formal triage | yes | yes | yes | +| Stage-separated live workload | yes | yes | yes, with a writable shared trace dir or per-stage host runner | +| `--profile-by-stage` capture | yes | no | no | +| `--profile-prefix` control | yes | usually ignored on HTTP profiler route | usually ignored on HTTP profiler route | + +For TensorRT-LLM, live capture only works when the server exposes `/start_profile` and +`/stop_profile`, and when the deployment already provides a shared trace path plus the +required env vars. + +## Real H100 Validation + +The current reference run is the `4x H100` matrix captured on `2026-04-23` on +`h100_sglang` under: + +- `/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3` + +Rendered markdown bundle: + +- `/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md` + +Validated model directories: + +- `mixtral_8x7b_instruct` +- `qwen2_5_32b_instruct` +- `qwen3_32b` + +Each model directory contains: + +- `analysis_sglang.txt` +- `analysis_vllm.txt` +- `analysis_trtllm.txt` +- framework-specific trace roots and probe artifacts + +Validated matrix: + +| Model | SGLang | vLLM | TensorRT-LLM | Result | +| --- | --- | --- | --- | --- | +| `mistralai/Mixtral-8x7B-Instruct-v0.1` | `4x H100` | `4x H100` | `4x H100` | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text | +| `Qwen/Qwen2.5-32B-Instruct` | `4x H100` | `4x H100` | `4x H100` | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text | +| `Qwen/Qwen3-32B` | `4x H100` | `4x H100` | `4x H100` | three tables rendered correctly on all three frameworks; vLLM and TensorRT-LLM chat probes often emitted `` prefixes | + +Use this run as the main H100 reference. +The older `2026-04-22` single-card Qwen3 matrix is still useful for bring-up, but it is +not the default reference anymore. + +Stage-separated workload validation captured on `2026-05-01` on `h100_sglang`: + +- `/data/bbuf/validate/unified_llm_profiler_skill/runs/20260501_stage_split_validation` +- `/data/bbuf/validate/unified_llm_profiler_skill/runs/20260501_stage_split_validation_large` + +Validated models: + +| Model | GPU | Workloads | Result | +| --- | --- | --- | --- | +| `Qwen/Qwen2.5-0.5B-Instruct` | `1x H100` | prefill `4090->1`, decode `1->2048` | generated separate `prefill/*.trace.json.gz` and `decode/*.trace.json.gz`; kernel, overlap, and fuse tables rendered with separate `extend/prefill` and `decode` sections | +| `Qwen/Qwen2.5-1.5B-Instruct` | `1x H100` | prefill `4090->1`, decode `1->2048` | generated separate `prefill/*.trace.json.gz` and `decode/*.trace.json.gz`; kernel, overlap, and fuse tables rendered with separate `extend/prefill` and `decode` sections | +| `Qwen/Qwen2.5-7B-Instruct` | `1x H100` | prefill `4090->1`, decode `1->2048` | generated separate traces; prefill kernel table captured 28-layer GEMM/FA3/RMSNorm work, decode captured 5-step graph launches, and fuse rows were split by stage | +| `Qwen/Qwen2.5-14B-Instruct` | `1x H100` | prefill `4090->1`, decode `1->2048` | generated separate traces; prefill kernel table captured 48-layer GEMM/FA3/RMSNorm work, decode captured 5-step graph launches, and fuse rows were split by stage | +| `Qwen/Qwen3-8B` | `2x H100`, TP=2 | prefill `4090->1`, decode `1->2048`, warmup 10/capture 5 | generated separate prefill/decode traces and all three tables; unique probe prompts avoided prefix-cache pollution in the prefill table | +| `mistralai/Mistral-7B-Instruct-v0.3` | `2x H100`, TP=2 | prefill `4090->1`, decode `1->2048`, warmup 10/capture 5 | generated separate prefill/decode traces and all three tables; server logs showed no repeated-prompt prefix-cache shortcut during the active prefill window | + +This validation also covers the compatibility fix for older SGLang profiler +state machines: workload-separated live capture labels stages by output +directory and avoids nesting SGLang's internal `profile_by_stage` state machine +inside each workload. The helper +adds one internal scheduler guard step because SGLang increments `forward_ct` +before checking whether the profiler should stop; without that guard, a +`num_steps=1` prefill capture can stop just before the actual prefill forward. +The 2026-05-01 two-card validation artifacts for the additional models are: + +- `/data/bbuf/validate/core_skill_validation_20260501/qwen3_8b/profiler` +- `/data/bbuf/validate/core_skill_validation_20260501/mistral_7b_instruct_v03/profiler` + +To render a validated run into one markdown document: + +```bash +python3 scripts/render_triage_markdown_bundle.py \ + --analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \ + --output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md +``` + +The bundle groups by model and keeps the three tables for each framework. + +H100 notes: + +- all three frameworks now render kernel, overlap, and fuse tables with separate `extend/prefill` and `decode` sections when the trace contains a clean stage split +- SGLang live capture is validated and calls the server profiler API directly instead of shelling out to `sglang.profiler` +- SGLang trace flush can lag well beyond a few seconds, so the runner waits longer for artifacts than the earlier implementation +- SGLang kernel-site reconstruction keeps sampling disabled in the mapping path so the optimized parser does not perturb SGLang table output; equality rechecks matched for `Mixtral-8x7B-Instruct-v0.1`, `Qwen3-32B`, and `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` +- vLLM live capture requires `--output-dir` to match the server `torch_profiler_dir`; the validated H100 flow uses `--profiler-config {"profiler":"torch","torch_profiler_dir":"..."}` and then drives `/start_profile` and `/stop_profile` +- TensorRT-LLM validation stays on `--backend pytorch`; the H100 flow writes the trace with `TLLM_TORCH_PROFILE_TRACE` and then analyzes the saved trace +- the 2026-04-22 TensorRT-LLM 1.0.0 `py_executor.py` profiler setup still needed a `with_stack=True` override for table-quality Python locations, and the matrix runner generated that override under `/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm`; re-check this on TensorRT-LLM 1.2.1 or any 1.3.x release-candidate image before assuming the override is still required +- on this host, keep all trace roots under `/data/...`, not `/home/...` + +## When To Use It + +- inspect a `torch.profiler` trace or profile directory from `sglang`, `vllm`, or `TensorRT-LLM` +- profile a live serving endpoint and analyze the result +- summarize which kernel families dominate prefill or decode +- map kernels back to Python code paths +- judge whether a code path still leaves overlap opportunity +- check whether an already-known fusion or overlap path should have applied + +## Diffusion Backend Gate + +For diffusion benchmark or profiling work, only analyze traces produced by the native +SGLang diffusion backend. + +If the run that generated the trace logs any of: + +- `Falling back to diffusers backend` +- `Using diffusers backend` +- `Loaded diffusers pipeline` + +stop the workflow instead of analyzing the trace. +Handle it as a backend-selection issue, not as native-kernel profiler evidence. + +## Main Flows + +## Stage-Separated Live Capture Contract + +Live capture must not use one mixed prompt as the default. +By default, `analyze_llm_torch_profile.py --url ...` captures two labeled +workloads and then renders the same three tables with separate stage sections: + +- prefill: synthetic input length `4090`, output length `1` +- decode: synthetic input length `1`, output length `2048` + +Every live profiler path warms up `10` steps before arming the profiler and then +captures `5` active steps by default. Keep this warmup/active split aligned +across SGLang, vLLM, and TensorRT-LLM before comparing kernel tables. + +Use these options to override the contract when the benchmark workload is known: + +```bash +--profile-workload both \ +--warmup-steps 10 --num-steps 5 \ +--prefill-input-len 4090 --prefill-output-len 1 \ +--decode-input-len 1 --decode-output-len 2048 +``` + +Allowed `--profile-workload` values: + +- `both`: default; capture prefill and decode separately +- `prefill`: capture only the long-input / one-token workload +- `decode`: capture only the one-input / long-output workload +- `legacy`: keep the old `--probe-prompt` / `--probe-max-new-tokens` behavior + +For `sglang-sota-performance`, do not use the defaults if the slow SGLang +benchmark scenario has a known input/output distribution. +Set the profiler lengths from that slow scenario instead: prefill uses the slow +input length with output `1`, and decode uses input `1` with the slow output +length. For a mixed dataset, profile the slowest representative bucket such as +the p50 or p95 input/output pair used in the benchmark report, and record the +bucket in the artifact notes. + +### 1. Single-trace triage from an existing profile dir or trace + +```bash +python3 scripts/analyze_llm_torch_profile.py \ + --input /path/to/profile_dir_or_trace.json.gz +``` + +Use this when one trace is enough. +The overlap table stays conservative in single-trace mode and will tell you when a +mapping/formal pair is needed. + +### 2. Single-trace live capture from SGLang + +```bash +python3 scripts/analyze_llm_torch_profile.py \ + --framework sglang \ + --url http://127.0.0.1:30000 \ + --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \ + --num-steps 5 \ + --warmup-steps 10 \ + --profile-by-stage \ + --profile-workload both +``` + +The script sends `POST /start_profile` to the SGLang server directly. +Keep `--output-dir` under `/data/...` so later analysis and docs can see the trace. +The script writes `server_args.json`, warms up with the same workload shape, +sends the active probe requests after profiling is armed, captures separate +`prefill/` and `decode/` profile roots by default, and waits longer for trace +flush than the earlier implementation. +For the default workload-separated capture, the directory name labels the stage +and the SGLang internal `profile_by_stage` mode is not used inside each +workload. This avoids mixing a one-token prefill probe with a separate decode +profile. The helper still adds one internal guard step because older SGLang +profilers check the target counter before running the next forward. + +### 3. Single-trace live capture from vLLM + +Launch vLLM with torch profiler enabled, for example: + +```bash +vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}' +``` + +Then run: + +```bash +python3 scripts/analyze_llm_torch_profile.py \ + --framework vllm \ + --url http://127.0.0.1:8000 \ + --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \ + --num-steps 5 \ + --warmup-steps 10 \ + --no-profile-by-stage \ + --profile-workload both +``` + +For vLLM, `--output-dir` must point to the same `torch_profiler_dir` the server uses. +The current vLLM profiler config already defaults `torch_profiler_with_stack=true`, +so the runner only needs to set `torch_profiler_dir`. +On `h100_sglang`, external vLLM containers should mount both: + +- `/data/.cache/huggingface:/root/.cache/huggingface` +- `/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill` + +### 4. Single-trace live capture from TensorRT-LLM + +Use this only when the server exposes `POST /start_profile` and `POST /stop_profile`, +and the trace path is shared with the current machine. + +Typical env expectations are: + +- `TLLM_PROFILE_START_STOP=1` +- `TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json` or `.json.gz` + +Then run: + +```bash +python3 scripts/analyze_llm_torch_profile.py \ + --framework trtllm \ + --url http://127.0.0.1:8000 \ + --output-dir /shared/path \ + --num-steps 5 \ + --no-profile-by-stage \ + --profile-workload both +``` + +If the deployment does not expose the profiler control endpoints, fall back to analyzing +an existing trace instead of trying live capture. +If the TensorRT-LLM trace output is configured as one fixed file path, use +`scripts/run_trtllm_pytorch_profile_host.sh --stage prefill` and `--stage decode` +instead of direct `--profile-workload both`, so each stage gets its own trace file. + +On the current TensorRT-LLM mainline path, `py_executor.py` creates the torch profiler +with `record_shapes=True` and `with_modules=True` but not `with_stack=True`. +For table-quality validation, use the override generator: + +```bash +python3 scripts/make_trtllm_py_executor_override.py \ + --source /path/to/original/py_executor.py \ + --output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py +``` + +The matrix runner does this automatically on H100 before TensorRT-LLM capture starts. + +This is the validated TensorRT-LLM flow on `h100_sglang`: + +1. launch `trtllm-serve` with `TLLM_TORCH_PROFILE_TRACE=/data/.../trace.json` +2. run a few benchmark requests +3. analyze the emitted trace with `--input /data/.../trace.json` + +### 5. Two-trace triage from existing profile dirs or traces + +```bash +python3 scripts/analyze_llm_torch_profile.py \ + --mapping-input /path/to/graph_off_profile_dir \ + --formal-input /path/to/graph_on_profile_dir +``` + +Use this when you need stronger overlap attribution and kernel-to-source mapping. + +### 6. Two-trace triage from running servers + +```bash +python3 scripts/analyze_llm_torch_profile.py \ + --framework sglang \ + --mapping-url http://127.0.0.1:31025 \ + --formal-url http://127.0.0.1:31026 \ + --num-steps 5 \ + --profile-by-stage +``` + +For `vllm` or `TensorRT-LLM`, use the same shape but pass: + +- `--framework vllm` or `--framework trtllm` +- `--mapping-output-dir ...` +- `--formal-output-dir ...` +- `--no-profile-by-stage` + +## `profile_by_stage` + +`--profile-by-stage` is only meaningful on the SGLang live-capture path. + +- With `--profile-workload both` / `prefill` / `decode`, workload directories + are the stage labels; the live-capture helper disables SGLang's internal + stage profiler per workload, warms up first, and captures the requested + active step count for the selected workload. +- On legacy or hand-captured SGLang serving, internal `profile_by_stage` is + still useful because prefill and decode usually have very different + bottlenecks. +- On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path. +- PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary `profile_by_stage`. +- For `vllm` and `TensorRT-LLM`, disable it with `--no-profile-by-stage`. + +## How To Choose The Triage Shape + +### Single-trace triage + +Use when you want the lowest-friction report: + +- one trace is already available +- you mainly want kernel share and fusion clues +- you are comparing two runs side by side by running triage once per trace + +Prefer this by default. + +### Two-trace triage + +Use when you need: + +- a stronger overlap answer +- graph-off source mapping plus graph-on final behavior +- more trustworthy overlap recommendations in the middle table + +1. mapping trace with graph disabled or with the lower-fusion / more-readable config +2. formal trace with the real serving optimizations enabled + +Do not call the mapping pass a "fast profile". +It exists to recover `kernel -> cpu_op -> python scope`. + +## Workflow + +### Single-trace workflow + +1. If the user only wants a diagnosis, one trace is enough. +2. Prefer one-rank traces over merged traces whenever the profiler emitted both. +3. For a live server, let the script drive the profiler only when the framework-specific prerequisites are already met. +4. Prefer `--profile-workload both`; use `legacy` only when reproducing an old trace contract. +5. Prefer workload-separated SGLang capture; use internal `--profile-by-stage` + mainly for `legacy` or manually collected traces. +6. When on `h100_sglang`, create or clean the target trace directory through `docker exec sglang_bbuf ...` so the path is definitely writable under `/data`. + +### Two-trace workflow + +1. Produce a mapping trace first with graph disabled or the lower-fusion configuration. +2. Produce a formal trace second with the real serving optimizations enabled. +3. Run `triage` for the three-table report. +4. Read the results in this order: + - kernel table + - overlap-opportunity table + - fuse-pattern table +5. Before calling something a "new" optimization idea, compare the top rows against both [references/fuse-overlap-catalog.md](references/fuse-overlap-catalog.md) and [references/overlap-catalog.md](references/overlap-catalog.md). Check mainline rows first, then the `PR-backed / in-flight` sections. Prefer reporting: + - an existing fused or overlap path that should already apply here + - an existing path that appears disabled, unsupported, or regressed in this trace + - an upstream pattern that is mainline elsewhere but missing locally, or still open upstream + - a truly new opportunity only when no catalog entry fits +6. If no exact pattern fully matches but the trace is still close to a known family, add one flat similarity note after the tables. + Use `high`, `medium`, or `low` only. + Base that note on the full pattern shape, not on one kernel name alone. + Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. + Do not rewrite the script table itself to include these heuristic judgments. + +## References + +Load these only when needed: + +- [references/source-map.md](references/source-map.md) + - upstream SGLang profiler entrypoints and trace-writing paths; still most useful for SGLang-specific source follow-up +- [references/heuristics.md](references/heuristics.md) + - overlap labels, dependency-risk interpretation, and limits +- [references/fuse-overlap-catalog.md](references/fuse-overlap-catalog.md) + - mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows +- [references/vllm-torch-compile-fusions.md](references/vllm-torch-compile-fusions.md) + - current vLLM torch.compile fusion passes and the source patterns they target +- [references/overlap-catalog.md](references/overlap-catalog.md) + - overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling + +## Output Contract + +Return: + +- trace path or generated profile path +- framework +- model/server args when available +- kernel table +- overlap-opportunity table +- fuse-pattern table +- optional similarity note with `high` / `medium` / `low` when exact matching is inconclusive +- one short summary of what dominates the run +- whether the overlap read came from single-trace triage or mapping/formal two-trace triage diff --git a/.claude/skills/llm-torch-profiler-analysis/references/fuse-overlap-catalog.md b/.claude/skills/llm-torch-profiler-analysis/references/fuse-overlap-catalog.md new file mode 100644 index 000000000000..917de36b64f7 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/references/fuse-overlap-catalog.md @@ -0,0 +1,367 @@ +# Fuse And Overlap Catalog + +This catalog is the source-backed lookup table that the profiler skill should +consult before labeling a fuse or overlap opportunity as novel. + +For overlap-only triage, also load `references/overlap-catalog.md`. + +This revision is intentionally kernel-scoped. Keep rows here only when they map +to one fused GPU/NPU kernel family, one fused collective-plus-kernel family, or +one profiler-visible stream overlap among GPU kernels / collective kernels. +Host-only scheduler, event-loop, executor, offload, and load-path patterns are +intentionally excluded. + +Use it like this: + +1. Start from the three `triage` tables. +2. Match top rows against the `Trace keywords` and `Primary code` columns below. +3. If a finding matches an existing row, report it as: + - an existing optimization path that is missing, disabled, regressed, or unsupported for the current backend, or + - an already-known family that should be re-applied to the current model shape. +4. Check the mainline comparison sections and the `PR-backed / in-flight` sections too. If a match exists there, do not call it novel; call it an upstream or in-flight pattern instead. +5. Only call a finding "new" when it does not fit any mainline or PR-backed row in this catalog. + +The `vLLM-origin` sections below are comparative references. They are not +necessarily present in the checked-out `sglang` tree, but they should still be +treated as upstream or analogous kernel families before labeling a fuse or +overlap opportunity as novel. + +The catalog is grouped by reusable optimization family, not by one specific model. + +Refresh note `2026-05-01`: rescanned current `sglang` and vLLM mainline, then +rechecked recent merged and open optimization PRs through the GitHub CLI/API. +The vLLM torch.compile pass inventory is now split out in +[`vllm-torch-compile-fusions.md`](vllm-torch-compile-fusions.md). Stable +current-code families remain folded into the mainline rows below. New +status-sensitive rows were added for DeepSeek-V4, GLM5 NSA / PDL, NVFP4 MoE, +torch.compile decode, vLLM DSV4, vLLM ROCm WMMA, and vLLM GPU/CPU sync-removal +work. Recheck PR state before treating an in-flight row as shipped. + +## 1. LLM / SRT fused-kernel families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Fused residual add + RMSNorm | `fused_add_rmsnorm*`
`npu_add_rms_norm`
`add_rmsnorm_bias`
`gemma_fused_add_rmsnorm`
`gemma_rmsnorm_residual_scalar`
`_gemma_rmsnorm_residual_kernel`
residual add right before norm | `python/sglang/srt/layers/layernorm.py`
`python/sglang/srt/layers/gemma4_fused_ops.py`
`python/sglang/srt/layers/quantization/modelslim/modelslim.py` | Shared CUDA / ROCm / CPU / NPU fused add-RMSNorm implementations, including Gemma, Gemma4 scalar-residual, and NPU-bias variants | Treat split residual add + RMSNorm as an existing cross-backend fusion first, not a new idea. | +| FlashInfer unified `allreduce_fusion` | `cross_device_reduce_1stage*`
`all_reduce`
`FusedAddRMSNormKernel`
`rmsnorm*` | `python/sglang/srt/layers/flashinfer_comm_fusion.py`
`python/sglang/srt/layers/layernorm.py::forward_with_allreduce_fusion`
`python/sglang/srt/layers/communicator.py::apply_flashinfer_allreduce_fusion` | FlashInfer workspace creation plus `allreduce_fusion(..., pattern=AllReduceFusionPattern.kARResidualRMSNorm, ...)` | First suspect missing / disabled / unsupported FlashInfer allreduce fusion, not a brand new TP fusion idea. | +| AITER allreduce fusion | ROCm all-reduce plus RMSNorm still split | `python/sglang/srt/layers/layernorm.py::forward_with_allreduce_fusion`
`python/sglang/srt/distributed/communication_op.py::tensor_model_parallel_fused_allreduce_rmsnorm`
`python/sglang/srt/layers/communicator.py::apply_aiter_all_reduce_fusion` | ROCm-side fused TP all-reduce + RMSNorm with fallback to plain all-reduce plus norm | On AMD, rule out existing AITER fusion before proposing a new communication fusion. | +| Fused activation-and-mul (`SwiGLU` / `GeGLU`) | `silu_and_mul`
`gelu_and_mul`
`npu_swiglu` | `python/sglang/srt/layers/activation.py` | Single op covers activation plus elementwise multiply across CUDA / CPU / NPU / XPU backends | Treat separate activation + mul on packed MLP outputs as missing existing fusion. | +| Fused dual residual RMSNorm | residual add plus two RMSNorm-like kernels around Grok blocks | `python/sglang/srt/layers/elementwise.py::fused_dual_residual_rmsnorm`
`python/sglang/srt/models/grok.py` | One Triton kernel computes intermediate residual update and next RMSNorm output together | On Grok-like residual layouts, treat split residual + norm as missing existing fusion. | +| In-place QK RMSNorm | split `q_norm` / `k_norm` kernels | `python/sglang/srt/models/utils.py::apply_qk_norm`
`python/sglang/jit_kernel/norm.py::fused_inplace_qknorm` | In-place JIT QK norm plus optional `alt_stream` overlap for K | Check shape, dtype, deterministic mode, and in-place legality before proposing a new QK fuse. | +| TorchInductor horizontal Q/K norm combo-kernels | `combo_kernels`
`benchmark_combo_kernel`
`q_norm`
`k_norm`
`split_with_sizes` | `torch._inductor.config.combo_kernels` | TorchInductor can horizontally fuse sibling Q-norm and K-norm kernels in compiled traces, often deleting `split_with_sizes` / `clone` ladders | Treat separate Q/K norm ladders in compile-heavy traces as an existing compiler-fusion family first. | +| MiniMax TP fused QK RMSNorm | `MiniMaxM2RMSNormTP`
`rms_sumsq_serial`
`rms_apply_serial`
`forward_qk` | `python/sglang/srt/models/minimax_m2.py` | Triton kernels compute Q / K sumsq together, TP all-reduces shared stats, then apply both RMSNorms together | On MiniMax traces, separate Q norm and K norm are usually a missed model-specific Triton fusion. | +| Fused QK RMSNorm + RoPE | `qknorm*` + `rope*` + `rotary*` as separate steps | `python/sglang/jit_kernel/fused_qknorm_rope.py`
`python/sglang/srt/models/qwen3_moe.py` | One JIT kernel applies QK RMSNorm and RoPE in-place on packed QKV | For compatible LLMs, classify split QK norm + RoPE as a missing existing fusion. | +| Fused QK RoPE reshape + KV cache write | `fused_qk_rope_reshape_and_cache*`
RoPE followed by reshape / cache DtoD | `python/sglang/srt/layers/attention/utils.py::fused_qk_rope_reshape_and_cache` | One Triton kernel applies RoPE to Q / K, reshapes cache layout, and writes K / V directly to paged cache | Treat separate RoPE + reshape + cache-write ladders as an existing attention-prep fusion family. | +| Fused RoPE + KV cache store | `fused_set_kv_buffer`
RoPE followed by KV-store, DtoD, or cache-write kernels | `python/sglang/jit_kernel/rope.py`
`python/sglang/srt/models/utils.py::enable_fused_set_kv_buffer` | Shared entrypoints can route to fused RoPE + KV-store or model-side `fused_set_kv_buffer` fast paths | Compare against the fused cache-store path before proposing a new KV rewrite. | +| Fused decode metadata setup | `normal_decode_set_metadata`
`cache_seqlens_int32`
`cu_seqlens_k`
`page_table`
`swa_page_table` | `python/sglang/srt/layers/attention/flashattention_backend.py::normal_decode_set_metadata` | Triton decode path fuses seq-len cast/add, prefix-sum, req-to-token gather, page-table divide, and optional SWA metadata build into 1-2 kernels | If decode exposes multiple tiny metadata kernels before attention, first compare against this existing fused metadata-prep path. | +| NSA fused metadata copy for graph replay | `fused_metadata_copy`
`fused_metadata_copy_multi`
`fused_nsa_cache_seqlens`
`fused_flashmla_metadata` | `python/sglang/jit_kernel/fused_metadata_copy.py` | CUDA graph replay path fuses multiple metadata copies into one kernel or one multi-destination kernel | Treat bursts of tiny metadata-copy kernels around NSA replay as a missed existing replay fusion. | +| DeepSeek MLA fused projection + norm + RoPE | `qkv_proj_with_rope_fused_weight`
`fused_qkv_a_proj_with_mqa`
`forward_absorb_fused_mla_rope*` | `python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mla_fused_rope_cpu.py`
`python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mla_fused_rope_rocm.py`
`python/sglang/srt/models/deepseek_v2.py` | CPU / ROCm paths fuse DeepSeek MLA projection packing with q / k norm, RoPE, and cache-oriented MLA prep | For DeepSeek MLA, split proj / norm / rope prep is usually an existing backend-specific fuse that did not fire. | +| Fused QK RoPE concat + MLA cache write | `fused_qk_rope_cat_and_cache_mla`
`set_mla_kv_buffer` | `python/sglang/srt/layers/rocm_linear_utils.py`
`python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mla.py` | ROCm MLA path can fuse Q / K RoPE packing, concat, and MLA cache write in one backend-specific op | On DeepSeek / MLA traces, separate RoPE-cat-cache steps are not automatically novel. | +| Qwen3 decode fused QK norm + 3D mRoPE + KV cache write | `fused_qk_norm_mrope_3d_cache_pts_quant_shuffle`
`mrope`
decode cache write | `python/sglang/srt/models/qwen3.py` | ROCm / AITER decode path fuses QK norm, 3D mRoPE, and paged KV cache write | On Qwen3-style decode, separate norm + mRoPE + cache-store kernels are not a novel opportunity. | +| NPU fused split-QKV + RMSNorm + RoPE | `split_qkv_rmsnorm_rope` | `python/sglang/srt/models/llama.py`
`python/sglang/srt/models/qwen3.py`
`python/sglang/srt/models/qwen3_moe.py`
`python/sglang/srt/models/glm4_moe.py` | Ascend path fuses QKV split, Q / K RMSNorm, and RoPE in one op | On NPU traces, separate split / norm / rope kernels usually mean the fused path is unavailable or bypassed. | +| Fused FP8 quantize + paged KV cache write | `trtllm_fp8_kv_kernel`
`fp8 kv cache write`
`paged KV cache write` | `python/sglang/srt/layers/attention/triton_ops/trtllm_fp8_kv_kernel.py` | TRTLLM MHA path fuses FP8 quantization, scale computation, and paged K / V cache write | If FP8 KV cache traces show standalone quant plus write kernels, first compare against this existing Triton fuse. | +| Fused MLA KV cache write + FP8 quant | `set_mla_kv_buffer_fp8_quant*`
`set_mla_kv_buffer_triton_fp8_quant` | `python/sglang/srt/mem_cache/utils.py`
`python/sglang/srt/mem_cache/memory_pool.py` | MLA / NSA KV pool path can quantize K and write directly into KV storage without a separate concat-and-quant chain | Treat standalone quant + KV-buffer write on MLA paths as missing existing fusion first. | +| Fused MoE router / top-k / softcapping | `FusedMoeRouter`
`fused_moe_router*`
router GEMM + `topk` + `tanh` | `python/sglang/srt/layers/moe/router.py` | Single fused router kernel covers router matmul, softcapping, and top-k selection | Treat exposed router matmul + softcap + top-k chains as an existing MoE fusion family. | +| Fused MoE grouped-topk / gate kernels | `fused_topk_deepseek`
`moe_fused_gate`
`aiter_fused_topk`
`kimi_k2_moe_fused_gate` | `python/sglang/srt/layers/moe/topk.py` | CUDA / ROCm / FlashInfer kernels fuse bias, grouped-topk, renorm, and routed scaling into one gate op | Check backend / model eligibility before proposing a novel router-gate fusion. | +| Qwen-style shared-expert append into routed top-k output | `_append_shared_to_topk_output`
`fused_append_shared_experts_with_weights`
`num_fused_shared_experts` | `python/sglang/srt/models/qwen2_moe.py`
`python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe_triton_kernels.py` | Qwen-style MoE paths can append shared-expert ids and sigmoid gate weights to routed top-k output in one Triton kernel so the shared experts execute inside the fused MoE path | Treat routed top-k plus shared-expert pad / concat ladders as an existing MoE-prep fusion family first. | +| Fused MoE dispatch / permute / combine | token permutation
dispatch / combine
grouped top-k
many small MoE support kernels | `python/sglang/srt/layers/moe/fused_moe_triton/layer.py`
`python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py` | `FusedMoE` plus DeepEP / FlashInfer / FuseEP / standard dispatch backends and `permute_fusion=True` | First ask whether the model is missing an existing `FusedMoE`-style path or backend-specific dispatcher path. | +| Fused MoE sum + all-reduce | routed MoE followed by explicit sum-reduce kernels | `python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py`
`python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py` | `fuse_sum_all_reduce=True` path in the second MoE GEMM | Before inventing a new MoE reduction fuse, check whether `enable_fused_moe_sum_all_reduce` is simply off or the quant path is incompatible. | +| Fused MoE activation + quant / re-quant | `silu_and_mul_*quant*`
`npu_dequant_swiglu_quant`
`swiglu_quant` | `python/sglang/srt/layers/moe/ep_moe/kernels.py`
`python/sglang/jit_kernel/nvfp4.py`
`python/sglang/srt/layers/moe/cutlass_w4a8_moe.py`
`python/sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py` | Quantized MoE backends fuse SwiGLU / SiLU-and-mul with FP8 / FP4 / NPU re-quant before the second expert GEMM | If MoE traces show standalone activation then quant kernels, first check whether the quantized fused path is missing. | +| DeepSeek comm-prep fused RMSNorm + quant / flatten-quant | `fused_rms_fp8_group_quant`
`fused_rms_mxfp4_quant`
`fused_flatten_fp8_group_quant`
`fused_flatten_mxfp4_quant` | `python/sglang/srt/layers/communicator.py`
`python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mla.py`
`python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mha.py` | DeepSeek MLA / MHA ROCm paths fuse RMSNorm or flatten with FP8 / MXFP4 quantization for comm / attention prep | On DeepSeek quant traces, split norm + quant or flatten + quant is an existing family, not a new idea. | +| NSA fused top-k transform / page-table build | `fast_topk_transform_fused`
`fast_topk_transform_ragged_fused` | `python/sglang/srt/layers/attention/nsa_backend.py` | NSA can fuse top-k selection with paged / ragged index transform instead of separate top-k plus metadata scatter | If NSA top-k metadata work is split, check `SGLANG_NSA_FUSE_TOPK` and backend support first. | +| NSA fused quantize + indexed K-cache store | `fused_store_index_k_cache`
`act_quant`
`index_k_with_scale_buffer` | `python/sglang/jit_kernel/fused_store_index_cache.py`
`python/sglang/srt/layers/attention/nsa/nsa_indexer.py` | Single JIT kernel quantizes bf16 K to fp8 + scale and writes directly into NSA index cache | Treat split `act_quant` + buffer-store on CUDA as missing an existing fused store path. | +| Fused sampling temperature + softmax | `fused_temperature_softmax*` | `python/sglang/srt/layers/fused_sampling.py`
`python/sglang/srt/layers/sampler.py` | Triton single-pass / multi-pass kernels fuse temperature scaling and softmax during decode | Separate temp-divide + softmax at decode batch sizes is often a missed existing fusion. | +| Fused logit softcap | `fused_softcap`
`final_logit_softcapping` | `python/sglang/srt/layers/elementwise.py`
`python/sglang/srt/layers/logits_processor.py` | Triton kernels fuse cast-to-float and softcap / tanh math for logits or generic elementwise softcapping | Treat exposed cast + softcap ladders as an existing Triton fuse family. | +| Linear-attention packed projection reshuffle | `fused_qkvzba_split_reshape_cat*`
`qkvz_proj`
`ba_proj`
`qkvabz_proj`
`fused_qkvbfg_a_proj` | `python/sglang/jit_kernel/triton/gdn_fused_proj.py`
`python/sglang/srt/models/qwen3_next.py`
`python/sglang/srt/models/qwen3_5.py`
`python/sglang/srt/models/kimi_linear.py`
`python/sglang/srt/models/jet_nemotron.py` | GDN / Kimi / Jet-style linear-attn models pack multiple projections, then fuse split / reshape / cat into one kernel | Treat split reshape / transpose / cat ladders as an existing linear-attention fusion family. | +| Fused GDN gating prep | `fused_gdn_gating`
`softplus`
`beta_output` | `python/sglang/srt/layers/attention/fla/fused_gdn_gating.py` | Triton kernel computes GDN gate preparation such as `-exp(A_log) * softplus(...)` and `sigmoid(b)` together | On GDN traces, treat split gate-prep elementwise kernels as missing existing fusion first. | +| Fused RMSNorm-gated linear-attention output | `FusedRMSNormGated`
`layer_norm_gated_fwd` | `python/sglang/srt/layers/attention/fla/fused_norm_gate.py`
`python/sglang/srt/models/qwen3_next.py`
`python/sglang/srt/models/kimi_linear.py` | One Triton op covers residual-aware (RMS)Norm plus sigmoid / swish gating | If norm and output gate appear as separate kernels in GDN / Kimi-like blocks, first suspect a missing existing fusion. | +| Fused gated RMSNorm / LayerNorm | `rms_norm_gated`
`layer_norm_gated` | `python/sglang/srt/layers/attention/mamba/ops/layernorm_gated.py` | Mamba-derived kernels can fuse normalization with the gating branch `z * sigmoid(z)` | Treat split norm and gate post-processing on Mamba-style blocks as an existing fusion family. | +| Fused linear-attention chunk KKT + solve_tril | `chunk_gated_delta_rule_fwd_kkt_solve_kernel`
`scaled_dot_kkt`
`solve_tril`
`recompute_w_u` | `python/sglang/srt/layers/attention/fla/chunk_fwd.py`
`python/sglang/srt/layers/attention/fla/kda.py` | GDN / KDA chunk forward fuses `scaled_dot_kkt + solve_tril` in the prefill / intra-chunk path, then finishes `recompute_w_u` as the next step | Treat split KKT + triangular-solve ladders as an existing linear-attention fusion family first. | +| Fused linear-attention recurrent / KDA update | `fused_sigmoid_gating_delta_rule_update`
`fused_recurrent_gated_delta_rule_update`
`fused_kda_gate` | `python/sglang/srt/layers/attention/fla/fused_sigmoid_gating_recurrent.py`
`python/sglang/srt/layers/attention/fla/fused_recurrent.py`
`python/sglang/srt/models/kimi_linear.py`
`python/sglang/srt/models/jet_nemotron.py` | Triton / CuTeDSL kernels fuse gating math, optional QK l2norm, recurrent state update, and output generation | Treat split gating + recurrent-update chains as existing linear-attention fusion, not a novel opportunity. | +| Fused Mamba state gather/scatter with mask | `fused_mamba_state_scatter_with_mask`
`index_elementwise_kernel` | `python/sglang/srt/layers/attention/mamba/mamba_state_scatter_triton.py` | Triton kernel replaces multiple masked gather / scatter index kernels with one fused update | If Mamba verify/update shows many tiny index kernels, first compare against this existing fused path. | +| Staging-buffer fused gather / scatter | `_fused_gather_to_staging_kernel`
`_fused_scatter_from_staging_kernel` | `python/sglang/srt/disaggregation/common/staging_buffer.py` | Triton kernels gather scattered KV slices into contiguous staging memory and scatter them back into KV cache on decode | Treat ladders of tiny gather/scatter/copy kernels in heterogeneous TP staging as missing an existing Triton fusion. | + +## 2. LLM / SRT kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Single-batch overlap (SBO) | MoE combine, down-gemm, shared-expert work in nearby two-stream windows | `python/sglang/srt/batch_overlap/single_batch_overlap.py` | combine vs down-gemm overlap, combine vs shared-expert overlap, one-stream dispatch+shared overlap, explicit SM partitioning and events | If exposed MoE combine sits near neighboring compute, classify it against SBO before calling it new overlap. | +| Q and K normalization on different streams | Q-side norm and K-side norm on different streams | `python/sglang/srt/models/utils.py::apply_qk_norm`
`python/sglang/srt/models/qwen3.py`
`python/sglang/srt/models/qwen3_next.py`
`python/sglang/srt/models/qwen3_5.py` | Q stays on current stream, K can run on `alt_stream` in capture mode | Treat split Q / K norm as an existing overlap family when `alt_stream` is already wired. | +| DeepSeek shared-expert / routed-expert overlap | shared-expert GEMMs near DeepEP dispatch / combine | `python/sglang/srt/models/deepseek_v2.py`
`python/sglang/srt/batch_overlap/single_batch_overlap.py` | shared experts on `alt_stream`, overlap with dispatch / combine and down-gemm, Blackwell-specific env gating | This is an established routed-vs-shared branch overlap pattern, not a novel idea. | +| Llama4 shared branch vs routed branch overlap | shared expert branch plus routed MoE branch as adjacent windows | `python/sglang/srt/models/llama4.py` | shared expert on current stream, router + topk + routed experts on `alt_stream` | Use Llama4 as the first precedent for branch-level overlap in similar sparse models. | +| ExaoneMoE shared experts vs router experts overlap | shared expert output and router-expert output form a two-branch window | `python/sglang/srt/models/exaone_moe.py::forward_normal_dual_stream` | shared experts on current stream, router + routed experts on `alt_stream`, explicit join before combine | This is an existing dual-stream MoE overlap family. | +| Grok residual-MoE branch overlap | dense MLP and block-sparse MoE branches in parallel | `python/sglang/srt/models/grok.py::moe_with_rmoe` | dense MLP on current stream, MoE on `alt_stream`, fused dual residual RMSNorm around boundaries | Treat exposed Grok branch overlap as an existing pattern. | +| NSA dual-stream overlap | Q-proj, K-proj, RoPE, cache-store, quantization in tight two-stream windows | `python/sglang/srt/layers/attention/nsa/nsa_indexer.py` | Q / K projection split, RoPE split, cache-store vs quantization overlap | NSA already contains several dual-stream overlap precedents. | +| MoriEP async dispatch / combine comm stream | `MoriEP`
`_comm_stream`
`dispatch`
`combine`
`done_event` | `python/sglang/srt/layers/moe/token_dispatcher/moriep.py` | MoriEP can submit dispatch and combine onto a dedicated communication stream and synchronize only through events | Treat MoriEP comm / compute interleave as an existing MoE overlap family. | +| Heterogeneous-TP staging scatter overlap | `scatter_stream`
`_scatter_stream`
`staging` | `python/sglang/srt/disaggregation/common/staging_handler.py`
`python/sglang/srt/disaggregation/common/staging_buffer.py` | decode-side staging scatter kernels can run on a dedicated stream while forward continues on the main stream | If decode traces show staging scatter kernels adjacent to forward kernels, classify them against this existing overlap family first. | +| Generic `alt_stream` overlap families | `alt_stream` plus explicit `wait_stream` / `with torch.cuda.stream(...)` | `qwen2_moe.py`
`qwen3_moe.py`
`glm4_moe.py`
`bailing_moe.py`
`llada2.py`
`grok.py`
`olmo2.py`
`step3p5.py`
`longcat_flash.py`
`falcon_h1.py` | model-specific overlap on attention prep, MoE branches, or cache-store | Search these families before designing a new overlap scheme from scratch. | + +## 3. VLM-specific kernel families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Vision QK norm with aux stream | vision-side QK norm or norm-like kernels before attention | `python/sglang/srt/layers/attention/vision.py` | vision QK normalization can call shared `apply_qk_norm(...)`, with K-side work on `aux_stream` | If vision QK prep is split, first check this existing aux-stream path. | +| ViT CUDA graph disables vision aux stream | expected vision overlap is absent under ViT graph | `python/sglang/srt/models/internvl.py`
`python/sglang/srt/layers/attention/vision.py`
`python/sglang/srt/environ.py::SGLANG_VIT_ENABLE_CUDA_GRAPH` | vision `aux_stream` is intentionally disabled when ViT CUDA graph is on | Missing vision overlap may be intentional, not a regression. | +| Fused multimodal RoPE kernel | `triton_mrope_fused`
`multimodal_rotary_embedding_cpu`
`npu_mrope`
`MRotaryEmbedding` | `python/sglang/srt/layers/rotary_embedding/mrope.py`
`python/sglang/srt/layers/rotary_embedding/triton_kernels.py`
`python/sglang/srt/models/qwen3.py` | CUDA Triton, CPU `sgl_kernel`, and NPU paths already fuse multimodal t / h / w position lookup plus in-place Q / K rotary application | If VLM traces show separate mRoPE gather / shuffle / apply steps, first classify them as a missing existing mRoPE fusion. | + +## 4. Diffusion fused-kernel families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Fused residual + norm + scale + shift | residual add, norm, scale, shift, gate around DiT blocks | `python/sglang/jit_kernel/diffusion/cutedsl/scale_residual_norm_scale_shift.py`
`python/sglang/multimodal_gen/runtime/layers/layernorm.py` | `fused_scale_residual_norm_scale_shift(...)` | Treat split residual + norm + modulation as a missing existing diffusion fusion first. | +| Fused norm + scale + shift | norm followed by scale / shift elementwise kernels | `python/sglang/jit_kernel/diffusion/cutedsl/scale_residual_norm_scale_shift.py`
`python/sglang/multimodal_gen/runtime/layers/layernorm.py` | `fused_norm_scale_shift(...)` | Existing modulation fusion already covers this family. | +| Triton scale / shift and gate-select kernels | tiny scale / shift or gate-select kernels dominate modulation blocks | `python/sglang/jit_kernel/diffusion/triton/scale_shift.py`
`python/sglang/multimodal_gen/runtime/layers/elementwise.py` | `fuse_scale_shift_kernel(...)` and `fuse_layernorm_scale_shift_gate_select01_kernel(...)` | Check whether the runtime is missing these existing Triton fusions. | +| Fused add-RMSNorm and one-pass RMSNorm | residual add plus RMSNorm still split on short hidden sizes | `python/sglang/multimodal_gen/runtime/layers/layernorm.py`
`python/sglang/jit_kernel/diffusion/triton/rmsnorm_onepass.py` | `fused_add_rmsnorm(...)` and `triton_one_pass_rms_norm(...)` | For short hidden-size diffusion blocks, this is already an established fusion family. | +| Fused diffusion QK norm + RoPE | split QK norm and RoPE in diffusion attention blocks | `python/sglang/jit_kernel/diffusion/qknorm_rope.py`
`python/sglang/multimodal_gen/runtime/layers/layernorm.py::apply_qk_norm_rope` | `fused_inplace_qknorm_rope(...)`, with fallback to QK norm plus `apply_flashinfer_rope_qk_inplace(...)` | Distinguish between missing fused qknorm + rope and the existing FlashInfer RoPE fallback. | +| Z-Image fused `norm(x) * tanh(scale) + shift` | `fused_norm_tanh_mul_add`
`tanh(gate) * rmsnorm(x)` | `python/sglang/jit_kernel/diffusion/cutedsl/norm_tanh_mul_add_norm_scale.py`
`python/sglang/multimodal_gen/runtime/layers/layernorm.py` | CuTeDSL kernel plus runtime helper for Z-Image residual-form modulation | Treat split Z-Image residual-form modulation as a missing existing diffusion fusion, not a novel idea. | +| Z-Image fused residual modulation + next norm-scale | `fused_norm_tanh_mul_add_norm_scale`
`residual + tanh(gate) * rmsnorm(x)`
`ffn_norm1(x) * scale_mlp` | `python/sglang/jit_kernel/diffusion/cutedsl/norm_tanh_mul_add_norm_scale.py`
`python/sglang/multimodal_gen/runtime/models/dits/zimage.py` | One CuTeDSL kernel fuses the first residual-form modulation and the next normalization / scale stage | If you see this chain split in Z-Image traces, report it as a missing existing mainline fusion family. | +| Nunchaku fused GELU MLP | `_fused_gelu_mlp`
`fused_gelu_mlp` | `python/sglang/multimodal_gen/runtime/models/dits/flux.py` | Nunchaku path fuses `fc1 GEMM + GELU + shift + re-quant + fc2.lora_down` before the second GEMM | Treat split GELU-MLP on Nunchaku checkpoints as an existing fused family, not a new discovery. | + +## 5. Diffusion kernel-overlap and async-communication families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Ulysses sequence-parallel attention | exposed `all_to_all` around attention blocks | `python/sglang/multimodal_gen/runtime/layers/attention/layer.py`
`python/sglang/multimodal_gen/runtime/distributed/communication_op.py` | head / sequence redistribution before and after attention | Treat sequence-parallel all-to-all as an existing distributed attention family. | +| USP attention with all-to-all and ring attention | `all_to_all`, ring-attention comm, head / sequence reshards | `python/sglang/multimodal_gen/runtime/layers/attention/layer.py` | `_usp_input_all_to_all(...)`, `_usp_output_all_to_all(...)`, `ring_attn(...)` | This is the primary existing overlap / comm family for many diffusion models. | +| Turbo-layer async all-to-all pipelining | pipelined A2A windows with explicit waits on a comm stream | `python/sglang/multimodal_gen/runtime/layers/attention/turbo_layer.py` | looped `all_to_all_single(..., async_op=True)` plus staged postprocess on a comm stream | Treat exposed turbo A2A windows as an existing pipelined overlap pattern. | +| TorchInductor compute / communication reorder | compiled traces with compute and comm partially interleaved | `python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py`
`python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/mova.py` | `torch._inductor.config.reorder_for_compute_comm_overlap = True` | Existing compile-time reordering may already explain partial overlap in diffusion traces. | +| Dual-stream diffusion models | two nearby compute branches inside one DiT / UNet block | `python/sglang/multimodal_gen/runtime/models/dits/hunyuan3d.py` | `use_dual_stream = True` | Treat dual-branch diffusion execution as an existing overlap family. | + +## 6. PR-backed / in-flight fused-kernel families + +These rows track still-open upstream work or status-sensitive PR families. +Stable entries should be folded into the mainline family rows above. + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#21877` fused grouped down-GEMM + combine | `grouped_gemm_nt_masked`
`combine`
`fused grouped gemm combine` | `PR #21877`
`python/sglang/srt/layers/moe/ep_moe/flashinfer_cutedsl_moe.py`
`python/sglang/srt/layers/moe/token_dispatcher/deepep.py` | FlashInfer CuTeDSL kernel fuses the second expert GEMM with DeepEP low-latency combine | Treat this as a concrete upstream MoE fuse / overlap family, not a new thought experiment. | +| PR `#21889` fused BF16 to FP4 quant + paged KV write | `set_mla_kv_buffer_fp4_quant_kernel`
`fp4 kv cache` | `PR #21889`
`python/sglang/srt/mem_cache/utils.py` | Triton kernel writes FP4 NSA KV pages directly while quantizing BF16 input | If NSA FP4 KV paths are split into quant plus store, classify them as an in-flight upstream fuse family. | +| PR `#21889` fused FP4 paged dequant to FP8 + page-table remap | `_dequant_fp4_to_fp8_paged_kernel`
`WRITE_PT`
`dequant_fp4_paged_decode` | `PR #21889`
`python/sglang/srt/layers/attention/nsa/dequant_fp4_to_fp8.py` | Triton kernel reads FP4 pages, writes FP8 directly, and can fuse decode-side page-table remap | Treat this as an upstream in-flight decode-prep fusion family. | +| PR `#21491` FlashInfer TRTLLM FP8 MoE with fused shared experts | `num_fused_shared_experts`
`trtllm_fp8_block_scale_moe` | `PR #21491`
`python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py`
`python/sglang/srt/models/deepseek_v2.py` | FlashInfer TRTLLM FP8 MoE path can fuse shared experts inside the routed MoE kernel | On FP8 TRTLLM MoE discussions, treat fused shared experts as an upstream pattern that already has a concrete PR. | +| PR `#22005` fused add + RMSNorm + per-token FP8 quant | `fused_add_rmsnorm_per_token_quant`
`per_token_quant_fp8` | `PR #22005`
`python/sglang/jit_kernel/csrc/elementwise/fused_add_rmsnorm_per_token_quant.cuh`
`python/sglang/jit_kernel/fused_add_rmsnorm_per_token_quant.py` | CUDA JIT kernel keeps normed values in registers and emits BF16 + FP8 outputs plus per-token scales | If FP8 online-quant traces show add+norm followed by per-token quant, treat this as an in-flight upstream CUDA fuse family. | +| PR `#20667` Qwen3.5 fused QK norm + RoPE + KV cache write | `fused_qk_norm_rope_cache_pts_quant_shuffle`
`fused_qk_norm_mrope_3d_cache_pts_quant_shuffle`
`rotary_dim` | `PR #20667`
`python/sglang/srt/models/qwen3_5.py`
`python/sglang/srt/models/utils.py` | ROCm / AITER path fuses Q / K RMSNorm, partial or 3D RoPE, and direct KV cache write for Qwen3.5 attention | Treat split QK-norm + RoPE + cache-store on Qwen3.5 as a concrete in-flight upstream family, not a novel idea. | +| PR `#22392` CUTLASS FP8 GEMM replacing nvjet | `cutlass_scaled_mm`
`fp8_scaled_mm`
`nvjet`
`cudaMemsetAsync` | `PR #22392`
`sgl-kernel/python/sgl_kernel/gemm.py`
`python/sglang/srt/layers/quantization/fp8_utils.py` | Runtime replacement swaps nvjet FP8 GEMMs for CUTLASS kernels, removing per-launch memset bubbles and extra output-copy kernels | Treat nvjet GEMM + memset bubble ladders as an in-flight SGLang linear-kernel family before calling them novel. | +| PR `#18612` NVFP4 CUTLASS MoE fused SiLU+Mul+quant | `silu_and_mul_scaled_nvfp4`
`nvfp4 expert quant`
`cutlass moe` | `PR #18612`
`python/sglang/srt/layers/moe/cutlass_w4a8_moe.py`
`python/sglang/jit_kernel/nvfp4.py` | Fuses MoE activation epilogue and NVFP4 expert quantization before the CUTLASS MoE second GEMM | Treat split SiLU+Mul then NVFP4 expert quant in CUTLASS MoE traces as an in-flight upstream SGLang family. | +| PR `#22918` FlashInfer per-token NVFP4 MoE | `per_token_nvfp4`
`trtllm_fp4_block_scale_moe`
`FlashInfer MoE` | `PR #22918`
`python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py` | Adds FlashInfer-backed per-token NVFP4 MoE execution so expert quant/dequant work can move into the fused MoE backend | Treat standalone per-token NVFP4 MoE support kernels as a candidate missing backend-selection path, not an automatically novel kernel idea. | +| PR `#22851` NSA top-k backend and FlashInfer / PyTorch top-k split | `nsa topk`
`flashinfer_topk`
`pytorch_topk`
`fast_topk_transform` | `PR #22851`
`python/sglang/srt/layers/attention/nsa_backend.py` | Makes NSA top-k backend selection explicit and aligns fused top-k transform with FlashInfer / PyTorch fallbacks | When NSA top-k dominates decode, first classify it as backend selection or fused-transform eligibility work. | +| PR `#24125` GLM5 NSA decode CatArrayBatchedCopy removal | `CatArrayBatchedCopy`
`GLM-5`
`NSA`
`TileLang decode` | `PR #24125`
`python/sglang/srt/layers/attention/nsa_backend.py` | Skips redundant cat/copy work in the GLM5 NSA TileLang decode path | Treat cat/copy bursts in GLM5 NSA decode as a concrete in-flight cleanup opportunity. | +| PR `#24007` MoE LoRA virtual experts for csgmv backend | `csgmv`
`virtual experts`
`MoE LoRA`
`fused_moe_lora` | `PR #24007`
`python/sglang/srt/layers/lora_backend.py`
`python/sglang/srt/layers/moe` | Routes MoE LoRA adapter work through virtual experts so csgmv-style kernels can batch it instead of launching fragmented adapter work | Treat MoE-LoRA tiny-kernel ladders as an in-flight batching/fusion family. | +| PR `#24150` torch.compile local decode support | `enable_torch_compile`
`local compile`
`decode compile`
`torchinductor` | `PR #24150`
`python/sglang/srt` | Extends SGLang torch.compile coverage to local decode regions, so Inductor-generated fusion may replace hand-authored tiny kernels | When decode traces show compiler-generated kernels or missing named fused kernels, check this in-flight compile path before calling the shape unsupported. | + +## 7. PR-backed / in-flight kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#21877` fused down-GEMM + combine superseding SBO | `enable_fused_grouped_gemm_combine`
`combine`
`down_gemm` | `PR #21877`
`python/sglang/srt/server_args.py`
`python/sglang/srt/layers/moe/token_dispatcher/deepep.py` | Fused combine eliminates the standalone combine window, so SBO is intentionally disabled when this path is on | If the trace discussion is about combine overlap, first classify it as this upstream fused-overlap family. | +| PR `#23965` PDL for DSV32 / GLM5 kernels | `enable_pdl`
`TRTLLM_ENABLE_PDL`
`cudaGridDependencySynchronize`
`DSV32`
`GLM5` | `PR #23965`
`python/sglang/srt/layers`
`sgl-kernel` | Enables programmatic dependent launch on selected DeepSeek / GLM kernels so dependent decode kernels can overlap launch-to-start gaps | Treat tight same-stream decode windows around DSV32 / GLM5 as an in-flight PDL overlap family. | +| PR `#21878` TTFT / TPOT torch.compile optimization | `enable_torch_compile`
`decode graph`
`piecewise cudagraph` | `PR #21878`
`python/sglang/srt` | Uses compiler and graph capture changes to shave TTFT / TPOT rather than adding one handwritten kernel | If the trace shows many small compiler-visible decode ops, compare against this compile-overlap / graph-capture family first. | +| PR `#24168` batched GPU-to-CPU sync for logprobs / embeddings | `logprobs`
`embeddings`
`GPU->CPU sync`
`batch sync` | `PR #24168`
`python/sglang/srt` | Batches per-request synchronization work that can otherwise serialize decode progress around logprob or embedding outputs | Treat per-request CPU sync stalls in logprob / embedding traces as a concrete in-flight SGLang scheduler/data-movement family. | + +## 8. FlashInfer mainline fused-kernel families + +These rows are comparative references from `flashinfer`. Use them when a trace +looks like an upstream FlashInfer family even if the current `sglang` checkout +only consumes a subset of that implementation. + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| FlashInfer activation / gate epilogues | `silu_and_mul`
`gelu_tanh_and_mul`
`gelu_and_mul`
`silu_and_mul_scaled_nvfp4_experts_quantize` | `flashinfer/activation.py`
`flashinfer/quantization/fp4_quantization.py` | FlashInfer covers both the plain activation-plus-mul epilogues and the NVFP4 expert-quantized extension used on MoE expert paths | Treat standalone activation, multiply, and expert-side quant ladders as one existing FlashInfer epilogue family first. | +| FlashInfer norm / residual / quant epilogues | `rmsnorm_quant`
`fused_add_rmsnorm`
`fused_add_rmsnorm_quant`
`gemma_rmsnorm`
`gemma_fused_add_rmsnorm`
`fused_rmsnorm_silu`
`rmsnorm_fp4quant`
`add_rmsnorm_fp4quant` | `flashinfer/norm/__init__.py`
`flashinfer/cute_dsl/rmsnorm_fp4quant.py`
`flashinfer/cute_dsl/add_rmsnorm_fp4quant.py` | The norm family spans plain RMSNorm derivatives, residual-add epilogues, norm+activation, and direct FP8 / NVFP4 output variants instead of materializing each intermediate | Treat split residual add, norm, activation, and quant chains as one existing FlashInfer epilogue family first. | +| FlashInfer allreduce + post-op fusion family | `allreduce_fusion`
`AllReduceFusionPattern`
`kARResidualRMSNorm`
`kARResidualRMSNormFP8Quant`
`kARResidualRMSNormFP4Quant`
`trtllm_mnnvl_allreduce_fusion` | `flashinfer/comm/allreduce.py`
`flashinfer/comm/trtllm_ar.py`
`flashinfer/comm/trtllm_mnnvl_ar.py` | TRTLLM and MNNVL backends fuse all-reduce with residual add, RMSNorm, and backend-appropriate quant / norm-output variants | Treat TP collective + norm (+ quant) ladders as an existing FlashInfer fused-collective family first. | +| FlashInfer RoPE + FP8 quant / cache-update family | `rope_quantize_fp8`
`mla_rope_quantize_fp8`
`rope_quantize_fp8_append_paged_kv_cache`
`seqlen=0`
`batch_indices < 0` | `flashinfer/rope.py` | The RoPE family covers both RoPE+FP8 output and the larger decode / prefill-prep path that writes K / V directly into paged KV cache, including padding-token / zero-length sequence handling | Treat split RoPE, quant, cache-write, and padding-token ladders as one existing FlashInfer attention-prep family first. | +| FlashInfer fused DeepSeek grouped-topk routing | `fused_topk_deepseek`
`NoAuxTc` | `flashinfer/fused_moe/fused_routing_dsv3.py` | One kernel performs sigmoid+bias, grouped score reduction, group top-k, expert top-k, and routed renorm for DeepSeek-V3-style routing | Treat router score activation -> grouped top-k -> renorm ladders as an existing FlashInfer router family first. | +| FlashInfer fused MoE expert execution | `cutlass_fused_moe`
`trtllm_bf16_moe`
`trtllm_fp8_per_tensor_scale_moe`
`trtllm_fp8_block_scale_moe`
`trtllm_fp4_block_scale_moe`
`trtllm_mxint4_block_scale_moe`
`non-gated` | `flashinfer/fused_moe/core.py` | CUTLASS and TRTLLM backends collapse expert execution, routed combine, and quantized expert variants into fused MoE runners, including gated and non-gated FP8 per-tensor cases | Treat exposed expert-side tiny GEMM or non-gated FP8 ladders as matching an existing FlashInfer fused-MoE family. | +| FlashInfer CuTeDSL two-stage MoE fusion | `blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion_nvfp4`
`blockscaled_contiguous_grouped_gemm_finalize_fusion_nvfp4`
`moe_permute`
`moe_unpermute` | `flashinfer/fused_moe/cute_dsl/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`
`flashinfer/fused_moe/cute_dsl/blockscaled_contiguous_grouped_gemm_finalize_fusion.py` | The CuTeDSL path fuses gather+GEMM1+SwiGLU in the first stage and finalize+unpermute+scatter-reduce in the second stage, removing standalone `moe_permute` and `moe_unpermute` kernels | Treat multi-kernel MoE ladders around permute / finalize as one existing FlashInfer CuTeDSL family first. | +| FlashInfer SM120 FP4 / groupwise GEMM heuristics | `cutlass_fp4_gemm_sm120`
`CutlassTileConfigSM120`
`group_gemm_nvfp4_nt_groupwise`
`group_gemm_mxfp4_nt_groupwise` | `flashinfer/gemm/gemm_base.py`
`include/flashinfer/gemm/fp4_gemm_cutlass_template_sm120.h`
`include/flashinfer/gemm/group_gemm_nvfp4_groupwise_sm120.cuh`
`csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp` | FlashInfer mainline adds SM120-oriented FP4 GEMM selection and b12x CuTeDSL fused-MoE kernels | Treat SM120 FP4 MoE/GEMM tile selection and Blackwell-lite shape restrictions as an upstream FlashInfer kernel family before inventing a local heuristic. | +| FlashInfer MoE `routing_replay_out` support | `routing_replay_out`
`mPtrRoutingReplayOut`
`trtllm_fp8_block_scale_moe` | `flashinfer/fused_moe/core.py`
`csrc/trtllm_fused_moe_kernel_launcher.cu`
`csrc/fused_moe/noAuxTcKernels.cu` | TRTLLM-gen MoE kernels can optionally emit compact routing replay metadata without a separate routing-side reconstruction pass | Treat routing-replay writes in MoE traces as part of the upstream FlashInfer TRTLLM MoE family, not a separate postprocess opportunity. | + +## 9. FlashInfer mainline kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| FlashInfer PDL launch-overlap family | `enable_pdl`
`launch_with_pdl`
`cudaGridDependencySynchronize`
`cudaTriggerProgrammaticLaunchCompletion`
`trigger_completion_at_end=False`
`allreduce_fusion` | `flashinfer/norm/__init__.py`
`flashinfer/activation.py`
`flashinfer/rope.py`
`flashinfer/comm/allreduce.py`
`flashinfer/comm/trtllm_ar.py` | FlashInfer uses Programmatic Dependent Launch broadly, and the allreduce path can further advance completion so the next PDL-aware kernel overlaps on the same stream | Treat tight same-stream dependent windows and allreduce-followed-by-kernel windows as one existing FlashInfer launch-overlap family first. | +| FlashInfer CuTeDSL MoE aux-stream async-memset overlap | `aux_stream`
`main_event`
`memset_event`
`use_async_memset` | `flashinfer/fused_moe/cute_dsl/fused_moe.py` | Preallocated MoE output is zeroed on an auxiliary CUDA stream while GEMM1 runs on the main stream, then both streams join before finalize | Treat GEMM1 vs output-zero windows as an existing FlashInfer multi-stream overlap family. | +| FlashInfer green-context SM partition overlap | `split_device_green_ctx`
`split_device_green_ctx_by_sm_count`
`green_ctx` | `flashinfer/green_ctx.py` | CUDA green contexts partition SMs and create dedicated streams for concurrent kernel families on separate SM slices | Treat full-device two-stream traces and SM-partitioned traces as different manifestations of an existing FlashInfer overlap mechanism. | + +## 10. FlashInfer PR-backed / in-flight fused-kernel and kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#2720` PDL runtime-API migration | `cudaGridDependencySynchronize`
`cudaTriggerProgrammaticLaunchCompletion`
`inline PTX` | `PR #2720`
`include/flashinfer/comm/trtllm_allreduce_fusion.cuh`
`include/flashinfer/pos_enc.cuh` | Repo-wide migration preserves the existing PDL overlap family while replacing inline PTX with CUDA runtime APIs across norm, RoPE, attention, and MoE codepaths | Treat PDL-looking launch groups as an upstream FlashInfer overlap family even when implementation details differ across revisions. | + +## 11. TensorRT-LLM-origin fused-kernel families + +These rows are comparative references from `TensorRT-LLM`. Use them when a +trace looks like a TensorRT-LLM or TensorRT-LLM-plus-FlashInfer family even if +the current `sglang` checkout only carries an analogous implementation. + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| TensorRT-LLM FlashInfer activation / gate epilogues | `flashinfer_silu_and_mul`
`flashinfer_gelu_tanh_and_mul`
`auto_deploy::silu_and_mul`
post-GEMM `silu` + `mul` | `tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py`
`tensorrt_llm/_torch/auto_deploy/transform/library/fuse_silu_mul.py`
`tensorrt_llm/_torch/models/modeling_gemma3.py` | Runtime custom ops and AutoDeploy rewrite `split/getitem + activation + mul` MLP epilogues into one FlashInfer op, including Gemma3 `gelu_tanh_and_mul` | Treat split gate activation + multiply as an existing TensorRT-LLM/FlashInfer epilogue family first. | +| TensorRT-LLM FlashInfer RMSNorm family | `flashinfer_rmsnorm`
`flashinfer_gemma_rmsnorm`
`auto_deploy::flashinfer_rms_norm` | `tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py`
`tensorrt_llm/_torch/modules/rms_norm.py`
`tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py` | Runtime modules and AutoDeploy can lower plain RMSNorm and Gemma RMSNorm directly to FlashInfer kernels | Treat split RMSNorm ladders as an existing TensorRT-LLM norm family before calling them novel. | +| TensorRT-LLM FlashInfer residual add + RMSNorm | `flashinfer_fused_add_rmsnorm`
`flashinfer_gemma_fused_add_rmsnorm`
`auto_deploy::flashinfer_fused_add_rms_norm_inplace` | `tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py`
`tensorrt_llm/_torch/modules/rms_norm.py`
`tensorrt_llm/_torch/auto_deploy/transform/library/fused_add_rms_norm.py` | Residual add immediately before RMSNorm can collapse to one in-place FlashInfer op, with Gemma variant support | Treat residual add + RMSNorm chains as an existing TensorRT-LLM fused epilogue family first. | +| TensorRT-LLM Triton fused residual add + RMSNorm + FP8 quant | `triton_fused_add_rms_norm_quant_fp8`
`fuse_rmsnorm_quant_fp8`
`fp8 static quant` | `tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/triton_fused_add_rms_norm_quant_fp8.py`
`tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_fp8.py` | Mainline AutoDeploy can rewrite residual-add plus RMSNorm plus FP8 static quant into one Triton op that emits BF16 norm output, FP8 quant output, and residual-add output together | Treat split add + norm + FP8 quant ladders as an existing TensorRT-LLM mainline family first. | +| TensorRT-LLM FlashInfer RoPE with shared cos/sin cache | `flashinfer_apply_rope_with_cos_sin_cache_inplace`
`flashinfer_rope`
`cos_sin_cache` | `tensorrt_llm/_torch/modules/rotary_embedding.py`
`tensorrt_llm/_torch/auto_deploy/custom_ops/rope/flashinfer_rope.py`
`tensorrt_llm/_torch/auto_deploy/transform/library/rope.py` | Runtime path applies in-place RoPE from a shared cos/sin cache, while AutoDeploy can prebuild the full cache and lower diverse RoPE graphs to `flashinfer_rope` | Treat separate cos/sin gather + RoPE application ladders as an existing TensorRT-LLM attention-prep family. | +| TensorRT-LLM FlashInfer cached paged attention | `append_paged_kv_cache`
`BatchPrefillWithPagedKVCacheWrapper`
`BatchDecodeWithPagedKVCacheWrapper`
`auto_deploy::flashinfer_attention_mha_with_cache`
`read_cache_only` | `tensorrt_llm/_torch/attention_backend/flashinfer.py`
`tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py`
`docs/source/features/attention.md` | FlashInfer attention backend fuses metadata setup, optional paged-KV append, and prefill/decode wrapper execution, including shared-KV and read-cache-only variants in AutoDeploy | Treat metadata + KV-append + cached-attention ladders as one existing TensorRT-LLM cached-attention family first. | +| TensorRT-LLM FlashInfer MLA regular prefill | `append_paged_mla_kv_cache`
`BatchPrefillWithRaggedKVCacheWrapper`
`flashinfer_mla`
`rank 256`
`gpu append kernel` | `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py` | Regular MLA prefill writes compressed KV pages and runs FlashInfer ragged prefill instead of a split append-plus-prefill ladder, with rank-256 paged-KV setups using the GPU append path | Treat MLA regular-prefill prep as an existing TensorRT-LLM FlashInfer family first. | +| TensorRT-LLM FlashInfer MLA chunked prefill with absorbed `W_kn` | `BatchMLAPagedAttentionWrapper`
`chunked prefill`
`W_kn`
`W_v` | `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py` | Chunked prefill absorbs `W_kn` into the query-side projection, runs paged MLA attention in compressed space, then projects back with `W_v` | Treat split absorbed-proj + MLA + output-proj ladders as an existing TensorRT-LLM MLA family first. | +| TensorRT-LLM FlashInfer MLA decode with absorbed `W_kn` + `W_v` | `plan_decode`
`BatchMLAPagedAttentionWrapper`
`decode`
`W_kn`
`W_v` | `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py` | Decode path reuses the absorbed-query MLA family and projects the compressed attention output back with `W_v` | Treat similar decode-time absorbed MLA ladders as an existing TensorRT-LLM family, not a new idea. | +| TensorRT-LLM FlashInfer fused MoE backend | `flashinfer.fused_moe`
`trtllm_bf16_moe`
`trtllm_fp8_block_scale_moe`
`trtllm_fp4_block_scale_moe`
`TRTLLM_GEN_FUSED_MOE_USE_FLASHINFER` | `tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py`
`tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` | TRTLLM-gen MoE can route expert execution and quant helpers through FlashInfer instead of exposing per-expert eager ladders | Treat expert-side tiny GEMM ladders as matching an existing TensorRT-LLM FlashInfer MoE family first. | +| TensorRT-LLM FlashInfer cached SSM / Mamba update | `flashinfer_cached_ssm`
`selective_state_update`
`flashinfer_ssm` | `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/flashinfer_backend_mamba.py`
`tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` | Mamba2 paths can lower cached SSM state updates to FlashInfer selective-state-update kernels instead of many smaller state ops | Treat split cached-SSM state update ladders as an existing TensorRT-LLM FlashInfer family first. | + +## 12. TensorRT-LLM-origin kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| TensorRT-LLM multi-stream MLA attention | `multi_stream_mla_attn`
`record_event_passthrough`
`_aux`
`wait_event` | `tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_attn.py`
`tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py` | AutoDeploy rewrites MLA Q/KV forks so the KV projection runs on an auxiliary stream while the Q path stays on the caller stream | Treat exposed Q-branch vs KV-branch overlap as an existing TensorRT-LLM multi-stream family first. | +| TensorRT-LLM multi-stream MoE shared-vs-routed overlap | `multi_stream_moe`
`begin_aux_stream_passthrough`
`end_aux_stream_passthrough`
`wait_aux_stream_passthrough`
`mlir_elementwise_fusion`
`piecewise cudagraph`
`caller_stream.synchronize()` | `tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py`
`tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py` | Shared-expert work is moved to an auxiliary stream while routed-expert MoE work remains on the main stream and rejoins at the merge node; the same family includes synchronization rules for MLIR-fused kernels and piecewise cudagraph replay | Treat shared-expert vs routed-expert windows, including altered `multi_stream_moe` behavior under MLIR / piecewise graph modes, as an existing TensorRT-LLM branch-overlap family. | +| TensorRT-LLM multi-stream FP8 GEMM fork parallelism | `multi_stream_gemm`
`trtllm_finegrained_fp8_linear`
`record_event_passthrough`
`_aux` | `tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_gemm.py`
`tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py` | Compiler pass identifies fork points with multiple FP8 linears and moves the largest GEMM to the auxiliary stream so sibling GEMMs overlap | Treat sibling FP8 linear branches as an existing TensorRT-LLM overlap family before designing a new stream split. | + +## 13. TensorRT-LLM-origin PR-backed / in-flight fused-kernel and kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#12525` FlashInfer TRTLLM-gen FMHA paged-index / buffer rework | `shared paged index`
`trtllm-gen attention`
`flashinfer`
`kv cache buffer` | `PR #12525`
`tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py` | Open PR refines the existing FlashInfer TRTLLM-gen cached-attention family by disabling shared paged index and unifying KV-buffer construction | Treat these attention-prep changes as an in-flight implementation evolution of an existing family first. | +| PR `#12544` NVFP4 KV cache support in TRTLLM-gen attention | `NVFP4 KV cache`
`trtllm-gen attention`
`flashinfer` | `PR #12544`
`tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py` | Open PR extends the cached-attention family so the FlashInfer-backed TRTLLM-gen path can build and consume NVFP4 KV buffers directly | Treat split KV-cache quant + buffer-build ladders as an in-flight TensorRT-LLM attention family first. | +| PR `#12738` / `#12557` BF16 TRTLLM-gen MoE through FlashInfer | `bf16 trtllm-gen moe`
`flashinfer`
`trtllm_bf16_moe` | `PR #12738`
`PR #12557`
`tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` | Open PRs extend the TRTLLM-gen MoE family so BF16 expert execution can route through FlashInfer instead of only CUTLASS-like paths | Treat BF16 expert ladders as an in-flight TensorRT-LLM FlashInfer MoE family. | + +## 14. vLLM-origin fused-kernel families + +These rows are comparative references from `vllm`. Use them when a trace looks +similar to an upstream family even if the current `sglang` checkout does not +contain the same implementation. + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| vLLM-origin fused residual add + RMSNorm | `fused_add_rms_norm*`
residual add right before RMSNorm | `vllm/model_executor/layers/layernorm.py`
`vllm/_custom_ops.py`
`csrc/layernorm_kernels.cu`
`csrc/cpu/layernorm.cpp` | Custom CUDA / CPU fused add-RMSNorm op reused directly and as a building block for later compile-time fusions | Treat split residual add + RMSNorm as a long-standing vLLM-origin precedent before calling the opportunity novel in sglang. | +| vLLM-origin AllReduce + RMSNorm (+ residual / quant) | `fuse_allreduce_rms`
`AllReduceFusionPass`
`allreduce + rmsnorm` | `vllm/compilation/passes/fusion/allreduce_rms_fusion.py`
`docs/design/fusions.md` | Compile-time patterns cover `AllReduce -> RMSNorm(+residual_add)` and optional FP8 / NVFP4 quant suffixes | Treat TP collective + norm (+ quant) ladders as a known vLLM-origin fusion family first. | +| vLLM-origin RMSNorm (+ residual add) + quant | `RMSNormQuantFusionPass`
`fused_add_rms_norm_static_fp8_quant`
`per_token_quant`
`per_group_quant` | `vllm/compilation/passes/fusion/rms_quant_fusion.py`
`vllm/compilation/passes/fusion/rocm_aiter_fusion.py` | Compile-time and ROCm AITER paths fuse RMSNorm or fused-add-RMSNorm with FP8 / FP4 quant output | Treat split norm/add + quant as an upstream fused family, not an unexplored direction. | +| vLLM-origin SiLU+Mul + quant | `ActivationQuantFusionPass`
`SiluMulFp8*`
`Nvfp4`
`rocm_aiter` | `vllm/compilation/passes/fusion/act_quant_fusion.py`
`vllm/compilation/passes/fusion/rocm_aiter_fusion.py` | Activation epilogues fuse `SiLU+Mul` with FP8 / NVFP4 / AITER group quant instead of materializing the BF16 activation first | Treat standalone activation then quant kernels as matching a vLLM-origin precedent. | +| vLLM-origin add + RMSNorm + pad | `fuse_act_padding`
`RocmAiterTritonAddRMSNormPadFusionPass`
`add_rmsnorm_pad` | `vllm/compilation/passes/fusion/rocm_aiter_fusion.py`
`docs/design/fusions.md` | ROCm / AITER path fuses residual add + RMSNorm directly into the padded layout expected by the next kernel | Treat norm-plus-padding ladders as an existing backend-specific fuse family first. | +| vLLM-origin attention + output quant | `fuse_attn_quant`
`AttnQuantFusionPass`
`merge_attn_states`
`output_scale`
`output_group_scale`
`output_block_scale` | `vllm/compilation/passes/fusion/attn_quant_fusion.py`
`vllm/v1/attention/ops/merge_attn_states.py`
`vllm/csrc/attention/merge_attn_states.cu`
`docs/design/fusions.md` | Compile-time fusion pushes FP8 / NVFP4 quantization into the attention epilogue on supported Triton / FlashInfer / ROCm / AITER backends, and mainline `merge_attn_states` kernels already support FP8 output when `output_scale` is provided | Treat attention-output quant and merged-attention quant epilogues as a known upstream family before calling them novel. | +| vLLM-origin fused QK RMSNorm + RoPE | `fused_qk_norm_rope`
`QKNormRoPEFusionPass`
`qk norm + rope` | `vllm/compilation/passes/fusion/qk_norm_rope_fusion.py`
`vllm/_custom_ops.py`
`csrc/fused_qknorm_rope_kernel.cu` | Compile-time and direct custom-op paths fuse per-head Q / K RMSNorm with RoPE | Treat split QK norm + RoPE as a clear vLLM-origin precedent. | +| vLLM-origin fused reshape + KV cache write | `reshape_and_cache`
`triton_reshape_and_cache_flash`
`kv cache write` | `vllm/v1/attention/ops/triton_reshape_and_cache_flash.py`
`vllm/v1/attention/backends/triton_attn.py` | Triton cache-update kernels reshape K / V into paged-cache layout and can include FP8 KV-cache scale/write logic | Treat reshape / transpose / cache-write ladders as an existing cache-store fusion family. | +| vLLM-origin fused RoPE + KV cache update | `fuse_rope_kvcache`
`RopeKVCacheFusionPass`
`triton_rope_and_cache` | `vllm/compilation/passes/fusion/rope_kvcache_fusion.py`
`vllm/_aiter_ops.py`
`docs/design/fusions.md` | ROCm / AITER compile-time fusion combines RoPE with paged KV cache update instead of launching them separately | Treat split RoPE + cache-store as a known upstream family, especially on ROCm-like paths. | +| vLLM-origin fused MLA RoPE + concat/cache write | `concat_and_cache_mla_rope_fused`
`mla rope cache` | `vllm/_custom_ops.py`
`csrc/cache_kernels_fused.cu` | CUDA kernel fuses MLA-oriented RoPE preparation, concat, and cache write into a direct paged-store path | Treat MLA concat + cache-write ladders as a vLLM-origin precedent before calling them novel. | +| vLLM-origin fused grouped top-k / biased grouped top-k router | `grouped_topk`
`biased_grouped_topk`
`grouped_topk_fused_kernel` | `vllm/_custom_ops.py`
`vllm/_aiter_ops.py`
`vllm/model_executor/layers/fused_moe/router/grouped_topk_router.py`
`csrc/moe/grouped_topk_kernels.cu` | CUDA / ROCm router kernels fuse grouped score processing, top-k selection, and routed renorm / bias handling | Treat MoE router ladders as matching an upstream grouped-topk family first. | +| vLLM-origin fused top-k softmax / sigmoid router | `topk_softmax`
`topk_sigmoid`
`topkGating`
`fused_topk` | `vllm/_custom_ops.py`
`vllm/_aiter_ops.py`
`vllm/model_executor/layers/fused_moe/router/fused_topk_router.py`
`vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py`
`csrc/moe/topk_softmax_kernels.cu` | CUDA and ROCm / AITER router kernels fuse score activation (`softmax` / `sigmoid`), top-k selection, optional bias correction, and routed renorm into one op instead of routing through grouped-topk or eager softmax-plus-topk ladders | Treat standalone score activation -> top-k -> bias / renorm chains as a known upstream fused router family first. | +| vLLM-origin DSV3 router GEMM | `dsv3_router_gemm`
`allow_dsv3_router_gemm`
`router logits` | `vllm/_custom_ops.py`
`vllm/model_executor/layers/fused_moe/router/gate_linear.py`
`csrc/moe/dsv3_router_gemm_entry.cu`
`csrc/moe/dsv3_router_gemm_float_out.cu` | Hopper-class CUDA kernel specializes the DeepSeek router linear for small decode batches and can emit FP32 logits directly without a generic GEMM chain | Treat DeepSeek-style router linear paths as an existing upstream specialized fuse, distinct from grouped-topk itself. | +| vLLM-origin GPT-OSS router GEMM | `gpt_oss_router_gemm`
`router gemm` | `vllm/_custom_ops.py`
`vllm/model_executor/layers/fused_moe/router/gate_linear.py`
`csrc/moe/gpt_oss_router_gemm.cu` | Model-specific CUDA kernel replaces the router linear plus bias path with one specialized GEMM op | Treat GPT-OSS-style router linear chains as an existing upstream specialized fuse. | +| vLLM-origin DeepSeek min-latency fused QKV-A projection | `dsv3_fused_a_gemm`
`fused_qkv_a_proj`
`q_a_proj` | `vllm/model_executor/models/deepseek_v2.py`
`vllm/_custom_ops.py`
`csrc/dsv3_fused_a_gemm.cu` | Hopper-class CUDA kernel replaces the tiny-batch DeepSeek QKV-A projection path with one specialized min-latency GEMM instead of a generic linear launch | Treat small-batch DeepSeek QKV-A projection ladders as a known upstream fused kernel family first. | +| vLLM-origin DSV3.2 fused indexer projections | `wk_weights_proj`
`MergedColumnParallelLinear`
`weights_proj` | `vllm/model_executor/models/deepseek_v2.py`
`vllm/model_executor/models/deepseek_mtp.py` | DSV3.2 indexer paths can fuse the `wk` and `weights_proj` projections into one GEMM and carry the matching MTP weight-loading path | Treat paired indexer projection chains as a known upstream fused linear family before calling the opportunity novel. | +| vLLM-origin MiniMax allreduce_rms kernels | `minimax_allreduce_rms`
`minimax_allreduce_rmsnorm`
`MiniMax-M2.5`
`allreduce_rms` | `vllm/model_executor/models/minimax_m2.py` | TensorRT-LLM-derived MiniMax allreduce-plus-RMSNorm kernels are a concrete upstream TP decode family | Treat MiniMax TP norm + collective ladders as an upstream specialized fusion family. | +| vLLM-origin CUTLASS scaled MM with scale / bias epilogue | `cutlass_scaled_mm`
`cutlass_scaled_mm_azp`
`scaled mm` | `vllm/_custom_ops.py`
`vllm/model_executor/kernels/linear/scaled_mm/cutlass.py`
`csrc/libtorch_stable/quantization/w8a8/cutlass/scaled_mm_entry.cu` | CUTLASS kernels fuse activation scales, weight scales, matmul, and optional bias / AZP epilogues | Treat separate scale-mul + GEMM + bias ladders as a vLLM-origin fused linear family first. | +| vLLM-origin fused MoE expert execution | `cpu_fused_moe`
`rocm_aiter_fused_moe`
`FusedMoE` | `vllm/model_executor/layers/fused_moe/layer.py`
`vllm/model_executor/layers/fused_moe/cpu_fused_moe.py`
`vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py`
`vllm/_aiter_ops.py` | MoE backends on CUDA / ROCm / CPU already collapse packed expert execution into fused expert kernels rather than per-expert eager GEMMs | Treat exposed expert-side tiny GEMM ladders as matching an upstream fused-MoE family. | +| vLLM-origin fused MoE LoRA | `fused_moe_lora`
`fused_moe_lora_fp8`
`w13_shrink`
`w2_expand` | `vllm/lora/ops/triton_ops/fused_moe_lora_op.py`
`vllm/lora/ops/triton_ops/fused_moe_lora_fp8_op.py`
`vllm/lora/layers/fused_moe.py` | Triton kernels fuse LoRA shrink / expand work into MoE expert execution, including FP8 variants | Treat MoE-LoRA adapter work as an upstream fused family before proposing a brand new kernel. | +| vLLM-origin ViT fused bilinear position-embedding interpolation | `triton_pos_embed_interpolate`
`bilinear_pos_embed`
`pos_embed_interpolate_native` | `vllm/model_executor/models/qwen3_vl.py` | Triton kernel fuses bilinear interpolation and spatial-merge reorder for Qwen3-VL ViT position embeddings, replacing many tiny eager kernels | Treat VLM position-embedding ladders as an existing vLLM-origin Triton fusion family. | + +## 15. vLLM-origin kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| vLLM-origin AsyncTP GEMM + collective overlap | `fuse_gemm_comms`
`fused_matmul_reduce_scatter`
`fused_all_gather_matmul` | `vllm/compilation/passes/fusion/collective_fusion.py`
`docs/design/fusions.md` | AsyncTP overlaps GEMM with reduce-scatter / all-gather via symmetric-memory collectives | Treat GEMM+comm windows as a clear vLLM-origin overlap precedent first. | +| vLLM-origin Sequence Parallelism staging | `enable_sp`
`ReduceScatter`
`AllGather`
`SequenceParallelismPass` | `vllm/compilation/passes/fusion/sequence_parallelism.py`
`docs/design/fusions.md` | Sequence-parallel rewrites all-reduce into RS -> local norm -> AG so later passes can overlap comm and compute | Treat RS / AG staging around norm blocks as an upstream overlap-enabling family. | +| vLLM-origin shared-expert aux-stream overlap | `aux_stream`
`shared_experts_stream`
shared expert near router | `vllm/model_executor/layers/fused_moe/runner/shared_experts.py`
`vllm/model_executor/layers/fused_moe/runner/moe_runner_base.py` | MoE shared experts can record the cloned input on `shared_experts_stream`, wait on the caller stream, run in parallel with router-side work, and rejoin before merge | Treat shared-expert vs router overlap as an existing upstream sparse-model family. | +| vLLM-origin DCP async all-to-all overlap | `dcp_alltoall`
`all_to_all_single`
`async_op=True` | `vllm/v1/attention/ops/dcp_alltoall.py` | Output / LSE exchange uses async all-to-all handles instead of serializing collective completion on the main path | Treat DCP all-to-all windows as an upstream async-collective family. | + +## 16. vLLM-origin PR-backed / in-flight fused-kernel and kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#35968` DSV3.2 multi-stream indexer overlap | `weights_proj`
`wk`
`k_norm`
`aux_stream` | `PR #35968`
`vllm/model_executor/models/deepseek_v2.py`
`vllm/utils/torch_utils.py` | Closed PR explored overlapping the small `weights_proj` GEMM with `wk + k_norm` on a secondary CUDA stream for decode batches instead of serializing both on the default stream | Treat this as a concrete upstream decode-time kernel-overlap family when traces show underutilized projection overlap opportunities. | +| PR `#37110` Triton attention + per-group FP8 dynamic quant | `group_size=128`
`group_size=64`
`output_group_scale`
`per-group FP8` | `PR #37110`
`vllm/compilation/passes/fusion/attn_quant_fusion.py`
`vllm/v1/attention/ops/triton_unified_attention.py` | In-flight Triton attention epilogue computes per-group FP8 scales and quantizes output directly instead of launching a separate group-quant kernel | Treat attention + per-group FP8 quant as a concrete upstream vLLM family, not a novel idea. | +| PR `#38445` MiniMax-M2 FP32 gate kernel | `fp32_router_gemm`
`MiniMax-M2`
`gate kernel` | `PR #38445`
`vllm/model_executor/layers/fused_moe/router/gate_linear.py`
`vllm/model_executor/models/minimax_m2.py` | Draft CUDA kernel fuses BF16->FP32 conversion and low-batch router GEMM for MiniMax-M2, replacing up to three kernels on the gate path | Treat MiniMax-M2 gate ladders as an in-flight upstream fused router family first. | +| PR `#38621` fused QK norm + RoPE + cache + quant | `fused_qk_norm_rope_cache_quant`
`QK Norm + RoPE + Cache + Quant` | `PR #38621`
`csrc/fused_qk_norm_rope_cache_quant.cu`
`vllm/compilation/passes/fusion/qk_norm_rope_cache_quant_fusion.py` | Draft CUDA kernel and compile-time pass try to fuse QK RMSNorm, RoPE, KV cache write, and optional FP8 quant for small-batch decode | Treat this as an in-flight upstream fusion family before calling a similar idea novel. | +| PR `#37646` ROCm AITER fused allreduce + RMSNorm | `rocm_aiter_fused_allreduce_rmsnorm`
`custom_fused_ar_rms`
`RocmAiterAllReduceFusionPass` | `PR #37646`
`vllm/_aiter_ops.py`
`vllm/compilation/passes/pass_manager.py` | ROCm-specific compile-time path swaps the generic all-reduce fusion pass for an AITER fused allreduce-plus-RMSNorm kernel family | Treat ROCm TP all-reduce + RMSNorm ladders as an in-flight upstream fused-collective family first. | +| PR `#36413` FlashInfer RMSNorm + FP4 quant fusion | `fuse_norm_quant`
`flashinfer`
`NVFP4`
`rmsnorm + fp4 quant` | `PR #36413`
`vllm/compilation/passes/fusion/rms_quant_fusion.py`
`vllm/docs/design/fusions.md` | FlashInfer-backed norm-plus-FP4 quant fusion extends the existing RMSNorm+quant family to NVFP4 flows | Treat split RMSNorm + FP4 quant ladders as an upstream in-flight family, not a fresh idea. | +| PR `#39301` GLM5 router GEMM with PDL overlap | `TRTLLM_ENABLE_PDL`
`router_gemm`
`GLM5`
`FI AR RMS fusion` | `PR #39301`
`vllm/model_executor/layers/fused_moe/router/gate_linear.py`
`vllm/csrc/moe/dsv3_router_gemm_utils.h` | Extends the specialized router GEMM family to GLM5 hidden size and uses PDL to overlap the router launch with the preceding fused allreduce-plus-RMS block | Treat this as an in-flight upstream router-kernel plus launch-overlap family before calling it novel. | +| PR `#41455` ROCm WMMA paged prefill and split-K decode | `wmma`
`paged prefill`
`split-K decode`
`ROCm attention` | `PR #41455`
`vllm/v1/attention`
`vllm/_aiter_ops.py` | Adds ROCm WMMA attention kernels for paged prefill and split-K decode shapes | Treat split attention support kernels on AMD as an in-flight vLLM attention-kernel family before calling them novel. | +| PR `#41263` DeepSeek-V4 fused norm / router low-latency path | `DSV4`
`fuse norm router`
`low latency`
`router` | `PR #41263`
`vllm/model_executor/models/deepseek_v2.py`
`vllm/model_executor/layers/fused_moe/router` | Targets DeepSeek-V4 decode latency by fusing norm / router-adjacent work and low-latency model paths | Treat DSV4 norm-router ladders as a concrete in-flight upstream family. | +| PR `#41428` DSV4 fused indexer Q quant kernel | `DSV4`
`fused Indexer Q quant`
`indexer q`
`fp4` | `PR #41428`
`vllm/model_executor/models/deepseek_v2.py`
`vllm/csrc` | Improves the fused DeepSeek-V4 indexer Q quant kernel instead of materializing Q then quantizing separately | Treat DSV4 indexer-Q quant ladders as an in-flight upstream fused quant family. | +| PR `#41255` DeepSeek-V4 Tile kernels / `head_compute_mix_kernel` | `head_compute_mix_kernel`
`Tile kernel`
`DSV4`
`MLA` | `PR #41255`
`vllm/model_executor/models/deepseek_v2.py`
`vllm/csrc` | Adds DeepSeek-V4 Tile kernels that mix head compute work in one specialized kernel | Treat DSV4 MLA head-compute ladders as a known in-flight specialized-kernel family. | +| PR `#41441` DSV4 all-reduce plus `mhc_post` fusion | `DSV4`
`AR+mhc_post`
`allreduce`
`mhc_post` | `PR #41441`
`vllm/model_executor/models/deepseek_v2.py`
`vllm/compilation/passes/fusion` | Fuses or overlaps DSV4 all-reduce with post-MLA head-compute work | Treat all-reduce followed by `mhc_post` in DSV4 traces as an in-flight vLLM overlap/fusion family. | +| PR `#41446` AMD GatedDeltaNet FLA prefill kernels | `GatedDeltaNet`
`FLA prefill`
`AMD`
`Qwen3-Next` | `PR #41446`
`vllm/model_executor/models/qwen3_next.py`
`vllm/v1/attention` | Optimizes GatedDeltaNet / FLA prefill kernels on AMD linear-attention models | Treat split GDN prefill kernels on ROCm as an in-flight upstream family. | +| PR `#39748` dual-stream GDN input projection | `dual-stream`
`input projection`
`GatedDeltaNet`
`Qwen3.5` | `PR #39748`
`vllm/model_executor/models/qwen3_next.py` | Overlaps sibling input-projection branches for Qwen3 / Qwen3.5 GDN-style blocks | Treat serial GDN input projections as a known in-flight overlap opportunity. | +| PRs `#41433` / `#41434` / `#41429` / `#40561` GPU/CPU sync removal | `GPU->CPU sync`
`cpu sync`
`item()`
`non_blocking` | `PR #41433`
`PR #41434`
`PR #41429`
`PR #40561` | Removes or gates accidental GPU-to-CPU synchronization points and adds sync-detection coverage | Treat CPU gaps next to small GPU kernels as an upstream vLLM sync-removal family before proposing a kernel-only fix. | +| PR `#36823` vLLM IR `fused_add_rms_norm` overload | `vllm_ir`
`fused_add_rms_norm`
`maybe_inplace` | `PR #36823`
`vllm/compilation/passes/ir`
`vllm/compilation/passes/fusion/rms_quant_fusion.py` | Extends vLLM IR lowering so fused-add-RMSNorm variants remain visible to later compile-time fusions | Treat missing norm/quant compile fusion as potentially an IR-lowering visibility issue. | + +## 17. Important toggles and caveats + +| Toggle / env | Location | Effect on trace interpretation | +| --- | --- | --- | +| `enable_flashinfer_allreduce_fusion` | `python/sglang/srt/server_args.py` | Enables the FlashInfer TP allreduce fusion family. | +| `enable_aiter_allreduce_fusion` | `python/sglang/srt/server_args.py` | Enables ROCm AITER TP allreduce fusion. | +| `enable_deterministic_inference` | `python/sglang/srt/server_args.py` | Can intentionally disable or change some fast fusion paths, especially AITER allreduce fusion and some sampling / router choices, so split kernels may be expected. | +| `enable_single_batch_overlap` | `python/sglang/srt/server_args.py` | Enables the SBO family. | +| `enable_fused_moe_sum_all_reduce` | `python/sglang/srt/server_args.py` | Enables fused MoE sum-reduce in the down path. | +| `SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO` | `python/sglang/srt/environ.py` | Alters how DeepSeek-style shared-expert overlap behaves on Blackwell. | +| `SGLANG_NSA_FUSE_TOPK` | `python/sglang/srt/environ.py` | Gates NSA fused top-k transform / page-table build. | +| `SGLANG_DISAGG_STAGING_BUFFER` | `python/sglang/srt/environ.py` | Enables the heterogeneous-TP staging-buffer family and its overlap windows. | +| `SGLANG_STAGING_USE_TORCH` | `python/sglang/srt/disaggregation/common/staging_buffer.py` | Forces torch fallback for staging gather / scatter, so Triton staging kernels may disappear by design. | +| `SGLANG_VIT_ENABLE_CUDA_GRAPH` | `python/sglang/srt/environ.py` | Can intentionally disable vision `aux_stream` overlap. | +| `SGLANG_ENABLE_FUSED_QKNORM_ROPE` | `python/sglang/multimodal_gen/runtime/layers/layernorm.py` | Gates the diffusion fused qknorm+rope path. | +| `enable_pdl` / `launch_with_pdl` | `flashinfer/norm/__init__.py`
`flashinfer/activation.py`
`flashinfer/rope.py`
`flashinfer/fused_moe/core.py`
`flashinfer/comm/allreduce.py` | Enables FlashInfer PDL across many kernels; launch grouping and same-stream overlap can change substantially when it is on. | +| `trigger_completion_at_end` | `flashinfer/comm/allreduce.py` | `False` enables downstream PDL-aware overlap after FlashInfer allreduce fusion; `True` delays completion to kernel end and removes that overlap window. | +| `use_cuda_graph` | `flashinfer/fused_moe/cute_dsl/fused_moe.py` | Enables the preallocated-buffer path and the safe aux-stream async-memset overlap in FlashInfer CuTeDSL MoE. | +| `split_device_green_ctx*` | `flashinfer/green_ctx.py` | Changes trace shape by partitioning SMs into separate green contexts instead of overlapping full-device streams on the default context. | +| `rmsnorm_backend` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Chooses whether AutoDeploy lowers RMSNorm to FlashInfer, so split norm ladders may reflect backend selection rather than a missing fuse. | +| `insert_cached_attention.backend` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Selects the cached-attention backend; `flashinfer` enables the paged-KV cached-attention family. | +| `insert_cached_mla_attention.backend` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Selects the cached MLA backend; `flashinfer_mla` enables the MLA prefill / decode family. | +| `TRTLLM_GEN_FUSED_MOE_USE_FLASHINFER` | `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` | Forces or guards the FlashInfer-backed TRTLLM-gen MoE family, so expert-kernel shape can change substantially when it is set. | +| `multi_stream_moe` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Enables the TensorRT-LLM shared-expert vs routed-expert overlap family. | +| `multi_stream_mla_attn` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Enables the TensorRT-LLM MLA Q-vs-KV branch overlap family. | +| `multi_stream_gemm` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Enables generalized FP8 GEMM fork overlap in TensorRT-LLM AutoDeploy. | +| `mlir_elementwise_fusion` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Can absorb merge adds into larger fused kernels, so missing explicit merge nodes in multi-stream traces may be intentional. | +| `enable_torch_compile` | `python/sglang/srt/server_args.py`
`python/sglang/multimodal_gen/runtime/server_args.py` | Compiler-generated fusion / reordering can hide handwritten kernel names; absence of a custom kernel does not always mean absence of fusion. | +| `enable_fused_grouped_gemm_combine` | `PR #21877` | In-flight path that intentionally disables SBO because combine is folded into down-GEMM. | +| `PassConfig.fuse_allreduce_rms` | `vllm/config/compilation.py` | Enables vLLM's AllReduce -> RMSNorm (+ residual / quant) compile-time fusion family. | +| `PassConfig.fuse_norm_quant` | `vllm/config/compilation.py` | Enables vLLM's RMSNorm(+residual add) -> FP8 / FP4 quant compile-time fusion family. | +| `PassConfig.fuse_act_quant` | `vllm/config/compilation.py` | Enables vLLM's `SiLU+Mul -> quant` fusion family, plus ROCm AITER variants where applicable. | +| `PassConfig.fuse_attn_quant` | `vllm/config/compilation.py` | Enables attention-epilogue quant fusion; requires the right backend / graph visibility, so split kernels may still be expected. | +| `PassConfig.fuse_mla_dual_rms_norm` | `vllm/config/compilation.py` | Enables the AITER-backed MLA paired-Q/KV RMSNorm fusion family on ROCm. | +| `PassConfig.enable_qk_norm_rope_fusion` | `vllm/config/compilation.py` | Enables the compile-time QK RMSNorm + RoPE family on CUDA-like backends. | +| `PassConfig.fuse_rope_kvcache` | `vllm/config/compilation.py` | Enables ROCm / AITER RoPE + KV-cache update fusion and is range-limited by token count. | +| `PassConfig.fuse_minimax_qk_norm` | `vllm/config/compilation.py` | Enables the MiniMax decode Q/K allreduce-plus-RMSNorm compile-time fusion family. | +| `PassConfig.fuse_act_padding` | `vllm/config/compilation.py` | Enables the ROCm AITER add-RMSNorm-plus-pad fusion family when AITER is available. | +| `PassConfig.enable_sp` | `vllm/config/compilation.py` | Rewrites all-reduce into sequence-parallel staging; this is often a prerequisite for the overlap family, not just a pure fuse toggle. | +| `PassConfig.fuse_gemm_comms` | `vllm/config/compilation.py` | Enables AsyncTP GEMM + collective overlap and auto-enables `enable_sp` when valid. | +| `TRTLLM_ENABLE_PDL` | `vllm/csrc/dsv3_fused_a_gemm.cu`
`vllm/csrc/moe/dsv3_router_gemm_utils.h` | Enables programmatic dependent launch for the DSV3 specialized CUDA kernels, which can change launch grouping and trace shape for router / QKV-A paths. | + +## 18. Suggested refresh commands + +These commands are only for maintainers refreshing this catalog by rescanning +the local source trees. They are not used by the triage scripts at runtime. + +```bash +# Optional sibling checkouts used for comparative scanning: +FLASHINFER_REPO=${FLASHINFER_REPO:-../flashinfer} +TRTLLM_REPO=${TRTLLM_REPO:-../TensorRT-LLM} +VLLM_REPO=${VLLM_REPO:-../vllm} + +rg -n "fused_add_rmsnorm|gemma_fused_add_rmsnorm|silu_and_mul|gelu_and_mul|fused_qk_rope_reshape_and_cache|fused_set_kv_buffer|fused_metadata_copy|normal_decode_set_metadata|_append_shared_to_topk_output|fused_append_shared_experts_with_weights" python/sglang +rg -n "MiniMaxM2RMSNormTP|fused_qknorm_rope|fused_qk_rope_cat_and_cache_mla|fused_qk_norm_mrope_3d_cache_pts_quant_shuffle|split_qkv_rmsnorm_rope|trtllm_fp8_kv_kernel|set_mla_kv_buffer_fp8_quant" python/sglang +rg -n "FusedMoeRouter|fused_topk_deepseek|moe_fused_gate|aiter_fused_topk|fused_rms_fp8_group_quant|fast_topk_transform_fused|fused_store_index_k_cache|fused_temperature_softmax|fused_softcap" python/sglang +rg -n "fused_qkvzba_split_reshape_cat|fused_gdn_gating|rms_norm_gated|layer_norm_gated|chunk_gated_delta_rule_fwd_kkt_solve_kernel|fused_recurrent_gated_delta_rule_update|fused_mamba_state_scatter_with_mask|_fused_gather_to_staging_kernel|_fused_scatter_from_staging_kernel" python/sglang +rg -n "single_batch_overlap|alt_stream|shared_expert|_comm_stream|scatter_stream|triton_mrope_fused|ring_attn|all_to_all_single|reorder_for_compute_comm_overlap|use_dual_stream" python/sglang +git log --all --format='%h %s' | rg -i 'fused|fusion|overlap|cutedsl|triton|cuda|rope|topk|quant|combine|allreduce|all_to_all' +rg -n "silu_and_mul|gelu_tanh_and_mul|gelu_and_mul|silu_and_mul_scaled_nvfp4_experts_quantize|rmsnorm_quant|fused_add_rmsnorm|fused_add_rmsnorm_quant|fused_rmsnorm_silu" "$FLASHINFER_REPO/flashinfer" +rg -n "AllReduceFusionPattern|allreduce_fusion|trigger_completion_at_end|rope_quantize_fp8|rope_quantize_fp8_append_paged_kv_cache|fused_topk_deepseek|cutlass_fused_moe|trtllm_.*_moe" "$FLASHINFER_REPO/flashinfer" +rg -n "aux_stream|use_async_memset|split_device_green_ctx|split_device_green_ctx_by_sm_count|enable_pdl|launch_with_pdl" "$FLASHINFER_REPO/flashinfer" "$FLASHINFER_REPO/include" +git -C "$FLASHINFER_REPO" log --all --format='%h %s' | rg -i 'fused|fusion|overlap|pdl|stream|rope|kv|quant|topk|moe' +rg -n "flashinfer_silu_and_mul|flashinfer_gelu_tanh_and_mul|flashinfer_rmsnorm|flashinfer_gemma_rmsnorm|flashinfer_fused_add_rmsnorm|flashinfer_apply_rope_with_cos_sin_cache_inplace|triton_fused_add_rms_norm_quant_fp8|fuse_rmsnorm_quant_fp8" "$TRTLLM_REPO/tensorrt_llm/_torch" +rg -n "flashinfer_attention_mha_with_cache|append_paged_kv_cache|flashinfer_mla|append_paged_mla_kv_cache|flashinfer_cached_ssm|selective_state_update|flashinfer.fused_moe" "$TRTLLM_REPO/tensorrt_llm/_torch" "$TRTLLM_REPO/docs/source" +rg -n "multi_stream_moe|multi_stream_mla_attn|multi_stream_gemm|record_event_passthrough|begin_aux_stream_passthrough|end_aux_stream_passthrough|wait_aux_stream_passthrough" "$TRTLLM_REPO/tensorrt_llm/_torch" +git -C "$TRTLLM_REPO" log --all --format='%h %s' | rg -i 'fused|fusion|overlap|flashinfer|mla|kv cache|multi-stream|stream|rope|rmsnorm|moe' +rg -n "fused_add_rms_norm|merge_attn_states|fused_qk_norm_rope|grouped_topk|topk_softmax|topk_sigmoid|dsv3_router_gemm|dsv3_fused_a_gemm|concat_and_cache_mla_rope_fused|gpt_oss_router_gemm|cutlass_scaled_mm|cpu_fused_moe|fused_moe_lora|triton_pos_embed_interpolate" "$VLLM_REPO/vllm" "$VLLM_REPO/csrc" +rg -n "fuse_allreduce_rms|fuse_norm_quant|fuse_act_quant|fuse_attn_quant|enable_qk_norm_rope_fusion|fuse_rope_kvcache|enable_sp|fuse_gemm_comms|RocmAiter|dcp_alltoall|shared_experts_stream|TRTLLM_ENABLE_PDL|wk_weights_proj" "$VLLM_REPO/vllm" "$VLLM_REPO/docs/design/fusions.md" "$VLLM_REPO/csrc" +git -C "$VLLM_REPO" log --all --format='%h %s' | rg -i 'fused|fusion|overlap|triton|cuda|rope|kv cache|topk|router|allreduce|reduce-scatter|all-gather|all_to_all|quant' +# GitHub PR scan terms for the connector or web UI: +# "fused OR overlap repo:sgl-project/sglang" +# "triton OR cutedsl OR cuda fused repo:sgl-project/sglang" +# "fused OR overlap repo:flashinfer-ai/flashinfer" +# "pdl OR aux_stream OR green_ctx repo:flashinfer-ai/flashinfer" +# "fused OR overlap repo:NVIDIA/TensorRT-LLM" +# "flashinfer OR mla OR moe OR rmsnorm repo:NVIDIA/TensorRT-LLM" +# "multi-stream OR aux_stream OR cudagraph repo:NVIDIA/TensorRT-LLM" +# "fused OR overlap repo:vllm-project/vllm" +# "triton OR cuda fused repo:vllm-project/vllm" +``` diff --git a/.claude/skills/llm-torch-profiler-analysis/references/heuristics.md b/.claude/skills/llm-torch-profiler-analysis/references/heuristics.md new file mode 100644 index 000000000000..e0f55ab4a630 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/references/heuristics.md @@ -0,0 +1,119 @@ +# Overlap Heuristics + +This analyzer is intentionally conservative. + +## What Comes From Which Trace + +### Mapping trace + +Used for: + +- `kernel -> cpu_op -> python scope` +- launch-site call chains + +This trace should be easier to read, even if it is not the exact final serving schedule. + +### Formal trace + +Used for: + +- hidden ratio +- exclusive ratio +- overlap headroom +- ASCII timelines + +This trace should reflect the real serving shape. + +## What It Treats As Hidden + +A kernel is treated as hidden for a segment if: + +- it is active during that segment +- at least one kernel on a different stream is also active + +If the overlapping kernel is compute-like, the analyzer separately records that it is hidden under compute. + +## Category Heuristics + +The analyzer classifies kernels by name: + +- `compute`: GEMM, attention, cutlass, cublas, Triton matmul-like kernels +- `communication`: NCCL, all-reduce, reduce-scatter, all-gather, DeepEP dispatch/combine +- `elementwise`: sigmoid, top-k, gate, rmsnorm, layernorm, rope, casts +- `memory`: memcpy, memset, fill, copy +- `other`: everything else + +These categories are for prioritization only. + +## How To Read The Action Table + +The overlap-opportunity table is intentionally not a full kernel dump. + +It only keeps rows that already have an action-oriented label: + +- `headroom` +- `low-roi-hidden` + +It also prunes very small `headroom` rows after prioritization. + +- if a `headroom` row would end up as `P5` because it is below the default `1%` share bar, it is omitted from the table +- `low-roi-hidden` rows can still remain even when they are small, because they are useful as "do not chase this first" signals + +### `headroom` + +Interpretation: + +- the kernel still spends meaningful time exposed in the formal trace +- the mapped Python scope is a good place to inspect scheduling or fusion opportunities +- the dependency signal should still be checked before treating it as a serious overlap candidate + +### `low-roi-hidden` + +Interpretation: + +- the kernel is already mostly hidden by another stream +- optimizing it in isolation is less likely to move end-to-end latency +- focus on fusion, launch reduction, or the surrounding schedule instead + +## Dependency Signal + +The table includes a dependency-oriented adjacency signal from the formal trace. + +It is built from the nearest previous and next kernels on the same stream plus the mapping-trace source attribution. + +Communication kernels are treated more conservatively than before: + +- if a tight adjacent kernel looks like a likely producer or consumer, the table will raise the dependency risk even when the Python scope names differ +- this avoids over-claiming that an all-reduce-like kernel is a clean overlap candidate just because its neighbors map to different functions + +Typical labels: + +- `serial risk low`: adjacent kernels do not look like a tight same-code serial chain +- `prev-side serial risk`: the previous adjacent kernel looks tightly tied to the same code path +- `next-side serial risk`: the next adjacent kernel looks tightly tied to the same code path +- `both-side serial risk`: both sides look like a tight serial chain +- `adjacency unclear`: the timing is tight but source attribution is too weak to trust a stronger claim + +Treat this as a strong heuristic, not proof of dataflow. + +The readable table compresses those into shorter labels: + +- `low` +- `high` +- `unclear` + +The recommendation labels are also intentionally short: + +- `try overlap` +- `try fusion` +- `check deps` +- `skip overlap` +- `manual check` +- `observe later` + +## Important Limits + +- A trace shows what overlapped, not what could legally overlap. +- Two kernels on different streams do not prove they are dependency-free. +- A mapped Python scope is a launch-site clue, not the only relevant code location. +- A hidden kernel can still matter if it changes occupancy, launch count, or surrounding schedule. diff --git a/.claude/skills/llm-torch-profiler-analysis/references/overlap-catalog.md b/.claude/skills/llm-torch-profiler-analysis/references/overlap-catalog.md new file mode 100644 index 000000000000..5a38cb204e74 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/references/overlap-catalog.md @@ -0,0 +1,180 @@ +# Overlap Catalog + +This catalog is the overlap-only companion to +`references/fuse-overlap-catalog.md`. + +This revision is intentionally kernel-scoped. Keep rows here only when the +overlap is visible in a profiler as GPU kernels, collective kernels, or +streamed kernel families. Host-only scheduler, event-loop, executor, offload, +and load-path overlaps are intentionally excluded. + +Use it like this: + +1. Start from the `overlap-opportunity table`. +2. Match visible kernel windows, collective windows, or stream-level overlap + against the rows below. +3. If a match exists in the mainline sections, report it as an existing + overlap family that is missing, disabled, regressed, or unsupported on the + current backend. +4. If a match exists only in the `PR-backed / in-flight` + section, report it as an upstream overlap pattern, not a novel idea. +5. Only call an overlap opportunity "new" when no row in this file or + `fuse-overlap-catalog.md` fits. + +The `vLLM-origin` sections below are comparative references. They are not +necessarily present in the checked-out `sglang` tree, but they should still be +treated as upstream or analogous kernel-overlap families before labeling an +overlap opportunity as novel. + +Refresh note `2026-04-22`: rescanned current `sglang`, `flashinfer`, +`TensorRT-LLM`, and `vllm` mainline overlap paths plus rechecked referenced PR +state via the GitHub API on `2026-04-22`. Closed-unmerged SGLang +[#22410](https://github.com/sgl-project/sglang/pull/22410) and FlashInfer +[#2840](https://github.com/flashinfer-ai/flashinfer/pull/2840) were removed +from the PR-backed sections. SGLang +[#21877](https://github.com/sgl-project/sglang/pull/21877), FlashInfer +[#2720](https://github.com/flashinfer-ai/flashinfer/pull/2720), and vLLM +[#35968](https://github.com/vllm-project/vllm/pull/35968) / +[#39301](https://github.com/vllm-project/vllm/pull/39301) remain useful +upstream overlap references as of this refresh. + +## 1. LLM / SRT kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Single-batch overlap (SBO) | MoE combine, down-gemm, shared-expert work in nearby two-stream windows | `python/sglang/srt/batch_overlap/single_batch_overlap.py` | combine vs down-gemm overlap, combine vs shared-expert overlap, one-stream dispatch+shared overlap, explicit SM partitioning and events | If exposed MoE combine sits near neighboring compute, classify it against SBO before calling it new overlap. | +| Q and K normalization on different streams | Q-side norm and K-side norm on different streams | `python/sglang/srt/models/utils.py::apply_qk_norm`
`python/sglang/srt/models/qwen3.py`
`python/sglang/srt/models/qwen3_next.py`
`python/sglang/srt/models/qwen3_5.py` | Q stays on current stream, K can run on `alt_stream` in capture mode | Treat split Q / K norm as an existing overlap family when `alt_stream` is already wired. | +| DeepSeek shared-expert / routed-expert overlap | shared-expert GEMMs near DeepEP dispatch / combine | `python/sglang/srt/models/deepseek_v2.py`
`python/sglang/srt/batch_overlap/single_batch_overlap.py` | shared experts on `alt_stream`, overlap with dispatch / combine and down-gemm, Blackwell-specific env gating | This is an established routed-vs-shared branch overlap pattern, not a novel idea. | +| Llama4 shared branch vs routed branch overlap | shared expert branch plus routed MoE branch as adjacent windows | `python/sglang/srt/models/llama4.py` | shared expert on current stream, router + topk + routed experts on `alt_stream` | Use Llama4 as the first precedent for branch-level overlap in similar sparse models. | +| ExaoneMoE shared experts vs router experts overlap | shared expert output and router-expert output form a two-branch window | `python/sglang/srt/models/exaone_moe.py::forward_normal_dual_stream` | shared experts on current stream, router + routed experts on `alt_stream`, explicit join before combine | This is an existing dual-stream MoE overlap family. | +| Grok residual-MoE branch overlap | dense MLP and block-sparse MoE branches in parallel | `python/sglang/srt/models/grok.py::moe_with_rmoe` | dense MLP on current stream, MoE on `alt_stream`, fused dual residual RMSNorm around boundaries | Treat exposed Grok branch overlap as an existing pattern. | +| NSA dual-stream overlap | Q-proj, K-proj, RoPE, cache-store, quantization in tight two-stream windows | `python/sglang/srt/layers/attention/nsa/nsa_indexer.py` | Q / K projection split, RoPE split, cache-store vs quantization overlap | NSA already contains several dual-stream overlap precedents. | +| MoriEP async dispatch / combine comm stream | `MoriEP`
`_comm_stream`
`dispatch`
`combine`
`done_event` | `python/sglang/srt/layers/moe/token_dispatcher/moriep.py` | MoriEP can submit dispatch and combine onto a dedicated communication stream and synchronize only through events | Treat MoriEP comm / compute interleave as an existing MoE overlap family. | +| Generic `alt_stream` overlap families | `alt_stream` plus explicit `wait_stream` / `with torch.cuda.stream(...)` | `qwen2_moe.py`
`qwen3_moe.py`
`glm4_moe.py`
`bailing_moe.py`
`llada2.py`
`grok.py`
`olmo2.py`
`step3p5.py`
`longcat_flash.py`
`falcon_h1.py` | model-specific overlap on attention prep, MoE branches, or cache-store | Search these families before designing a new overlap scheme from scratch. | + +## 2. Staging / communication kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Decode scatter on dedicated `scatter_stream` | `scatter_stream`
`_scatter_stream` | `python/sglang/srt/disaggregation/common/staging_handler.py` | staging scatter kernels are submitted to a dedicated stream so the decode thread does not block on the main forward stream | Treat decode-side staging scatter windows as an existing overlap pattern. | +| Staging-buffer fused gather / scatter kernels | `_fused_gather_to_staging_kernel`
`_fused_scatter_from_staging_kernel` | `python/sglang/srt/disaggregation/common/staging_buffer.py` | Triton kernels gather KV slices into contiguous staging memory and scatter them back to KV cache | If heterogeneous-TP staging shows many small copy kernels, compare against this existing fused-plus-overlap family first. | + +## 3. VLM / diffusion kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| Vision QK norm with aux stream | vision-side QK norm or norm-like kernels before attention | `python/sglang/srt/layers/attention/vision.py` | vision QK normalization can call shared `apply_qk_norm(...)`, with K-side work on `aux_stream` | If vision QK prep is split, first check this existing aux-stream path. | +| ViT CUDA graph disables vision aux stream | expected vision overlap is absent under ViT graph | `python/sglang/srt/models/internvl.py`
`python/sglang/srt/layers/attention/vision.py`
`python/sglang/srt/environ.py::SGLANG_VIT_ENABLE_CUDA_GRAPH` | vision `aux_stream` is intentionally disabled when ViT CUDA graph is on | Missing vision overlap may be intentional, not a regression. | +| Ulysses sequence-parallel attention | exposed `all_to_all` around attention blocks | `python/sglang/multimodal_gen/runtime/layers/attention/layer.py`
`python/sglang/multimodal_gen/runtime/distributed/communication_op.py` | head / sequence redistribution before and after attention | Treat sequence-parallel all-to-all as an existing distributed attention family. | +| USP attention with all-to-all and ring attention | `all_to_all`, ring-attention comm, head / sequence reshards | `python/sglang/multimodal_gen/runtime/layers/attention/layer.py` | `_usp_input_all_to_all(...)`, `_usp_output_all_to_all(...)`, `ring_attn(...)` | This is the primary existing overlap / comm family for many diffusion models. | +| Turbo-layer async all-to-all pipelining | pipelined A2A windows with explicit waits on a comm stream | `python/sglang/multimodal_gen/runtime/layers/attention/turbo_layer.py` | looped `all_to_all_single(..., async_op=True)` plus staged postprocess on a comm stream | Treat exposed turbo A2A windows as an existing pipelined overlap pattern. | +| TorchInductor compute / communication reorder | compiled traces with compute and comm partially interleaved | `python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py`
`python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/mova.py` | `torch._inductor.config.reorder_for_compute_comm_overlap = True` | Existing compile-time reordering may already explain partial overlap in diffusion traces. | +| Dual-stream diffusion models | two nearby compute branches inside one DiT / UNet block | `python/sglang/multimodal_gen/runtime/models/dits/hunyuan3d.py` | `use_dual_stream = True` | Treat dual-branch diffusion execution as an existing overlap family. | + +## 4. PR-backed / in-flight kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#21877` fused down-GEMM + combine superseding SBO | `enable_fused_grouped_gemm_combine`
`combine`
`down_gemm` | `PR #21877`
`python/sglang/srt/server_args.py`
`python/sglang/srt/layers/moe/token_dispatcher/deepep.py` | Fused combine eliminates the standalone combine window, so SBO is intentionally disabled when this path is on | If the trace discussion is about combine overlap, first classify it as this upstream fused-overlap family. | + +## 5. FlashInfer kernel-overlap families + +These rows are comparative references from `flashinfer`. Use them when a trace +looks like an upstream FlashInfer overlap family even if the current `sglang` +checkout only calls part of that implementation. + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| FlashInfer PDL launch-overlap family | `enable_pdl`
`launch_with_pdl`
`cudaGridDependencySynchronize`
`cudaTriggerProgrammaticLaunchCompletion`
`trigger_completion_at_end=False`
`allreduce_fusion` | `flashinfer/norm/__init__.py`
`flashinfer/activation.py`
`flashinfer/rope.py`
`flashinfer/comm/allreduce.py`
`flashinfer/comm/trtllm_ar.py` | FlashInfer uses Programmatic Dependent Launch broadly, and the allreduce path can further advance completion so the next PDL-aware kernel overlaps on the same stream | Treat tight same-stream dependent windows and allreduce-followed-by-kernel windows as one existing FlashInfer launch-overlap family first. | +| FlashInfer CuTeDSL MoE aux-stream async-memset overlap | `aux_stream`
`main_event`
`memset_event`
`use_async_memset` | `flashinfer/fused_moe/cute_dsl/fused_moe.py` | Preallocated MoE output is zeroed on an auxiliary CUDA stream while GEMM1 runs on the main stream, then both streams join before finalize | Treat GEMM1 vs output-zero windows as an existing FlashInfer multi-stream overlap family. | +| FlashInfer green-context SM partition overlap | `split_device_green_ctx`
`split_device_green_ctx_by_sm_count`
`green_ctx` | `flashinfer/green_ctx.py` | CUDA green contexts partition SMs and create dedicated streams for concurrent kernel families on separate SM slices | Treat SM-partitioned concurrency as an existing FlashInfer overlap mechanism, not a novel scheduler idea. | + +## 6. FlashInfer PR-backed / in-flight kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#2720` PDL runtime-API migration | `cudaGridDependencySynchronize`
`cudaTriggerProgrammaticLaunchCompletion`
`inline PTX` | `PR #2720`
`include/flashinfer/comm/trtllm_allreduce_fusion.cuh`
`include/flashinfer/pos_enc.cuh` | Repo-wide migration preserves the existing PDL overlap family while replacing inline PTX with CUDA runtime APIs across norm, RoPE, attention, and MoE codepaths | Treat PDL-looking launch groups as an upstream FlashInfer overlap family even when implementation details differ across revisions. | + +## 7. TensorRT-LLM-origin kernel-overlap families + +These rows are comparative references from `TensorRT-LLM`. Current mainline +TensorRT-LLM overlap rows are mostly explicit auxiliary-stream rewrites in +AutoDeploy rather than same-stream PDL windows. + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| TensorRT-LLM multi-stream MLA attention | `multi_stream_mla_attn`
`record_event_passthrough`
`_aux`
`wait_event` | `tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_attn.py`
`tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py` | AutoDeploy rewrites MLA Q/KV forks so the KV projection runs on an auxiliary stream while the Q path stays on the caller stream | Treat exposed Q-branch vs KV-branch overlap as an existing TensorRT-LLM multi-stream family first. | +| TensorRT-LLM multi-stream MoE shared-vs-routed overlap | `multi_stream_moe`
`begin_aux_stream_passthrough`
`end_aux_stream_passthrough`
`wait_aux_stream_passthrough`
`mlir_elementwise_fusion`
`piecewise cudagraph`
`caller_stream.synchronize()` | `tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py`
`tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py` | Shared-expert work is moved to an auxiliary stream while routed-expert MoE work remains on the main stream and rejoins at the merge node; the same family includes synchronization rules for MLIR-fused kernels and piecewise cudagraph replay | Treat shared-expert vs routed-expert windows, including altered behavior under MLIR / piecewise graph modes, as an existing TensorRT-LLM branch-overlap family. | +| TensorRT-LLM multi-stream FP8 GEMM fork parallelism | `multi_stream_gemm`
`trtllm_finegrained_fp8_linear`
`record_event_passthrough`
`_aux` | `tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_gemm.py`
`tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py` | Compiler pass identifies fork points with multiple FP8 linears and moves the largest GEMM to the auxiliary stream so sibling GEMMs overlap | Treat sibling FP8 linear branches as an existing TensorRT-LLM overlap family before designing a new stream split. | + +## 8. vLLM-origin kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| vLLM-origin AsyncTP GEMM + collective overlap | `fuse_gemm_comms`
`fused_matmul_reduce_scatter`
`fused_all_gather_matmul` | `vllm/compilation/passes/fusion/collective_fusion.py`
`docs/design/fusions.md` | AsyncTP overlaps GEMM with reduce-scatter / all-gather via symmetric-memory collectives | Treat GEMM+comm windows as a clear vLLM-origin overlap precedent first. | +| vLLM-origin Sequence Parallelism staging | `enable_sp`
`ReduceScatter`
`AllGather`
`SequenceParallelismPass` | `vllm/compilation/passes/fusion/sequence_parallelism.py`
`docs/design/fusions.md` | Sequence-parallel rewrites all-reduce into RS -> local norm -> AG so later passes can overlap comm and compute | Treat RS / AG staging around norm blocks as an upstream overlap-enabling family. | +| vLLM-origin shared-expert aux-stream overlap | `aux_stream`
`shared_experts_stream`
shared expert near router | `vllm/model_executor/layers/fused_moe/runner/shared_experts.py`
`vllm/model_executor/layers/fused_moe/runner/moe_runner_base.py` | MoE shared experts can record the cloned input on `shared_experts_stream`, wait on the caller stream, run in parallel with router-side work, and rejoin before merge | Treat shared-expert vs router overlap as an existing upstream sparse-model family. | +| vLLM-origin DCP async all-to-all overlap | `dcp_alltoall`
`all_to_all_single`
`async_op=True` | `vllm/v1/attention/ops/dcp_alltoall.py` | Output / LSE exchange uses async all-to-all handles instead of serializing collective completion on the main path | Treat DCP all-to-all windows as an upstream async-collective family. | + +## 9. vLLM-origin PR-backed / in-flight kernel-overlap families + +| Pattern | Trace keywords | Primary code | Existing path | Skill should conclude | +| --- | --- | --- | --- | --- | +| PR `#35968` DSV3.2 multi-stream indexer overlap | `weights_proj`
`wk`
`k_norm`
`aux_stream` | `PR #35968`
`vllm/model_executor/models/deepseek_v2.py`
`vllm/utils/torch_utils.py` | Closed PR explored overlapping the small `weights_proj` GEMM with `wk + k_norm` on a secondary CUDA stream for decode batches instead of serializing both on the default stream | Treat this as a concrete upstream decode-time kernel-overlap family when traces show underutilized projection overlap opportunities. | +| PR `#39301` GLM5 router GEMM with PDL overlap | `TRTLLM_ENABLE_PDL`
`router_gemm`
`GLM5`
`FI AR RMS fusion` | `PR #39301`
`vllm/model_executor/layers/fused_moe/router/gate_linear.py`
`vllm/csrc/moe/dsv3_router_gemm_utils.h` | The GLM5 router GEMM path explicitly uses PDL so the router kernel can overlap with the preceding fused allreduce-plus-RMS block on supported GPUs | Treat router-GEMM launch overlap on GLM5-like traces as an in-flight upstream family first. | + +## 10. Important toggles and caveats + +| Toggle / env | Location | Effect on trace interpretation | +| --- | --- | --- | +| `enable_single_batch_overlap` | `python/sglang/srt/server_args.py` | Enables the SBO family. | +| `SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO` | `python/sglang/srt/environ.py` | Alters how DeepSeek-style shared-expert overlap behaves on Blackwell. | +| `SGLANG_DISAGG_STAGING_BUFFER` | `python/sglang/srt/environ.py` | Enables the heterogeneous-TP staging-buffer family and its overlap windows. | +| `SGLANG_STAGING_USE_TORCH` | `python/sglang/srt/disaggregation/common/staging_buffer.py` | Forces torch fallback for staging gather / scatter, so Triton staging kernels may disappear by design. | +| `SGLANG_VIT_ENABLE_CUDA_GRAPH` | `python/sglang/srt/environ.py` | Can intentionally disable vision `aux_stream` overlap. | +| `enable_pdl` / `launch_with_pdl` | `flashinfer/norm/__init__.py`
`flashinfer/activation.py`
`flashinfer/rope.py`
`flashinfer/fused_moe/core.py`
`flashinfer/comm/allreduce.py` | Enables FlashInfer PDL across many kernels; launch grouping and same-stream overlap can change substantially when it is on. | +| `trigger_completion_at_end` | `flashinfer/comm/allreduce.py` | `False` enables downstream PDL-aware overlap after FlashInfer allreduce fusion; `True` delays completion to kernel end and removes that overlap window. | +| `use_cuda_graph` | `flashinfer/fused_moe/cute_dsl/fused_moe.py` | Enables the preallocated-buffer path and the safe aux-stream async-memset overlap in FlashInfer CuTeDSL MoE. | +| `split_device_green_ctx*` | `flashinfer/green_ctx.py` | Changes trace shape by partitioning SMs into separate green contexts instead of overlapping full-device streams on the default context. | +| `multi_stream_moe` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Enables the TensorRT-LLM shared-expert vs routed-expert overlap family. | +| `multi_stream_mla_attn` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Enables the TensorRT-LLM MLA Q-vs-KV branch overlap family. | +| `multi_stream_gemm` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Enables generalized FP8 GEMM fork overlap in TensorRT-LLM AutoDeploy. | +| `mlir_elementwise_fusion` | `tensorrt_llm/_torch/auto_deploy/config/default.yaml` | Can absorb merge adds into larger fused kernels, so missing explicit merge nodes in TensorRT-LLM multi-stream traces may be intentional. | +| `enable_torch_compile` | `python/sglang/srt/server_args.py`
`python/sglang/multimodal_gen/runtime/server_args.py` | Compiler-generated reordering can hide or rename overlap windows. | +| `enable_fused_grouped_gemm_combine` | `PR #21877` | In-flight path that intentionally disables SBO because combine is folded into down-GEMM. | +| `PassConfig.enable_sp` | `vllm/config/compilation.py` | Enables vLLM's sequence-parallel staging family that creates RS / AG overlap opportunities. | +| `PassConfig.fuse_gemm_comms` | `vllm/config/compilation.py` | Enables AsyncTP GEMM + collective overlap and auto-enables `enable_sp` when valid. | + +## 11. Suggested refresh commands + +These commands are only for maintainers refreshing this catalog by rescanning +the local source trees. They are not used by the triage scripts at runtime. + +```bash +# Optional sibling checkouts used for comparative scanning: +FLASHINFER_REPO=${FLASHINFER_REPO:-../flashinfer} +TRTLLM_REPO=${TRTLLM_REPO:-../TensorRT-LLM} +VLLM_REPO=${VLLM_REPO:-../vllm} + +rg -n "single_batch_overlap|alt_stream|shared_expert|scatter_stream|_fused_gather_to_staging_kernel|_fused_scatter_from_staging_kernel|async_op=True" python/sglang +rg -n "apply_qk_norm|vision.py|ring_attn|all_to_all_single|reorder_for_compute_comm_overlap|use_dual_stream" python/sglang/multimodal_gen python/sglang/srt +git log --all --format='%h %s' | rg -i 'fused|fusion|overlap|combine|all_to_all|ring attn|stream|triton|cutedsl|cuda' +rg -n "enable_pdl|launch_with_pdl|trigger_completion_at_end|aux_stream|use_async_memset|split_device_green_ctx|split_device_green_ctx_by_sm_count" "$FLASHINFER_REPO/flashinfer" "$FLASHINFER_REPO/include" +git -C "$FLASHINFER_REPO" log --all --format='%h %s' | rg -i 'fused|fusion|overlap|pdl|stream|rope|kv|quant|topk|moe' +rg -n "multi_stream_moe|multi_stream_mla_attn|multi_stream_gemm|record_event_passthrough|begin_aux_stream_passthrough|end_aux_stream_passthrough|wait_aux_stream_passthrough" "$TRTLLM_REPO/tensorrt_llm/_torch" +rg -n "mlir_elementwise_fusion|piecewise|cudagraph|caller_stream.synchronize" "$TRTLLM_REPO/tensorrt_llm/_torch" +git -C "$TRTLLM_REPO" log --all --format='%h %s' | rg -i 'overlap|multi-stream|aux stream|cudagraph|mlir|stream|flashinfer|moe|mla' +rg -n "fuse_gemm_comms|enable_sp|fused_matmul_reduce_scatter|fused_all_gather_matmul|shared_experts_stream|maybe_sync_shared_experts_stream|dcp_alltoall|async_op=True|aux_stream|maybe_execute_in_parallel" "$VLLM_REPO/vllm" "$VLLM_REPO/docs/design/fusions.md" +git -C "$VLLM_REPO" log --all --format='%h %s' | rg -i 'fused|fusion|overlap|allreduce|reduce-scatter|all-gather|all_to_all|stream|multi-stream|triton|cuda|router' +# GitHub PR scan terms for the connector or web UI: +# "fused OR overlap repo:sgl-project/sglang" +# "triton OR cutedsl OR cuda overlap repo:sgl-project/sglang" +# "fused OR overlap repo:flashinfer-ai/flashinfer" +# "pdl OR aux_stream OR green_ctx repo:flashinfer-ai/flashinfer" +# "fused OR overlap repo:NVIDIA/TensorRT-LLM" +# "multi-stream OR aux_stream OR cudagraph repo:NVIDIA/TensorRT-LLM" +# "mlir OR piecewise OR flashinfer repo:NVIDIA/TensorRT-LLM" +# "fused OR overlap repo:vllm-project/vllm" +# "triton OR cuda overlap repo:vllm-project/vllm" +# "multi-stream OR aux_stream overlap repo:vllm-project/vllm" +``` diff --git a/.claude/skills/llm-torch-profiler-analysis/references/source-map.md b/.claude/skills/llm-torch-profiler-analysis/references/source-map.md new file mode 100644 index 000000000000..f460a45faf4b --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/references/source-map.md @@ -0,0 +1,42 @@ +# Source Map + +Use these upstream files when the workflow or behavior needs to be justified from SGLang source. + +## Profiler entrypoints + +- `python/sglang/profiler.py` + - live profiler CLI + - writes `server_args.json` + - forwards `num_steps`, `profile_by_stage`, `merge_profiles`, and `profile_prefix` + +- `python/sglang/test/send_one.py` + - minimal request path that can trigger profiling from a single command + +- `python/sglang/bench_serving.py` + - profile-capable serving benchmark path + - forwards `profile_activities`, `profile_by_stage`, `profile_stages`, and `profile_prefix` + +## Scheduler-side trace writing + +- `python/sglang/srt/managers/scheduler_profiler_mixin.py` + - actual trace start/stop behavior + - filename pattern for `TP/DP/PP/EP` and optional stage suffixes + - `CUDA_PROFILER` and torch profiler handling + +- `python/sglang/srt/utils/profile_merger.py` + - merged distributed trace behavior + - why merged traces should be treated differently from rank-local traces + +- `python/sglang/srt/utils/profile_utils.py` + - newer profile v2 manager path used for stage-based traces + +## Documentation and tests + +- `docs/developer_guide/benchmark_and_profiling.md` + - canonical profiling docs + +- `test/registered/profiling/test_start_profile.py` + - validates `/start_profile` behavior, including `CUDA_PROFILER` + +- `test/registered/profiling/test_profile_v2.py` + - validates stage-scoped trace outputs under `SGLANG_PROFILE_V2` diff --git a/.claude/skills/llm-torch-profiler-analysis/references/vllm-torch-compile-fusions.md b/.claude/skills/llm-torch-profiler-analysis/references/vllm-torch-compile-fusions.md new file mode 100644 index 000000000000..e9f3109f57a2 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/references/vllm-torch-compile-fusions.md @@ -0,0 +1,66 @@ +# vLLM Torch Compile Fusion Patterns + +Refresh: `2026-05-01`. +Source tree: vLLM `origin/main` at `7075df79b`. + +Use this file when the fuse-pattern table reports split kernels in a trace and +you need to decide whether the shape is already covered by vLLM's +`torch.compile` pattern matcher. Treat every row here as an upstream precedent +before calling a similar SGLang opportunity novel. + +## Pass Registration + +vLLM registers these passes from +`vllm/compilation/passes/pass_manager.py` through `PassConfig`. + +| Toggle | Pass | Target shape | +| --- | --- | --- | +| `enable_sp` | `SequenceParallelismPass` | all-reduce around residual/norm blocks becomes reduce-scatter, local work, and all-gather | +| `fuse_gemm_comms` | `AsyncTPPass` | GEMM plus reduce-scatter / all-gather overlap through symmetric-memory collectives | +| `fuse_allreduce_rms` | `AllReduceFusionPass` | all-reduce followed by RMSNorm, optional residual add, optional FP8 / NVFP4 quant | +| `fuse_minimax_qk_norm` | `MiniMaxQKNormPass` | MiniMax Q/K all-reduce plus RMSNorm decode path | +| `fuse_norm_quant` | `RMSNormQuantFusionPass` | RMSNorm or fused-add-RMSNorm followed by FP8 / FP4 quant | +| `fuse_norm_quant` + AITER | `RocmAiterRMSNormQuantFusionPass` | ROCm AITER RMSNorm / fused-add-RMSNorm followed by AITER or vLLM quant | +| `fuse_act_quant` | `ActivationQuantFusionPass` | SiLU-and-mul followed by FP8 / NVFP4 / block quant | +| `fuse_act_quant` + AITER | `RocmAiterSiluMulFp8GroupQuantFusionPass` | AITER SiLU-and-mul followed by FP8 group quant | +| `fuse_act_padding` + AITER | `RocmAiterTritonAddRMSNormPadFusionPass` | AITER fused-add-RMSNorm followed by padding into the next layout | +| `fuse_mla_dual_rms_norm` + AITER | `MLADualRMSNormFusionPass` | MLA paired Q and KV RMSNorms become `fused_mla_dual_rms_norm` | +| `fuse_rope_kvcache` | `RopeKVCacheFusionPass` | RoPE plus paged KV-cache update, after split cleanup passes | +| `fuse_attn_quant` | `AttnQuantFusionPass` | attention output followed by FP8 / NVFP4 quant | +| `fuse_attn_quant` | `MLAAttnQuantFusionPass` | MLA attention output followed by FP8 / NVFP4 / FP8 group quant | +| `enable_qk_norm_rope_fusion` | `QKNormRoPEFusionPass` | Q/K RMSNorm plus RoPE on packed QKV tensors | + +## Pattern Inventory + +| Source file | Pattern classes | Trace clue | Replacement | +| --- | --- | --- | --- | +| `fusion/allreduce_rms_fusion.py` | `AllReduceRMSNormPattern`, `AllReduceFusedAddRMSNormPattern`, `AllReduceFusedRMSNormStaticQuantFP8Pattern`, `AllReduceFusedAddRMSNormStaticQuantFP8Pattern`, `AllReduceFusedRMSNormStaticQuantNVFP4Pattern`, `AllReduceFusedAddRMSNormStaticQuantNVFP4Pattern` | TP all-reduce directly before RMSNorm, residual-add RMSNorm, or quant | `flashinfer_trtllm_fused_allreduce_norm` with FlashInfer allreduce fusion pattern codes | +| `fusion/rms_quant_fusion.py` | `RMSNormStaticQuantPattern`, `FusedAddRMSNormStaticQuantPattern`, `RMSNormDynamicQuantPattern`, `FusedAddRMSNormDynamicQuantPattern`, `RMSNormGroupQuantPattern`, `FusedAddRMSNormGroupQuantPattern` | RMSNorm or fused-add-RMSNorm followed by static FP8, dynamic per-token FP8, FP8 group quant, or NVFP4 quant | `_C.rms_norm_*_quant`, `_C.fused_add_rms_norm_*_quant`, or per-block quant custom op | +| `fusion/rocm_aiter_fusion.py` | `AiterRMSNormDynamicQuantPattern`, `AiterFusedAddRMSNormDynamicQuantPattern`, `AiterRMSFp8GroupQuantPattern`, `AiterFusedAddRMSFp8GroupQuantPattern` | AITER RMSNorm/fused-add-RMSNorm followed by AITER or vLLM FP8 quant | AITER fused RMSNorm-quant custom ops | +| `fusion/act_quant_fusion.py` | `SiluMulFp8StaticQuantPattern`, `SiluMulNvfp4QuantPattern`, `SiluMulBlockQuantPattern` | SiLU-and-mul activation output immediately quantized | fused activation-plus-quant custom op | +| `fusion/rocm_aiter_fusion.py` | `AiterSiluMulFp8GroupQuantPattern` | AITER SiLU-and-mul followed by FP8 group quant | AITER `act_mul_fused_fp8_group_quant` | +| `fusion/rocm_aiter_fusion.py` | `AddAiterRMSNormPadPattern` | AITER fused-add-RMSNorm output padded before the next op | AITER add-RMSNorm-pad op | +| `fusion/rocm_aiter_fusion.py` | `MLADualRMSNormPattern` | MLA Q branch and KV branch each run RMSNorm | `torch.ops.vllm.fused_mla_dual_rms_norm` backed by AITER fused QK RMSNorm | +| `fusion/qk_norm_rope_fusion.py` | `QkNormRopePattern` | Q/K RMSNorm, split/getitem reshapes, then RoPE | `_C.fused_qk_norm_rope` | +| `fusion/rope_kvcache_fusion.py` | `RopeReshapeKVCachePattern` | RoPE output followed by reshape/cache update | `vllm.fused_rope_and_unified_kv_cache_update` | +| `fusion/attn_quant_fusion.py` | `AttnFp8StaticQuantPattern`, `AttnNvfp4QuantPattern` | attention output followed by FP8 static quant or NVFP4 quant | backend attention op with fused output quant when supported | +| `fusion/mla_attn_quant_fusion.py` | `MLAAttnFp8StaticQuantPattern`, `MLAAttnNvfp4QuantPattern`, `MLAAttnFp8GroupQuantPattern` | MLA attention output followed by static FP8, NVFP4, or FP8 group quant | MLA attention op with fused output quant when supported | +| `fusion/minimax_qk_norm_fusion.py` | `MiniMaxQKNormPattern` | MiniMax `forward_qk`: Q/K variance all-reduce divided by TP world size, then RMS apply | `vllm.minimax_qk_norm_fused` / Lamport fused kernel | +| `fusion/sequence_parallelism.py` | `FirstAllReduceRMSNormPattern`, `MiddleAllReduceRMSNormPattern`, `FirstAllReduceRMSNormStaticFP8Pattern`, `MiddleAllReduceRMSNormStaticFP8Pattern` | all-reduce plus norm block in a full-graph TP model | sequence-parallel reduce-scatter, local norm, all-gather staging | +| `fusion/collective_fusion.py` | `GEMMReduceScatterPattern`, `AllGatherGEMMPattern`, `ScaledMMReduceScatterPattern`, `AllGatherScaledMMPattern`, `CutlassScaledMMReduceScatterPattern`, `AllGatherCutlassScaledMMPattern`, `FlashInferBMMFP8ReduceScatterPattern`, `FlashInferAllGatherBMMFP8Pattern` | matmul / scaled-mm / FlashInfer BMM adjacent to TP collectives | symmetric-memory fused matmul+reduce-scatter or all-gather+matmul | + +## Triage Rules + +- If the trace shows split norm/add/quant, compare first against + `RMSNormQuantFusionPass`, AITER variants, and `AllReduceFusionPass`. +- If the trace shows attention output followed by quant kernels, compare against + `AttnQuantFusionPass` or `MLAAttnQuantFusionPass`, not only handwritten + attention kernels. +- If the trace shows Q/K norm followed by RoPE or cache update, compare both + `QKNormRoPEFusionPass` and `RopeKVCacheFusionPass`; they are separate passes. +- If the trace is a TP decode trace with visible collectives, check whether + `enable_sp` and `fuse_gemm_comms` would transform the same region into + sequence-parallel or AsyncTP overlap. +- A missing vLLM compile fusion may be intentional when the graph range, backend + support check, dtype, token count, or AITER / FlashInfer availability does not + satisfy the pass-specific guard. diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/analyze_llm_torch_profile.py b/.claude/skills/llm-torch-profiler-analysis/scripts/analyze_llm_torch_profile.py new file mode 100644 index 000000000000..c432e4d308f7 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/analyze_llm_torch_profile.py @@ -0,0 +1,858 @@ +"""Compact triage entrypoint for unified LLM torch-profiler analysis.""" + +from __future__ import annotations + +import argparse +import sys +from collections import defaultdict +from pathlib import Path +from typing import Dict, List, Optional, Sequence, Tuple + +import triage_kernel_helpers as kernel_helpers +import triage_overlap_helpers as overlap_helpers +from profile_common import ( + DEFAULT_DECODE_INPUT_LEN, + DEFAULT_DECODE_OUTPUT_LEN, + DEFAULT_PREFILL_INPUT_LEN, + DEFAULT_PREFILL_OUTPUT_LEN, + DEFAULT_WARMUP_STEPS, + PROFILE_WORKLOAD_CHOICES, + discover_trace_targets, + framework_display_name, + load_server_args, + load_trace_json, + parse_stage, + resolve_framework, + run_profiler, +) + +MIN_RENDER_SHARE_PCT = 1.0 +MAPPING_KERNEL_SAMPLE_LIMIT_PER_NAME = 16 + + +def build_triage_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + prog="analyze_llm_torch_profile.py", + description=( + "Compact LLM torch-profiler triage entrypoint for SGLang, vLLM, and " + "TensorRT-LLM. " + "This prints three tables: kernel mapping, overlap opportunities, " + "and fuse opportunities. " + "Use either a single trace/profile input or a mapping+formal two-trace pair." + ), + ) + parser.add_argument( + "--framework", + type=str, + default="auto", + choices=["auto", "sglang", "vllm", "trtllm", "tllm", "tensorrt-llm"], + help=( + "Serving framework. Use auto to detect from trace contents, path hints, " + "or URL features." + ), + ) + parser.add_argument( + "--input", + type=str, + default=None, + help="Single trace file or profile directory to triage.", + ) + parser.add_argument( + "--url", + type=str, + default=None, + help=( + "Running server URL for single-trace triage. SGLang supports direct " + "capture via sglang.profiler. vLLM and TensorRT-LLM require a server-side " + "torch-profiler output path exposed via --output-dir." + ), + ) + parser.add_argument( + "--output-dir", + type=str, + default=None, + help=( + "Trace output dir when using --url. For vLLM this should match the " + "server's torch_profiler_dir. For TensorRT-LLM it should match the " + "directory or file path configured by TLLM_TORCH_PROFILE_TRACE." + ), + ) + parser.add_argument( + "--profile-prefix", + type=str, + default="triage-trace", + help=( + "Profile prefix when generating a trace from --url. SGLang uses it " + "directly; vLLM and TensorRT-LLM may ignore it on the HTTP profiler path." + ), + ) + parser.add_argument( + "--mapping-input", + type=str, + default=None, + help="Graph-off mapping trace file or directory.", + ) + parser.add_argument( + "--mapping-url", + type=str, + default=None, + help="Running graph-off server URL for the mapping trace.", + ) + parser.add_argument( + "--formal-input", + type=str, + default=None, + help="Formal graph-on trace file or directory.", + ) + parser.add_argument( + "--formal-url", + type=str, + default=None, + help="Running graph-on server URL for the formal trace.", + ) + parser.add_argument( + "--mapping-output-dir", + type=str, + default=None, + help="Trace output dir when using --mapping-url.", + ) + parser.add_argument( + "--formal-output-dir", + type=str, + default=None, + help="Trace output dir when using --formal-url.", + ) + parser.add_argument( + "--mapping-profile-prefix", + type=str, + default="mapping-trace", + help="Profile prefix for the mapping trace.", + ) + parser.add_argument( + "--formal-profile-prefix", + type=str, + default="formal-trace", + help="Profile prefix for the formal trace.", + ) + parser.add_argument( + "--num-steps", + type=int, + default=5, + help="Active profiler steps when generating traces from URLs.", + ) + parser.add_argument( + "--warmup-steps", + type=int, + default=DEFAULT_WARMUP_STEPS, + help="Warmup steps to run before arming the profiler for URL capture.", + ) + parser.add_argument( + "--profile-by-stage", action=argparse.BooleanOptionalAction, default=True + ) + parser.add_argument( + "--merge-profiles", action=argparse.BooleanOptionalAction, default=False + ) + parser.add_argument("--probe-requests", type=int, default=1) + parser.add_argument( + "--probe-prompt", + type=str, + default=( + "Repeat the word profiler many times with spaces so the server performs several decode steps. " + "Do not add explanations." + ), + ) + parser.add_argument("--probe-max-new-tokens", type=int, default=None) + parser.add_argument("--probe-delay", type=float, default=0.5) + parser.add_argument( + "--profile-workload", + choices=PROFILE_WORKLOAD_CHOICES, + default="both", + help=( + "Live-capture workload shape. Default 'both' captures separate " + "prefill and decode profiles instead of one mixed request. Use " + "'legacy' to keep the old --probe-prompt behavior." + ), + ) + parser.add_argument( + "--prefill-input-len", + type=int, + default=DEFAULT_PREFILL_INPUT_LEN, + help="Synthetic input length for the prefill profile workload.", + ) + parser.add_argument( + "--prefill-output-len", + type=int, + default=DEFAULT_PREFILL_OUTPUT_LEN, + help="Output length for the prefill profile workload.", + ) + parser.add_argument( + "--decode-input-len", + type=int, + default=DEFAULT_DECODE_INPUT_LEN, + help="Synthetic input length for the decode profile workload.", + ) + parser.add_argument( + "--decode-output-len", + type=int, + default=DEFAULT_DECODE_OUTPUT_LEN, + help="Output length for the decode profile workload.", + ) + parser.add_argument( + "--start-step", + type=int, + default=None, + help="Pass through to sglang.profiler when generating traces from URLs.", + ) + parser.add_argument( + "--pid-substring", + type=str, + default=None, + help="Restrict overlap analysis to PIDs containing this substring.", + ) + parser.add_argument( + "--kernel-table-limit", + type=int, + default=0, + help="How many kernel rows to print per stage. Use 0 for all kernels.", + ) + parser.add_argument( + "--overlap-table-limit", + type=int, + default=0, + help="How many overlap rows to print per stage. Use 0 for all kernels.", + ) + return parser + + +def parse_triage_args(argv: Sequence[str]) -> argparse.Namespace: + parser = build_triage_parser() + args = parser.parse_args(argv) + + single_trace_mode = bool(args.input) or bool(args.url) + dual_trace_mode = any( + [ + args.mapping_input, + args.mapping_url, + args.formal_input, + args.formal_url, + ] + ) + + if single_trace_mode and dual_trace_mode: + parser.error( + "Use either single-trace mode (--input/--url) or two-trace mode " + "(--mapping-* plus --formal-*), not both." + ) + + if single_trace_mode: + if bool(args.input) == bool(args.url): + parser.error("Provide exactly one of --input or --url.") + return args + + if bool(args.mapping_input) == bool(args.mapping_url): + parser.error("Provide exactly one of --mapping-input or --mapping-url.") + if bool(args.formal_input) == bool(args.formal_url): + parser.error("Provide exactly one of --formal-input or --formal-url.") + return args + + +def resolve_profile_targets( + *, + label: str, + input_path: Optional[str], + url: Optional[str], + output_dir: Optional[str], + profile_prefix: Optional[str], + args: argparse.Namespace, +) -> Tuple[List[Path], Optional[dict], str]: + if bool(input_path) == bool(url): + raise ValueError(f"{label} trace requires exactly one of input path or URL.") + + if url: + framework = resolve_framework( + args.framework, + input_path=Path(output_dir).resolve() if output_dir else None, + url=url, + ) + target_dir = run_profiler( + url=url, + output_dir=output_dir, + num_steps=args.num_steps, + profile_by_stage=args.profile_by_stage, + merge_profiles=args.merge_profiles, + profile_prefix=profile_prefix, + probe_requests=max(0, args.probe_requests), + probe_prompt=args.probe_prompt, + probe_max_new_tokens=args.probe_max_new_tokens, + probe_delay=args.probe_delay, + warmup_steps=args.warmup_steps, + start_step=args.start_step, + framework=framework, + framework_hint_path=output_dir, + profile_workload=args.profile_workload, + prefill_input_len=args.prefill_input_len, + prefill_output_len=args.prefill_output_len, + decode_input_len=args.decode_input_len, + decode_output_len=args.decode_output_len, + ) + traces, server_args = discover_trace_targets(target_dir, all_traces=False) + resolved_framework = resolve_framework( + args.framework, + input_path=target_dir, + url=url, + server_args=server_args, + ) + return traces, server_args, resolved_framework + + resolved = Path(input_path).resolve() + traces, server_args = discover_trace_targets(resolved, all_traces=False) + if server_args is None: + server_args = load_server_args(resolved) + framework = resolve_framework( + args.framework, input_path=resolved, server_args=server_args + ) + return traces, server_args, framework + + +def build_mapping_kernel_map(trace_paths: Sequence[Path], framework: str) -> dict: + stage_site_stats = defaultdict( + lambda: defaultdict(lambda: defaultdict(kernel_helpers.MappingSiteAggregate)) + ) + stage_kernel_categories: Dict[str, Dict[str, str]] = defaultdict(dict) + global_site_stats = defaultdict( + lambda: defaultdict(kernel_helpers.MappingSiteAggregate) + ) + global_kernel_categories: Dict[str, str] = {} + + for trace_path in trace_paths: + trace = load_trace_json(trace_path) + kernels, cpu_ops, python_frames, launch_events, _, _ = ( + kernel_helpers.extract_trace_data(trace) + ) + if not kernels: + continue + cpu_ops_by_external_id = kernel_helpers.build_cpu_op_index(cpu_ops) + launches_by_correlation = kernel_helpers.build_launch_index(launch_events) + site_context_cache = {} + default_stage = parse_stage(trace_path) + for stage, stage_kernels in kernel_helpers.group_kernels_by_stage( + kernels, default_stage + ).items(): + sampled_stage_kernels = ( + stage_kernels + if framework == "sglang" + else sample_kernels_for_mapping(stage_kernels) + ) + local_site_stats = kernel_helpers.aggregate_kernel_sites( + sampled_stage_kernels, + cpu_ops_by_external_id, + python_frames, + launches_by_correlation=launches_by_correlation, + site_context_cache=site_context_cache, + ) + kernel_categories = { + kernel.canonical_name: kernel.category for kernel in stage_kernels + } + kernel_helpers.merge_site_stats(stage_site_stats[stage], local_site_stats) + kernel_helpers.merge_site_stats(global_site_stats, local_site_stats) + stage_kernel_categories[stage].update(kernel_categories) + global_kernel_categories.update(kernel_categories) + + stage_payloads = { + stage: kernel_helpers.build_stage_payload( + dict(site_stats), stage_kernel_categories.get(stage, {}) + ) + for stage, site_stats in stage_site_stats.items() + } + global_payload = kernel_helpers.build_stage_payload( + dict(global_site_stats), global_kernel_categories + ) + return {"stages": stage_payloads, "global": global_payload} + + +def stage_index(stage: str) -> int: + return {"extend": 0, "prefill": 0, "decode": 1, "all": 2}.get(stage, 99) + + +def sample_kernels_for_mapping( + kernels: Sequence[kernel_helpers.KernelEvent], + per_name_limit: int = MAPPING_KERNEL_SAMPLE_LIMIT_PER_NAME, +) -> List[kernel_helpers.KernelEvent]: + if per_name_limit <= 0: + return list(kernels) + + grouped: Dict[str, List[kernel_helpers.KernelEvent]] = defaultdict(list) + for kernel in kernels: + grouped[kernel.canonical_name].append(kernel) + + sampled: List[kernel_helpers.KernelEvent] = [] + for kernel_name in sorted(grouped): + items = grouped[kernel_name] + if len(items) <= per_name_limit: + sampled.extend(items) + continue + for sample_idx in range(per_name_limit): + pos = round(sample_idx * (len(items) - 1) / (per_name_limit - 1)) + sampled.append(items[pos]) + sampled.sort(key=lambda kernel: (kernel.ts, kernel.name)) + return sampled + + +def stage_display(stage: str) -> str: + return kernel_helpers.stage_label(stage) + + +def pick_stage_value(stage_to_value: Dict[str, object], stage: str) -> Optional[object]: + if stage in stage_to_value: + return stage_to_value[stage] + if "all" in stage_to_value: + return stage_to_value["all"] + if len(stage_to_value) == 1: + return next(iter(stage_to_value.values())) + return None + + +def render_stages(stage_to_value: Dict[str, object]) -> List[str]: + stages = set(stage_to_value) + if any(stage != "all" for stage in stages): + stages.discard("all") + return sorted(stages, key=stage_index) + + +def build_overlap_stage_bundle_map( + trace_paths: Sequence[Path], + *, + label_prefix: str, + server_args: Optional[dict], + pid_substring: Optional[str], +) -> Dict[str, overlap_helpers.TraceBundle]: + stage_bundles: Dict[str, overlap_helpers.TraceBundle] = {} + for trace_path in sorted( + trace_paths, key=lambda item: (stage_index(parse_stage(item)), item.name) + ): + trace_json = load_trace_json(trace_path) + raw_events = trace_json.get( + "traceEvents", + trace_json if isinstance(trace_json, list) else [], + ) + events, pid = overlap_helpers.extract_kernel_events(trace_json, pid_substring) + if not events: + continue + default_stage = parse_stage(trace_path) + stage_groups = overlap_helpers.group_events_by_stage(events, default_stage) + for stage in render_stages(stage_groups): + if stage in stage_bundles: + continue + stage_bundles[stage] = overlap_helpers.TraceBundle( + label=f"{label_prefix}-{stage}", + trace_path=trace_path, + server_args=server_args, + raw_events=raw_events, + events=stage_groups[stage], + pid=pid, + ) + if "all" in stage_groups and not stage_bundles: + stage_bundles["all"] = overlap_helpers.TraceBundle( + label=f"{label_prefix}-all", + trace_path=trace_path, + server_args=server_args, + raw_events=raw_events, + events=stage_groups["all"], + pid=pid, + ) + return stage_bundles + + +def group_rows_by_stage(rows: Sequence[dict]) -> List[Tuple[str, List[dict]]]: + grouped: Dict[str, List[dict]] = defaultdict(list) + for row in rows: + grouped[str(row.get("stage") or "all")].append(row) + return [ + (stage, grouped[stage]) for stage in sorted(grouped.keys(), key=stage_index) + ] + + +def render_kernel_table_for_stage(rows: Sequence[dict]) -> List[str]: + lines = [ + "| Kernel | Category | GPU time | Share | Launches | Python location (site share) | CPU op |", + "| --- | --- | ---: | ---: | ---: | --- | --- |", + ] + if not rows: + lines.append( + "| No kernel rows at or above 1.0% share. | - | - | - | - | - | - |" + ) + return lines + for row in rows: + lines.append( + "| {kernel} | {category} | {gpu_time} | {share:.1f}% | {launches} | {location} | {cpu_op} |".format( + kernel=kernel_helpers.escape_md_cell(row["kernel"]), + category=kernel_helpers.escape_md_cell(row["category"]), + gpu_time=kernel_helpers.format_ms(row["total_us"]), + share=row["share_pct"], + launches=row["launches"], + location=kernel_helpers.escape_md_cell(row["location"]), + cpu_op=kernel_helpers.escape_md_cell(row["cpu_op"]), + ) + ) + return lines + + +def render_stage_section_tables( + rows: Sequence[dict], + *, + render_stage_fn, + stage_label_prefix: str = "#####", +) -> List[str]: + if not rows: + return render_stage_fn([]) + stage_groups = group_rows_by_stage(rows) + if len(stage_groups) == 1 and stage_groups[0][0] == "all": + return render_stage_fn(stage_groups[0][1]) + + lines: List[str] = [] + for index, (stage, stage_rows) in enumerate(stage_groups): + lines.append(f"{stage_label_prefix} {stage_display(stage)}") + lines.extend(render_stage_fn(stage_rows)) + if index != len(stage_groups) - 1: + lines.append("") + return lines + + +def render_kernel_tables(rows: Sequence[dict]) -> List[str]: + return render_stage_section_tables( + rows, render_stage_fn=render_kernel_table_for_stage + ) + + +def render_overlap_table_for_stage(rows: Sequence[dict]) -> List[str]: + lines = [ + "| Priority | Verdict | Kernel | Python scope | Formal signal | Dep risk | Recommendation |", + "| --- | --- | --- | --- | --- | --- | --- |", + ] + if not rows: + lines.append( + "| - | - | No rows cleared the 1.0% reporting bar. Use mapping/formal mode for overlap attribution. | - | - | - | - |" + ) + return lines + for row in rows: + formal_signal = ( + f"{row['total_us']:.1f} us, share {row['share_pct']:.1f}%, " + f"excl {row['exclusive_ratio'] * 100:.1f}% / hid {row['hidden_ratio'] * 100:.1f}%" + ) + lines.append( + "| " + + " | ".join( + [ + row["priority"], + row["verdict"], + kernel_helpers.escape_md_cell(row["kernel"]), + kernel_helpers.escape_md_cell(row["python_scope"]), + kernel_helpers.escape_md_cell(formal_signal), + overlap_helpers.dependency_risk_label(row["dependency_signal"]), + row["recommendation"], + ] + ) + + " |" + ) + return lines + + +def render_overlap_tables(rows: Sequence[dict]) -> List[str]: + return render_stage_section_tables( + rows, + render_stage_fn=render_overlap_table_for_stage, + ) + + +def render_fuse_table_for_stage(rows: Sequence[dict]) -> List[str]: + lines = [ + "| Pattern | Confidence | Related GPU time | Share | Evidence kernels | Current kernel Python location | Candidate fused Python path | Rationale |", + "| --- | --- | ---: | ---: | --- | --- | --- | --- |", + ] + if not rows: + lines.append( + "| No medium-confidence source-backed fusion opportunity matched this trace. | - | - | - | - | - | - | - |" + ) + return lines + for row in rows: + lines.append( + "| {pattern} | {confidence} | {gpu_time} | {share:.1f}% | {evidence} | {current_locations} | {candidate_path} | {rationale} |".format( + pattern=kernel_helpers.escape_md_cell(row["pattern"]), + confidence=kernel_helpers.escape_md_cell(row["confidence"]), + gpu_time=kernel_helpers.format_ms(row["related_us"]), + share=row["share_pct"], + evidence=kernel_helpers.escape_md_cell(row["evidence"]), + current_locations=kernel_helpers.escape_md_cell( + row["current_locations"] + ), + candidate_path=kernel_helpers.escape_md_cell(row["candidate_path"]), + rationale=kernel_helpers.escape_md_cell(row["rationale"]), + ) + ) + return lines + + +def render_fuse_tables(rows: Sequence[dict]) -> List[str]: + return render_stage_section_tables( + rows, + render_stage_fn=render_fuse_table_for_stage, + ) + + +def run_triage(args: argparse.Namespace) -> int: + single_trace_mode = bool(args.input) or bool(args.url) + if single_trace_mode: + formal_traces, formal_server_args, formal_framework = resolve_profile_targets( + label="input", + input_path=args.input, + url=args.url, + output_dir=args.output_dir, + profile_prefix=args.profile_prefix, + args=args, + ) + mapping_traces = formal_traces + mapping_server_args = formal_server_args + mapping_framework = formal_framework + else: + mapping_traces, mapping_server_args, mapping_framework = ( + resolve_profile_targets( + label="mapping", + input_path=args.mapping_input, + url=args.mapping_url, + output_dir=args.mapping_output_dir, + profile_prefix=args.mapping_profile_prefix, + args=args, + ) + ) + formal_traces, formal_server_args, formal_framework = resolve_profile_targets( + label="formal", + input_path=args.formal_input, + url=args.formal_url, + output_dir=args.formal_output_dir, + profile_prefix=args.formal_profile_prefix, + args=args, + ) + + mapping_kernel_map = build_mapping_kernel_map(mapping_traces, mapping_framework) + + kernel_rows_rendered: List[dict] = [] + fuse_rows_rendered: List[dict] = [] + formal_stage_payloads: Dict[str, dict] = {} + + for formal_trace in formal_traces: + trace = load_trace_json(formal_trace) + kernels, cpu_ops, python_frames, launch_events, _, _ = ( + kernel_helpers.extract_trace_data(trace) + ) + if not kernels: + continue + default_stage = parse_stage(formal_trace) + stage_groups = kernel_helpers.group_kernels_by_stage(kernels, default_stage) + formal_cpu_ops_by_external_id = kernel_helpers.build_cpu_op_index(cpu_ops) + formal_launches_by_correlation = kernel_helpers.build_launch_index( + launch_events + ) + formal_site_context_cache = {} + for stage_name, stage_kernels in stage_groups.items(): + local_site_stats = kernel_helpers.aggregate_kernel_sites( + stage_kernels, + formal_cpu_ops_by_external_id, + python_frames, + launches_by_correlation=formal_launches_by_correlation, + site_context_cache=formal_site_context_cache, + ) + formal_stage_payloads[stage_name] = kernel_helpers.build_stage_payload( + local_site_stats, + {kernel.canonical_name: kernel.category for kernel in stage_kernels}, + ) + trace_total_us = sum(kernel.dur for kernel in kernels) + for stage in sorted(stage_groups, key=stage_index): + stage_kernels = stage_groups[stage] + if not stage_kernels: + continue + total_us = sum(kernel.dur for kernel in stage_kernels) + if ( + stage == "all" + and default_stage == "all" + and kernel_helpers.pct(total_us, trace_total_us) < MIN_RENDER_SHARE_PCT + ): + continue + kernel_stats = kernel_helpers.aggregate( + stage_kernels, key_fn=lambda item: item.canonical_name + ) + kernel_categories = { + kernel.canonical_name: kernel.category for kernel in stage_kernels + } + full_kernel_rows = kernel_helpers.build_kernel_rows( + stage=stage, + kernel_stats=kernel_stats, + kernel_categories=kernel_categories, + local_stage_payload=formal_stage_payloads.get(stage, {"kernels": {}}), + external_kernel_map=mapping_kernel_map, + ) + visible_kernel_rows = kernel_helpers.limit_kernel_rows( + full_kernel_rows, args.kernel_table_limit + ) + for row in visible_kernel_rows: + share_pct = kernel_helpers.pct(row.total_us, total_us) + if share_pct < MIN_RENDER_SHARE_PCT: + continue + kernel_rows_rendered.append( + { + "stage": stage, + "kernel": row.name, + "category": row.category, + "total_us": row.total_us, + "share_pct": share_pct, + "launches": row.aggregate.count, + "location": row.location, + "cpu_op": row.cpu_op, + } + ) + for item in kernel_helpers.detect_fusion_opportunities( + kernel_rows=full_kernel_rows, + total_us=total_us, + server_args=formal_server_args or mapping_server_args, + framework=formal_framework, + ): + share_pct = kernel_helpers.pct(item.related_us, total_us) + if share_pct < MIN_RENDER_SHARE_PCT: + continue + fuse_rows_rendered.append( + { + "stage": stage, + "pattern": item.pattern, + "confidence": item.confidence, + "related_us": item.related_us, + "share_pct": share_pct, + "evidence": item.evidence, + "current_locations": item.current_locations, + "candidate_path": item.candidate_path, + "rationale": item.rationale, + } + ) + + overlap_rows_rendered: List[dict] = [] + if not single_trace_mode: + mapping_overlap_bundles = build_overlap_stage_bundle_map( + mapping_traces, + label_prefix="mapping", + server_args=mapping_server_args, + pid_substring=args.pid_substring, + ) + formal_overlap_bundles = build_overlap_stage_bundle_map( + formal_traces, + label_prefix="formal", + server_args=formal_server_args, + pid_substring=args.pid_substring, + ) + for stage in render_stages(formal_overlap_bundles): + formal_bundle = pick_stage_value(formal_overlap_bundles, stage) + mapping_bundle = pick_stage_value(mapping_overlap_bundles, stage) + if formal_bundle is None or mapping_bundle is None: + continue + formal_bundle.overlap_stats = overlap_helpers.analyze_overlap( + formal_bundle.events + ) + aggregates = overlap_helpers.aggregate_events(formal_bundle.events) + source_map = overlap_helpers.build_kernel_source_map( + mapping_bundle, + kernel_map_entry_lookup=lambda stage_name, kernel_name: ( + kernel_helpers.lookup_kernel_map_entry( + mapping_kernel_map, stage_name, kernel_name + ) + if mapping_kernel_map + else None + ), + stage=stage, + ) + source_map = overlap_helpers.merge_source_map_from_kernel_payload( + source_map, + pick_stage_value(formal_stage_payloads, stage), + ) + stage_rows = overlap_helpers.build_action_rows( + aggregates, + source_map, + formal_bundle.events, + formal_bundle.overlap_stats["total_busy_us"], + table_limit=max(0, args.overlap_table_limit), + ) + for row in stage_rows: + if row.share_pct < MIN_RENDER_SHARE_PCT: + continue + overlap_rows_rendered.append( + { + "stage": stage, + "priority": row.priority, + "verdict": row.verdict, + "kernel": row.kernel, + "python_scope": row.python_scope, + "total_us": row.total_us, + "share_pct": row.share_pct, + "exclusive_ratio": row.exclusive_ratio, + "hidden_ratio": row.hidden_ratio, + "dependency_signal": row.dependency_signal, + "recommendation": row.recommendation, + } + ) + + lines: List[str] = [] + lines.append("Triage View") + lines.append(f"Mode: {'single-trace' if single_trace_mode else 'mapping-formal'}") + if single_trace_mode: + lines.append(f"Framework: {framework_display_name(formal_framework)}") + lines.append(f"Input traces: {', '.join(str(path) for path in formal_traces)}") + else: + if mapping_framework == formal_framework: + lines.append(f"Framework: {framework_display_name(formal_framework)}") + else: + lines.append( + f"Mapping framework: {framework_display_name(mapping_framework)}" + ) + lines.append( + f"Formal framework: {framework_display_name(formal_framework)}" + ) + lines.append( + f"Mapping traces: {', '.join(str(path) for path in mapping_traces)}" + ) + lines.append(f"Formal traces: {', '.join(str(path) for path in formal_traces)}") + if formal_server_args or mapping_server_args: + server_args = formal_server_args or mapping_server_args + model = server_args.get("model_path") or server_args.get("model") + if model: + lines.append(f"Model: {model}") + lines.append("") + lines.append("Kernel Table") + lines.extend(render_kernel_tables(kernel_rows_rendered)) + lines.append("") + lines.append("Overlap Opportunity Table") + lines.extend(render_overlap_tables(overlap_rows_rendered)) + lines.append("") + lines.append("Fuse Opportunity Table") + lines.extend(render_fuse_tables(fuse_rows_rendered)) + print("\n".join(lines).rstrip()) + return 0 + + +def main(argv: Optional[Sequence[str]] = None) -> int: + argv = list(argv or sys.argv[1:]) + triage_parser = build_triage_parser() + + if not argv or argv[0] in {"-h", "--help"}: + triage_parser.print_help() + return 0 + + if argv[0] == "triage": + argv = argv[1:] + elif not argv[0].startswith("-"): + triage_parser.error( + "This skill exposes only the triage workflow. " + "Use single-trace mode (--input/--url) or mapping+formal two-trace mode." + ) + return 2 + + return run_triage(parse_triage_args(argv)) + + +if __name__ == "__main__": + raise SystemExit(main(sys.argv[1:])) diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/analyze_sglang_torch_profile.py b/.claude/skills/llm-torch-profiler-analysis/scripts/analyze_sglang_torch_profile.py new file mode 100644 index 000000000000..35aabc4c5693 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/analyze_sglang_torch_profile.py @@ -0,0 +1,16 @@ +"""Backwards-compatibility shim for the unified LLM torch-profiler entrypoint. + +The real implementation now lives in ``analyze_llm_torch_profile`` because this +skill covers SGLang, vLLM, and TensorRT-LLM. Older scripts and runbooks that +still invoke ``analyze_sglang_torch_profile.py`` keep working by forwarding to +that module. +""" + +from __future__ import annotations + +import sys + +from analyze_llm_torch_profile import main + +if __name__ == "__main__": + raise SystemExit(main(sys.argv[1:])) diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/make_trtllm_py_executor_override.py b/.claude/skills/llm-torch-profiler-analysis/scripts/make_trtllm_py_executor_override.py new file mode 100644 index 000000000000..597665d67bc1 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/make_trtllm_py_executor_override.py @@ -0,0 +1,132 @@ +"""Generate a TensorRT-LLM py_executor override for stable torch-profiler capture.""" + +from __future__ import annotations + +import argparse +from dataclasses import dataclass +from pathlib import Path + +START_MARKER = "torch_profiler = torch.profiler.profile(" + + +@dataclass +class ProfileCallSpan: + start: int + end: int + block: str + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description=( + "Create a py_executor.py override that enables with_stack=True for " + "TensorRT-LLM torch-profiler traces." + ) + ) + parser.add_argument("--source", required=True, help="Original py_executor.py path.") + parser.add_argument("--output", required=True, help="Override file path to write.") + return parser.parse_args() + + +def find_profile_call_span(text: str) -> ProfileCallSpan: + start = text.find(START_MARKER) + if start == -1: + raise SystemExit("Could not find torch profiler setup in source file.") + + open_paren = text.find("(", start) + if open_paren == -1: + raise SystemExit("Malformed torch profiler setup in source file.") + + depth = 0 + for index in range(open_paren, len(text)): + char = text[index] + if char == "(": + depth += 1 + elif char == ")": + depth -= 1 + if depth == 0: + return ProfileCallSpan( + start=start, + end=index + 1, + block=text[start : index + 1], + ) + raise SystemExit("Could not find the end of the torch profiler call.") + + +def inject_with_stack(block: str) -> str: + if "with_stack=" in block: + return block + + lines = block.splitlines() + if not lines: + raise SystemExit("Unexpected torch profiler block format.") + + last_line = lines[-1] + if not last_line.strip(): + raise SystemExit("Unexpected torch profiler block terminator.") + + if last_line.strip() == ")": + if len(lines) < 2: + raise SystemExit("Could not find the last torch profiler argument line.") + last_arg_index = len(lines) - 2 + last_arg_line = lines[last_arg_index] + indent = last_arg_line[: len(last_arg_line) - len(last_arg_line.lstrip())] + if not last_arg_line.rstrip().endswith(","): + lines[last_arg_index] = last_arg_line.rstrip() + "," + lines.insert(len(lines) - 1, f"{indent}with_stack=True") + return "\n".join(lines) + + if not last_line.rstrip().endswith(")"): + raise SystemExit("Unexpected torch profiler block terminator.") + + indent = last_line[: len(last_line) - len(last_line.lstrip())] + last_arg_text = last_line.rstrip()[:-1].rstrip() + if not last_arg_text.endswith(","): + last_arg_text += "," + lines[-1] = last_arg_text + lines.append(f"{indent}with_stack=True)") + return "\n".join(lines) + + +def inject_rank0_trace_guard(text: str) -> str: + needle = ( + " enable_torch_trace = bool(torch_trace_path and profile_start_stop)\n" + ) + replacement = ( + " # Multi-rank PyTorch backend workers race on the same chrome-trace " + "path.\n" + " # Keep the full torch-profiler trace on rank 0 and let the other " + "ranks\n" + " # continue with CUDA-profiler gating only.\n" + " enable_torch_trace = bool(\n" + " torch_trace_path and profile_start_stop and self.dist.rank == 0\n" + " )\n" + ) + if replacement in text: + return text + if needle not in text: + raise SystemExit("Could not find enable_torch_trace assignment in source file.") + return text.replace(needle, replacement, 1) + + +def main() -> int: + args = parse_args() + source = Path(args.source).expanduser().resolve() + output = Path(args.output).expanduser().resolve() + text = source.read_text(encoding="utf-8") + span = find_profile_call_span(text) + patched_block = inject_with_stack(span.block) + patched = ( + text + if patched_block == span.block + else (text[: span.start] + patched_block + text[span.end :]) + ) + patched = inject_rank0_trace_guard(patched) + output.parent.mkdir(parents=True, exist_ok=True) + output.write_text(patched, encoding="utf-8") + print(output) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/probe_llm_server.py b/.claude/skills/llm-torch-profiler-analysis/scripts/probe_llm_server.py new file mode 100755 index 000000000000..ba7b65d00c2a --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/probe_llm_server.py @@ -0,0 +1,230 @@ +#!/usr/bin/env python3 +"""Run a small correctness and latency probe against an LLM server.""" + +from __future__ import annotations + +import argparse +import json +import math +import statistics +import time +from pathlib import Path +from typing import Any, Dict, List, Optional +from urllib import request + +from profile_common import extract_openai_chat_text + +DEFAULT_PROMPTS = [ + "Introduce Shanghai in one short sentence.", + "What is 2+2? Answer briefly.", + "Write one short haiku about GPUs.", +] + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description=( + "Send a few short requests to an LLM server and record latency plus " + "sample outputs." + ) + ) + parser.add_argument( + "--framework", + required=True, + choices=("sglang", "vllm", "trtllm"), + help="Serving framework.", + ) + parser.add_argument( + "--url", + required=True, + help="Server base URL, for example http://127.0.0.1:30000.", + ) + parser.add_argument( + "--model", + default=None, + help="OpenAI model id. Auto-discovered for vLLM and TensorRT-LLM when omitted.", + ) + parser.add_argument( + "--requests", + type=int, + default=6, + help="How many probe requests to send.", + ) + parser.add_argument( + "--max-tokens", + type=int, + default=48, + help="Generation length for each request.", + ) + parser.add_argument( + "--timeout", + type=float, + default=180.0, + help="Per-request timeout in seconds.", + ) + parser.add_argument( + "--prompt", + action="append", + default=[], + help="Optional prompt override. Repeat to add more prompts.", + ) + parser.add_argument( + "--output", + default=None, + help="Optional JSON output path.", + ) + return parser.parse_args() + + +def post_json(url: str, payload: Dict[str, Any], timeout: float) -> Dict[str, Any]: + req = request.Request( + url=url, + data=json.dumps(payload).encode("utf-8"), + headers={"Content-Type": "application/json"}, + method="POST", + ) + with request.urlopen(req, timeout=timeout) as resp: + raw = resp.read() + return json.loads(raw.decode("utf-8")) if raw else {} + + +def get_json(url: str, timeout: float) -> Dict[str, Any]: + req = request.Request(url=url, method="GET") + with request.urlopen(req, timeout=timeout) as resp: + raw = resp.read() + return json.loads(raw.decode("utf-8")) if raw else {} + + +def discover_openai_model(base_url: str, timeout: float) -> str: + payload = get_json(base_url.rstrip("/") + "/v1/models", timeout=timeout) + data = payload.get("data") + if not isinstance(data, list) or not data: + raise RuntimeError(f"No models returned by {base_url.rstrip('/')}/v1/models") + first = data[0] + if isinstance(first, dict) and first.get("id"): + return str(first["id"]) + raise RuntimeError(f"Malformed /v1/models payload from {base_url.rstrip('/')}") + + +def p95(values: List[float]) -> Optional[float]: + if not values: + return None + ordered = sorted(values) + index = max(0, math.ceil(len(ordered) * 0.95) - 1) + return ordered[index] + + +def sglang_request(base_url: str, prompt: str, max_tokens: int, timeout: float) -> str: + payload = { + "text": prompt, + "sampling_params": { + "temperature": 0.0, + "max_new_tokens": max_tokens, + }, + "stream": False, + } + body = post_json(base_url.rstrip("/") + "/generate", payload, timeout=timeout) + return str(body.get("text", "")) + + +def openai_request( + base_url: str, + model: str, + prompt: str, + max_tokens: int, + timeout: float, +) -> Dict[str, str]: + payload = { + "model": model, + "messages": [{"role": "user", "content": prompt}], + "temperature": 0.0, + "max_tokens": max_tokens, + "stream": False, + } + body = post_json( + base_url.rstrip("/") + "/v1/chat/completions", + payload, + timeout=timeout, + ) + text, source = extract_openai_chat_text(body) + return {"text": text, "source": source} + + +def run_probe(args: argparse.Namespace) -> Dict[str, Any]: + prompts = args.prompt or list(DEFAULT_PROMPTS) + model = args.model + if args.framework in {"vllm", "trtllm"} and not model: + model = discover_openai_model(args.url, timeout=args.timeout) + + latencies: List[float] = [] + samples: List[Dict[str, Any]] = [] + errors: List[Dict[str, str]] = [] + + for request_idx in range(args.requests): + prompt = prompts[request_idx % len(prompts)] + start = time.time() + try: + if args.framework == "sglang": + text = sglang_request( + args.url, + prompt, + max_tokens=args.max_tokens, + timeout=args.timeout, + ) + source = "generate.text" + else: + assert model is not None + result = openai_request( + args.url, + model, + prompt, + max_tokens=args.max_tokens, + timeout=args.timeout, + ) + text = result["text"] + source = result["source"] + elapsed = time.time() - start + latencies.append(elapsed) + samples.append( + { + "prompt": prompt, + "latency_s": round(elapsed, 3), + "content": text[:240], + "source": source, + "non_empty": bool(text.strip()), + } + ) + except Exception as exc: # pragma: no cover - runtime probe path + errors.append({"prompt": prompt, "error": repr(exc)}) + + return { + "framework": args.framework, + "url": args.url, + "model": model, + "requests": args.requests, + "success": len(samples), + "errors": len(errors), + "all_non_empty": ( + all(sample["non_empty"] for sample in samples) if samples else False + ), + "avg_latency_s": round(statistics.mean(latencies), 3) if latencies else None, + "p95_latency_s": round(p95(latencies), 3) if latencies else None, + "samples": samples[:3], + "error_samples": errors[:3], + } + + +def main() -> int: + args = parse_args() + summary = run_probe(args) + rendered = json.dumps(summary, ensure_ascii=False, indent=2) + print(rendered) + if args.output: + output_path = Path(args.output).expanduser().resolve() + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(rendered + "\n", encoding="utf-8") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/profile_common.py b/.claude/skills/llm-torch-profiler-analysis/scripts/profile_common.py new file mode 100644 index 000000000000..8e4d7514af62 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/profile_common.py @@ -0,0 +1,1145 @@ +"""Shared helpers for unified LLM torch-profiler skill scripts.""" + +from __future__ import annotations + +import gzip +import json +import re +import shutil +import sys +import tempfile +import time +from collections import Counter, defaultdict +from dataclasses import dataclass +from functools import lru_cache +from pathlib import Path +from typing import Callable, Dict, Iterable, List, Optional, Sequence, Tuple +from urllib import request + +STAGE_ORDER = {"extend": 0, "prefill": 0, "decode": 1, "all": 2} +FRAMEWORK_LABELS = { + "auto": "auto", + "sglang": "SGLang", + "vllm": "vLLM", + "trtllm": "TensorRT-LLM", +} +TRACE_FILE_PATTERNS = ( + "*.trace.json", + "*.trace.json.gz", + "*.pt.trace.json", + "*.pt.trace.json.gz", + "*.json", + "*.json.gz", +) +TRACE_FILE_IGNORE_NAMES = { + "server_args.json", + "metadata.json", + "config.json", +} +TRACE_METADATA_NAMES = { + "process_name", + "thread_name", + "process_sort_index", + "thread_sort_index", +} +NON_KERNEL_TRACE_CATEGORIES = ("python_function", "cpu_op", "trace") +PYTHON_SCOPE_NAME_PREFIXES = ("python/", "nn.module:") +PROFILE_WORKLOAD_CHOICES = ("legacy", "prefill", "decode", "both") +DEFAULT_PREFILL_INPUT_LEN = 4090 +DEFAULT_PREFILL_OUTPUT_LEN = 1 +DEFAULT_DECODE_INPUT_LEN = 1 +DEFAULT_DECODE_OUTPUT_LEN = 2048 +DEFAULT_WARMUP_STEPS = 10 + + +@dataclass(frozen=True) +class ProbePlan: + prompt: str + capture_max_new_tokens: int + capture_requests: int + warmup_max_new_tokens: int + warmup_requests: int + + +@lru_cache(maxsize=65536) +def _normalize_text_cached(text: str) -> str: + text = text.strip() + if not text: + return "" + for token in (" ", "\t", "\n", "\r", "\v", "\f"): + if token in text: + return " ".join(text.split()) + return text + + +def normalize_text(value: object) -> str: + return _normalize_text_cached(value if isinstance(value, str) else str(value)) + + +def canonicalize_framework(value: object) -> str: + lowered = normalize_text(value).lower().replace("_", "-") + aliases = { + "": "auto", + "auto": "auto", + "sglang": "sglang", + "sgl": "sglang", + "vllm": "vllm", + "trt": "trtllm", + "tllm": "trtllm", + "trtllm": "trtllm", + "tensorrt-llm": "trtllm", + "tensorrtllm": "trtllm", + } + return aliases.get(lowered, "auto") + + +def framework_display_name(value: object) -> str: + return FRAMEWORK_LABELS.get(canonicalize_framework(value), str(value)) + + +@lru_cache(maxsize=65536) +def _normalize_repo_relative_path_cached(text: str) -> str: + text = text.replace("\\", "/") + lowered = text.lower() + for marker, normalized_marker in ( + ("python/sglang/", "python/sglang/"), + ("sgl_kernel/", "sgl_kernel/"), + ("vllm/", "vllm/"), + ("tensorrt_llm/", "tensorrt_llm/"), + ("tensorrt-llm/", "tensorrt_llm/"), + ): + idx = lowered.find(marker) + if idx != -1: + suffix = text[idx + len(marker) :].lstrip("/") + return f"{normalized_marker}{suffix}".lstrip("/") + idx = lowered.find("sglang/") + if idx != -1: + return ("python/" + text[idx:]).lstrip("/") + return text.lstrip("/") + + +def normalize_repo_relative_path(path: object) -> str: + return _normalize_repo_relative_path_cached(normalize_text(path)) + + +def contains_any_keyword(text: str, keywords: Iterable[str]) -> bool: + return any(keyword in text for keyword in keywords) + + +def coerce_optional_int(value: object) -> Optional[int]: + if value in (None, "", "None"): + return None + if isinstance(value, int): + return value + if isinstance(value, float): + return int(value) if value.is_integer() else None + try: + return int(str(value)) + except (TypeError, ValueError): + return None + + +def extract_trace_events(trace: object) -> Sequence[dict]: + if isinstance(trace, dict): + events = trace.get("traceEvents", []) + return events if isinstance(events, list) else [] + if isinstance(trace, list): + return trace + return [] + + +def is_trace_metadata_name(name: object) -> bool: + return str(name) in TRACE_METADATA_NAMES + + +def is_complete_duration_event(event: dict) -> bool: + if event.get("ph") != "X": + return False + dur = event.get("dur") + ts = event.get("ts") + if dur is None or ts is None: + return False + try: + return float(dur) > 0 + except (TypeError, ValueError): + return False + + +def is_annotation_event(name: object, category: object) -> bool: + lowered_name = normalize_text(name).lower() + lowered_category = normalize_text(category).lower() + return "annotation" in lowered_category or lowered_name.startswith("## call ") + + +def is_non_kernel_trace_category(category: object) -> bool: + lowered_category = normalize_text(category).lower() + return any(token in lowered_category for token in NON_KERNEL_TRACE_CATEGORIES) + + +def looks_like_python_scope_name(name: object) -> bool: + lowered_name = normalize_text(name).lower() + return ".py(" in lowered_name or lowered_name.startswith(PYTHON_SCOPE_NAME_PREFIXES) + + +def has_stream_marker(args: Optional[dict]) -> bool: + trace_args = args or {} + return "stream" in trace_args or "cuda_stream" in trace_args + + +def load_trace_json(path: Path) -> dict: + if path.suffix == ".gz": + with gzip.open(path, "rt", encoding="utf-8") as handle: + return json.load(handle) + with open(path, "r", encoding="utf-8") as handle: + return json.load(handle) + + +def load_server_args(path: Path) -> Optional[dict]: + resolved = path.resolve() + candidate_dirs: List[Path] = [] + if resolved.is_file(): + candidate_dirs.extend([resolved.parent, resolved.parent.parent]) + else: + candidate_dirs.extend([resolved, resolved.parent]) + + seen: set[Path] = set() + for candidate_dir in candidate_dirs: + if candidate_dir in seen: + continue + seen.add(candidate_dir) + candidate = candidate_dir / "server_args.json" + if candidate.exists(): + with open(candidate, "r", encoding="utf-8") as handle: + return json.load(handle) + return None + + +def try_get_json(url: str, timeout: float = 60.0) -> Optional[object]: + try: + with request.urlopen(url, timeout=timeout) as response: + raw = response.read() + except Exception: + return None + if not raw: + return None + try: + return json.loads(raw.decode("utf-8")) + except json.JSONDecodeError: + return None + + +def _flatten_chat_text_parts(value: object) -> List[str]: + if value is None: + return [] + if isinstance(value, str): + text = value.strip() + return [text] if text else [] + if isinstance(value, list): + parts: List[str] = [] + for item in value: + parts.extend(_flatten_chat_text_parts(item)) + return parts + if isinstance(value, dict): + parts: List[str] = [] + text_keys = ( + "text", + "content", + "reasoning_content", + "reasoning", + "output_text", + ) + if any(key in value for key in text_keys): + for key in text_keys: + parts.extend(_flatten_chat_text_parts(value.get(key))) + if parts: + return parts + item_type = normalize_text(value.get("type")).lower() + if item_type in {"text", "output_text", "input_text"}: + for key in ("text", "content", "value"): + parts.extend(_flatten_chat_text_parts(value.get(key))) + elif item_type in {"reasoning", "thinking"}: + for key in ("text", "content", "reasoning_content", "reasoning"): + parts.extend(_flatten_chat_text_parts(value.get(key))) + return parts + return [] + + +def flatten_chat_text(value: object) -> str: + return "\n".join(_flatten_chat_text_parts(value)).strip() + + +def extract_openai_chat_text(body: object) -> Tuple[str, str]: + if not isinstance(body, dict): + return "", "invalid_body" + + choices = body.get("choices") + if not isinstance(choices, list) or not choices: + fallback = flatten_chat_text(body.get("output_text")) + if fallback: + return fallback, "body.output_text" + return "", "missing_choices" + + first_choice = choices[0] + if not isinstance(first_choice, dict): + return "", "invalid_choice" + + message = first_choice.get("message") + if isinstance(message, dict): + for key in ("content", "reasoning_content", "reasoning"): + text = flatten_chat_text(message.get(key)) + if text: + return text, f"message.{key}" + + for key in ("text", "content", "reasoning_content", "reasoning"): + text = flatten_chat_text(first_choice.get(key)) + if text: + return text, f"choice.{key}" + + delta = first_choice.get("delta") + if isinstance(delta, dict): + for key in ("content", "reasoning_content", "reasoning"): + text = flatten_chat_text(delta.get(key)) + if text: + return text, f"delta.{key}" + + fallback = flatten_chat_text(body.get("output_text")) + if fallback: + return fallback, "body.output_text" + return "", "empty" + + +def detect_framework_from_text(text: object) -> Optional[str]: + lowered = normalize_text(text).lower() + if not lowered: + return None + if any( + token in lowered + for token in ( + "tensorrt_llm", + "tensorrt-llm", + "trtllm", + "pyexecutor", + ) + ): + return "trtllm" + if "vllm" in lowered: + return "vllm" + if any(token in lowered for token in ("python/sglang/", "sgl_kernel/", "sglang/")): + return "sglang" + return None + + +def detect_framework_from_server_args(server_args: Optional[dict]) -> Optional[str]: + if not isinstance(server_args, dict) or not server_args: + return None + lowered_keys = {normalize_text(key).lower() for key in server_args} + if lowered_keys & { + "attention_backend", + "sampling_backend", + "disable_cuda_graph", + "disable_piecewise_cuda_graph", + "chunked_prefill_size", + "schedule_policy", + }: + return "sglang" + return detect_framework_from_text(json.dumps(server_args, sort_keys=True)) + + +def detect_framework_from_trace(trace: object) -> Optional[str]: + text_samples: List[str] = [] + for event in extract_trace_events(trace)[:256]: + text_samples.extend( + [ + str(event.get("name", "")), + str(event.get("cat", "")), + str(event.get("pid", "")), + ] + ) + trace_args = event.get("args") + if isinstance(trace_args, dict): + for key, value in list(trace_args.items())[:8]: + text_samples.append(str(key)) + if isinstance(value, str): + text_samples.append(value) + return detect_framework_from_text(" ".join(text_samples)) + + +def detect_framework_from_path(path: Path) -> Optional[str]: + hint = detect_framework_from_text(str(path)) + if hint: + return hint + server_args = load_server_args(path) + hint = detect_framework_from_server_args(server_args) + if hint: + return hint + if path.is_file(): + try: + return detect_framework_from_trace(load_trace_json(path)) + except Exception: + return None + trace_files = discover_trace_files(path, recursive=True, limit=3) + for trace_file in trace_files: + try: + hint = detect_framework_from_trace(load_trace_json(trace_file)) + except Exception: + hint = None + if hint: + return hint + return None + + +def detect_framework_from_url( + url: str, output_dir: Optional[str] = None +) -> Optional[str]: + hint = detect_framework_from_text(output_dir or "") + if hint: + return hint + server_info = try_get_json(url.rstrip("/") + "/server_info") + if isinstance(server_info, dict) and ( + "internal_states" in server_info + or "tokenizer_path" in server_info + or "prefill" in server_info + or "decode" in server_info + ): + return "sglang" + models = try_get_json(url.rstrip("/") + "/v1/models") + if isinstance(models, dict) and isinstance(models.get("data"), list): + return "vllm" + return None + + +def resolve_framework( + requested: object, + *, + input_path: Optional[Path] = None, + url: Optional[str] = None, + server_args: Optional[dict] = None, +) -> str: + explicit = canonicalize_framework(requested) + if explicit != "auto": + return explicit + for hint in ( + detect_framework_from_server_args(server_args), + detect_framework_from_path(input_path) if input_path else None, + ( + detect_framework_from_url(url, str(input_path) if input_path else None) + if url + else None + ), + ): + if hint: + return hint + return "sglang" + + +def parse_stage(path: Path) -> str: + parts = [part.lower() for part in path.parts[-6:]] + name = " ".join(parts) + segment_path = "/" + "/".join(parts) + "/" + if any(marker in name for marker in ("-extend", "-prefill", "_extend", "_prefill")): + return "extend" + if any(f"/{segment}/" in segment_path for segment in ("extend", "prefill")): + return "extend" + if any(marker in name for marker in ("-decode", "_decode")): + return "decode" + if "/decode/" in segment_path: + return "decode" + return "all" + + +def parse_tp_rank(path: Path) -> Optional[int]: + for pattern in ( + r"(?:^|[_-])tp(\d+)(?:[_.-]|$)", + r"TP-(\d+)", + r"(?:^|[_-])rank(\d+)(?:[_.-]|$)", + r"(?:^|[_-])worker(\d+)(?:[_.-]|$)", + ): + match = re.search(pattern, path.name, re.IGNORECASE) + if match: + return int(match.group(1)) + return None + + +def file_looks_like_trace(path: Path) -> bool: + name = path.name.lower() + if name in TRACE_FILE_IGNORE_NAMES: + return False + if path.is_dir(): + return False + if any(name.endswith(suffix) for suffix in (".trace.json", ".trace.json.gz")): + return True + if ".pt.trace.json" in name: + return True + if not any(name.endswith(suffix) for suffix in (".json", ".json.gz")): + return False + try: + trace = load_trace_json(path) + except Exception: + return False + if isinstance(trace, dict): + return isinstance(trace.get("traceEvents"), list) + if isinstance(trace, list): + return bool(trace) and all(isinstance(item, dict) for item in trace[:8]) + return False + + +def discover_trace_files( + path: Path, + *, + recursive: bool, + limit: Optional[int] = None, +) -> List[Path]: + if path.is_file(): + return [path] if file_looks_like_trace(path) else [] + + candidates: List[Path] = [] + seen: set[Path] = set() + for pattern in TRACE_FILE_PATTERNS: + iterator = path.rglob(pattern) if recursive else path.glob(pattern) + for candidate in iterator: + resolved = candidate.resolve() + if resolved in seen: + continue + seen.add(resolved) + candidates.append(resolved) + candidates = [ + candidate + for candidate in candidates + if candidate.exists() and file_looks_like_trace(candidate) + ] + candidates.sort(key=lambda item: item.stat().st_mtime) + if limit is not None and limit >= 0: + return candidates[-limit:] if limit else [] + return candidates + + +def newest_trace_dir(path: Path) -> Path: + if path.is_file(): + return path.parent + direct = discover_trace_files(path, recursive=False) + if direct: + return path + traces = discover_trace_files(path, recursive=True) + trace_dirs = list({trace.parent for trace in traces}) + if not trace_dirs: + raise FileNotFoundError(f"No trace files found under {path}") + trace_dirs.sort( + key=lambda item: max( + trace.stat().st_mtime for trace in traces if trace.parent == item + ) + ) + return trace_dirs[-1] + + +def discover_trace_targets( + path: Path, all_traces: bool +) -> Tuple[List[Path], Optional[dict]]: + if path.is_file(): + return [path], load_server_args(path) + + direct_traces = discover_trace_files(path, recursive=False) + recursive_traces = discover_trace_files(path, recursive=True) + recursive_stages = {parse_stage(trace) for trace in recursive_traces} + if ( + not direct_traces + and recursive_traces + and any(stage != "all" for stage in recursive_stages) + ): + traces = recursive_traces + trace_dir = path + else: + trace_dir = newest_trace_dir(path) + traces = discover_trace_files(trace_dir, recursive=False) + if not traces: + raise FileNotFoundError(f"No trace files found under {trace_dir}") + + non_merged = [trace for trace in traces if not trace.name.startswith("merged-")] + selected = non_merged or traces + if not all_traces: + ranks = sorted( + { + rank + for rank in (parse_tp_rank(trace) for trace in selected) + if rank is not None + } + ) + if ranks: + rank = 0 if 0 in ranks else ranks[0] + selected = [trace for trace in selected if parse_tp_rank(trace) == rank] + grouped: Dict[str, List[Path]] = defaultdict(list) + for trace in selected: + grouped[parse_stage(trace)].append(trace) + selected = [ + sorted(group, key=lambda item: item.stat().st_mtime)[-1] + for group in grouped.values() + ] + + selected.sort(key=lambda item: (STAGE_ORDER.get(parse_stage(item), 99), item.name)) + return selected, load_server_args(trace_dir) + + +def post_json( + url: str, payload: Optional[dict] = None, timeout: float = 60.0 +) -> Optional[dict]: + req = request.Request( + url=url, + data=(None if payload is None else json.dumps(payload).encode("utf-8")), + headers={"Content-Type": "application/json"}, + method="POST", + ) + with request.urlopen(req, timeout=timeout) as response: + raw = response.read() + return json.loads(raw.decode("utf-8")) if raw else None + + +def send_probe_request( + url: str, + prompt: str, + max_new_tokens: int, + sampling_seed: int, + framework: str, + model: Optional[str] = None, +) -> None: + framework = canonicalize_framework(framework) + if framework == "sglang": + payload = { + "text": prompt, + "sampling_params": { + "sampling_seed": sampling_seed, + "temperature": 0.0, + "max_new_tokens": max_new_tokens, + }, + "stream": False, + } + post_json(url.rstrip("/") + "/generate", payload, timeout=300.0) + return + + resolved_model = model or discover_openai_model(url) + chat_payload = { + "model": resolved_model, + "messages": [{"role": "user", "content": prompt}], + "temperature": 0.0, + "max_tokens": max_new_tokens, + "stream": False, + } + try: + post_json(url.rstrip("/") + "/v1/chat/completions", chat_payload, timeout=300.0) + return + except Exception: + completion_payload = { + "model": resolved_model, + "prompt": prompt, + "temperature": 0.0, + "max_tokens": max_new_tokens, + "stream": False, + } + post_json( + url.rstrip("/") + "/v1/completions", + completion_payload, + timeout=300.0, + ) + + +def unique_probe_prompt(prompt: str, probe_index: int) -> str: + marker = f"profile_probe_{max(0, int(probe_index))}" + parts = prompt.split(maxsplit=1) + suffix = parts[1] if len(parts) == 2 else prompt + return f"{marker} {suffix}".strip() + + +def send_probe_requests( + *, + url: str, + prompt: str, + max_new_tokens: int, + request_count: int, + framework: str, + model: Optional[str] = None, + sampling_seed_offset: int = 0, +) -> None: + request_count = max(0, int(request_count)) + seed_offset = max(0, int(sampling_seed_offset)) + for request_idx in range(request_count): + probe_index = seed_offset + request_idx + send_probe_request( + url=url, + prompt=unique_probe_prompt(prompt, probe_index), + max_new_tokens=max_new_tokens, + sampling_seed=probe_index, + framework=framework, + model=model, + ) + + +def synthetic_prompt(input_len: int) -> str: + token_count = max(1, int(input_len)) + return " ".join(["profile"] * token_count) + + +def workload_probe( + stage: str, + *, + prefill_input_len: int, + prefill_output_len: int, + decode_input_len: int, + decode_output_len: int, +) -> Tuple[str, int]: + if stage == "prefill": + return synthetic_prompt(prefill_input_len), max(1, int(prefill_output_len)) + if stage == "decode": + return synthetic_prompt(decode_input_len), max(1, int(decode_output_len)) + raise ValueError(f"unknown profile workload stage: {stage}") + + +def build_probe_plan( + stage: str, + *, + prompt: str, + max_new_tokens: int, + num_steps: int, + probe_requests: int, + warmup_steps: int, +) -> ProbePlan: + active_steps = max(1, int(num_steps)) + requested_probes = max(1, int(probe_requests)) + warmup_steps = max(0, int(warmup_steps)) + max_new_tokens = max(1, int(max_new_tokens)) + + if stage == "prefill": + return ProbePlan( + prompt=prompt, + capture_max_new_tokens=max_new_tokens, + capture_requests=max(requested_probes, active_steps), + warmup_max_new_tokens=max_new_tokens, + warmup_requests=warmup_steps, + ) + if stage == "decode": + return ProbePlan( + prompt=prompt, + capture_max_new_tokens=max_new_tokens, + capture_requests=requested_probes, + warmup_max_new_tokens=max(1, warmup_steps), + warmup_requests=1 if warmup_steps else 0, + ) + return ProbePlan( + prompt=prompt, + capture_max_new_tokens=max_new_tokens, + capture_requests=requested_probes, + warmup_max_new_tokens=max_new_tokens, + warmup_requests=warmup_steps, + ) + + +def expand_profile_workload(profile_workload: str) -> List[str]: + workload = normalize_text(profile_workload).lower() + if workload not in PROFILE_WORKLOAD_CHOICES: + raise ValueError( + f"--profile-workload must be one of {', '.join(PROFILE_WORKLOAD_CHOICES)}" + ) + if workload == "both": + return ["prefill", "decode"] + if workload == "legacy": + return ["legacy"] + return [workload] + + +def discover_openai_model(url: str) -> str: + payload = try_get_json(url.rstrip("/") + "/v1/models", timeout=60.0) + if not isinstance(payload, dict): + raise RuntimeError(f"Could not read {url.rstrip('/')}/v1/models") + data = payload.get("data") + if not isinstance(data, list) or not data: + raise RuntimeError(f"No models returned by {url.rstrip('/')}/v1/models") + first = data[0] + if isinstance(first, dict) and first.get("id"): + return str(first["id"]) + raise RuntimeError(f"Malformed /v1/models payload from {url.rstrip('/')}") + + +def ensure_remote_profiler_output_path( + output_dir: Optional[str], framework: str +) -> Path: + if not output_dir: + raise ValueError( + f"{framework_display_name(framework)} live capture requires --output-dir " + "to point at the server-side torch profiler trace path that is visible " + "from this machine." + ) + output_path = Path(output_dir).expanduser().resolve() + if output_path.suffix in {".json", ".gz"}: + output_path.parent.mkdir(parents=True, exist_ok=True) + else: + output_path.mkdir(parents=True, exist_ok=True) + return output_path + + +def wait_for_profiler_artifact(path: Path, timeout_s: float = 60.0) -> Path: + deadline = time.time() + timeout_s + while time.time() < deadline: + if path.is_file() and file_looks_like_trace(path): + return path + if path.exists(): + trace_files = discover_trace_files(path, recursive=True) + if trace_files: + return newest_trace_dir(path) + if path.is_dir(): + child_dirs = [item for item in path.iterdir() if item.is_dir()] + if child_dirs: + child_dirs.sort(key=lambda item: item.stat().st_mtime) + newest_child = child_dirs[-1] + child_traces = discover_trace_files(newest_child, recursive=True) + if child_traces: + return newest_child + time.sleep(0.5) + return path + + +def start_remote_profiler(url: str, framework: str) -> None: + try: + post_json(url.rstrip("/") + "/start_profile", timeout=60.0) + except Exception as exc: + if framework == "vllm": + raise RuntimeError( + "vLLM live torch profiling requires the server to be launched with " + '--profiler-config \'{"profiler":"torch","torch_profiler_dir":"..."}\' ' + "and to expose POST /start_profile." + ) from exc + if framework == "trtllm": + raise RuntimeError( + "TensorRT-LLM live torch profiling requires " + "a server build that exposes POST /start_profile plus the env vars " + "TLLM_PROFILE_START_STOP=1 and TLLM_TORCH_PROFILE_TRACE=/shared/path." + ) from exc + raise + + +def stop_remote_profiler(url: str, framework: str) -> None: + try: + post_json(url.rstrip("/") + "/stop_profile", timeout=300.0) + except Exception as exc: + raise RuntimeError( + f"Failed to stop {framework_display_name(framework)} profiler via " + f"{url.rstrip('/')}/stop_profile" + ) from exc + + +def run_remote_profiler( + url: str, + output_dir: Optional[str], + framework: str, + probe_plan: ProbePlan, + probe_delay: float, + stage: Optional[str] = None, +) -> Path: + framework = canonicalize_framework(framework) + output_path = ensure_remote_profiler_output_path(output_dir, framework) + if stage and output_path.is_file(): + raise ValueError( + "--profile-workload both requires a directory output path for " + f"{framework_display_name(framework)} so each stage trace can be labeled." + ) + before_traces = ( + set(discover_trace_files(output_path, recursive=True)) + if output_path.exists() + else set() + ) + model = discover_openai_model(url) if framework in {"vllm", "trtllm"} else None + if probe_plan.warmup_requests > 0: + send_probe_requests( + url=url, + prompt=probe_plan.prompt, + max_new_tokens=probe_plan.warmup_max_new_tokens, + request_count=probe_plan.warmup_requests, + framework=framework, + model=model, + ) + + start_remote_profiler(url, framework) + stop_error: Optional[BaseException] = None + try: + if probe_plan.capture_requests > 0: + # `sglang.profiler` performs its own startup work before it reaches + # POST /start_profile. A very short delay can send probes too early + # and miss the profiling window entirely. + time.sleep(max(5.0, probe_delay)) + send_probe_requests( + url=url, + prompt=probe_plan.prompt, + max_new_tokens=probe_plan.capture_max_new_tokens, + request_count=probe_plan.capture_requests, + framework=framework, + model=model, + sampling_seed_offset=probe_plan.warmup_requests, + ) + finally: + try: + stop_remote_profiler(url, framework) + except BaseException as exc: # pragma: no cover - preserve original failure + stop_error = exc + if stop_error is not None: + raise stop_error + artifact = wait_for_profiler_artifact(output_path) + if stage and output_path.is_dir(): + after_traces = set(discover_trace_files(output_path, recursive=True)) + new_traces = sorted(after_traces - before_traces, key=lambda item: item.name) + if new_traces: + stage_dir = output_path / stage + stage_dir.mkdir(parents=True, exist_ok=True) + for trace in new_traces: + if stage_dir in trace.parents: + continue + target = stage_dir / trace.name + if target.exists(): + target = stage_dir / f"{time.time_ns()}-{trace.name}" + shutil.move(str(trace), str(target)) + return stage_dir + return artifact + + +def run_sglang_profiler( + url: str, + output_dir: Optional[str], + num_steps: int, + profile_by_stage: bool, + merge_profiles: bool, + profile_prefix: Optional[str], + probe_plan: ProbePlan, + probe_delay: float, + start_step: Optional[int] = None, +) -> Path: + if output_dir is None: + output_dir = tempfile.mkdtemp(prefix="sglang-torch-profile-") + output_root = Path(output_dir).resolve() + output_root.mkdir(parents=True, exist_ok=True) + output_path = output_root / str(time.time()) + output_path.mkdir(parents=True, exist_ok=True) + + server_args = try_get_json(url.rstrip("/") + "/server_info", timeout=60.0) + if server_args is not None: + with open(output_path / "server_args.json", "w", encoding="utf-8") as handle: + json.dump(server_args, handle) + + payload = { + "output_dir": str(output_path), + "num_steps": str(num_steps), + "activities": ["CPU", "GPU"], + "profile_by_stage": profile_by_stage, + "merge_profiles": merge_profiles, + "profile_prefix": profile_prefix, + } + if start_step is not None: + payload["start_step"] = str(start_step) + + if probe_plan.warmup_requests > 0: + send_probe_requests( + url=url, + prompt=probe_plan.prompt, + max_new_tokens=probe_plan.warmup_max_new_tokens, + request_count=probe_plan.warmup_requests, + framework="sglang", + ) + + req = request.Request( + url.rstrip("/") + "/start_profile", + data=json.dumps(payload).encode("utf-8"), + headers={"Content-Type": "application/json"}, + ) + with request.urlopen(req, timeout=300.0): + pass + + if probe_plan.capture_requests > 0: + time.sleep(max(0.0, probe_delay)) + send_probe_requests( + url=url, + prompt=probe_plan.prompt, + max_new_tokens=probe_plan.capture_max_new_tokens, + request_count=probe_plan.capture_requests, + framework="sglang", + sampling_seed_offset=probe_plan.warmup_requests, + ) + try: + stop_remote_profiler(url, "sglang") + except RuntimeError: + pass + + return wait_for_profiler_artifact(output_path, timeout_s=180.0) + + +def run_profiler( + url: str, + output_dir: Optional[str], + num_steps: int, + profile_by_stage: bool, + merge_profiles: bool, + profile_prefix: Optional[str], + probe_requests: int, + probe_prompt: str, + probe_max_new_tokens: Optional[int], + probe_delay: float, + warmup_steps: int = DEFAULT_WARMUP_STEPS, + start_step: Optional[int] = None, + framework: str = "auto", + framework_hint_path: Optional[str] = None, + profile_workload: str = "both", + prefill_input_len: int = DEFAULT_PREFILL_INPUT_LEN, + prefill_output_len: int = DEFAULT_PREFILL_OUTPUT_LEN, + decode_input_len: int = DEFAULT_DECODE_INPUT_LEN, + decode_output_len: int = DEFAULT_DECODE_OUTPUT_LEN, +) -> Path: + resolved_framework = resolve_framework( + framework, + url=url, + input_path=( + Path(framework_hint_path).expanduser().resolve() + if framework_hint_path + else None + ), + ) + if resolved_framework == "sglang": + stages = expand_profile_workload(profile_workload) + if stages != ["legacy"]: + output_root = ( + Path(output_dir).expanduser().resolve() + if output_dir + else Path(tempfile.mkdtemp(prefix="sglang-torch-profile-")) + ) + output_root.mkdir(parents=True, exist_ok=True) + for stage in stages: + prompt, max_new_tokens = workload_probe( + stage, + prefill_input_len=prefill_input_len, + prefill_output_len=prefill_output_len, + decode_input_len=decode_input_len, + decode_output_len=decode_output_len, + ) + probe_plan = build_probe_plan( + stage, + prompt=prompt, + max_new_tokens=max_new_tokens, + num_steps=num_steps, + probe_requests=probe_requests, + warmup_steps=warmup_steps, + ) + # SGLang increments `forward_ct` before checking whether the + # profiler reached its target. Ask for one extra step so the + # requested stage forward is captured instead of stopping just + # before it runs. + stage_num_steps = max(1, int(num_steps)) + 1 + run_sglang_profiler( + url=url, + output_dir=str(output_root / stage), + num_steps=stage_num_steps, + profile_by_stage=False, + merge_profiles=merge_profiles, + profile_prefix=( + f"{profile_prefix}-{stage}" if profile_prefix else stage + ), + probe_plan=probe_plan, + probe_delay=probe_delay, + start_step=start_step, + ) + return output_root + legacy_max_new_tokens = probe_max_new_tokens or max(64, num_steps * 8) + legacy_plan = build_probe_plan( + "legacy", + prompt=probe_prompt, + max_new_tokens=legacy_max_new_tokens, + num_steps=num_steps, + probe_requests=probe_requests, + warmup_steps=warmup_steps, + ) + return run_sglang_profiler( + url=url, + output_dir=output_dir, + num_steps=num_steps, + profile_by_stage=profile_by_stage, + merge_profiles=merge_profiles, + profile_prefix=profile_prefix, + probe_plan=legacy_plan, + probe_delay=probe_delay, + start_step=start_step, + ) + if start_step is not None: + raise ValueError("--start-step is only supported for SGLang live capture.") + if profile_by_stage: + raise ValueError( + "--profile-by-stage is only supported for SGLang live capture. " + "Disable it when profiling vLLM or TensorRT-LLM." + ) + if merge_profiles: + raise ValueError( + "--merge-profiles is only supported for SGLang live capture. " + "Disable it when profiling vLLM or TensorRT-LLM." + ) + if profile_prefix: + print( + f"Note: {framework_display_name(resolved_framework)} ignores " + "--profile-prefix on the HTTP profiler control path.", + file=sys.stderr, + ) + stages = expand_profile_workload(profile_workload) + if stages == ["legacy"]: + legacy_max_new_tokens = probe_max_new_tokens or max(64, num_steps * 8) + return run_remote_profiler( + url=url, + output_dir=output_dir, + framework=resolved_framework, + probe_plan=build_probe_plan( + "legacy", + prompt=probe_prompt, + max_new_tokens=legacy_max_new_tokens, + num_steps=num_steps, + probe_requests=probe_requests, + warmup_steps=warmup_steps, + ), + probe_delay=probe_delay, + ) + output_root = ensure_remote_profiler_output_path(output_dir, resolved_framework) + for stage in stages: + prompt, max_new_tokens = workload_probe( + stage, + prefill_input_len=prefill_input_len, + prefill_output_len=prefill_output_len, + decode_input_len=decode_input_len, + decode_output_len=decode_output_len, + ) + run_remote_profiler( + url=url, + output_dir=str(output_root), + framework=resolved_framework, + probe_plan=build_probe_plan( + stage, + prompt=prompt, + max_new_tokens=max_new_tokens, + num_steps=num_steps, + probe_requests=probe_requests, + warmup_steps=warmup_steps, + ), + probe_delay=probe_delay, + stage=stage, + ) + return output_root + + +def select_heaviest_pid( + events: Sequence[dict], + event_filter: Callable[[dict], bool], + pid_substring: Optional[str] = None, + preferred_substrings: Iterable[str] = (), +) -> Optional[str]: + durations: Counter = Counter() + for event in events: + if not event_filter(event): + continue + pid = str(event.get("pid")) + if pid_substring and pid_substring not in pid: + continue + durations[pid] += float(event["dur"]) + if not durations: + return None + + for substring in preferred_substrings: + preferred = [pid for pid in durations if substring in pid] + if preferred: + return max(preferred, key=lambda pid: durations[pid]) + return max(durations, key=lambda pid: durations[pid]) diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/render_triage_markdown_bundle.py b/.claude/skills/llm-torch-profiler-analysis/scripts/render_triage_markdown_bundle.py new file mode 100644 index 000000000000..cd12429c876c --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/render_triage_markdown_bundle.py @@ -0,0 +1,259 @@ +"""Bundle one or more triage text reports into a single markdown document.""" + +from __future__ import annotations + +import argparse +from collections import defaultdict +from datetime import datetime, timezone +from pathlib import Path +from typing import Dict, List, Optional, Sequence, Tuple + +FRAMEWORK_LABELS = { + "sglang": "SGLang", + "vllm": "vLLM", + "trtllm": "TensorRT-LLM", +} + +FRAMEWORK_ORDER = {"sglang": 0, "vllm": 1, "trtllm": 2} + + +def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description=( + "Render multiple profiler triage text outputs into one markdown file. " + "Input files are expected to be the existing analysis_*.txt outputs " + "already emitted by analyze_llm_torch_profile.py." + ) + ) + parser.add_argument( + "--analysis-root", + type=str, + default=None, + help=( + "Root directory to scan recursively for analysis_*.txt files. " + "Parent directory names are used as model section ids." + ), + ) + parser.add_argument( + "--analysis-file", + action="append", + default=[], + help=( + "Explicit analysis file entry. Use either PATH or LABEL=PATH. " + "When LABEL is omitted, the parent directory name is used." + ), + ) + parser.add_argument( + "--title", + type=str, + default="Unified LLM Torch Profiler Triage Bundle", + help="Top-level markdown title.", + ) + parser.add_argument( + "--output", + type=str, + default=None, + help="Write the bundled markdown to this file. Prints to stdout when omitted.", + ) + parser.add_argument( + "--include-toc", + action=argparse.BooleanOptionalAction, + default=True, + help="Include a simple table of contents.", + ) + args = parser.parse_args(argv) + if not args.analysis_root and not args.analysis_file: + parser.error("Provide at least one of --analysis-root or --analysis-file.") + return args + + +def framework_key_from_path(path: Path) -> str: + lowered = path.name.lower() + if "sglang" in lowered: + return "sglang" + if "vllm" in lowered: + return "vllm" + if "trtllm" in lowered or "tensorrt" in lowered: + return "trtllm" + return "other" + + +def framework_label(framework_key: str) -> str: + return FRAMEWORK_LABELS.get(framework_key, framework_key) + + +def discover_analysis_files(root: Path) -> List[Tuple[str, Path]]: + entries: List[Tuple[str, Path]] = [] + for path in sorted(root.rglob("analysis*.txt")): + entries.append((path.parent.name, path)) + return entries + + +def parse_explicit_entry(raw: str) -> Tuple[str, Path]: + if "=" in raw: + label, path_text = raw.split("=", 1) + path = Path(path_text).expanduser().resolve() + return label.strip(), path + path = Path(raw).expanduser().resolve() + return path.parent.name, path + + +def slugify(text: str) -> str: + chars = [] + last_dash = False + for char in text.lower(): + if char.isalnum(): + chars.append(char) + last_dash = False + elif not last_dash: + chars.append("-") + last_dash = True + return "".join(chars).strip("-") + + +def extract_model_name(report_text: str) -> Optional[str]: + for line in report_text.splitlines(): + if line.startswith("Model: "): + return line.split("Model: ", 1)[1].strip() + return None + + +def choose_model_display_name( + current: Optional[str], + candidate: Optional[str], + *, + label: str, +) -> str: + if candidate and candidate != label: + if not current or current == label: + return candidate + if len(candidate) > len(current): + return candidate + return current + if current: + return current + return label + + +def normalize_report_text(report_text: str) -> str: + text = report_text.replace("\r\n", "\n").strip() + if not text: + return "_Empty analysis output._" + heading_map = { + "Triage View": "#### Triage View", + "Kernel Table": "#### Kernel Table", + "Overlap Opportunity Table": "#### Overlap Opportunity Table", + "Fuse Opportunity Table": "#### Fuse Opportunity Table", + } + normalized_lines = [] + for line in text.splitlines(): + normalized_lines.append(heading_map.get(line, line)) + return "\n".join(normalized_lines) + + +def build_bundle_markdown( + *, + title: str, + labeled_paths: Sequence[Tuple[str, Path]], + include_toc: bool, +) -> str: + grouped: Dict[str, List[Tuple[str, Path, str]]] = defaultdict(list) + model_display: Dict[str, str] = {} + + for label, path in labeled_paths: + raw_text = path.read_text(encoding="utf-8") + report_text = normalize_report_text(raw_text) + model_name = extract_model_name(report_text) + grouped[label].append((framework_key_from_path(path), path, report_text)) + model_display[label] = choose_model_display_name( + model_display.get(label), + model_name, + label=label, + ) + + ordered_labels = sorted( + grouped, + key=lambda item: (model_display[item].lower(), item.lower()), + ) + + lines: List[str] = [f"# {title}", ""] + lines.append( + f"_Generated on {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC')}_" + ) + lines.append("") + + if include_toc: + lines.append("## Contents") + lines.append("") + for label in ordered_labels: + lines.append( + f"- [{model_display[label]}](#{slugify(model_display[label])})" + ) + lines.append("") + + for label in ordered_labels: + display_name = model_display[label] + lines.append(f"## {display_name}") + lines.append("") + lines.append(f"Model id: `{label}`") + lines.append("") + + records = sorted( + grouped[label], + key=lambda item: ( + FRAMEWORK_ORDER.get(item[0], 99), + item[1].name.lower(), + ), + ) + + for framework_key, path, report_text in records: + lines.append(f"### {framework_label(framework_key)}") + lines.append("") + lines.append(f"Source: `{path}`") + lines.append("") + lines.append(report_text) + lines.append("") + + return "\n".join(lines).rstrip() + "\n" + + +def main(argv: Optional[Sequence[str]] = None) -> int: + args = parse_args(argv) + + labeled_paths: List[Tuple[str, Path]] = [] + if args.analysis_root: + labeled_paths.extend( + discover_analysis_files(Path(args.analysis_root).expanduser().resolve()) + ) + for raw_entry in args.analysis_file: + labeled_paths.append(parse_explicit_entry(raw_entry)) + + existing = [] + missing = [] + for label, path in labeled_paths: + if path.is_file(): + existing.append((label, path)) + else: + missing.append(str(path)) + if missing: + raise SystemExit("Missing analysis files:\n" + "\n".join(missing)) + if not existing: + raise SystemExit("No analysis files found.") + + markdown = build_bundle_markdown( + title=args.title, + labeled_paths=existing, + include_toc=args.include_toc, + ) + + if args.output: + output_path = Path(args.output).expanduser().resolve() + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(markdown, encoding="utf-8") + else: + print(markdown, end="") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/run_llm_single_model_matrix_host.sh b/.claude/skills/llm-torch-profiler-analysis/scripts/run_llm_single_model_matrix_host.sh new file mode 100755 index 000000000000..b80f3fe43147 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/run_llm_single_model_matrix_host.sh @@ -0,0 +1,274 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + run_llm_single_model_matrix_host.sh \ + --model-id gpt_oss_20b \ + --model openai/gpt-oss-20b \ + --root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix \ + --gpus 2,3,4,5 \ + --sglang-port 30098 \ + --vllm-formal-port 31098 \ + --vllm-mapping-port 31099 \ + --trt-formal-prefill-port 32098 \ + --trt-formal-decode-port 32099 \ + --trt-mapping-prefill-port 32198 \ + --trt-mapping-decode-port 32199 + +This script is intended to run on the H100 host. It: +1. captures SGLang live profiling and writes `analysis_sglang.txt` +2. captures vLLM formal + eager mapping traces and writes `analysis_vllm.txt` +3. captures TensorRT-LLM formal + graph-off mapping traces and writes `analysis_trtllm.txt` +4. stores one benchmark JSON per framework under the model run directory + +Default profiler workloads are stage-separated: + prefill: input 4090, output 1 + decode: input 1, output 2048 + +Environment: + Export `HF_TOKEN` and `HUGGINGFACE_HUB_TOKEN` before running. +EOF +} + +MODEL_ID="" +MODEL="" +ROOT="" +GPUS="" +TP_SIZE="" +SGLANG_PORT="" +VLLM_FORMAL_PORT="" +VLLM_MAPPING_PORT="" +TRT_FORMAL_PREFILL_PORT="" +TRT_FORMAL_DECODE_PORT="" +TRT_MAPPING_PREFILL_PORT="" +TRT_MAPPING_DECODE_PORT="" +SGLANG_MEM_FRACTION="0.85" +MAX_MODEL_LEN="4096" +KV_FRACTION="0.85" +SGLANG_SERVER_EXTRA="" +PROFILE_WORKLOAD="both" +PREFILL_INPUT_LEN=4090 +PREFILL_OUTPUT_LEN=1 +DECODE_INPUT_LEN=1 +DECODE_OUTPUT_LEN=2048 +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +TRT_IMAGE="nvcr.io/nvidia/tensorrt-llm/release:latest" +TRT_OVERRIDE_ROOT="/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm" +TRT_OVERRIDE_SOURCE="$TRT_OVERRIDE_ROOT/py_executor.original.py" +TRT_OVERRIDE_PATH="$TRT_OVERRIDE_ROOT/py_executor_with_stack.py" + +while [[ $# -gt 0 ]]; do + case "$1" in + --model-id) MODEL_ID="$2"; shift 2 ;; + --model) MODEL="$2"; shift 2 ;; + --root) ROOT="$2"; shift 2 ;; + --gpus) GPUS="$2"; shift 2 ;; + --tp-size) TP_SIZE="$2"; shift 2 ;; + --sglang-port) SGLANG_PORT="$2"; shift 2 ;; + --vllm-formal-port) VLLM_FORMAL_PORT="$2"; shift 2 ;; + --vllm-mapping-port) VLLM_MAPPING_PORT="$2"; shift 2 ;; + --trt-formal-prefill-port) TRT_FORMAL_PREFILL_PORT="$2"; shift 2 ;; + --trt-formal-decode-port) TRT_FORMAL_DECODE_PORT="$2"; shift 2 ;; + --trt-mapping-prefill-port) TRT_MAPPING_PREFILL_PORT="$2"; shift 2 ;; + --trt-mapping-decode-port) TRT_MAPPING_DECODE_PORT="$2"; shift 2 ;; + --sglang-mem-fraction) SGLANG_MEM_FRACTION="$2"; shift 2 ;; + --sglang-server-extra) SGLANG_SERVER_EXTRA="$2"; shift 2 ;; + --max-model-len) MAX_MODEL_LEN="$2"; shift 2 ;; + --kv-fraction) KV_FRACTION="$2"; shift 2 ;; + --profile-workload) PROFILE_WORKLOAD="$2"; shift 2 ;; + --prefill-input-len) PREFILL_INPUT_LEN="$2"; shift 2 ;; + --prefill-output-len) PREFILL_OUTPUT_LEN="$2"; shift 2 ;; + --decode-input-len) DECODE_INPUT_LEN="$2"; shift 2 ;; + --decode-output-len) DECODE_OUTPUT_LEN="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) + echo "Unknown argument: $1" >&2 + usage >&2 + exit 2 + ;; + esac +done + +if [[ -z "${HF_TOKEN:-}" && -z "${HUGGINGFACE_HUB_TOKEN:-}" ]]; then + echo "Set HF_TOKEN or HUGGINGFACE_HUB_TOKEN before running." >&2 + exit 2 +fi +if [[ -z "${HF_TOKEN:-}" ]]; then + HF_TOKEN="$HUGGINGFACE_HUB_TOKEN" +fi +if [[ -z "${HUGGINGFACE_HUB_TOKEN:-}" ]]; then + HUGGINGFACE_HUB_TOKEN="$HF_TOKEN" +fi + +for value in \ + MODEL_ID MODEL ROOT GPUS \ + SGLANG_PORT VLLM_FORMAL_PORT VLLM_MAPPING_PORT \ + TRT_FORMAL_PREFILL_PORT TRT_FORMAL_DECODE_PORT \ + TRT_MAPPING_PREFILL_PORT TRT_MAPPING_DECODE_PORT; do + if [[ -z "${!value}" ]]; then + echo "Missing required argument: $value" >&2 + usage >&2 + exit 2 + fi +done + +IFS=',' read -r -a GPU_LIST <<< "$GPUS" +GPU_COUNT="${#GPU_LIST[@]}" +if [[ "$GPU_COUNT" -lt 1 ]]; then + echo "Could not parse --gpus: $GPUS" >&2 + exit 2 +fi +if [[ -z "$TP_SIZE" ]]; then + TP_SIZE="$GPU_COUNT" +fi +if (( TP_SIZE < 1 || TP_SIZE > GPU_COUNT )); then + echo "--tp-size must be between 1 and the visible GPU count ($GPU_COUNT)." >&2 + exit 2 +fi + +MODEL_ROOT="$ROOT/$MODEL_ID" +SGLANG_ANALYSIS="$MODEL_ROOT/analysis_sglang.txt" +VLLM_FORMAL_DIR="$MODEL_ROOT/vllm_formal" +VLLM_MAPPING_DIR="$MODEL_ROOT/vllm_mapping" +VLLM_ANALYSIS="$MODEL_ROOT/analysis_vllm.txt" +TRT_FORMAL_DIR="$MODEL_ROOT/trtllm_formal" +TRT_MAPPING_DIR="$MODEL_ROOT/trtllm_mapping" +TRT_ANALYSIS="$MODEL_ROOT/analysis_trtllm.txt" + +docker exec sglang_bbuf bash -lc "mkdir -p '$MODEL_ROOT'" + +if [[ ! -s "$TRT_OVERRIDE_SOURCE" ]]; then + echo "[bootstrap] TensorRT-LLM py_executor source snapshot" + docker exec sglang_bbuf bash -lc "mkdir -p '$TRT_OVERRIDE_ROOT'" + docker run --rm --entrypoint cat "$TRT_IMAGE" \ + /usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py \ + | docker exec -i sglang_bbuf bash -lc "cat > '$TRT_OVERRIDE_SOURCE'" +fi +echo "[bootstrap] TensorRT-LLM py_executor override with with_stack=True and rank0-only trace export" +docker exec sglang_bbuf bash -lc "cd '$SCRIPT_DIR' && python3 make_trtllm_py_executor_override.py --source '$TRT_OVERRIDE_SOURCE' --output '$TRT_OVERRIDE_PATH'" + +sglang_args=( + --model "$MODEL" + --run-dir "$MODEL_ROOT" + --port "$SGLANG_PORT" + --gpus "$GPUS" + --tp-size "$TP_SIZE" + --mem-fraction "$SGLANG_MEM_FRACTION" + --profile-workload "$PROFILE_WORKLOAD" + --prefill-input-len "$PREFILL_INPUT_LEN" + --prefill-output-len "$PREFILL_OUTPUT_LEN" + --decode-input-len "$DECODE_INPUT_LEN" + --decode-output-len "$DECODE_OUTPUT_LEN" + --trust-remote-code +) +if [[ -n "$SGLANG_SERVER_EXTRA" ]]; then + sglang_args+=(--server-extra "$SGLANG_SERVER_EXTRA") +fi + +echo "[1/6] SGLang server + live triage" +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_sglang_torch_profile_host.sh" \ + "${sglang_args[@]}" + +echo "[2/6] vLLM formal" +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_vllm_torch_profile_host.sh" \ + --model "$MODEL" \ + --run-dir "$VLLM_FORMAL_DIR" \ + --port "$VLLM_FORMAL_PORT" \ + --gpus "$GPUS" \ + --tensor-parallel-size "$TP_SIZE" \ + --max-model-len "$MAX_MODEL_LEN" \ + --profile-workload "$PROFILE_WORKLOAD" \ + --prefill-input-len "$PREFILL_INPUT_LEN" \ + --prefill-output-len "$PREFILL_OUTPUT_LEN" \ + --decode-input-len "$DECODE_INPUT_LEN" \ + --decode-output-len "$DECODE_OUTPUT_LEN" \ + --trust-remote-code + +echo "[3/6] vLLM mapping" +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_vllm_torch_profile_host.sh" \ + --model "$MODEL" \ + --run-dir "$VLLM_MAPPING_DIR" \ + --port "$VLLM_MAPPING_PORT" \ + --gpus "$GPUS" \ + --tensor-parallel-size "$TP_SIZE" \ + --profiler-active-iterations 2 \ + --max-model-len "$MAX_MODEL_LEN" \ + --profile-workload "$PROFILE_WORKLOAD" \ + --prefill-input-len "$PREFILL_INPUT_LEN" \ + --prefill-output-len "$PREFILL_OUTPUT_LEN" \ + --decode-input-len "$DECODE_INPUT_LEN" \ + --decode-output-len "$DECODE_OUTPUT_LEN" \ + --trust-remote-code \ + --enforce-eager + +echo "[4/6] vLLM mapping-formal analysis" +docker exec sglang_bbuf bash -lc "cd '$SCRIPT_DIR' && python3 analyze_llm_torch_profile.py --framework vllm --mapping-input '$VLLM_MAPPING_DIR' --formal-input '$VLLM_FORMAL_DIR' > '$VLLM_ANALYSIS'" + +echo "[5/6] TensorRT-LLM formal + mapping captures" +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_trtllm_pytorch_profile_host.sh" \ + --model "$MODEL" \ + --run-dir "$TRT_FORMAL_DIR" \ + --stage prefill \ + --port "$TRT_FORMAL_PREFILL_PORT" \ + --gpus "$GPUS" \ + --tp-size "$TP_SIZE" \ + --kv-fraction "$KV_FRACTION" \ + --input-len "$PREFILL_INPUT_LEN" \ + --output-len "$PREFILL_OUTPUT_LEN" \ + --override-py-executor "$TRT_OVERRIDE_PATH" \ + --trust-remote-code +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_trtllm_pytorch_profile_host.sh" \ + --model "$MODEL" \ + --run-dir "$TRT_FORMAL_DIR" \ + --stage decode \ + --port "$TRT_FORMAL_DECODE_PORT" \ + --gpus "$GPUS" \ + --tp-size "$TP_SIZE" \ + --kv-fraction "$KV_FRACTION" \ + --input-len "$DECODE_INPUT_LEN" \ + --output-len "$DECODE_OUTPUT_LEN" \ + --override-py-executor "$TRT_OVERRIDE_PATH" \ + --trust-remote-code +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_trtllm_pytorch_profile_host.sh" \ + --model "$MODEL" \ + --run-dir "$TRT_MAPPING_DIR" \ + --stage prefill \ + --port "$TRT_MAPPING_PREFILL_PORT" \ + --gpus "$GPUS" \ + --tp-size "$TP_SIZE" \ + --kv-fraction "$KV_FRACTION" \ + --input-len "$PREFILL_INPUT_LEN" \ + --output-len "$PREFILL_OUTPUT_LEN" \ + --override-py-executor "$TRT_OVERRIDE_PATH" \ + --disable-cudagraph \ + --trust-remote-code +HF_TOKEN="$HF_TOKEN" HUGGINGFACE_HUB_TOKEN="$HUGGINGFACE_HUB_TOKEN" \ + "$SCRIPT_DIR/run_trtllm_pytorch_profile_host.sh" \ + --model "$MODEL" \ + --run-dir "$TRT_MAPPING_DIR" \ + --stage decode \ + --port "$TRT_MAPPING_DECODE_PORT" \ + --gpus "$GPUS" \ + --tp-size "$TP_SIZE" \ + --kv-fraction "$KV_FRACTION" \ + --input-len "$DECODE_INPUT_LEN" \ + --output-len "$DECODE_OUTPUT_LEN" \ + --override-py-executor "$TRT_OVERRIDE_PATH" \ + --disable-cudagraph \ + --trust-remote-code + +echo "[6/6] TensorRT-LLM mapping-formal analysis" +docker exec sglang_bbuf bash -lc "cd '$SCRIPT_DIR' && python3 analyze_llm_torch_profile.py --framework trtllm --mapping-input '$TRT_MAPPING_DIR' --formal-input '$TRT_FORMAL_DIR' > '$TRT_ANALYSIS'" + +echo "MODEL_ROOT=$MODEL_ROOT" +echo "ANALYSIS_SGLANG=$SGLANG_ANALYSIS" +echo "ANALYSIS_VLLM=$VLLM_ANALYSIS" +echo "ANALYSIS_TRTLLM=$TRT_ANALYSIS" diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/run_sglang_torch_profile_host.sh b/.claude/skills/llm-torch-profiler-analysis/scripts/run_sglang_torch_profile_host.sh new file mode 100755 index 000000000000..c22af0e57f2e --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/run_sglang_torch_profile_host.sh @@ -0,0 +1,241 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + run_sglang_torch_profile_host.sh \ + --model Qwen/Qwen3-8B \ + --run-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example_sglang \ + --port 30088 \ + --gpus 0 + + run_sglang_torch_profile_host.sh \ + --model openai/gpt-oss-20b \ + --run-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example_sglang_4gpu \ + --port 30088 \ + --gpus 2,3,4,5 \ + --tp-size 4 + +Options: + --model TEXT Model id or local path for SGLang. + --run-dir PATH Shared /data directory for logs and traces. + --port INT Server port. + --gpus TEXT CUDA_VISIBLE_DEVICES value, for example 0 or 2,3,4,5. + --gpu TEXT Alias for --gpus. + --tp-size INT Tensor parallel size. Defaults to the visible GPU count. + --trust-remote-code Pass --trust-remote-code. + --mem-fraction FLOAT SGLang static memory fraction. + --request-max-tokens INT Generation length for the probe request. + --prompt TEXT Probe prompt. + --warmup-steps INT Warmup steps before profiling. Defaults to 10. + --profile-workload TEXT legacy|prefill|decode|both. Defaults to both. + --prefill-input-len INT Synthetic prefill prompt length. Defaults to 4090. + --prefill-output-len INT Synthetic prefill output length. Defaults to 1. + --decode-input-len INT Synthetic decode prompt length. Defaults to 1. + --decode-output-len INT Synthetic decode output length. Defaults to 2048. + --repo-dir PATH SGLang repo path inside `sglang_bbuf`. + --server-extra TEXT Extra args appended to launch_server. + --help Show this message. + +Notes: + - Run this on the H100 host. It uses `docker exec sglang_bbuf`. + - The server is launched first, then the profiler capture runs with + stage-separated prefill/decode workloads and `--profile-by-stage`. + - A small benchmark summary is written after profiling. +EOF +} + +MODEL="" +RUN_DIR="" +PORT="" +GPUS="" +TP_SIZE="" +TRUST_REMOTE_CODE=0 +MEM_FRACTION=0.85 +REQUEST_MAX_TOKENS=12 +PROMPT="Explain the difference between CUDA graph mode and eager mode in two sentences." +WARMUP_STEPS=10 +PROFILE_WORKLOAD="both" +PREFILL_INPUT_LEN=4090 +PREFILL_OUTPUT_LEN=1 +DECODE_INPUT_LEN=1 +DECODE_OUTPUT_LEN=2048 +SGLANG_REPO_DIR="${SGLANG_REPO_DIR:-/data/bbuf/repos/sglang}" +SERVER_EXTRA="" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +while [[ $# -gt 0 ]]; do + case "$1" in + --model) + MODEL="$2" + shift 2 + ;; + --run-dir) + RUN_DIR="$2" + shift 2 + ;; + --port) + PORT="$2" + shift 2 + ;; + --gpu) + GPUS="$2" + shift 2 + ;; + --gpus) + GPUS="$2" + shift 2 + ;; + --tp-size) + TP_SIZE="$2" + shift 2 + ;; + --trust-remote-code) + TRUST_REMOTE_CODE=1 + shift + ;; + --mem-fraction) + MEM_FRACTION="$2" + shift 2 + ;; + --request-max-tokens) + REQUEST_MAX_TOKENS="$2" + shift 2 + ;; + --prompt) + PROMPT="$2" + shift 2 + ;; + --warmup-steps) + WARMUP_STEPS="$2" + shift 2 + ;; + --profile-workload) + PROFILE_WORKLOAD="$2" + shift 2 + ;; + --prefill-input-len) + PREFILL_INPUT_LEN="$2" + shift 2 + ;; + --prefill-output-len) + PREFILL_OUTPUT_LEN="$2" + shift 2 + ;; + --decode-input-len) + DECODE_INPUT_LEN="$2" + shift 2 + ;; + --decode-output-len) + DECODE_OUTPUT_LEN="$2" + shift 2 + ;; + --repo-dir) + SGLANG_REPO_DIR="$2" + shift 2 + ;; + --server-extra) + SERVER_EXTRA="$2" + shift 2 + ;; + --help|-h) + usage + exit 0 + ;; + *) + echo "Unknown argument: $1" >&2 + usage >&2 + exit 2 + ;; + esac +done + +if [[ -z "$MODEL" || -z "$RUN_DIR" || -z "$PORT" || -z "$GPUS" ]]; then + usage >&2 + exit 2 +fi + +IFS=',' read -r -a GPU_LIST <<< "$GPUS" +GPU_COUNT="${#GPU_LIST[@]}" +if [[ "$GPU_COUNT" -lt 1 ]]; then + echo "Could not parse --gpus: $GPUS" >&2 + exit 2 +fi +if [[ -z "$TP_SIZE" ]]; then + TP_SIZE="$GPU_COUNT" +fi +if (( TP_SIZE < 1 || TP_SIZE > GPU_COUNT )); then + echo "--tp-size must be between 1 and the visible GPU count ($GPU_COUNT)." >&2 + exit 2 +fi + +LOG_PATH="$RUN_DIR/sglang_server.log" +ANALYSIS_PATH="$RUN_DIR/analysis_sglang.txt" +PROFILE_ROOT="$RUN_DIR/sglang_profile_live" +BENCHMARK_PATH="$RUN_DIR/benchmark_sglang.json" +PID_PATH="$RUN_DIR/sglang_server.pid" +LAUNCH_PATTERN="[s]glang.launch_server.*--port $PORT" +SERVER_ARGS="python3 -m sglang.launch_server --model-path \"$MODEL\" --port \"$PORT\" --tp-size \"$TP_SIZE\" --mem-fraction-static \"$MEM_FRACTION\"" + +if [[ "$TRUST_REMOTE_CODE" -eq 1 ]]; then + SERVER_ARGS="$SERVER_ARGS --trust-remote-code" +fi +if [[ -n "$SERVER_EXTRA" ]]; then + SERVER_ARGS="$SERVER_ARGS $SERVER_EXTRA" +fi + +docker exec sglang_bbuf bash -lc "mkdir -p '$RUN_DIR' '$PROFILE_ROOT'" +docker exec sglang_bbuf bash -lc "pkill -f '$LAUNCH_PATTERN' >/dev/null 2>&1 || true" +docker exec sglang_bbuf bash -lc "mkdir -p '$RUN_DIR' '$PROFILE_ROOT' && cd '$SGLANG_REPO_DIR' && rm -f '$PID_PATH' && (CUDA_VISIBLE_DEVICES=$GPUS PYTHONPATH=python nohup $SERVER_ARGS > '$LOG_PATH' 2>&1 < /dev/null & echo \$! > '$PID_PATH')" + +cleanup() { + docker exec sglang_bbuf bash -lc "pkill -f '$LAUNCH_PATTERN' >/dev/null 2>&1 || true" >/dev/null 2>&1 || true +} +trap cleanup EXIT + +ready=0 +for _ in $(seq 1 180); do + if curl -sf "http://127.0.0.1:${PORT}/v1/models" >/dev/null; then + ready=1 + break + fi + sleep 2 +done +if [[ "$ready" -ne 1 ]]; then + echo "SGLang server did not become ready on port ${PORT}. Recent logs:" >&2 + ssh_log=$(docker exec sglang_bbuf bash -lc "tail -n 120 '$LOG_PATH'" 2>/dev/null || true) + printf '%s\n' "$ssh_log" >&2 + exit 1 +fi + +python3 - < '$ANALYSIS_PATH'" +python3 "$SCRIPT_DIR/probe_llm_server.py" \ + --framework sglang \ + --url "http://127.0.0.1:${PORT}" \ + | docker exec -i sglang_bbuf bash -lc "cat > '$BENCHMARK_PATH'" >/dev/null +docker exec sglang_bbuf bash -lc "sed -n '1,240p' '$ANALYSIS_PATH'" +echo "BENCHMARK_PATH=$BENCHMARK_PATH" diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/run_trtllm_pytorch_profile_host.sh b/.claude/skills/llm-torch-profiler-analysis/scripts/run_trtllm_pytorch_profile_host.sh new file mode 100755 index 000000000000..1e3b51e1dd77 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/run_trtllm_pytorch_profile_host.sh @@ -0,0 +1,407 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + run_trtllm_pytorch_profile_host.sh \ + --model Qwen/Qwen3-8B \ + --run-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example \ + --stage prefill \ + --port 32188 \ + --gpus 0 + + run_trtllm_pytorch_profile_host.sh \ + --model openai/gpt-oss-20b \ + --run-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example_4gpu \ + --stage prefill \ + --port 32188 \ + --gpus 2,3,4,5 \ + --tp-size 4 + +Options: + --model TEXT Hugging Face model id. + --run-dir PATH Shared /data run directory for logs and traces. + --stage prefill|decode Capture window. Prefill profiles 4090->1 by + default; decode profiles 1->2048 by default. + --port INT Host port for trtllm-serve. + --gpus TEXT CUDA_VISIBLE_DEVICES value, for example 0 or 2,3,4,5. + --gpu TEXT Alias for --gpus. + --tp-size INT Tensor parallel size. Defaults to the visible GPU count. + --image TEXT Container image. + --shared-root PATH Shared validation root mounted into the container. + --hf-cache PATH Host Hugging Face cache path. + --override-py-executor PATH Optional py_executor.py override path. + --disable-cudagraph Generate/use a YAML override with cuda_graph_config: null. + --input-len INT Synthetic prompt length for this stage. + Defaults: prefill 4090, decode 1. + --request-max-tokens INT Generation length for this stage. + Defaults: prefill 1, decode 2048. + --output-len INT Alias for --request-max-tokens. + --prompt TEXT Probe prompt. Defaults to a synthetic prompt + sized by --input-len. + --warmup-steps INT Warmup steps before the profiler window. Defaults to 10. + --active-steps INT Active profiler steps to capture. Defaults to 5. + --max-seq-len INT Serve max sequence length. + --kv-fraction FLOAT KV cache free GPU memory fraction. + --container-name TEXT Override container name. + --trust-remote-code Pass --trust_remote_code to trtllm-serve. + --help Show this message. + +Environment: + HF_TOKEN or HUGGINGFACE_HUB_TOKEN must be set. + +Notes: + - Run this on the H100 host, not inside `sglang_bbuf`. + - It always pins TensorRT-LLM to `--backend pytorch`. + - The default image tag is floating; record the resolved TensorRT-LLM version + in the run manifest and pass --image for reproducible validation. + - Profiling uses `TLLM_PROFILE_START_STOP` and `TLLM_TORCH_PROFILE_TRACE`. + - For Python-location recovery, prefer a `py_executor.py` override with `with_stack=True`. + - A small benchmark summary is written after the trace is emitted. +EOF +} + +IMAGE="nvcr.io/nvidia/tensorrt-llm/release:latest" +SHARED_ROOT="/data/bbuf/validate/unified_llm_profiler_skill" +HF_CACHE="/data/.cache/huggingface" +OVERRIDE_PY_EXECUTOR="" +DISABLE_CUDAGRAPH=0 +REQUEST_MAX_TOKENS="" +INPUT_LEN="" +PROMPT="" +WARMUP_STEPS=10 +ACTIVE_STEPS=5 +MAX_SEQ_LEN=4096 +KV_FRACTION=0.85 +CONTAINER_NAME="" +TRUST_REMOTE_CODE=0 +TP_SIZE="" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +MODEL="" +RUN_DIR="" +STAGE="" +PORT="" +GPUS="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --model) + MODEL="$2" + shift 2 + ;; + --run-dir) + RUN_DIR="$2" + shift 2 + ;; + --stage) + STAGE="$2" + shift 2 + ;; + --port) + PORT="$2" + shift 2 + ;; + --gpu) + GPUS="$2" + shift 2 + ;; + --gpus) + GPUS="$2" + shift 2 + ;; + --tp-size) + TP_SIZE="$2" + shift 2 + ;; + --image) + IMAGE="$2" + shift 2 + ;; + --shared-root) + SHARED_ROOT="$2" + shift 2 + ;; + --hf-cache) + HF_CACHE="$2" + shift 2 + ;; + --override-py-executor) + OVERRIDE_PY_EXECUTOR="$2" + shift 2 + ;; + --disable-cudagraph) + DISABLE_CUDAGRAPH=1 + shift + ;; + --input-len) + INPUT_LEN="$2" + shift 2 + ;; + --request-max-tokens) + REQUEST_MAX_TOKENS="$2" + shift 2 + ;; + --output-len) + REQUEST_MAX_TOKENS="$2" + shift 2 + ;; + --prompt) + PROMPT="$2" + shift 2 + ;; + --warmup-steps) + WARMUP_STEPS="$2" + shift 2 + ;; + --active-steps) + ACTIVE_STEPS="$2" + shift 2 + ;; + --max-seq-len) + MAX_SEQ_LEN="$2" + shift 2 + ;; + --kv-fraction) + KV_FRACTION="$2" + shift 2 + ;; + --container-name) + CONTAINER_NAME="$2" + shift 2 + ;; + --trust-remote-code) + TRUST_REMOTE_CODE=1 + shift + ;; + --help|-h) + usage + exit 0 + ;; + *) + echo "Unknown argument: $1" >&2 + usage >&2 + exit 2 + ;; + esac +done + +if [[ -z "${HF_TOKEN:-}" && -z "${HUGGINGFACE_HUB_TOKEN:-}" ]]; then + echo "Set HF_TOKEN or HUGGINGFACE_HUB_TOKEN before running." >&2 + exit 2 +fi +if [[ -z "${HF_TOKEN:-}" ]]; then + HF_TOKEN="$HUGGINGFACE_HUB_TOKEN" +fi +if [[ -z "${HUGGINGFACE_HUB_TOKEN:-}" ]]; then + HUGGINGFACE_HUB_TOKEN="$HF_TOKEN" +fi + +if [[ -z "$MODEL" || -z "$RUN_DIR" || -z "$STAGE" || -z "$PORT" || -z "$GPUS" ]]; then + usage >&2 + exit 2 +fi + +IFS=',' read -r -a GPU_LIST <<< "$GPUS" +GPU_COUNT="${#GPU_LIST[@]}" +if [[ "$GPU_COUNT" -lt 1 ]]; then + echo "Could not parse --gpus: $GPUS" >&2 + exit 2 +fi +if [[ -z "$TP_SIZE" ]]; then + TP_SIZE="$GPU_COUNT" +fi +if (( TP_SIZE < 1 || TP_SIZE > GPU_COUNT )); then + echo "--tp-size must be between 1 and the visible GPU count ($GPU_COUNT)." >&2 + exit 2 +fi + +case "$STAGE" in + prefill) + TRACE_PATH="$RUN_DIR/trace-prefill.json" + LOG_PATH="$RUN_DIR/server-prefill.log" + BENCHMARK_PATH="$RUN_DIR/benchmark-prefill.json" + if [[ -z "$INPUT_LEN" ]]; then + INPUT_LEN=4090 + fi + if [[ -z "$REQUEST_MAX_TOKENS" ]]; then + REQUEST_MAX_TOKENS=1 + fi + ;; + decode) + TRACE_PATH="$RUN_DIR/trace-decode.json" + LOG_PATH="$RUN_DIR/server-decode.log" + BENCHMARK_PATH="$RUN_DIR/benchmark-decode.json" + if [[ -z "$INPUT_LEN" ]]; then + INPUT_LEN=1 + fi + if [[ -z "$REQUEST_MAX_TOKENS" ]]; then + REQUEST_MAX_TOKENS=2048 + fi + ;; + *) + echo "--stage must be prefill or decode." >&2 + exit 2 + ;; +esac + +if (( WARMUP_STEPS < 0 || ACTIVE_STEPS < 1 )); then + echo "--warmup-steps must be >= 0 and --active-steps must be >= 1." >&2 + exit 2 +fi + +case "$STAGE" in + prefill) + profile_start=$((WARMUP_STEPS + 1)) + ;; + decode) + profile_start=$((WARMUP_STEPS + 2)) + ;; +esac +profile_stop=$((profile_start + ACTIVE_STEPS - 1)) +PROFILE_START_STOP="${profile_start}-${profile_stop}" + +if [[ -z "$CONTAINER_NAME" ]]; then + model_slug="${MODEL##*/}" + model_slug="${model_slug//\//-}" + model_slug="${model_slug//./-}" + model_slug="${model_slug//_/-}" + model_slug="${model_slug// /-}" + gpu_slug="${GPUS//,/-}" + CONTAINER_NAME="trtllm-${model_slug}-${STAGE}-g${gpu_slug}-p${PORT}" +fi + +EXTRA_LLM_OPTIONS="" +if [[ "$DISABLE_CUDAGRAPH" -eq 1 ]]; then + EXTRA_CFG_PATH="$SHARED_ROOT/tmp/trt_no_cudagraph.yaml" + docker exec sglang_bbuf bash -lc "mkdir -p '$(dirname "$EXTRA_CFG_PATH")' && printf 'cuda_graph_config: null\n' > '$EXTRA_CFG_PATH'" + EXTRA_LLM_OPTIONS="--extra_llm_api_options $EXTRA_CFG_PATH" +fi + +docker exec sglang_bbuf bash -lc "mkdir -p '$RUN_DIR'" +docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true + +docker_args=( + run -d --rm + --name "$CONTAINER_NAME" + --gpus all + --ipc=host + --network host + --entrypoint bash + -e "CUDA_VISIBLE_DEVICES=$GPUS" + -e "HF_TOKEN=$HF_TOKEN" + -e "HUGGINGFACE_HUB_TOKEN=$HUGGINGFACE_HUB_TOKEN" + -e "TLLM_PROFILE_START_STOP=$PROFILE_START_STOP" + -e "TLLM_LLMAPI_ENABLE_NVTX=1" + -e "TLLM_TORCH_PROFILE_TRACE=$TRACE_PATH" + -e "RUN_DIR=$RUN_DIR" + -e "LOG_PATH=$LOG_PATH" + -e "MODEL_ID=$MODEL" + -e "SERVE_PORT=$PORT" + -v "$HF_CACHE:/root/.cache/huggingface" + -v "$SHARED_ROOT:$SHARED_ROOT" +) + +if [[ -n "$OVERRIDE_PY_EXECUTOR" ]]; then + docker_args+=( + -v "$OVERRIDE_PY_EXECUTOR:/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py:ro" + ) +fi + +trust_remote_code_arg="" +if [[ "$TRUST_REMOTE_CODE" -eq 1 ]]; then + trust_remote_code_arg="--trust_remote_code" +fi + +container_cmd=$( + cat < "$LOG_PATH" 2>&1 +EOF +) + +docker_args+=("$IMAGE" -lc "$container_cmd") +docker "${docker_args[@]}" >/dev/null + +cleanup() { + docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true +} +trap cleanup EXIT + +ready=0 +for _ in $(seq 1 180); do + if curl -sf "http://127.0.0.1:${PORT}/v1/models" >/dev/null; then + ready=1 + break + fi + sleep 2 +done +if [[ "$ready" -ne 1 ]]; then + echo "Server did not become ready on port ${PORT}. Recent logs:" >&2 + docker logs "$CONTAINER_NAME" 2>&1 | tail -n 120 >&2 || true + exit 1 +fi + +python3 - <&2 + exit 1 +fi + +python3 "$SCRIPT_DIR/probe_llm_server.py" \ + --framework trtllm \ + --url "http://127.0.0.1:${PORT}" \ + --model "$MODEL" \ + | docker exec -i sglang_bbuf bash -lc "cat > '$BENCHMARK_PATH'" >/dev/null + +echo "TRACE_PATH=$TRACE_PATH" +echo "LOG_PATH=$LOG_PATH" +echo "BENCHMARK_PATH=$BENCHMARK_PATH" diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/run_vllm_torch_profile_host.sh b/.claude/skills/llm-torch-profiler-analysis/scripts/run_vllm_torch_profile_host.sh new file mode 100755 index 000000000000..c21d014ddd47 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/run_vllm_torch_profile_host.sh @@ -0,0 +1,343 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + run_vllm_torch_profile_host.sh \ + --model Qwen/Qwen3-8B \ + --run-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example_vllm_formal \ + --port 31088 \ + --gpus 1 + + run_vllm_torch_profile_host.sh \ + --model openai/gpt-oss-20b \ + --run-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example_vllm_4gpu \ + --port 31088 \ + --gpus 2,3,4,5 \ + --tensor-parallel-size 4 + +Options: + --model TEXT Hugging Face model id. + --run-dir PATH Shared /data directory for logs and traces. + --port INT Host port for vllm serve. + --gpus TEXT CUDA_VISIBLE_DEVICES value, for example 1 or 2,3,4,5. + --gpu TEXT Alias for --gpus. + --image TEXT Container image. + --hf-cache PATH Host Hugging Face cache path. + --gpu-memory-util FLOAT vLLM --gpu-memory-utilization. + --max-model-len INT vLLM --max-model-len. + --tensor-parallel-size INT vLLM --tensor-parallel-size. Defaults to the visible GPU count. + --profiler-active-iterations INT + Torch-profiler active iterations. + --enforce-eager Launch vLLM with --enforce-eager for mapping traces. + --trust-remote-code Pass --trust-remote-code. + --request-max-tokens INT Generation length for the probe request. + --prompt TEXT Probe prompt. + --warmup-steps INT Warmup steps before profiling. Defaults to 10. + --profile-workload TEXT legacy|prefill|decode|both. Defaults to both. + --prefill-input-len INT Synthetic prefill prompt length. Defaults to 4090. + --prefill-output-len INT Synthetic prefill output length. Defaults to 1. + --decode-input-len INT Synthetic decode prompt length. Defaults to 1. + --decode-output-len INT Synthetic decode output length. Defaults to 2048. + --container-name TEXT Override container name. + --help Show this message. + +Environment: + HF_TOKEN or HUGGINGFACE_HUB_TOKEN must be set. + +Notes: + - Run this on the H100 host, not inside `sglang_bbuf`. + - This uses the vLLM torch-profiler flow: `--profiler-config`, then POST + `/start_profile` and `/stop_profile`. + - Default capture is two labeled profiles: prefill 4090->1 and decode 1->2048. + - Current vLLM profiler config already defaults `torch_profiler_with_stack=true`. + - A small benchmark summary is written after profiling. +EOF +} + +IMAGE="vllm/vllm-openai:latest" +HF_CACHE="/data/.cache/huggingface" +GPU_MEMORY_UTIL=0.90 +MAX_MODEL_LEN=4096 +TP_SIZE="" +ENFORCE_EAGER=0 +TRUST_REMOTE_CODE=0 +REQUEST_MAX_TOKENS=12 +PROFILER_ACTIVE_ITERATIONS=5 +PROMPT="Explain the difference between CUDA graph mode and eager mode in two sentences." +WARMUP_STEPS=10 +PROFILE_WORKLOAD="both" +PREFILL_INPUT_LEN=4090 +PREFILL_OUTPUT_LEN=1 +DECODE_INPUT_LEN=1 +DECODE_OUTPUT_LEN=2048 +CONTAINER_NAME="" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +MODEL="" +RUN_DIR="" +PORT="" +GPUS="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --model) + MODEL="$2" + shift 2 + ;; + --run-dir) + RUN_DIR="$2" + shift 2 + ;; + --port) + PORT="$2" + shift 2 + ;; + --gpu) + GPUS="$2" + shift 2 + ;; + --gpus) + GPUS="$2" + shift 2 + ;; + --image) + IMAGE="$2" + shift 2 + ;; + --hf-cache) + HF_CACHE="$2" + shift 2 + ;; + --gpu-memory-util) + GPU_MEMORY_UTIL="$2" + shift 2 + ;; + --max-model-len) + MAX_MODEL_LEN="$2" + shift 2 + ;; + --tensor-parallel-size) + TP_SIZE="$2" + shift 2 + ;; + --profiler-active-iterations) + PROFILER_ACTIVE_ITERATIONS="$2" + shift 2 + ;; + --enforce-eager) + ENFORCE_EAGER=1 + shift + ;; + --trust-remote-code) + TRUST_REMOTE_CODE=1 + shift + ;; + --request-max-tokens) + REQUEST_MAX_TOKENS="$2" + shift 2 + ;; + --prompt) + PROMPT="$2" + shift 2 + ;; + --warmup-steps) + WARMUP_STEPS="$2" + shift 2 + ;; + --profile-workload) + PROFILE_WORKLOAD="$2" + shift 2 + ;; + --prefill-input-len) + PREFILL_INPUT_LEN="$2" + shift 2 + ;; + --prefill-output-len) + PREFILL_OUTPUT_LEN="$2" + shift 2 + ;; + --decode-input-len) + DECODE_INPUT_LEN="$2" + shift 2 + ;; + --decode-output-len) + DECODE_OUTPUT_LEN="$2" + shift 2 + ;; + --container-name) + CONTAINER_NAME="$2" + shift 2 + ;; + --help|-h) + usage + exit 0 + ;; + *) + echo "Unknown argument: $1" >&2 + usage >&2 + exit 2 + ;; + esac +done + +if [[ -z "${HF_TOKEN:-}" && -z "${HUGGINGFACE_HUB_TOKEN:-}" ]]; then + echo "Set HF_TOKEN or HUGGINGFACE_HUB_TOKEN before running." >&2 + exit 2 +fi +if [[ -z "${HF_TOKEN:-}" ]]; then + HF_TOKEN="$HUGGINGFACE_HUB_TOKEN" +fi +if [[ -z "${HUGGINGFACE_HUB_TOKEN:-}" ]]; then + HUGGINGFACE_HUB_TOKEN="$HF_TOKEN" +fi + +if [[ -z "$MODEL" || -z "$RUN_DIR" || -z "$PORT" || -z "$GPUS" ]]; then + usage >&2 + exit 2 +fi + +IFS=',' read -r -a GPU_LIST <<< "$GPUS" +GPU_COUNT="${#GPU_LIST[@]}" +if [[ "$GPU_COUNT" -lt 1 ]]; then + echo "Could not parse --gpus: $GPUS" >&2 + exit 2 +fi +if [[ -z "$TP_SIZE" ]]; then + TP_SIZE="$GPU_COUNT" +fi +if (( TP_SIZE < 1 || TP_SIZE > GPU_COUNT )); then + echo "--tensor-parallel-size must be between 1 and the visible GPU count ($GPU_COUNT)." >&2 + exit 2 +fi +if (( PROFILER_ACTIVE_ITERATIONS < 1 )); then + echo "--profiler-active-iterations must be >= 1." >&2 + exit 2 +fi + +PROFILE_DIR="$RUN_DIR/vllm_profile" +LOG_PATH="$RUN_DIR/server.log" +ANALYSIS_PATH="$RUN_DIR/analysis_vllm_live.txt" +BENCHMARK_PATH="$RUN_DIR/benchmark_vllm.json" + +if [[ -z "$CONTAINER_NAME" ]]; then + model_slug="${MODEL##*/}" + model_slug="${model_slug//\//-}" + model_slug="${model_slug//./-}" + model_slug="${model_slug//_/-}" + gpu_slug="${GPUS//,/-}" + CONTAINER_NAME="vllm-${model_slug}-g${gpu_slug}-p${PORT}" + if [[ "$ENFORCE_EAGER" -eq 1 ]]; then + CONTAINER_NAME="${CONTAINER_NAME}-eager" + fi +fi + +docker exec sglang_bbuf bash -lc "mkdir -p '$PROFILE_DIR'" +docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true + +profiler_config=$(python3 - </dev/null +cleanup() { + docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true +} +trap cleanup EXIT + +ready=0 +for _ in $(seq 1 180); do + if curl -sf "http://127.0.0.1:${PORT}/v1/models" >/dev/null; then + ready=1 + break + fi + sleep 2 +done +if [[ "$ready" -ne 1 ]]; then + echo "Server did not become ready on port ${PORT}. Recent logs:" >&2 + docker logs "$CONTAINER_NAME" 2>&1 | tail -n 120 >&2 || true + exit 1 +fi + +python3 "$SCRIPT_DIR/analyze_llm_torch_profile.py" \ + --framework vllm \ + --url "http://127.0.0.1:${PORT}" \ + --output-dir "$PROFILE_DIR" \ + --num-steps "$PROFILER_ACTIVE_ITERATIONS" \ + --warmup-steps "$WARMUP_STEPS" \ + --probe-requests 1 \ + --no-profile-by-stage \ + --profile-workload "$PROFILE_WORKLOAD" \ + --probe-prompt "$PROMPT" \ + --probe-max-new-tokens "$REQUEST_MAX_TOKENS" \ + --prefill-input-len "$PREFILL_INPUT_LEN" \ + --prefill-output-len "$PREFILL_OUTPUT_LEN" \ + --decode-input-len "$DECODE_INPUT_LEN" \ + --decode-output-len "$DECODE_OUTPUT_LEN" \ + > "$ANALYSIS_PATH" + +profile_found=0 +for _ in $(seq 1 240); do + if find "$PROFILE_DIR" -type f \( -name '*.pt.trace.json' -o -name '*.pt.trace.json.gz' -o -name '*.trace.json' -o -name '*.trace.json.gz' \) | grep -q .; then + profile_found=1 + break + fi + sleep 2 +done +if [[ "$profile_found" -ne 1 ]]; then + echo "No vLLM profiler traces appeared under $PROFILE_DIR" >&2 + docker logs "$CONTAINER_NAME" 2>&1 | tail -n 120 >&2 || true + exit 1 +fi + +python3 "$SCRIPT_DIR/probe_llm_server.py" \ + --framework vllm \ + --url "http://127.0.0.1:${PORT}" \ + --model "$MODEL" \ + | docker exec -i sglang_bbuf bash -lc "cat > '$BENCHMARK_PATH'" >/dev/null + +docker logs "$CONTAINER_NAME" 2>&1 | docker exec -i sglang_bbuf bash -lc "cat > '$LOG_PATH'" || true +sed -n '1,240p' "$ANALYSIS_PATH" +echo "PROFILE_DIR=$PROFILE_DIR" +echo "LOG_PATH=$LOG_PATH" +echo "ANALYSIS_PATH=$ANALYSIS_PATH" +echo "BENCHMARK_PATH=$BENCHMARK_PATH" diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/triage_kernel_helpers.py b/.claude/skills/llm-torch-profiler-analysis/scripts/triage_kernel_helpers.py new file mode 100644 index 000000000000..ca1d47d18b89 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/triage_kernel_helpers.py @@ -0,0 +1,2648 @@ +"""Internal kernel attribution helpers for triage-only torch-profiler analysis.""" + +from __future__ import annotations + +import json +import re +from bisect import bisect_right +from collections import Counter, defaultdict +from dataclasses import dataclass, field +from functools import lru_cache +from pathlib import Path +from typing import DefaultDict, Dict, Iterable, List, Optional, Sequence, Tuple + +from profile_common import ( + coerce_optional_int, + contains_any_keyword, + extract_trace_events, + has_stream_marker, + is_annotation_event, + is_complete_duration_event, + is_non_kernel_trace_category, + is_trace_metadata_name, + looks_like_python_scope_name, + normalize_repo_relative_path, + normalize_text, + select_heaviest_pid, +) + +CATEGORY_PATTERNS: List[Tuple[str, Tuple[str, ...]]] = [ + ( + "hybrid_linear", + ( + "gdn", + "gated_delta", + "mamba", + "selective_scan", + "ssd", + "causal_conv", + "ssm", + ), + ), + ( + "attention", + ( + "flash_attn", + "flashattention", + "flash_attention", + "fmha", + "attention", + "mla", + "paged_attention", + "decode_attention", + ), + ), + ( + "moe", + ( + "fused_moe", + "grouped_mm", + "groupgemm", + "group_gemm", + "moe", + "expert", + "groupproblemshape", + ), + ), + ( + "gemm", + ( + "gemm", + "gemv", + "matmul", + "cublas", + "cutlass", + "wgmma", + "mma", + "bmm", + "nvjet", + ), + ), + ( + "norm", + ( + "rmsnorm", + "layernorm", + "_norm_", + " norm", + "normkernel", + ), + ), + ("rope", ("rotary", "rope", "mrope")), + ("softmax", ("softmax",)), + ("activation", ("silu", "gelu", "relu", "act_and_mul", "sigmoid")), + ("quantize", ("quant", "fp8", "mxfp", "nvfp4", "dequant", "cvt")), + ( + "reduce_topk", + ("topk", "reduce", "argmax", "argtopk", "sampling", "multinomial"), + ), + ( + "sampling_io", + ( + "prepare_inputs", + "write_req_to", + "catarraybatched", + "prepare_next", + "copy_next", + ), + ), + ( + "elementwise", + ( + "elementwise", + "vectorized_elementwise_kernel", + "unrolled_elementwise_kernel", + "gpu_kernel_impl", + "binary_internal", + "unaryfunctor", + "add_kernel", + "sub_kernel", + "mul_kernel", + "div_", + "floor_kernel", + "log_kernel", + "neg_kernel", + ), + ), +] + +COMMUNICATION_STRONG_KEYWORDS = ( + "nccl", + "allreduce", + "all_reduce", + "reduce_scatter", + "allgather", + "all_gather", + "alltoall", + "all_to_all", + "cross_device_reduce", + "deepep", + "mooncake", +) + +COMMUNICATION_WEAK_KEYWORDS = ( + "broadcast", + "dispatch", + "combine", +) + +MEMORY_STRONG_KEYWORDS = ( + "memcpy", + "memset", + "dma", + "prefetch", +) + +MEMORY_WEAK_KEYWORDS = ( + "copy", + "fill", +) + +COMPUTE_HINT_KEYWORDS = ( + "gemm", + "gemv", + "matmul", + "cublas", + "cutlass", + "wgmma", + "mma", + "bmm", + "nvjet", + "fmha", + "attention", + "flash_attn", + "flashattention", + "flash_attention", + "grouped_mm", + "groupgemm", + "moe", + "expert", +) + +NOISE_FRAME_PREFIXES = ( + "threading.py(", + "multiprocessing/", + "contextlib.py(", + "torch/utils/_contextlib.py(", + "runpy.py(", + "asyncio/", + "selectors.py(", + "queue.py(", + "socket.py(", + "tqdm/_monitor.py(", + "(", + " float: + return self.total_us / self.count if self.count else 0.0 + + +@dataclass +class MappingSiteAggregate: + total_us: float = 0.0 + count: int = 0 + cpu_ops: Counter = field(default_factory=Counter) + stacks: Counter = field(default_factory=Counter) + + +@dataclass +class KernelRow: + name: str + category: str + aggregate: Aggregate + location: str + cpu_op: str + entry: Optional[dict] + + @property + def total_us(self) -> float: + return self.aggregate.total_us + + +@dataclass +class FusionOpportunity: + pattern: str + status: str + confidence: str + related_us: float + evidence: str + current_locations: str + candidate_path: str + rationale: str + covered_row_keys: Tuple[Tuple[str, str, str], ...] = field( + default_factory=tuple, repr=False + ) + pattern_span: int = field(default=1, repr=False) + has_active_match: bool = field(default=False, repr=False) + priority: int = field(default=0, repr=False) + subsumes: Tuple[str, ...] = field(default_factory=tuple, repr=False) + + +@dataclass(frozen=True) +class FusionPatternSpec: + pattern: str + candidate_path: str + active_keywords: Tuple[str, ...] = () + split_groups: Tuple[Tuple[str, ...], ...] = () + rationale_hint: str = "" + origin: str = "mainline" + model_include: Tuple[str, ...] = () + model_exclude: Tuple[str, ...] = () + min_tp_size: int = 1 + require_tp: bool = False + min_share: float = 0.25 + likely_share: float = 3.0 + priority: int = 0 + subsumes: Tuple[str, ...] = () + + +FUSION_PATTERN_REGISTRY: Tuple[FusionPatternSpec, ...] = ( + FusionPatternSpec( + pattern="Fused residual add + RMSNorm", + candidate_path=( + "python/sglang/srt/layers/layernorm.py" + "
python/sglang/srt/layers/quantization/modelslim/modelslim.py" + ), + active_keywords=( + "fused_add_rmsnorm", + "gemma_fused_add_rmsnorm", + "npu_add_rms_norm", + "add_rmsnorm_bias", + ), + rationale_hint=( + "Residual add plus RMSNorm already has fused implementations across" + " several backends." + ), + min_share=0.1, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="FlashInfer unified allreduce_fusion", + candidate_path=( + "python/sglang/srt/layers/flashinfer_comm_fusion.py" + "
python/sglang/srt/layers/layernorm.py" + "
python/sglang/srt/layers/communicator.py" + ), + active_keywords=( + "allreduce_fusion", + "fusedaddrmsnormkernel", + "flashinfer_comm_fusion.py", + ), + split_groups=( + ( + "cross_device_reduce", + "allreduce", + "all_reduce", + "custom_all_reduce_ops.py", + ), + ("rmsnorm", "layernorm", "fused_add_rmsnorm", "layernorm.py"), + ), + rationale_hint=( + "FlashInfer has a TP all-reduce plus residual/RMSNorm fusion path." + ), + require_tp=True, + min_tp_size=2, + min_share=0.5, + likely_share=4.0, + ), + FusionPatternSpec( + pattern="AITER allreduce fusion", + candidate_path=( + "python/sglang/srt/distributed/communication_op.py" + "
python/sglang/srt/layers/communicator.py" + "
python/sglang/srt/layers/layernorm.py" + ), + active_keywords=( + "tensor_model_parallel_fused_allreduce_rmsnorm", + "apply_aiter_all_reduce_fusion", + "custom_fused_ar_rms", + ), + split_groups=( + ("allreduce", "all_reduce", "cross_device_reduce"), + ("rmsnorm", "layernorm"), + ), + rationale_hint=( + "ROCm already has an AITER fused all-reduce plus RMSNorm family." + ), + require_tp=True, + min_tp_size=2, + min_share=0.5, + likely_share=4.0, + ), + FusionPatternSpec( + pattern="Fused activation-and-mul (SwiGLU / GeGLU)", + candidate_path="python/sglang/srt/layers/activation.py", + active_keywords=("silu_and_mul", "gelu_and_mul", "npu_swiglu"), + rationale_hint=( + "Packed MLP activation and multiply already has dedicated fused ops." + ), + min_share=0.1, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="In-place QK RMSNorm", + candidate_path=( + "python/sglang/srt/models/utils.py" "
python/sglang/jit_kernel/norm.py" + ), + active_keywords=("fused_inplace_qknorm", "minimaxm2rmsnormtp"), + split_groups=(("apply_qk_norm", "q_norm", "k_norm", "qknorm"),), + rationale_hint=( + "Q/K normalization already has in-place or model-specific fused" + " implementations." + ), + min_share=0.3, + likely_share=2.0, + ), + FusionPatternSpec( + pattern="Fused QK RMSNorm + RoPE", + candidate_path=( + "python/sglang/jit_kernel/fused_qknorm_rope.py" + "
python/sglang/srt/models/qwen3_moe.py" + ), + active_keywords=("fused_qknorm_rope", "fused_qk_norm_rope"), + split_groups=( + ("apply_qk_norm", "q_norm", "k_norm", "qknorm"), + ("apply_rope", "rotary", "rope", "mrope"), + ), + rationale_hint=("SGLang has a fused QK-norm plus RoPE kernel family."), + min_share=0.3, + likely_share=2.0, + priority=30, + ), + FusionPatternSpec( + pattern="Fused QK RoPE reshape + KV cache write", + candidate_path="python/sglang/srt/layers/attention/utils.py", + active_keywords=("fused_qk_rope_reshape_and_cache",), + split_groups=( + ("rotary", "rope", "mrope"), + ("reshape", "set_kv", "kv_cache", "cache write", "paged kv"), + ), + rationale_hint=( + "Attention prep already has a fused RoPE plus reshape plus cache" + " write path." + ), + min_share=0.4, + likely_share=2.0, + priority=40, + subsumes=("Fused RoPE + KV cache store",), + ), + FusionPatternSpec( + pattern="Fused RoPE + KV cache store", + candidate_path=( + "python/sglang/jit_kernel/rope.py" "
python/sglang/srt/models/utils.py" + ), + active_keywords=("fused_set_kv_buffer",), + split_groups=( + ("rotary", "rope", "mrope"), + ("set_kv_buffer", "kv cache write", "paged kv", "cache write"), + ), + rationale_hint=( + "RoPE application and KV cache storage already have fused fast" + " paths in several models." + ), + min_share=0.3, + likely_share=1.5, + priority=20, + ), + FusionPatternSpec( + pattern="Fused decode metadata setup", + candidate_path=("python/sglang/srt/layers/attention/flashattention_backend.py"), + active_keywords=( + "normal_decode_set_metadata", + "cache_seqlens_int32", + "cu_seqlens_k", + "swa_page_table", + ), + rationale_hint=( + "Decode metadata setup already has a fused Triton preparation path." + ), + min_share=0.05, + likely_share=0.5, + ), + FusionPatternSpec( + pattern="NSA fused metadata copy for graph replay", + candidate_path="python/sglang/jit_kernel/fused_metadata_copy.py", + active_keywords=( + "fused_metadata_copy", + "fused_metadata_copy_multi", + "fused_nsa_cache_seqlens", + "fused_flashmla_metadata", + ), + rationale_hint=( + "NSA replay metadata copies are already fused into one-kernel" " families." + ), + min_share=0.02, + likely_share=0.2, + ), + FusionPatternSpec( + pattern="DeepSeek MLA fused projection + norm + RoPE", + candidate_path=( + "python/sglang/srt/models/deepseek_common/attention_forward_methods/" + "forward_mla_fused_rope_cpu.py" + "
python/sglang/srt/models/deepseek_common/attention_forward_methods/" + "forward_mla_fused_rope_rocm.py" + ), + active_keywords=( + "qkv_proj_with_rope_fused_weight", + "fused_qkv_a_proj_with_mqa", + "forward_absorb_fused_mla_rope", + ), + split_groups=( + ("mla", "qkv_a_proj", "q_a_proj"), + ("qknorm", "rmsnorm", "apply_qk_norm"), + ("rope", "rotary"), + ), + rationale_hint=( + "DeepSeek MLA has backend-specific fused projection, norm, and" + " RoPE prep paths." + ), + model_include=("deepseek", "glm"), + min_share=0.4, + likely_share=2.0, + priority=80, + subsumes=("Fused QK RMSNorm + RoPE",), + ), + FusionPatternSpec( + pattern="Fused QK RoPE concat + MLA cache write", + candidate_path=( + "python/sglang/srt/layers/rocm_linear_utils.py" + "
python/sglang/srt/models/deepseek_common/attention_forward_methods/" + "forward_mla.py" + ), + active_keywords=("fused_qk_rope_cat_and_cache_mla", "set_mla_kv_buffer"), + split_groups=( + ("mla", "rope", "rotary"), + ("cache", "kv_buffer", "concat"), + ), + rationale_hint=( + "MLA RoPE packing and cache write already have fused backend paths." + ), + model_include=("deepseek", "glm"), + min_share=0.3, + likely_share=1.5, + priority=85, + subsumes=("Fused RoPE + KV cache store",), + ), + FusionPatternSpec( + pattern="Qwen3 decode fused QK norm + 3D mRoPE + KV cache write", + candidate_path="python/sglang/srt/models/qwen3.py", + active_keywords=("fused_qk_norm_mrope_3d_cache_pts_quant_shuffle",), + split_groups=( + ("apply_qk_norm", "q_norm", "k_norm", "qknorm"), + ("mrope", "3d rope", "rotary"), + ("cache", "kv_buffer", "paged kv", "cache write"), + ), + rationale_hint=( + "Qwen3-style decode already has a fused QK-norm plus 3D mRoPE plus" + " cache-write path." + ), + model_include=("qwen3",), + model_exclude=("qwen3.5", "qwen3_5"), + min_share=0.4, + likely_share=2.0, + priority=90, + subsumes=( + "Fused QK RMSNorm + RoPE", + "Fused QK RoPE reshape + KV cache write", + "Fused RoPE + KV cache store", + ), + ), + FusionPatternSpec( + pattern="Fused MoE router / top-k / softcapping", + candidate_path="python/sglang/srt/layers/moe/router.py", + active_keywords=("fusedmoerouter", "fused_moe_router"), + split_groups=( + ("router", "gate", "router logits"), + ("topk", "softmax", "softcap", "tanh"), + ), + rationale_hint=( + "MoE routing already has fused router, softcap, and top-k kernels." + ), + min_share=0.3, + likely_share=1.5, + priority=30, + ), + FusionPatternSpec( + pattern="Fused MoE grouped-topk / gate kernels", + candidate_path="python/sglang/srt/layers/moe/topk.py", + active_keywords=( + "fused_topk_deepseek", + "moe_fused_gate", + "aiter_fused_topk", + "kimi_k2_moe_fused_gate", + ), + split_groups=( + ("grouped_topk", "topk", "biased_grouped_topk"), + ("gate", "router", "renorm", "routed scaling"), + ), + rationale_hint=( + "Grouped-topk, bias handling, and routed scaling already have fused" + " gate kernels." + ), + min_share=0.3, + likely_share=1.5, + priority=50, + subsumes=("Fused MoE router / top-k / softcapping",), + ), + FusionPatternSpec( + pattern="Qwen-style shared-expert append into routed top-k output", + candidate_path=( + "python/sglang/srt/models/qwen2_moe.py" + "
python/sglang/srt/layers/moe/moe_runner/triton_utils/" + "fused_moe_triton_kernels.py" + ), + active_keywords=( + "_append_shared_to_topk_output", + "fused_append_shared_experts_with_weights", + "_fused_append_shared_experts_with_weights_kernel", + ), + split_groups=( + ("_append_shared_to_topk_output", "topk", "grouped_topk"), + ("shared_expert", "shared_expert_gate", "sigmoid"), + ), + rationale_hint=( + "Qwen-style shared experts can already be appended into routed top-k" + " output in one Triton prep kernel before fused MoE execution." + ), + min_share=0.05, + likely_share=0.5, + priority=55, + ), + FusionPatternSpec( + pattern="Fused MoE sum + all-reduce", + candidate_path=("python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py"), + active_keywords=("fuse_sum_all_reduce", "enable_fused_moe_sum_all_reduce"), + split_groups=( + ("fused_moe", "expert", "moe"), + ("allreduce", "all_reduce", "cross_device_reduce"), + ), + rationale_hint=( + "The second MoE GEMM already has a fused sum-plus-all-reduce path." + ), + require_tp=True, + min_tp_size=2, + min_share=0.4, + likely_share=2.0, + ), + FusionPatternSpec( + pattern="Fused MoE activation + quant / re-quant", + candidate_path=( + "python/sglang/srt/layers/moe/ep_moe/kernels.py" + "
python/sglang/jit_kernel/nvfp4.py" + "
python/sglang/srt/layers/moe/cutlass_w4a8_moe.py" + ), + active_keywords=( + "silu_and_mul_scaled_fp4", + "npu_dequant_swiglu_quant", + "swiglu_quant", + ), + split_groups=( + ("silu", "gelu", "act_and_mul"), + ("quant", "fp8", "mxfp", "nvfp4", "dequant"), + ), + rationale_hint=( + "Quantized MoE backends already fuse activation with re-quantization." + ), + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="DeepSeek comm-prep fused RMSNorm + quant / flatten-quant", + candidate_path=( + "python/sglang/srt/layers/communicator.py" + "
python/sglang/srt/models/deepseek_common/attention_forward_methods/" + "forward_mla.py" + "
python/sglang/srt/models/deepseek_common/attention_forward_methods/" + "forward_mha.py" + ), + active_keywords=( + "fused_rms_fp8_group_quant", + "fused_rms_mxfp4_quant", + "fused_flatten_fp8_group_quant", + "fused_flatten_mxfp4_quant", + ), + split_groups=( + ("rmsnorm", "layernorm", "flatten"), + ("fp8", "mxfp4", "quant"), + ), + rationale_hint=( + "DeepSeek comm preparation already fuses norm or flatten work with" + " quantization." + ), + model_include=("deepseek", "glm"), + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="NSA fused top-k transform / page-table build", + candidate_path="python/sglang/srt/layers/attention/nsa_backend.py", + active_keywords=( + "fast_topk_transform_fused", + "fast_topk_transform_ragged_fused", + ), + rationale_hint=( + "NSA top-k metadata preparation already has fused transform kernels." + ), + min_share=0.05, + likely_share=0.3, + ), + FusionPatternSpec( + pattern="NSA fused quantize + indexed K-cache store", + candidate_path=( + "python/sglang/jit_kernel/fused_store_index_cache.py" + "
python/sglang/srt/layers/attention/nsa/nsa_indexer.py" + ), + active_keywords=("fused_store_index_k_cache",), + split_groups=( + ("act_quant", "quant", "scale_buffer"), + ("index_k", "cache", "store"), + ), + rationale_hint=( + "NSA already has a fused quantize-and-indexed-store kernel family." + ), + min_share=0.2, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="Fused sampling temperature + softmax", + candidate_path=( + "python/sglang/srt/layers/fused_sampling.py" + "
python/sglang/srt/layers/sampler.py" + ), + active_keywords=("fused_temperature_softmax",), + split_groups=( + ("temperature", "temp_scale"), + ("softmax", "sampling"), + ), + rationale_hint=( + "Decode-time sampling already has fused temperature and softmax" " kernels." + ), + min_share=0.05, + likely_share=0.5, + ), + FusionPatternSpec( + pattern="Fused logit softcap", + candidate_path=( + "python/sglang/srt/layers/elementwise.py" + "
python/sglang/srt/layers/logits_processor.py" + ), + active_keywords=("fused_softcap", "final_logit_softcapping"), + rationale_hint=( + "Logit softcap math already has dedicated fused elementwise kernels." + ), + min_share=0.02, + likely_share=0.2, + ), + FusionPatternSpec( + pattern="PR #20667 Qwen3.5 fused QK norm + RoPE + KV cache write", + candidate_path=( + "PR #20667" + "
python/sglang/srt/models/qwen3_5.py" + "
python/sglang/srt/models/utils.py" + ), + active_keywords=( + "fused_qk_norm_rope_cache_pts_quant_shuffle", + "fused_qk_norm_mrope_3d_cache_pts_quant_shuffle", + ), + split_groups=( + ("apply_qk_norm", "qknorm", "q_norm", "k_norm"), + ("rotary", "rope", "mrope"), + ("cache", "kv_buffer", "cache write"), + ), + rationale_hint=( + "Open SGLang ROCm PR wires a fused QK-norm plus RoPE plus KV-cache" + " family for Qwen3.5." + ), + origin="inflight", + model_include=("qwen3.5", "qwen3_5"), + min_share=0.4, + likely_share=2.0, + priority=100, + subsumes=( + "Fused QK RMSNorm + RoPE", + "Fused QK RoPE reshape + KV cache write", + "Fused RoPE + KV cache store", + ), + ), + FusionPatternSpec( + pattern="PR #22392 CUTLASS FP8 scaled MM replacing nvjet", + candidate_path=( + "PR #22392" + "
sgl-kernel/python/sgl_kernel/gemm.py" + "
python/sglang/srt/layers/quantization/fp8_utils.py" + ), + active_keywords=("cutlass_scaled_mm", "fp8_scaled_mm"), + split_groups=( + ("nvjet", "_scaled_mm"), + ("memset", "memcpy128"), + ), + rationale_hint=( + "Open SGLang PR replaces nvjet FP8 GEMM with CUTLASS to remove" + " memset bubbles and extra copies." + ), + origin="inflight", + min_share=0.2, + likely_share=1.0, + priority=90, + ), + FusionPatternSpec( + pattern="vLLM-origin Attention + Quantization", + candidate_path=( + "vllm/compilation/passes/fusion/attn_quant_fusion.py" + "
vllm/v1/attention/ops/merge_attn_states.py" + "
vllm/csrc/attention/merge_attn_states.cu" + "
vllm/docs/design/fusions.md" + ), + active_keywords=( + "merge_attn_states", + "attn_quant_fusion", + "output_scale", + "output_group_scale", + ), + split_groups=( + ("attention", "flash_attn", "flashattention", "mla"), + ("quant", "fp8", "nvfp4", "group_scale"), + ), + rationale_hint=( + "vLLM combines attention merge with attention-epilogue quantization." + ), + origin="upstream", + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="vLLM-origin DSV3.2 fused indexer projections", + candidate_path=( + "vllm/model_executor/models/deepseek_v2.py" + "
vllm/model_executor/models/deepseek_mtp.py" + ), + active_keywords=("wk_weights_proj",), + split_groups=( + ("wk_weights_proj", "wk", "weights_proj"), + ("mergedcolumnparallellinear", "gemm", "matmul"), + ), + rationale_hint=( + "vLLM already fuses the paired `wk` and `weights_proj` indexer" + " projections into one DSV3.2 linear family." + ), + origin="upstream", + min_share=0.2, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="vLLM-origin RMSNorm + Quantization", + candidate_path=( + "vllm/compilation/passes/fusion/rms_quant_fusion.py" + "
vllm/docs/design/fusions.md" + ), + active_keywords=( + "fused_add_rms_norm_static_fp8_quant", + "rms_quant_fusion", + "norm_quant", + ), + split_groups=( + ("rmsnorm", "layernorm", "fused_add_rms_norm"), + ("quant", "fp8", "fp4", "per-group"), + ), + rationale_hint=( + "vLLM already has a compile-time norm-plus-quant fusion family." + ), + origin="upstream", + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="vLLM-origin SiLU+Mul + Quantization", + candidate_path=( + "vllm/compilation/passes/fusion/act_quant_fusion.py" + "
vllm/docs/design/fusions.md" + ), + active_keywords=( + "silu_mul_quant_fp4", + "fused_silu_mul_block_quant", + "act_quant_fusion", + ), + split_groups=( + ("silu", "gelu", "act_and_mul"), + ("quant", "fp8", "fp4", "block_quant"), + ), + rationale_hint=("vLLM has an activation-plus-quant fusion family."), + origin="upstream", + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="vLLM-origin DSV3 router GEMM", + candidate_path=( + "vllm/model_executor/layers/fused_moe/router/gate_linear.py" + "
vllm/csrc/moe/dsv3_router_gemm_entry.cu" + ), + active_keywords=("dsv3_router_gemm", "fp32_router_gemm"), + split_groups=( + ("router", "gate", "router logits"), + ("gemm", "matmul", "cublas", "cutlass"), + ), + rationale_hint=( + "vLLM has a specialized DeepSeek router GEMM family for small" + " decode batches." + ), + origin="upstream", + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="vLLM-origin GPT-OSS router GEMM", + candidate_path=( + "vllm/_custom_ops.py" + "
vllm/model_executor/layers/fused_moe/router/gate_linear.py" + "
vllm/csrc/moe/gpt_oss_router_gemm.cu" + ), + active_keywords=("gpt_oss_router_gemm",), + split_groups=( + ("router", "gate", "router logits", "gpt_oss"), + ("gemm", "matmul", "cublas", "cutlass"), + ), + rationale_hint=("vLLM has a GPT-OSS-specific router GEMM path."), + origin="upstream", + model_include=("gpt-oss", "gpt_oss"), + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="vLLM-origin DeepSeek min-latency fused QKV-A projection", + candidate_path=( + "vllm/model_executor/models/deepseek_v2.py" + "
vllm/csrc/dsv3_fused_a_gemm.cu" + ), + active_keywords=("dsv3_fused_a_gemm", "fused_qkv_a_proj"), + split_groups=( + ("q_a_proj", "kv_a_proj", "weights_proj"), + ("gemm", "matmul", "cutlass", "cublas"), + ), + rationale_hint=( + "vLLM has a fused DeepSeek QKV-A projection family for decode" + " latency reduction." + ), + origin="upstream", + model_include=("deepseek", "glm"), + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="PR #38621 fused QK norm + RoPE + cache + quant", + candidate_path=( + "PR #38621" + "
vllm/csrc/fused_qk_norm_rope_cache_quant.cu" + "
vllm/compilation/passes/fusion/qk_norm_rope_cache_quant_fusion.py" + ), + active_keywords=("fused_qk_norm_rope_cache_quant",), + split_groups=( + ("qknorm", "q_norm", "k_norm"), + ("rope", "rotary", "mrope"), + ("cache", "kv_buffer", "cache write"), + ("quant", "fp8", "nvfp4"), + ), + rationale_hint=( + "Open vLLM PR covers QK-norm plus RoPE plus cache plus quant as" + " one fusion family." + ), + origin="inflight", + min_share=0.4, + likely_share=2.0, + priority=100, + subsumes=("vLLM-origin Attention + Quantization",), + ), + FusionPatternSpec( + pattern="vLLM-origin MiniMax allreduce_rms kernels", + candidate_path="vllm/model_executor/models/minimax_m2.py", + active_keywords=("minimax_allreduce_rms", "minimax_allreduce_rmsnorm"), + split_groups=( + ("q_norm", "k_norm", "rmsnorm", "minimax"), + ("allreduce", "all_reduce", "cross_device_reduce"), + ), + rationale_hint=( + "vLLM includes the TRTLLM-derived MiniMax allreduce-plus-RMSNorm" + " kernel family." + ), + origin="upstream", + model_include=("minimax",), + min_share=0.3, + likely_share=1.5, + ), + FusionPatternSpec( + pattern="vLLM fused residual add + RMSNorm", + candidate_path=( + "vllm/_custom_ops.py" + "
vllm/compilation/passes/fusion/rms_quant_fusion.py" + ), + active_keywords=( + "fused_add_rms_norm", + "fused_add_rms_norm_static_fp8_quant", + ), + rationale_hint=( + "vLLM exposes fused residual-add-plus-RMSNorm kernels and matching" + " compile-time hooks." + ), + origin="upstream", + min_share=0.1, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="vLLM fused activation-and-mul", + candidate_path=( + "vllm/_custom_ops.py" + "
vllm/compilation/passes/fusion/act_quant_fusion.py" + ), + active_keywords=( + "silu_and_mul", + "silu_and_mul_quant", + "silu_and_mul_per_block_quant", + "act_and_mul", + ), + rationale_hint=( + "vLLM ships fused activation-and-multiply kernels plus quantized" + " variants for the MLP epilogue." + ), + origin="upstream", + min_share=0.1, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="TensorRT-LLM FlashInfer residual add + RMSNorm", + candidate_path=( + "tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py" + "
tensorrt_llm/_torch/modules/rms_norm.py" + "
tensorrt_llm/_torch/auto_deploy/transform/library/fused_add_rms_norm.py" + ), + active_keywords=( + "flashinfer_fused_add_rmsnorm", + "flashinfer_gemma_fused_add_rmsnorm", + "flashinfer::norm::FusedAddRMSNormKernel", + "FusedAddRMSNormKernel", + "auto_deploy::flashinfer_fused_add_rms_norm_inplace", + ), + rationale_hint=( + "TensorRT-LLM exposes a FlashInfer fused residual-add plus RMSNorm" + " family, including AutoDeploy rewrites." + ), + origin="upstream", + min_share=0.1, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="TensorRT-LLM Triton fused residual add + RMSNorm + FP8 quant", + candidate_path=( + "tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/" + "triton_fused_add_rms_norm_quant_fp8.py" + "
tensorrt_llm/_torch/auto_deploy/transform/library/" + "fuse_rmsnorm_quant_fp8.py" + ), + active_keywords=( + "triton_fused_add_rms_norm_quant_fp8", + "fuse_rmsnorm_quant_fp8", + ), + rationale_hint=( + "TensorRT-LLM mainline has a Triton residual-add plus RMSNorm plus" + " FP8-quant family in AutoDeploy." + ), + origin="upstream", + min_share=0.2, + likely_share=1.0, + priority=20, + ), + FusionPatternSpec( + pattern="TensorRT-LLM FlashInfer RMSNorm family", + candidate_path=( + "tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py" + "
tensorrt_llm/_torch/modules/rms_norm.py" + "
tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py" + ), + active_keywords=( + "flashinfer_rmsnorm", + "flashinfer_gemma_rmsnorm", + "auto_deploy::flashinfer_rms_norm", + ), + rationale_hint=( + "TensorRT-LLM lowers RMSNorm-style ladders to FlashInfer kernels" + " and AutoDeploy custom ops." + ), + origin="upstream", + min_share=0.1, + likely_share=1.0, + ), + FusionPatternSpec( + pattern="TensorRT-LLM FlashInfer activation / gate epilogues", + candidate_path=( + "tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py" + "
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_silu_mul.py" + "
tensorrt_llm/_torch/models/modeling_gemma3.py" + ), + active_keywords=( + "flashinfer_silu_and_mul", + "flashinfer_gelu_tanh_and_mul", + "auto_deploy::silu_and_mul", + ), + rationale_hint=( + "TensorRT-LLM already rewrites gate activation plus multiply" + " ladders into FlashInfer epilogue kernels." + ), + origin="upstream", + min_share=0.1, + likely_share=1.0, + ), +) + + +def short_name(name: str, max_len: int = 96) -> str: + text = normalize_text(name) + if len(text) <= max_len: + return text + return text[: max_len - 3] + "..." + + +@lru_cache(maxsize=65536) +def canonicalize_name(name: str) -> str: + text = normalize_text(name) + text = re.sub(r"0x[0-9a-fA-F]+", "0xADDR", text) + if text.startswith("void ") and text.endswith(")"): + depth = 0 + split_idx: Optional[int] = None + for idx in range(len(text) - 1, -1, -1): + char = text[idx] + if char == ")": + depth += 1 + elif char == "(": + depth -= 1 + if depth == 0: + split_idx = idx + break + if split_idx is not None: + text = text[:split_idx] + return text + + +@lru_cache(maxsize=65536) +def classify_kernel(name: str) -> str: + # Keep the matching order explicit: strong communication/memory signals win + # first, then we fall back to weaker category hints. + lowered = name.lower() + if contains_any_keyword(lowered, COMMUNICATION_STRONG_KEYWORDS): + return "communication" + if contains_any_keyword(lowered, MEMORY_STRONG_KEYWORDS): + return "memory" + looks_compute_like = contains_any_keyword(lowered, COMPUTE_HINT_KEYWORDS) + if contains_any_keyword(lowered, MEMORY_WEAK_KEYWORDS) and not looks_compute_like: + return "memory" + for category, keywords in CATEGORY_PATTERNS: + if contains_any_keyword(lowered, keywords): + return category + if ( + contains_any_keyword(lowered, COMMUNICATION_WEAK_KEYWORDS) + and not looks_compute_like + ): + return "communication" + return "other" + + +@lru_cache(maxsize=65536) +def normalize_source_location(name: str) -> str: + text = normalize_text(name) + match = re.match(r"(?P.+?)\((?P\d+)\): (?P.+)$", text) + if not match: + return text + path = normalize_repo_relative_path(match.group("path")) + return f"{path}:{match.group('line')} {match.group('func')}" + + +def source_location_priority(location: str) -> int: + text = str(location).strip() + if not text or text == "unresolved": + return -100 + penalty = 80 if is_low_signal_source_location(text) else 0 + if text.startswith("python/sglang/"): + return 300 - penalty + if text.startswith("sglang/"): + return 290 - penalty + if text.startswith("vllm/"): + return 285 - penalty + if text.startswith("tensorrt_llm/"): + return 280 - penalty + if text.startswith("sgl_kernel/"): + return 260 - penalty + if text.startswith("python/"): + return 180 - penalty + if text.startswith("torch/") or "/torch/" in text: + return 20 + if ".py:" in text: + return 120 - penalty + return 0 + + +def is_preferred_source_location(location: str) -> bool: + text = str(location).strip() + return ( + text.startswith("python/sglang/") + or text.startswith("sglang/") + or text.startswith("vllm/") + or text.startswith("tensorrt_llm/") + or text.startswith("sgl_kernel/") + ) + + +def extract_preferred_stack_location(stack: Optional[str]) -> Optional[str]: + if not stack: + return None + parts = [str(part).strip() for part in str(stack).split("->")] + ranked: List[Tuple[int, int, str]] = [] + for index, part in enumerate(parts): + normalized = normalize_source_location(part) + priority = source_location_priority(normalized) + if priority <= 0: + continue + ranked.append((priority, index, normalized)) + if not ranked: + return None + ranked.sort(key=lambda item: (item[0], item[1]), reverse=True) + return ranked[0][2] + + +def site_display_location(site: dict) -> str: + location = str(site.get("location") or "unresolved").strip() + if is_preferred_source_location(location) and not is_low_signal_source_location( + location + ): + return location + stack_location = extract_preferred_stack_location(site.get("stack")) + if stack_location: + return stack_location + return location + + +def choose_best_location(locations: Dict[str, MappingSiteAggregate]) -> str: + if not locations: + return "unresolved" + ranked = sorted( + locations.items(), + key=lambda pair: ( + source_location_priority(pair[0]), + pair[1].total_us, + pair[1].count, + ), + reverse=True, + ) + return ranked[0][0] + + +@lru_cache(maxsize=65536) +def frame_priority(frame_name: str) -> int: + raw_text = str(frame_name).strip() + normalized_text = normalize_source_location(raw_text) + penalty = 80 if is_low_signal_source_location(normalized_text) else 0 + if raw_text.startswith(NOISE_FRAME_PREFIXES): + return -20 + if normalized_text.startswith("python/sglang/"): + return 300 - penalty + if normalized_text.startswith("sglang/"): + return 290 - penalty + if normalized_text.startswith("vllm/"): + return 285 - penalty + if normalized_text.startswith("tensorrt_llm/"): + return 280 - penalty + if normalized_text.startswith("sgl_kernel/"): + return 260 - penalty + if normalized_text.startswith("triton_kernels/"): + return 220 - penalty + if normalized_text.startswith(LOW_LEVEL_FRAME_PREFIXES): + return 0 + if raw_text.startswith("/data/") or raw_text.startswith("/Users/"): + if "/sglang/" in raw_text: + return 120 + if "/vllm/" in raw_text: + return 118 + if "/TensorRT-LLM/" in raw_text or "/tensorrt_llm/" in raw_text: + return 116 + return 100 + if ".py(" in raw_text and "/sglang/" in raw_text: + return 110 + if ".py(" in raw_text and "/vllm/" in raw_text: + return 108 + if ".py(" in raw_text and ( + "/TensorRT-LLM/" in raw_text or "/tensorrt_llm/" in raw_text + ): + return 106 + if ".py:" in normalized_text and ( + "site-packages" in raw_text or normalized_text.startswith("torch/") + ): + return 45 + if ".py:" in normalized_text: + return 35 + if raw_text.startswith(" bool: + lowered = str(location).strip().lower() + if not lowered: + return False + return any(token in lowered for token in LOW_SIGNAL_FUNCTION_TOKENS) or any( + token in lowered for token in LOW_SIGNAL_PATH_TOKENS + ) + + +def stage_label(stage: str) -> str: + if stage == "extend": + return "extend/prefill" + return stage + + +def stage_aliases(stage: str) -> List[str]: + if stage == "extend": + return ["extend", "prefill", "all"] + if stage == "prefill": + return ["prefill", "extend", "all"] + if stage == "decode": + return ["decode", "all"] + return [stage, "all"] + + +def escape_md_cell(text: str) -> str: + return str(text).replace("|", "\\|").replace("\n", "
") + + +def pct(part: float, whole: float) -> float: + return 100.0 * part / whole if whole else 0.0 + + +def format_ms(value_us: float) -> str: + return f"{value_us / 1000.0:.2f} ms" + + +@lru_cache(maxsize=16384) +def _is_cuda_launch_event_cached(name: str, cat: str) -> bool: + lowered_name = normalize_text(name).lower() + lowered_cat = normalize_text(cat).lower() + if lowered_cat not in {"cuda_runtime", "cuda_driver"}: + return False + return "launch" in lowered_name + + +def is_cuda_launch_event(name: str, cat: str) -> bool: + return _is_cuda_launch_event_cached(str(name), str(cat)) + + +def is_gpu_kernel_event(event: dict) -> bool: + # Be conservative here: first drop trace metadata / Python scopes / + # annotations, then only accept entries with clear GPU-kernel markers. + if not is_complete_duration_event(event): + return False + name = normalize_text(event.get("name", "")) + if is_trace_metadata_name(name): + return False + cat = normalize_text(event.get("cat", "")).lower() + args = event.get("args") or {} + if is_non_kernel_trace_category(cat): + return False + if is_annotation_event(name, cat): + return False + if "kernel" in cat or cat.startswith("gpu_"): + return True + if looks_like_python_scope_name(name): + return False + return has_stream_marker(args) + + +def infer_stage_from_annotation_name(name: str) -> Optional[str]: + lowered = normalize_text(name).lower() + if not lowered: + return None + if "generation_1" in lowered or "decode" in lowered: + return "decode" + if "generation_0" in lowered or "prefill" in lowered: + return "extend" + return None + + +def build_stage_annotations( + raw_events: Sequence[dict], +) -> Tuple[ + Dict[int, StageAnnotation], + List[StageWindow], + List[StageWindow], +]: + by_external_id: Dict[int, StageAnnotation] = {} + gpu_annotations: List[StageAnnotation] = [] + cpu_annotations: List[StageAnnotation] = [] + + def should_replace(current: StageAnnotation, candidate: StageAnnotation) -> bool: + if candidate.is_gpu != current.is_gpu: + return candidate.is_gpu + return (candidate.end_ts - candidate.ts) > (current.end_ts - current.ts) + + for event in raw_events: + if not is_complete_duration_event(event): + continue + category = normalize_text(event.get("cat", "")).lower() + if category not in {"user_annotation", "gpu_user_annotation"}: + continue + stage = infer_stage_from_annotation_name(str(event.get("name", ""))) + if not stage: + continue + annotation = StageAnnotation( + stage=stage, + ts=float(event.get("ts", 0.0)), + end_ts=float(event.get("ts", 0.0)) + float(event.get("dur", 0.0)), + external_id=coerce_optional_int( + (event.get("args") or {}).get("External id") + ), + is_gpu=(category == "gpu_user_annotation"), + ) + if annotation.external_id is not None: + existing = by_external_id.get(annotation.external_id) + if existing is None or should_replace(existing, annotation): + by_external_id[annotation.external_id] = annotation + if annotation.is_gpu: + gpu_annotations.append(annotation) + else: + cpu_annotations.append(annotation) + + gpu_annotations.sort(key=lambda item: (item.ts, item.end_ts)) + cpu_annotations.sort(key=lambda item: (item.ts, item.end_ts)) + return ( + by_external_id, + merge_stage_windows(gpu_annotations), + merge_stage_windows(cpu_annotations), + ) + + +def merge_stage_windows(annotations: Sequence[StageAnnotation]) -> List[StageWindow]: + merged: List[StageWindow] = [] + for annotation in annotations: + if ( + merged + and merged[-1].stage == annotation.stage + and annotation.ts <= merged[-1].end_ts + 1e-3 + ): + merged[-1] = StageWindow( + stage=merged[-1].stage, + ts=merged[-1].ts, + end_ts=max(merged[-1].end_ts, annotation.end_ts), + ) + continue + merged.append( + StageWindow( + stage=annotation.stage, + ts=annotation.ts, + end_ts=annotation.end_ts, + ) + ) + return merged + + +def resolve_stage_from_windows( + probe_ts: float, + windows: Sequence[StageWindow], +) -> Tuple[Optional[str], Optional[float]]: + nearest_stage: Optional[str] = None + nearest_gap: Optional[float] = None + for window in windows: + if window.ts <= probe_ts <= window.end_ts + 1e-3: + return window.stage, 0.0 + gap = min(abs(probe_ts - window.ts), abs(probe_ts - window.end_ts)) + if nearest_gap is None or gap < nearest_gap: + nearest_gap = gap + nearest_stage = window.stage + return nearest_stage, nearest_gap + + +def resolve_kernel_stage( + *, + kernel_ts: float, + external_id: Optional[int], + annotations_by_external_id: Dict[int, StageAnnotation], + gpu_annotations: Sequence[StageWindow], + cpu_annotations: Sequence[StageWindow], +) -> str: + if external_id is not None: + annotation = annotations_by_external_id.get(external_id) + if annotation is not None: + return annotation.stage + probe_ts = kernel_ts + 1e-3 + nearest_stage: Optional[str] = None + nearest_gap: Optional[float] = None + for windows in (gpu_annotations, cpu_annotations): + stage, gap = resolve_stage_from_windows(probe_ts, windows) + if gap == 0.0 and stage is not None: + return stage + if stage is not None and ( + nearest_gap is None or (gap is not None and gap < nearest_gap) + ): + nearest_stage = stage + nearest_gap = gap + if ( + nearest_stage is not None + and nearest_gap is not None + and nearest_gap <= 20_000.0 + ): + return nearest_stage + return "all" + + +def extract_trace_data( + trace: dict, +) -> Tuple[ + List[KernelEvent], + List[CpuOpEvent], + Dict[Tuple[str, str], List[PythonFrame]], + List[LaunchEvent], + Optional[str], + float, +]: + # Build the basic trace views in one pass so later stages can stay simple: + # GPU kernels for ranking, CPU ops for External-id mapping, Python frames for + # source attribution, and CUDA launch calls for correlation-based fallback. + raw_events = extract_trace_events(trace) + correlation_external = build_correlation_external_lookup(raw_events) + ( + annotations_by_external_id, + gpu_stage_annotations, + cpu_stage_annotations, + ) = build_stage_annotations(raw_events) + chosen_pid = select_heaviest_pid( + raw_events, + is_gpu_kernel_event, + preferred_substrings=("TP00", "TP-0"), + ) + + kernels: List[KernelEvent] = [] + cpu_ops: List[CpuOpEvent] = [] + launches: List[LaunchEvent] = [] + python_frames: DefaultDict[Tuple[str, str], List[PythonFrame]] = defaultdict(list) + min_ts = None + max_end = None + + for event in raw_events: + if event.get("ph") != "X": + continue + + pid = str(event.get("pid")) + tid = str(event.get("tid")) + ts = float(event.get("ts", 0.0)) + dur = float(event.get("dur", 0.0)) + cat = str(event.get("cat", "")) + args = event.get("args") or {} + name = str(event.get("name", "")) + + if cat == "python_function": + python_frames[(pid, tid)].append( + PythonFrame( + name=name, + normalized_name=normalize_source_location(name), + pid=pid, + tid=tid, + ts=ts, + dur=dur, + python_id=coerce_optional_int(args.get("Python id")), + parent_id=coerce_optional_int(args.get("Python parent id")), + end_ts=ts + dur, + priority=frame_priority(name), + ) + ) + + correlation = coerce_optional_int(args.get("correlation")) + external_id = coerce_optional_int(args.get("External id")) + if external_id is None and correlation is not None: + external_id = correlation_external.get(correlation) + if cat == "cpu_op" and external_id is not None: + cpu_ops.append( + CpuOpEvent( + name=name, + pid=pid, + tid=tid, + ts=ts, + dur=dur, + external_id=external_id, + ) + ) + if is_cuda_launch_event(name, cat) and correlation is not None: + launches.append( + LaunchEvent( + name=name, + pid=pid, + tid=tid, + ts=ts, + dur=dur, + correlation=correlation, + ) + ) + + if chosen_pid is None or not is_gpu_kernel_event(event) or pid != chosen_pid: + continue + + min_ts = ts if min_ts is None else min(min_ts, ts) + max_end = ts + dur if max_end is None else max(max_end, ts + dur) + kernels.append( + KernelEvent( + name=name, + canonical_name=canonicalize_name(name), + category=classify_kernel(name), + stage=resolve_kernel_stage( + kernel_ts=ts, + external_id=external_id, + annotations_by_external_id=annotations_by_external_id, + gpu_annotations=gpu_stage_annotations, + cpu_annotations=cpu_stage_annotations, + ), + pid=pid, + tid=tid, + ts=ts, + dur=dur, + external_id=external_id, + correlation=correlation, + ) + ) + + for frames in python_frames.values(): + frames.sort(key=lambda item: (item.ts, item.end_ts)) + + window_us = 0.0 if min_ts is None or max_end is None else max_end - min_ts + return kernels, cpu_ops, dict(python_frames), launches, chosen_pid, window_us + + +def build_correlation_external_lookup(raw_events: Sequence[dict]) -> Dict[int, int]: + lookup: Dict[int, int] = {} + for event in raw_events: + args = event.get("args", {}) or {} + correlation = coerce_optional_int(args.get("correlation")) + external_id = coerce_optional_int(args.get("External id")) + if correlation is not None and external_id is not None: + lookup[correlation] = external_id + return lookup + + +def build_timed_event_index(events: Sequence[object]) -> TimedEventIndex: + ordered = list(events) + ordered.sort(key=lambda item: item.ts) + return TimedEventIndex( + events=ordered, + start_ts=[float(item.ts) for item in ordered], + ) + + +def build_cpu_op_index(cpu_ops: Sequence[CpuOpEvent]) -> Dict[int, TimedEventIndex]: + output: DefaultDict[int, List[CpuOpEvent]] = defaultdict(list) + for cpu_op in cpu_ops: + output[cpu_op.external_id].append(cpu_op) + return { + external_id: build_timed_event_index(items) + for external_id, items in output.items() + } + + +def match_cpu_op( + kernel: KernelEvent, cpu_ops_by_external_id: Dict[int, TimedEventIndex] +) -> Optional[CpuOpEvent]: + if kernel.external_id is None: + return None + return match_timed_event( + cpu_ops_by_external_id.get(kernel.external_id, []), kernel.ts + ) + + +def build_launch_index( + launch_events: Sequence[LaunchEvent], +) -> Dict[int, TimedEventIndex]: + output: DefaultDict[int, List[LaunchEvent]] = defaultdict(list) + for launch in launch_events: + output[launch.correlation].append(launch) + return { + correlation: build_timed_event_index(items) + for correlation, items in output.items() + } + + +def match_launch_event( + kernel: KernelEvent, launches_by_correlation: Dict[int, TimedEventIndex] +) -> Optional[LaunchEvent]: + if kernel.correlation is None: + return None + return match_timed_event( + launches_by_correlation.get(kernel.correlation, []), kernel.ts + ) + + +def match_timed_event(index: object, probe_ts: float): + if not index: + return None + if isinstance(index, TimedEventIndex): + events = index.events + if not events: + return None + right = bisect_right(index.start_ts, probe_ts + 1e-3) + candidates: List[object] = [] + if right > 0: + candidates.extend(events[max(0, right - 4) : right]) + if right < len(events): + candidates.extend(events[right : min(len(events), right + 2)]) + if not candidates: + return None + earlier = [item for item in candidates if item.ts <= probe_ts + 1e-3] + if earlier: + return min(earlier, key=lambda item: abs((item.ts + item.dur) - probe_ts)) + return min(candidates, key=lambda item: abs(item.ts - probe_ts)) + events = list(index) + if not events: + return None + earlier = [item for item in events if item.ts <= probe_ts + 1e-3] + if earlier: + return min(earlier, key=lambda item: abs((item.ts + item.dur) - probe_ts)) + return min(events, key=lambda item: abs(item.ts - probe_ts)) + + +def resolve_active_frames_linear( + frames: Sequence[PythonFrame], probe_ts: float +) -> List[PythonFrame]: + active = [item for item in frames if item.ts <= probe_ts <= item.end_ts] + active.sort(key=lambda item: (item.ts, item.end_ts)) + return active + + +def thread_has_crossing_frames(frames: Sequence[PythonFrame]) -> bool: + ordered_frames = sorted(frames, key=lambda item: (item.ts, -item.end_ts)) + stack: List[PythonFrame] = [] + for frame in ordered_frames: + while stack and stack[-1].end_ts < frame.ts: + stack.pop() + if stack and frame.end_ts > stack[-1].end_ts + 1e-3: + return True + stack.append(frame) + return False + + +def render_frame_resolution( + active_frames: Sequence[PythonFrame], +) -> Optional[FrameResolution]: + if not active_frames: + return None + chosen_frame = choose_mapping_frame(active_frames) + if chosen_frame is None: + return None + return FrameResolution( + location=chosen_frame.normalized_name, + stack=build_stack_display(active_frames), + ) + + +def resolve_thread_query_times( + frames: Sequence[PythonFrame], query_times: Sequence[float] +) -> Dict[float, Optional[FrameResolution]]: + if not frames or not query_times: + return {} + ordered_frames = sorted(frames, key=lambda item: (item.ts, -item.end_ts)) + ordered_queries = sorted(set(float(ts) for ts in query_times)) + results: Dict[float, Optional[FrameResolution]] = {} + active_frames: List[PythonFrame] = [] + frame_idx = 0 + total_frames = len(ordered_frames) + + for ts in ordered_queries: + while frame_idx < total_frames and ordered_frames[frame_idx].ts <= ts: + active_frames.append(ordered_frames[frame_idx]) + frame_idx += 1 + if active_frames: + active_frames = [ + frame for frame in active_frames if frame.end_ts >= ts - 1e-3 + ] + results[ts] = render_frame_resolution(active_frames) + return results + + +def build_frame_resolution_index( + python_frames: Dict[Tuple[str, str], List[PythonFrame]], + query_times_by_thread: Dict[Tuple[str, str], Sequence[float]], +) -> Dict[Tuple[str, str], Dict[float, Optional[FrameResolution]]]: + output: Dict[Tuple[str, str], Dict[float, Optional[FrameResolution]]] = {} + for thread_key, query_times in query_times_by_thread.items(): + frames = python_frames.get(thread_key, []) + output[thread_key] = resolve_thread_query_times(frames, query_times) + return output + + +def find_active_python_frames( + cpu_op: CpuOpEvent, + python_frames: Dict[Tuple[str, str], List[PythonFrame]], +) -> List[PythonFrame]: + frames = python_frames.get((cpu_op.pid, cpu_op.tid), []) + if not frames: + return [] + probe_ts = cpu_op.ts + min(cpu_op.dur * 0.5, 1.0) + return resolve_active_frames_linear(frames, probe_ts) + + +def find_active_python_frames_at_ts( + *, + pid: str, + tid: str, + ts: float, + python_frames: Dict[Tuple[str, str], List[PythonFrame]], +) -> List[PythonFrame]: + frames = python_frames.get((pid, tid), []) + if not frames: + return [] + return resolve_active_frames_linear(frames, ts) + + +def render_kernel_site( + active_frames: Sequence[PythonFrame], cpu_op_name: str +) -> Tuple[str, str, str]: + chosen_frame = choose_mapping_frame(active_frames) + if chosen_frame is None: + return "unresolved", "", cpu_op_name + return chosen_frame.normalized_name, build_stack_display(active_frames), cpu_op_name + + +def resolve_kernel_site_context( + kernel: KernelEvent, + cpu_ops_by_external_id: Dict[int, TimedEventIndex], + python_frames: Dict[Tuple[str, str], List[PythonFrame]], + launches_by_correlation: Dict[int, TimedEventIndex], + frame_resolution_index: Optional[ + Dict[Tuple[str, str], Dict[float, Optional[FrameResolution]]] + ] = None, +) -> Tuple[str, str, str]: + # Prefer the normal External-id path first. If the kernel dropped that link, + # fall back to the correlated CUDA launch and reuse the Python frames that + # were active when the launch happened. + cpu_op = match_cpu_op(kernel, cpu_ops_by_external_id) + if cpu_op is not None: + probe_ts = cpu_op.ts + min(cpu_op.dur * 0.5, 1.0) + if frame_resolution_index is not None: + resolved = frame_resolution_index.get((cpu_op.pid, cpu_op.tid), {}).get( + probe_ts + ) + if resolved is not None: + return resolved.location, resolved.stack, cpu_op.name + active_frames = find_active_python_frames(cpu_op, python_frames) + if active_frames: + return render_kernel_site(active_frames, cpu_op.name) + + launch_event = match_launch_event(kernel, launches_by_correlation) + if launch_event is not None: + if frame_resolution_index is not None: + resolved = frame_resolution_index.get( + (launch_event.pid, launch_event.tid), {} + ).get(launch_event.ts) + if resolved is not None: + cpu_op_name = cpu_op.name if cpu_op is not None else launch_event.name + return resolved.location, resolved.stack, cpu_op_name + active_frames = find_active_python_frames_at_ts( + pid=launch_event.pid, + tid=launch_event.tid, + ts=launch_event.ts, + python_frames=python_frames, + ) + if active_frames: + cpu_op_name = cpu_op.name if cpu_op is not None else launch_event.name + return render_kernel_site(active_frames, cpu_op_name) + return "unresolved", "", launch_event.name + + cpu_op_name = cpu_op.name if cpu_op is not None else "" + return "unresolved", "", cpu_op_name + + +def choose_mapping_frame(active_frames: Sequence[PythonFrame]) -> Optional[PythonFrame]: + if not active_frames: + return None + best = active_frames[0] + best_key = (best.priority, best.ts, -best.dur) + for item in active_frames[1:]: + key = (item.priority, item.ts, -item.dur) + if key > best_key: + best = item + best_key = key + return best + + +def build_stack_display(active_frames: Sequence[PythonFrame]) -> str: + if not active_frames: + return "" + filtered = [item.normalized_name for item in active_frames if item.priority > 0] + if not filtered: + filtered = [active_frames[-1].normalized_name] + return " -> ".join(filtered[-4:]) + + +def aggregate(events: Iterable[KernelEvent], key_fn) -> Dict[str, Aggregate]: + output: Dict[str, Aggregate] = defaultdict(Aggregate) + for event in events: + key = key_fn(event) + item = output[key] + item.total_us += event.dur + item.count += 1 + item.max_us = max(item.max_us, event.dur) + return output + + +def group_kernels_by_stage( + kernels: Sequence[KernelEvent], default_stage: str +) -> Dict[str, List[KernelEvent]]: + grouped: DefaultDict[str, List[KernelEvent]] = defaultdict(list) + for kernel in kernels: + stage = default_stage if default_stage != "all" else (kernel.stage or "all") + grouped[stage].append(kernel) + return dict(grouped) + + +def aggregate_kernel_sites( + kernels: Sequence[KernelEvent], + cpu_ops_by_external_id: Dict[int, TimedEventIndex], + python_frames: Dict[Tuple[str, str], List[PythonFrame]], + launches_by_correlation: Optional[Dict[int, TimedEventIndex]] = None, + site_context_cache: Optional[ + Dict[Tuple[str, str, float, Optional[int], Optional[int]], Tuple[str, str, str]] + ] = None, +) -> Dict[str, Dict[str, MappingSiteAggregate]]: + # Each kernel is mapped independently so the fallback behavior stays easy to + # reason about and easy to regression-test. + output: DefaultDict[str, DefaultDict[str, MappingSiteAggregate]] = defaultdict( + lambda: defaultdict(MappingSiteAggregate) + ) + launch_index = launches_by_correlation or {} + query_times_by_thread: DefaultDict[Tuple[str, str], List[float]] = defaultdict(list) + for kernel in kernels: + cpu_op = match_cpu_op(kernel, cpu_ops_by_external_id) + if cpu_op is not None: + query_times_by_thread[(cpu_op.pid, cpu_op.tid)].append( + cpu_op.ts + min(cpu_op.dur * 0.5, 1.0) + ) + launch_event = match_launch_event(kernel, launch_index) + if launch_event is not None: + query_times_by_thread[(launch_event.pid, launch_event.tid)].append( + launch_event.ts + ) + frame_resolution_index = build_frame_resolution_index( + python_frames, query_times_by_thread + ) + resolved_cache = site_context_cache if site_context_cache is not None else {} + for kernel in kernels: + cache_key = ( + kernel.pid, + kernel.tid, + kernel.ts, + kernel.external_id, + kernel.correlation, + ) + cached = resolved_cache.get(cache_key) + if cached is None: + cached = resolve_kernel_site_context( + kernel, + cpu_ops_by_external_id, + python_frames, + launch_index, + frame_resolution_index=frame_resolution_index, + ) + resolved_cache[cache_key] = cached + location, stack, cpu_op_name = cached + + item = output[kernel.canonical_name][location] + item.total_us += kernel.dur + item.count += 1 + if cpu_op_name: + item.cpu_ops[cpu_op_name] += 1 + if stack: + item.stacks[stack] += 1 + return {kernel_name: dict(locations) for kernel_name, locations in output.items()} + + +def merge_site_stats( + destination: DefaultDict[str, DefaultDict[str, MappingSiteAggregate]], + source: Dict[str, Dict[str, MappingSiteAggregate]], +) -> None: + for kernel_name, locations in source.items(): + for location, aggregate_item in locations.items(): + target = destination[kernel_name][location] + target.total_us += aggregate_item.total_us + target.count += aggregate_item.count + target.cpu_ops.update(aggregate_item.cpu_ops) + target.stacks.update(aggregate_item.stacks) + + +def build_stage_payload( + site_stats: Dict[str, Dict[str, MappingSiteAggregate]], + kernel_categories: Dict[str, str], +) -> Dict[str, dict]: + kernels_payload: Dict[str, dict] = {} + for kernel_name, locations in sorted(site_stats.items()): + total_us = sum(item.total_us for item in locations.values()) + sites = [] + for location, aggregate_item in sorted( + locations.items(), + key=lambda pair: pair[1].total_us, + reverse=True, + ): + sites.append( + { + "location": location, + "display_location": extract_preferred_stack_location( + aggregate_item.stacks.most_common(1)[0][0] + if aggregate_item.stacks + else None + ) + or location, + "launches": aggregate_item.count, + "total_us": round(aggregate_item.total_us, 3), + "share_pct_within_kernel": round( + pct(aggregate_item.total_us, total_us), 3 + ), + "top_cpu_op": ( + aggregate_item.cpu_ops.most_common(1)[0][0] + if aggregate_item.cpu_ops + else None + ), + "stack": ( + aggregate_item.stacks.most_common(1)[0][0] + if aggregate_item.stacks + else None + ), + } + ) + sites.sort( + key=lambda site: ( + source_location_priority(site_display_location(site)), + float(site.get("total_us", 0.0)), + int(site.get("launches", 0)), + ), + reverse=True, + ) + kernels_payload[kernel_name] = { + "category": kernel_categories.get(kernel_name, "other"), + "sites": sites, + "best_location": ( + site_display_location(sites[0]) + if sites + else choose_best_location(locations) + ), + } + return {"kernels": kernels_payload} + + +def load_kernel_map(path: Path) -> dict: + with open(path, "r", encoding="utf-8") as handle: + return json.load(handle) + + +def relaxed_kernel_entry_lookup( + kernels: Dict[str, dict], kernel_name: str +) -> Optional[dict]: + if kernel_name in kernels: + return kernels[kernel_name] + lowered = kernel_name.lower() + best_key = None + best_score = -1 + for candidate_key in kernels: + candidate_lowered = candidate_key.lower() + if candidate_lowered.startswith(lowered) or lowered.startswith( + candidate_lowered + ): + score = min(len(candidate_lowered), len(lowered)) + elif candidate_lowered in lowered or lowered in candidate_lowered: + score = min(len(candidate_lowered), len(lowered)) // 2 + else: + continue + if score > best_score: + best_key = candidate_key + best_score = score + if best_key: + return kernels.get(best_key) + + # Long auto-generated kernels such as CUTLASS / FlashAttention templates can + # differ in the middle of the symbol while still sharing the same high-level + # family. Fall back to a conservative common-prefix match so we can still + # recover the higher-level Python callsite from the mapping trace. + lowered_compact = normalize_match_text(kernel_name) + if len(lowered_compact) < 96: + return alias_kernel_entry_lookup(kernels, kernel_name) + + def common_prefix_len(left: str, right: str) -> int: + count = 0 + for left_ch, right_ch in zip(left, right): + if left_ch != right_ch: + break + count += 1 + return count + + best_key = None + best_score = -1 + for candidate_key in kernels: + candidate_compact = normalize_match_text(candidate_key) + if len(candidate_compact) < 96: + continue + prefix_len = common_prefix_len(lowered_compact, candidate_compact) + shorter_len = min(len(lowered_compact), len(candidate_compact)) + if prefix_len < 64 or prefix_len < int(shorter_len * 0.4): + continue + score = prefix_len + if lowered_compact.startswith( + "voidcutlassdevicekernelflash" + ) and candidate_compact.startswith("voidcutlassdevicekernelflash"): + score += 32 + if score > best_score: + best_key = candidate_key + best_score = score + if best_key: + return kernels.get(best_key) + return alias_kernel_entry_lookup(kernels, kernel_name) + + +def lookup_kernel_map_entry( + kernel_map: dict, stage: str, kernel_name: str +) -> Optional[dict]: + stage_map = kernel_map.get("stages", {}) + for candidate_stage in stage_aliases(stage): + entry = relaxed_kernel_entry_lookup( + stage_map.get(candidate_stage, {}).get("kernels", {}), + kernel_name, + ) + if entry: + return entry + return relaxed_kernel_entry_lookup( + kernel_map.get("global", {}).get("kernels", {}), kernel_name + ) + + +def best_site_summary(kernel_entry: Optional[dict]) -> Tuple[str, str]: + if not kernel_entry: + return "unresolved", "-" + sites = kernel_entry.get("sites") or [] + if not sites: + return kernel_entry.get("best_location", "unresolved"), "-" + preferred_sites = [ + site + for site in sites + if is_preferred_source_location(site_display_location(site)) + ] + candidate_sites = preferred_sites or sites + rendered_locations = [] + rendered_cpu_ops = [] + for site in candidate_sites[:2]: + location = site_display_location(site) + share = site.get("share_pct_within_kernel") + if len(candidate_sites) > 1 and share is not None: + rendered_locations.append(f"{location} (site share {share:.0f}%)") + else: + rendered_locations.append(location) + cpu_op = site.get("top_cpu_op") + if cpu_op: + rendered_cpu_ops.append(cpu_op) + return "
".join(rendered_locations), ( + "
".join(rendered_cpu_ops) if rendered_cpu_ops else "-" + ) + + +def resolve_kernel_entry( + stage: str, + kernel_name: str, + local_stage_payload: dict, + external_kernel_map: Optional[dict], +) -> Optional[dict]: + if external_kernel_map: + kernel_entry = lookup_kernel_map_entry(external_kernel_map, stage, kernel_name) + if kernel_entry: + return kernel_entry + return relaxed_kernel_entry_lookup( + local_stage_payload.get("kernels", {}), kernel_name + ) + + +def build_kernel_rows( + stage: str, + kernel_stats: Dict[str, Aggregate], + kernel_categories: Dict[str, str], + local_stage_payload: dict, + external_kernel_map: Optional[dict], +) -> List[KernelRow]: + rows: List[KernelRow] = [] + for kernel_name, aggregate_item in sorted( + kernel_stats.items(), + key=lambda pair: pair[1].total_us, + reverse=True, + ): + kernel_entry = resolve_kernel_entry( + stage, kernel_name, local_stage_payload, external_kernel_map + ) + location, cpu_op = best_site_summary(kernel_entry) + rows.append( + KernelRow( + name=kernel_name, + category=kernel_categories.get(kernel_name, "other"), + aggregate=aggregate_item, + location=location, + cpu_op=cpu_op, + entry=kernel_entry, + ) + ) + return rows + + +def limit_kernel_rows(rows: Sequence[KernelRow], table_limit: int) -> List[KernelRow]: + if table_limit <= 0: + return list(rows) + return list(rows[:table_limit]) + + +def entry_sites(kernel_entry: Optional[dict]) -> List[dict]: + if not kernel_entry: + return [] + sites = kernel_entry.get("sites") or [] + return [site for site in sites if site.get("location")] + + +def ordered_unique(values: Iterable[str], limit: int = 4) -> List[str]: + output: List[str] = [] + seen = set() + for value in values: + item = str(value).strip() + if not item or item in seen: + continue + seen.add(item) + output.append(item) + if len(output) >= limit: + break + return output + + +def kernel_row_locations(row: KernelRow, limit: int = 4) -> List[str]: + values = [site_display_location(site) for site in entry_sites(row.entry)] + if not values and row.location and row.location != "unresolved": + values = [fragment.strip() for fragment in row.location.split("
")] + return ordered_unique(values, limit=limit) + + +def format_location_for_fusion_display(location: str) -> str: + text = normalize_text(location) + match = re.match(r"(?P.+?):(?P\d+)\s+(?P.+)$", text) + if not match: + return text + return f"{match.group('func')} @ {match.group('path')}:{match.group('line')}" + + +def normalize_match_text(text: object) -> str: + return re.sub(r"[^0-9A-Za-z]+", "", normalize_text(text)).lower() + + +def kernel_entry_total_us(entry: Optional[dict]) -> float: + if not entry: + return 0.0 + return sum(float(site.get("total_us", 0.0)) for site in entry.get("sites", [])) + + +def kernel_entry_lookup_text(kernel_name: str, entry: Optional[dict]) -> str: + parts = [kernel_name] + if entry: + parts.append(str(entry.get("best_location") or "")) + for site in entry.get("sites", [])[:4]: + parts.append(str(site.get("location") or "")) + parts.append(str(site.get("display_location") or "")) + parts.append(str(site.get("top_cpu_op") or "")) + parts.append(str(site.get("stack") or "")) + return normalize_match_text(" ".join(parts)) + + +def kernel_alias_token_groups(kernel_name: str) -> List[Tuple[str, ...]]: + lowered = normalize_match_text(kernel_name) + groups: List[Tuple[str, ...]] = [] + if "flashattnfwdcombine" in lowered: + groups.append( + ( + "flashattnfwdsm90", + "flashattnvarlenfunc", + "vllmflashattnflashattninterface", + "vllmfa3cfwd", + ) + ) + if "kernelmha" in lowered: + groups.append( + ( + "maskedmultiheadattentionkernel", + "attentioninplace", + "attentionbackendtrtllm", + ) + ) + if "applybiasropeupdatekvcachev2" in lowered: + groups.append( + ( + "fusedqknormropekernel", + "applyqknormrope", + "modelingqwen3py98applyqknormrope", + ) + ) + if lowered.startswith("memset"): + groups.append(("memset",)) + return groups + + +def alias_kernel_entry_lookup( + kernels: Dict[str, dict], kernel_name: str +) -> Optional[dict]: + alias_groups = kernel_alias_token_groups(kernel_name) + if not alias_groups: + return None + + best_key = None + best_score = -1 + for candidate_key, entry in kernels.items(): + candidate_text = kernel_entry_lookup_text(candidate_key, entry) + score = 0 + for group_index, group in enumerate(alias_groups): + group_score = max( + (len(token) for token in group if token in candidate_text), + default=0, + ) + if group_score: + score += 1000 * (group_index + 1) + group_score + if score <= 0: + continue + score += max( + source_location_priority(str(entry.get("best_location") or "")), + source_location_priority(best_site_summary(entry)[0]), + ) + score += int(kernel_entry_total_us(entry) // 10) + if score > best_score: + best_key = candidate_key + best_score = score + return kernels.get(best_key) if best_key else None + + +def row_matches(row: KernelRow, *needles: str) -> bool: + lowered = " ".join([row.name, row.location, row.cpu_op]).lower() + lowered_compact = normalize_match_text(lowered) + for needle in needles: + needle_lowered = needle.lower() + if needle_lowered in lowered: + return True + needle_compact = normalize_match_text(needle) + if needle_compact and needle_compact in lowered_compact: + return True + return False + + +def summarize_text(values: Iterable[str], limit: int = 4) -> str: + items = ordered_unique(values, limit=limit) + return "
".join(items) if items else "-" + + +def summarize_locations(values: Iterable[str], limit: int = 4) -> str: + items = ordered_unique( + (format_location_for_fusion_display(value) for value in values), + limit=limit, + ) + return "
".join(items) if items else "-" + + +def summarize_evidence( + rows: Sequence[KernelRow], + total_us: float, + limit: int = 3, + min_share_pct: float = 1.0, +) -> str: + items = [] + for row in rows: + share = pct(row.total_us, total_us) + if share < min_share_pct: + continue + items.append(f"{row.name} ({share:.1f}%)") + if len(items) >= limit: + break + return "
".join(items) if items else "-" + + +def model_path_from_server_args(server_args: Optional[dict]) -> str: + if not isinstance(server_args, dict): + return "" + return str(server_args.get("model_path") or server_args.get("model") or "") + + +def fusion_framework_hints(spec: FusionPatternSpec) -> set[str]: + text = normalize_text(spec.candidate_path).lower() + hints: set[str] = set() + if "vllm/" in text: + hints.add("vllm") + if "tensorrt_llm/" in text: + hints.add("trtllm") + if any(token in text for token in ("python/sglang/", "sgl-kernel/", "sgl_kernel/")): + hints.add("sglang") + return hints + + +def pattern_supports_framework( + spec: FusionPatternSpec, framework: Optional[str] +) -> bool: + normalized = normalize_text(framework).lower() + if not normalized or normalized == "auto": + return True + hints = fusion_framework_hints(spec) + if not hints: + return True + return normalized in hints + + +def matching_rows_for_keywords( + kernel_rows: Sequence[KernelRow], + keywords: Sequence[str], +) -> List[KernelRow]: + if not keywords: + return [] + return [row for row in kernel_rows if row_matches(row, *keywords)] + + +def row_identity(row: KernelRow) -> Tuple[str, str, str]: + return (row.name, row.location, row.cpu_op) + + +def merge_kernel_rows(*groups: Sequence[KernelRow]) -> List[KernelRow]: + output: List[KernelRow] = [] + seen = set() + for group in groups: + for row in group: + row_key = row_identity(row) + if row_key in seen: + continue + seen.add(row_key) + output.append(row) + return output + + +def pattern_model_matches(spec: FusionPatternSpec, model_path: str) -> bool: + if spec.model_include and not any( + token in model_path for token in spec.model_include + ): + return False + if spec.model_exclude and any(token in model_path for token in spec.model_exclude): + return False + return True + + +def pattern_status(spec: FusionPatternSpec, has_active_match: bool) -> str: + if spec.origin == "mainline": + return "mainline direct" if has_active_match else "mainline split" + if spec.origin == "upstream": + return "upstream direct" if has_active_match else "upstream split" + return "pending direct" if has_active_match else "pending split" + + +def build_pattern_rationale( + spec: FusionPatternSpec, + has_active_match: bool, + related_us: float, + total_us: float, +) -> str: + share = pct(related_us, total_us) + if spec.origin == "mainline": + if has_active_match: + return ( + f"`{spec.pattern}` is present in this trace ({share:.1f}% related GPU time). " + f"{spec.rationale_hint}" + ) + return ( + f"Split kernels in this family take {share:.1f}% of GPU time. " + f"This tree already has a matching path. {spec.rationale_hint}" + ) + if spec.origin == "upstream": + return ( + f"Matches an upstream path ({share:.1f}% related GPU time). " + f"{spec.rationale_hint}" + ) + return ( + f"Matches an open upstream path ({share:.1f}% related GPU time). " + f"{spec.rationale_hint}" + ) + + +def pattern_span(spec: FusionPatternSpec) -> int: + return max(len(spec.split_groups), 1 if spec.active_keywords else 0) + + +def fusion_priority_key(item: FusionOpportunity) -> Tuple[int, int, int, float]: + return ( + item.priority, + item.pattern_span, + len(item.covered_row_keys), + item.related_us, + ) + + +def detect_pattern_match( + spec: FusionPatternSpec, + kernel_rows: Sequence[KernelRow], + total_us: float, + model_path: str, + tp_size: int, + framework: Optional[str], +) -> Optional[FusionOpportunity]: + if total_us <= 0: + return None + if not pattern_supports_framework(spec, framework): + return None + if spec.require_tp and tp_size < spec.min_tp_size: + return None + if not pattern_model_matches(spec, model_path): + return None + + active_rows = matching_rows_for_keywords(kernel_rows, spec.active_keywords) + split_groups = [ + matching_rows_for_keywords(kernel_rows, keywords) + for keywords in spec.split_groups + ] + has_active_match = bool(active_rows) + has_split_match = bool(split_groups) and all(split_groups) + if not has_active_match and not has_split_match: + return None + + related_rows = merge_kernel_rows(active_rows, *split_groups) + related_us = sum(row.total_us for row in related_rows) + if related_us <= 0: + return None + if not has_active_match and pct(related_us, total_us) < spec.min_share: + return None + + return FusionOpportunity( + pattern=spec.pattern, + status=pattern_status(spec, has_active_match), + confidence=( + "Confirmed" + if has_active_match or pct(related_us, total_us) >= spec.likely_share + else "Candidate" + ), + related_us=related_us, + evidence=summarize_evidence(related_rows, total_us), + current_locations=summarize_locations( + location for row in related_rows for location in kernel_row_locations(row) + ), + candidate_path=spec.candidate_path, + rationale=build_pattern_rationale( + spec=spec, + has_active_match=has_active_match, + related_us=related_us, + total_us=total_us, + ), + covered_row_keys=tuple(row_identity(row) for row in related_rows), + pattern_span=pattern_span(spec), + has_active_match=has_active_match, + priority=spec.priority, + subsumes=spec.subsumes, + ) + + +def detect_fusion_opportunities( + kernel_rows: Sequence[KernelRow], + total_us: float, + server_args: Optional[dict], + framework: Optional[str] = None, +) -> List[FusionOpportunity]: + opportunities: List[FusionOpportunity] = [] + if total_us <= 0: + return opportunities + + model_path = model_path_from_server_args(server_args).lower() + tp_size = 1 + if isinstance(server_args, dict): + tp_size = int(server_args.get("tp_size") or 1) + + raw_matches: List[FusionOpportunity] = [] + for spec in FUSION_PATTERN_REGISTRY: + opportunity = detect_pattern_match( + spec=spec, + kernel_rows=kernel_rows, + total_us=total_us, + model_path=model_path, + tp_size=tp_size, + framework=framework, + ) + if opportunity is not None: + raw_matches.append(opportunity) + + raw_matches.sort(key=fusion_priority_key, reverse=True) + consumed_row_keys = set() + blocked_patterns = set() + for opportunity in raw_matches: + if opportunity.pattern in blocked_patterns: + continue + if any( + row_key in consumed_row_keys for row_key in opportunity.covered_row_keys + ): + continue + opportunities.append(opportunity) + consumed_row_keys.update(opportunity.covered_row_keys) + blocked_patterns.update(opportunity.subsumes) + return opportunities diff --git a/.claude/skills/llm-torch-profiler-analysis/scripts/triage_overlap_helpers.py b/.claude/skills/llm-torch-profiler-analysis/scripts/triage_overlap_helpers.py new file mode 100644 index 000000000000..0b38e588cff5 --- /dev/null +++ b/.claude/skills/llm-torch-profiler-analysis/scripts/triage_overlap_helpers.py @@ -0,0 +1,1747 @@ +"""Internal overlap helpers for triage-only torch-profiler analysis.""" + +from __future__ import annotations + +import re +from bisect import bisect_left, bisect_right +from collections import Counter, defaultdict +from dataclasses import dataclass, field +from pathlib import Path +from typing import Dict, Iterable, List, Optional, Sequence, Tuple + +import triage_kernel_helpers as kernel_helpers +from profile_common import ( + coerce_optional_int, + contains_any_keyword, + extract_trace_events, + has_stream_marker, + is_annotation_event, + is_complete_duration_event, + is_non_kernel_trace_category, + is_trace_metadata_name, + looks_like_python_scope_name, + normalize_repo_relative_path, + normalize_text, + select_heaviest_pid, +) + +SOURCE_MAP_SAMPLE_LIMIT_PER_NAME = 16 + +COMMUNICATION_STRONG_KEYWORDS = ( + "allreduce", + "all_reduce", + "reduce_scatter", + "allgather", + "all_gather", + "nccl", + "cross_device_reduce", + "deepep", + "a2a", + "alltoall", + "allreduce_fusion", + "mooncake", +) + +COMMUNICATION_WEAK_KEYWORDS = ( + "broadcast", + "dispatch", + "combine", +) + +MEMORY_STRONG_KEYWORDS = ( + "memcpy", + "memset", + "dma", + "prefetch", +) + +MEMORY_WEAK_KEYWORDS = ( + "fill", + "copy", +) + +ELEMENTWISE_KEYWORDS = ( + "sigmoid", + "silu", + "gelu", + "relu", + "softmax", + "layernorm", + "rmsnorm", + "norm", + "rotary", + "rope", + "topk", + "gate", + "bias", + "_cast", + "index", + "gather", + "scatter", + "masked", + "elementwise", + "activation", +) + +COMPUTE_KEYWORDS = ( + "cublas", + "cudnn", + "cutlass", + "triton", + "gemm", + "gemv", + "matmul", + "grouped_mm", + "flash", + "attention", + "fmha", + "marlin", + "fused_moe", + "moe_kernel", + "groupgemm", + "mma", + "wgmma", + "conv", + "bmm", + "mm_kernel", +) + +LOW_SIGNAL_FUNCTION_TOKENS = ( + "__torch_function__", + "__torch_dispatch__", + "__call__", + "_call_impl", + "_wrapped_call_impl", +) + +LOW_SIGNAL_PATH_TOKENS = ( + "model_executor/parameter.py(", + "model_executor/parameter.py:", + "model_executor/cuda_graph_runner.py(", + "model_executor/cuda_graph_runner.py:", + "compilation/cuda_graph.py(", + "compilation/cuda_graph.py:", + "pyexecutor/cuda_graph_runner.py(", + "pyexecutor/cuda_graph_runner.py:", + "pyexecutor/py_executor.py(", + "pyexecutor/py_executor.py:", + "_torch/utils.py(", + "_torch/utils.py:", + "torch/fx/graph_module.py(", + "torch/fx/graph_module.py:", +) + +CATEGORY_PRIORITY = { + "compute": 4, + "communication": 3, + "memory": 2, + "elementwise": 1, + "other": 0, +} + +PYTHON_SCOPE_IGNORE_PREFIXES = ( + "threading.py(", + "selectors.py(", + "contextlib.py(", + "queue.py(", + "logging/", + "logging/__init__.py(", + "socket.py(", + "asyncio/", + "concurrent/futures/", + "tqdm/", + "uvicorn/", + "fastapi/", + "starlette/", + "http/", + "torch/_ops.py(", + "torch/nn/modules/module.py(", + "torch/utils/_contextlib.py(", + "torch/autograd/", + "torch/_tensor.py(", + "torch/distributed/", + "torch/_dynamo/", + "torch/_inductor/", +) +KERNEL_NAME_HINTS = ( + COMMUNICATION_STRONG_KEYWORDS + + COMMUNICATION_WEAK_KEYWORDS + + MEMORY_STRONG_KEYWORDS + + MEMORY_WEAK_KEYWORDS + + COMPUTE_KEYWORDS +) + + +@dataclass +class KernelEvent: + idx: int + name: str + canonical_name: str + category: str + pid: str + tid: str + stream: str + ts: float + dur: float + end: float + stage: str = "all" + external_id: Optional[int] = None + correlation: Optional[int] = None + hidden_us: float = 0.0 + exclusive_us: float = 0.0 + hidden_by_compute_us: float = 0.0 + overlap_with: Counter = field(default_factory=Counter) + + +@dataclass +class AggregateStats: + name: str + category: str + count: int = 0 + total_us: float = 0.0 + hidden_us: float = 0.0 + exclusive_us: float = 0.0 + hidden_by_compute_us: float = 0.0 + overlap_with: Counter = field(default_factory=Counter) + representative_idx: Optional[int] = None + representative_score: float = -1.0 + + @property + def hidden_ratio(self) -> float: + return self.hidden_us / self.total_us if self.total_us else 0.0 + + @property + def exclusive_ratio(self) -> float: + return self.exclusive_us / self.total_us if self.total_us else 0.0 + + +@dataclass +class PythonScope: + name: str + normalized_name: str + pid: str + tid: str + ts: float + dur: float + end: float + is_meaningful: bool = False + is_fallback: bool = False + + +@dataclass +class CPUOpContext: + external_id: int + cpu_op_name: str + pid: str + tid: str + ts: float + dur: float + end: float + scope_chain: Tuple[str, ...] + + +@dataclass +class KernelSourceStats: + name: str + total_count: int = 0 + mapped_count: int = 0 + scope_counter: Counter = field(default_factory=Counter) + chain_counter: Counter = field(default_factory=Counter) + launch_op_counter: Counter = field(default_factory=Counter) + site_share_counter: Counter = field(default_factory=Counter) + + @property + def mapping_ratio(self) -> float: + return self.mapped_count / self.total_count if self.total_count else 0.0 + + @property + def best_scope(self) -> Optional[str]: + return self.scope_counter.most_common(1)[0][0] if self.scope_counter else None + + @property + def best_chain(self) -> Optional[str]: + return self.chain_counter.most_common(1)[0][0] if self.chain_counter else None + + @property + def best_launch_op(self) -> Optional[str]: + return ( + self.launch_op_counter.most_common(1)[0][0] + if self.launch_op_counter + else None + ) + + +@dataclass +class TraceBundle: + label: str + trace_path: Path + server_args: Optional[dict] + raw_events: Sequence[dict] + events: List[KernelEvent] + pid: Optional[str] + overlap_stats: Optional[Dict[str, float]] = None + + +@dataclass +class ActionRow: + priority: str + verdict: str + kernel: str + category: str + total_us: float + share_pct: float + exclusive_ratio: float + hidden_ratio: float + python_scope: str + launch_op: str + mapping_ratio: float + dependency_signal: str + prev_neighbor: str + next_neighbor: str + recommendation: str + suggestion: str + representative_idx: Optional[int] + + +def short_name(name: str, max_len: int = 80) -> str: + name = normalize_text(name) + if len(name) <= max_len: + return name + return name[: max_len - 3] + "..." + + +def canonicalize_name(name: str) -> str: + name = normalize_text(name) + name = re.sub(r"0x[0-9a-fA-F]+", "0xADDR", name) + if name.startswith("void ") and name.endswith(")"): + depth = 0 + split_idx: Optional[int] = None + for idx in range(len(name) - 1, -1, -1): + char = name[idx] + if char == ")": + depth += 1 + elif char == "(": + depth -= 1 + if depth == 0: + split_idx = idx + break + if split_idx is not None: + name = name[:split_idx] + return name + + +def canonicalize_python_scope_name(name: str) -> str: + name = normalize_text(name) + name = re.sub(r"0x[0-9a-fA-F]+", "0xADDR", name) + match = re.match(r"(?P.+?)\((?P\d+)\): (?P.+)$", name) + if match: + path = normalize_repo_relative_path(match.group("path")) + name = f"{path}({match.group('line')}): {match.group('func')}" + return name + + +def canonicalize_cpu_op_name(name: str) -> str: + return short_name(normalize_text(name), max_len=100) + + +def classify_kernel(name: str) -> str: + # This script only needs broad overlap buckets, so keep the precedence small + # and deterministic: memory/communication first, then compute/elementwise. + lowered = name.lower() + looks_compute_like = contains_any_keyword(lowered, COMPUTE_KEYWORDS) + if contains_any_keyword(lowered, MEMORY_STRONG_KEYWORDS): + return "memory" + if contains_any_keyword(lowered, COMMUNICATION_STRONG_KEYWORDS): + return "communication" + if contains_any_keyword(lowered, COMPUTE_KEYWORDS): + return "compute" + if contains_any_keyword(lowered, ELEMENTWISE_KEYWORDS): + return "elementwise" + if contains_any_keyword(lowered, MEMORY_WEAK_KEYWORDS) and not looks_compute_like: + return "memory" + if ( + contains_any_keyword(lowered, COMMUNICATION_WEAK_KEYWORDS) + and not looks_compute_like + ): + return "communication" + if lowered.startswith("void "): + return "other" + return "other" + + +def is_kernel_event(event: dict) -> bool: + # The overlap helpers prefer a slightly broader kernel detector than the + # kernel-attribution helpers, but still reject annotations and Python + # frames up front so the later overlap math only sees real GPU work. + if not is_complete_duration_event(event): + return False + name = normalize_text(event.get("name", "")) + if is_trace_metadata_name(name): + return False + cat = normalize_text(event.get("cat", "")).lower() + args = event.get("args", {}) or {} + if is_non_kernel_trace_category(cat): + return False + if is_annotation_event(name, cat): + return False + if "kernel" in cat or cat.startswith("gpu_"): + return True + lowered = name.lower() + if looks_like_python_scope_name(name): + return False + if has_stream_marker(args) and ( + lowered.startswith("void ") + or lowered.startswith("ampere_") + or lowered.startswith("sm80_") + or lowered.startswith("sm90_") + or contains_any_keyword(lowered, KERNEL_NAME_HINTS) + ): + return True + return False + + +def is_meaningful_python_scope(name: str) -> bool: + normalized = canonicalize_python_scope_name(name) + if not normalized: + return False + if normalized.startswith(" bool: + normalized = canonicalize_python_scope_name(name) + if ( + not normalized + or normalized.startswith(" Dict[Tuple[str, str], str]: + mapping: Dict[Tuple[str, str], str] = {} + for event in events: + if event.get("ph") != "M" or event.get("name") != "thread_name": + continue + pid = str(event.get("pid")) + tid = str(event.get("tid")) + thread_name = str((event.get("args") or {}).get("name", "")) + if thread_name: + mapping[(pid, tid)] = thread_name + return mapping + + +def build_correlation_external_lookup(raw_events: Sequence[dict]) -> Dict[int, int]: + lookup: Dict[int, int] = {} + for event in raw_events: + args = event.get("args", {}) or {} + correlation = coerce_optional_int(args.get("correlation")) + external_id = coerce_optional_int(args.get("External id")) + if correlation is not None and external_id is not None: + lookup[correlation] = external_id + return lookup + + +def extract_kernel_events( + trace: dict, pid_substring: Optional[str] +) -> Tuple[List[KernelEvent], Optional[str]]: + # We first build a clean kernel list from the chosen TP rank, then later + # overlap analysis can stay focused on stream timing instead of trace noise. + raw_events = extract_trace_events(trace) + thread_names = extract_thread_names(raw_events) + correlation_external = build_correlation_external_lookup(raw_events) + ( + annotations_by_external_id, + gpu_stage_annotations, + cpu_stage_annotations, + ) = kernel_helpers.build_stage_annotations(raw_events) + chosen_pid = select_heaviest_pid( + raw_events, + is_kernel_event, + pid_substring=pid_substring, + preferred_substrings=(() if pid_substring else ("TP00",)), + ) + kernel_events: List[KernelEvent] = [] + if chosen_pid is None: + return kernel_events, None + + idx = 0 + for event in raw_events: + if not is_kernel_event(event): + continue + pid = str(event.get("pid")) + if pid != chosen_pid: + continue + tid = str(event.get("tid")) + args = event.get("args", {}) or {} + stream = ( + args.get("stream") + or args.get("cuda_stream") + or thread_names.get((pid, tid)) + or f"tid={tid}" + ) + correlation = coerce_optional_int(args.get("correlation")) + external_id = coerce_optional_int(args.get("External id")) + if external_id is None and correlation is not None: + external_id = correlation_external.get(correlation) + name = str(event["name"]) + dur = float(event["dur"]) + ts = float(event["ts"]) + kernel_events.append( + KernelEvent( + idx=idx, + name=name, + canonical_name=canonicalize_name(name), + category=classify_kernel(name), + stage=kernel_helpers.resolve_kernel_stage( + kernel_ts=ts, + external_id=external_id, + annotations_by_external_id=annotations_by_external_id, + gpu_annotations=gpu_stage_annotations, + cpu_annotations=cpu_stage_annotations, + ), + pid=pid, + tid=tid, + stream=str(stream), + ts=ts, + dur=dur, + end=ts + dur, + external_id=external_id, + correlation=correlation, + ) + ) + idx += 1 + return kernel_events, chosen_pid + + +def group_events_by_stage( + events: Sequence[KernelEvent], default_stage: str +) -> Dict[str, List[KernelEvent]]: + grouped: Dict[str, List[KernelEvent]] = defaultdict(list) + for event in events: + stage = default_stage if default_stage != "all" else (event.stage or "all") + grouped[stage].append(event) + return dict(grouped) + + +def dominant_overlap_name( + event: KernelEvent, active_events: Iterable[KernelEvent] +) -> Optional[str]: + candidates = [ + other + for other in active_events + if other.idx != event.idx and other.stream != event.stream + ] + if not candidates: + return None + candidates.sort( + key=lambda other: (CATEGORY_PRIORITY.get(other.category, 0), other.dur), + reverse=True, + ) + return candidates[0].canonical_name + + +def analyze_overlap(events: Sequence[KernelEvent]) -> Dict[str, float]: + # Sweep line over kernel start/end points. For each active time slice we + # decide whether a kernel was exposed on the critical path or hidden by work + # on other streams. + points: List[Tuple[float, int, int]] = [] + event_map = {event.idx: event for event in events} + for event in events: + points.append((event.ts, 1, event.idx)) + points.append((event.end, 0, event.idx)) + points.sort(key=lambda item: (item[0], item[1])) + + total_busy = 0.0 + total_overlap = 0.0 + max_concurrent = 0 + active: Dict[int, KernelEvent] = {} + prev_time: Optional[float] = None + + for time_point, is_start, event_idx in points: + if prev_time is not None and time_point > prev_time and active: + segment = time_point - prev_time + active_events = list(active.values()) + distinct_streams = {event.stream for event in active_events} + total_busy += segment + max_concurrent = max(max_concurrent, len(distinct_streams)) + if len(distinct_streams) >= 2: + total_overlap += segment + for event in active_events: + overlapping_events = [ + other + for other in active_events + if other.idx != event.idx and other.stream != event.stream + ] + if overlapping_events: + event.hidden_us += segment + if any(other.category == "compute" for other in overlapping_events): + event.hidden_by_compute_us += segment + overlap_name = dominant_overlap_name(event, active_events) + if overlap_name: + event.overlap_with[overlap_name] += segment + else: + event.exclusive_us += segment + + if is_start == 0: + active.pop(event_idx, None) + else: + active[event_idx] = event_map[event_idx] + prev_time = time_point + + return { + "total_busy_us": total_busy, + "total_overlap_us": total_overlap, + "max_concurrent_streams": float(max_concurrent), + } + + +def aggregate_events( + events: Sequence[KernelEvent], +) -> Dict[Tuple[str, str], AggregateStats]: + aggregates: Dict[Tuple[str, str], AggregateStats] = {} + for event in events: + key = (event.canonical_name, event.category) + if key not in aggregates: + aggregates[key] = AggregateStats( + name=event.canonical_name, category=event.category + ) + stats = aggregates[key] + stats.count += 1 + stats.total_us += event.dur + stats.hidden_us += event.hidden_us + stats.exclusive_us += event.exclusive_us + stats.hidden_by_compute_us += event.hidden_by_compute_us + stats.overlap_with.update(event.overlap_with) + score = event.hidden_us + event.exclusive_us + if score > stats.representative_score: + stats.representative_score = score + stats.representative_idx = event.idx + return aggregates + + +def top_hidden_low_roi( + aggregates: Dict[Tuple[str, str], AggregateStats], +) -> List[AggregateStats]: + candidates = [ + stats + for stats in aggregates.values() + if stats.category in {"elementwise", "memory"} + and stats.total_us >= 5.0 + and stats.hidden_ratio >= 0.65 + ] + candidates.sort( + key=lambda stats: ( + stats.hidden_us + * (1.0 + stats.hidden_by_compute_us / max(stats.hidden_us, 1.0)), + stats.hidden_ratio, + ), + reverse=True, + ) + return candidates[:5] + + +def top_overlap_opportunities( + aggregates: Dict[Tuple[str, str], AggregateStats], +) -> List[AggregateStats]: + category_weight = { + "communication": 1.3, + "memory": 1.15, + "elementwise": 1.0, + "compute": 0.35, + "other": 0.8, + } + candidates = [ + stats + for stats in aggregates.values() + if stats.total_us >= 5.0 and stats.exclusive_ratio >= 0.45 + ] + primary = [stats for stats in candidates if stats.category != "compute"] + fallback = [stats for stats in candidates if stats.category == "compute"] + primary.sort( + key=lambda stats: stats.exclusive_us * category_weight.get(stats.category, 1.0), + reverse=True, + ) + fallback.sort( + key=lambda stats: stats.exclusive_us * category_weight.get(stats.category, 1.0), + reverse=True, + ) + return (primary + fallback)[:5] + + +def choose_best_scope(scope_chain: Sequence[str]) -> Optional[str]: + ranked: List[Tuple[float, str]] = [] + for index, scope in enumerate(scope_chain): + score = float(index) + if scope.startswith("python/sglang/"): + score += 50.0 + elif scope.startswith("sglang/"): + score += 48.0 + elif scope.startswith("vllm/"): + score += 46.0 + elif scope.startswith("tensorrt_llm/"): + score += 44.0 + elif scope.startswith("sgl_kernel/"): + score += 30.0 + elif ".py(" in scope: + score += 10.0 + if "utils.py" in scope and "__call__" in scope: + score -= 15.0 + if "scheduler_profiler_mixin.py" in scope: + score -= 20.0 + if is_low_signal_scope(scope): + score -= 25.0 + ranked.append((score, scope)) + return max(ranked, key=lambda item: item[0])[1] if ranked else None + + +def is_low_signal_scope(scope: str) -> bool: + lowered = canonicalize_python_scope_name(scope).lower() + if not lowered: + return False + return any(token in lowered for token in LOW_SIGNAL_FUNCTION_TOKENS) or any( + token in lowered for token in LOW_SIGNAL_PATH_TOKENS + ) + + +def scope_chain_key(scope_chain: Sequence[str]) -> Optional[str]: + if not scope_chain: + return None + trimmed = list(scope_chain[-4:]) + return " -> ".join(trimmed) + + +def normalize_match_text(text: object) -> str: + return re.sub(r"[^0-9A-Za-z]+", "", normalize_text(text)).lower() + + +def source_scope_priority(scope: Optional[str]) -> int: + normalized = canonicalize_python_scope_name(scope or "") + if not normalized or normalized == "unmapped": + return 0 + penalty = 80 if is_low_signal_scope(normalized) else 0 + if normalized.startswith("python/sglang/"): + return 300 - penalty + if normalized.startswith("sglang/"): + return 290 - penalty + if normalized.startswith("vllm/"): + return 285 - penalty + if normalized.startswith("tensorrt_llm/"): + return 280 - penalty + if normalized.startswith("sgl_kernel/"): + return 260 - penalty + if ".py(" in normalized: + return 120 - penalty + return 0 + + +def kernel_alias_token_groups(kernel_name: str) -> List[Tuple[str, ...]]: + lowered = normalize_match_text(kernel_name) + groups: List[Tuple[str, ...]] = [] + if "flashattnfwdcombine" in lowered: + groups.append( + ( + "flashattnfwdsm90", + "flashattnvarlenfunc", + "vllmflashattnflashattninterface", + "vllmfa3cfwd", + ) + ) + if "kernelmha" in lowered: + groups.append( + ( + "maskedmultiheadattentionkernel", + "attentioninplace", + "attentionbackendtrtllm", + ) + ) + if "applybiasropeupdatekvcachev2" in lowered: + groups.append( + ( + "fusedqknormropekernel", + "applyqknormrope", + "modelingqwen3py98applyqknormrope", + ) + ) + if lowered.startswith("memset"): + groups.append(("memset",)) + return groups + + +def source_stats_lookup_text( + kernel_name: str, stats: Optional[KernelSourceStats] +) -> str: + parts = [kernel_name] + if stats: + parts.append(str(stats.best_scope or "")) + parts.append(str(stats.best_chain or "")) + parts.append(str(stats.best_launch_op or "")) + return normalize_match_text(" ".join(parts)) + + +def relaxed_source_stats_lookup( + source_map: Dict[str, KernelSourceStats], kernel_name: str +) -> Optional[KernelSourceStats]: + if kernel_name in source_map: + return source_map[kernel_name] + + lowered = kernel_name.lower() + best_key = None + best_score = -1 + for candidate_key in source_map: + candidate_lowered = candidate_key.lower() + if candidate_lowered.startswith(lowered) or lowered.startswith( + candidate_lowered + ): + score = min(len(candidate_lowered), len(lowered)) + elif candidate_lowered in lowered or lowered in candidate_lowered: + score = min(len(candidate_lowered), len(lowered)) // 2 + else: + continue + if score > best_score: + best_key = candidate_key + best_score = score + if best_key: + return source_map.get(best_key) + + lowered_compact = normalize_match_text(kernel_name) + if len(lowered_compact) >= 96: + + def common_prefix_len(left: str, right: str) -> int: + count = 0 + for left_ch, right_ch in zip(left, right): + if left_ch != right_ch: + break + count += 1 + return count + + best_key = None + best_score = -1 + for candidate_key in source_map: + candidate_compact = normalize_match_text(candidate_key) + if len(candidate_compact) < 96: + continue + prefix_len = common_prefix_len(lowered_compact, candidate_compact) + shorter_len = min(len(lowered_compact), len(candidate_compact)) + if prefix_len < 64 or prefix_len < int(shorter_len * 0.4): + continue + score = prefix_len + if lowered_compact.startswith( + "voidcutlassdevicekernelflash" + ) and candidate_compact.startswith("voidcutlassdevicekernelflash"): + score += 32 + if score > best_score: + best_key = candidate_key + best_score = score + if best_key: + return source_map.get(best_key) + + alias_groups = kernel_alias_token_groups(kernel_name) + if not alias_groups: + return None + best_key = None + best_score = -1 + for candidate_key, stats in source_map.items(): + candidate_text = source_stats_lookup_text(candidate_key, stats) + score = 0 + for group_index, group in enumerate(alias_groups): + group_score = max( + (len(token) for token in group if token in candidate_text), + default=0, + ) + if group_score: + score += 1000 * (group_index + 1) + group_score + if score <= 0: + continue + score += source_scope_priority(stats.best_scope) + score += int(stats.mapping_ratio * 100) + if score > best_score: + best_key = candidate_key + best_score = score + return source_map.get(best_key) if best_key else None + + +def extract_cpu_launch_contexts( + raw_events: Sequence[dict], + target_external_ids: Optional[set[int]] = None, +) -> Dict[int, List[CPUOpContext]]: + # Rebuild `External id -> CPU op -> active Python scopes` only for the + # small set of launch ids that the source-map step will actually consume. + # vLLM eager traces can have millions of Python frames on one thread, so + # avoid global timeline reconstruction across unrelated threads and ids. + cpu_ops_by_thread: Dict[Tuple[str, str], List[CPUOpContext]] = defaultdict(list) + + for event in raw_events: + if not is_complete_duration_event(event): + continue + if str(event.get("cat", "")) != "cpu_op": + continue + args = event.get("args", {}) or {} + external_id = coerce_optional_int(args.get("External id")) + if external_id is None: + continue + if target_external_ids is not None and external_id not in target_external_ids: + continue + pid = str(event.get("pid")) + tid = str(event.get("tid")) + ts = float(event.get("ts", 0.0)) + dur = float(event.get("dur", 0.0)) + cpu_ops_by_thread[(pid, tid)].append( + CPUOpContext( + external_id=external_id, + cpu_op_name=str(event.get("name", "")), + pid=pid, + tid=tid, + ts=ts, + dur=dur, + end=ts + dur, + scope_chain=(), + ) + ) + + if not cpu_ops_by_thread: + return {} + + scopes_by_thread: Dict[Tuple[str, str], List[PythonScope]] = defaultdict(list) + relevant_threads = set(cpu_ops_by_thread) + for event in raw_events: + if not is_complete_duration_event(event): + continue + if str(event.get("cat", "")) != "python_function": + continue + pid = str(event.get("pid")) + tid = str(event.get("tid")) + thread_key = (pid, tid) + if thread_key not in relevant_threads: + continue + normalized_name = canonicalize_python_scope_name(event.get("name", "")) + is_meaningful = is_meaningful_python_scope(normalized_name) + is_fallback = is_fallback_python_scope(normalized_name) + if not is_meaningful and not is_fallback: + continue + ts = float(event.get("ts", 0.0)) + dur = float(event.get("dur", 0.0)) + scopes_by_thread[thread_key].append( + PythonScope( + name=str(event.get("name", "")), + normalized_name=normalized_name, + pid=pid, + tid=tid, + ts=ts, + dur=dur, + end=ts + dur, + is_meaningful=is_meaningful, + is_fallback=is_fallback, + ) + ) + + contexts_by_external_id: Dict[int, List[CPUOpContext]] = defaultdict(list) + for thread_key in relevant_threads: + scopes = scopes_by_thread.get(thread_key, []) + cpu_ops = cpu_ops_by_thread.get(thread_key, []) + timeline = [] + for scope_idx, scope in enumerate(scopes): + timeline.append((scope.ts, 0, scope_idx)) + timeline.append((scope.end, 2, scope_idx)) + for cpu_op_idx, cpu_op in enumerate(cpu_ops): + timeline.append((cpu_op.ts, 1, cpu_op_idx)) + timeline.sort(key=lambda item: (item[0], item[1])) + + active_scopes: Dict[int, PythonScope] = {} + for _, kind, payload in timeline: + if kind == 0: + active_scopes[payload] = scopes[payload] + elif kind == 1: + meaningful = [ + scope.normalized_name + for scope in active_scopes.values() + if scope.is_meaningful + ] + fallback = ( + [] + if meaningful + else [ + scope.normalized_name + for scope in active_scopes.values() + if scope.is_fallback + ] + ) + chosen_chain = tuple((meaningful or fallback)[-6:]) + cpu_op = cpu_ops[payload] + contexts_by_external_id[cpu_op.external_id].append( + CPUOpContext( + external_id=cpu_op.external_id, + cpu_op_name=cpu_op.cpu_op_name, + pid=cpu_op.pid, + tid=cpu_op.tid, + ts=cpu_op.ts, + dur=cpu_op.dur, + end=cpu_op.end, + scope_chain=chosen_chain, + ) + ) + else: + active_scopes.pop(payload, None) + return contexts_by_external_id + + +def is_cuda_launch_event(name: str, cat: str) -> bool: + lowered_name = normalize_text(name).lower() + lowered_cat = normalize_text(cat).lower() + if lowered_cat == "cuda_runtime": + return lowered_name in { + "cudaLaunchKernel", + "cudaLaunchKernelExC", + } + return lowered_name in { + "cuLaunchKernel", + "cuLaunchKernelEx", + "cudaLaunchKernel", + "cudaLaunchKernelExC", + } + + +@dataclass +class LaunchContext: + correlation: int + pid: str + tid: str + ts: float + dur: float + end: float + launch_name: str + + +def build_launch_contexts( + raw_events: Sequence[dict], +) -> Dict[int, List[LaunchContext]]: + output: Dict[int, List[LaunchContext]] = defaultdict(list) + for event in raw_events: + if not is_complete_duration_event(event): + continue + cat = str(event.get("cat", "")) + name = str(event.get("name", "")) + args = event.get("args", {}) or {} + correlation = coerce_optional_int(args.get("correlation")) + if correlation is None or not is_cuda_launch_event(name, cat): + continue + ts = float(event.get("ts", 0.0)) + dur = float(event.get("dur", 0.0)) + output[correlation].append( + LaunchContext( + correlation=correlation, + pid=str(event.get("pid")), + tid=str(event.get("tid")), + ts=ts, + dur=dur, + end=ts + dur, + launch_name=name, + ) + ) + for items in output.values(): + items.sort(key=lambda item: item.ts) + return output + + +def choose_launch_context( + contexts: Sequence[LaunchContext], kernel_ts: float +) -> Optional[LaunchContext]: + if not contexts: + return None + return min(contexts, key=lambda context: (abs(context.ts - kernel_ts), context.dur)) + + +def choose_cpu_context( + contexts: Sequence[CPUOpContext], kernel_ts: float +) -> Optional[CPUOpContext]: + if not contexts: + return None + return min(contexts, key=lambda context: (abs(context.ts - kernel_ts), context.dur)) + + +def extract_meaningful_python_scopes(raw_events: Sequence[dict]) -> List[PythonScope]: + scopes: List[PythonScope] = [] + for event in raw_events: + if not is_complete_duration_event(event): + continue + if str(event.get("cat", "")) != "python_function": + continue + ts = float(event.get("ts", 0.0)) + dur = float(event.get("dur", 0.0)) + normalized_name = canonicalize_python_scope_name(event.get("name", "")) + if not is_meaningful_python_scope(normalized_name): + continue + scopes.append( + PythonScope( + name=str(event.get("name", "")), + normalized_name=normalized_name, + pid=str(event.get("pid")), + tid=str(event.get("tid")), + ts=ts, + dur=dur, + end=ts + dur, + ) + ) + return scopes + + +def choose_temporal_scope_chain( + scopes: Sequence[PythonScope], kernel_ts: float +) -> Tuple[str, ...]: + matches = [scope for scope in scopes if scope.ts <= kernel_ts <= scope.end] + if not matches: + return () + matches.sort(key=lambda scope: (scope.ts, -scope.dur, scope.normalized_name)) + chain = [] + seen = set() + for scope in matches: + if scope.normalized_name in seen: + continue + seen.add(scope.normalized_name) + chain.append(scope.normalized_name) + return tuple(chain[-6:]) + + +def build_temporal_scope_lookup( + scopes: Sequence[PythonScope], + query_points: Sequence[Tuple[int, float]], +) -> Dict[int, Tuple[str, ...]]: + if not scopes or not query_points: + return {} + + timeline: List[Tuple[float, int, object]] = [] + for scope in scopes: + timeline.append((scope.ts, 0, scope)) + timeline.append((scope.end, 2, scope)) + for event_idx, probe_ts in query_points: + timeline.append((probe_ts, 1, event_idx)) + timeline.sort(key=lambda item: (item[0], item[1])) + + active_scopes: List[PythonScope] = [] + resolved: Dict[int, Tuple[str, ...]] = {} + for _, kind, payload in timeline: + if kind == 0: + active_scopes.append(payload) + continue + if kind == 2: + if payload in active_scopes: + active_scopes.remove(payload) + continue + + chain: List[str] = [] + seen: set[str] = set() + for scope in sorted( + active_scopes, + key=lambda scope: (scope.ts, -scope.dur, scope.normalized_name), + ): + name = scope.normalized_name + if name in seen: + continue + seen.add(name) + chain.append(name) + resolved[payload] = tuple(chain[-6:]) + return resolved + + +def build_temporal_scope_lookup_from_raw_events( + raw_events: Sequence[dict], + query_points: Sequence[Tuple[int, float]], +) -> Dict[int, Tuple[str, ...]]: + if not query_points: + return {} + + ordered_queries = sorted( + ((float(query_ts), int(query_id)) for query_id, query_ts in query_points), + key=lambda item: item[0], + ) + query_times = [query_ts for query_ts, _ in ordered_queries] + query_ids = [query_id for _, query_id in ordered_queries] + first_query_ts = query_times[0] + last_query_ts = query_times[-1] + + matches_by_query: Dict[int, List[PythonScope]] = defaultdict(list) + for event in raw_events: + if not is_complete_duration_event(event): + continue + if str(event.get("cat", "")) != "python_function": + continue + ts = float(event.get("ts", 0.0)) + dur = float(event.get("dur", 0.0)) + end = ts + dur + if end < first_query_ts or ts > last_query_ts: + continue + + normalized_name = canonicalize_python_scope_name(event.get("name", "")) + if not is_meaningful_python_scope(normalized_name): + continue + + left = bisect_left(query_times, ts - 1e-3) + right = bisect_right(query_times, end + 1e-3) + if left >= right: + continue + + scope = PythonScope( + name=str(event.get("name", "")), + normalized_name=normalized_name, + pid=str(event.get("pid")), + tid=str(event.get("tid")), + ts=ts, + dur=dur, + end=end, + is_meaningful=True, + is_fallback=False, + ) + for pos in range(left, right): + matches_by_query[query_ids[pos]].append(scope) + + resolved: Dict[int, Tuple[str, ...]] = {} + for query_id, scopes in matches_by_query.items(): + chain: List[str] = [] + seen: set[str] = set() + for scope in sorted( + scopes, + key=lambda scope: (scope.ts, -scope.dur, scope.normalized_name), + ): + name = scope.normalized_name + if name in seen: + continue + seen.add(name) + chain.append(name) + resolved[query_id] = tuple(chain[-6:]) + return resolved + + +def build_kernel_source_map( + mapping_bundle: TraceBundle, + kernel_map_entry_lookup=None, + stage: str = "all", +) -> Dict[str, KernelSourceStats]: + sampled_events = sample_source_map_events(mapping_bundle.events) + target_external_ids = { + event.external_id for event in sampled_events if event.external_id is not None + } + contexts_by_external_id = extract_cpu_launch_contexts( + mapping_bundle.raw_events, + target_external_ids=target_external_ids or None, + ) + correlation_external = build_correlation_external_lookup(mapping_bundle.raw_events) + launch_contexts_by_correlation = build_launch_contexts(mapping_bundle.raw_events) + fallback_queries = [ + (event.idx, event.ts) + for event in sampled_events + if event.external_id is None + or not contexts_by_external_id.get(event.external_id) + ] + temporal_scope_lookup = build_temporal_scope_lookup_from_raw_events( + mapping_bundle.raw_events, + fallback_queries, + ) + source_map: Dict[str, KernelSourceStats] = {} + for event in sampled_events: + stats = source_map.setdefault( + event.canonical_name, KernelSourceStats(name=event.canonical_name) + ) + stats.total_count += 1 + kernel_entry = ( + kernel_map_entry_lookup(stage, event.canonical_name) + if kernel_map_entry_lookup is not None + else None + ) + cpu_context = None + effective_external_id = event.external_id + if effective_external_id is None and event.correlation is not None: + effective_external_id = correlation_external.get(event.correlation) + if effective_external_id is not None: + cpu_context = choose_cpu_context( + contexts_by_external_id.get(effective_external_id, []), event.ts + ) + + launch_op = None + scope_chain: Tuple[str, ...] = () + if cpu_context is not None: + launch_op = canonicalize_cpu_op_name(cpu_context.cpu_op_name) + scope_chain = cpu_context.scope_chain + else: + launch_context = ( + choose_launch_context( + launch_contexts_by_correlation.get(event.correlation, []), event.ts + ) + if event.correlation is not None + else None + ) + if launch_context is not None: + scope_chain = build_temporal_scope_lookup_from_raw_events( + mapping_bundle.raw_events, + [(event.idx, launch_context.ts)], + ).get(event.idx, ()) + if scope_chain: + launch_op = canonicalize_cpu_op_name(launch_context.launch_name) + if not scope_chain: + scope_chain = temporal_scope_lookup.get(event.idx, ()) + if scope_chain: + launch_op = "time-window fallback" + + if not scope_chain: + if kernel_entry: + best_location = str(kernel_entry.get("best_location") or "").strip() + if best_location and best_location != "unresolved": + stats.mapped_count += 1 + stats.scope_counter[best_location] += 1 + stats.site_share_counter[best_location] += 1 + for site in kernel_entry.get("sites") or []: + display_location = str( + site.get("display_location") or site.get("location") or "" + ).strip() + if display_location and display_location != "unresolved": + launches = int(site.get("launches") or 0) + stats.site_share_counter[display_location] += max( + 1, launches + ) + if launches > 0: + stats.scope_counter[display_location] += launches + top_cpu_op = site.get("top_cpu_op") + if top_cpu_op: + launches = int(site.get("launches") or 0) + stats.launch_op_counter[str(top_cpu_op)] += max(1, launches) + continue + + stats.mapped_count += 1 + best_scope = choose_best_scope(scope_chain) + if best_scope: + stats.scope_counter[best_scope] += 1 + stats.site_share_counter[best_scope] += 1 + chain = scope_chain_key(scope_chain) + if chain: + stats.chain_counter[chain] += 1 + if launch_op: + stats.launch_op_counter[launch_op] += 1 + return source_map + + +def merge_source_map_from_kernel_payload( + source_map: Dict[str, KernelSourceStats], + stage_payload: Optional[dict], +) -> Dict[str, KernelSourceStats]: + if not stage_payload: + return source_map + + for kernel_name, entry in (stage_payload.get("kernels") or {}).items(): + sites = entry.get("sites") or [] + best_location = str(entry.get("best_location") or "").strip() + if not sites and (not best_location or best_location == "unresolved"): + continue + + stats = source_map.setdefault(kernel_name, KernelSourceStats(name=kernel_name)) + if sites: + for site in sites: + location = str(site.get("location") or best_location or "").strip() + launches = max(1, int(site.get("launches") or 0)) + stats.total_count += launches + if location and location != "unresolved": + stats.mapped_count += launches + stats.scope_counter[location] += launches + stats.site_share_counter[location] += launches + top_cpu_op = str(site.get("top_cpu_op") or "").strip() + if top_cpu_op: + stats.launch_op_counter[top_cpu_op] += launches + stack = str(site.get("stack") or "").strip() + if stack: + stats.chain_counter[stack] += launches + continue + + stats.total_count += 1 + stats.mapped_count += 1 + stats.scope_counter[best_location] += 1 + stats.site_share_counter[best_location] += 1 + return source_map + + +def sample_source_map_events( + events: Sequence[KernelEvent], + per_name_limit: int = SOURCE_MAP_SAMPLE_LIMIT_PER_NAME, +) -> List[KernelEvent]: + if per_name_limit <= 0: + return list(events) + + grouped: Dict[str, List[KernelEvent]] = defaultdict(list) + for event in events: + grouped[event.canonical_name].append(event) + + sampled: List[KernelEvent] = [] + for kernel_name in sorted(grouped): + items = grouped[kernel_name] + if len(items) <= per_name_limit: + sampled.extend(items) + continue + for sample_idx in range(per_name_limit): + pos = round(sample_idx * (len(items) - 1) / (per_name_limit - 1)) + sampled.append(items[pos]) + sampled.sort(key=lambda event: (event.ts, event.idx)) + return sampled + + +def format_overlap_counter(counter: Counter, limit: int = 2) -> str: + if not counter: + return "n/a" + parts = [] + for name, duration in counter.most_common(limit): + parts.append(f"{short_name(name, 48)} ({duration:.1f} us)") + return ", ".join(parts) + + +def build_headroom_suggestion(stats: AggregateStats) -> str: + if stats.category == "communication": + return "Communication is still exposed. Check overlap with nearby compute." + if stats.category in {"elementwise", "memory"}: + return "This work is still exposed. Check fusion or nearby compute coverage." + return ( + "This work is still exposed. Check stream placement and immediate dependencies." + ) + + +def build_hidden_suggestion(stats: AggregateStats) -> str: + overlap = format_overlap_counter(stats.overlap_with, limit=1) + if overlap != "n/a": + return f"Mostly hidden under {overlap}. Revisit only if schedule or fusion changes." + return "Mostly hidden already. Revisit only if schedule or fusion changes." + + +def build_other_suggestion(stats: AggregateStats) -> str: + if stats.exclusive_ratio >= 0.6: + return "Still exposed, but not one of the leading overlap targets." + if stats.hidden_ratio >= 0.6: + return "Often hidden already. Revisit it if launch count or schedule changes." + return "Mixed exposure and overlap. Inspect it after the higher-share rows above." + + +def parse_scope_signature(scope: str) -> Tuple[str, str]: + if not scope or scope in {"unmapped", "n/a"}: + return "", "" + match = re.match(r"(.+?)\(\d+\):\s*(.+)$", scope) + if match: + return match.group(1), match.group(2) + return scope, "" + + +def same_scope_family(left: str, right: str) -> bool: + left_path, left_func = parse_scope_signature(left) + right_path, right_func = parse_scope_signature(right) + if not left_path or not right_path: + return False + if left_path == right_path: + return True + return bool(left_func and right_func and left_func == right_func) + + +def is_neighbor_dependency_like( + current: KernelEvent, neighbor: Optional[KernelEvent] +) -> bool: + if neighbor is None: + return False + if current.category == "communication": + return neighbor.category in {"compute", "elementwise", "memory", "other"} + if current.category in {"elementwise", "memory"}: + return neighbor.category in { + "compute", + "communication", + "elementwise", + "memory", + } + return False + + +def build_stream_neighbor_index( + events: Sequence[KernelEvent], +) -> Dict[int, Tuple[Optional[KernelEvent], Optional[KernelEvent]]]: + by_stream: Dict[str, List[KernelEvent]] = defaultdict(list) + for event in events: + by_stream[event.stream].append(event) + + index: Dict[int, Tuple[Optional[KernelEvent], Optional[KernelEvent]]] = {} + for stream_events in by_stream.values(): + stream_events.sort(key=lambda event: (event.ts, event.end, event.idx)) + for pos, event in enumerate(stream_events): + prev_event = stream_events[pos - 1] if pos > 0 else None + next_event = ( + stream_events[pos + 1] if pos + 1 < len(stream_events) else None + ) + index[event.idx] = (prev_event, next_event) + return index + + +def describe_neighbor( + neighbor: Optional[KernelEvent], + gap_us: Optional[float], + source_map: Dict[str, KernelSourceStats], +) -> str: + if neighbor is None: + return "none" + source = relaxed_source_stats_lookup(source_map, neighbor.canonical_name) + scope = source.best_scope if source and source.best_scope else "unmapped" + if gap_us is not None: + gap_us = max(gap_us, 0.0) + gap_text = f"{gap_us:.1f} us" + else: + gap_text = "n/a" + return ( + f"{short_name(neighbor.canonical_name, 28)} " + f"@ {short_name(scope, 28)} " + f"(gap {gap_text})" + ) + + +def classify_dependency_signal( + current: KernelEvent, + source: Optional[KernelSourceStats], + prev_event: Optional[KernelEvent], + next_event: Optional[KernelEvent], + source_map: Dict[str, KernelSourceStats], +) -> Tuple[str, str, str]: + current_scope = source.best_scope if source and source.best_scope else "unmapped" + current_launch = ( + source.best_launch_op if source and source.best_launch_op else "n/a" + ) + + prev_gap = current.ts - prev_event.end if prev_event is not None else None + next_gap = next_event.ts - current.end if next_event is not None else None + prev_source = ( + relaxed_source_stats_lookup(source_map, prev_event.canonical_name) + if prev_event is not None + else None + ) + next_source = ( + relaxed_source_stats_lookup(source_map, next_event.canonical_name) + if next_event is not None + else None + ) + prev_scope = ( + prev_source.best_scope if prev_source and prev_source.best_scope else "unmapped" + ) + next_scope = ( + next_source.best_scope if next_source and next_source.best_scope else "unmapped" + ) + prev_launch = ( + prev_source.best_launch_op + if prev_source and prev_source.best_launch_op + else "n/a" + ) + next_launch = ( + next_source.best_launch_op + if next_source and next_source.best_launch_op + else "n/a" + ) + + if prev_gap is not None: + prev_gap = max(prev_gap, 0.0) + if next_gap is not None: + next_gap = max(next_gap, 0.0) + + tight_gap_threshold = max(2.0, min(20.0, current.dur * 0.15)) + prev_tight = prev_gap is not None and prev_gap <= tight_gap_threshold + next_tight = next_gap is not None and next_gap <= tight_gap_threshold + + prev_risk = prev_tight and ( + same_scope_family(current_scope, prev_scope) + or (current_launch != "n/a" and current_launch == prev_launch) + or is_neighbor_dependency_like(current, prev_event) + ) + next_risk = next_tight and ( + same_scope_family(current_scope, next_scope) + or (current_launch != "n/a" and current_launch == next_launch) + or is_neighbor_dependency_like(current, next_event) + ) + + prev_unclear = ( + prev_tight + and not prev_risk + and (current_scope == "unmapped" or prev_scope == "unmapped") + ) + next_unclear = ( + next_tight + and not next_risk + and (current_scope == "unmapped" or next_scope == "unmapped") + ) + + if prev_risk and next_risk: + signal = "both-side serial risk" + elif prev_risk: + signal = "prev-side serial risk" + elif next_risk: + signal = "next-side serial risk" + elif prev_unclear or next_unclear: + signal = "adjacency unclear" + else: + signal = "serial risk low" + + prev_desc = describe_neighbor(prev_event, prev_gap, source_map) + next_desc = describe_neighbor(next_event, next_gap, source_map) + return signal, prev_desc, next_desc + + +def dependency_risk_label(signal: str) -> str: + mapping = { + "serial risk low": "low", + "prev-side serial risk": "high", + "next-side serial risk": "high", + "both-side serial risk": "high", + "adjacency unclear": "unclear", + } + return mapping.get(signal, signal) + + +def build_priority_and_recommendation( + verdict: str, + category: str, + dependency_signal: str, + stats: AggregateStats, + share_pct: float, +) -> Tuple[str, str]: + dep_label = dependency_risk_label(dependency_signal) + if share_pct < 1.0: + return "P5", "skip" + + if verdict == "headroom": + if dep_label == "low": + if category == "communication": + return "P1", "try overlap" + return "P1", "try fusion" + return "P2", "check deps" + + if verdict == "low-roi-hidden": + return "P4", "skip" + + if stats.exclusive_ratio >= 0.85 and dep_label == "low": + return "P3", "defer" + if stats.hidden_ratio >= 0.7: + return "P5", "skip" + if dep_label == "high": + return "P4", "check deps" + if dep_label == "unclear": + return "P4", "inspect" + return "P4", "defer" + + +def make_action_row( + stats: AggregateStats, + verdict: str, + suggestion: str, + source_map: Dict[str, KernelSourceStats], + formal_events: Sequence[KernelEvent], + neighbor_index: Dict[int, Tuple[Optional[KernelEvent], Optional[KernelEvent]]], + total_busy_us: float, +) -> ActionRow: + source = relaxed_source_stats_lookup(source_map, stats.name) + representative_idx = stats.representative_idx + dependency_signal = "adjacency unclear" + prev_neighbor = "none" + next_neighbor = "none" + share_pct = (stats.total_us / total_busy_us * 100.0) if total_busy_us > 0 else 0.0 + if representative_idx is not None: + current_event = next( + (event for event in formal_events if event.idx == representative_idx), None + ) + if current_event is not None: + prev_event, next_event = neighbor_index.get( + representative_idx, (None, None) + ) + dependency_signal, prev_neighbor, next_neighbor = ( + classify_dependency_signal( + current=current_event, + source=source, + prev_event=prev_event, + next_event=next_event, + source_map=source_map, + ) + ) + priority, recommendation = build_priority_and_recommendation( + verdict=verdict, + category=stats.category, + dependency_signal=dependency_signal, + stats=stats, + share_pct=share_pct, + ) + + return ActionRow( + priority=priority, + verdict=verdict, + kernel=stats.name, + category=stats.category, + total_us=stats.total_us, + share_pct=share_pct, + exclusive_ratio=stats.exclusive_ratio, + hidden_ratio=stats.hidden_ratio, + python_scope=source.best_scope if source and source.best_scope else "unmapped", + launch_op=source.best_launch_op if source and source.best_launch_op else "n/a", + mapping_ratio=source.mapping_ratio if source else 0.0, + dependency_signal=dependency_signal, + prev_neighbor=prev_neighbor, + next_neighbor=next_neighbor, + recommendation=recommendation, + suggestion=suggestion, + representative_idx=representative_idx, + ) + + +def build_action_rows( + aggregates: Dict[Tuple[str, str], AggregateStats], + source_map: Dict[str, KernelSourceStats], + formal_events: Sequence[KernelEvent], + total_busy_us: float, + table_limit: int, +) -> List[ActionRow]: + rows: List[ActionRow] = [] + seen: set[str] = set() + neighbor_index = build_stream_neighbor_index(formal_events) + + for stats in top_overlap_opportunities(aggregates): + row = make_action_row( + stats=stats, + verdict="headroom", + suggestion=build_headroom_suggestion(stats), + source_map=source_map, + formal_events=formal_events, + neighbor_index=neighbor_index, + total_busy_us=total_busy_us, + ) + if row.priority == "P5": + continue + rows.append(row) + seen.add(stats.name) + + for stats in top_hidden_low_roi(aggregates): + if stats.name in seen: + continue + rows.append( + make_action_row( + stats=stats, + verdict="low-roi-hidden", + suggestion=build_hidden_suggestion(stats), + source_map=source_map, + formal_events=formal_events, + neighbor_index=neighbor_index, + total_busy_us=total_busy_us, + ) + ) + seen.add(stats.name) + + if table_limit > 0: + return rows[:table_limit] + return rows diff --git a/.claude/skills/sglang-bisect-ci-regression/SKILL.md b/.claude/skills/sglang-bisect-ci-regression/SKILL.md new file mode 100644 index 000000000000..0afd159721a0 --- /dev/null +++ b/.claude/skills/sglang-bisect-ci-regression/SKILL.md @@ -0,0 +1,224 @@ +--- +name: sglang-bisect-ci-regression +description: Investigate consistently failing SGLang CI tests by extracting the failure signature from scheduled or rerun workflows, bisecting the passing/failing commit window, checking runner or hardware specificity, and optionally reproducing on a remote GPU host. +--- + +# SGLang Bisect CI Regression + +Investigate a consistently failing CI test to find the root cause - whether it's a code regression from a specific PR, a hardware/runner-specific issue, or an environment change. Optionally reproduce the failure on a remote GPU server. + +## Slash Command + +`/sglang-bisect-ci-regression [ssh_target] [docker_container]` + +## When to Use This Skill + +- A CI test is failing consistently on main (scheduled runs) +- You need to find which PR introduced a regression +- You suspect a runner-specific or GPU-specific issue +- You want to reproduce a CI failure on a remote server + +## Arguments + +- **First argument (required)**: Test file name (e.g. `test_lora_tp.py`) or a GitHub Actions job URL +- **Second argument (optional)**: SSH target for remote reproduction (e.g. `user@host`) +- **Third argument (optional)**: Docker container name on the SSH target (e.g. `sglang_dev`) + +If SSH target and docker container are not provided, the skill will only perform the CI log analysis and bisection, without remote reproduction. **Ask the user** for these if reproduction is needed and they weren't provided. + +## Background: Scheduled CI Runs + +SGLang uses the `pr-test.yml` workflow with **scheduled runs** (cron-triggered) to periodically test the `main` branch. These runs are the primary data source for detecting regressions: + +- **Workflow**: `pr-test.yml` with `event: schedule` +- **Branch**: `main` +- **Dashboard**: https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule +- **Frequency**: Runs multiple times daily, each pinned to the HEAD of `main` at trigger time +- **Purpose**: Catches regressions that slip through PR-level CI (e.g., interaction bugs between merged PRs, hardware-specific issues) + +Always use these scheduled runs (not PR-triggered runs) when bisecting regressions on `main`. The `--event schedule` filter in `gh run list` ensures you only see these periodic main-branch runs. + +## Workflow + +### Phase 1: Extract the Failure Signature + +1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent scheduled runs of `pr-test.yml` on `main` that failed: + +```bash +# List recent scheduled runs targeting main (the primary source of truth for regressions) +# These are cron-triggered runs visible at: +# https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule +gh run list --repo sgl-project/sglang --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha + +# Find the job containing the test +gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}' + +# Get the failure details +gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E -B 5 -A 30 "AssertionError|FAIL|Error|{TEST_NAME}" +``` + +2. **Record the failure signature:** + - Exact error message and assertion + - Affected test method name + - Model/config involved + - Numeric values (e.g., tolerance diffs, scores) + - Whether the failure is deterministic (same values across runs) + +### Phase 2: Temporal Bisection + +3. **Find the boundary between passing and failing runs.** Walk through the scheduled run history (from the `pr-test.yml` schedule runs on `main`) to identify: + - Last known PASSING run (sha + date) + - First known FAILING run (sha + date) + +```bash +# For each scheduled run, check the specific partition/job status +gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}' + +# Verify a specific test passed or failed in a run +gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10 +``` + +4. **List commits between the boundary:** + +```bash +git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} +``` + +5. **Filter for relevant commits** that touch files related to the failing test (model layers, kernels, test utilities, etc.): + +```bash +git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths} +``` + +### Phase 3: Runner/Hardware Analysis + +6. **Check if the failure is runner-specific.** Extract the runner identity from each failing and passing run: + +```bash +# Get runner name and machine +gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5 + +# Get GPU/driver info +gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i -E "NVIDIA-SMI|Driver Version|CUDA Version" | head -5 + +# Get package versions +gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5 +``` + +7. **Correlate runners with pass/fail outcomes.** Build a table: + +| Run ID | Date | Runner | GPU Type | Driver | Result | +|--------|------|--------|----------|--------|--------| + +If all failures map to a specific runner type/GPU and all passes map to another, the issue is **hardware-specific**, not a code regression. + +### Phase 4: Code Analysis + +8. **If a code regression is suspected** (failures not runner-specific), examine the candidate commits: + - Read the changed files + - Understand how the changes could affect the failing test + - Look for prefill-vs-decode differences, TP-specific paths, kernel changes + +9. **If a hardware issue is suspected**, analyze: + - Kernel compatibility (CUDA compute capability) + - Driver version differences + - All-reduce / NCCL behavior differences + - CUDA graph capture differences across GPU architectures + +### Phase 5: Remote Reproduction (Optional) + +Only if SSH target and docker container were provided. + +10. **Verify the remote environment:** + +```bash +ssh {SSH_TARGET} "docker exec {CONTAINER} nvidia-smi --query-gpu=name,driver_version --format=csv" +ssh {SSH_TARGET} "docker exec {CONTAINER} pip show sgl-kernel sglang flashinfer-python 2>&1 | grep -E 'Name:|Version:'" +``` + +11. **Ensure latest code is installed.** If the container is stale, update: + +```bash +# Try fetching latest main +ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && git fetch origin main && git checkout origin/main'" +# Or download and install from tarball if git auth fails +ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /tmp && curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz | tar xz && cd sglang-main && pip install -e \"python[all]\"'" +# Reinstall (after git fetch) +ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && pip install -e \"python[all]\"'" +# Install test dependencies if needed +ssh {SSH_TARGET} "docker exec {CONTAINER} pip install peft rouge-score" +``` + +12. **Create a minimal reproduction script** that: + - Uses `if __name__ == '__main__'` with `mp.set_start_method("spawn")` + - Runs the specific failing test configuration + - Prints key metrics (diffs, scores, outputs) + - Exits with code 1 on failure + +13. **Copy and run the reproduction script:** + +```bash +scp /tmp/repro_script.py {SSH_TARGET}:/tmp/ +ssh {SSH_TARGET} "docker cp /tmp/repro_script.py {CONTAINER}:/tmp/" +ssh {SSH_TARGET} "docker exec -e CUDA_VISIBLE_DEVICES=0,1 {CONTAINER} python3 /tmp/repro_script.py" +``` + +14. **Run control experiments** to isolate the variable: + - If suspecting TP issue: run with TP=1 as control + - If suspecting GPU issue: compare same code on different GPU + - If suspecting a specific commit: test before/after that commit + +### Phase 6: Report + +15. **Produce a structured report:** + +```markdown +## CI Regression Bisection Report + +### Failure Signature +- **Test**: {test_file}::{test_method} +- **Error**: {exact error message} +- **Key metrics**: {numeric values} +- **Deterministic**: Yes/No + +### Root Cause Classification +One of: +- **Code Regression**: PR #{number} introduced the bug +- **Hardware-Specific**: Fails on {GPU_TYPE}, passes on others +- **Environment Change**: New runner/driver/package version +- **Pre-existing Flakiness**: Intermittent, not a new regression + +### Evidence +| Condition | Result | +|-----------|--------| +| {condition1} | PASS/FAIL | +| {condition2} | PASS/FAIL | + +### Timeline +- {date}: Last known pass ({sha}, {runner}) +- {date}: First known fail ({sha}, {runner}) +- {date}: Confirmed reproduction on {server} + +### Recommended Fix +- **Short-term**: {workaround} +- **Long-term**: {proper fix} +``` + +## Key Patterns to Recognize + +| Pattern | Diagnosis | +|---------|-----------| +| Same SHA passes on runner A, fails on runner B | Hardware/runner-specific | +| All runners fail after commit X | Code regression from commit X | +| Intermittent - same runner sometimes passes/fails | Flaky test or race condition | +| Prefill OK but decode fails | TP/all-reduce issue in decode path | +| Works with TP=1, fails with TP>1 | Tensor parallelism bug | +| Exact same numeric diff every time | Deterministic bug, not flakiness | + +## Important Notes + +- **Always check runner identity** before concluding it's a code regression. Many "consistent" failures are actually runner-specific. +- **Test partition assignments change over time** as tests are added/removed. A test may move between partitions, landing on different runner types. +- **H200 runners** use `/root/actions-runner/` path and machine names like `gpu-h200-worker-*`. Non-H200 runners use `/public_sglang_ci/runner-*` paths. +- When running remote reproduction, use `run_in_background` for long-running tests and check output with `TaskOutput`. +- Container environments may be stale - always verify package versions match CI before drawing conclusions. diff --git a/.claude/skills/sglang-prod-incident-triage/SKILL.md b/.claude/skills/sglang-prod-incident-triage/SKILL.md new file mode 100644 index 000000000000..708e399f1c99 --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/SKILL.md @@ -0,0 +1,291 @@ +--- +name: sglang-prod-incident-triage +description: Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool. +--- + +# SGLang Serving Debug + +## Overview + +Use this skill to turn a live serving problem into a debug path you can replay. + +Use one loop: + +- collect a baseline bundle +- save the failing request or crash dump +- replay on a clean target +- only then switch tools + +Do not start with profiling. + +This skill should work with more focused skills instead of re-implementing them: + +- `debug-cuda-crash` when replay plus coredump points to a CUDA crash path +- `debug-distributed-hang` when the problem is clearly a TP/PP/DP/EP hang +- `llm-torch-profiler-analysis` when the issue is already narrowed to a + compute-side path + +Three examples are included: + +- TTFT spike with low queue time +- replay-first CUDA crash flow +- request-shaped distributed hang flow + +## Output Contract + +Return: + +- problem class +- what was checked +- strongest signal so far +- current best guess +- what was ruled out +- next step +- production risk + +## When To Use It + +- `/health` or `/health_generate` is unhealthy +- latency or throughput regressed under serving load +- queue size grows while health still looks green +- one request class times out or hangs +- the server crashes only after some requests +- outputs changed after a deploy, topology change, or weight switch +- one older commit is known-good and a newer commit is known-bad + +## Workflow + +### 1. Collect a baseline bundle + +If a live server is reachable, collect a read-only bundle before anything more +intrusive: + +```bash +python3 scripts/incident_artifact_tool.py collect-bundle \ + --base-url http://127.0.0.1:30000 \ + --outdir /tmp/incident_bundle + +python3 scripts/incident_artifact_tool.py summarize-bundle \ + /tmp/incident_bundle +``` + +If the server is protected: + +```bash +python3 scripts/incident_artifact_tool.py collect-bundle \ + --base-url http://127.0.0.1:30000 \ + --token "$SGLANG_BEARER_TOKEN" \ + --outdir /tmp/incident_bundle +``` + +The bundle script collects: + +- `/health` +- `/health_generate` +- `/model_info` +- `/server_info` +- `/v1/loads?include=all` +- `/v1/loads?include=core,queues,disagg,spec` +- `/metrics` +- `/hicache/storage-backend` on a best-effort basis + +Use the summary for a quick read on: + +- health vs. active health state +- topology and runtime flags +- point-in-time queue and token usage +- TTFT / E2E / queue-time heuristics from Prometheus metrics + +If the summary says the bundle was captured while the server was idle, recollect +it during traffic or move quickly to dump plus replay. + +If no live server is reachable, start from the best dump or log already available: + +- crash dump +- request dump +- logs +- CUDA coredump +- OTel trace +- torch profile + +### 2. Save the failing request + +Read [references/decision-tree.md](references/decision-tree.md) only if the +problem class is still unclear: + +- server down or unhealthy +- latency or throughput regression +- wrong output or behavior regression +- intermittent timeout or hang + +Then preserve the request payload that actually triggers the problem: + +- crash path: use `--crash-dump-folder` +- non-crash path: enable request dump or save the exact trigger request + +Do not jump straight from a live symptom to low-level debugging without first +saving something you can replay. + +### 3. Replay on a clean target + +Read [references/endpoints-and-signals.md](references/endpoints-and-signals.md) +when you need help reading the baseline bundle or the replay target. + +Read [references/replay-trace-profile.md](references/replay-trace-profile.md) +when you need the replay, trace, profile, or bisect paths. + +Standard order: + +1. collect baseline bundle +2. capture request dump or crash dump +3. restart a clean debug target if needed +4. replay the same issue +5. collect replay-time logs and dumps + +### 4. Only go deeper after replay + +#### Replay + +Use replay when: + +- a crash dump exists +- a request dump exists +- the problem depends on request shape or workload mix + +If a crash dump exists, summarize it first: + +```bash +python3 scripts/incident_artifact_tool.py summarize-dump \ + --input-file /path/to/crash_dump.pkl +``` + +Then replay: + +```bash +python3 /path/to/sglang/scripts/playground/replay_request_dump.py \ + --input-file /path/to/crash_dump.pkl \ + --host 127.0.0.1 \ + --port 30000 \ + --parallel 128 +``` + +If `safe_pickle_load` blocks a locally captured trusted dump, use: + +```bash +python3 scripts/replay_trusted_request_dump.py \ + --input-file /path/to/request_dump.pkl \ + --host 127.0.0.1 \ + --port 30000 \ + --parallel 1 +``` + +If replay indicates a CUDA crash path, restart the same build with coredumps +enabled before reproducing again: + +```bash +SGLANG_CUDA_COREDUMP=1 \ +SGLANG_CUDA_COREDUMP_DIR=/tmp/sglang_cuda_coredumps \ +python -m sglang.launch_server \ + --model-path ... \ + --crash-dump-folder /tmp/sglang_crash_dump \ + ... +``` + +Then inspect the generated coredump: + +```bash +cuda-gdb "$(which python3)" \ + -ex "target cudacore /tmp/sglang_cuda_coredumps/cuda_coredump_.." +``` + +For a replay-first crash example, read +[references/case-studies.md](references/case-studies.md). + +#### OTel trace + +Use tracing when: + +- request-stage timing is unclear +- router vs. worker attribution is unclear +- PD prefill/decode transfer may be implicated + +If tracing was enabled at startup, you can change the level without restart: + +```bash +curl "http://127.0.0.1:30000/set_trace_level?level=1" +curl "http://127.0.0.1:30000/set_trace_level?level=2" +``` + +#### Torch profile + +Use profiling when: + +- the issue is already narrowed to compute-side ownership +- replay already reproduces the problem +- metrics and loads do not explain the regression + +At that point, switch to `llm-torch-profiler-analysis`. Do not duplicate +its profiling workflow here. + +For a low-noise latency example, read +[references/case-studies.md](references/case-studies.md). + +#### Distributed hang + +If this looks like a collective stall, save the failing request, replay it on a +clean target, collect the replay-time bundle and stacks, then switch to +`debug-distributed-hang`. + +For an example of that flow, read +[references/case-studies.md](references/case-studies.md). + +#### Regression between two commits + +If one commit is known-good and another is known-bad, build a deterministic +harness before doing deeper manual debugging: + +1. choose a stable reproducer: request replay, benchmark command, or correctness check +2. make the harness return `0` on good behavior and non-zero on bad behavior +3. run `git bisect start ` +4. run `git bisect run ` +5. return here only after a candidate commit is isolated + +Prefer replay-backed bisect when the regression depends on request shape or +long-running serving state. + +### 6. Switch tools when the boundary is clear + +Switch tools once the fault class is clear: + +- `llm-torch-profiler-analysis` for kernel and overlap attribution +- `debug-distributed-hang` for collective or rank-divergence hangs +- `debug-cuda-crash` for CUDA crash reproduction and kernel API logging + +Do not switch tools before collecting the first bundle unless the user already has +decisive logs or dumps. + +## References + +Load only what the current step needs: + +- [references/decision-tree.md](references/decision-tree.md) + - problem classes, tool switch points, return shape +- [references/endpoints-and-signals.md](references/endpoints-and-signals.md) + - endpoint behavior, auth notes, field reading +- [references/replay-trace-profile.md](references/replay-trace-profile.md) + - request dump, crash dump, replay, trace, profiler step, bisect +- [references/case-studies.md](references/case-studies.md) + - compact examples for replay-first CUDA crash, latency, and distributed-hang triage + +## Scripts + +- [scripts/incident_artifact_tool.py](scripts/incident_artifact_tool.py) + - collect a read-only live bundle + - summarize a collected bundle into a compact debug note + - summarize a trusted request dump or crash dump before replay +- [scripts/replay_trusted_request_dump.py](scripts/replay_trusted_request_dump.py) + - replay a trusted request dump when `safe_pickle_load` blocks stock replay + +If a live bundle was collected, include its path. + +If replay, trace, or profiling was chosen, say why bundle plus dump were not enough. diff --git a/.claude/skills/sglang-prod-incident-triage/references/case-studies.md b/.claude/skills/sglang-prod-incident-triage/references/case-studies.md new file mode 100644 index 000000000000..70de41d51f5a --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/references/case-studies.md @@ -0,0 +1,81 @@ +# Case Studies + +Use these examples only after the live bundle and request dump point toward the +same class of failure. They are patterns for how to reason from replayable +evidence, not recipes to copy blindly. + +## CUDA Crash: Upstream Top-K Corruption, Downstream MoE OOB + +Use when a replayed CUDA crash lands in a MoE align or shared-memory kernel but +the suspicious data was produced by an earlier routing kernel. + +Shape that made the original case useful: + +- model family: Qwen3 MoE +- visible crash: `moe_align_block_size_kernel` +- likely producer: `topkGatingSoftmax` / MoE top-k routing +- evidence path: crash dump -> replay -> CUDA coredump -> walk one kernel + upstream from the visible fault + +Triage loop: + +```text +summarize crash dump + -> replay the exact request + -> enable CUDA coredump on the replay target + -> identify the failing kernel + -> inspect the immediately preceding producer kernel and tensors +``` + +Key lesson: a consumer kernel can be the first one to fault even when the bad +index was produced earlier. Preserve the request shape before changing prompts. + +## Latency: TTFT Spike With Low Queue Time + +Use when `/health` and `/health_generate` are green, queue depth is low, but TTFT +is still high. + +Signals from the original case: + +- `waiting=0` +- average queue time was tiny +- TTFT was high +- scheduler stage timing pointed to prefill forward time + +Triage loop: + +```text +collect live bundle + -> save the slow request + -> replay the same request on a clean target + -> profile only after replay reproduces compute-side ownership +``` + +Key lesson: rule out queue pressure with `/v1/loads`, `/metrics`, and stage +timing before opening a profiler trace. + +## Distributed Hang: Request-Shaped TP Collective Mismatch + +Use when one request hangs, ranks stop making progress differently, and the +failure looks like a generic serving stall until replay isolates it. + +Shape that made the original case useful: + +- a prompt tokenized to a specific extend length +- one TP rank skipped a logits `all_gather` +- the peer rank still entered the real collective +- the request never returned + +Triage loop: + +```text +collect healthy bundle + -> save the trigger request + -> replay on a clean target + -> collect rank stacks and replay-time bundle + -> switch to debug-distributed-hang +``` + +Key lesson: once the symptom looks like rank divergence or a collective mismatch, +do not keep profiling kernels. Preserve the replay and move to distributed-hang +debugging. diff --git a/.claude/skills/sglang-prod-incident-triage/references/decision-tree.md b/.claude/skills/sglang-prod-incident-triage/references/decision-tree.md new file mode 100644 index 000000000000..9aa03b567c67 --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/references/decision-tree.md @@ -0,0 +1,197 @@ +# SGLang First Checks + +Use this reference when the problem class is still unclear and you need a fast +starting point. + +## Default Order + +1. classify the symptom +2. collect the fastest useful signal +3. save the failing request or dump +4. replay before you profile + +Do not start with `torch.profiler` unless the issue is already clearly +compute-side. + +If one commit is known-good and another is known-bad, turn the problem into a +stable `git bisect run ` first. + +## Problem Classes + +### Server down or unhealthy + +Check: + +- `/health` +- `/health_generate` +- `/server_info` +- recent stderr/stdout +- crash dump status if `--crash-dump-folder` is enabled + +Likely directions: + +- startup or weight-load failure +- deadlock or blocked scheduler +- CUDA crash or OOM +- auth or routing mismatch + +### High latency or low throughput + +Check: + +- `/v1/loads?include=all` +- `/metrics` +- `/server_info` +- the exact request shape or benchmark command + +Likely directions: + +- queueing or capacity pressure +- cache hit rate collapse +- PD or EP topology mismatch +- speculative decoding disabled or ineffective +- kernel or backend regression + +### Wrong output or behavior regression + +Check: + +- exact request and expected output +- `/model_info` +- `/server_info` +- current weights or recent config change + +Likely directions: + +- wrong weights or wrong revision +- chat template, parser, or tool config drift +- multimodal preprocessing drift +- quantization or kernel correctness bug + +### Timeout or hang + +Check: + +- `/health` +- `/health_generate` +- `/v1/loads?include=all` +- request dumps if enabled +- per-rank logs +- OTel trace if already enabled + +Likely directions: + +- distributed divergence or collective hang +- queue starvation or retraction storm +- PD transfer stall +- storage or HiCache backend stall + +## Quick Paths + +### TTFT spike + +Start with: + +- `/v1/loads?include=all` +- `/metrics` +- `/server_info` + +Watch for: + +- `num_waiting_reqs` growth +- `token_usage` saturation +- `cache_hit_rate` drop +- PD queue buildup + +If queue pressure does not explain the slowdown, save the slow request and +replay it. + +### Throughput collapse + +Start with: + +- `/v1/loads?include=all` +- `/metrics` +- benchmark reproduction if available + +Watch for: + +- low `gen_throughput` +- queue growth +- low cache hit rate +- speculative metrics collapse +- PD transfer or decode prealloc queues backing up + +### Crash after some requests + +Start with: + +- crash dump folder +- stderr/stdout +- request dump folder if available + +Then replay the crash dump or recent request dump. + +### Regression between two commits + +Start with: + +- known-good commit +- known-bad commit +- one stable pass/fail harness + +Best move: + +- `git bisect run ` + +### One request class fails + +Start with: + +- exact request payload +- request dump if available +- smallest reproduction request + +Typical categories: + +- multimodal edge case +- parser or structured output bug +- model-specific kernel path +- tool-call formatting issue + +## When To Switch Tools + +### Use replay when + +- a crash dump or request dump already exists +- the issue depends on request shape or workload mix +- you need one stable reproducer before going deeper + +### Use OTel trace when + +- request-stage timing is unclear +- router vs. worker ownership is unclear +- PD boundaries may be involved + +### Use torch profiler when + +- replay already reproduces the issue +- queueing and routing are mostly ruled out +- you need kernel-level attribution + +At that point, switch to `llm-torch-profiler-analysis`. + +### Use lower-level debug paths when + +- replay plus trace still leave ambiguity +- the problem looks like a specific crash, hang, or correctness bug + +## What To Return + +- problem class +- what was checked +- strongest signal so far +- current best guess +- what was ruled out +- next step +- production risk diff --git a/.claude/skills/sglang-prod-incident-triage/references/endpoints-and-signals.md b/.claude/skills/sglang-prod-incident-triage/references/endpoints-and-signals.md new file mode 100644 index 000000000000..c99e95d7ac38 --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/references/endpoints-and-signals.md @@ -0,0 +1,218 @@ +# SGLang Endpoints and Signals + +Use this reference when checking a live server. + +## Auth + +Most read endpoints are public unless the server is protected by `api_key` or +`admin_api_key`. + +Use: + +```bash +curl -H "Authorization: Bearer " ... +``` + +Rules: + +- normal protected endpoints require `api_key` +- admin endpoints require `admin_api_key` +- some HiCache endpoints fail if `admin_api_key` is not configured at all +- `/health` and metrics-style health checks are usually still exposed + +## Core Endpoints + +### `/health` + +Cheap liveness check. + +- `200`: process is alive enough to answer health +- `503`: starting, shutting down, or unhealthy + +`/health` alone is not enough for latency or hang diagnosis. + +### `/health_generate` + +Active health check. + +- exercises a real generate or embedding path +- catches stuck schedulers or broken worker paths that `/health` can miss + +Use this when requests time out but `/health` is still green. + +### `/model_info` + +Use for model identity: + +- `model_path` +- `tokenizer_path` +- `is_generation` +- `weight_version` +- multimodal flags +- model type or architectures + +This is the first check for wrong-output or wrong-weight problems. + +### `/server_info` + +Use for runtime shape: + +- serialized `server_args` +- scheduler info +- per-DP `internal_states` +- SGLang version + +This is usually the single best live snapshot. + +## Load And Capacity + +### `/v1/loads?include=all` + +Best structured load endpoint for a first pass. + +Useful fields: + +- `num_running_reqs` +- `num_waiting_reqs` +- `num_total_tokens` +- `num_used_tokens` +- `token_usage` +- `gen_throughput` +- `cache_hit_rate` +- `memory` +- `speculative` +- `disaggregation` +- `queues` + +Useful queries: + +```bash +curl -s http://127.0.0.1:30000/v1/loads +curl -s "http://127.0.0.1:30000/v1/loads?include=all" +curl -s "http://127.0.0.1:30000/v1/loads?include=core,queues,disagg" +curl -s "http://127.0.0.1:30000/v1/loads?format=prometheus" +``` + +What to look for: + +- high `num_waiting_reqs` with low compute throughput usually means queueing or capacity pressure +- `token_usage` near `1.0` usually means KV or token-capacity pressure +- low `cache_hit_rate` after a deploy can explain TTFT regressions +- PD queue fields often explain transfer or prealloc bottlenecks hidden by plain queue size + +### `/metrics` + +Prometheus endpoint. Use it when you need trends rather than one live snapshot. + +High-value metrics: + +- `sglang:time_to_first_token_seconds` +- `sglang:time_per_output_token_seconds` +- `sglang:e2e_request_latency_seconds` +- `sglang:num_running_reqs` +- `sglang:num_queue_reqs` +- `sglang:num_used_tokens` +- `sglang:cache_hit_rate` +- `sglang:gen_throughput` +- `sglang:token_usage` + +## Request Capture + +### `/configure_logging` + +Used by `python -m sglang.srt.managers.configure_logging`. + +Main use: + +- enable request logging +- set request logging level +- enable request dump folder +- set request dump threshold + +Typical payload: + +```json +{ + "log_requests": true, + "log_requests_level": 3, + "dump_requests_folder": "/tmp/sglang_request_dump", + "dump_requests_threshold": 100 +} +``` + +Use this when the problem is ongoing and you need the next failing request +without restarting the service. + +## HiCache + +### `GET /hicache/storage-backend` + +Returns tokenizer-side HiCache storage status: + +- `hicache_storage_backend` +- `hicache_storage_backend_extra_config` +- `hicache_storage_prefetch_policy` +- `hicache_write_policy` + +Use this when long-context or PD problems may involve storage-backed KV reuse. + +### `PUT /hicache/storage-backend` +### `DELETE /hicache/storage-backend` + +Runtime attach or detach. These are operational actions, not passive checks. + +## Profiling And Tracing Controls + +### `/start_profile` +### `/stop_profile` + +Use only after the problem is already narrowed down. + +### `/set_trace_level?level=N` + +Changes trace verbosity when tracing was enabled at startup. + +Levels: + +- `0`: disabled +- `1`: important slices +- `2`: all slices except nested ones +- `3`: all slices + +## Quick Reads By Problem Type + +### TTFT spike + +Read: + +- `/server_info` +- `/v1/loads?include=all` +- `/metrics` + +Compare: + +- queue size +- token usage +- cache hit rate +- PD disaggregation queues + +### Hang or timeout + +Read: + +- `/health` +- `/health_generate` +- `/server_info` +- `/v1/loads?include=all` + +If tracing is already enabled, look at trace data before heavier profiling. + +### Wrong model behavior + +Read: + +- `/model_info` +- `/server_info` +- exact request payload and parser or template config + +Do not jump to kernel profiling until config drift is ruled out. diff --git a/.claude/skills/sglang-prod-incident-triage/references/replay-trace-profile.md b/.claude/skills/sglang-prod-incident-triage/references/replay-trace-profile.md new file mode 100644 index 000000000000..6dde7a2542b8 --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/references/replay-trace-profile.md @@ -0,0 +1,236 @@ +# Replay, Trace, Profile, and Bisect + +Use this reference after the first live checks. The goal is to turn the problem +into something repeatable. + +## Save Requests + +### Request dump + +```bash +python3 -m sglang.srt.managers.configure_logging \ + --url http://127.0.0.1:30000 \ + --dump-requests-folder /tmp/sglang_request_dump \ + --dump-requests-threshold 100 +``` + +Use this when: + +- the problem is intermittent +- you need the real request shape +- you do not want to restart the server + +### Crash dump + +If the server already runs with: + +```bash +--crash-dump-folder /tmp/crash_dump +``` + +SGLang saves recent requests before a crash. Treat that dump as the best +starting point. + +Summarize it first: + +```bash +python3 scripts/incident_artifact_tool.py summarize-dump \ + --input-file /path/to/crash_dump.pkl +``` + +Current crash-dump tests show at least: + +- `server_args` +- `requests` +- `launch_command` + +## Replay + +Use the stock replay tool: + +```bash +python3 scripts/playground/replay_request_dump.py \ + --input-file /path/to/crash_dump.pkl \ + --host 127.0.0.1 \ + --port 30000 \ + --parallel 128 +``` + +Or replay a folder: + +```bash +python3 scripts/playground/replay_request_dump.py \ + --input-folder /path/to/request_dump_dir \ + --file-number 10 \ + --parallel 128 +``` + +If `safe_pickle_load` blocks a locally captured trusted dump, use: + +```bash +python3 scripts/replay_trusted_request_dump.py \ + --input-file /path/to/request_dump.pkl \ + --host 127.0.0.1 \ + --port 30000 \ + --parallel 1 +``` + +If that happens, the allowlist is the problem, not the dump. + +Use replay before profiling when: + +- the issue depends on workload mix +- it only appears after some number of requests +- you need to compare two builds on the same traffic + +## CUDA Restart-And-Replay + +If replay points to a CUDA crash path, restart the same build with coredumps: + +```bash +SGLANG_CUDA_COREDUMP=1 \ +SGLANG_CUDA_COREDUMP_DIR=/tmp/sglang_cuda_coredumps \ +python -m sglang.launch_server \ + --model-path ... \ + --crash-dump-folder /tmp/sglang_crash_dump \ + ... +``` + +Then inspect the coredump: + +```bash +cuda-gdb "$(which python3)" \ + -ex "target cudacore /tmp/sglang_cuda_coredumps/cuda_coredump_.." +``` + +Good first commands: + +- `where` +- `info cuda kernels` +- `x/10i ` + +Use the coredump to find the failing kernel, not automatically the root-cause +kernel. + +See: + +- [case-studies.md](case-studies.md) + +## Trace + +Tracing must be enabled at startup: + +```bash +python -m sglang.launch_server \ + --enable-trace \ + --otlp-traces-endpoint localhost:4317 \ + ... +``` + +Optional router command: + +```bash +python -m sglang_router.launch_router \ + --enable-trace \ + --otlp-traces-endpoint localhost:4317 \ + ... +``` + +Useful environment variables: + +```bash +export SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS=500 +export SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE=64 +``` + +If tracing is already enabled, change the level without restart: + +```bash +curl "http://127.0.0.1:30000/set_trace_level?level=1" +curl "http://127.0.0.1:30000/set_trace_level?level=2" +curl "http://127.0.0.1:30000/set_trace_level?level=3" +``` + +Use tracing for: + +- router vs. worker delay +- tokenizer / scheduler / detokenizer timing +- PD transfer timing +- request timing across processes + +If you already have OTEL JSON or JSONL, convert it for timeline inspection: + +```bash +python3 scripts/convert_otel_2_perfetto.py \ + --input /tmp/otel_trace.json \ + --output /tmp/sglang_trace_perfetto.json +``` + +## Torch Profiler + +Switch to `llm-torch-profiler-analysis` when: + +- replay already reproduces the issue +- metrics and loads do not explain it +- the problem now looks compute-side + +This skill should decide when to profile, not duplicate the profiler workflow. + +## Bisect + +If one commit is known-good and a newer commit is known-bad: + +1. build a deterministic harness from the problem +2. prefer replay-based harnesses when the failure depends on request mix +3. use `git bisect run ` +4. only then go back to trace or profile if needed + +Example: + +```bash +git bisect start +git bisect run bash ./repro_or_check.sh +``` + +## Common Paths + +### Crash + +1. crash dump +2. summarize dump +3. replay +4. CUDA coredump plus `cuda-gdb` +5. `debug-cuda-crash` or narrower instrumentation + +### TTFT regression + +1. baseline metrics and loads +2. request dump +3. replay the slow request +4. trace if stage ownership is unclear +5. `llm-torch-profiler-analysis` if it still looks compute-side + +See: + +- [case-studies.md](case-studies.md) + +### Distributed hang + +1. healthy baseline bundle +2. save the trigger request +3. replay on a clean target +4. collect replay-time bundle and stacks +5. identify the NCCL or collective path +6. switch to `debug-distributed-hang` + +See: + +- [case-studies.md](case-studies.md) + +### Throughput regression after deploy + +1. compare `server_info` +2. compare `/metrics` and `/v1/loads` +3. replay stable workload +4. bisect if one older commit is known-good +5. profile only if compute still looks suspicious diff --git a/.claude/skills/sglang-prod-incident-triage/scripts/incident_artifact_tool.py b/.claude/skills/sglang-prod-incident-triage/scripts/incident_artifact_tool.py new file mode 100755 index 000000000000..2a9dacc69a63 --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/scripts/incident_artifact_tool.py @@ -0,0 +1,735 @@ +#!/usr/bin/env python3 +"""Collect or inspect serving bundles and dumps for SGLang debug.""" + +from __future__ import annotations + +import argparse +import glob +import json +import math +import os +import pickle +import re +import time +from collections import defaultdict +from datetime import datetime +from pathlib import Path +from typing import Any, Dict, Optional, Sequence +from urllib import error, parse, request + +METRIC_RE = re.compile( + r"^(?P[^{\s]+)(?:\{(?P[^}]*)\})?\s+(?P[-+]?\d+(?:\.\d+)?(?:[eE][-+]?\d+)?)$" +) +LABEL_RE = re.compile(r'([a-zA-Z_:][a-zA-Z0-9_:]*)="((?:[^"\\]|\\.)*)"') +ENDPOINT_SPECS = ( + ("text", "health.txt", "/health"), + ("text", "health_generate.txt", "/health_generate"), + ("text", "metrics.txt", "/metrics"), + ("json", "model_info.json", "/model_info"), + ("json", "server_info.json", "/server_info"), + ("json", "loads_all.json", "/v1/loads?include=all"), + ( + "json", + "loads_core_queues_disagg.json", + "/v1/loads?include=core,queues,disagg,spec", + ), + ("json", "hicache_storage_backend.json", "/hicache/storage-backend"), +) +BUNDLE_NOTES = [ + "This bundle is read-only. It does not start profiling or change trace level.", + "HiCache status may fail if admin_api_key is not configured or the wrong bearer token was used.", + "loads_all.json is the best point-in-time load snapshot in this bundle.", + "metrics.txt is raw Prometheus text intended for follow-up parsing.", +] + + +def request_text( + base_url: str, + path: str, + token: Optional[str], + timeout: float = 10.0, +) -> tuple[bool, int, str]: + url = parse.urljoin(base_url.rstrip("/") + "/", path.lstrip("/")) + req = request.Request(url) + if token: + req.add_header("Authorization", f"Bearer {token}") + try: + with request.urlopen(req, timeout=timeout) as resp: + body = resp.read().decode("utf-8", errors="replace") + return True, resp.status, body + except error.HTTPError as e: + body = e.read().decode("utf-8", errors="replace") + return False, e.code, body + except Exception as e: # noqa: BLE001 + return False, -1, f"{type(e).__name__}: {e}" + + +def request_endpoint( + base_url: str, + path: str, + token: Optional[str], + parse_json: bool, + timeout: float = 10.0, +) -> Dict[str, Any]: + ok, status, body = request_text(base_url, path, token, timeout=timeout) + result: Dict[str, Any] = {"ok": ok, "status": status, "path": path} + if not ok: + result["error"] = body + return result + if not parse_json: + result["text"] = body + return result + try: + result["json"] = json.loads(body) + except json.JSONDecodeError: + result["text"] = body + result["decode_error"] = "response was not valid JSON" + return result + + +def write_json(path: Path, obj: Dict[str, Any]) -> None: + path.write_text( + json.dumps(obj, indent=2, ensure_ascii=False) + "\n", encoding="utf-8" + ) + + +def write_text(path: Path, text: str) -> None: + path.write_text(text, encoding="utf-8") + + +def format_summary_line(filename: str, result: Dict[str, Any]) -> str: + if result.get("ok"): + return f"{filename}: ok" + return ( + f"{filename}: failed status={result.get('status')} " + f"error={result.get('error')}" + ) + + +def collect_bundle( + base_url: str, + token: Optional[str], + outdir: Optional[str], + timeout: float, +) -> Path: + timestamp = time.strftime("%Y%m%d_%H%M%S") + bundle_dir = Path(outdir or f"./incident_bundle_{timestamp}").resolve() + bundle_dir.mkdir(parents=True, exist_ok=True) + + metadata = { + "artifact_type": "incident_bundle", + "base_url": base_url, + "collected_at": timestamp, + "token_provided": bool(token), + "timeout_seconds": timeout, + } + write_json(bundle_dir / "metadata.json", metadata) + + summary_lines = [] + for kind, filename, path in ENDPOINT_SPECS: + result = request_endpoint( + base_url, path, token, parse_json=(kind == "json"), timeout=timeout + ) + output_path = bundle_dir / filename + if kind == "text" and result.get("ok"): + write_text(output_path, str(result.get("text", ""))) + else: + write_json( + ( + output_path + if kind == "json" + else bundle_dir / f"{filename}.error.json" + ), + result, + ) + summary_lines.append(format_summary_line(filename, result)) + + write_text( + bundle_dir / "SUMMARY.txt", + "\n".join(summary_lines + [""] + BUNDLE_NOTES) + "\n", + ) + return bundle_dir + + +def load_json(path: Path) -> Optional[Dict[str, Any]]: + if not path.exists(): + return None + return json.loads(path.read_text(encoding="utf-8")) + + +def unwrap_result(path: Path) -> Optional[Dict[str, Any]]: + obj = load_json(path) + if obj is None: + return None + if isinstance(obj, dict) and "json" in obj: + return obj.get("json") + return obj + + +def read_text(path: Path) -> Optional[str]: + if not path.exists(): + return None + return path.read_text(encoding="utf-8") + + +def endpoint_ok(bundle_dir: Path, stem: str) -> bool: + return (bundle_dir / f"{stem}.txt").exists() and not ( + bundle_dir / f"{stem}.txt.error.json" + ).exists() + + +def parse_labels(raw: Optional[str]) -> Dict[str, str]: + if not raw: + return {} + labels = {} + for key, value in LABEL_RE.findall(raw): + labels[key] = bytes(value, "utf-8").decode("unicode_escape") + return labels + + +def parse_metrics(metrics_text: str) -> Dict[str, list[dict[str, Any]]]: + series: Dict[str, list[dict[str, Any]]] = defaultdict(list) + for line in metrics_text.splitlines(): + line = line.strip() + if not line or line.startswith("#"): + continue + match = METRIC_RE.match(line) + if not match: + continue + series[match.group("name")].append( + { + "labels": parse_labels(match.group("labels")), + "value": float(match.group("value")), + } + ) + return series + + +def metric_sum(metrics: Dict[str, list[dict[str, Any]]], name: str) -> float: + return sum(item["value"] for item in metrics.get(name, [])) + + +def safe_div( + numerator: Optional[float], denominator: Optional[float] +) -> Optional[float]: + if numerator is None or denominator in (None, 0): + return None + return numerator / denominator + + +def coalesce(*values: Any) -> Any: + for value in values: + if value is not None: + return value + return None + + +def fmt_float(value: Optional[float], digits: int = 3) -> str: + if value is None or ( + isinstance(value, float) and (math.isnan(value) or math.isinf(value)) + ): + return "n/a" + return f"{value:.{digits}f}" + + +def is_positive_number(value: Any, threshold: float = 0.0) -> bool: + return ( + isinstance(value, (int, float)) + and not math.isnan(value) + and not math.isinf(value) + and value > threshold + ) + + +def compute_stage_averages( + metrics: Dict[str, list[dict[str, Any]]], sum_name: str, count_name: str +) -> Dict[str, float]: + grouped_sum: Dict[str, float] = defaultdict(float) + grouped_count: Dict[str, float] = defaultdict(float) + for item in metrics.get(sum_name, []): + stage = item["labels"].get("stage", "") + rank = item["labels"].get("tp_rank", "") + grouped_sum[f"{stage}|{rank}"] += item["value"] + for item in metrics.get(count_name, []): + stage = item["labels"].get("stage", "") + rank = item["labels"].get("tp_rank", "") + grouped_count[f"{stage}|{rank}"] += item["value"] + + result: Dict[str, float] = {} + for key, total_sum in grouped_sum.items(): + stage, _rank = key.split("|", 1) + avg = safe_div(total_sum, grouped_count.get(key)) + if avg is None: + continue + result[stage] = max(result.get(stage, 0.0), avg) + return result + + +def add_signal(signals: list[str], text: str) -> None: + if text not in signals: + signals.append(text) + + +def build_bundle_summary(bundle_dir: Path) -> Dict[str, Any]: + metadata = load_json(bundle_dir / "metadata.json") or {} + model_info = unwrap_result(bundle_dir / "model_info.json") or {} + server_info = unwrap_result(bundle_dir / "server_info.json") or {} + loads_info = unwrap_result(bundle_dir / "loads_all.json") or {} + metrics_text = read_text(bundle_dir / "metrics.txt") or "" + metrics = parse_metrics(metrics_text) + + aggregate = loads_info.get("aggregate") or {} + loads = loads_info.get("loads") or [] + load0 = loads[0] if loads else {} + internal_states = server_info.get("internal_states") or [] + runtime_state = internal_states[0] if internal_states else {} + memory_usage = runtime_state.get("memory_usage") or load0.get("memory") or {} + + ttft_avg = safe_div( + metric_sum(metrics, "sglang:time_to_first_token_seconds_sum"), + metric_sum(metrics, "sglang:time_to_first_token_seconds_count"), + ) + e2e_avg = safe_div( + metric_sum(metrics, "sglang:e2e_request_latency_seconds_sum"), + metric_sum(metrics, "sglang:e2e_request_latency_seconds_count"), + ) + queue_avg = safe_div( + metric_sum(metrics, "sglang:queue_time_seconds_sum"), + metric_sum(metrics, "sglang:queue_time_seconds_count"), + ) + per_stage_avg = compute_stage_averages( + metrics, + "sglang:per_stage_req_latency_seconds_sum", + "sglang:per_stage_req_latency_seconds_count", + ) + + summary: Dict[str, Any] = { + "artifact_type": "incident_bundle", + "bundle_dir": str(bundle_dir), + "base_url": metadata.get("base_url"), + "collected_at": metadata.get("collected_at"), + "health": { + "health_ok": endpoint_ok(bundle_dir, "health"), + "health_generate_ok": endpoint_ok(bundle_dir, "health_generate"), + }, + "model": { + "model_path": model_info.get("model_path") or server_info.get("model_path"), + "served_model_name": server_info.get("served_model_name"), + "weight_version": model_info.get("weight_version") + or server_info.get("weight_version"), + "model_type": model_info.get("model_type"), + "is_generation": model_info.get("is_generation"), + }, + "topology": { + "tp_size": server_info.get("tp_size"), + "dp_size": server_info.get("dp_size"), + "pp_size": server_info.get("pp_size"), + "ep_size": server_info.get("ep_size"), + "disaggregation_mode": server_info.get("disaggregation_mode"), + "attention_backend": server_info.get("attention_backend"), + "sampling_backend": server_info.get("sampling_backend"), + "schedule_policy": server_info.get("schedule_policy"), + "enable_trace": server_info.get("enable_trace"), + "enable_metrics": server_info.get("enable_metrics"), + }, + "capacity": { + "max_total_num_tokens": server_info.get("max_total_num_tokens"), + "max_req_input_len": server_info.get("max_req_input_len"), + "effective_max_running_requests_per_dp": coalesce( + runtime_state.get("effective_max_running_requests_per_dp"), + load0.get("max_running_requests"), + ), + "weight_gb": coalesce( + memory_usage.get("weight"), memory_usage.get("weight_gb") + ), + "kv_cache_gb": coalesce( + memory_usage.get("kvcache"), memory_usage.get("kv_cache_gb") + ), + "graph_gb": coalesce( + memory_usage.get("graph"), memory_usage.get("graph_gb") + ), + "token_capacity": memory_usage.get("token_capacity"), + }, + "point_in_time_load": { + "running_reqs": coalesce( + aggregate.get("total_running_reqs"), load0.get("num_running_reqs") + ), + "waiting_reqs": coalesce( + aggregate.get("total_waiting_reqs"), load0.get("num_waiting_reqs") + ), + "total_reqs": coalesce( + aggregate.get("total_reqs"), load0.get("num_total_reqs") + ), + "token_usage": coalesce( + aggregate.get("avg_token_usage"), load0.get("token_usage") + ), + "avg_throughput": coalesce( + aggregate.get("avg_throughput"), load0.get("gen_throughput") + ), + "avg_utilization": coalesce( + aggregate.get("avg_utilization"), load0.get("utilization") + ), + "cache_hit_rate": load0.get("cache_hit_rate"), + "queues": load0.get("queues"), + "disaggregation": load0.get("disaggregation"), + }, + "metrics": { + "request_count": metric_sum(metrics, "sglang:num_requests_total"), + "prompt_tokens_total": metric_sum(metrics, "sglang:prompt_tokens_total"), + "generation_tokens_total": metric_sum( + metrics, "sglang:generation_tokens_total" + ), + "avg_ttft_seconds": ttft_avg, + "avg_e2e_seconds": e2e_avg, + "avg_queue_time_seconds": queue_avg, + "stage_avg_seconds_max_tp_rank": per_stage_avg, + }, + "signals": [], + } + + signals = summary["signals"] + health = summary["health"] + point_in_time_load = summary["point_in_time_load"] + running_reqs = point_in_time_load.get("running_reqs") + waiting_reqs = point_in_time_load.get("waiting_reqs") + + if health["health_ok"] and not health["health_generate_ok"]: + add_signal( + signals, + "/health is green but /health_generate failed. Suspect runtime or scheduler path, not just HTTP liveness.", + ) + if not health["health_ok"]: + add_signal( + signals, + "/health failed. Start with startup, crash, or global unhealthy paths.", + ) + if is_positive_number(waiting_reqs): + add_signal( + signals, + f"Point-in-time load shows queue buildup: waiting_reqs={waiting_reqs}.", + ) + if ( + point_in_time_load.get("token_usage") is not None + and point_in_time_load["token_usage"] >= 0.9 + ): + add_signal( + signals, + "Token usage is near saturation. KV or token-capacity pressure may explain latency.", + ) + if ( + ttft_avg is not None + and queue_avg is not None + and ttft_avg > 2.0 + and queue_avg < 0.2 + ): + add_signal( + signals, + f"Average TTFT is high ({fmt_float(ttft_avg)}s) while average queue time is low ({fmt_float(queue_avg)}s). This looks more like prefill or request-path work than queue pressure.", + ) + prefill_forward = per_stage_avg.get("prefill_forward") + request_process = per_stage_avg.get("request_process") + if ( + prefill_forward is not None + and request_process is not None + and prefill_forward > max(0.5, request_process * 10) + ): + add_signal( + signals, + f"Prefill forward dominates quick stage timing: prefill_forward~{fmt_float(prefill_forward)}s vs request_process~{fmt_float(request_process)}s.", + ) + if running_reqs == 0 and waiting_reqs == 0: + add_signal( + signals, + "Bundle snapshot was captured while the server was effectively idle. Reproduce under live traffic or replayed workload if the problem is intermittent.", + ) + + return summary + + +def render_bundle_text(summary: Dict[str, Any]) -> str: + health = summary["health"] + model = summary["model"] + topology = summary["topology"] + capacity = summary["capacity"] + load = summary["point_in_time_load"] + metrics = summary["metrics"] + stage_avgs = metrics["stage_avg_seconds_max_tp_rank"] + + lines = [ + f"Bundle: {summary['bundle_dir']}", + f"Base URL: {summary.get('base_url') or 'n/a'}", + f"Collected At: {summary.get('collected_at') or 'n/a'}", + "", + f"Health: /health={'ok' if health['health_ok'] else 'failed'} /health_generate={'ok' if health['health_generate_ok'] else 'failed'}", + f"Model: {model.get('model_path') or 'n/a'} weight_version={model.get('weight_version') or 'n/a'} type={model.get('model_type') or 'n/a'}", + "Topology: " + f"tp={topology.get('tp_size')} dp={topology.get('dp_size')} pp={topology.get('pp_size')} ep={topology.get('ep_size')} " + f"disagg={topology.get('disaggregation_mode')} trace={topology.get('enable_trace')} metrics={topology.get('enable_metrics')}", + "Capacity: " + f"max_total_tokens={capacity.get('max_total_num_tokens')} " + f"max_running_reqs={capacity.get('effective_max_running_requests_per_dp')} " + f"weight_gb={fmt_float(capacity.get('weight_gb'))} " + f"kv_cache_gb={fmt_float(capacity.get('kv_cache_gb'))} " + f"graph_gb={fmt_float(capacity.get('graph_gb'))}", + "Point-in-time load: " + f"running={load.get('running_reqs')} waiting={load.get('waiting_reqs')} total={load.get('total_reqs')} " + f"token_usage={fmt_float(load.get('token_usage'))} throughput={fmt_float(load.get('avg_throughput'))} " + f"cache_hit_rate={fmt_float(load.get('cache_hit_rate'))}", + "Metrics: " + f"requests={fmt_float(metrics.get('request_count'), 0)} " + f"prompt_tokens={fmt_float(metrics.get('prompt_tokens_total'), 0)} " + f"generation_tokens={fmt_float(metrics.get('generation_tokens_total'), 0)} " + f"avg_ttft_s={fmt_float(metrics.get('avg_ttft_seconds'))} " + f"avg_e2e_s={fmt_float(metrics.get('avg_e2e_seconds'))} " + f"avg_queue_s={fmt_float(metrics.get('avg_queue_time_seconds'))}", + ] + + if stage_avgs: + stage_parts = [ + f"{name}={fmt_float(value)}s" for name, value in sorted(stage_avgs.items()) + ] + lines.append("Stage Averages (max across TP ranks): " + ", ".join(stage_parts)) + + queues = load.get("queues") or {} + if queues: + lines.append( + "Queues: " + + ", ".join(f"{key}={value}" for key, value in sorted(queues.items())) + ) + + disagg = load.get("disaggregation") or {} + if disagg: + lines.append( + "Disaggregation: " + + ", ".join(f"{key}={value}" for key, value in sorted(disagg.items())) + ) + + lines.append("") + lines.append("What stands out:") + if summary["signals"]: + lines.extend(f"- {signal}" for signal in summary["signals"]) + else: + lines.append("- No strong signal from this bundle.") + + return "\n".join(lines) + "\n" + + +def get_field(obj: Any, name: str, default: Any = None) -> Any: + if obj is None: + return default + if isinstance(obj, dict): + return obj.get(name, default) + return getattr(obj, name, default) + + +def iter_dump_files( + input_file: Optional[str], input_folder: Optional[str] +) -> Sequence[Path]: + if input_file: + return [Path(input_file)] + if input_folder: + return [Path(p) for p in sorted(glob.glob(f"{input_folder}/*.pkl"))] + raise SystemExit("Either --input-file or --input-folder must be provided.") + + +def load_dump_payload(path: Path) -> dict[str, Any]: + with path.open("rb") as fh: + payload = pickle.load(fh) + if isinstance(payload, dict): + return payload + return {"requests": payload} + + +def pick_text_preview(req: Any) -> str: + candidates = [ + get_field(req, "origin_input_text"), + get_field(req, "text"), + get_field(req, "prompt"), + ] + for value in candidates: + if isinstance(value, str) and value: + return value + if isinstance(value, list) and value: + first = value[0] + if isinstance(first, str) and first: + return first + return "" + + +def format_timestamp(ts: Any) -> str: + if not isinstance(ts, (int, float)): + return "n/a" + return datetime.fromtimestamp(ts).strftime("%Y-%m-%d %H:%M:%S") + + +def summarize_request( + record: tuple[Any, dict[str, Any], Any, Any], idx: int, preview_chars: int +) -> list[str]: + req, output, start_time, end_time = record + preview = pick_text_preview(req).replace("\n", " ").strip() + if len(preview) > preview_chars: + preview = preview[: preview_chars - 3] + "..." + + output_dict = output if isinstance(output, dict) else {} + meta_info = get_field(output_dict, "meta_info", {}) or {} + rid = get_field(req, "rid") or get_field(meta_info, "id") + stream = bool(get_field(req, "stream", False)) + prompt_tokens = get_field(meta_info, "prompt_tokens") + completion_tokens = get_field(meta_info, "completion_tokens") + duration = ( + end_time - start_time + if isinstance(start_time, (int, float)) and isinstance(end_time, (int, float)) + else None + ) + + elapsed_str = f"{duration:.3f}" if duration is not None else "n/a" + lines = [ + f"[{idx}] rid={rid or 'n/a'} stream={stream} " + f"prompt_tokens={prompt_tokens if prompt_tokens is not None else 'n/a'} " + f"completion_tokens={completion_tokens if completion_tokens is not None else 'n/a'} " + f"start={format_timestamp(start_time)} elapsed_s={elapsed_str}" + ] + if preview: + lines.append(f" text={preview}") + return lines + + +def summarize_dump_file(path: Path, max_requests: int, preview_chars: int) -> str: + payload = load_dump_payload(path) + requests = payload.get("requests") or [] + server_args = payload.get("server_args") + launch_command = payload.get("launch_command") + + model_path = get_field(server_args, "model_path") + tp_size = get_field(server_args, "tp_size") + dp_size = get_field(server_args, "dp_size") + pp_size = get_field(server_args, "pp_size") + host = get_field(server_args, "host") + port = get_field(server_args, "port") + + timestamps = [ + record[2] + for record in requests + if isinstance(record, tuple) + and len(record) >= 4 + and isinstance(record[2], (int, float)) + ] + time_span = ( + max(timestamps) - min(timestamps) + if len(timestamps) >= 2 + else 0.0 if len(timestamps) == 1 else None + ) + + lines = [ + f"File: {path}", + "Dump Type: request_or_crash_dump", + f"Requests: {len(requests)}", + f"Model: {model_path or 'n/a'}", + f"Topology: tp={tp_size if tp_size is not None else 'n/a'} " + f"dp={dp_size if dp_size is not None else 'n/a'} " + f"pp={pp_size if pp_size is not None else 'n/a'}", + f"Endpoint: {host or 'n/a'}:{port if port is not None else 'n/a'}", + ( + f"Time span seconds: {time_span:.3f}" + if time_span is not None + else "Time span seconds: n/a" + ), + ] + if launch_command: + lines.append(f"Launch command: {launch_command}") + + for idx, record in enumerate(requests[:max_requests]): + if not isinstance(record, tuple) or len(record) < 4: + lines.append(f"[{idx}] Unsupported record shape: {type(record)!r}") + continue + lines.extend(summarize_request(record, idx, preview_chars)) + + if len(requests) > max_requests: + lines.append(f"... truncated {len(requests) - max_requests} more requests") + return "\n".join(lines) + + +def main() -> int: + parser = argparse.ArgumentParser( + description="Collect or inspect serving bundles and dumps for SGLang debug." + ) + subparsers = parser.add_subparsers(dest="command", required=True) + + collect_parser = subparsers.add_parser( + "collect-bundle", help="Collect a read-only live bundle from a running server" + ) + collect_parser.add_argument("--base-url", required=True) + collect_parser.add_argument( + "--token", + default=os.environ.get("SGLANG_BEARER_TOKEN"), + help="Bearer token for protected endpoints. Defaults to $SGLANG_BEARER_TOKEN.", + ) + collect_parser.add_argument("--outdir", default=None) + collect_parser.add_argument("--timeout", type=float, default=10.0) + + bundle_parser = subparsers.add_parser( + "summarize-bundle", help="Summarize a bundle directory" + ) + bundle_parser.add_argument("bundle_dir") + bundle_parser.add_argument("--out", default=None) + bundle_parser.add_argument("--json-out", default=None) + bundle_parser.add_argument("--stdout-json", action="store_true") + + dump_parser = subparsers.add_parser( + "summarize-dump", help="Summarize a trusted request dump or crash dump" + ) + dump_parser.add_argument("--input-file", default=None) + dump_parser.add_argument("--input-folder", default=None) + dump_parser.add_argument("--max-requests", type=int, default=20) + dump_parser.add_argument("--preview-chars", type=int, default=160) + + args = parser.parse_args() + + if args.command == "collect-bundle": + bundle_dir = collect_bundle( + args.base_url, args.token, args.outdir, args.timeout + ) + print(bundle_dir) + return 0 + + if args.command == "summarize-bundle": + bundle_dir = Path(args.bundle_dir).resolve() + if not bundle_dir.is_dir(): + raise SystemExit( + f"bundle_dir does not exist or is not a directory: {bundle_dir}" + ) + summary = build_bundle_summary(bundle_dir) + out_text = render_bundle_text(summary) + text_path = Path(args.out) if args.out else bundle_dir / "SUMMARY_REPORT.txt" + json_path = ( + Path(args.json_out) if args.json_out else bundle_dir / "SUMMARY_REPORT.json" + ) + text_path.write_text(out_text, encoding="utf-8") + json_path.write_text( + json.dumps(summary, indent=2, ensure_ascii=False) + "\n", + encoding="utf-8", + ) + if args.stdout_json: + print(json.dumps(summary, indent=2, ensure_ascii=False)) + else: + print(out_text, end="") + return 0 + + files = iter_dump_files(args.input_file, args.input_folder) + if not files: + raise SystemExit("No .pkl files matched the provided input.") + for idx, path in enumerate(files): + if idx: + print() + print( + summarize_dump_file( + path=path, + max_requests=args.max_requests, + preview_chars=args.preview_chars, + ) + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/sglang-prod-incident-triage/scripts/replay_trusted_request_dump.py b/.claude/skills/sglang-prod-incident-triage/scripts/replay_trusted_request_dump.py new file mode 100755 index 000000000000..768b3a33a2f9 --- /dev/null +++ b/.claude/skills/sglang-prod-incident-triage/scripts/replay_trusted_request_dump.py @@ -0,0 +1,219 @@ +#!/usr/bin/env python3 +"""Replay a trusted SGLang request dump directly over HTTP. + +Use this only for locally captured or otherwise trusted dump files. +It uses plain pickle loading to bypass SafeUnpickler restrictions that may block +the stock replay helper on newer SGLang builds. +""" + +from __future__ import annotations + +import argparse +import glob +import json +import pickle +import time +from concurrent.futures import ThreadPoolExecutor +from dataclasses import asdict, is_dataclass +from datetime import datetime +from pathlib import Path +from typing import Any, Sequence + +import requests + +Record = tuple[object, dict[str, Any], float, float] + + +def normalize_mm_data_item(item: Any) -> Any: + if isinstance(item, dict) and "url" in item: + return item["url"] + return item + + +def normalize_mm_data(data: Any) -> Any: + if data is None: + return None + if isinstance(data, list): + return [ + ( + [normalize_mm_data_item(item) for item in sublist] + if isinstance(sublist, list) + else normalize_mm_data_item(sublist) + ) + for sublist in data + ] + return normalize_mm_data_item(data) + + +def normalize_request_data(json_data: dict[str, Any]) -> dict[str, Any]: + for field in ["image_data", "video_data", "audio_data"]: + if field in json_data and json_data[field] is not None: + json_data[field] = normalize_mm_data(json_data[field]) + return json_data + + +def to_plain_dict(obj: Any) -> dict[str, Any]: + if obj is None: + return {} + if isinstance(obj, dict): + return dict(obj) + if is_dataclass(obj): + return asdict(obj) + + model_dump = getattr(obj, "model_dump", None) + if callable(model_dump): + dumped = model_dump() + if isinstance(dumped, dict): + return dumped + + dict_method = getattr(obj, "dict", None) + if callable(dict_method): + dumped = dict_method() + if isinstance(dumped, dict): + return dumped + + obj_dict = getattr(obj, "__dict__", None) + if isinstance(obj_dict, dict): + return { + key: value for key, value in obj_dict.items() if not key.startswith("_") + } + + raise TypeError(f"Unsupported request object type: {type(obj)!r}") + + +def request_to_json_data(req: Any) -> dict[str, Any]: + json_data = normalize_request_data(to_plain_dict(req)) + sampling_params = json_data.get("sampling_params") + if sampling_params is not None and not isinstance(sampling_params, dict): + json_data["sampling_params"] = to_plain_dict(sampling_params) + return json_data + + +def load_records(path: Path) -> list[Record]: + with path.open("rb") as fh: + payload = pickle.load(fh) + if isinstance(payload, dict) and "requests" in payload: + return payload["requests"] + return payload + + +def iter_files(args: argparse.Namespace) -> Sequence[Path]: + if args.input_file: + return [Path(args.input_file)] + if args.input_folder: + return [ + Path(p) + for p in sorted(glob.glob(f"{args.input_folder}/*.pkl"))[: args.file_number] + ] + raise SystemExit("Either --input-file or --input-folder must be provided.") + + +def run_one_request( + record: Record, + args: argparse.Namespace, + replay_init_time: float, + base_time: float, + idx: int, +) -> None: + req, output, start_time, end_time = record + relative_start = start_time - base_time + delay = max(0.0, (relative_start - (time.time() - replay_init_time)) / args.speed) + if delay: + time.sleep(delay) + + json_data = request_to_json_data(req) + if args.ignore_eos: + json_data.setdefault("sampling_params", {})["ignore_eos"] = True + completion_tokens = output.get("meta_info", {}).get("completion_tokens") + if completion_tokens: + json_data["sampling_params"]["max_new_tokens"] = completion_tokens + + t0 = time.time() + response = requests.post( + f"http://{args.host}:{args.port}/generate", + json=json_data, + timeout=args.timeout, + stream=bool(json_data.get("stream")), + ) + elapsed = time.time() - t0 + + if json_data.get("stream"): + last = None + for chunk in response.iter_lines(decode_unicode=False): + decoded = chunk.decode("utf-8") + if decoded and decoded.startswith("data:"): + if decoded == "data: [DONE]": + break + last = json.loads(decoded[5:].strip()) + result = last or {} + else: + result = response.json() + + meta = result.get("meta_info", {}) + print( + json.dumps( + { + "idx": idx, + "status_code": response.status_code, + "elapsed_seconds": round(elapsed, 3), + "prompt_tokens": meta.get("prompt_tokens"), + "completion_tokens": meta.get("completion_tokens"), + "rid": meta.get("id"), + }, + ensure_ascii=False, + ) + ) + + +def main() -> int: + parser = argparse.ArgumentParser( + description="Replay a trusted SGLang request dump or crash dump directly over HTTP." + ) + parser.add_argument("--host", default="127.0.0.1") + parser.add_argument("--port", type=int, default=30000) + parser.add_argument("--input-folder", default=None) + parser.add_argument("--input-file", default=None) + parser.add_argument("--file-number", type=int, default=1) + parser.add_argument("--req-number", type=int, default=1_000_000) + parser.add_argument("--req-start", type=int, default=0) + parser.add_argument("--parallel", type=int, default=1) + parser.add_argument("--ignore-eos", action="store_true") + parser.add_argument("--speed", type=float, default=1.0) + parser.add_argument("--timeout", type=float, default=120.0) + args = parser.parse_args() + + files = iter_files(args) + print(f"Replay files: {[str(p) for p in files]}") + + records: list[Record] = [] + for path in files: + records.extend(load_records(path)) + + if not records: + print("No requests found.") + return 0 + + records.sort(key=lambda x: x[-2]) + records = records[args.req_start : args.req_start + args.req_number] + print(f"Replay requests: {len(records)}") + base_time = records[0][-2] + print( + "Base time: " + datetime.fromtimestamp(base_time).strftime("%Y-%m-%d %H:%M:%S") + ) + + replay_init_time = time.time() + with ThreadPoolExecutor(max_workers=args.parallel) as executor: + futures = [] + for idx, record in enumerate(records): + futures.append( + executor.submit( + run_one_request, record, args, replay_init_time, base_time, idx + ) + ) + for future in futures: + future.result() + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.claude/skills/sglang-sota-performance/SKILL.md b/.claude/skills/sglang-sota-performance/SKILL.md new file mode 100644 index 000000000000..1432ed3f3d40 --- /dev/null +++ b/.claude/skills/sglang-sota-performance/SKILL.md @@ -0,0 +1,254 @@ +--- +name: sglang-sota-performance +description: End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs. +--- + +# SGLang SOTA Performance + +## Overview + +Use this skill as the top-level optimization loop for one model at a time. +It composes two lower-level skills: + +- `llm-serving-auto-benchmark`: search and compare best deployment commands across SGLang, vLLM, and TensorRT-LLM. +- `llm-torch-profiler-analysis`: capture or analyze torch-profiler traces and produce kernel, overlap-opportunity, and fuse-pattern tables. + +This skill's goal is not "run one benchmark." Its goal is a reproducible +SGLang improvement loop: tune every framework fairly, prove whether SGLang is +behind, explain the gap with profiler evidence, patch SGLang, and re-run the +same model workload until the result is SOTA for the target environment. + +Treat "SOTA" as "best observed, reproducible performance under the recorded +model, workload, hardware, framework commits, precision, and SLA." Do not claim +global SOTA without enough external evidence. + +## Required Companion Reads + +Before a real run, read only the needed sections from: + +- `../llm-serving-auto-benchmark/SKILL.md` +- `../llm-torch-profiler-analysis/SKILL.md` + +If the run uses a remote GPU host, also read the matching host skill such as +`h100`, `b200`, `rtx5090`, or another operator-side skill that gives SSH, +container, workspace, and artifact-path conventions. + +## Required Inputs + +Collect or infer these before starting a long search: + +- model id or local checkpoint path, tokenizer path, precision, quantization, + trust-remote-code policy, and max context length +- target GPU type/count, single-node or multi-node allowance, and VRAM budget +- workload distribution: dataset, input/output lengths, request rate or + concurrency mode, sampling settings, endpoint style, and SLA target +- frameworks to compare: default to SGLang, vLLM, and TensorRT-LLM when all are + available in the target environment +- artifact root for commands, logs, benchmark JSONL, profiles, analysis reports, + patches, and final comparison tables + +If the user only provides a model, choose a reasonable first workload and state +it explicitly. Prefer the closest cookbook config from +`llm-serving-auto-benchmark/configs/cookbook-llm/` when available. + +## Artifact Layout + +Use one run directory per model and date, for example: + +```text +runs/YYYYMMDD__sota_loop/ + manifest.txt + help/ + benchmark/ + profiles/ + analysis/ + patches/ + final_report.md +``` + +Record exact framework versions, git commits, container names/images, CUDA/NCCL +versions, GPU ids, launch commands, benchmark commands, and environment knobs. +Never write Hugging Face tokens or other secrets into artifacts. + +## Workflow + +### 1. Preflight The Model And Environment + +Verify the model can be loaded by each framework before launching a sweep. +Capture each framework's current `--help` output and version. Remove candidate +flags that are not accepted by that exact environment. + +For TensorRT-LLM, keep the server backend within the scope of +`llm-serving-auto-benchmark`: `trtllm-serve serve --backend pytorch`. +If that backend is unavailable, mark TensorRT-LLM unsupported for the run +instead of silently switching to a different serving stack. + +### 2. Search Each Framework's Best Command + +Use `llm-serving-auto-benchmark` as the source of truth for benchmark fairness, +candidate generation, result schema, and comparison tables. + +Run a bounded search for every available framework. Do not compare SGLang's +tuned command against competitor defaults. Each framework must get a real chance +to find its best deployment command under the same: + +- model weights and tokenizer +- precision and quantization policy +- GPU type/count and memory budget +- dataset and request distribution +- endpoint path and sampling settings +- SLA target and measurement window + +Keep failed candidates and their failure reasons. The fastest SLA-failing +candidate is not the winner. + +### 3. Compare The Best Commands + +Normalize the benchmark output with +`llm-serving-auto-benchmark/scripts/compare_benchmark_results.py`. + +The comparison must include: + +- best server command per framework +- benchmark command and workload settings +- SLA pass/fail status +- throughput and goodput +- TTFT, ITL, end-to-end latency, and p95/p99 where available +- peak memory or allocator evidence when available +- failed candidate summary + +If SGLang is within benchmark noise of the best framework, rerun enough samples +to decide whether the difference is real. Use a default regression threshold of +3-5% unless the user specifies a tighter target. + +### 4. Profile SGLang When It Is Behind + +If SGLang is meaningfully slower, fails SLA while another framework passes, or +uses much more memory for the same workload, run profiler triage before patching. + +Use `llm-torch-profiler-analysis` against the SGLang best command first: + +- capture live SGLang profiles with `--profile-workload both`; the profiler + skill labels `prefill/` and `decode/` by workload directory for this mode +- keep separate `extend/prefill` and `decode` traces; do not use one mixed + request as the default profiler workload +- set profiler lengths from the slow benchmark scenario instead of the profiler + defaults: prefill uses the slow input length with output `1`, and decode uses + input `1` with the slow output length +- for mixed benchmark datasets, choose the slowest representative bucket + already reported by the benchmark, usually p50 or p95 input/output lengths, + and record that bucket beside the profiler artifact path +- run mapping+formal triage if single-trace output cannot map kernels to useful + Python source locations +- save the kernel, overlap-opportunity, and fuse-pattern tables in artifacts + +Profile the winning competitor too when the SGLang table alone cannot explain +why the other framework is faster. Compare stage by stage, not just total QPS. + +### 5. Turn Tables Into A Root Cause + +Use the profiler tables to identify the narrowest plausible bottleneck. + +Typical signals: + +- kernel table: attention, MoE routing, quantization, sampling, GEMM shape, + cache update, communication, or framework overhead dominates GPU time +- overlap-opportunity table: CPU scheduling, host-to-device work, collectives, + or decode bookkeeping leaves GPU idle time +- fuse-pattern table: a known fusion or overlap path should have applied but did + not, or competitor traces show a fused path SGLang lacks +- source map: hot kernels map to a concrete SGLang Python/CUDA/Triton path that + can be patched + +Do not patch from vibes. State the table row, stage, source location, and +benchmark symptom that justify the code change. + +### 6. Patch SGLang Conservatively + +Patch SGLang only after the benchmark gap and profiler evidence agree. + +Good patch candidates: + +- enable or select a better existing kernel for the model/hardware shape +- fix a missed fast path, fusion, overlap, or batching condition +- reduce unnecessary synchronization, CPU scheduling overhead, or tensor copies +- improve model-specific routing, quantization, attention, or cache handling +- add a guarded heuristic that is backed by benchmark and profiler evidence + +Avoid changes that merely make the benchmark easier: + +- weakening correctness, output quality, safety checks, or tokenizer handling +- changing only the workload or SLA after seeing results +- disabling features for SGLang but not competitors +- claiming SOTA from synthetic data when the user asked for production traffic + +Keep patches minimal and local. Add focused tests when behavior changes, and add +microbenchmarks or profiler evidence when performance is the only intended +change. + +### 7. Revalidate The Patch + +After patching, rerun: + +- the relevant unit or integration tests +- the SGLang candidate that exposed the gap +- the same cross-framework benchmark comparison +- the profiler triage if the original gap was diagnosed from profiler tables + +If the patch changes SGLang's available knobs, re-search SGLang's best command. +If competitor versions or commands changed during the work, rerun their best +commands too. Preserve before/after artifacts. + +## H100 Validation Snapshot + +On 2026-05-01, this workflow was smoke-validated on `h100_sglang` with two +real model runs and two competitor checks per run. Artifacts were saved +under +`/data/bbuf/validate/sglang_sota_performance_skill/runs/20260501_two_model_validation`. + +| Model | GPUs | Workload | SGLang result | vLLM check | TensorRT-LLM check | +| --- | --- | --- | --- | --- | --- | +| `Qwen/Qwen2.5-7B-Instruct` | 2x H100, TP=2 | random, input 512/output 64, 24 prompts, 10 warmup requests | 52.09 req/s, mean TTFT 144.85 ms, mean ITL 4.91 ms | 51.06 req/s, mean TTFT 159.19 ms, mean ITL 4.85 ms | 49.71 req/s, mean TTFT 177.54 ms, mean ITL 4.77 ms | +| `Qwen/Qwen2.5-32B-Instruct` | 4x H100, TP=4 | random, input 512/output 64, 16 prompts, 10 warmup requests | 18.47 req/s, mean TTFT 247.06 ms, mean ITL 9.66 ms | 18.78 req/s, mean TTFT 218.68 ms, mean ITL 9.98 ms | 15.48 req/s, mean TTFT 445.62 ms, mean ITL 9.27 ms | + +Use this only as a workflow health check, not as a universal performance +claim. The TensorRT-LLM checks used `trtllm-serve serve --backend pytorch` and +the same OpenAI-compatible random workload. + +Additional 2-card validation on 2026-05-01 exercised the full handoff from +bounded cross-framework search into SGLang stage-separated profiling. The +benchmark workload was random input `512`, output `64`, 8 prompts, and the +profiler used the same slow-workload lengths: prefill `512->1` and decode +`1->64`, with warmup 10 and capture 5. + +| Model | GPUs | Best SGLang | Best vLLM | Profiler result | Artifact root | +| --- | --- | --- | --- | --- | --- | +| `Qwen/Qwen3-8B` | 2x H100, TP=2 | `sglang_mem086`, 21.64 req/s | `vllm_mem080`, 22.88 req/s | kernel, overlap, and fuse tables rendered with separate `extend/prefill` and `decode` sections | `/data/bbuf/validate/core_skill_validation_20260501/qwen3_8b/sota` | +| `mistralai/Mistral-7B-Instruct-v0.3` | 2x H100, TP=2 | `sglang_mem080`, 24.09 req/s | `vllm_mem090`, 24.76 req/s | kernel, overlap, and fuse tables rendered with separate `extend/prefill` and `decode` sections | `/data/bbuf/validate/core_skill_validation_20260501/mistral_7b_instruct_v03/sota` | + +## Stop Conditions + +Stop with a clear report when any of these is true: + +- SGLang is the best SLA-passing framework for the target workload +- SGLang is within noise of the best framework and the remaining gap is not + statistically stable +- SGLang remains behind but the root cause is external to SGLang, such as missing + model weights, unavailable backend dependencies, or an unsupported hardware + feature +- a patch improves SGLang but still does not reach SOTA; report the next table + row or source path to investigate + +## Final Report Contract + +Return a compact report with: + +- model, hardware, framework versions, workload, and artifact root +- best deployment command per framework +- benchmark comparison table before patch and after patch +- SGLang gap analysis, including exact profiler table rows and source paths +- patch summary with changed files and correctness tests +- real-model validation result and whether SGLang reached target-environment SOTA + +If no code patch was needed, say why and include the benchmark evidence. +If a patch was attempted but not enough, be explicit about the remaining gap. diff --git a/.claude/skills/write-sglang-test/SKILL.md b/.claude/skills/write-sglang-test/SKILL.md new file mode 100644 index 000000000000..8bd49a8b60bd --- /dev/null +++ b/.claude/skills/write-sglang-test/SKILL.md @@ -0,0 +1,448 @@ +--- +name: write-sglang-test +description: Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features. +--- + +# Writing SGLang CI / UT Tests + +This skill covers **how to write and register tests**. For CI pipeline internals (stage ordering, fast-fail, gating, partitioning, debugging CI failures), see the [CI workflow guide](../ci-workflow-guide/SKILL.md). + +## Core Rules + +1. **Always use `CustomTestCase`** — never raw `unittest.TestCase`. It ensures `tearDownClass` runs even when `setUpClass` fails, preventing resource leaks in CI. +2. **`tearDownClass` must be defensive** — use `hasattr`/null checks before accessing resources (e.g. `cls.process`) that `setUpClass` may not have finished allocating. +3. **Place tests in `test/registered//`** — except JIT kernel tests and benchmarks, which live in `python/sglang/jit_kernel/tests/` and `python/sglang/jit_kernel/benchmark/` (nested subfolders are allowed) +4. **Reuse server fixtures** — inherit from `DefaultServerBase` or write `setUpClass`/`tearDownClass` with `popen_launch_server` +5. **Prefer mock over real server** — when testing logic that doesn't need a server / engine launch (middleware, request routing, config validation, argument parsing), use `unittest.mock.patch` / `MagicMock` and place tests in `test/registered/unit/`. Only launch a real server when the test genuinely needs inference results or server lifecycle behavior. + +JIT kernel exception: +- If the task is adding or updating code under `python/sglang/jit_kernel/`, prefer the `add-jit-kernel` skill first. +- JIT kernel correctness tests use `python/sglang/jit_kernel/tests/**/test_*.py`. +- JIT kernel benchmarks use `python/sglang/jit_kernel/benchmark/**/bench_*.py`. +- Those files are still executed by `test/run_suite.py`, but through dedicated kernel suites rather than `test/registered/`. + +--- + +## Model & Backend Selection + +| Scenario | Model | CI Registration | Suite | +|----------|-------|-----------------|-------| +| **Unit tests** (no server / engine launch) | None | `register_cpu_ci` (prefer) or `register_cuda_ci` | `stage-a-test-cpu` or `stage-b-test-1-gpu-small` | +| **Common / backend-independent** (middleware, abort, routing, config, arg parsing) | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (1B) | `register_cuda_ci` only | `stage-b-test-1-gpu-small` | +| **Model-agnostic functionality** (sampling, session, OpenAI API features) | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (1B) | `register_cuda_ci` (+ AMD if relevant) | `stage-b-test-1-gpu-small` | +| **General performance** (single node, no spec/DP/parallelism) | `DEFAULT_MODEL_NAME_FOR_TEST` (8B) | `register_cuda_ci` | `stage-b-test-1-gpu-large` | +| **Bigger features** (spec, DP, TP, disaggregation) | Case by case | Case by case | See suite table below | + +**Key principle for E2E tests**: Do NOT add `register_amd_ci` unless the test specifically exercises AMD/ROCm code paths. Common E2E tests just need any GPU to run — duplicating across backends wastes CI time with no extra coverage. + +### All model constants + +Defined in `python/sglang/test/test_utils.py`: + +| Constant | Model | When to use | +|----------|-------|-------------| +| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` | Llama-3.2-1B-Instruct | Common features, model-agnostic tests | +| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST_BASE` | Llama-3.2-1B | Base (non-instruct) model tests | +| `DEFAULT_MODEL_NAME_FOR_TEST` | Llama-3.1-8B-Instruct | General performance (single node) | +| `DEFAULT_MOE_MODEL_NAME_FOR_TEST` | Mixtral-8x7B-Instruct | MoE-specific tests | +| `DEFAULT_SMALL_EMBEDDING_MODEL_NAME_FOR_TEST` | — | Embedding tests | +| `DEFAULT_SMALL_VLM_MODEL_NAME_FOR_TEST` | — | Vision-language tests | + +### Naming Conventions + +- **Suite**: `stage-{a,b,c}-test-{gpu_count}-gpu-{hardware}` (e.g., `stage-b-test-1-gpu-small`) +- **CI runner**: `{gpu_count}-gpu-{hardware}` (e.g., `1-gpu-5090`, `4-gpu-h100`, `8-gpu-h200`) + +### All CI Suites + +#### Per-commit (CUDA) + +| Suite | Runner (label) | Description | +|-------|----------------|-------------| +| `stage-a-test-1-gpu-small` | `1-gpu-5090` | Quick checks on a small NVIDIA GPU before heavier stages | +| `stage-a-test-cpu` | `ubuntu-latest` | CPU-only unit tests | +| `stage-b-test-1-gpu-small` | `1-gpu-5090` | Core engine tests that fit a 5090-class card | +| `stage-b-test-1-gpu-large` | `1-gpu-h100` | Tests that need H100-class memory or kernels (e.g. FA3) | +| `stage-b-test-2-gpu-large` | `2-gpu-h100` | Two-GPU correctness and parallelism (TP/PP) on H100 | +| `stage-b-test-4-gpu-b200` | `4-gpu-b200` | Early Blackwell coverage (SM100+ paths) on four GPUs | +| `stage-b-kernel-unit-1-gpu-large` | `1-gpu-h100` | JIT kernel correctness tests under `python/sglang/jit_kernel/tests/` | +| `stage-b-kernel-unit-1-gpu-b200` | `4-gpu-b200` | JIT kernel correctness tests for Blackwell / SM100-specific paths | +| `stage-b-kernel-unit-8-gpu-h200` | `8-gpu-h200` | Multi-GPU JIT kernel correctness tests under `python/sglang/jit_kernel/tests/` | +| `stage-b-kernel-benchmark-1-gpu-large` | `1-gpu-h100` | JIT kernel benchmark files under `python/sglang/jit_kernel/benchmark/` | +| `stage-c-test-4-gpu-h100` | `4-gpu-h100` | Large 4-GPU H100 integration and scaling tests | +| `stage-c-test-8-gpu-h200` | `8-gpu-h200` | Large 8-GPU H200 runs for big models and parallelism | +| `stage-c-test-8-gpu-h20` | `8-gpu-h20` | Large 8-GPU H20 runs for big models | +| `stage-c-test-deepep-4-gpu-h100` | `4-gpu-h100` | DeepEP expert-parallel and networking on four H100s | +| `stage-c-test-deepep-8-gpu-h200` | `8-gpu-h200` | DeepEP at 8-GPU H200 scale | +| `stage-c-test-8-gpu-b200` | `8-gpu-b200` | 8-GPU B200 suite (registered but not yet wired to a workflow) | +| `stage-c-test-4-gpu-b200` | `4-gpu-b200` | 4-GPU B200 suite for large models on Blackwell | +| `stage-c-test-4-gpu-b200-small` | `4-gpu-b200` | Smaller 4-GPU B200 suite split onto low-disk B200 runners | +| `stage-c-test-4-gpu-gb200` | `4-gpu-gb200` | 4-GPU GB200 suite for Grace Blackwell; registered in `run_suite.py`, but the PR workflow is currently disabled until a runner is provisioned | + +#### Per-commit (AMD) + +| Suite | Runner (label) | Description | +|-------|----------------|-------------| +| `stage-a-test-1-gpu-small-amd` | `linux-mi325-1gpu-sglang` | Quick checks on one MI325-class GPU | +| `stage-b-test-1-gpu-small-amd` | `linux-mi325-1gpu-sglang` | Core 1-GPU AMD tests (14 partitions) | +| `stage-b-test-1-gpu-small-amd-nondeterministic` | `linux-mi325-1gpu-sglang` | Non-deterministic 1-GPU AMD tests | +| `stage-b-test-1-gpu-small-amd-mi35x` | `linux-mi35x-gpu-1` | 1-GPU tests on MI35x hardware | +| `stage-b-test-1-gpu-large-amd` | `linux-mi325-1gpu-sglang` | Large 1-GPU AMD tests (2 partitions) | +| `stage-b-test-2-gpu-large-amd` | `linux-mi325-2gpu-sglang` | 2-GPU ROCm correctness and parallel setups | +| `stage-b-test-large-8-gpu-35x-disaggregation-amd` | `linux-mi35x-gpu-8.fabric` | PD disaggregation and RDMA on 8×MI35x fabric | +| `stage-c-test-4-gpu-amd` | `linux-mi325-4gpu-sglang` | 4-GPU AMD integration (2 partitions) | +| `stage-c-test-large-8-gpu-amd` | `linux-mi325-8gpu-sglang` | 8-GPU MI325 scaling and integration | +| `stage-c-test-large-8-gpu-amd-mi35x` | `linux-mi35x-gpu-8` | 8-GPU MI35x scaling (2 partitions) | + + +### Per-commit (Ascend NPU) + +| Suite | Runner (label) | Description | +| --- | --- | --- | +| `per-commit-1-npu-a2` | `linux-aarch64-a2-1` | 1-NPU LLM CI machine | +| `per-commit-2-npu-a2` | `linux-aarch64-a2-2` | 2-NPU LLM CI machine | +| `per-commit-4-npu-a3` | `linux-aarch64-a3-4` | 4-NPU LLM CI machine | +| `per-commit-16-npu-a3` | `linux-aarch64-a3-16` | 16-NPU LLM CI machine | +| `multimodal-gen-test-1-npu-a3` | `linux-aarch64-a3-2` | 1-NPU multimodal CI machine | +| `multimodal-gen-test-2-npu-a3` | `linux-aarch64-a3-16` | 2-NPU multimodal CI machine | +| `multimodal-gen-test-8-npu-a3` | `linux-aarch64-a3-16` | 8-NPU multimodal CI machine | + +#### Nightly + +Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml`, `nightly-test-amd.yml`, and `nightly-test-npu.yml`, not `pr-test.yml`. Examples: + +- `nightly-1-gpu` (CUDA) +- `nightly-kernel-1-gpu` (CUDA, JIT kernel full grids) +- `nightly-kernel-8-gpu-h200` (CUDA, multi-GPU JIT kernel nightly) +- `nightly-8-gpu-h200` (CUDA) +- `nightly-eval-vlm-2-gpu` (CUDA) +- `nightly-amd` (AMD) +- `nightly-amd-8-gpu-mi35x` (AMD) +- `nightly-1-npu-a3` (NPU) +- `nightly-2-npu-a3` (NPU) +- `nightly-4-npu-a3` (NPU) +- `nightly-8-npu-a3` (NPU) +- `nightly-16-npu-a3` (NPU) + +> **Note**: Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`. + +### Choosing a Suite + +Use the lightest suite that meets your test's needs: + +- **No GPU required** → `stage-a-test-cpu` +- **Most small GPU tests** → `stage-b-test-1-gpu-small` (default choice) +- **Need H100 memory or Hopper features** → `stage-b-test-1-gpu-large` +- **JIT kernel correctness** → `stage-b-kernel-unit-1-gpu-large` +- **JIT kernel correctness for B200 / SM100 paths** → `stage-b-kernel-unit-1-gpu-b200` +- **JIT kernel benchmarks** → `stage-b-kernel-benchmark-1-gpu-large` +- **Multi-GPU** → only when the test actually needs multiple GPUs + +--- + +## Test File Templates + +### Unit Tests (no server / engine launch) + +See `test/registered/unit/README.md` for quick-start and rules. Unit tests live in `test/registered/unit/`, mirroring `python/sglang/srt/`: + +```python +"""Unit tests for srt/""" + +import unittest +from unittest.mock import MagicMock, patch + +from sglang.srt. import TargetClass +from sglang.test.ci.ci_register import register_cpu_ci +from sglang.test.test_utils import CustomTestCase + +register_cpu_ci(est_time=5, suite="stage-a-test-cpu") +# Prefer CPU. Only use register_cuda_ci when the test truly needs a GPU. + +class TestTargetClass(CustomTestCase): + def test_basic_behavior(self): + obj = TargetClass(...) + self.assertEqual(obj.method(), expected) + + @patch("sglang.srt..some_dependency") + def test_with_mock(self, mock_dep): + mock_dep.return_value = MagicMock() + # test logic with dependency mocked + ... + + +if __name__ == "__main__": + unittest.main() +``` + +Use `unittest.mock.patch` / `MagicMock` to mock dependencies and isolate the logic under test. If the module transitively imports GPU-only packages (e.g. `sgl_kernel`), they can be stubbed so the test runs on CPU CI. Do not modify `sys.modules` at module level — use `patch.dict` (as a class decorator or with `start`/`stop`) to ensure cleanup and avoid cross-test pollution. See `test/registered/unit/README.md` for details and examples. + +**Quality bar** — test real logic (validation boundaries, state transitions, error paths, branching, etc.). Skip tests that just verify Python itself works (e.g., "does calling an abstract method raise `NotImplementedError`?", "does a dataclass store the field I assigned?"). Consolidate repetitive patterns into parameterized tests. No production code changes in test PRs. + +### E2E test (small model, server needed) + +```python +import unittest + +import requests + +from sglang.srt.utils import kill_process_tree +from sglang.test.ci.ci_register import register_cuda_ci +from sglang.test.test_utils import ( + DEFAULT_SMALL_MODEL_NAME_FOR_TEST, + DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, + DEFAULT_URL_FOR_TEST, + CustomTestCase, + popen_launch_server, +) + +register_cuda_ci(est_time=60, suite="stage-b-test-1-gpu-small") + + +class TestMyFeature(CustomTestCase): + @classmethod + def setUpClass(cls): + cls.model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST + cls.base_url = DEFAULT_URL_FOR_TEST + cls.process = popen_launch_server( + cls.model, + cls.base_url, + timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, + other_args=["--arg1", "value1"], # feature-specific args + ) + + @classmethod + def tearDownClass(cls): + if hasattr(cls, "process") and cls.process: + kill_process_tree(cls.process.pid) + + def test_basic_functionality(self): + response = requests.post( + self.base_url + "/generate", + json={"text": "Hello", "sampling_params": {"max_new_tokens": 32}}, + ) + self.assertEqual(response.status_code, 200) + + +if __name__ == "__main__": + unittest.main(verbosity=3) +``` + +### E2E test (8B model, server needed, performance) + +```python +import time +import unittest + +import requests + +from sglang.srt.utils import kill_process_tree +from sglang.test.ci.ci_register import register_cuda_ci +from sglang.test.test_utils import ( + DEFAULT_MODEL_NAME_FOR_TEST, + DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, + DEFAULT_URL_FOR_TEST, + CustomTestCase, + popen_launch_server, +) + +register_cuda_ci(est_time=300, suite="stage-b-test-1-gpu-large") + + +class TestMyFeaturePerf(CustomTestCase): + @classmethod + def setUpClass(cls): + cls.model = DEFAULT_MODEL_NAME_FOR_TEST + cls.base_url = DEFAULT_URL_FOR_TEST + cls.process = popen_launch_server( + cls.model, + cls.base_url, + timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, + ) + + @classmethod + def tearDownClass(cls): + if hasattr(cls, "process") and cls.process: + kill_process_tree(cls.process.pid) + + def test_latency(self): + start = time.perf_counter() + response = requests.post( + self.base_url + "/generate", + json={"text": "Hello", "sampling_params": {"max_new_tokens": 128}}, + ) + elapsed = time.perf_counter() - start + self.assertEqual(response.status_code, 200) + self.assertLess(elapsed, 5.0, "Latency exceeded threshold") + + +if __name__ == "__main__": + unittest.main(verbosity=3) +``` + +--- + +## Server Fixture Reuse + +For tests that only need a standard server, inherit from `DefaultServerBase` and override class attributes: + +```python +from sglang.test.server_fixtures.default_fixture import DefaultServerBase + +class TestMyFeature(DefaultServerBase): + model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST + other_args = ["--enable-my-feature"] + + def test_something(self): + ... +``` + +Available fixtures in `python/sglang/test/server_fixtures/`: + +| Fixture | Use case | +|---------|----------| +| `DefaultServerBase` | Standard single-server tests | +| `EagleServerBase` | EAGLE speculative decoding | +| `PDDisaggregationServerBase` | Disaggregated prefill/decode | +| `MMMUServerBase` | Multimodal VLM tests | + +--- + +## CI Registration + +Every CI-discovered test file must call a registration function at module level: + +```python +from sglang.test.ci.ci_register import ( + register_cuda_ci, + register_amd_ci, + register_cpu_ci, + register_npu_ci, +) + +# Per-commit test (small 1-gpu, runs on 5090) +register_cuda_ci(est_time=80, suite="stage-b-test-1-gpu-small") + +# Per-commit test (large 1-gpu, runs on H100) +register_cuda_ci(est_time=120, suite="stage-b-test-1-gpu-large") + +# Nightly-only test +register_cuda_ci(est_time=200, suite="nightly-1-gpu", nightly=True) + +# Multi-backend test (only when testing backend-specific code paths) +register_cuda_ci(est_time=80, suite="stage-a-test-1-gpu-small") +register_amd_ci(est_time=120, suite="stage-a-test-1-gpu-small-amd") +register_npu_ci(est_time=400, suite="nightly-8-npu-a3", nightly=True) + +# Temporarily disabled test +register_cuda_ci(est_time=80, suite="stage-b-test-1-gpu-small", disabled="flaky - see #12345") +``` + +Parameters: +- `est_time`: estimated runtime in seconds (used for CI partitioning) +- `suite`: which CI suite to run in (see suite tables above) +- `nightly=True`: for nightly-only tests (default `False` = per-commit) +- `disabled="reason"`: temporarily disable with explanation + +**Key principle**: Only add `register_amd_ci` / `register_npu_ci` when the test exercises backend-specific code paths. Common E2E tests just need `register_cuda_ci` — duplicating across backends wastes CI time. + +### JIT Kernel Registration + +JIT kernel files live outside `test/registered/` but still use registration: + +```python +from sglang.test.ci.ci_register import register_cuda_ci + +# Correctness tests in python/sglang/jit_kernel/tests/ +register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large") +register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-b200") +register_cuda_ci(est_time=120, suite="stage-b-kernel-unit-8-gpu-h200") + +# Benchmarks in python/sglang/jit_kernel/benchmark/ +register_cuda_ci(est_time=6, suite="stage-b-kernel-benchmark-1-gpu-large") + +# Optional nightly registration +register_cuda_ci(est_time=120, suite="nightly-kernel-1-gpu", nightly=True) +register_cuda_ci(est_time=120, suite="nightly-kernel-8-gpu-h200", nightly=True) +``` + +Keep `est_time` and `suite` as **literal values** — `run_suite.py` collects them by AST parsing + +--- + +## Test Placement + +``` +test/ +├── registered/ # CI tests (auto-discovered by run_suite.py) +│ ├── unit/ # No server / engine launch (see test/registered/unit/README.md) +│ ├── kernels/ # CUDA kernel correctness (no server, GPU required) +│ ├── sampling/ # test_penalty.py, test_sampling_params.py ... +│ ├── sessions/ # test_session_control.py ... +│ ├── openai_server/ # basic/, features/, validation/ ... +│ ├── spec/ # eagle/, utils/ ... +│ ├── models/ # model-specific accuracy tests +│ ├── perf/ # performance benchmarks +│ └── / # create new category if needed +├── manual/ # Non-CI: debugging, one-off, manual verification +└── run_suite.py # CI runner (scans registered/ plus jit_kernel test/benchmark files) + +python/sglang/jit_kernel/ +├── tests/ # JIT kernel correctness tests (CI-discovered by test/run_suite.py) +└── benchmark/ # JIT kernel benchmarks (CI-discovered by test/run_suite.py) +``` + +**Decision rule** (see also `test/registered/README.md`): +- Component logic, no server → `registered/unit/` +- JIT kernel correctness / benchmarks → `python/sglang/jit_kernel/tests/` or `python/sglang/jit_kernel/benchmark/` +- Other kernel correctness → `registered/kernels/` +- Server needed → `registered//` +- Local debugging → `manual/` + +--- + +## Eval Accuracy Mixins + +**Design philosophy**: Most test files don't care about eval logic — they only need a "does this feature break model output quality?" sanity check. The mixin pattern separates **what to test** (threshold) from **how to test** (run_eval, assertions, CI summary). Test classes declare thresholds as class attributes; the mixin provides the `test_*` method. Override when you need extra assertions (e.g. EAGLE accept length). + +Available mixins in `python/sglang/test/kits/eval_accuracy_kit.py`: `MMLUMixin`, `HumanEvalMixin`, `MGSMEnMixin`, `GSM8KMixin`. Can be combined freely. Read the source for attrs and defaults. + +```python +class TestMyFeature(CustomTestCase, MMLUMixin): + mmlu_score_threshold = 0.65 + mmlu_num_examples = 64 + mmlu_num_threads = 32 + # test_mmlu is inherited — no code needed +``` + +--- + +## Key Utilities + +```python +from sglang.test.test_utils import ( + CustomTestCase, # base class with retry logic + popen_launch_server, # launch server subprocess + DEFAULT_URL_FOR_TEST, # auto-configured base URL + DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, # 600s default + run_bench_serving, # benchmark helper (launch + bench) +) +from sglang.srt.utils import kill_process_tree # cleanup server +``` + +--- + +## Checklist + +Before submitting a test: + +- [ ] Inherits from `CustomTestCase` (not `unittest.TestCase`) +- [ ] Has `register_*_ci(...)` call at module level +- [ ] Placed in `test/registered//`, unless this is a JIT kernel test/benchmark +- [ ] JIT kernel work: files live in `python/sglang/jit_kernel/tests/` or `python/sglang/jit_kernel/benchmark/` +- [ ] Backend-independent tests: `register_cuda_ci` only + smallest model +- [ ] Logic that doesn't need a server / engine launch → unit test in `registered/unit/` (see Unit Tests section) +- [ ] `setUpClass` launches server, `tearDownClass` kills it (if server-based) +- [ ] `tearDownClass` is defensive — uses `hasattr`/null checks before accessing resources that may not have been allocated +- [ ] Has `if __name__ == "__main__": unittest.main()` +- [ ] `est_time` is reasonable (measure locally) diff --git a/.codespellrc b/.codespellrc index 808a344b4e6f..b95d08495c91 100644 --- a/.codespellrc +++ b/.codespellrc @@ -1,3 +1,3 @@ [codespell] -ignore-words-list = ans, als, hel, boostrap, childs, te, vas, hsa, ment, cann, thi, makro, wil, rouge, PRIS -skip = *.json,*.jsonl,*.patch,*.txt +ignore-words-list = ans, als, hel, boostrap, childs, te, vas, hsa, ment, cann, thi, makro, wil, rouge, PRIS, ather, MIS, medias, allready, inout, nd, fo, visibles, nothink +skip = *.json, *.jsonl, *.patch, *.txt, *.lock diff --git a/.coveragerc b/.coveragerc new file mode 100644 index 000000000000..5a0a37805828 --- /dev/null +++ b/.coveragerc @@ -0,0 +1,16 @@ +[run] +source = python/sglang/srt +omit = + */test/* + */__pycache__/* + +[report] +show_missing = true +exclude_lines = + pragma: no cover + if __name__ == .__main__.: + raise NotImplementedError + if TYPE_CHECKING + +[html] +directory = htmlcov diff --git a/.devcontainer/Dockerfile b/.devcontainer/Dockerfile index 3c7b67cac8f5..3f7d93114878 100644 --- a/.devcontainer/Dockerfile +++ b/.devcontainer/Dockerfile @@ -16,8 +16,6 @@ RUN apt-get update && apt-get install -y sudo && \ # Set up oh-my-zsh for devuser RUN cp -r /root/.oh-my-zsh /home/devuser/.oh-my-zsh && \ cp /root/.zshrc /home/devuser/.zshrc && \ - cp /root/.vimrc /home/devuser/.vimrc && \ - cp /root/.tmux.conf /home/devuser/.tmux.conf && \ sed -i 's|/root/.oh-my-zsh|/home/devuser/.oh-my-zsh|g' /home/devuser/.zshrc && \ chown -R devuser:devuser /home/devuser/ diff --git a/.editorconfig b/.editorconfig deleted file mode 100644 index 030a7293dcb6..000000000000 --- a/.editorconfig +++ /dev/null @@ -1,25 +0,0 @@ -# https://editorconfig.org/ - -root = true - -[*] -charset = utf-8 -end_of_line = lf -indent_style = space -indent_size = 4 -trim_trailing_whitespace = true -insert_final_newline = true - -[*.{json,yaml,yml}] -indent_size = 2 - -[*.md] -indent_size = 2 -x-soft-wrap-text = true - -[*.rst] -indent_size = 4 -x-soft-wrap-text = true - -[Makefile] -indent_style = tab diff --git a/.github/CI_PERMISSIONS.json b/.github/CI_PERMISSIONS.json index c661d147f028..26e512db6722 100644 --- a/.github/CI_PERMISSIONS.json +++ b/.github/CI_PERMISSIONS.json @@ -2,1121 +2,1373 @@ "1pikachu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "842974287": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" + }, + "AgainstEntropy": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "Alcanderian": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "AniZpZ": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "BBuf": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "BHZ-BER": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ByronHsu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "CaoE": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "CatherineSue": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" + }, + "Chen-0210": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "ClawSeven": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ConnorLi96": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "DarkSharpness": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "Edwardf0t1": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "FlamingoPg": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "FrankLeeeee": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "Fridge003": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "HaiShaw": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "HanHan009527": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "HandH1998": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "Hanrui-Wang": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "Hexq0210": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "HydraQYH": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "JeremieMelo": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, - "Johnsonms": { + "Jiminator": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "custom override" + }, + "Johnsonms": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "JustinTong0323": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "Kangyan-Zhou": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "LorrinWWW": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "Makcum888e": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "MingxuZh": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "Oasis-Git": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "Prozac614": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "Qiaolin-Yu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "Qihang-Zhang": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "Ratish1": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "RubiaCx": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "ShangmingCai": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" + }, + "Shunkangz": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "SimonCqk": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "TianQiLin666666": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "Ubospica": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "Valentine233": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "Xia-Weiwen": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "XiaotongJiang": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "XucSh": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "YAMY1234": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "Ying1123": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ZailiWang": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ZhengWG": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ZhengdQin": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "acelyc111": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "adarshxs": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "airMeng": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" + }, + "alexnails": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "alisonshao": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "alphabetc1": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "amysaq2023": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "attack204": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ayrnb": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "azhurkevich": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "b8zhong": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "bingxche": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "blzheng": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "byjiang1996": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "cctry": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ch-wan": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "chenxu214": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "chunyuan-w": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "cicirori": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "cyb70289": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "dongjiyingdjy": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "dougyster": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "elfiegg": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "fortunecookiee": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "fy1214": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "fzyzcjy": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "gaopengff": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, - "gongwei-130": { + "glenliu21": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" + }, + "gongwei-130": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "gongy": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "guapisolo": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "guoyuhong": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "hanming-lu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "harrisonlimh": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "harvenstar": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "hebiao064": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "hlu1": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "hnyls2002": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "huaiyuzh": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "huangtingwei9988": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "hubertlu-tw": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "hyhieu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "hzh0425": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "iforgetmyname": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "ishandhanani": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "ispobock": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "jason-fxz": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "jasperjiaguo": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "jhinpan": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "jianan-gu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "jinleic": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "jinmingyi1998": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" + }, + "jybsuper": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "custom override" }, "kaixih": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "kevin85421": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "key4ng": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "kkHuang-amd": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "kpham-sgl": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "custom override" }, "kssteven418": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "kushanam": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "lanking520": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "lawrence-harmonic": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "lifuhuang": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" + }, + "liupeng374": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "liusy58": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "liz-badada": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" + }, + "luccafong": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "maocheng23": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "merrymercy": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "michaelzhang-ai": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "mickqian": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "mingfeima": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "minleminzui": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "mmangkad": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "narutolhy": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "netanel-haber": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "nvcastet": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ocss884": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "pansicheng": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "pavanimajety": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "pdasgup": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ping1jing2": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "ppraneth": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "pranavm-nvidia": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "pyc96": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "qimcis": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "qingquansong": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "qywu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "rainj-me": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "ravi03071991": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "rkooo567": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "roikoren755": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "saienduri": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" + }, + "samuellees": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" + }, + "satyamk7054": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "custom override" }, "scottjlee": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "sglang-bot": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" + }, + "sglang-npu-bot": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "custom override" }, "shaharmor98": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "shanyu-sys": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "shuaills": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "sleepcoo": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "slin1237": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "stmatengss": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "strgrb": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "sufeng-buaa": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "sundar24295s": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "sunjiweiswift": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "sunxxuns": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "thecodingwizard": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "timmy-feng": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "trevor-m": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "vincentzed": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "wenscarl": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "whybeyoung": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "wisclmy0611": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "xiezhq-hermann": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "xutizhou": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "xyjixyjixyji": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "yanbing-j": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "yangsijia-serena": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "yctseng0211": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "yeahdongcn": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "top contributor" }, "yhyang201": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "yilian49": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "yinghai": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, - "yizhang2077": { + "yingluosanqian": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "custom override" + }, + "yizhang2077": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "ykcombat": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "ynwang007": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "yuan-luo": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "yundai424": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "yushengsu-thu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "yyihuang": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "yzh119": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "zRzRzRzRzRzRzR": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "top contributor" }, "zhaochenyang20": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" + }, + "zhendonghua": { + "can_tag_run_ci_label": true, + "can_rerun_failed_ci": true, + "can_rerun_stage": true, + "cooldown_interval_minutes": 0, + "reason": "custom override" }, "zhijian-liu": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "zhuzilin": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, - "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "can_rerun_stage": true, + "cooldown_interval_minutes": 60, + "reason": "custom override" }, "zhyncs": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "zminglei": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "top contributor", - "can_rerun_stage": true + "reason": "top contributor" }, "zyksir": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 60, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" }, "zyzshishui": { "can_tag_run_ci_label": true, "can_rerun_failed_ci": true, + "can_rerun_stage": true, "cooldown_interval_minutes": 0, - "reason": "custom override", - "can_rerun_stage": true + "reason": "custom override" } } diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index b956c0ed94ac..4850534a5bd7 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,44 +1,63 @@ -.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou -/docker @Fridge003 @ispobock @HaiShaw @ishandhanani +.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou @bingxche +/docker @Fridge003 @ispobock @HaiShaw @ishandhanani @yctseng0211 /docker/npu.Dockerfile @ping1jing2 @iforgetmyname +/docs @wisclmy0611 @zijiexia +/docs_new @wisclmy0611 @zijiexia @Richardczl98 @JustinTong0323 /python/pyproject.toml @merrymercy @Fridge003 @ispobock -/python/sglang/jit_kernel @DarkSharpness @BBuf -/python/sglang/multimodal_gen @mickqian @yhyang201 -/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf -/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf +/python/sglang/jit_kernel @DarkSharpness @BBuf @celve @HydraQYH @yuan-luo +/python/sglang/jit_kernel/diffusion @yingluosanqian @BBuf @mickqian +/python/sglang/multimodal_gen @mickqian @yhyang201 @ping1jing2 +/python/sglang/multimodal_gen/runtime/cache @DefTruth +/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2 +/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2 /python/sglang/srt/batch_invariant_ops @Fridge003 @hebiao064 +/python/sglang/srt/compilation @hebiao064 @Oasis-Git /python/sglang/srt/constrained @hnyls2002 @DarkSharpness -/python/sglang/srt/compilation @hebiao064 /python/sglang/srt/disaggregation @ByronHsu @hnyls2002 @ShangmingCai /python/sglang/srt/disaggregation/ascend @ping1jing2 @iforgetmyname /python/sglang/srt/distributed @yizhang2077 @merrymercy @ch-wan +/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py @ShangmingCai @stmatengss +/python/sglang/srt/dllm @ClawSeven @btw616 /python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237 @merrymercy @JustinTong0323 +/python/sglang/srt/entrypoints/engine_score_mixin.py @sundar24295s @chanh @fortunecookiee /python/sglang/srt/entrypoints/grpc_server.py @CatherineSue @slin1237 +/python/sglang/srt/entrypoints/openai/serving_score.py @sundar24295s @chanh @fortunecookiee /python/sglang/srt/eplb @fzyzcjy @ch-wan /python/sglang/srt/function_call @CatherineSue @JustinTong0323 /python/sglang/srt/grpc @CatherineSue @slin1237 +/python/sglang/srt/hardware_backend/mlx @yeahdongcn +/python/sglang/srt/hardware_backend/musa @yeahdongcn /python/sglang/srt/hardware_backend/npu @ping1jing2 @iforgetmyname /python/sglang/srt/hardware_backend/npu/quantization @OrangeRedeng @TamirBaydasov @iforgetmyname /python/sglang/srt/layers @merrymercy @Ying1123 @Fridge003 @ispobock @HaiShaw @ch-wan @BBuf @Edwardf0t1 -/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064 -/python/sglang/srt/layers/attention/fla @yizhang2077 @hebiao064 -/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py @yizhang2077 @hebiao064 @hanming-lu +/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064 @HaiShaw +/python/sglang/srt/layers/attention/fla @yizhang2077 @hebiao064 @yuan-luo +/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py @yizhang2077 @hebiao064 @hanming-lu @yuan-luo /python/sglang/srt/layers/attention/mamba @yizhang2077 @hebiao064 -/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ -/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang +/python/sglang/srt/layers/attention/nsa @1am9trash @hubertlu-tw @kkHuang-amd @HaiShaw @Fridge003 @hlu1 @rainj-me +/python/sglang/srt/layers/attention/vision.py @mickqian @yuan-luo @yhyang201 +/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ @HaiShaw @b8zhong +/python/sglang/srt/layers/quantization/quark @kkHuang-amd @yichiche @hubertlu-tw @1am9trash @BowenBao +/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang @yushengsu-thu /python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann /python/sglang/srt/managers/scheduler_pp_mixin.py @ShangmingCai @XucSh -/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann @hanming-lu @yizhang2077 +/python/sglang/srt/managers/tokenizer_manager_score_mixin.py @sundar24295s @chanh @fortunecookiee +/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann @hanming-lu @yizhang2077 @hzh0425 @ispobock /python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @Fridge003 @ispobock /python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py @hebiao064 +/python/sglang/srt/models/deepseek_common @Fridge003 @ispobock @fzyzcjy @ch-wan /python/sglang/srt/models/deepseek_v2.py @fzyzcjy @zhyncs @ispobock @ch-wan @merrymercy @Fridge003 +/python/sglang/srt/models/transformers.py @adarshxs /python/sglang/srt/multimodal @mickqian @JustinTong0323 @yhyang201 @yuan-luo -/python/sglang/srt/speculative @Ying1123 @merrymercy @hnyls2002 -/sgl-kernel @zhyncs @ispobock @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw +/python/sglang/srt/observability @merrymercy @fzyzcjy @sufeng-buaa +/python/sglang/srt/ray @Qiaolin-Yu @xyuzh +/python/sglang/srt/speculative @Ying1123 @merrymercy @hnyls2002 @Qiaolin-Yu +/sgl-kernel @ispobock @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw /sgl-model-gateway @slin1237 @CatherineSue /sgl-model-gateway/benches @slin1237 /sgl-model-gateway/bindings/python @CatherineSue @key4ng @slin1237 /sgl-model-gateway/e2e_test @CatherineSue @key4ng +/sgl-model-gateway/examples/wasm @slin1237 /sgl-model-gateway/src/config @slin1237 /sgl-model-gateway/src/core @slin1237 /sgl-model-gateway/src/data_connector @key4ng @@ -53,5 +72,17 @@ /sgl-model-gateway/src/tool_parser @slin1237 @CatherineSue /sgl-model-gateway/src/wasm @slin1237 /sgl-model-gateway/examples/wasm @slin1237 +/test/registered/prefill_only @sundar24295s @chanh @fortunecookiee +/benchmark/prefill_only/bench_score.py @sundar24295s @chanh @fortunecookiee /test/srt/ascend @ping1jing2 @iforgetmyname /test/srt/test_modelopt* @Edwardf0t1 +/python/sglang/srt/layers/gemma4_fused_ops.py @merrymercy @Ying1123 @Fridge003 @ispobock @HaiShaw @ch-wan @BBuf @Edwardf0t1 @kpham-sgl +/python/sglang/srt/function_call/gemma4_detector.py @CatherineSue @JustinTong0323 @kpham-sgl +/python/sglang/srt/models/gemma4_*.py @kpham-sgl +/python/sglang/srt/multimodal/processors/gemma4.py @mickqian @JustinTong0323 @yhyang201 @yuan-luo @kpham-sgl +/docs_new/cookbook/autoregressive/Google/Gemma4.mdx @wisclmy0611 @zijiexia @Richardczl98 @kpham-sgl +/docs_new/src/snippets/autoregressive/gemma4-deployment.jsx @wisclmy0611 @zijiexia @Richardczl98 @kpham-sgl +/python/sglang/srt/speculative/ngram_*.py @hnyls2002 @Qiaolin-Yu @kpham-sgl +/python/sglang/srt/speculative/cpp_ngram @hnyls2002 @Qiaolin-Yu @kpham-sgl +/python/sglang/jit_kernel/ngram_*.py @hnyls2002 @Qiaolin-Yu @kpham-sgl +/python/sglang/jit_kernel/csrc/ngram_corpus @hnyls2002 @Qiaolin-Yu @kpham-sgl diff --git a/.github/MAINTAINER.md b/.github/MAINTAINER.md index cc569f1456a7..58b71196c948 100644 --- a/.github/MAINTAINER.md +++ b/.github/MAINTAINER.md @@ -37,31 +37,118 @@ __Note__: The permissions to trigger CI tests are defined separately according t - **Ideal case:** For each modified file, one Codeowner has approved the PR. The PR has also passed the required CI tests. Then, anyone with write permission can merge the PR. - **Exception:** In cases where it is difficult to meet all requirements (due to flaky CI or slow responses), a Merge Oncall can bypass branch protection to merge the PR. -If you meet any issues during the merge, you can discuss in [slack channels](https://slack.sglang.io/): #dev, #pull-request, and #ci-cd-build-release. +If you meet any issues during the merge, you can discuss in [slack channels](https://slack.sglang.io/): #pull-request, #ci-cd-build-release, #dev. ## The List of Merge Oncalls and Reviewers +This section lists the oncalls for each module or feature. The format is @github-username (Slack username). -TODO: fill in the list. +### Scheduler +[@merrymercy](https://github.com/merrymercy) (Lianmin Zheng), [@hnyls2002](https://github.com/hnyls2002) (Liangsheng Yin), [@cctry](https://github.com/cctry) (Shiyang Chen) + +related files +- python/sglang/srt/managers +- python/sglang/srt/model_executor + +### Diffusion +[@mickqian](https://github.com/mickqian) (Mick), [@BBuf](https://github.com/BBuf) (BBuf) + +related files +- python/sglang/multimodal_gen + +### PD disaggregation +[@ByronHsu](https://github.com/ByronHsu) (Byron Hsu), [@cctry](https://github.com/cctry) (Shiyang Chen), [@ShangmingCai](https://github.com/ShangmingCai) (Shangming Cai) + +related files +- python/sglang/srt/disaggregation + +### KV Cache +[@ispobock](https://github.com/ispobock) (Ke Bao), [@xiezhq-hermann](https://github.com/xiezhq-hermann) (Zhiqiang Xie) + +related files +- python/sglang/srt/mem_cache + +### Parallelism +[@ch-wan](https://github.com/ch-wan) (Cheng Wan), [@fzyzcjy](https://github.com/fzyzcjy) (Tom) + +related files +- python/sglang/srt/eplb +- python/sglang/srt/distributed +- python/sglang/srt/layers/dp_attention.py + +### Kernel +[@BBuf](https://github.com/BBuf) (BBuf) + +related files +- python/sglang/jit_kernel +- sgl-kernel + +### Speculative decoding +[@hnyls2002](https://github.com/hnyls2002) (Liangsheng Yin), [@Qiaolin-Yu](https://github.com/Qiaolin-Yu) (Qiaolin Yu) + +related files +- python/sglang/srt/speculative + +### NV and model-specific optimizations +[@Fridge003](https://github.com/Fridge003) (Baizhou Zhang), [@ishandhanani](https://github.com/ishandhanani) (Ishan Dhanani), [@Qiaolin-Yu](https://github.com/Qiaolin-Yu) (Qiaolin Yu) + +related files +- python/sglang/srt/models +- python/sglang/srt/layers/attention + +### AMD optimizations +[@HaiShaw](https://github.com/HaiShaw) (Henry HAI) + +### NPU optimizations +[@iforgetmyname](https://github.com/iforgetmyname) (Even Zhou) + +related files +- python/sglang/srt/hardware_backend/npu + +### CI, Release, Package +[@Kangyan-Zhou](https://github.com/Kangyan-Zhou) (Kangyan Zhou), [@Fridge003](https://github.com/Fridge003) (Baizhou Zhang) + +related files +- .github/workflows + +### Router, API +[@slin1237](https://github.com/slin1237) (Simo Lin) + +related files +- sgl-model-gateway +- python/sglang/srt/grpc +- python/sglang/srt/entrypoints + +### Other Notes Now we have many Merge Oncalls mainly because the CI is flaky and the CODEOWNERS is too coarse-grained. In the future, we hope the CI can be improved and we only need bypass rarely. After that, most Merge Oncalls can be converted back to Write and CODEOWNERS. -This list is based on the current situation. If you or someone you know would like to take on more responsibility and are qualified, please ping @Lianmin Zheng and @Ying Sheng in the Slack channel. They will start a nomination and internal review process. +This list is based on the current situation. If you or someone you know would like to take on more responsibility and are qualified, please ping [Lianmin Zheng](https://github.com/merrymercy) and [Ying Sheng](https://github.com/Ying1123) in the Slack channel. They will start a nomination and internal review process. ## The List of CI Oncalls -The format is @github-username (Slack username). +This section lists the oncalls for each hardware platform. The format is @github-username (Slack username). ### NVIDIA GPUs -@merrymercy (Lianmin Zheng), @Kangyan-Zhou (Kangyan Zhou), @ch-wan (Cheng Wan), @HanHan009527 (hanhan), @ishandhanani (Ishan Dhanani), @key4ng (Keyang Ru), @slin1237 (Simo Lin), @ShangmingCai (Shangming Cai) +[@Kangyan-Zhou](https://github.com/Kangyan-Zhou) (Kangyan Zhou), [@ch-wan](https://github.com/ch-wan) (Cheng Wan), [@HanHan009527](https://github.com/HanHan009527) (hanhan), [@ishandhanani](https://github.com/ishandhanani) (Ishan Dhanani), [@ShangmingCai](https://github.com/ShangmingCai) (Shangming Cai), [@alisonshao](https://github.com/alisonshao) (Alison Shao). ### AMD GPUs -@saienduri (Sai Enduri), @HaiShaw (Henry HAI) +[@saienduri](https://github.com/saienduri) (Sai Enduri), [@HaiShaw](https://github.com/HaiShaw) (Henry HAI) ### Intel CPU and XPU -@mingfeima (Mingfei Ma), @DiweiSun (Diwei Sun) +[@mingfeima](https://github.com/mingfeima) (Mingfei Ma), [@DiweiSun](https://github.com/DiweiSun) (Diwei Sun) ### Ascend NPUs -@iforgetmyname (Even Zhou) +[@iforgetmyname](https://github.com/iforgetmyname) (Even Zhou) + +This list is based on the current situation. If you or someone you know would like to donate machines for CI, they can serve as the CI oncalls for their machines. Please ping [Lianmin Zheng](https://github.com/merrymercy) and [Ying Sheng](https://github.com/Ying1123) in the Slack channel. They will start a nomination and internal review process. + +## CI Maintenance Mode +When the CI is unhealthy (e.g., the scheduled pr-test on `main` is broken for consecutive runs), the project enters **CI Maintenance Mode** by opening [issue #21065](https://github.com/sgl-project/sglang/issues/21065). While active: +- All PR CI runs are paused. Resources are allocated to PRs that fix the CI. +- **Merging non-CI-fix PRs is prohibited.** Only PRs that fix the CI may be merged. In severe cases, merge permissions may be revoked. + +Maintenance mode ends when `pr-test.yml` is all green on `main` and the issue is closed. -This list is based on the current situation. If you or someone you know would like to donate machines for CI, they can serve as the CI oncalls for their machines. Please ping @Lianmin Zheng and @Ying Sheng in the Slack channel. They will start a nomination and internal review process. +## Suspending Permissions +If a Merge Oncall bypasses checks to merge a PR that breaks the `main` branch, merges a non-CI-fix PR during CI Maintenance Mode, or repeatedly breaks the CI due to various reasons, their privileges will be suspended for at least two days, depending on the severity of the incident. diff --git a/.github/actions/check-maintenance/action.yml b/.github/actions/check-maintenance/action.yml new file mode 100644 index 000000000000..f064cad522d0 --- /dev/null +++ b/.github/actions/check-maintenance/action.yml @@ -0,0 +1,63 @@ +name: Check Maintenance Mode +description: Blocks CI when maintenance mode is active (issue #21065 is open), unless the PR has the bypass-maintenance label, or env PR_TEST_BYPASS_MAINTENANCE_ON_MAIN=true (PR Test workflow on main only). Merging non-CI-fix PRs is prohibited during maintenance mode; in severe cases, merge permissions may be revoked. + +inputs: + github-token: + description: GitHub token for API access + required: false + default: ${{ github.token }} + +runs: + using: composite + steps: + - name: Check maintenance mode + shell: bash + env: + GH_TOKEN: ${{ inputs.github-token }} + run: | + MAINTENANCE_ISSUE=21065 + REPO="${{ github.repository }}" + PR_NUMBER="${{ github.event.pull_request.number }}" + + # PR Test workflow only: scheduled runs and runs on main (dispatch / workflow_call) set this env + if [[ "${PR_TEST_BYPASS_MAINTENANCE_ON_MAIN:-}" == "true" ]]; then + echo "✅ PR Test on main branch; bypassing maintenance gate." + exit 0 + fi + + # Check if maintenance issue is open (fail-open: if API errors, allow CI to proceed) + ISSUE_STATE=$(gh issue view "$MAINTENANCE_ISSUE" --repo "$REPO" --json state --jq '.state' 2>/dev/null || echo "UNKNOWN") + + if [[ "$ISSUE_STATE" != "OPEN" ]]; then + echo "✅ Maintenance mode is OFF. Proceeding with CI." + exit 0 + fi + + # For PRs, check if bypass-maintenance label is present + if [[ -n "$PR_NUMBER" ]]; then + HAS_BYPASS=$(gh pr view "$PR_NUMBER" --repo "$REPO" --json labels --jq '[.labels[].name] | map(select(. == "bypass-maintenance")) | length' 2>/dev/null || echo "0") + if [[ "$HAS_BYPASS" -gt 0 ]]; then + echo "✅ PR #$PR_NUMBER has 'bypass-maintenance' label. Bypassing maintenance mode." + exit 0 + fi + fi + + MSG=$(printf "%s\n" \ + "## ⚠️ CI Maintenance Mode is Active" \ + "The CI infrastructure is currently under maintenance." \ + "All PR CI runs are paused until maintenance is complete." \ + "**Merging non-CI-fix PRs is prohibited during maintenance mode.** In severe cases, merge permissions may be revoked." \ + "You might also experience unexpected failures during this period." \ + "The team is working on the issue and will update the status as soon as possible." \ + "" \ + "What should you do?" \ + "- **Do NOT merge non-CI-fix PRs** until maintenance mode is lifted" \ + "- Check back later (~12 hours)" \ + "- Follow CI Maintenance Mode issue: https://github.com/$REPO/issues/$MAINTENANCE_ISSUE for status updates") + + echo "$MSG" >> "$GITHUB_STEP_SUMMARY" + while IFS= read -r line; do + echo "::error::$line" + done <<< "$MSG" + + exit 1 diff --git a/.github/actions/check-stage-health/action.yml b/.github/actions/check-stage-health/action.yml new file mode 100644 index 000000000000..290d3c73e872 --- /dev/null +++ b/.github/actions/check-stage-health/action.yml @@ -0,0 +1,94 @@ +name: Check Stage Health +description: Fail fast if any job in the current workflow run has already failed, or if the lint check (from lint.yml) has failed. Auto-skips for scheduled runs. The jobs-failed check (but not the lint check) is bypassed when the PR carries the `bypass-fastfail` label. + +inputs: + github-token: + description: 'GitHub token for API calls' + required: false + default: ${{ github.token }} + +runs: + using: composite + steps: + - name: Check stage health + uses: actions/github-script@v7 + env: + SKIP_STAGE_HEALTH_CHECK: ${{ env.SKIP_STAGE_HEALTH_CHECK }} + with: + github-token: ${{ inputs.github-token }} + script: | + // Skip when explicitly requested via env var (e.g. release branch cut) + if (process.env.SKIP_STAGE_HEALTH_CHECK === 'true') { + core.info('Skipping health check (SKIP_STAGE_HEALTH_CHECK=true)'); + return; + } + + // Skip for scheduled runs — they should collect all failures, not fast-fail + if (context.eventName === 'schedule') { + core.info('Skipping health check for scheduled run'); + return; + } + + // Check lint status from the separate Lint workflow (lint.yml). + // listJobsForWorkflowRun only sees jobs within the SAME run, so we use + // checks.listForRef which queries by commit SHA across ALL workflows. + const ref = context.payload.pull_request?.head?.sha || context.sha; + const { data } = await github.rest.checks.listForRef({ + owner: context.repo.owner, + repo: context.repo.repo, + ref: ref, + check_name: 'lint', + }); + const lintRun = data.check_runs.find( + cr => cr.app?.slug === 'github-actions' + ); + if (lintRun?.status === 'completed' && lintRun?.conclusion === 'failure') { + core.setFailed('Fast-fail: lint check failed'); + return; + } + + // Skip the jobs-failed check when the PR carries the bypass-fastfail label. + // Lint check above still runs. + let labels = []; + if (context.payload.pull_request?.labels) { + labels = context.payload.pull_request.labels.map(l => l.name); + } else { + const { data: prs } = await github.rest.repos.listPullRequestsAssociatedWithCommit({ + owner: context.repo.owner, + repo: context.repo.repo, + commit_sha: ref, + }); + if (prs.length > 0) { + labels = prs[0].labels.map(l => l.name); + } + } + if (labels.includes('bypass-fastfail')) { + core.info('Skipping jobs-failed check (bypass-fastfail label present)'); + return; + } + + const jobs = await github.paginate(github.rest.actions.listJobsForWorkflowRun, { + owner: context.repo.owner, + repo: context.repo.repo, + run_id: context.runId, + per_page: 100, + }); + // Find jobs that failed from a real error, not from fast-fail cascade + const rootCauseFailures = jobs.filter(j => { + if (j.status !== 'completed' || j.conclusion !== 'failure') return false; + // h20 runners are flaky (dirty GPU state from prior runs); their failures + // should not cascade fast-fail to other stages. Exact match avoids + // accidentally matching the h200 job names. + if (j.name === 'stage-c-test-8-gpu-h20') { + return false; + } + // If the failing step is the health check, it's a cascade — skip it + const failedStep = (j.steps || []).find(s => s.conclusion === 'failure'); + if (failedStep && (failedStep.name.includes('check-stage-health') || failedStep.name.includes('Check stage health'))) { + return false; + } + return true; + }); + if (rootCauseFailures.length > 0) { + core.setFailed(`Fast-fail: skipping — root cause job(s): ${rootCauseFailures.map(j => j.name).join(', ')}`); + } diff --git a/.github/actions/upload-cuda-coredumps/action.yml b/.github/actions/upload-cuda-coredumps/action.yml new file mode 100644 index 000000000000..0e9fdde2799d --- /dev/null +++ b/.github/actions/upload-cuda-coredumps/action.yml @@ -0,0 +1,27 @@ +name: Upload CUDA Coredumps +description: Upload CUDA coredump files as artifacts and clean up the directory. + +inputs: + artifact-suffix: + description: Suffix appended to the artifact name (e.g. matrix partition id) + required: false + default: "" + retention-days: + description: Number of days to retain the artifact + required: false + default: "7" + +runs: + using: composite + steps: + - name: Upload CUDA coredumps + uses: actions/upload-artifact@v4 + with: + name: cuda-coredumps-${{ github.job }}${{ inputs.artifact-suffix && format('-{0}', inputs.artifact-suffix) }} + path: ${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}/ + retention-days: ${{ inputs.retention-days }} + if-no-files-found: ignore + + - name: Cleanup CUDA coredumps + shell: bash + run: rm -rf "${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}" diff --git a/.github/actions/wait-for-jobs/action.yml b/.github/actions/wait-for-jobs/action.yml new file mode 100644 index 000000000000..c3200853a2d1 --- /dev/null +++ b/.github/actions/wait-for-jobs/action.yml @@ -0,0 +1,222 @@ +name: Wait for Jobs +description: Poll and wait for specified jobs in the current workflow run to complete. Returns success immediately when the PR carries the `bypass-fastfail` label, letting downstream stages dispatch in parallel (same effect as scheduled runs). + +inputs: + stage-name: + description: 'Human-readable stage name for log messages (e.g. "stage-a")' + required: true + jobs: + description: | + JSON array of job specs to wait for. Each element is either: + - a string: exact job name (e.g. "stage-a-test-1-gpu-small") + - an object { "prefix": "...", "expected_count": N }: for matrix jobs + required: true + max-wait-minutes: + description: 'Maximum time to wait before timing out' + required: false + default: '240' + poll-interval-seconds: + description: 'Seconds between polling attempts' + required: false + default: '60' + github-token: + description: 'GitHub token for API calls' + required: false + default: ${{ github.token }} + +outputs: + result: + description: 'Overall result: success, failure, or timeout' + value: ${{ steps.wait.outputs.result }} + +runs: + using: composite + steps: + - name: Wait for jobs to complete + id: wait + uses: actions/github-script@v7 + env: + INPUT_STAGE_NAME: ${{ inputs.stage-name }} + INPUT_JOBS: ${{ inputs.jobs }} + INPUT_MAX_WAIT_MINUTES: ${{ inputs.max-wait-minutes }} + INPUT_POLL_INTERVAL_SECONDS: ${{ inputs.poll-interval-seconds }} + with: + github-token: ${{ inputs.github-token }} + script: | + const stageName = process.env.INPUT_STAGE_NAME; + const jobSpecs = JSON.parse(process.env.INPUT_JOBS); + const maxWaitMinutes = parseInt(process.env.INPUT_MAX_WAIT_MINUTES); + const pollIntervalSeconds = parseInt(process.env.INPUT_POLL_INTERVAL_SECONDS); + const maxAttempts = (maxWaitMinutes * 60) / pollIntervalSeconds; + + // bypass-fastfail label opts the PR out of stage-to-stage waiting, + // letting all stages dispatch in parallel like scheduled runs do. + let labels = []; + if (context.payload.pull_request?.labels) { + labels = context.payload.pull_request.labels.map(l => l.name); + } else { + const ref = context.payload.pull_request?.head?.sha || context.sha; + try { + const { data: prs } = await github.rest.repos.listPullRequestsAssociatedWithCommit({ + owner: context.repo.owner, + repo: context.repo.repo, + commit_sha: ref, + }); + if (prs.length > 0) { + labels = prs[0].labels.map(l => l.name); + } + } catch (e) { + console.log(`Could not fetch PR labels for ${ref}: ${e.message}`); + } + } + if (labels.includes('bypass-fastfail')) { + console.log(`Skipping ${stageName} wait (bypass-fastfail label present)`); + core.setOutput('result', 'success'); + return; + } + + // Normalize job specs into a uniform format + const normalizedSpecs = jobSpecs.map(spec => { + if (typeof spec === 'string') { + return { prefix: spec, expected_count: 1, exact: true }; + } + return { ...spec, exact: false }; + }); + + const totalExpectedJobs = normalizedSpecs.reduce((sum, s) => sum + s.expected_count, 0); + + const matchesSpec = (jobName, spec) => { + if (spec.exact) { + return jobName === spec.prefix; + } + return jobName === spec.prefix || jobName.startsWith(spec.prefix + ' ('); + }; + + // Use ETag conditional requests to avoid consuming rate limit when nothing changed. + // GitHub returns 304 Not Modified for unchanged data, which is FREE (no rate limit cost). + let lastEtag = ''; + let lastJobs = null; + let apiCalls = 0; + let cachedCalls = 0; + + async function fetchJobs() { + const url = `GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs`; + const params = { + owner: context.repo.owner, + repo: context.repo.repo, + run_id: context.runId, + per_page: 100, + headers: {}, + }; + if (lastEtag) { + params.headers['if-none-match'] = lastEtag; + } + + try { + const response = await github.request(url, params); + apiCalls++; + const rateRemaining = response.headers['x-ratelimit-remaining'] || '?'; + const rateLimit = response.headers['x-ratelimit-limit'] || '?'; + console.log(`[rate-limit] ${rateRemaining}/${rateLimit} remaining (ETag: ${lastEtag ? 'sent' : 'none'}) | this session: ${apiCalls} paid, ${cachedCalls} free`); + lastEtag = response.headers.etag || ''; + const jobs = response.data.jobs; + + // Handle pagination if >100 jobs + // ETag only covers page 1, so invalidate it to avoid stale cache + // when later pages change but page 1 doesn't. + if (response.data.total_count > 100) { + lastEtag = ''; + for (let page = 2; page <= Math.ceil(response.data.total_count / 100); page++) { + const { data: pageData } = await github.request(url, { + ...params, + page, + headers: {}, + }); + jobs.push(...pageData.jobs); + } + } + + lastJobs = jobs; + return { jobs, cached: false }; + } catch (err) { + if (err.status === 304 && lastJobs) { + cachedCalls++; + console.log(`[rate-limit] 304 Not Modified | this session: ${apiCalls} paid, ${cachedCalls} free`); + return { jobs: lastJobs, cached: true }; + } + throw err; + } + } + + for (let attempt = 0; attempt < maxAttempts; attempt++) { + const { jobs, cached } = await fetchJobs(); + + let allCompleted = true; + let failedJobs = []; + let completedCount = 0; + let totalCount = 0; + + for (const spec of normalizedSpecs) { + const matchingJobs = jobs.filter(job => matchesSpec(job.name, spec)); + + for (const job of matchingJobs) { + totalCount++; + if (!cached) { + console.log(`${job.name}: status=${job.status}, conclusion=${job.conclusion}`); + } + + if (job.status === 'completed') { + completedCount++; + if (job.conclusion !== 'success' && job.conclusion !== 'skipped') { + failedJobs.push(job.name); + } + } else { + allCompleted = false; + } + } + + if (matchingJobs.length < spec.expected_count) { + // Job-level `if:` is evaluated before matrix expansion. When it + // evaluates false, GitHub emits exactly one "skipped" entry using + // the un-expanded job name (bare prefix, no " (shard)" suffix) + // instead of N matrix entries. Detect that precise shape so we + // don't poll forever — and so we don't mistake a partially + // materialized dynamic/reusable matrix for a skipped one. + const unexpandedSkip = matchingJobs.length === 1 && + matchingJobs[0].name === spec.prefix && + matchingJobs[0].status === 'completed' && + matchingJobs[0].conclusion === 'skipped'; + if (unexpandedSkip) { + const missing = spec.expected_count - 1; + totalCount += missing; + completedCount += missing; + if (!cached) { + console.log(`${spec.prefix}: job-level skip (bare entry, conclusion=skipped); treating as all ${spec.expected_count} skipped`); + } + } else { + console.log(`${spec.prefix}: found ${matchingJobs.length}/${spec.expected_count} jobs (waiting for more)`); + allCompleted = false; + } + } + } + + console.log(`[${stageName}] Progress: ${completedCount}/${totalCount} jobs completed (expected ${totalExpectedJobs})${cached ? ' (cached, no rate limit cost)' : ''}`); + + // Fail fast if any jobs failed + if (failedJobs.length > 0) { + core.setOutput('result', 'failure'); + core.setFailed(`${stageName} jobs failed: ${failedJobs.join(', ')}`); + return; + } + + if (allCompleted && totalCount >= totalExpectedJobs) { + core.setOutput('result', 'success'); + return; + } + + console.log(`Waiting ${pollIntervalSeconds}s... (attempt ${attempt + 1}/${maxAttempts})`); + await new Promise(resolve => setTimeout(resolve, pollIntervalSeconds * 1000)); + } + + core.setFailed(`Timeout waiting for ${stageName} jobs`); + core.setOutput('result', 'timeout'); diff --git a/.github/audit_permission.py b/.github/audit_permission.py new file mode 100644 index 000000000000..35c19f9b56a1 --- /dev/null +++ b/.github/audit_permission.py @@ -0,0 +1,411 @@ +""" +Audit GitHub repository collaborators with elevated access. + +This script will: +1. Fetch all collaborators with write permission to this repo. +2. Show their github username, Nickname and the role (e.g., admin, maintain, + custom org role, write, triage). +3. Show their last activity related to this repo (last commit, last issue, + last pull request). Put the data in YYYY-MM-DD format. Add a column "last activity date" to the CSV, before the above three breakdown columns. +4. Show activity on other repos: repos touched via public events in the last 90 days (Push, PR, Issues, etc.). Sort the repos by the number of activities. +5. Write results to a CSV sorted by the roles (admin, maintain, custom org role, write, triage) and the last activity date (most recent first). + +Usage: + export GH_TOKEN="your_github_token" + python3 audit_permission.py [--output path] [--repo owner/name] + +Requires: requests, and a token with permission to list collaborators (push+ +access to the repo). +""" + +from __future__ import annotations + +import argparse +import csv +import os +import sys +import time +from collections import Counter +from datetime import datetime, timedelta, timezone +from typing import Any + +try: + import requests +except ImportError: + requests = None # type: ignore + +DEFAULT_OWNER = "sgl-project" +DEFAULT_NAME = "sglang" + +HEADERS: dict[str, str] = {} + + +def _request( + method: str, + url: str, + *, + params: dict[str, Any] | None = None, + max_retries: int = 3, +) -> requests.Response: + if requests is None: + raise RuntimeError("Install the requests package: pip install requests") + for attempt in range(max_retries): + r = requests.request(method, url, headers=HEADERS, params=params, timeout=60) + if r.status_code == 403 and "rate limit" in (r.text or "").lower(): + reset = r.headers.get("X-RateLimit-Reset") + wait = 60 + if reset: + try: + wait = max(1, int(reset) - int(time.time()) + 2) + except ValueError: + pass + print(f"Rate limited; sleeping {wait}s...", file=sys.stderr) + time.sleep(min(wait, 3600)) + continue + return r + return r + + +def paginate_list(url: str, params: dict[str, Any] | None = None) -> list[Any]: + out: list[Any] = [] + next_url: str | None = url + next_params = params + while next_url: + r = _request("GET", next_url, params=next_params) + next_params = None + if r.status_code != 200: + print( + f"Error {r.status_code} GET {next_url}: {r.text[:500]}", + file=sys.stderr, + ) + break + data = r.json() + if isinstance(data, list): + out.extend(data) + else: + break + next_url = None + link = r.headers.get("Link", "") + for part in link.split(", "): + if 'rel="next"' in part: + start = part.find("<") + 1 + end = part.find(">") + if start > 0 and end > start: + next_url = part[start:end] + break + return out + + +def collaborator_role(collab: dict[str, Any]) -> str: + role_name = collab.get("role_name") + if isinstance(role_name, str) and role_name.strip(): + return role_name.strip() + perms = collab.get("permissions") or {} + if perms.get("admin"): + return "admin" + if perms.get("maintain"): + return "maintain" + if perms.get("push"): + return "write" + if perms.get("triage"): + return "triage" + return "read" + + +def has_write_plus(collab: dict[str, Any]) -> bool: + perms = collab.get("permissions") or {} + return bool( + perms.get("admin") + or perms.get("maintain") + or perms.get("push") + or perms.get("triage") + ) + + +def role_sort_tier(collab: dict[str, Any]) -> int: + """Sort order: admin (0), maintain (1), custom org role (2), write (3), triage (4).""" + rn = collab.get("role_name") + if isinstance(rn, str) and rn.strip(): + k = rn.strip().lower() + if k == "admin": + return 0 + if k == "maintain": + return 1 + if k == "write": + return 3 + if k == "triage": + return 4 + if k == "read": + return 5 + return 2 + perms = collab.get("permissions") or {} + if perms.get("admin"): + return 0 + if perms.get("maintain"): + return 1 + if perms.get("push"): + return 3 + if perms.get("triage"): + return 4 + return 5 + + +def fetch_display_name(login: str) -> str: + url = f"https://api.github.com/users/{login}" + r = _request("GET", url) + if r.status_code != 200: + return "" + data = r.json() + if not isinstance(data, dict): + return "" + n = data.get("name") + return n.strip() if isinstance(n, str) else "" + + +def parse_github_ts(s: str) -> datetime | None: + if not s: + return None + s = s.replace("Z", "+00:00") + try: + return datetime.fromisoformat(s) + except ValueError: + return None + + +def iso_timestamp_to_ymd(iso: str | None) -> str: + if not iso: + return "" + p = parse_github_ts(iso) + if not p: + return "" + return p.date().isoformat() + + +def max_date_ymd(*iso_dates: str | None) -> str: + best: datetime | None = None + for d in iso_dates: + p = parse_github_ts(d or "") + if p and (best is None or p > best): + best = p + return best.date().isoformat() if best else "" + + +def parse_ymd(s: str) -> datetime | None: + if not s: + return None + try: + return datetime.strptime(s, "%Y-%m-%d").replace(tzinfo=timezone.utc) + except ValueError: + return None + + +def last_commit_date(owner: str, repo: str, login: str) -> str | None: + url = f"https://api.github.com/repos/{owner}/{repo}/commits" + r = _request("GET", url, params={"author": login, "per_page": 1}) + if r.status_code != 200: + return None + data = r.json() + if not isinstance(data, list) or not data: + return None + commit = data[0].get("commit") or {} + c = commit.get("committer") or commit.get("author") or {} + d = c.get("date") + return d if isinstance(d, str) else None + + +def search_repo_item( + owner: str, repo: str, login: str, kind: str +) -> dict[str, Any] | None: + q = f"repo:{owner}/{repo} is:{kind} author:{login}" + url = "https://api.github.com/search/issues" + r = _request( + "GET", + url, + params={"q": q, "sort": "updated", "order": "desc", "per_page": 1}, + ) + if r.status_code != 200: + return None + payload = r.json() + items = payload.get("items") + if not items: + return None + return items[0] if isinstance(items[0], dict) else None + + +def last_issue_pr_dates( + owner: str, repo: str, login: str +) -> tuple[str | None, str | None]: + issue = search_repo_item(owner, repo, login, "issue") + pr = search_repo_item(owner, repo, login, "pr") + issue_dt = None + pr_dt = None + if issue: + issue_dt = issue.get("updated_at") or issue.get("created_at") + if not isinstance(issue_dt, str): + issue_dt = None + if pr: + pr_dt = pr.get("updated_at") or pr.get("created_at") + if not isinstance(pr_dt, str): + pr_dt = None + return issue_dt, pr_dt + + +def other_repos_activity_column( + login: str, owner: str, repo: str, days: int = 90 +) -> str: + """Repos other than this one touched in the window, sorted by event count (desc).""" + cutoff = datetime.now(timezone.utc) - timedelta(days=days) + full = f"{owner}/{repo}" + counts: Counter[str] = Counter() + url: str | None = f"https://api.github.com/users/{login}/events/public" + params: dict[str, Any] = {"per_page": 100} + + while url: + r = _request("GET", url, params=params) + params = {} + if r.status_code != 200: + break + events = r.json() + if not isinstance(events, list): + break + oldest_in_page: datetime | None = None + for ev in events: + if not isinstance(ev, dict): + continue + created = parse_github_ts(ev.get("created_at") or "") + if created: + if oldest_in_page is None or created < oldest_in_page: + oldest_in_page = created + if created and created < cutoff: + continue + rinfo = ev.get("repo") + name = None + if isinstance(rinfo, dict): + name = rinfo.get("name") + if isinstance(name, str) and name and name != full: + counts[name] += 1 + next_url = None + link = r.headers.get("Link", "") + for part in link.split(", "): + if 'rel="next"' in part: + s, e = part.find("<") + 1, part.find(">") + if s > 0 and e > s: + next_url = part[s:e] + break + if oldest_in_page and oldest_in_page < cutoff: + break + url = next_url + if not events: + break + + ordered = sorted(counts.items(), key=lambda x: (-x[1], x[0])) + return ";".join(f"{n}:{c}" for n, c in ordered) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Audit repo collaborator permissions.") + parser.add_argument( + "--repo", + default=f"{DEFAULT_OWNER}/{DEFAULT_NAME}", + help=f"owner/name (default: {DEFAULT_OWNER}/{DEFAULT_NAME})", + ) + parser.add_argument( + "--output", + "-o", + default=os.path.join(os.path.dirname(__file__), "permission_audit.csv"), + help="Output CSV path", + ) + parser.add_argument( + "--events-days", + type=int, + default=90, + help="Window for other-repo activity via public events", + ) + args = parser.parse_args() + + if "/" not in args.repo: + print("Error: --repo must be owner/name", file=sys.stderr) + sys.exit(1) + owner, name = args.repo.split("/", 1) + + gh_token = os.getenv("GH_TOKEN") + if not gh_token: + print("Error: GH_TOKEN environment variable is not set.", file=sys.stderr) + sys.exit(1) + + global HEADERS + HEADERS = { + "Authorization": f"Bearer {gh_token}", + "Accept": "application/vnd.github+json", + "X-GitHub-Api-Version": "2022-11-28", + } + + collab_url = f"https://api.github.com/repos/{owner}/{name}/collaborators" + print(f"Fetching collaborators for {owner}/{name}...", file=sys.stderr) + collaborators = paginate_list( + collab_url, params={"per_page": 100, "affiliation": "all"} + ) + + rows: list[dict[str, Any]] = [] + elevated = [c for c in collaborators if isinstance(c, dict) and has_write_plus(c)] + print( + f"Found {len(elevated)} collaborators with admin/maintain/write/triage.", + file=sys.stderr, + ) + + for i, col in enumerate(elevated, start=1): + login = col.get("login") + if not isinstance(login, str): + continue + print(f" [{i}/{len(elevated)}] {login}", file=sys.stderr) + + role = collaborator_role(col) + nickname = fetch_display_name(login) + cd = last_commit_date(owner, name, login) + issue_dt, pr_dt = last_issue_pr_dates(owner, name, login) + last_act_ymd = max_date_ymd(cd, issue_dt, pr_dt) + others = other_repos_activity_column(login, owner, name, days=args.events_days) + rows.append( + { + "_role_tier": role_sort_tier(col), + "github_username": login, + "nickname": nickname, + "role": role, + "last_activity_date": last_act_ymd, + "last_commit_date": iso_timestamp_to_ymd(cd), + "last_issue_date": iso_timestamp_to_ymd(issue_dt), + "last_pr_date": iso_timestamp_to_ymd(pr_dt), + "other_repos_90d": others, + } + ) + + def sort_key(r: dict[str, Any]) -> tuple[int, float]: + tier = r["_role_tier"] + act = parse_ymd(r.get("last_activity_date") or "") + ts = act.timestamp() if act else 0.0 + return (tier, -ts) + + rows.sort(key=sort_key) + + fieldnames = [ + "github_username", + "nickname", + "role", + "last_activity_date", + "last_commit_date", + "last_issue_date", + "last_pr_date", + "other_repos_90d", + ] + for r in rows: + del r["_role_tier"] + with open(args.output, "w", newline="", encoding="utf-8") as f: + w = csv.DictWriter(f, fieldnames=fieldnames) + w.writeheader() + w.writerows(rows) + + print(f"Wrote {len(rows)} rows to {args.output}", file=sys.stderr) + + +if __name__ == "__main__": + main() diff --git a/.github/labeler.yml b/.github/labeler.yml index 3994d1d12f5f..ad3b7f2a966e 100644 --- a/.github/labeler.yml +++ b/.github/labeler.yml @@ -11,13 +11,20 @@ sgl-kernel: - changed-files: - any-glob-to-any-file: 'sgl-kernel/**/*' +# JIT kernel specific +jit-kernel: + - changed-files: + - any-glob-to-any-file: 'python/sglang/jit_kernel/**/*' + # Documentation documentation: - changed-files: - any-glob-to-any-file: - '**/*.md' + - '**/*.mdx' - 'docs/**/*' - 'README*' + - 'docs_new/**/*' # Dependencies dependencies: @@ -108,3 +115,10 @@ deterministic: piecewise-cuda-graph: - changed-files: - any-glob-to-any-file: 'python/sglang/srt/compilation/**/*' + +# Moore Threads specific +mthreads: + - changed-files: + - any-glob-to-any-file: + - '**/*mthreads*' + - '**/*musa*' diff --git a/.github/linters/lychee-ci.toml b/.github/linters/lychee-ci.toml new file mode 100644 index 000000000000..50919dcd3421 --- /dev/null +++ b/.github/linters/lychee-ci.toml @@ -0,0 +1,42 @@ +no_progress = true +verbose = "warn" +timeout = 20 +max_concurrency = 8 +retry_wait_time = 2 +max_retries = 2 + +# CI should validate external links over the network. +offline = false +scheme = ["http", "https"] + +exclude_path = [ + # Exclude generated Sphinx build artifacts. + # - "(\\./)?" allows both "docs/..." and "./docs/..." + # - "[/\\\\]" supports both slash styles in CI environments + "^(\\./)?docs[/\\\\]_build[/\\\\]", +] + +exclude = [ + # Local-only endpoints referenced in docs/examples. + # These are expected to be unreachable in GitHub-hosted CI. + "^https?://localhost(:[0-9]+)?(/|$)", + "^http://127\\.0\\.0\\.1(:[0-9]+)?(/|$)", + # Vendor pages that frequently block/deny CI user-agents (transient 403/anti-bot). + "^https://www\\.intel\\.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics\\.html$", + "^https://www\\.intel\\.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics\\.html$", + "^https://www\\.intel\\.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications\\.html$", + + # Non-routable bind address used in examples, never externally reachable. + "^http://0\\.0\\.0\\.0(/|$)", + + # Large doc portals with anti-bot/rate-limit behavior in CI. + # We keep API docs references in content but do not fail CI on access policy. + "^https://platform\\.openai\\.com/docs/", + "^https://gamma\\.app/docs/Optimizing-RL-with-SGLang-y0kqgj877k34779$", + "^https://aflah02\\.substack\\.com/p/multi-node-llm-inference-with-sglang/?$", + + # Known noisy image URLs used in notebook-rendered examples. + "^https://github\\.com/sgl-project/sglang/blob/main/examples/assets/example_image\\.png\\?raw=true$", + "^https://raw\\.githubusercontent\\.com/sgl-project/sglang/main/examples/assets/example_image\\.png/?$", + "^https://raw\\.githubusercontent\\.com/sgl-project/sglang/main/assets/logo\\.png/?$", +] diff --git a/.github/linters/lychee.toml b/.github/linters/lychee.toml new file mode 100644 index 000000000000..cae63984da47 --- /dev/null +++ b/.github/linters/lychee.toml @@ -0,0 +1,18 @@ +# .github/linters/lychee.toml +no_progress = true +verbose = "warn" +timeout = 20 +max_concurrency = 8 + +offline = true + +# Ignore generated docs output; check source docs only. +exclude_path = [ + "^(\\./)?docs[/\\\\]_build[/\\\\]", +] + +exclude = [ + "^https?://localhost(:[0-9]+)?(/|$)", + "^http://127\\.0\\.0\\.1(:[0-9]+)?(/|$)", + "^http://0\\.0\\.0\\.0(/|$)", +] diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 45db320d57df..a2338baf30d9 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -12,7 +12,7 @@ -## Benchmarking and Profiling +## Speed Tests and Profiling @@ -24,10 +24,10 @@ - [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed). - [ ] Follow the SGLang code style [guidance](https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance). -## Review Process +## Review and Merge Process -1. Ping Merge Oncalls to start the PR flow. See the [PR Merge Process](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process). +1. Ping Merge Oncalls to start the process. See the [PR Merge Process](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process). 2. Get approvals from [CODEOWNERS](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and other reviewers. 3. Trigger CI tests with [comments](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests) or contact authorized users to do so. - - `/tag-run-ci-label`, `/rerun-failed-ci`, `/tag-and-rerun-ci` -4. After green CI and required approvals, ask Merge Oncalls to merge. + - Common commands include `/tag-and-rerun-ci`, `/tag-run-ci-label`, `/rerun-failed-ci` +4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR. diff --git a/.github/update_ci_permission.py b/.github/update_ci_permission.py index bbf695149022..106532ede44d 100644 --- a/.github/update_ci_permission.py +++ b/.github/update_ci_permission.py @@ -22,7 +22,7 @@ Permissions are assigned according to the following rules: -1. Add the top 50 contributors from the last 90 days with full permissions, no cooldown, and the reason "top contributor". +1. Add the top 50 contributors from the last 120 days with full permissions, no cooldown, and the reason "top contributor". 2. Load all users from the existing `CI_PERMISSIONS.json` file and update their entries as follows: - If a user is already covered by rule 1, skip that user. - If the old reason of a user is "top contributor" but they are not in the current top contributors list, change their configuration to: @@ -117,7 +117,7 @@ def get_write_access_users(): return writers -def get_top_contributors(days=90, limit=50): +def get_top_contributors(days, limit): """Fetches top contributors based on commit count in the last N days.""" print(f"Fetching commits from the last {days} days...") since_date = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat() @@ -132,7 +132,7 @@ def get_top_contributors(days=90, limit=50): author_counts[commit["author"]["login"]] += 1 top_users = [user for user, _ in author_counts.most_common(limit)] - print(f"Found {len(top_users)} active contributors in the last {days} days.") + print(f"Found {len(top_users)} top contributors in the last {days} days.") return set(top_users) @@ -193,7 +193,7 @@ def main(): print(f"Warning: Could not fetch collaborators (check token scope). Error: {e}") write_access_users = set() - top_contributors = get_top_contributors(days=90, limit=50) + top_contributors = get_top_contributors(days=120, limit=50) old_permissions = load_existing_permissions() new_permissions = {} @@ -203,6 +203,7 @@ def main(): new_permissions[user] = { "can_tag_run_ci_label": True, "can_rerun_failed_ci": True, + "can_rerun_stage": True, "cooldown_interval_minutes": 0, "reason": "top contributor", } @@ -220,6 +221,7 @@ def main(): new_permissions[user] = { "can_tag_run_ci_label": True, "can_rerun_failed_ci": True, + "can_rerun_stage": True, "cooldown_interval_minutes": 60, "reason": "custom override", } diff --git a/.github/workflows/_docker-build-and-publish.yml b/.github/workflows/_docker-build-and-publish.yml new file mode 100644 index 000000000000..ba55dc939f94 --- /dev/null +++ b/.github/workflows/_docker-build-and-publish.yml @@ -0,0 +1,331 @@ +name: Build and Publish Multi-Arch Docker Images + +# Reusable workflow: builds CUDA 12 + CUDA 13 images for amd64 and arm64, +# then creates multi-arch manifests with caller-specified tags. + +on: + workflow_call: + inputs: + docker_target: + description: "Dockerfile target stage (framework or runtime)" + required: true + type: string + sgl_version: + description: "Version string passed as SGL_VERSION build arg (empty to skip)" + required: false + type: string + default: "" + extra_build_args: + description: "Additional --build-arg flags appended to docker buildx build" + required: false + type: string + default: "" + checkout_ref: + description: "Git ref to checkout (empty for default)" + required: false + type: string + default: "" + tag_config: + description: 'JSON array of {"cuda":"cu129|cu130","tags":["tag1","tag2"]}. Tags support {version} substitution.' + required: true + type: string + use_environment: + description: "GitHub environment name (e.g. prod) or empty for none" + required: false + type: string + default: "" + image_repo: + description: "Docker Hub repo to push to (e.g. lmsysorg/sglang-staging for testing)" + required: false + type: string + default: "lmsysorg/sglang" + +jobs: + build-x86: + if: github.repository == 'sgl-project/sglang' + environment: ${{ inputs.use_environment || null }} + runs-on: x64-docker-build-node + env: + TAG_CONFIG: ${{ inputs.tag_config }} + SGL_VERSION: ${{ inputs.sgl_version }} + IMAGE_REPO: ${{ inputs.image_repo }} + outputs: + digest-cu129: ${{ steps.build-cu129.outputs.digest }} + digest-cu130: ${{ steps.build-cu130.outputs.digest }} + steps: + - name: Delete huge unnecessary tools folder + run: rm -rf /opt/hostedtoolcache + + - name: Cleanup workspace (remove root-owned files from prior runs) + run: sudo rm -rf "$GITHUB_WORKSPACE"/* || true + + - name: Checkout repository + uses: actions/checkout@v4 + with: + ref: ${{ inputs.checkout_ref || github.ref }} + + - name: Compute Docker build metadata args + run: | + set -euo pipefail + BUILD_COMMIT="$(git rev-parse HEAD)" + BUILD_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}" + for CUDA_VARIANT in cu129 cu130; do + python3 scripts/ci/utils/docker_build_metadata_args.py \ + --cuda "${CUDA_VARIANT}" \ + --tag-config "${TAG_CONFIG}" \ + --image-repo "${IMAGE_REPO}" \ + --sgl-version "${SGL_VERSION}" \ + --build-commit "${BUILD_COMMIT}" \ + --build-url "${BUILD_URL}" \ + > "/tmp/docker-metadata-${CUDA_VARIANT}.args" + done + + - name: Free disk space + uses: jlumbroso/free-disk-space@main + with: + tool-cache: true + docker-images: true + android: true + dotnet: true + haskell: true + large-packages: true + swap-storage: true + + - name: Prune Docker to reclaim disk space + run: | + docker buildx prune --filter "until=72h" -f + docker system prune -af --filter "until=72h" + docker volume prune -af + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Build and push AMD64 image (CUDA 12) + id: build-cu129 + run: | + VERSION_ARG="" + if [ -n "${SGL_VERSION}" ]; then + VERSION_ARG="--build-arg SGL_VERSION=${SGL_VERSION}" + fi + mapfile -t METADATA_ARGS < /tmp/docker-metadata-cu129.args + + docker buildx build \ + --target ${{ inputs.docker_target }} \ + --platform linux/amd64 \ + --output type=image,name=${{ inputs.image_repo }},push-by-digest=true,name-canonical=true,push=true \ + -f docker/Dockerfile \ + --build-arg CUDA_VERSION=12.9.1 \ + --build-arg BUILD_TYPE=all \ + --build-arg GRACE_BLACKWELL=0 \ + --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ + "${METADATA_ARGS[@]}" \ + ${VERSION_ARG} \ + ${{ inputs.extra_build_args }} \ + --metadata-file /tmp/metadata-cu129.json \ + --no-cache \ + . + + DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129.json'))['containerimage.digest'])") + echo "Pushed digest: ${DIGEST}" + echo "digest=${DIGEST}" >> $GITHUB_OUTPUT + + - name: Build and push AMD64 image (CUDA 13) + id: build-cu130 + run: | + VERSION_ARG="" + if [ -n "${SGL_VERSION}" ]; then + VERSION_ARG="--build-arg SGL_VERSION=${SGL_VERSION}" + fi + mapfile -t METADATA_ARGS < /tmp/docker-metadata-cu130.args + + docker buildx build \ + --target ${{ inputs.docker_target }} \ + --platform linux/amd64 \ + --output type=image,name=${{ inputs.image_repo }},push-by-digest=true,name-canonical=true,push=true \ + -f docker/Dockerfile \ + --build-arg CUDA_VERSION=13.0.1 \ + --build-arg BUILD_TYPE=all \ + --build-arg GRACE_BLACKWELL=0 \ + --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ + "${METADATA_ARGS[@]}" \ + ${VERSION_ARG} \ + ${{ inputs.extra_build_args }} \ + --metadata-file /tmp/metadata-cu130.json \ + --no-cache \ + . + + DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130.json'))['containerimage.digest'])") + echo "Pushed digest: ${DIGEST}" + echo "digest=${DIGEST}" >> $GITHUB_OUTPUT + + build-arm64: + if: github.repository == 'sgl-project/sglang' + environment: ${{ inputs.use_environment || null }} + runs-on: arm-docker-build-node + env: + TAG_CONFIG: ${{ inputs.tag_config }} + SGL_VERSION: ${{ inputs.sgl_version }} + IMAGE_REPO: ${{ inputs.image_repo }} + outputs: + digest-cu129: ${{ steps.build-cu129.outputs.digest }} + digest-cu130: ${{ steps.build-cu130.outputs.digest }} + steps: + - name: Delete huge unnecessary tools folder + run: rm -rf /opt/hostedtoolcache + + - name: Cleanup workspace (remove root-owned files from prior runs) + run: sudo rm -rf "$GITHUB_WORKSPACE"/* || true + + - name: Checkout repository + uses: actions/checkout@v4 + with: + ref: ${{ inputs.checkout_ref || github.ref }} + + - name: Compute Docker build metadata args + run: | + set -euo pipefail + BUILD_COMMIT="$(git rev-parse HEAD)" + BUILD_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}" + for CUDA_VARIANT in cu129 cu130; do + python3 scripts/ci/utils/docker_build_metadata_args.py \ + --cuda "${CUDA_VARIANT}" \ + --tag-config "${TAG_CONFIG}" \ + --image-repo "${IMAGE_REPO}" \ + --sgl-version "${SGL_VERSION}" \ + --build-commit "${BUILD_COMMIT}" \ + --build-url "${BUILD_URL}" \ + > "/tmp/docker-metadata-${CUDA_VARIANT}.args" + done + + - name: Prune Docker to reclaim disk space + run: | + docker buildx prune --filter "until=72h" -f + docker system prune -af --filter "until=72h" + docker volume prune -af + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Build and push ARM64 image (CUDA 12) + id: build-cu129 + run: | + VERSION_ARG="" + if [ -n "${SGL_VERSION}" ]; then + VERSION_ARG="--build-arg SGL_VERSION=${SGL_VERSION}" + fi + mapfile -t METADATA_ARGS < /tmp/docker-metadata-cu129.args + + docker buildx build \ + --target ${{ inputs.docker_target }} \ + --platform linux/arm64 \ + --output type=image,name=${{ inputs.image_repo }},push-by-digest=true,name-canonical=true,push=true \ + -f docker/Dockerfile \ + --build-arg CUDA_VERSION=12.9.1 \ + --build-arg BUILD_TYPE=all \ + --build-arg GRACE_BLACKWELL=1 \ + --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ + "${METADATA_ARGS[@]}" \ + ${VERSION_ARG} \ + ${{ inputs.extra_build_args }} \ + --metadata-file /tmp/metadata-cu129.json \ + --no-cache \ + . + + DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129.json'))['containerimage.digest'])") + echo "Pushed digest: ${DIGEST}" + echo "digest=${DIGEST}" >> $GITHUB_OUTPUT + + - name: Build and push ARM64 image (CUDA 13) + id: build-cu130 + run: | + VERSION_ARG="" + if [ -n "${SGL_VERSION}" ]; then + VERSION_ARG="--build-arg SGL_VERSION=${SGL_VERSION}" + fi + mapfile -t METADATA_ARGS < /tmp/docker-metadata-cu130.args + + docker buildx build \ + --target ${{ inputs.docker_target }} \ + --platform linux/arm64 \ + --output type=image,name=${{ inputs.image_repo }},push-by-digest=true,name-canonical=true,push=true \ + -f docker/Dockerfile \ + --build-arg CUDA_VERSION=13.0.1 \ + --build-arg BUILD_TYPE=all \ + --build-arg GRACE_BLACKWELL=1 \ + --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ + "${METADATA_ARGS[@]}" \ + ${VERSION_ARG} \ + ${{ inputs.extra_build_args }} \ + --metadata-file /tmp/metadata-cu130.json \ + --no-cache \ + . + + DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130.json'))['containerimage.digest'])") + echo "Pushed digest: ${DIGEST}" + echo "digest=${DIGEST}" >> $GITHUB_OUTPUT + + create-manifests: + runs-on: ubuntu-latest + needs: [build-x86, build-arm64] + if: github.repository == 'sgl-project/sglang' + environment: ${{ inputs.use_environment || null }} + steps: + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Create multi-arch manifests + env: + TAG_CONFIG: ${{ inputs.tag_config }} + SGL_VERSION: ${{ inputs.sgl_version }} + IMAGE_REPO: ${{ inputs.image_repo }} + X86_CU129: ${{ needs.build-x86.outputs.digest-cu129 }} + X86_CU130: ${{ needs.build-x86.outputs.digest-cu130 }} + ARM64_CU129: ${{ needs.build-arm64.outputs.digest-cu129 }} + ARM64_CU130: ${{ needs.build-arm64.outputs.digest-cu130 }} + SHORT_SHA: ${{ github.sha }} + run: | + echo "${TAG_CONFIG}" | jq -c '.[]' | while read -r entry; do + CUDA=$(echo "${entry}" | jq -r '.cuda') + + if [ "${CUDA}" = "cu129" ]; then + X86_DIGEST="${X86_CU129}" + ARM64_DIGEST="${ARM64_CU129}" + else + X86_DIGEST="${X86_CU130}" + ARM64_DIGEST="${ARM64_CU130}" + fi + + TAG_ARGS="" + for tag in $(echo "${entry}" | jq -r '.tags[]'); do + # Substitute template variables + tag=$(echo "${tag}" | sed "s/{version}/${SGL_VERSION}/g") + tag=$(echo "${tag}" | sed "s/{date}/$(date +%Y%m%d)/g") + tag=$(echo "${tag}" | sed "s/{short_sha}/${SHORT_SHA:0:8}/g") + TAG_ARGS="${TAG_ARGS} -t ${IMAGE_REPO}:${tag}" + done + + docker buildx imagetools create \ + ${TAG_ARGS} \ + ${IMAGE_REPO}@${X86_DIGEST} \ + ${IMAGE_REPO}@${ARM64_DIGEST} + + echo "Published:${TAG_ARGS}" + done diff --git a/.github/workflows/_docker-cleanup-nightly.yml b/.github/workflows/_docker-cleanup-nightly.yml new file mode 100644 index 000000000000..fda12e4e07ec --- /dev/null +++ b/.github/workflows/_docker-cleanup-nightly.yml @@ -0,0 +1,78 @@ +name: Cleanup Old Nightly Docker Tags + +# Reusable workflow: deletes old nightly Docker Hub tags, keeping the most recent ones. +# Can also be triggered manually to clean up tags on demand. + +on: + workflow_call: + inputs: + tag_prefixes: + description: 'JSON array of tag prefixes to clean up (e.g. ["nightly-dev", "nightly-dev-cu13"])' + required: true + type: string + keep_count: + description: "Number of most recent tags to keep per prefix" + required: false + type: number + default: 14 + image_repo: + description: "Docker Hub repo to clean up (e.g. lmsysorg/sglang-staging for testing)" + required: false + type: string + default: "lmsysorg/sglang" + workflow_dispatch: + inputs: + tag_prefixes: + description: 'JSON array of tag prefixes to clean up (e.g. ["nightly-dev", "nightly-dev-cu13"])' + required: true + type: string + keep_count: + description: "Number of most recent tags to keep per prefix" + required: false + type: number + default: 14 + image_repo: + description: "Docker Hub repo to clean up (e.g. lmsysorg/sglang-staging for testing)" + required: false + type: string + default: "lmsysorg/sglang" + +jobs: + cleanup: + if: github.repository == 'sgl-project/sglang' + runs-on: ubuntu-latest + steps: + - name: Cleanup old nightly tags + env: + TAG_PREFIXES: ${{ inputs.tag_prefixes }} + KEEP_COUNT: ${{ inputs.keep_count }} + IMAGE_REPO: ${{ inputs.image_repo }} + run: | + TOKEN=$(curl -s -H "Content-Type: application/json" \ + -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' \ + https://hub.docker.com/v2/users/login/ | jq -r .token) + + TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" \ + "https://hub.docker.com/v2/repositories/${IMAGE_REPO}/tags/?page_size=100") + + echo "${TAG_PREFIXES}" | jq -r '.[]' | while read -r PREFIX; do + echo "--- Checking prefix: ${PREFIX} ---" + + TAGS=$(echo "$TAGS_RESPONSE" | jq -r \ + --arg prefix "${PREFIX}" \ + '.results[] | select(.name | test("^\($prefix)-[0-9]")) | "\(.last_updated)|\(.name)"' \ + | sort -r | cut -d'|' -f2) + + TAG_COUNT=$(echo "$TAGS" | grep -c . || true) + if [ "$TAG_COUNT" -gt "$KEEP_COUNT" ]; then + echo "Found $TAG_COUNT tags, keeping only the $KEEP_COUNT most recent" + TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +$((KEEP_COUNT + 1))) + for tag in $TAGS_TO_DELETE; do + echo "Deleting tag: $tag" + curl -X DELETE -H "Authorization: JWT $TOKEN" \ + "https://hub.docker.com/v2/repositories/${IMAGE_REPO}/tags/$tag/" + done + else + echo "Only $TAG_COUNT tags found, no cleanup needed" + fi + done diff --git a/.github/workflows/amd-aiter-scout.yml b/.github/workflows/amd-aiter-scout.yml new file mode 100644 index 000000000000..9e7b413bc57d --- /dev/null +++ b/.github/workflows/amd-aiter-scout.yml @@ -0,0 +1,161 @@ +name: AMD AITER Scout + +on: + schedule: + - cron: '0 20 * * 1' # Monday 20:00 UTC + - cron: '0 20 * * 4' # Thursday 20:00 UTC + workflow_dispatch: + inputs: + aiter_ref: + description: 'AITER git ref (branch, tag, or SHA). Default: main (latest commit)' + required: false + type: string + default: 'main' + job_filter: + description: 'Comma-separated workflows to run: nightly-amd, nightly-amd-rocm720, pr-test-amd, pr-test-amd-rocm720. Default: all' + required: false + type: string + default: 'all' + continue_on_error: + description: 'Continue running other workflows even if one fails' + required: false + type: boolean + default: true + +concurrency: + group: amd-aiter-scout-${{ github.run_id }} + cancel-in-progress: true + +jobs: + resolve-aiter: + runs-on: ubuntu-latest + outputs: + aiter_sha: ${{ steps.resolve.outputs.sha }} + run_nightly_amd: ${{ steps.parse.outputs.run_nightly_amd }} + run_nightly_amd_rocm720: ${{ steps.parse.outputs.run_nightly_amd_rocm720 }} + run_pr_test_amd: ${{ steps.parse.outputs.run_pr_test_amd }} + run_pr_test_amd_rocm720: ${{ steps.parse.outputs.run_pr_test_amd_rocm720 }} + steps: + - name: Resolve AITER commit + id: resolve + run: | + REF="${{ inputs.aiter_ref || 'main' }}" + echo "Resolving AITER ref: ${REF}" + + SHA=$(git ls-remote https://github.com/ROCm/aiter.git "refs/heads/${REF}" | head -1 | cut -f1) + if [ -z "$SHA" ]; then + SHA=$(git ls-remote https://github.com/ROCm/aiter.git "refs/tags/${REF}" | head -1 | cut -f1) + fi + if [ -z "$SHA" ]; then + SHA=$(git ls-remote https://github.com/ROCm/aiter.git "${REF}" | head -1 | cut -f1) + fi + if [ -z "$SHA" ]; then + SHA="${REF}" + fi + + echo "sha=${SHA}" >> $GITHUB_OUTPUT + echo "### AITER Ref Resolution" >> $GITHUB_STEP_SUMMARY + echo "- **Requested ref:** \`${REF}\`" >> $GITHUB_STEP_SUMMARY + echo "- **Resolved SHA:** \`${SHA}\`" >> $GITHUB_STEP_SUMMARY + echo "- **AITER commit:** https://github.com/ROCm/aiter/commit/${SHA}" >> $GITHUB_STEP_SUMMARY + + - name: Parse job filter + id: parse + run: | + FILTER="${{ inputs.job_filter || 'all' }}" + echo "Job filter: ${FILTER}" + + if [[ "$FILTER" == "all" ]]; then + echo "run_nightly_amd=true" >> $GITHUB_OUTPUT + echo "run_nightly_amd_rocm720=true" >> $GITHUB_OUTPUT + echo "run_pr_test_amd=true" >> $GITHUB_OUTPUT + echo "run_pr_test_amd_rocm720=true" >> $GITHUB_OUTPUT + else + # Wrap with commas for exact substring matching (avoids "nightly-amd" matching "nightly-amd-rocm720") + PADDED=",${FILTER// /}," + echo "run_nightly_amd=$(echo "$PADDED" | grep -q ',nightly-amd,' && echo true || echo false)" >> $GITHUB_OUTPUT + echo "run_nightly_amd_rocm720=$(echo "$PADDED" | grep -q ',nightly-amd-rocm720,' && echo true || echo false)" >> $GITHUB_OUTPUT + echo "run_pr_test_amd=$(echo "$PADDED" | grep -q ',pr-test-amd,' && echo true || echo false)" >> $GITHUB_OUTPUT + echo "run_pr_test_amd_rocm720=$(echo "$PADDED" | grep -q ',pr-test-amd-rocm720,' && echo true || echo false)" >> $GITHUB_OUTPUT + fi + + echo "### Job Filter" >> $GITHUB_STEP_SUMMARY + echo "- **Filter:** \`${FILTER}\`" >> $GITHUB_STEP_SUMMARY + + call-nightly-amd: + if: needs.resolve-aiter.outputs.run_nightly_amd == 'true' + needs: resolve-aiter + uses: ./.github/workflows/nightly-test-amd.yml + secrets: inherit + with: + ref: ${{ github.sha }} + aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }} + job_filter: 'all' + continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }} + + call-nightly-amd-rocm720: + if: needs.resolve-aiter.outputs.run_nightly_amd_rocm720 == 'true' + needs: resolve-aiter + uses: ./.github/workflows/nightly-test-amd-rocm720.yml + secrets: inherit + with: + ref: ${{ github.sha }} + aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }} + job_filter: 'all' + continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }} + + call-pr-test-amd: + if: needs.resolve-aiter.outputs.run_pr_test_amd == 'true' + needs: resolve-aiter + uses: ./.github/workflows/pr-test-amd.yml + secrets: inherit + with: + run_all_tests: true + aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }} + continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }} + + call-pr-test-amd-rocm720: + if: needs.resolve-aiter.outputs.run_pr_test_amd_rocm720 == 'true' + needs: resolve-aiter + uses: ./.github/workflows/pr-test-amd-rocm720.yml + secrets: inherit + with: + run_all_tests: true + aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }} + continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }} + + check-all-jobs: + if: always() + needs: + - resolve-aiter + - call-nightly-amd + - call-nightly-amd-rocm720 + - call-pr-test-amd + - call-pr-test-amd-rocm720 + runs-on: ubuntu-latest + steps: + - name: Summary + run: | + echo "## AMD AITER Scout Results" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "- **AITER SHA:** \`${{ needs.resolve-aiter.outputs.aiter_sha }}\`" >> $GITHUB_STEP_SUMMARY + echo "- **AITER commit:** https://github.com/ROCm/aiter/commit/${{ needs.resolve-aiter.outputs.aiter_sha }}" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "| Workflow | Result |" >> $GITHUB_STEP_SUMMARY + echo "|----------|--------|" >> $GITHUB_STEP_SUMMARY + echo "| Nightly AMD (AITER Latest) | \`${{ needs.call-nightly-amd.result }}\` |" >> $GITHUB_STEP_SUMMARY + echo "| Nightly AMD ROCm 7.2 | \`${{ needs.call-nightly-amd-rocm720.result }}\` |" >> $GITHUB_STEP_SUMMARY + echo "| PR Test AMD (AITER Latest) | \`${{ needs.call-pr-test-amd.result }}\` |" >> $GITHUB_STEP_SUMMARY + echo "| PR Test AMD ROCm 7.2 | \`${{ needs.call-pr-test-amd-rocm720.result }}\` |" >> $GITHUB_STEP_SUMMARY + + - name: Check if any job failed + run: | + if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then + echo "One or more workflows failed" + exit 1 + fi + if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then + echo "One or more workflows were cancelled" + exit 1 + fi + echo "All workflows passed" diff --git a/.github/workflows/amd-ci-job-monitor.yml b/.github/workflows/amd-ci-job-monitor.yml new file mode 100644 index 000000000000..cbb8798b110a --- /dev/null +++ b/.github/workflows/amd-ci-job-monitor.yml @@ -0,0 +1,338 @@ +name: AMD CI Job Monitor + +on: + schedule: + - cron: '0 0 * * *' # Daily at midnight UTC + pull_request: + paths: + - '.github/workflows/amd-ci-job-monitor.yml' + - 'scripts/ci/utils/query_job_status.py' + workflow_dispatch: + inputs: + hours: + description: 'Time window in hours' + required: false + default: '24' + type: string + job_filter: + description: 'Job name filter (leave empty for all AMD jobs)' + required: false + type: string + +jobs: + fetch-actions-data: + name: Fetch Actions Snapshot + runs-on: ubuntu-latest + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Select workflows for snapshot + id: select-workflows + run: | + if [[ -n "${{ inputs.job_filter }}" ]]; then + echo "workflows=pr-test-amd.yml" >> "$GITHUB_OUTPUT" + else + echo "workflows=pr-test-amd.yml,nightly-test-amd.yml,pr-test-amd-rocm720.yml,nightly-test-amd-rocm720.yml" >> "$GITHUB_OUTPUT" + fi + + - name: Fetch Actions data snapshot + timeout-minutes: 30 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --workflow "${{ steps.select-workflows.outputs.workflows }}" \ + --hours ${{ inputs.hours || '24' }} \ + --dump-data-file actions-job-snapshot.json + + - name: Upload Actions data snapshot + uses: actions/upload-artifact@v4 + with: + name: actions-job-snapshot + path: actions-job-snapshot.json + if-no-files-found: error + + # Single job filter mode + custom-report: + name: Custom Job Report + if: ${{ inputs.job_filter }} + needs: fetch-actions-data + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Download Actions data snapshot + uses: actions/download-artifact@v4 + with: + name: actions-job-snapshot + path: ci-data + + - name: Generate Custom Job Report + timeout-minutes: 30 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --job "${{ inputs.job_filter }}" \ + --workflow "pr-test-amd.yml" \ + --hours ${{ inputs.hours || '24' }} \ + --input-data-file ci-data/actions-job-snapshot.json \ + --summary + + # Parse workflow files to get job names dynamically + parse-workflows: + name: Parse Workflow Jobs + if: ${{ !inputs.job_filter }} + runs-on: ubuntu-latest + outputs: + pr_jobs: ${{ steps.parse.outputs.pr_jobs }} + nightly_jobs: ${{ steps.parse.outputs.nightly_jobs }} + pr_rocm720_jobs: ${{ steps.parse.outputs.pr_rocm720_jobs }} + nightly_rocm720_jobs: ${{ steps.parse.outputs.nightly_rocm720_jobs }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Parse workflow files + id: parse + run: | + # Parse pr-test-amd.yml and extract job names (exclude utility jobs) + # Excluded: call-gate, check-changes, pr-test-amd-finish, cancel, check-all-jobs + pr_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/pr-test-amd.yml | \ + grep -v -E '^(call-gate|check-changes|pr-test-amd-finish|cancel|check-all-jobs)$' | \ + jq -R -s -c 'split("\n") | map(select(length > 0))') + echo "pr_jobs=$pr_jobs" >> $GITHUB_OUTPUT + echo "PR jobs: $pr_jobs" + + # Parse nightly-test-amd.yml and extract job names (exclude utility jobs) + # Excluded: check-all-jobs + nightly_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/nightly-test-amd.yml | \ + grep -v -E '^(check-all-jobs)$' | \ + jq -R -s -c 'split("\n") | map(select(length > 0))') + echo "nightly_jobs=$nightly_jobs" >> $GITHUB_OUTPUT + echo "Nightly jobs: $nightly_jobs" + + # Parse pr-test-amd-rocm720.yml (exclude utility jobs) + # Excluded: call-gate, check-changes, pr-test-amd-finish, cancel, check-all-jobs + pr_rocm720_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/pr-test-amd-rocm720.yml | \ + grep -v -E '^(call-gate|check-changes|pr-test-amd-finish|cancel|check-all-jobs)$' | \ + jq -R -s -c 'split("\n") | map(select(length > 0))') + echo "pr_rocm720_jobs=$pr_rocm720_jobs" >> $GITHUB_OUTPUT + echo "PR ROCm 7.2 jobs: $pr_rocm720_jobs" + + # Parse nightly-test-amd-rocm720.yml (exclude utility jobs) + # Excluded: check-all-jobs + nightly_rocm720_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/nightly-test-amd-rocm720.yml | \ + grep -v -E '^(check-all-jobs)$' | \ + jq -R -s -c 'split("\n") | map(select(length > 0))') + echo "nightly_rocm720_jobs=$nightly_rocm720_jobs" >> $GITHUB_OUTPUT + echo "Nightly ROCm 7.2 jobs: $nightly_rocm720_jobs" + + # PR CI reports using dynamic matrix + pr-ci-reports: + name: PR - ${{ matrix.job_name }} + needs: [parse-workflows, fetch-actions-data] + if: ${{ !inputs.job_filter }} + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + job_name: ${{ fromJson(needs.parse-workflows.outputs.pr_jobs) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Download Actions data snapshot + uses: actions/download-artifact@v4 + with: + name: actions-job-snapshot + path: ci-data + + - name: Generate Report + timeout-minutes: 15 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --job "${{ matrix.job_name }}" \ + --workflow "pr-test-amd.yml" \ + --hours ${{ inputs.hours || '24' }} \ + --input-data-file ci-data/actions-job-snapshot.json \ + --summary + + # Nightly AMD test reports using dynamic matrix + nightly-reports: + name: Nightly - ${{ matrix.job_name }} + needs: [parse-workflows, fetch-actions-data] + if: ${{ !inputs.job_filter }} + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + job_name: ${{ fromJson(needs.parse-workflows.outputs.nightly_jobs) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Download Actions data snapshot + uses: actions/download-artifact@v4 + with: + name: actions-job-snapshot + path: ci-data + + - name: Generate Nightly Report + timeout-minutes: 15 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --job "${{ matrix.job_name }}" \ + --workflow "nightly-test-amd.yml" \ + --hours ${{ inputs.hours || '24' }} \ + --input-data-file ci-data/actions-job-snapshot.json \ + --summary + + # PR ROCm 7.2 CI reports using dynamic matrix + pr-rocm720-ci-reports: + name: PR ROCm720 - ${{ matrix.job_name }} + needs: [parse-workflows, fetch-actions-data] + if: ${{ !inputs.job_filter }} + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + job_name: ${{ fromJson(needs.parse-workflows.outputs.pr_rocm720_jobs) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Download Actions data snapshot + uses: actions/download-artifact@v4 + with: + name: actions-job-snapshot + path: ci-data + + - name: Generate PR ROCm 7.2 Report + timeout-minutes: 15 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --job "${{ matrix.job_name }}" \ + --workflow "pr-test-amd-rocm720.yml" \ + --hours ${{ inputs.hours || '24' }} \ + --input-data-file ci-data/actions-job-snapshot.json \ + --summary + + # Nightly ROCm 7.2 reports using dynamic matrix + nightly-rocm720-reports: + name: Nightly ROCm720 - ${{ matrix.job_name }} + needs: [parse-workflows, fetch-actions-data] + if: ${{ !inputs.job_filter }} + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + job_name: ${{ fromJson(needs.parse-workflows.outputs.nightly_rocm720_jobs) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Download Actions data snapshot + uses: actions/download-artifact@v4 + with: + name: actions-job-snapshot + path: ci-data + + - name: Generate Nightly ROCm 7.2 Report + timeout-minutes: 15 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --job "${{ matrix.job_name }}" \ + --workflow "nightly-test-amd-rocm720.yml" \ + --hours ${{ inputs.hours || '24' }} \ + --input-data-file ci-data/actions-job-snapshot.json \ + --summary + + # Runner fleet report - cross-workflow runner analytics in a single pass + runner-fleet-report: + name: Runner Fleet Report + if: ${{ !inputs.job_filter }} + needs: fetch-actions-data + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install dependencies + run: pip install tabulate + + - name: Download Actions data snapshot + uses: actions/download-artifact@v4 + with: + name: actions-job-snapshot + path: ci-data + + - name: Generate Runner Fleet Report + timeout-minutes: 30 + run: | + python scripts/ci/utils/query_job_status.py \ + --repo ${{ github.repository }} \ + --runner-report \ + --workflow "pr-test-amd.yml,nightly-test-amd.yml,pr-test-amd-rocm720.yml,nightly-test-amd-rocm720.yml" \ + --hours ${{ inputs.hours || '24' }} \ + --input-data-file ci-data/actions-job-snapshot.json \ + --summary diff --git a/.github/workflows/auto-format.yml b/.github/workflows/auto-format.yml deleted file mode 100644 index 15b208db82ab..000000000000 --- a/.github/workflows/auto-format.yml +++ /dev/null @@ -1,71 +0,0 @@ -name: Auto Format Code - -on: - pull_request: - types: [labeled] - -permissions: - contents: write - pull-requests: write - -jobs: - auto-format: - if: github.event.label.name == 'format' - runs-on: ubuntu-latest - steps: - - name: Checkout PR branch - uses: actions/checkout@v4 - with: - ref: ${{ github.event.pull_request.head.ref }} - repository: ${{ github.event.pull_request.head.repo.full_name }} - token: ${{ secrets.GITHUB_TOKEN }} - fetch-depth: 0 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.12" - - - name: Install pre-commit hook - run: | - python -m pip install pre-commit - pre-commit install - - - name: Run pre-commit to format code - run: SKIP=no-commit-to-branch pre-commit run --all-files - continue-on-error: true - - - name: Check for changes - id: check_changes - run: | - if [[ -n $(git status -s) ]]; then - echo "has_changes=true" >> $GITHUB_OUTPUT - else - echo "has_changes=false" >> $GITHUB_OUTPUT - fi - - - name: Commit and push changes - if: steps.check_changes.outputs.has_changes == 'true' - run: | - git config --local user.email "github-actions[bot]@users.noreply.github.com" - git config --local user.name "github-actions[bot]" - git add . - git commit -m "🤖 Auto-format code with isort, black, ruff, and clang-format" - git push - - - name: Remove format label - if: always() - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - try { - await github.rest.issues.removeLabel({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: context.issue.number, - name: 'format' - }); - } catch (error) { - console.log('Label may have already been removed'); - } diff --git a/.github/workflows/auto-tune.yml b/.github/workflows/auto-tune.yml index 0afc79bb7c8c..16ad5d23b177 100644 --- a/.github/workflows/auto-tune.yml +++ b/.github/workflows/auto-tune.yml @@ -4,7 +4,7 @@ on: workflow_dispatch: jobs: - lint: + auto-tune-lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 diff --git a/.github/workflows/bot-bump-flashinfer-version.yml b/.github/workflows/bot-bump-flashinfer-version.yml new file mode 100644 index 000000000000..cc1cba930ce2 --- /dev/null +++ b/.github/workflows/bot-bump-flashinfer-version.yml @@ -0,0 +1,50 @@ +name: Bot Bump Flashinfer Version + +on: + workflow_dispatch: + inputs: + new_version: + description: 'New flashinfer version (e.g., 0.6.4)' + required: true + type: string + +permissions: + contents: write + pull-requests: write + +jobs: + bump-flashinfer-version: + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + token: ${{ secrets.GITHUB_TOKEN }} + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install Python dependencies + run: | + pip install tomli + + - name: Configure Git and branch + run: | + git config user.name "sglang-bot" + git config user.email "sglang-bot@users.noreply.github.com" + RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4) + BRANCH_NAME="bot/bump-flashinfer-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}" + git checkout -b "$BRANCH_NAME" + echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV + + - name: Run flashinfer version bump script + run: | + python scripts/release/bump_flashinfer_version.py "${{ github.event.inputs.new_version }}" + + - name: Commit and create PR + env: + GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }} + run: | + bash scripts/release/commit_and_pr.sh "flashinfer" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME" diff --git a/.github/workflows/bot-bump-kernel-version-to-sglang.yml b/.github/workflows/bot-bump-kernel-version-to-sglang.yml index 817889846a8d..b26192aba1ac 100644 --- a/.github/workflows/bot-bump-kernel-version-to-sglang.yml +++ b/.github/workflows/bot-bump-kernel-version-to-sglang.yml @@ -58,43 +58,3 @@ jobs: GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }} run: | bash scripts/release/commit_and_pr_kernel_to_sglang.sh "$KERNEL_VERSION" "$BRANCH_NAME" - - run-nightly-tests-nvidia: - needs: bump-kernel-version-to-sglang - if: needs.bump-kernel-version-to-sglang.outputs.needs_sync == 'true' - uses: ./.github/workflows/nightly-test-nvidia.yml - with: - ref: ${{ needs.bump-kernel-version-to-sglang.outputs.branch_name }} - secrets: inherit - - run-nightly-tests-amd: - needs: bump-kernel-version-to-sglang - if: needs.bump-kernel-version-to-sglang.outputs.needs_sync == 'true' - uses: ./.github/workflows/nightly-test-amd.yml - with: - ref: ${{ needs.bump-kernel-version-to-sglang.outputs.branch_name }} - secrets: inherit - - run-nightly-tests-npu: - needs: bump-kernel-version-to-sglang - if: needs.bump-kernel-version-to-sglang.outputs.needs_sync == 'true' - uses: ./.github/workflows/nightly-test-npu.yml - with: - ref: ${{ needs.bump-kernel-version-to-sglang.outputs.branch_name }} - secrets: inherit - - run-pr-tests-xeon: - needs: bump-kernel-version-to-sglang - if: needs.bump-kernel-version-to-sglang.outputs.needs_sync == 'true' - uses: ./.github/workflows/pr-test-xeon.yml - with: - ref: ${{ needs.bump-kernel-version-to-sglang.outputs.branch_name }} - secrets: inherit - - run-pr-tests-xpu: - needs: bump-kernel-version-to-sglang - if: needs.bump-kernel-version-to-sglang.outputs.needs_sync == 'true' - uses: ./.github/workflows/pr-test-xpu.yml - with: - ref: ${{ needs.bump-kernel-version-to-sglang.outputs.branch_name }} - secrets: inherit diff --git a/.github/workflows/cancel-unfinished-pr-tests.yml b/.github/workflows/cancel-unfinished-pr-tests.yml index 2c0e9f63f0ba..486beec48bd4 100644 --- a/.github/workflows/cancel-unfinished-pr-tests.yml +++ b/.github/workflows/cancel-unfinished-pr-tests.yml @@ -8,6 +8,11 @@ on: required: true type: string default: 'pr-test.yml' + include_high_priority: + description: 'Also cancel runs from high-priority PRs' + required: false + type: boolean + default: false permissions: actions: write # Needed to cancel runs @@ -26,6 +31,7 @@ jobs: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} REPO: ${{ github.repository }} WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }} + INCLUDE_HIGH_PRIORITY: ${{ github.event.inputs.include_high_priority || 'false' }} shell: bash run: | set -euo pipefail @@ -123,11 +129,19 @@ jobs: labels=$(gh pr view "$pr_number" --repo "$REPO" --json labels \ | jq -r '.labels[].name' 2>/dev/null || true) - if echo "$labels" | grep -Fxq "high priority"; then - echo " 🛑 Skipping (high priority label)" + if echo "$labels" | grep -Fxq "bypass-maintenance"; then + echo " 🛑 Skipping (bypass-maintenance label, never cancelled)" continue fi + if echo "$labels" | grep -Fxq "high priority"; then + if [ "$INCLUDE_HIGH_PRIORITY" != "true" ]; then + echo " 🛑 Skipping (high priority label)" + continue + fi + echo " ⚠️ High priority PR, but include_high_priority is enabled" + fi + echo " 🚫 Cancelling..." gh run cancel "$run_id" --repo "$REPO" || echo " ⚠️ Cancellation failed" done diff --git a/.github/workflows/ci-auto-bisect.yml b/.github/workflows/ci-auto-bisect.yml new file mode 100644 index 000000000000..3c77b5f1e61c --- /dev/null +++ b/.github/workflows/ci-auto-bisect.yml @@ -0,0 +1,66 @@ +name: CI Auto Bisect + +on: + workflow_run: + workflows: ["PR Test"] + types: [completed] + branches: [main] + workflow_dispatch: {} + +concurrency: + group: ci-auto-bisect + cancel-in-progress: true + +permissions: + contents: read + actions: read + +jobs: + auto-bisect: + # Only run for scheduled pr-test completions (not PR-triggered), or manual dispatch + if: > + github.repository == 'sgl-project/sglang' && ( + github.event_name == 'workflow_dispatch' || + (github.event.workflow_run.event == 'schedule' && + github.event.workflow_run.conclusion != 'cancelled') + ) + runs-on: ubuntu-latest + timeout-minutes: 30 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + fetch-depth: 0 # Full history needed for git log between SHAs + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.14' + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install requests anthropic + + - name: Run Auto Bisect + id: bisect + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + PYTHONUNBUFFERED: 1 + PYTHONIOENCODING: utf-8 + run: | + cd scripts/ci_monitor + python ci_auto_bisect.py \ + --github-token $GITHUB_TOKEN \ + --anthropic-api-key $ANTHROPIC_API_KEY \ + --output bisect_results.json \ + --max-failures 10 + + - name: Upload Bisect Results + if: always() && hashFiles('scripts/ci_monitor/bisect_results.json') != '' + uses: actions/upload-artifact@v4 + with: + name: ci-auto-bisect-${{ github.run_number }} + path: scripts/ci_monitor/bisect_results.json + retention-days: 14 diff --git a/.github/workflows/ci-coverage-overview.yml b/.github/workflows/ci-coverage-overview.yml index db9269d67e86..9a9f84fda3d8 100644 --- a/.github/workflows/ci-coverage-overview.yml +++ b/.github/workflows/ci-coverage-overview.yml @@ -68,6 +68,73 @@ jobs: run: | python scripts/ci/utils/ci_coverage_report.py --section by-suite + unit-test-coverage: + name: Unit Test Code Coverage + if: github.event_name != 'pull_request' + runs-on: 1-gpu-h100 + timeout-minutes: 30 + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Install dependencies + timeout-minutes: 10 + run: | + pip install -e "python/[test]" + + - name: Run unit tests with coverage + timeout-minutes: 10 + run: | + pytest test/registered/unit/ \ + --cov --cov-config=.coveragerc \ + --cov-report=term-missing:skip-covered \ + --continue-on-collection-errors \ + -v | tee coverage_output.txt + + - name: Write coverage to summary + if: always() + run: | + echo "## Unit Test Code Coverage" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "**Commit:** \`${GITHUB_SHA::8}\` | **Branch:** \`${GITHUB_REF_NAME}\`" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + + # Test result line (e.g., "== 42 passed, 1 failed in 23.5s ==") + echo '```' >> $GITHUB_STEP_SUMMARY + grep -E '^=+.*passed' coverage_output.txt >> $GITHUB_STEP_SUMMARY || true + echo "" >> $GITHUB_STEP_SUMMARY + # Coverage total, reformatted + awk '/^TOTAL / { for(i=1;i<=NF;i++) if($i~/^[0-9]+$/ || $i~/^[0-9]+%$/) a[++n]=$i; if(n>=3) printf "TOTAL Stmts: %s Miss: %s Cover: %s\n", a[1], a[2], a[3] }' coverage_output.txt >> $GITHUB_STEP_SUMMARY || true + echo '```' >> $GITHUB_STEP_SUMMARY + + # Core modules with coverage < 50%, sorted by uncovered lines (descending) + LOW_COV=$(awk '/^python\/.*%/ { + for (i=1; i<=NF; i++) { + if ($i ~ /^[0-9]+%$/) { + pct = $i + 0 + if (pct >= 1 && pct < 50) { + stmts = $(i-2) + 0 + miss = $(i-1) + 0 + printf "%d|%s|%d|%d|%d%%\n", miss, $1, stmts, miss, pct + } + break + } + } + }' coverage_output.txt \ + | grep -E '/(mem_cache|managers|sampling|parser|observability|function_call|entrypoints|speculative|multimodal|utils)/' \ + | sort -t'|' -k1 -nr | cut -d'|' -f2- | head -40 || true) + if [ -n "$LOW_COV" ]; then + echo "" >> $GITHUB_STEP_SUMMARY + echo "
Top uncovered core modules (coverage < 50%, sorted by uncovered lines)" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "| File | Stmts | Miss | Cover |" >> $GITHUB_STEP_SUMMARY + echo "|------|-------|------|-------|" >> $GITHUB_STEP_SUMMARY + echo "$LOW_COV" | while IFS='|' read -r file stmts miss pct; do + echo "| \`$file\` | $stmts | $miss | $pct |" >> $GITHUB_STEP_SUMMARY + done + echo "
" >> $GITHUB_STEP_SUMMARY + fi + json-export: name: JSON Export runs-on: ubuntu-latest diff --git a/.github/workflows/ci-failure-monitor.yml b/.github/workflows/ci-failure-monitor.yml index 07dcea7d6111..0b81fbfd36d6 100644 --- a/.github/workflows/ci-failure-monitor.yml +++ b/.github/workflows/ci-failure-monitor.yml @@ -29,7 +29,7 @@ jobs: - name: Install dependencies run: | python -m pip install --upgrade pip - pip install requests slack_sdk + pip install requests - name: Run Failure Analysis env: @@ -51,22 +51,3 @@ jobs: path: | scripts/ci_monitor/ci_failure_analysis_*.json retention-days: 7 - - - name: Send Slack Notification - if: always() - env: - SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }} - run: | - cd scripts/ci_monitor - LATEST_REPORT=$(ls -t ci_failure_analysis_*.json | head -1) - - if [ ! -f "$LATEST_REPORT" ]; then - echo "No report found, so skipping Slack notification" - exit 0 - fi - - if [ -n "$SGLANG_DIFFUSION_SLACK_TOKEN" ]; then - python3 post_ci_failures_to_slack.py --report-file "$LATEST_REPORT" - else - echo "SGLANG_DIFFUSION_SLACK_TOKEN not configured, skipping notification" - fi diff --git a/.github/workflows/ci-monitor.yml b/.github/workflows/ci-monitor.yml deleted file mode 100644 index 28a198a32a58..000000000000 --- a/.github/workflows/ci-monitor.yml +++ /dev/null @@ -1,111 +0,0 @@ -name: CI Monitor - -on: - schedule: - - cron: '0 */12 * * *' # Every 12 hours for main analysis - workflow_dispatch: - inputs: - limit: - description: 'Number of CI runs to analyze' - required: false - default: '1000' - type: string - -concurrency: - group: ci-monitor-${{ github.ref }} - cancel-in-progress: true - -permissions: - contents: write - actions: read - -jobs: - ci-monitor: - if: github.repository == 'sgl-project/sglang'|| github.event_name == 'pull_request' - runs-on: ubuntu-latest - steps: - - name: Checkout code - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.9' - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install requests matplotlib pandas - - - name: Run CI Analysis - env: - GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} - PYTHONUNBUFFERED: 1 - PYTHONIOENCODING: utf-8 - run: | - cd scripts/ci_monitor - python ci_analyzer.py --token $GITHUB_TOKEN --limit ${{ inputs.limit || '1000' }} --output ci_analysis_$(date +%Y%m%d_%H%M%S).json - - - name: Run Nightly Test Analysis - env: - GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} - PYTHONUNBUFFERED: 1 - PYTHONIOENCODING: utf-8 - run: | - cd scripts/ci_monitor - python ci_analyzer.py --token $GITHUB_TOKEN --mode nightly --days 2 --output nightly_analysis_$(date +%Y%m%d_%H%M%S).json - - - name: Run Performance Analysis - env: - GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} - PYTHONUNBUFFERED: 1 - PYTHONIOENCODING: utf-8 - run: | - cd scripts/ci_monitor - python ci_analyzer_perf.py --token $GITHUB_TOKEN --limit ${{ inputs.limit || '1000' }} --output-dir performance_tables_$(date +%Y%m%d_%H%M%S) --upload-to-github - - - name: Upload Analysis Results - uses: actions/upload-artifact@v4 - with: - name: ci-analysis-results-${{ github.run_number }} - path: | - scripts/ci_monitor/ci_analysis_*.json - scripts/ci_monitor/nightly_analysis_*.json - scripts/ci_monitor/performance_tables_* - retention-days: 30 - - ci-monitor-balance: - needs: ci-monitor - if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request' - runs-on: ubuntu-latest - steps: - - name: Checkout code - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.9' - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install requests - - - name: Run Test Balance Analysis - env: - GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} - PYTHONUNBUFFERED: 1 - PYTHONIOENCODING: utf-8 - run: | - cd scripts/ci_monitor - python ci_analyzer_balance.py --token $GITHUB_TOKEN --limit ${{ inputs.limit || '1000' }} --output test_balance_report_$(date +%Y%m%d_%H%M%S).json - - - name: Upload Balance Analysis Results - uses: actions/upload-artifact@v4 - with: - name: test-balance-results-${{ github.run_number }} - path: | - scripts/ci_monitor/test_balance_report_*.json - scripts/ci_monitor/test_balance_report_*.csv - retention-days: 30 diff --git a/.github/workflows/diffusion-ci-gt-gen.yml b/.github/workflows/diffusion-ci-gt-gen.yml new file mode 100644 index 000000000000..909cfff54053 --- /dev/null +++ b/.github/workflows/diffusion-ci-gt-gen.yml @@ -0,0 +1,533 @@ +name: Diffusion CI Ground Truth Generation + +on: + workflow_dispatch: + inputs: + ref: + description: 'Git ref to checkout' + required: false + default: '' + type: string + case_ids: + description: 'Specific case IDs to run (space-separated, optional)' + required: false + default: '' + type: string + output_name: + description: 'Custom local output/artifact folder name. Leave empty to use defaults.' + required: false + default: '' + type: string + publish_target_dir: + description: 'Remote target directory in sgl-project/ci-data. Leave empty to use sglang_generated, or official_generated when run_official_cases is true.' + required: false + default: '' + type: string + kernel_artifact_run_id: + description: "Optional sgl-kernel wheel source: GitHub Actions run ID of a pr-test.yml run on the SAME commit as 'ref' that uploaded wheel-python3.10-cuda13.0. Use only when this branch's torch/ABI change makes the PyPI sglang-kernel wheel incompatible. Find it under Actions > pr-test; artifacts expire after 90 days." + required: false + default: '' + type: string + run_official_cases: + description: 'Run official comparable GT cases instead of native SGLang GT cases.' + required: false + default: false + type: boolean + official_case_ids: + description: 'Specific official case IDs to run (space-separated). Leave empty to run all official comparable cases.' + required: false + default: '' + type: string + official_source_group: + description: 'Official GT source group filter: all, diffusers, wan21, ltx, or ltx23. Used only when run_official_cases is true.' + required: false + default: '' + type: string + ci_data_ref: + description: 'ci-data ref to use for repro scripts when running official GT cases.' + required: false + default: 'main' + type: string + +concurrency: + group: diffusion-ci-gt-gen-${{ github.ref }}-${{ inputs.output_name || inputs.case_ids || inputs.official_case_ids || inputs.official_source_group || inputs.run_official_cases || 'default' }} + cancel-in-progress: true + +permissions: + contents: write + actions: read + +env: + SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" + OUTPUT_NAME: ${{ inputs.output_name || 'diffusion-ci-outputs' }} + PUBLISH_TARGET_DIR: ${{ inputs.publish_target_dir || (inputs.run_official_cases && 'diffusion-ci/consistency_gt/official_generated' || 'diffusion-ci/consistency_gt/sglang_generated') }} + +jobs: + compute-official-gt-matrix: + if: github.repository == 'sgl-project/sglang' && inputs.run_official_cases + runs-on: ubuntu-latest + outputs: + matrix: ${{ steps.compute.outputs.matrix }} + case-count: ${{ steps.compute.outputs.case-count }} + steps: + - name: Compute official case matrix + id: compute + env: + OFFICIAL_CASE_IDS: ${{ inputs.official_case_ids }} + OFFICIAL_SOURCE_GROUP: ${{ inputs.official_source_group || 'all' }} + run: | + python3 - <<'PY' + import json + import os + + groups = { + "diffusers": [ + "flux_2_image_t2i", + "flux_2_klein_image_t2i", + "flux_2_ti2i", + "flux_image_t2i", + "qwen_image_edit_2509_ti2i", + "qwen_image_edit_2511_ti2i", + "qwen_image_edit_ti2i", + "qwen_image_layered_i2i", + "qwen_image_t2i", + "zimage_image_t2i", + ], + "wan21": ["wan2_1_t2v_1.3b"], + "ltx": [ + "ltx_2_two_stage_t2v", + "ltx_2.3_two_stage_t2v_2gpus", + "ltx_2_3_two_stage_ti2v_2gpus", + "ltx_2.3_one_stage_ti2v", + "ltx_2_3_hq_pipeline", + ], + "ltx23": [ + "ltx_2.3_two_stage_t2v_2gpus", + "ltx_2.3_one_stage_ti2v", + ], + } + source_group = os.environ["OFFICIAL_SOURCE_GROUP"].strip() or "all" + if source_group == "all": + selected_groups = ["diffusers", "wan21", "ltx"] + elif source_group in groups: + selected_groups = [source_group] + else: + raise SystemExit(f"unknown official_source_group: {source_group}") + + requested = os.environ["OFFICIAL_CASE_IDS"].split() + requested_set = set(requested) + h200_cases = { + "flux_2_image_t2i", + "flux_2_klein_image_t2i", + "flux_2_ti2i", + } + include = [] + for group in selected_groups: + for case_id in groups[group]: + if requested_set and case_id not in requested_set: + continue + include.append( + { + "source_group": group, + "case_id": case_id, + "runner": "8-gpu-h200" if case_id in h200_cases else "1-gpu-h100", + } + ) + + known_cases = {case for cases in groups.values() for case in cases} + unknown = sorted(requested_set - known_cases) + if unknown: + raise SystemExit(f"unknown official case id(s): {' '.join(unknown)}") + if not include: + raise SystemExit("official case matrix is empty") + + with open(os.environ["GITHUB_OUTPUT"], "a", encoding="utf-8") as f: + f.write(f"matrix={json.dumps({'include': include}, separators=(',', ':'))}\n") + f.write(f"case-count={len(include)}\n") + PY + + official-gt-hopper: + needs: compute-official-gt-matrix + if: needs.compute-official-gt-matrix.result == 'success' + runs-on: ${{ matrix.runner }} + timeout-minutes: 240 + strategy: + fail-fast: false + matrix: ${{ fromJson(needs.compute-official-gt-matrix.outputs.matrix) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Checkout ci-data repro scripts + uses: actions/checkout@v4 + with: + repository: sgl-project/ci-data + ref: ${{ inputs.ci_data_ref || 'main' }} + path: ci-data + token: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + sparse-checkout: | + diffusion-ci/repro_scripts + diffusion-ci/consistency_gt/official_generated/case_map.json + sparse-checkout-cone-mode: false + + - name: Prepare sgl-kernel/dist for prebuilt wheel + if: inputs.kernel_artifact_run_id != '' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download prebuilt sgl-kernel wheel + if: inputs.kernel_artifact_run_id != '' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + name: wheel-python3.10-cuda13.0 + run-id: ${{ inputs.kernel_artifact_run_id }} + github-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Install dependencies + run: | + CUSTOM_BUILD_SGL_KERNEL="${{ inputs.kernel_artifact_run_id != '' && 'true' || 'false' }}" \ + bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Install official LTX repro dependencies + if: matrix.source_group == 'ltx' || matrix.source_group == 'ltx23' + run: | + UV_SYSTEM_PYTHON=1 uv pip install \ + "transformers==4.52.4" \ + openimageio \ + --index-strategy unsafe-best-match \ + --prerelease allow + UV_SYSTEM_PYTHON=1 uv pip uninstall kernels + + - name: Checkout official Wan2.1 repo + if: matrix.source_group == 'wan21' + run: | + mkdir -p /tmp/mmgen-official-code + if [ ! -d /tmp/mmgen-official-code/Wan2.1/.git ]; then + git clone https://github.com/Wan-Video/Wan2.1.git /tmp/mmgen-official-code/Wan2.1 + fi + git -C /tmp/mmgen-official-code/Wan2.1 fetch --depth 1 origin 9737cba9c1c3c4d04b33fcad41c111989865d315 + git -C /tmp/mmgen-official-code/Wan2.1 checkout 9737cba9c1c3c4d04b33fcad41c111989865d315 + git -C /tmp/mmgen-official-code/Wan2.1 rev-parse HEAD + + - name: Checkout official LTX-2 repo + if: matrix.source_group == 'ltx' || matrix.source_group == 'ltx23' + run: | + mkdir -p /tmp/mmgen-official-code + if [ ! -d /tmp/mmgen-official-code/LTX-2/.git ]; then + git clone https://github.com/Lightricks/LTX-2.git /tmp/mmgen-official-code/LTX-2 + fi + git -C /tmp/mmgen-official-code/LTX-2 fetch --depth 1 origin 41d924371612b692c0fd1e4d9d94c3dfb3c02cb3 + git -C /tmp/mmgen-official-code/LTX-2 checkout 41d924371612b692c0fd1e4d9d94c3dfb3c02cb3 + git -C /tmp/mmgen-official-code/LTX-2 rev-parse HEAD + + - name: Generate official output + env: + HF_TOKEN: ${{ secrets.SGLANG_DIFFUSION_CI_HF_TOKEN || secrets.HF_TOKEN }} + HUGGING_FACE_HUB_TOKEN: ${{ secrets.SGLANG_DIFFUSION_CI_HF_TOKEN || secrets.HF_TOKEN }} + RUNAI_STREAMER_MEMORY_LIMIT: 0 + PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True + CUDA_VISIBLE_DEVICES: 0 + CASE_ID: ${{ matrix.case_id }} + SOURCE_GROUP: ${{ matrix.source_group }} + run: | + git -C ci-data rev-parse HEAD + mkdir -p "python/${{ env.OUTPUT_NAME }}" + if [ "$SOURCE_GROUP" = "diffusers" ]; then + cd python + python ../ci-data/diffusion-ci/repro_scripts/gen_official_diffusion_gt.py \ + --suite 1-gpu \ + --out-dir ./${{ env.OUTPUT_NAME }} \ + --case-ids "$CASE_ID" \ + --dtype bf16 \ + --device-map none \ + --generator-device cuda + elif [ "$SOURCE_GROUP" = "wan21" ]; then + cd python + python ../ci-data/diffusion-ci/repro_scripts/gen_official_diffusion_gt.py \ + --suite 1-gpu \ + --out-dir ./${{ env.OUTPUT_NAME }} \ + --case-ids "$CASE_ID" \ + --wan-official-repo-dir /tmp/mmgen-official-code/Wan2.1 \ + --dtype bf16 \ + --device-map none \ + --generator-device cuda + elif [ "$SOURCE_GROUP" = "ltx" ] || [ "$SOURCE_GROUP" = "ltx23" ]; then + cd python + extra_ltx_args=() + if [ "$CASE_ID" = "ltx_2_3_hq_pipeline" ]; then + extra_ltx_args+=(--num-frames 24) + fi + set +e + PYTHONPATH=/tmp/mmgen-official-code/LTX-2/packages/ltx-core/src:/tmp/mmgen-official-code/LTX-2/packages/ltx-pipelines/src:$PWD:$PYTHONPATH \ + python ../ci-data/diffusion-ci/repro_scripts/gen_official_ltx23.py \ + --out-dir ./${{ env.OUTPUT_NAME }} \ + --case-ids "$CASE_ID" \ + "${extra_ltx_args[@]}" + status=$? + set -e + if [ "$status" -ne 0 ]; then + find ./${{ env.OUTPUT_NAME }} \( -name 'official_ltx_manifest.json' -o -name 'official_ltx23_manifest.json' \) -print -exec cat {} \; + exit "$status" + fi + else + echo "Unknown SOURCE_GROUP=$SOURCE_GROUP" + exit 1 + fi + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: ${{ env.OUTPUT_NAME }}-${{ matrix.source_group }}-${{ matrix.case_id }} + path: | + python/${{ env.OUTPUT_NAME }}/*.jpg + python/${{ env.OUTPUT_NAME }}/*.png + python/${{ env.OUTPUT_NAME }}/official_gt_manifest_*.json + python/${{ env.OUTPUT_NAME }}/official_ltx_manifest.json + python/${{ env.OUTPUT_NAME }}/official_ltx23_manifest.json + retention-days: 7 + + - name: Publish official GT images to sgl-project/ci-data + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + run: | + python scripts/ci/utils/diffusion/publish_diffusion_gt.py \ + --source-dir python/${{ env.OUTPUT_NAME }} \ + --target-dir "${{ env.PUBLISH_TARGET_DIR }}" + + compute-diffusion-partitions: + if: github.repository == 'sgl-project/sglang' && !inputs.run_official_cases + runs-on: ubuntu-latest + outputs: + matrix-1gpu: ${{ steps.compute.outputs.matrix-1gpu }} + matrix-2gpu: ${{ steps.compute.outputs.matrix-2gpu }} + matrix-b200: ${{ steps.compute.outputs.matrix-b200 }} + partition-count-1gpu: ${{ steps.compute.outputs['partition-count-1gpu'] }} + partition-count-2gpu: ${{ steps.compute.outputs['partition-count-2gpu'] }} + partition-count-b200: ${{ steps.compute.outputs['partition-count-b200'] }} + plan-1gpu: ${{ steps.compute.outputs.plan-1gpu }} + plan-2gpu: ${{ steps.compute.outputs.plan-2gpu }} + plan-b200: ${{ steps.compute.outputs.plan-b200 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Compute partitions + id: compute + run: | + python scripts/ci/utils/diffusion/compute_diffusion_partitions.py \ + --min-time 1200 \ + --target-time 1800 \ + --max-time 2400 \ + --max-partitions 10 \ + --parametrized-only + + multimodal-diffusion-gen-1gpu: + needs: compute-diffusion-partitions + if: | + needs.compute-diffusion-partitions.result == 'success' && + needs.compute-diffusion-partitions.outputs.matrix-1gpu != '{"include":[]}' + runs-on: 1-gpu-h100 + strategy: + fail-fast: false + matrix: ${{ fromJson(needs.compute-diffusion-partitions.outputs.matrix-1gpu) }} + timeout-minutes: 150 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Prepare sgl-kernel/dist for prebuilt wheel + if: inputs.kernel_artifact_run_id != '' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download prebuilt sgl-kernel wheel + if: inputs.kernel_artifact_run_id != '' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + name: wheel-python3.10-cuda13.0 + run-id: ${{ inputs.kernel_artifact_run_id }} + github-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Install dependencies + run: | + CUSTOM_BUILD_SGL_KERNEL="${{ inputs.kernel_artifact_run_id != '' && 'true' || 'false' }}" \ + bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Generate outputs + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + PARTITION_PLAN_JSON: ${{ needs.compute-diffusion-partitions.outputs.plan-1gpu }} + run: | + cd python + python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \ + --suite 1-gpu \ + --partition-id ${{ matrix.part }} \ + --total-partitions ${{ needs.compute-diffusion-partitions.outputs['partition-count-1gpu'] }} \ + --partition-plan-json "$PARTITION_PLAN_JSON" \ + --out-dir ./${{ env.OUTPUT_NAME }} \ + ${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }} + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: ${{ env.OUTPUT_NAME }}-1gpu-part${{ matrix.part }} + path: python/${{ env.OUTPUT_NAME }} + retention-days: 7 + + - name: Publish GT images to sgl-project/ci-data + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + run: | + python scripts/ci/utils/diffusion/publish_diffusion_gt.py \ + --source-dir python/${{ env.OUTPUT_NAME }} \ + --target-dir "${{ env.PUBLISH_TARGET_DIR }}" + + multimodal-diffusion-gen-2gpu: + needs: compute-diffusion-partitions + if: | + needs.compute-diffusion-partitions.result == 'success' && + needs.compute-diffusion-partitions.outputs.matrix-2gpu != '{"include":[]}' + runs-on: 2-gpu-h100 + strategy: + fail-fast: false + matrix: ${{ fromJson(needs.compute-diffusion-partitions.outputs.matrix-2gpu) }} + timeout-minutes: 150 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Prepare sgl-kernel/dist for prebuilt wheel + if: inputs.kernel_artifact_run_id != '' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download prebuilt sgl-kernel wheel + if: inputs.kernel_artifact_run_id != '' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + name: wheel-python3.10-cuda13.0 + run-id: ${{ inputs.kernel_artifact_run_id }} + github-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Install dependencies + run: | + CUSTOM_BUILD_SGL_KERNEL="${{ inputs.kernel_artifact_run_id != '' && 'true' || 'false' }}" \ + bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Generate outputs + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + PARTITION_PLAN_JSON: ${{ needs.compute-diffusion-partitions.outputs.plan-2gpu }} + run: | + cd python + python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \ + --suite 2-gpu \ + --partition-id ${{ matrix.part }} \ + --total-partitions ${{ needs.compute-diffusion-partitions.outputs['partition-count-2gpu'] }} \ + --partition-plan-json "$PARTITION_PLAN_JSON" \ + --out-dir ./${{ env.OUTPUT_NAME }} \ + ${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }} + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: ${{ env.OUTPUT_NAME }}-2gpu-part${{ matrix.part }} + path: python/${{ env.OUTPUT_NAME }} + retention-days: 7 + + - name: Publish GT images to sgl-project/ci-data + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + run: | + python scripts/ci/utils/diffusion/publish_diffusion_gt.py \ + --source-dir python/${{ env.OUTPUT_NAME }} \ + --target-dir "${{ env.PUBLISH_TARGET_DIR }}" + + multimodal-diffusion-gen-b200: + needs: compute-diffusion-partitions + if: | + needs.compute-diffusion-partitions.result == 'success' && + needs.compute-diffusion-partitions.outputs.matrix-b200 != '{"include":[]}' + runs-on: 4-gpu-b200 + strategy: + fail-fast: false + matrix: ${{ fromJson(needs.compute-diffusion-partitions.outputs.matrix-b200) }} + timeout-minutes: 240 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Prepare sgl-kernel/dist for prebuilt wheel + if: inputs.kernel_artifact_run_id != '' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download prebuilt sgl-kernel wheel + if: inputs.kernel_artifact_run_id != '' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + name: wheel-python3.10-cuda13.0 + run-id: ${{ inputs.kernel_artifact_run_id }} + github-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Install dependencies + run: | + CUSTOM_BUILD_SGL_KERNEL="${{ inputs.kernel_artifact_run_id != '' && 'true' || 'false' }}" \ + bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Generate outputs + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + PARTITION_PLAN_JSON: ${{ needs.compute-diffusion-partitions.outputs.plan-b200 }} + run: | + cd python + python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \ + --suite 1-gpu-b200 \ + --partition-id ${{ matrix.part }} \ + --total-partitions ${{ needs.compute-diffusion-partitions.outputs['partition-count-b200'] }} \ + --partition-plan-json "$PARTITION_PLAN_JSON" \ + --out-dir ./${{ env.OUTPUT_NAME }} \ + ${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }} + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: ${{ env.OUTPUT_NAME }}-b200-part${{ matrix.part }} + path: python/${{ env.OUTPUT_NAME }} + retention-days: 7 + + - name: Publish GT images to sgl-project/ci-data + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + run: | + python scripts/ci/utils/diffusion/publish_diffusion_gt.py \ + --source-dir python/${{ env.OUTPUT_NAME }} \ + --target-dir "${{ env.PUBLISH_TARGET_DIR }}" diff --git a/.github/workflows/execute-notebook.yml b/.github/workflows/execute-notebook.yml index 953f34b72cbc..e53c49a64be5 100644 --- a/.github/workflows/execute-notebook.yml +++ b/.github/workflows/execute-notebook.yml @@ -3,9 +3,12 @@ name: Execute Notebooks on: pull_request: branches: [ main ] + types: [opened, synchronize, reopened, labeled] paths: - "python/sglang/**" - "docs/**" + - "!python/sglang/**/*.md" + - "!docs/**/*.md" workflow_dispatch: @@ -13,11 +16,20 @@ concurrency: group: execute-notebook-${{ github.ref }} cancel-in-progress: true +env: + SGLANG_IS_IN_CI: true jobs: + call-gate: + # Align with PR Test: fail fast if PR doesn't have run-ci label. + # This makes /tag-and-rerun-ci work by rerunning this failed workflow. + uses: ./.github/workflows/pr-gate.yml + secrets: inherit + run-all-notebooks: - runs-on: 1-gpu-runner - if: github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'run-ci') + needs: [call-gate] + runs-on: 1-gpu-h100 + if: github.event_name != 'pull_request' || needs.call-gate.result == 'success' steps: - name: Checkout code uses: actions/checkout@v4 @@ -43,9 +55,11 @@ jobs: notebook-finish: needs: [ + call-gate, run-all-notebooks ] runs-on: ubuntu-latest + if: always() && needs.run-all-notebooks.result != 'skipped' steps: - name: Check all dependent job statuses run: | diff --git a/.github/workflows/full-test-npu.yml b/.github/workflows/full-test-npu.yml new file mode 100644 index 000000000000..1feb3f504760 --- /dev/null +++ b/.github/workflows/full-test-npu.yml @@ -0,0 +1,359 @@ +name: Full Test (NPU) + +on: +# pull_request: +# branches: +# - main +# paths: +# - ".github/workflows/full-test-npu.yml" + workflow_dispatch: + inputs: + ref: + description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.' + required: false + type: string + default: '' + job_filter: + description: 'Select which job to run (leave empty or "all" to run all jobs)' + required: false + type: string + default: 'all' + image_a3: + description: 'The a3 running docker image of the test task.' + required: false + type: string + default: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11' + skip_install_flag: + description: 'Indicates whether to skip the installation of sglang, defaulting to false.' + required: false + type: string + default: 'false' + +concurrency: + group: full-test-npu-${{ inputs.ref || github.ref }} + cancel-in-progress: ${{ github.event_name != 'workflow_call' }} + +jobs: + set-image-config: + runs-on: ubuntu-latest + outputs: + ref: ${{ steps.set-vars.outputs.ref }} + job_filter: ${{ steps.set-vars.outputs.job_filter }} + image_a3: ${{ steps.set-vars.outputs.image_a3 }} + skip_install_flag: ${{ steps.set-vars.outputs.skip_install_flag }} + steps: + # When triggered by PR, no inputs parameters are used. The latest community code is tested by default. + - name: Set image config + id: set-vars + run: | + if [ -z "${{ inputs.ref }}" ]; then + echo "ref=" >> $GITHUB_OUTPUT + else + echo "ref=${{ inputs.ref }}" >> $GITHUB_OUTPUT + fi + + if [ -z "${{ inputs.job_filter }}" ]; then + echo "job_filter=all" >> $GITHUB_OUTPUT + else + echo "job_filter=${{ inputs.job_filter }}" >> $GITHUB_OUTPUT + fi + + if [ -z "${{ inputs.image_a3 }}" ]; then + echo "image_a3=swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11" >> $GITHUB_OUTPUT + else + echo "image_a3=${{ inputs.image_a3 }}" >> $GITHUB_OUTPUT + fi + + if [ -z "${{ inputs.skip_install_flag }}" ]; then + echo "skip_install_flag=false" >> $GITHUB_OUTPUT + else + echo "skip_install_flag=${{ inputs.skip_install_flag }}" >> $GITHUB_OUTPUT + fi + + nighly-test-npu: + needs: [set-image-config] + name: nightly-test-npu + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + uses: ./.github/workflows/nightly-test-npu.yml + with: + ref: ${{ needs.set-image-config.outputs.ref }} + job_filter: ${{ needs.set-image-config.outputs.job_filter }} + image_a3: ${{ needs.set-image-config.outputs.image_a3 }} + skip_install_flag: ${{ needs.set-image-config.outputs.skip_install_flag }} + secrets: inherit + + full-1-npu-a3: + needs: [set-image-config] + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + runs-on: linux-aarch64-a3-2 + container: + image: ${{ needs.set-image-config.outputs.image_a3 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi + + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Print Log Information + run: | + bash scripts/ci/npu/npu_log_print.sh + + - name: Run test + timeout-minutes: 240 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + run: | + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ + cd test + python3 run_suite.py --hw npu --suite full-1-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 + + full-2-npu-a3: + needs: [set-image-config] + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + runs-on: linux-aarch64-a3-2 + container: + image: ${{ needs.set-image-config.outputs.image_a3 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi + + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Print Log Information + run: | + bash scripts/ci/npu/npu_log_print.sh + + - name: Run test + timeout-minutes: 240 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + run: | + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ + cd test + python3 run_suite.py --hw npu --suite full-2-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 + + full-4-npu-a3: + needs: [set-image-config] + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + runs-on: linux-aarch64-a3-4 + container: + image: ${{ needs.set-image-config.outputs.image_a3 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi + + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Print Log Information + run: | + bash scripts/ci/npu/npu_log_print.sh + + - name: Run test + timeout-minutes: 240 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + run: | + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ + cd test + python3 run_suite.py --hw npu --suite full-4-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 + + full-16-npu-a3: + needs: [set-image-config] + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + runs-on: linux-aarch64-a3-16 + container: + image: ${{ needs.set-image-config.outputs.image_a3 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi + + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Print Log Information + run: | + bash scripts/ci/npu/npu_log_print.sh + + - name: Run test + timeout-minutes: 240 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + run: | + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ + cd test + python3 run_suite.py --hw npu --suite full-16-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 + + check-all-jobs: + if: github.repository == 'sgl-project/sglang' && always() + needs: + - nighly-test-npu + - full-1-npu-a3 + - full-2-npu-a3 + - full-4-npu-a3 + - full-16-npu-a3 + runs-on: ubuntu-latest + container: + image: docker.m.daocloud.io/ubuntu:22.04 + steps: + - name: Check if any job failed + run: | + if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then + echo "One or more nightly test jobs failed" + exit 1 + fi + if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then + echo "One or more nightly test jobs were cancelled" + exit 1 + fi + echo "All nightly test jobs passed" diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml index 018060fa42b2..72a67d25b0e4 100644 --- a/.github/workflows/lint.yml +++ b/.github/workflows/lint.yml @@ -25,26 +25,15 @@ jobs: - name: Run pre-commit checks run: SKIP=no-commit-to-branch pre-commit run --all-files --show-diff-on-failure + - name: Run lychee docs checks (offline references) + uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2 + with: + args: --config .github/linters/lychee.toml README.md "docs/**/*.md" "docs/**/*.rst" "docs/**/*.ipynb" + - name: Run sgl-kernel clang-format checks - uses: DoozyX/clang-format-lint-action@v0.18.1 + uses: DoozyX/clang-format-lint-action@v0.20 with: source: sgl-kernel extensions: h,c,cpp,hpp,cu,cuh,cc - clangFormatVersion: 18 + clangFormatVersion: 20 style: file - - - name: Check proto files are in sync - run: | - if ! diff -q python/sglang/srt/grpc/sglang_scheduler.proto sgl-model-gateway/src/proto/sglang_scheduler.proto; then - echo "❌ ERROR: Proto files are out of sync!" - echo "" - echo "The following files must be kept identical:" - echo " - python/sglang/srt/grpc/sglang_scheduler.proto" - echo " - sgl-model-gateway/src/proto/sglang_scheduler.proto" - echo "" - echo "Please ensure both files have the same content." - echo "" - echo "Differences:" - diff python/sglang/srt/grpc/sglang_scheduler.proto sgl-model-gateway/src/proto/sglang_scheduler.proto || true - exit 1 - fi diff --git a/.github/workflows/list-active-pr-runs.yml b/.github/workflows/list-active-pr-runs.yml new file mode 100644 index 000000000000..10deab8374cf --- /dev/null +++ b/.github/workflows/list-active-pr-runs.yml @@ -0,0 +1,317 @@ +name: List Active Runs + +on: + workflow_dispatch: + inputs: + workflows: + description: 'Space-separated list of workflow filenames to check' + required: false + type: string + default: 'pr-test.yml' + +permissions: + actions: read + contents: read + pull-requests: read + +jobs: + list-active-runs: + runs-on: ubuntu-latest + steps: + - name: Install GitHub CLI + run: sudo apt-get install -y gh jq + + - name: List active runs grouped by PR + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + REPO: ${{ github.repository }} + WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }} + shell: bash + run: | + set -euo pipefail + + echo "=========================================" + echo "🔍 Active Workflow Runs Report" + echo "=========================================" + echo "" + + # Get all workflows or specific ones + read -r -a workflow_files <<< "${WORKFLOWS}" + echo "📋 Checking specified workflows: ${WORKFLOWS}" + + echo "" + + # Create a temporary file to store PR data + pr_data_file=$(mktemp) + + # Process each workflow + for workflow_file in ${workflow_files[@]}; do + echo "Scanning workflow: $workflow_file" + + # Get all active runs (queued, waiting, in_progress) + active_runs=$(gh run list \ + --repo "$REPO" \ + --workflow "$workflow_file" \ + --json databaseId,status,event,headBranch,createdAt,updatedAt,headSha,number,attempt \ + --limit 500 \ + | jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")') + + if [ -z "$active_runs" ]; then + continue + fi + + # Process each run + echo "$active_runs" | while read -r run; do + run_id=$(echo "$run" | jq -r '.databaseId') + run_status=$(echo "$run" | jq -r '.status') + run_event=$(echo "$run" | jq -r '.event') + created_at=$(echo "$run" | jq -r '.createdAt') + head_sha=$(echo "$run" | jq -r '.headSha') + run_number=$(echo "$run" | jq -r '.number') + run_attempt=$(echo "$run" | jq -r '.attempt // 1') + + # Get detailed run information including jobs + run_details=$(gh api "repos/$REPO/actions/runs/$run_id" 2>/dev/null || true) + + if [ -z "$run_details" ]; then + continue + fi + + head_owner=$(echo "$run_details" | jq -r '.head_repository.owner.login // empty') + head_branch=$(echo "$run_details" | jq -r '.head_branch // empty') + + if [ -z "$head_owner" ] || [ -z "$head_branch" ]; then + continue + fi + + # Find PR number (may be empty for non-PR runs) + pr_number=$(gh api "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \ + --jq '.[0].number // empty' 2>/dev/null || true) + + if [ -z "$pr_number" ]; then + pr_number="NO_PR" + fi + + # Get jobs for this run (with pagination to avoid missing jobs) + jobs=$(gh api "repos/$REPO/actions/runs/$run_id/jobs" --paginate --jq '.jobs[]' | jq -s '.') + + running_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="in_progress")] | length') + queued_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="queued" or .status=="waiting")] | length') + + # Get runner info for running jobs + runners=$(echo "$jobs" | jq -r '.[] | select(.status=="in_progress") | .runner_name // "N/A"' | paste -sd "," -) + + # Calculate queue time + current_time=$(date -u +%s) + created_time=$(date -u -d "$created_at" +%s 2>/dev/null || echo "$current_time") + queue_time=$((current_time - created_time)) + queue_minutes=$((queue_time / 60)) + + # Store data in temporary file (unified format with event and branch) + echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt|$run_event|$head_branch" >> "$pr_data_file" + done + done + + echo "" + echo "=========================================" + echo "📊 Active Runs Summary" + echo "=========================================" + echo "" + + if [ ! -s "$pr_data_file" ]; then + echo "✅ No active runs found" + rm -f "$pr_data_file" + exit 0 + fi + + # Get unique PR numbers (exclude NO_PR entries) + pr_numbers=$(cut -d'|' -f1 < "$pr_data_file" | grep -v '^NO_PR$' | sort -u || true) + + # Separate high priority and normal PRs + high_priority_prs=() + normal_prs=() + + for pr_num in $pr_numbers; do + labels=$(gh pr view "$pr_num" --repo "$REPO" --json labels \ + | jq -r '.labels[].name' 2>/dev/null || true) + + if echo "$labels" | grep -Fxq "high priority"; then + high_priority_prs+=($pr_num) + else + normal_prs+=($pr_num) + fi + done + + # Combine: high priority first, then normal + sorted_pr_numbers=("${high_priority_prs[@]}" "${normal_prs[@]}") + + pr_count=0 + total_running=0 + total_queued=0 + + for pr_num in "${sorted_pr_numbers[@]}"; do + pr_count=$((pr_count + 1)) + + # Get PR details + pr_info=$(gh pr view "$pr_num" --repo "$REPO" --json title,author,labels,url 2>/dev/null || true) + + if [ -z "$pr_info" ]; then + continue + fi + + pr_title=$(echo "$pr_info" | jq -r '.title') + pr_author=$(echo "$pr_info" | jq -r '.author.login') + pr_url=$(echo "$pr_info" | jq -r '.url') + pr_labels=$(echo "$pr_info" | jq -r '.labels[].name' | paste -sd ", " -) + + if [ -z "$pr_labels" ]; then + pr_labels="(no labels)" + fi + + # Add priority indicator + priority_indicator="" + if echo "$pr_labels" | grep -q "high priority"; then + priority_indicator="🔴 [HIGH PRIORITY] " + fi + + echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" + echo "🔗 ${priority_indicator}PR #$pr_num: $pr_title" + echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" + echo "👤 Author: $pr_author" + echo "🏷️ Labels: $pr_labels" + echo "🔗 URL: $pr_url" + echo "" + + # Get all runs for this PR + pr_runs=$(grep "^$pr_num|" "$pr_data_file") + + pr_running_total=0 + pr_queued_total=0 + + echo "$pr_runs" | while read -r line; do + workflow=$(echo "$line" | cut -d'|' -f2) + run_id=$(echo "$line" | cut -d'|' -f3) + status=$(echo "$line" | cut -d'|' -f4) + running=$(echo "$line" | cut -d'|' -f5) + queued=$(echo "$line" | cut -d'|' -f6) + runners=$(echo "$line" | cut -d'|' -f7) + queue_min=$(echo "$line" | cut -d'|' -f8) + created=$(echo "$line" | cut -d'|' -f9) + attempt=$(echo "$line" | cut -d'|' -f11) + + pr_running_total=$((pr_running_total + running)) + pr_queued_total=$((pr_queued_total + queued)) + + run_url="https://github.com/$REPO/actions/runs/$run_id" + + # Calculate retry count for this specific run + retry_count=$((attempt - 1)) + + # Show retry indicator + retry_indicator="" + if [ "$retry_count" -gt 0 ]; then + retry_indicator=" 🔄 Retry #$retry_count" + fi + + echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator" + echo " Status: $status" + echo " 🟢 Running jobs: $running" + echo " 🟡 Queued jobs: $queued" + + if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then + echo " 🖥️ Runners: $runners" + fi + + if [ "$queue_min" -gt 0 ]; then + echo " ⏱️ Queue time: ${queue_min} minutes" + fi + + echo " 🔗 Run URL: $run_url" + echo "" + done + + # Summary for this PR + pr_running_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}') + pr_queued_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}') + + total_running=$((total_running + pr_running_total)) + total_queued=$((total_queued + pr_queued_total)) + + echo " 📊 PR Total: $pr_running_total running, $pr_queued_total queued" + echo "" + done + + # --- Non-PR Runs Section --- + non_pr_runs=$(grep '^NO_PR|' "$pr_data_file" 2>/dev/null || true) + non_pr_running=0 + non_pr_queued=0 + + if [ -n "$non_pr_runs" ]; then + echo "=========================================" + echo "📦 Non-PR Runs (manual / scheduled / other)" + echo "=========================================" + echo "" + + echo "$non_pr_runs" | while read -r line; do + workflow=$(echo "$line" | cut -d'|' -f2) + run_id=$(echo "$line" | cut -d'|' -f3) + status=$(echo "$line" | cut -d'|' -f4) + running=$(echo "$line" | cut -d'|' -f5) + queued=$(echo "$line" | cut -d'|' -f6) + runners=$(echo "$line" | cut -d'|' -f7) + queue_min=$(echo "$line" | cut -d'|' -f8) + created=$(echo "$line" | cut -d'|' -f9) + attempt=$(echo "$line" | cut -d'|' -f11) + event=$(echo "$line" | cut -d'|' -f12) + branch=$(echo "$line" | cut -d'|' -f13) + + run_url="https://github.com/$REPO/actions/runs/$run_id" + + retry_count=$((attempt - 1)) + retry_indicator="" + if [ "$retry_count" -gt 0 ]; then + retry_indicator=" 🔄 Retry #$retry_count" + fi + + echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator" + echo " Event: $event" + echo " Branch: $branch" + echo " Status: $status" + echo " 🟢 Running jobs: $running" + echo " 🟡 Queued jobs: $queued" + + if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then + echo " 🖥️ Runners: $runners" + fi + + if [ "$queue_min" -gt 0 ]; then + echo " ⏱️ Queue time: ${queue_min} minutes" + fi + + echo " 🔗 Run URL: $run_url" + echo "" + done + + non_pr_running=$(echo "$non_pr_runs" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}') + non_pr_queued=$(echo "$non_pr_runs" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}') + non_pr_count=$(echo "$non_pr_runs" | wc -l | tr -d ' ') + + total_running=$((total_running + non_pr_running)) + total_queued=$((total_queued + non_pr_queued)) + + echo " 📊 Non-PR Total: $non_pr_running running, $non_pr_queued queued" + echo "" + fi + + # Overall summary + echo "=========================================" + echo "📈 Overall Summary" + echo "=========================================" + echo "Total PRs with active runs: $pr_count" + echo "Total non-PR active runs: ${non_pr_count:-0}" + echo "Total running jobs: $total_running" + echo "Total queued jobs: $total_queued" + echo "=========================================" + + # Cleanup + rm -f "$pr_data_file" diff --git a/.github/workflows/list-active-pr-runs.yml.yml b/.github/workflows/list-active-pr-runs.yml.yml deleted file mode 100644 index e8f21297c489..000000000000 --- a/.github/workflows/list-active-pr-runs.yml.yml +++ /dev/null @@ -1,253 +0,0 @@ -name: List Active PR Runs - -on: - workflow_dispatch: - inputs: - workflows: - description: 'Space-separated list of workflow filenames to check' - required: false - type: string - default: 'pr-test.yml' - -permissions: - actions: read - contents: read - pull-requests: read - -jobs: - list-active-pr-runs: - runs-on: ubuntu-latest - steps: - - name: Install GitHub CLI - run: sudo apt-get install -y gh jq - - - name: List active PR runs grouped by PR - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - REPO: ${{ github.repository }} - WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }} - shell: bash - run: | - set -euo pipefail - - echo "=========================================" - echo "🔍 Active PR Workflow Runs Report" - echo "=========================================" - echo "" - - # Get all workflows or specific ones - read -r -a workflow_files <<< "${WORKFLOWS}" - echo "📋 Checking specified workflows: ${WORKFLOWS}" - - echo "" - - # Create a temporary file to store PR data - pr_data_file=$(mktemp) - - # Process each workflow - for workflow_file in ${workflow_files[@]}; do - echo "Scanning workflow: $workflow_file" - - # Get all active runs (queued, waiting, in_progress) - active_runs=$(gh run list \ - --repo "$REPO" \ - --workflow "$workflow_file" \ - --json databaseId,status,event,headBranch,createdAt,updatedAt,headSha,number,attempt \ - --limit 500 \ - | jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress") | select(.event=="pull_request")') - - if [ -z "$active_runs" ]; then - continue - fi - - # Process each run - echo "$active_runs" | while read -r run; do - run_id=$(echo "$run" | jq -r '.databaseId') - run_status=$(echo "$run" | jq -r '.status') - created_at=$(echo "$run" | jq -r '.createdAt') - head_sha=$(echo "$run" | jq -r '.headSha') - run_number=$(echo "$run" | jq -r '.number') - run_attempt=$(echo "$run" | jq -r '.attempt // 1') - - # Get detailed run information including jobs - run_details=$(gh api "repos/$REPO/actions/runs/$run_id" 2>/dev/null || true) - - if [ -z "$run_details" ]; then - continue - fi - - head_owner=$(echo "$run_details" | jq -r '.head_repository.owner.login // empty') - head_branch=$(echo "$run_details" | jq -r '.head_branch // empty') - - if [ -z "$head_owner" ] || [ -z "$head_branch" ]; then - continue - fi - - # Find PR number - pr_number=$(gh api "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \ - --jq '.[0].number // empty' 2>/dev/null || true) - - if [ -z "$pr_number" ]; then - continue - fi - - # Get jobs for this run (with pagination to avoid missing jobs) - jobs=$(gh api "repos/$REPO/actions/runs/$run_id/jobs" --paginate --jq '.jobs[]' | jq -s '.') - - running_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="in_progress")] | length') - queued_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="queued" or .status=="waiting")] | length') - - # Get runner info for running jobs - runners=$(echo "$jobs" | jq -r '.[] | select(.status=="in_progress") | .runner_name // "N/A"' | paste -sd "," -) - - # Calculate queue time - current_time=$(date -u +%s) - created_time=$(date -u -d "$created_at" +%s 2>/dev/null || echo "$current_time") - queue_time=$((current_time - created_time)) - queue_minutes=$((queue_time / 60)) - - # Store data in temporary file - echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt" >> "$pr_data_file" - done - done - - echo "" - echo "=========================================" - echo "📊 Active PRs Summary" - echo "=========================================" - echo "" - - if [ ! -s "$pr_data_file" ]; then - echo "✅ No active PR runs found" - rm -f "$pr_data_file" - exit 0 - fi - - # Get unique PR numbers - pr_numbers=$(cat "$pr_data_file" | cut -d'|' -f1 | sort -u) - - # Separate high priority and normal PRs - high_priority_prs=() - normal_prs=() - - for pr_num in $pr_numbers; do - labels=$(gh pr view "$pr_num" --repo "$REPO" --json labels \ - | jq -r '.labels[].name' 2>/dev/null || true) - - if echo "$labels" | grep -Fxq "high priority"; then - high_priority_prs+=($pr_num) - else - normal_prs+=($pr_num) - fi - done - - # Combine: high priority first, then normal - sorted_pr_numbers=("${high_priority_prs[@]}" "${normal_prs[@]}") - - pr_count=0 - total_running=0 - total_queued=0 - - for pr_num in "${sorted_pr_numbers[@]}"; do - pr_count=$((pr_count + 1)) - - # Get PR details - pr_info=$(gh pr view "$pr_num" --repo "$REPO" --json title,author,labels,url 2>/dev/null || true) - - if [ -z "$pr_info" ]; then - continue - fi - - pr_title=$(echo "$pr_info" | jq -r '.title') - pr_author=$(echo "$pr_info" | jq -r '.author.login') - pr_url=$(echo "$pr_info" | jq -r '.url') - pr_labels=$(echo "$pr_info" | jq -r '.labels[].name' | paste -sd ", " -) - - if [ -z "$pr_labels" ]; then - pr_labels="(no labels)" - fi - - # Add priority indicator - priority_indicator="" - if echo "$pr_labels" | grep -q "high priority"; then - priority_indicator="🔴 [HIGH PRIORITY] " - fi - - echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" - echo "🔗 ${priority_indicator}PR #$pr_num: $pr_title" - echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━" - echo "👤 Author: $pr_author" - echo "🏷️ Labels: $pr_labels" - echo "🔗 URL: $pr_url" - echo "" - - # Get all runs for this PR - pr_runs=$(grep "^$pr_num|" "$pr_data_file") - - pr_running_total=0 - pr_queued_total=0 - - echo "$pr_runs" | while read -r line; do - workflow=$(echo "$line" | cut -d'|' -f2) - run_id=$(echo "$line" | cut -d'|' -f3) - status=$(echo "$line" | cut -d'|' -f4) - running=$(echo "$line" | cut -d'|' -f5) - queued=$(echo "$line" | cut -d'|' -f6) - runners=$(echo "$line" | cut -d'|' -f7) - queue_min=$(echo "$line" | cut -d'|' -f8) - created=$(echo "$line" | cut -d'|' -f9) - attempt=$(echo "$line" | cut -d'|' -f11) - - pr_running_total=$((pr_running_total + running)) - pr_queued_total=$((pr_queued_total + queued)) - - run_url="https://github.com/$REPO/actions/runs/$run_id" - - # Calculate retry count for this specific run - retry_count=$((attempt - 1)) - - # Show retry indicator - retry_indicator="" - if [ "$retry_count" -gt 0 ]; then - retry_indicator=" 🔄 Retry #$retry_count" - fi - - echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator" - echo " Status: $status" - echo " 🟢 Running jobs: $running" - echo " 🟡 Queued jobs: $queued" - - if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then - echo " 🖥️ Runners: $runners" - fi - - if [ "$queue_min" -gt 0 ]; then - echo " ⏱️ Queue time: ${queue_min} minutes" - fi - - echo " 🔗 Run URL: $run_url" - echo "" - done - - # Summary for this PR - pr_running_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}') - pr_queued_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}') - - total_running=$((total_running + pr_running_total)) - total_queued=$((total_queued + pr_queued_total)) - - echo " 📊 PR Total: $pr_running_total running, $pr_queued_total queued" - echo "" - done - - # Overall summary - echo "=========================================" - echo "📈 Overall Summary" - echo "=========================================" - echo "Total PRs with active runs: $pr_count" - echo "Total running jobs: $total_running" - echo "Total queued jobs: $total_queued" - echo "=========================================" - - # Cleanup - rm -f "$pr_data_file" diff --git a/.github/workflows/nightly-72-gpu-gb200.yml b/.github/workflows/nightly-72-gpu-gb200.yml new file mode 100644 index 000000000000..51ad81fc687f --- /dev/null +++ b/.github/workflows/nightly-72-gpu-gb200.yml @@ -0,0 +1,465 @@ +name: Nightly Test (GB200 72GPU) + +# NOTE: Nightly (schedule) runs require no approval. +# Manual (workflow_dispatch) runs are gated by the gb200-ci environment +# to prevent individuals from queuing arbitrary jobs on the shared GB200 cluster. +on: + schedule: + - cron: '0 2 * * *' # 2 AM UTC daily (offset from other nightly runs) + workflow_dispatch: # allow manual trigger; gated by gb200-ci environment + inputs: + image: + description: 'Optional. SGLang Docker image to benchmark. Leave empty for the default nightly image. Mutually exclusive with pr_number and sglang_branch.' + required: false + default: '' + pr_number: + description: 'Optional. PR number to build from (works for PRs from forks too, via refs/pull//head). Preferred over sglang_branch when a PR exists. Mutually exclusive with image and sglang_branch.' + required: false + default: '' + sglang_branch: + description: 'Optional. Branch name on sgl-project/sglang to build from (use when no PR is open yet). For fork branches, open a PR and use pr_number instead. Mutually exclusive with image and pr_number.' + required: false + default: '' + configs: + description: 'Optional. Comma-separated names to run only a subset. Format: {model-prefix}-{precision}-{isl}{osl}-{recipe}. E.g. "dsr1-fp8-1k1k-max-tpt" or "dsr1-fp8-1k1k-max-tpt,dsr1-fp4-1k1k-mid-curve". Leave empty to run all. Available names are listed in the setup job log.' + required: false + default: '' + +concurrency: + group: nightly-test-gb200 + cancel-in-progress: false + +env: + SGLANG_IS_IN_CI: true + SRT_SLURM_BRANCH: sglang-nightly-regression + SLURM_PARTITION: batch + SLURM_ACCOUNT: sglang + # Docker Hub repo for ephemeral branch/PR build images (kept separate from + # the released `lmsysorg/sglang` repo). Cleaned up by `cleanup-image`. + CI_IMAGE_REPO: lmsysorg/sglang-staging + # How many most recent staging tags to retain after each run. + CI_IMAGE_KEEP_TAGS: 60 + +jobs: + # --------------------------------------------------------------------------- + # Reject conflicting inputs early. At most one of `image`, `pr_number`, + # `sglang_branch` may be set — they select different image sources. Only runs + # on manual dispatch; all downstream jobs chain through this so invalid + # inputs halt the pipeline before cluster resources are reserved. + # --------------------------------------------------------------------------- + validate-inputs: + if: github.repository == 'sgl-project/sglang' && github.event_name == 'workflow_dispatch' + runs-on: ubuntu-latest + steps: + - name: Reject conflicting inputs + run: | + IMAGE="${{ inputs.image }}" + PR="${{ inputs.pr_number }}" + BRANCH="${{ inputs.sglang_branch }}" + sources=0 + [ -n "$IMAGE" ] && sources=$((sources + 1)) + [ -n "$PR" ] && sources=$((sources + 1)) + [ -n "$BRANCH" ] && sources=$((sources + 1)) + if [ "$sources" -gt 1 ]; then + echo "::error::Specify at most one of 'image' ('$IMAGE'), 'pr_number' ('$PR'), or 'sglang_branch' ('$BRANCH')." + exit 1 + fi + if [ -n "$PR" ] && ! echo "$PR" | grep -Eq '^[0-9]+$'; then + echo "::error::pr_number must be a positive integer, got '$PR'." + exit 1 + fi + + # --------------------------------------------------------------------------- + # Reads scripts/ci/slurm/nightly-configs.yaml and generates one matrix entry + # per recipe YAML. Each job runs the full concurrency sweep defined in the + # recipe as a single Slurm job. + # To add/remove configs, edit nightly-configs.yaml only. + # --------------------------------------------------------------------------- + setup: + needs: validate-inputs + # Run if validate-inputs succeeded (dispatch) or was skipped (cron). + if: | + always() && github.repository == 'sgl-project/sglang' + && (needs.validate-inputs.result == 'success' || needs.validate-inputs.result == 'skipped') + runs-on: ubuntu-latest + outputs: + matrix: ${{ steps.generate.outputs.matrix }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Generate benchmark matrix + id: generate + env: + CONFIGS_FILTER: ${{ inputs.configs }} + run: | + pip install pyyaml -q + + # List all available config names first so they're visible in logs + # even when a filter rejects an unknown name. + ALL_MATRIX=$(python3 scripts/ci/slurm/generate_matrix.py \ + scripts/ci/slurm/nightly-configs.yaml --runner gb200) + echo "Available config names for runner gb200:" + echo "$ALL_MATRIX" | python3 -c "import json,sys; [print(f' - {e[\"name\"]}') for e in json.load(sys.stdin)]" + + FILTER_ARG=() + if [ -n "$CONFIGS_FILTER" ]; then + echo "" + echo "Filtering to: $CONFIGS_FILTER" + FILTER_ARG=(--filter "$CONFIGS_FILTER") + fi + MATRIX=$(python3 scripts/ci/slurm/generate_matrix.py \ + scripts/ci/slurm/nightly-configs.yaml --runner gb200 "${FILTER_ARG[@]}") + echo "matrix=$MATRIX" >> $GITHUB_OUTPUT + + # --------------------------------------------------------------------------- + # When pr_number or sglang_branch is provided, build an ARM64 (GB200) image + # from that ref and push it to Docker Hub under lmsysorg/sglang-staging. + # Uses refs/pull//head for PRs so fork PRs work without cross-repo auth. + # Old staging tags are pruned by `cleanup-image` at the end of the run. + # Skipped on nightly (cron) runs and manual runs with neither pr_number nor + # sglang_branch. + # --------------------------------------------------------------------------- + build-image: + needs: [validate-inputs, setup] + if: | + github.repository == 'sgl-project/sglang' && github.event_name == 'workflow_dispatch' + && (inputs.pr_number != '' || inputs.sglang_branch != '') + runs-on: arm-docker-build-node + outputs: + image_ref: ${{ steps.build.outputs.image_ref }} + image_tag: ${{ steps.build.outputs.image_tag }} + steps: + # Self-hosted runners retain the workspace across jobs. Prior `docker buildx` + # runs on this node leave root-owned build artifacts (e.g. sgl-kernel/build/) + # that actions/checkout cannot remove, causing EACCES on rmdir. Wipe them + # via a throwaway root container before checkout recreates the workspace. + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + + - name: Checkout code + uses: actions/checkout@v4 + with: + # PRs (including fork PRs) resolve via refs/pull//head on upstream. + # Otherwise fall back to the branch name on sgl-project/sglang. + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || inputs.sglang_branch }} + + - name: Verify checkout + env: + PR_NUMBER: ${{ inputs.pr_number }} + BRANCH: ${{ inputs.sglang_branch }} + run: | + SHA=$(git rev-parse HEAD) + echo "Commit SHA: $SHA" + echo "Author: $(git log -1 --format='%an <%ae>')" + echo "Date: $(git log -1 --format='%aI')" + echo "Subject: $(git log -1 --format='%s')" + echo "" + if [ -n "$PR_NUMBER" ]; then + echo "Cross-check: https://github.com/sgl-project/sglang/pull/${PR_NUMBER}/commits" + else + echo "Cross-check: https://github.com/sgl-project/sglang/commits/${BRANCH}" + fi + echo "Commit URL: https://github.com/sgl-project/sglang/commit/${SHA}" + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Login to Docker Hub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Build and push ARM64 image + id: build + run: | + if [ -n "${{ inputs.pr_number }}" ]; then + TAG_STUB="pr-${{ inputs.pr_number }}" + SOURCE_DESC="PR #${{ inputs.pr_number }}" + else + TAG_STUB=$(echo "${{ inputs.sglang_branch }}" | tr '/' '-' | tr -cd '[:alnum:]._-') + SOURCE_DESC="branch ${{ inputs.sglang_branch }}" + fi + # run_attempt disambiguates "Re-run jobs" so the squash filename + # (derived from the image URL) doesn't collide with a stale one. + TAG="${TAG_STUB}-${{ github.run_id }}-${{ github.run_attempt }}" + IMAGE_REF="${CI_IMAGE_REPO}:${TAG}" + echo "Building ${IMAGE_REF} from ${SOURCE_DESC}" + + docker buildx build \ + --platform linux/arm64 \ + --output type=image,name=${IMAGE_REF},push=true \ + --target framework_final \ + -f docker/Dockerfile \ + --build-arg CUDA_VERSION=13.0.1 \ + --build-arg BUILD_TYPE=all \ + --build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \ + --build-arg GRACE_BLACKWELL=1 \ + --build-arg BRANCH_TYPE=local \ + --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ + --no-cache \ + . + + echo "image_ref=${IMAGE_REF}" >> $GITHUB_OUTPUT + echo "image_tag=${TAG}" >> $GITHUB_OUTPUT + + # --------------------------------------------------------------------------- + # Import Docker images to Lustre squash files once before all benchmark jobs. + # This avoids parallel jobs racing to enroot import the same image. + # When build-image ran, we import the freshly built Docker Hub staging image + # (lmsysorg/sglang-staging is public → no auth needed for enroot pull). + # Otherwise we use the `image` input (or its default public nightly image). + # --------------------------------------------------------------------------- + prepare-image: + needs: [setup, build-image] + if: | + always() && github.repository == 'sgl-project/sglang' + && needs.setup.result == 'success' + && (needs.build-image.result == 'success' || needs.build-image.result == 'skipped') + environment: ${{ github.event_name == 'workflow_dispatch' && 'gb200-ci' || '' }} + runs-on: 72-gpu-gb200 + outputs: + squash_file: ${{ steps.import.outputs.squash_file }} + nginx_squash_file: ${{ steps.import.outputs.nginx_squash_file }} + image: ${{ steps.resolve.outputs.image }} + env: + NGINX_IMAGE: nginx:1.27.4 + steps: + - name: Resolve image to import + id: resolve + run: | + BUILT_IMAGE="${{ needs.build-image.outputs.image_ref }}" + if [ -n "$BUILT_IMAGE" ]; then + echo "Using freshly built image: $BUILT_IMAGE" + echo "image=$BUILT_IMAGE" >> $GITHUB_OUTPUT + else + IMAGE="${{ inputs.image || 'lmsysorg/sglang:dev-cu13' }}" + echo "Using pre-existing image: $IMAGE" + echo "image=$IMAGE" >> $GITHUB_OUTPUT + fi + + - name: Import Docker images to Lustre + id: import + env: + IMAGE: ${{ steps.resolve.outputs.image }} + run: | + SQUASH_FILE="/mnt/lustre01/users-public/sglang-ci/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g')_$(date +%Y%m%d).sqsh" + NGINX_SQUASH_FILE="/mnt/lustre01/users-public/sglang-ci/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + + if [ -f "$SQUASH_FILE" ]; then + echo "Squash file already exists, skipping import: $SQUASH_FILE" + else + enroot import -o "$SQUASH_FILE" "docker://$IMAGE" + fi + + if [ -f "$NGINX_SQUASH_FILE" ]; then + echo "Nginx squash file already exists, skipping import: $NGINX_SQUASH_FILE" + else + enroot import -o "$NGINX_SQUASH_FILE" "docker://$NGINX_IMAGE" + fi + + echo "squash_file=$SQUASH_FILE" >> $GITHUB_OUTPUT + echo "nginx_squash_file=$NGINX_SQUASH_FILE" >> $GITHUB_OUTPUT + + nightly-gb200-benchmark: + needs: [setup, prepare-image] + # Use always() + explicit success checks so a skipped transitive upstream + # (e.g. build-image when neither pr_number nor sglang_branch is set) does + # not propagate a skip to this job. Direct deps must still have succeeded. + if: | + always() && github.repository == 'sgl-project/sglang' + && needs.setup.result == 'success' + && needs.prepare-image.result == 'success' + runs-on: 72-gpu-gb200 + strategy: + fail-fast: false + matrix: + config: ${{ fromJson(needs.setup.outputs.matrix) }} + env: + FRAMEWORK: dynamo-sglang + MODEL: ${{ matrix.config.model }} + MODEL_PREFIX: ${{ matrix.config.model_prefix }} + PRECISION: ${{ matrix.config.precision }} + ISL: ${{ matrix.config.isl }} + OSL: ${{ matrix.config.osl }} + CONFIG_FILE: ${{ matrix.config.config_file }} + RESULT_FILENAME: gb200-${{ matrix.config.name }} + MATRIX_CONFIG_NAME: ${{ matrix.config.name }} + SQUASH_FILE: ${{ needs.prepare-image.outputs.squash_file }} + NGINX_SQUASH_FILE: ${{ needs.prepare-image.outputs.nginx_squash_file }} + # S3 log-upload credentials — consumed by srt-slurm's postprocess stage + # to upload /logs after each Slurm job; prefix derived in launch_gb200.sh. + AWS_ACCESS_KEY_ID: ${{ secrets.NV_S3_ACCESS_KEY_ID }} + AWS_SECRET_ACCESS_KEY: ${{ secrets.NV_S3_SECRET_ACCESS_KEY }} + S3_BUCKET: ${{ secrets.NV_S3_BUCKET }} + S3_ENDPOINT_URL: ${{ secrets.NV_S3_ENDPOINT_URL }} + + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Clean up prior Slurm jobs from this runner + continue-on-error: true + env: + RUNNER_NAME: ${{ runner.name }} + run: | + STALE_JOBS=$(squeue --noheader --format="%i %j" | grep "${RUNNER_NAME}" | awk '{print $1}') + if [ -n "$STALE_JOBS" ]; then + echo "Cancelling stale jobs: $STALE_JOBS" + scancel $STALE_JOBS + fi + + - name: Launch GB200 benchmark via srt-slurm + timeout-minutes: 360 + env: + RUNNER_NAME: ${{ runner.name }} + run: bash scripts/ci/slurm/launch_gb200.sh + + - name: Process results + if: always() + env: + RUNNER_NAME: ${{ runner.name }} + run: | + pip install tabulate pyyaml -q + SRT_REPO_DIR="/mnt/lustre01/users-public/sglang-ci/workspace/${RUNNER_NAME}/srt-slurm" + for result_file in ${{ github.workspace }}/${RESULT_FILENAME}_*.json; do + [ -f "$result_file" ] || continue + basename_file=$(basename "$result_file") + ctx=$(echo "$basename_file" | sed -n 's/.*_ctx_\([0-9]*\)_gen.*/\1/p') + gen=$(echo "$basename_file" | sed -n 's/.*_gen_\([0-9]*\)\.json/\1/p') + [ -n "$ctx" ] && [ -n "$gen" ] || continue + RESULT_FILENAME="${result_file%.json}" PREFILL_GPUS="$ctx" DECODE_GPUS="$gen" \ + RECIPE_FILE="$SRT_REPO_DIR/$CONFIG_FILE" \ + python3 scripts/ci/slurm/process_result.py + done + + - name: Upload results + if: always() + uses: actions/upload-artifact@v4 + with: + name: gb200-${{ matrix.config.name }}-${{ github.run_id }} + path: | + ${{ github.workspace }}/*.json + ${{ github.workspace }}/multinode_server_logs.tar.gz + retention-days: 30 + if-no-files-found: warn + + - name: Analyze logs with AI on failure + if: failure() + continue-on-error: true + env: + MODAL_TOKEN_ID: ${{ secrets.NV_MODAL_TOKEN_ID }} + MODAL_TOKEN_SECRET: ${{ secrets.NV_MODAL_TOKEN_SECRET }} + run: | + TARBALL="${{ github.workspace }}/multinode_server_logs.tar.gz" + if [ -f "$TARBALL" ]; then + uv run --with modal python scripts/ci/slurm/analyze_logs_with_modal.py \ + --tarball "$TARBALL" \ + --job-id "${{ matrix.config.name }}-${{ github.run_id }}" \ + --output "${{ github.workspace }}/ai_analysis.md" + if [ -f "${{ github.workspace }}/ai_analysis.md" ]; then + echo "## AI Log Analysis" >> $GITHUB_STEP_SUMMARY + cat "${{ github.workspace }}/ai_analysis.md" >> $GITHUB_STEP_SUMMARY + fi + else + echo "No log tarball found, skipping analysis" + fi + + - name: Upload AI analysis to S3 + if: failure() + continue-on-error: true + env: + ISL: ${{ matrix.config.isl }} + OSL: ${{ matrix.config.osl }} + run: | + ANALYSIS="${{ github.workspace }}/ai_analysis.md" + [ -f "$ANALYSIS" ] || { echo "no ai_analysis.md, skipping"; exit 0; } + case "${{ github.event_name }}" in + schedule) TRIGGER=cron ;; + workflow_dispatch) TRIGGER=manual ;; + *) TRIGGER="${{ github.event_name }}" ;; + esac + fmt() { if [ $(( $1 % 1024 )) -eq 0 ]; then echo "$(( $1 / 1024 ))k"; else echo "$1"; fi; } + SEQ_LEN="$(fmt "$ISL")$(fmt "$OSL")" + KEY="${TRIGGER}/${{ github.run_id }}-${{ github.run_attempt }}/${SEQ_LEN}/${{ matrix.config.name }}/ai_analysis.md" + aws --endpoint-url "$S3_ENDPOINT_URL" s3 cp "$ANALYSIS" "s3://${S3_BUCKET}/${KEY}" + echo "uploaded to s3://${S3_BUCKET}/${KEY}" + + - name: Clean up Slurm jobs on failure/cancel + if: failure() || cancelled() + continue-on-error: true + env: + RUNNER_NAME: ${{ runner.name }} + run: | + ACTIVE_JOBS=$(squeue --noheader --format="%i %j" | grep "${RUNNER_NAME}" | awk '{print $1}') + if [ -n "$ACTIVE_JOBS" ]; then + echo "Cancelling jobs: $ACTIVE_JOBS" + scancel $ACTIVE_JOBS + fi + + collect-results: + needs: nightly-gb200-benchmark + if: github.repository == 'sgl-project/sglang' && always() + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: results/ + pattern: gb200-* + + - name: Print summary + run: | + pip install tabulate -q + python3 scripts/ci/slurm/summarize.py results/ >> $GITHUB_STEP_SUMMARY + + # --------------------------------------------------------------------------- + # Prune old tags in the staging repo, keeping only the most recent N. Mirrors + # the pattern used by release-docker-dev.yml. Runs after benchmarks so the + # freshly built image (whose sqsh is already on Lustre) becomes a regular + # aged-out tag over time. No-op when the repo has ≤ CI_IMAGE_KEEP_TAGS tags. + # --------------------------------------------------------------------------- + cleanup-image: + needs: [build-image, nightly-gb200-benchmark] + if: always() && needs.build-image.result == 'success' + runs-on: ubuntu-latest + steps: + - name: Prune old staging tags on Docker Hub + env: + DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }} + DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }} + run: | + TOKEN=$(curl -s -H "Content-Type: application/json" \ + -X POST -d "{\"username\": \"${DOCKERHUB_USERNAME}\", \"password\": \"${DOCKERHUB_TOKEN}\"}" \ + https://hub.docker.com/v2/users/login/ | jq -r .token) + if [ -z "$TOKEN" ] || [ "$TOKEN" = "null" ]; then + echo "::error::Docker Hub login failed" + exit 1 + fi + + TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" \ + "https://hub.docker.com/v2/repositories/${CI_IMAGE_REPO}/tags/?page_size=100") + + # Sort tags by last_updated (newest first), keep names only. + TAGS=$(echo "$TAGS_RESPONSE" | jq -r \ + '.results[] | "\(.last_updated)|\(.name)"' \ + | sort -r | cut -d'|' -f2) + + TAG_COUNT=$(echo "$TAGS" | grep -c . || true) + if [ "$TAG_COUNT" -gt "$CI_IMAGE_KEEP_TAGS" ]; then + echo "Found $TAG_COUNT tags in ${CI_IMAGE_REPO}, keeping $CI_IMAGE_KEEP_TAGS most recent" + TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +$((CI_IMAGE_KEEP_TAGS + 1))) + for tag in $TAGS_TO_DELETE; do + echo "Deleting ${CI_IMAGE_REPO}:${tag}" + curl -s -X DELETE -H "Authorization: JWT $TOKEN" \ + "https://hub.docker.com/v2/repositories/${CI_IMAGE_REPO}/tags/${tag}/" + done + else + echo "Only $TAG_COUNT tags in ${CI_IMAGE_REPO}, no cleanup needed" + fi diff --git a/.github/workflows/nightly-link-check.yml b/.github/workflows/nightly-link-check.yml new file mode 100644 index 000000000000..63d905cdad8a --- /dev/null +++ b/.github/workflows/nightly-link-check.yml @@ -0,0 +1,32 @@ +name: Nightly Link Check + +on: + schedule: + - cron: "0 2 * * *" + workflow_dispatch: + +concurrency: + group: nightly-link-check-${{ github.ref }} + cancel-in-progress: true + +jobs: + lychee-online: + if: github.repository == 'sgl-project/sglang' + runs-on: ubuntu-latest + timeout-minutes: 20 + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Run lychee online link checks + uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2 + with: + fail: true + args: >- + --config .github/linters/lychee-ci.toml + README.md + docs/**/*.md + docs/**/*.rst + docs/**/*.ipynb + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/nightly-test-amd-rocm720.yml b/.github/workflows/nightly-test-amd-rocm720.yml new file mode 100644 index 000000000000..1adc2e618390 --- /dev/null +++ b/.github/workflows/nightly-test-amd-rocm720.yml @@ -0,0 +1,1613 @@ +name: Nightly Test (AMD ROCm 7.2) + +on: + schedule: + - cron: '30 17 * * *' + push: + branches: + - main + paths: + - "python/sglang/version.py" + workflow_dispatch: + inputs: + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: true + job_select: + description: 'Select a job to run from dropdown (choose "all" to run all jobs)' + required: false + type: choice + default: 'all' + options: + - 'all' + - nightly-test-1-gpu-unit-rocm720 + - nightly-accuracy-2-gpu-rocm720 + - nightly-accuracy-2-gpu-vlm-rocm720 + - nightly-perf-2-gpu-text-rocm720 + - nightly-perf-2-gpu-vlm-rocm720 + - nightly-4-gpu-rocm720 + - nightly-accuracy-8-gpu-rocm720 + - nightly-8-gpu-grok1-int4-rocm720 + - nightly-8-gpu-grok2-rocm720 + - nightly-8-gpu-deepseek-v31-rocm720 + - nightly-8-gpu-deepseek-v32-rocm720 + - nightly-8-gpu-deepseek-v32-mtp-rocm720 + - nightly-8-gpu-deepseek-v3-kv-fp8-rocm720 + - nightly-8-gpu-kimi-k26-rocm720 + - nightly-8-gpu-qwen3-235b-rocm720 + - nightly-8-gpu-qwen35-rocm720 + - nightly-8-gpu-glm51-rocm720 + - nightly-8-gpu-minimax-m27-rocm720 + - nightly-1-gpu-zimage-turbo-rocm720 + - nightly-test-1-gpu-mi35x-rocm720 + - nightly-accuracy-8-gpu-mi35x-rocm720 + - nightly-8-gpu-mi35x-grok1-int4-rocm720 + - nightly-8-gpu-mi35x-grok2-rocm720 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-rocm720 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8-rocm720 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion-rocm720 + - nightly-accuracy-8-gpu-mi35x-deepseek-v32-rocm720 + - nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720 + - nightly-perf-8-gpu-mi35x-deepseek-v32-basic-rocm720 + - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp-rocm720 + - nightly-8-gpu-mi35x-deepseek-v4-flash-rocm720 + - nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720 + - nightly-8-gpu-mi35x-kimi-k26-rocm720 + - nightly-8-gpu-mi35x-qwen3-235b-mxfp4-rocm720 + - nightly-8-gpu-mi35x-qwen35-rocm720 + - nightly-8-gpu-mi35x-glm51-rocm720 + - nightly-8-gpu-mi35x-glm5-mxfp4-rocm720 + job_filter: + description: 'Or type comma-separated job names (overrides dropdown if non-empty)' + required: false + type: string + default: '' + workflow_call: + inputs: + ref: + description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.' + required: false + type: string + default: '' + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + job_filter: + description: 'Select which job to run (leave empty or "all" to run all jobs)' + required: false + type: string + default: 'all' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: true + +env: + AITER_COMMIT_OVERRIDE: ${{ inputs.aiter_ref }} + DOCKERHUB_AMD_USERNAME: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + DOCKERHUB_AMD_TOKEN: ${{ secrets.DOCKERHUB_AMD_TOKEN }} + +concurrency: + # When called via workflow_call with ref set, use a unique group per caller run to avoid + # collisions with direct schedule/push triggers. We use inputs.ref (not github.event_name) + # to detect this, because github.event_name inherits from the caller in workflow_call. + # Manual dispatch runs also get unique groups so they never cancel each other. + group: nightly-test-amd-rocm720-${{ github.event_name == 'workflow_dispatch' && format('manual-{0}', github.run_id) || inputs.ref && format('caller-{0}', github.run_id) || github.ref }} + cancel-in-progress: ${{ !inputs.ref && github.event_name != 'workflow_call' && github.event_name != 'workflow_dispatch' }} + +jobs: + # ============================================== MI30x ROCm 7.2 Unit Tests ============================================== + # 1-GPU Unit Tests - LoRA, debug utils, scheduler, etc. (MI30x ROCm 7.2) + nightly-test-1-gpu-unit-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-test-1-gpu-unit-rocm720,')) + runs-on: linux-mi325-1gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Nightly Unit Test ROCm 7.2 (1-GPU) + timeout-minutes: 90 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 900 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # ============================================== MI30x ROCm 7.2 Accuracy Tests ============================================== + # 2-GPU Accuracy Tests - GSM8K eval (MI30x ROCm 7.2) + nightly-accuracy-2-gpu-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-2-gpu-rocm720,')) + runs-on: linux-mi325-2gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Nightly Test ROCm 7.2 (2-GPU) + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 2-GPU VLM Accuracy Tests - Vision-Language Models MMMU evaluation (ROCm 7.2) + nightly-accuracy-2-gpu-vlm-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-2-gpu-vlm-rocm720,')) + runs-on: linux-mi325-2gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Nightly Accuracy Test ROCm 7.2 (2-GPU VLM MMMU) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-2-gpu-vlm --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 2-GPU Text Models Performance Tests (ROCm 7.2) + nightly-perf-2-gpu-text-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-2-gpu-text-rocm720,')) + runs-on: linux-mi325-2gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Performance Test ROCm 7.2 (2-GPU Text Models) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-perf-text-2-gpu --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 2-GPU VLM Performance Tests (ROCm 7.2) + nightly-perf-2-gpu-vlm-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-2-gpu-vlm-rocm720,')) + runs-on: linux-mi325-2gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Performance Test ROCm 7.2 (2-GPU VLM Models) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-perf-vlm-2-gpu --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # ============================================== MI30x ROCm 7.2 4-GPU Tests ============================================== + # 4-GPU Nightly Tests - Dumper/Comparator E2E, VLM Encoder DP (ROCm 7.2) + nightly-4-gpu-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-4-gpu-rocm720,')) + runs-on: linux-mi325-4gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Nightly Test ROCm 7.2 (4-GPU) + timeout-minutes: 120 + run: | + > github_summary.md + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-4-gpu --nightly --continue-on-error --timeout-per-file 3600 || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU Accuracy Tests - GPT-OSS, Grok1-FP8 (ROCm 7.2) + nightly-accuracy-8-gpu-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU GPT-OSS) + timeout-minutes: 180 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-gpt-oss --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Accuracy Test ROCm 7.2 (8-GPU Grok1-FP8) + timeout-minutes: 60 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok1-fp8 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # ============================================== MI30x ROCm 7.2 Combined Accuracy + Performance Tests ============================================== + # 8-GPU Grok1-INT4 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-grok1-int4-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-grok1-int4-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU Grok1-INT4) + timeout-minutes: 60 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU Grok1-INT4) + timeout-minutes: 60 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU Grok2 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-grok2-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-grok2-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU Grok2) + timeout-minutes: 60 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU Grok2) + timeout-minutes: 60 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU DeepSeek-V3.1 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-deepseek-v31-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v31-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU DeepSeek-V3.1) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v31 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU DeepSeek-V3.1) + timeout-minutes: 300 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_ROCM700A=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v31 --nightly --timeout-per-file 18000 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU DeepSeek-V3.2 (Basic Accuracy + Perf) ROCm 7.2 + nightly-8-gpu-deepseek-v32-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v32-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU DeepSeek-V3.2 Basic) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v32 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU DeepSeek-V3.2 Basic) + timeout-minutes: 150 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v32-basic --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU DeepSeek-V3.2 MTP (MTP Accuracy + Perf) ROCm 7.2 + nightly-8-gpu-deepseek-v32-mtp-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v32-mtp-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU DeepSeek-V3.2 MTP) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v32-mtp --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU DeepSeek-V3.2 MTP) + timeout-minutes: 180 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v32-mtp --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU DeepSeek-V3 KV FP8 (Basic + MTP with --kv-cache-dtype fp8_e4m3) ROCm 7.2 + nightly-8-gpu-deepseek-v3-kv-fp8-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v3-kv-fp8-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: DeepSeek-V3 KV FP8 Test ROCm 7.2 (8-GPU Basic + MTP) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-deepseek-v3-kv-fp8 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU Kimi-K2.6 (Accuracy) ROCm 7.2 + nightly-8-gpu-kimi-k26-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-kimi-k26-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU Kimi-K2.6) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-kimi-k26 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU Qwen3-235B (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-qwen3-235b-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-qwen3-235b-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test + Performance Test ROCm 7.2 (8-GPU Qwen3) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-8-gpu-qwen3-235b --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU Qwen 3.5 (Accuracy + Performance combined) ROCm 7.2 + nightly-8-gpu-qwen35-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-qwen35-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build --skip-test-time-deps + bash scripts/ci/amd/amd_ci_exec.sh pip install mistral-common "lm-eval[api]" + + - name: Accuracy Test ROCm 7.2 (8-GPU Qwen 3.5) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-qwen35 --nightly --timeout-per-file 3600 --continue-on-error || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU Qwen 3.5 FP8) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-qwen35-fp8 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU GLM-5.1 (Accuracy + Performance combined) ROCm 7.2 + nightly-8-gpu-glm51-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-glm51-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git@96f807a33b75 + + - name: Accuracy Test ROCm 7.2 (8-GPU GLM-5.1 NSA) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-glm51 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU GLM-5.1) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-glm51 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU MiniMax-M2.7 (Accuracy + Performance combined, replaces M2.5) ROCm 7.2 + nightly-8-gpu-minimax-m27-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-minimax-m27-rocm720,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + + - name: Accuracy Test ROCm 7.2 (8-GPU MiniMax-M2.7) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-minimax-m27 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test ROCm 7.2 (8-GPU MiniMax-M2.7) + timeout-minutes: 120 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-minimax-m27 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # ============================================== MI30x ROCm 7.2 Diffusion Tests ============================================== + # 1-GPU Z-Image-Turbo (Diffusion T2I) ROCm 7.2 + nightly-1-gpu-zimage-turbo-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-1-gpu-zimage-turbo-rocm720,')) + runs-on: linux-mi325-1gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Z-Image-Turbo Diffusion Test ROCm 7.2 (1-GPU) + timeout-minutes: 45 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + -e SGLANG_DIFFUSION_ARTIFACT_DIR="/sglang-checkout/diffusion-artifacts" \ + pytest test/registered/amd/test_zimage_turbo.py -v -s ${{ inputs.continue_on_error && '|| true' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Upload generated images + if: always() + uses: actions/upload-artifact@v4 + with: + name: zimage-turbo-outputs-rocm720 + path: diffusion-artifacts/ + if-no-files-found: ignore + retention-days: 30 + + # ============================================== MI35x ROCm 7.2 Tests ============================================== + # MI35x 1-GPU ROCm 7.2 tests + nightly-test-1-gpu-mi35x-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-test-1-gpu-mi35x-rocm720,')) + runs-on: linux-mi35x-gpu-1 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Nightly Test MI35x ROCm 7.2 (1-GPU) + timeout-minutes: 90 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 900 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Accuracy Tests - GPT-OSS (ROCm 7.2) + nightly-accuracy-8-gpu-mi35x-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-mi35x-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU GPT-OSS) + timeout-minutes: 180 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Grok1-INT4 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-mi35x-grok1-int4-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-grok1-int4-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU Grok1-INT4) + timeout-minutes: 60 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU Grok1-INT4) + timeout-minutes: 60 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Grok2 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-mi35x-grok2-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-grok2-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU Grok2) + timeout-minutes: 60 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU Grok2) + timeout-minutes: 60 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-R1-MXFP4 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-mi35x-deepseek-r1-mxfp4-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-r1-mxfp4-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU DeepSeek-R1-MXFP4) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-R1-MXFP4) + timeout-minutes: 300 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-R1-MXFP4 KV FP8 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU DeepSeek-R1-MXFP4 KV FP8) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-R1-MXFP4 KV FP8) + timeout-minutes: 300 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_kv_fp8_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-R1-MXFP4 AllReduce Fusion (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU DeepSeek-R1-MXFP4 AllReduce Fusion) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-R1-MXFP4 AllReduce Fusion) + timeout-minutes: 300 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_ar_fusion_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 Accuracy Test (ROCm 7.2) + nightly-accuracy-8-gpu-mi35x-deepseek-v32-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-mi35x-deepseek-v32-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU DeepSeek-V3.2) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-v32 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 TP+MTP Accuracy Test (ROCm 7.2) + nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU DeepSeek-V3.2 TP+MTP) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-deepseek-v32-mtp --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 Performance Test (Basic) ROCm 7.2 + nightly-perf-8-gpu-mi35x-deepseek-v32-basic-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-8-gpu-mi35x-deepseek-v32-basic-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-V3.2 Basic) + timeout-minutes: 150 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-deepseek-v32-basic --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Kimi-K2.6 (Accuracy) ROCm 7.2 + nightly-8-gpu-mi35x-kimi-k26-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-kimi-k26-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU Kimi-K2.6) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-kimi-k26 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Qwen3-235B-MXFP4 (Accuracy + Performance) ROCm 7.2 + nightly-8-gpu-mi35x-qwen3-235b-mxfp4-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-qwen3-235b-mxfp4-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test + Performance Test MI35x ROCm 7.2 (8-GPU Qwen3-235B-MXFP4) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-8-gpu-mi35x-qwen3-235b-mxfp4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Qwen 3.5 (Accuracy + Performance combined) ROCm 7.2 + nightly-8-gpu-mi35x-qwen35-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-qwen35-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build --skip-test-time-deps + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install mistral-common "lm-eval[api]" + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU Qwen 3.5) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-qwen35 --nightly --timeout-per-file 3600 --continue-on-error || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU Qwen 3.5 FP8) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-qwen35-fp8 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU GLM-5.1 (Accuracy + Performance combined) ROCm 7.2 + nightly-8-gpu-mi35x-glm51-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-glm51-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git@96f807a33b75 + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU GLM-5.1 NSA) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-glm51 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU GLM-5.1) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-glm51 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU GLM-5-MXFP4 (Accuracy + Performance combined) ROCm 7.2 + nightly-8-gpu-mi35x-glm5-mxfp4-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-glm5-mxfp4-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git@96f807a33b75 + + - name: Accuracy Test MI35x ROCm 7.2 (8-GPU GLM-5-MXFP4) + timeout-minutes: 180 + run: | + > github_summary.md + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-glm5-mxfp4 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x ROCm 7.2 (8-GPU GLM-5-MXFP4) + timeout-minutes: 300 + continue-on-error: true + run: | + > github_summary.md + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 Performance Test (MTP) ROCm 7.2 + nightly-perf-8-gpu-mi35x-deepseek-v32-mtp-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-8-gpu-mi35x-deepseek-v32-mtp-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker (ROCm 7.2) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-test-time-deps + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-V3.2 MTP) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-deepseek-v32-mtp --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V4-Flash FP8 + FP4 (Accuracy + Performance combined) ROCm 7.2 + # NOTE on runtime sourcing: the DSv4 docker image (tag suffix `-DSv4`) bakes + # in sglang built from a specific commit of the amd/deepseek_v4 branch (the + # 7-char sha in the image tag is that commit). To keep the runtime as exactly + # that image-frozen sglang/aiter, we pass `--skip-sglang-build` and + # `--skip-aiter-build` so install_dependency.sh does NOT `pip install -e + # /sglang-checkout/python` (which would override the image's sglang with + # whatever this checkout happens to be) and does NOT rebuild aiter from this + # checkout's docker/rocm.Dockerfile. The /sglang-checkout mount is still used + # for shell scripts and for run_suite.py discovering test files; it does not + # poison Python imports because the image's site-packages .pth points at + # /sgl-workspace/sglang/python (a different path). + nightly-8-gpu-mi35x-deepseek-v4-flash-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-v4-flash-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Resolve DSv4 image tag + id: dsv4_image + run: | + # Pick the latest Docker Hub tag matching rocm720-mi35x---DSv4. + # Docker Hub returns results sorted by last_updated DESC by default, so the + # first regex match is the most recent daily build. + AUTH_HEADER=() + if [[ -n "${DOCKERHUB_AMD_USERNAME:-}" && -n "${DOCKERHUB_AMD_TOKEN:-}" ]]; then + TOKEN=$(curl -s -H "Content-Type: application/json" \ + -X POST -d "{\"username\":\"${DOCKERHUB_AMD_USERNAME}\",\"password\":\"${DOCKERHUB_AMD_TOKEN}\"}" \ + https://hub.docker.com/v2/users/login/ | python3 -c "import json,sys; print(json.load(sys.stdin).get('token',''))") + if [[ -n "$TOKEN" ]]; then + AUTH_HEADER=(-H "Authorization: JWT $TOKEN") + fi + fi + TAG=$(curl -s "${AUTH_HEADER[@]}" \ + "https://hub.docker.com/v2/repositories/rocm/sgl-dev/tags?page_size=100&name=DSv4" \ + | grep -oE '"name":"rocm720-mi35x-[a-f0-9]{7}-[0-9]{8}-DSv4"' \ + | head -n 1 | cut -d'"' -f4) + if [ -z "$TAG" ]; then + echo "::error::No DSv4 image found matching rocm720-mi35x---DSv4 on Docker Hub" + exit 1 + fi + echo "image=rocm/sgl-dev:$TAG" >> "$GITHUB_OUTPUT" + echo "Resolved DSv4 image: rocm/sgl-dev:$TAG" + + - name: Setup docker (ROCm 7.2 DSv4) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --custom-image ${{ steps.dsv4_image.outputs.image }} + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies (preserve DSv4 sglang/aiter from image) + run: | + # --skip-sglang-build: keep the image's pre-installed DSv4 sglang + # (default would `pip install -e /sglang-checkout/python` and clobber it with main's source). + # --skip-aiter-build: keep the image's DSv4-tuned aiter + # (default reads /sglang-checkout/docker/rocm.Dockerfile from main and rebuilds aiter to that commit). + # --skip-test-time-deps: GSM8K + bench_one_batch_server don't need lmms-eval / human-eval. + bash scripts/ci/amd/amd_ci_install_dependency.sh \ + --skip-sglang-build --skip-aiter-build --skip-test-time-deps + # tabulate is the only thing run_suite.py imports that may not be in the DSv4 image. + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy + Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-V4-Flash FP8 + FP4) + timeout-minutes: 300 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-v4-flash --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V4-Pro FP8 + FP4 (Accuracy + Performance combined) ROCm 7.2 + # Pro is 1.6T (vs Flash 285B); load + warmup is much longer, so timeout-per-file + # and the job timeout are both larger than the Flash job. + # Same image / branch / install strategy as the Flash job above — see the comment + # block on `nightly-8-gpu-mi35x-deepseek-v4-flash-rocm720` for the rationale. + nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Resolve DSv4 image tag + id: dsv4_image + run: | + AUTH_HEADER=() + if [[ -n "${DOCKERHUB_AMD_USERNAME:-}" && -n "${DOCKERHUB_AMD_TOKEN:-}" ]]; then + TOKEN=$(curl -s -H "Content-Type: application/json" \ + -X POST -d "{\"username\":\"${DOCKERHUB_AMD_USERNAME}\",\"password\":\"${DOCKERHUB_AMD_TOKEN}\"}" \ + https://hub.docker.com/v2/users/login/ | python3 -c "import json,sys; print(json.load(sys.stdin).get('token',''))") + if [[ -n "$TOKEN" ]]; then + AUTH_HEADER=(-H "Authorization: JWT $TOKEN") + fi + fi + TAG=$(curl -s "${AUTH_HEADER[@]}" \ + "https://hub.docker.com/v2/repositories/rocm/sgl-dev/tags?page_size=100&name=DSv4" \ + | grep -oE '"name":"rocm720-mi35x-[a-f0-9]{7}-[0-9]{8}-DSv4"' \ + | head -n 1 | cut -d'"' -f4) + if [ -z "$TAG" ]; then + echo "::error::No DSv4 image found matching rocm720-mi35x---DSv4 on Docker Hub" + exit 1 + fi + echo "image=rocm/sgl-dev:$TAG" >> "$GITHUB_OUTPUT" + echo "Resolved DSv4 image: rocm/sgl-dev:$TAG" + + - name: Setup docker (ROCm 7.2 DSv4) + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh --custom-image ${{ steps.dsv4_image.outputs.image }} + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies (preserve DSv4 sglang/aiter from image) + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh \ + --skip-sglang-build --skip-aiter-build --skip-test-time-deps + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy + Performance Test MI35x ROCm 7.2 (8-GPU DeepSeek-V4-Pro FP8 + FP4) + timeout-minutes: 480 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-v4-pro --nightly --timeout-per-file 14400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + check-all-jobs: + if: always() && (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request' || github.event_name == 'workflow_dispatch') + needs: + # MI30x ROCm 7.2 Unit Tests + - nightly-test-1-gpu-unit-rocm720 + # MI30x ROCm 7.2 Accuracy Tests + - nightly-accuracy-2-gpu-rocm720 + - nightly-accuracy-2-gpu-vlm-rocm720 + # MI30x ROCm 7.2 Performance Tests + - nightly-perf-2-gpu-text-rocm720 + - nightly-perf-2-gpu-vlm-rocm720 + # MI30x ROCm 7.2 4-GPU Tests + - nightly-4-gpu-rocm720 + - nightly-accuracy-8-gpu-rocm720 + # MI30x ROCm 7.2 Combined Accuracy + Performance Tests + - nightly-8-gpu-grok1-int4-rocm720 + - nightly-8-gpu-grok2-rocm720 + - nightly-8-gpu-deepseek-v31-rocm720 + - nightly-8-gpu-deepseek-v32-rocm720 + - nightly-8-gpu-deepseek-v32-mtp-rocm720 + - nightly-8-gpu-deepseek-v3-kv-fp8-rocm720 + - nightly-8-gpu-kimi-k26-rocm720 + - nightly-8-gpu-qwen3-235b-rocm720 + - nightly-8-gpu-qwen35-rocm720 + - nightly-8-gpu-glm51-rocm720 + - nightly-8-gpu-minimax-m27-rocm720 + # MI30x ROCm 7.2 Diffusion Tests + - nightly-1-gpu-zimage-turbo-rocm720 + # MI35x ROCm 7.2 jobs + - nightly-test-1-gpu-mi35x-rocm720 + - nightly-accuracy-8-gpu-mi35x-rocm720 + - nightly-8-gpu-mi35x-grok1-int4-rocm720 + - nightly-8-gpu-mi35x-grok2-rocm720 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-rocm720 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8-rocm720 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion-rocm720 + - nightly-accuracy-8-gpu-mi35x-deepseek-v32-rocm720 + - nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720 + - nightly-perf-8-gpu-mi35x-deepseek-v32-basic-rocm720 + - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp-rocm720 + - nightly-8-gpu-mi35x-deepseek-v4-flash-rocm720 + - nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720 + - nightly-8-gpu-mi35x-kimi-k26-rocm720 + - nightly-8-gpu-mi35x-qwen3-235b-mxfp4-rocm720 + - nightly-8-gpu-mi35x-qwen35-rocm720 + - nightly-8-gpu-mi35x-glm51-rocm720 + - nightly-8-gpu-mi35x-glm5-mxfp4-rocm720 + runs-on: ubuntu-latest + steps: + - name: Check if any job failed + run: | + if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then + echo "One or more ROCm 7.2 nightly test jobs failed" + exit 1 + fi + if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then + echo "One or more ROCm 7.2 nightly test jobs were cancelled" + exit 1 + fi + echo "All ROCm 7.2 nightly test jobs passed" diff --git a/.github/workflows/nightly-test-amd.yml b/.github/workflows/nightly-test-amd.yml index 1e0a9bf11f1a..752568fa6dbf 100644 --- a/.github/workflows/nightly-test-amd.yml +++ b/.github/workflows/nightly-test-amd.yml @@ -2,7 +2,7 @@ name: Nightly Test (AMD) on: schedule: - - cron: '0 0 * * *' + - cron: '30 17 * * *' push: branches: - main @@ -10,35 +10,63 @@ on: - "python/sglang/version.py" workflow_dispatch: inputs: - job_filter: - description: 'Select which job to run (leave empty or "all" to run all jobs)' + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: true + job_select: + description: 'Select a job to run from dropdown (choose "all" to run all jobs)' required: false type: choice default: 'all' options: - 'all' - # MI30x Unit Tests - - 'nightly-test-1-gpu-unit' - # MI30x Accuracy Tests (GSM8K / MMMU) - - 'nightly-accuracy-2-gpu' - - 'nightly-accuracy-2-gpu-vlm' - - 'nightly-perf-2-gpu-text' - - 'nightly-perf-2-gpu-vlm' - - 'nightly-accuracy-8-gpu' - - 'nightly-accuracy-8-gpu-deepseek-r1' - # MI30x Accuracy + Performance Tests (combined) - - 'nightly-8-gpu-grok1-int4' - - 'nightly-8-gpu-grok2' - - 'nightly-8-gpu-deepseek-v31' - # MI35x jobs - - 'nightly-test-1-gpu-mi35x' - - 'nightly-accuracy-8-gpu-mi35x' - - 'nightly-8-gpu-mi35x-grok1-int4' - - 'nightly-8-gpu-mi35x-grok2' - - 'nightly-8-gpu-mi35x-deepseek-r1-mxfp4' - - 'nightly-accuracy-8-gpu-mi35x-deepseek-v32' - - 'nightly-perf-8-gpu-mi35x-deepseek-v32-basic' - - 'nightly-perf-8-gpu-mi35x-deepseek-v32-mtp' + - nightly-test-1-gpu-unit + - nightly-accuracy-2-gpu + - nightly-accuracy-2-gpu-vlm + - nightly-perf-2-gpu-text + - nightly-perf-2-gpu-vlm + - nightly-4-gpu + - nightly-accuracy-8-gpu + - nightly-8-gpu-grok1-int4 + - nightly-8-gpu-grok2 + - nightly-8-gpu-deepseek-v31 + - nightly-8-gpu-deepseek-v32 + - nightly-8-gpu-deepseek-v32-mtp + - nightly-8-gpu-deepseek-v3-kv-fp8 + - nightly-8-gpu-kimi-k26 + - nightly-8-gpu-qwen3-235b + - nightly-8-gpu-qwen35 + - nightly-8-gpu-glm51 + - nightly-8-gpu-minimax-m27 + - nightly-1-gpu-zimage-turbo + - nightly-test-1-gpu-mi35x + - nightly-accuracy-8-gpu-mi35x + - nightly-8-gpu-mi35x-grok1-int4 + - nightly-8-gpu-mi35x-grok2 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion + - nightly-accuracy-8-gpu-mi35x-deepseek-v32 + - nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp + - nightly-perf-8-gpu-mi35x-deepseek-v32-basic + - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp + - nightly-8-gpu-mi35x-kimi-k26 + - nightly-8-gpu-mi35x-qwen3-235b-mxfp4 + - nightly-8-gpu-mi35x-qwen35 + - nightly-8-gpu-mi35x-glm51 + - nightly-8-gpu-mi35x-glm5-mxfp4 + job_filter: + description: 'Or type comma-separated job names (overrides dropdown if non-empty)' + required: false + type: string + default: '' workflow_call: inputs: ref: @@ -46,27 +74,46 @@ on: required: false type: string default: '' + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' job_filter: description: 'Select which job to run (leave empty or "all" to run all jobs)' required: false type: string default: 'all' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: true + +env: + AITER_COMMIT_OVERRIDE: ${{ inputs.aiter_ref }} + DOCKERHUB_AMD_USERNAME: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + DOCKERHUB_AMD_TOKEN: ${{ secrets.DOCKERHUB_AMD_TOKEN }} concurrency: - group: nightly-test-amd-${{ inputs.ref || github.ref }} - cancel-in-progress: ${{ github.event_name != 'workflow_call' }} + # When called via workflow_call with ref set, use a unique group per caller run to avoid + # collisions with direct schedule/push triggers. We use inputs.ref (not github.event_name) + # to detect this, because github.event_name inherits from the caller in workflow_call. + # Manual dispatch runs also get unique groups so they never cancel each other. + group: nightly-test-amd-${{ github.event_name == 'workflow_dispatch' && format('manual-{0}', github.run_id) || inputs.ref && format('caller-{0}', github.run_id) || github.ref }} + cancel-in-progress: ${{ !inputs.ref && github.event_name != 'workflow_call' && github.event_name != 'workflow_dispatch' }} jobs: # ============================================== MI30x Unit Tests ============================================== # 1-GPU Unit Tests - LoRA, debug utils, scheduler, etc. (MI30x only) nightly-test-1-gpu-unit: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-1-gpu-unit') - runs-on: linux-mi325-gpu-1 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-test-1-gpu-unit,')) + runs-on: linux-mi325-1gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -79,24 +126,24 @@ jobs: run: bash scripts/ci/amd/amd_ci_install_dependency.sh - name: Nightly Unit Test (1-GPU) - timeout-minutes: 60 + timeout-minutes: 90 run: | bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 600 --continue-on-error || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 900 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # ============================================== MI30x Accuracy Tests ============================================== # 2-GPU Accuracy Tests - GSM8K eval (MI30x only) nightly-accuracy-2-gpu: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-accuracy-2-gpu') - runs-on: linux-mi325-gpu-2 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-2-gpu,')) + runs-on: linux-mi325-2gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -113,19 +160,19 @@ jobs: > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # 2-GPU VLM Accuracy Tests - Vision-Language Models MMMU evaluation nightly-accuracy-2-gpu-vlm: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-accuracy-2-gpu-vlm') - runs-on: linux-mi325-gpu-2 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-2-gpu-vlm,')) + runs-on: linux-mi325-2gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -143,19 +190,19 @@ jobs: > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-2-gpu-vlm --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-2-gpu-vlm --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # 2-GPU Text Models Performance Tests nightly-perf-2-gpu-text: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-perf-2-gpu-text') - runs-on: linux-mi325-gpu-2 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-2-gpu-text,')) + runs-on: linux-mi325-2gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -174,19 +221,19 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e SGLANG_USE_AITER=1 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-perf-text-2-gpu --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-perf-text-2-gpu --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # 2-GPU VLM Performance Tests nightly-perf-2-gpu-vlm: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-perf-2-gpu-vlm') - runs-on: linux-mi325-gpu-2 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-2-gpu-vlm,')) + runs-on: linux-mi325-2gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -205,19 +252,20 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e SGLANG_USE_AITER=1 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-perf-vlm-2-gpu --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-perf-vlm-2-gpu --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # 8-GPU Accuracy Tests - GPT-OSS, Grok1-FP8 (accuracy only) - nightly-accuracy-8-gpu: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-accuracy-8-gpu') - runs-on: linux-mi325-gpu-8 + # ============================================== MI30x 4-GPU Tests ============================================== + # 4-GPU Nightly Tests - Dumper/Comparator E2E, VLM Encoder DP + nightly-4-gpu: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-4-gpu,')) + runs-on: linux-mi325-4gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -229,34 +277,25 @@ jobs: - name: Install dependencies run: bash scripts/ci/amd/amd_ci_install_dependency.sh - - name: Accuracy Test (8-GPU GPT-OSS) - timeout-minutes: 180 - run: | - bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ - -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-gpt-oss --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? - echo "$(> $GITHUB_STEP_SUMMARY || true - exit ${TEST_EXIT_CODE:-0} - - - name: Accuracy Test (8-GPU Grok1-FP8) - timeout-minutes: 60 + - name: Nightly Test (4-GPU) + timeout-minutes: 120 run: | + > github_summary.md bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ - -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok1-fp8 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-4-gpu --nightly --continue-on-error --timeout-per-file 3600 || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # 8-GPU DeepSeek-R1 Accuracy Test (separate job due to long loading time) - nightly-accuracy-8-gpu-deepseek-r1: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-accuracy-8-gpu-deepseek-r1') - runs-on: linux-mi325-gpu-8 + # 8-GPU Accuracy Tests - GPT-OSS, Grok1-FP8 (accuracy only) + nightly-accuracy-8-gpu: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -268,25 +307,35 @@ jobs: - name: Install dependencies run: bash scripts/ci/amd/amd_ci_install_dependency.sh - - name: Accuracy Test (8-GPU DeepSeek-R1) - timeout-minutes: 240 + - name: Accuracy Test (8-GPU GPT-OSS) + timeout-minutes: 180 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-gpt-oss --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Accuracy Test (8-GPU Grok1-FP8) + timeout-minutes: 60 run: | bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-r1 --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok1-fp8 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # ============================================== MI30x Combined Accuracy + Performance Tests ============================================== # 8-GPU Grok1-INT4 (Accuracy + Performance combined) nightly-8-gpu-grok1-int4: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-grok1-int4') - runs-on: linux-mi325-gpu-8 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-grok1-int4,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -305,7 +354,7 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok1-int4 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} @@ -317,19 +366,19 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-grok1-int4 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # 8-GPU Grok2 (Accuracy + Performance combined) nightly-8-gpu-grok2: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-grok2') - runs-on: linux-mi325-gpu-8 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-grok2,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -348,7 +397,7 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok2 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} @@ -360,19 +409,19 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-grok2 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} # 8-GPU DeepSeek-V3.1 (Accuracy + Performance combined) nightly-8-gpu-deepseek-v31: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-deepseek-v31') - runs-on: linux-mi325-gpu-8 + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v31,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -391,7 +440,7 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e SGLANG_USE_AITER=1 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v31 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v31 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} @@ -403,20 +452,19 @@ jobs: bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e SGLANG_USE_ROCM700A=1 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v31 --nightly --timeout-per-file 18000 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v31 --nightly --timeout-per-file 18000 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # ============================================== MI35x Tests ============================================== - # MI35x 1-GPU tests - platform-agnostic tests that may work on CDNA4 (gfx950) - nightly-test-1-gpu-mi35x: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-1-gpu-mi35x') - runs-on: linux-mi35x-gpu-1 + # 8-GPU DeepSeek-V3.2 (Basic Accuracy + Perf) + nightly-8-gpu-deepseek-v32: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v32,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -426,29 +474,38 @@ jobs: GITHUB_WORKSPACE: ${{ github.workspace }} - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Accuracy Test (8-GPU DeepSeek-V3.2 Basic) + timeout-minutes: 120 run: | - bash scripts/ci/amd/amd_ci_install_dependency.sh - # Install tabulate for run_suite.py (missing in MI35x container) - bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v32 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} - - name: Nightly Test MI35x (1-GPU) - timeout-minutes: 60 + - name: Performance Test (8-GPU DeepSeek-V3.2 Basic) + timeout-minutes: 150 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed run: | + > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 600 --continue-on-error || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v32-basic --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU Accuracy Tests - GPT-OSS (accuracy only) - nightly-accuracy-8-gpu-mi35x: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-accuracy-8-gpu-mi35x') - runs-on: linux-mi35x-gpu-8 + # 8-GPU DeepSeek-V3.2 MTP (MTP Accuracy + Perf) + nightly-8-gpu-deepseek-v32-mtp: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v32-mtp,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -458,29 +515,38 @@ jobs: GITHUB_WORKSPACE: ${{ github.workspace }} - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Accuracy Test (8-GPU DeepSeek-V3.2 MTP) + timeout-minutes: 120 run: | - bash scripts/ci/amd/amd_ci_install_dependency.sh - # Install tabulate for run_suite.py (missing in MI35x container) - bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-deepseek-v32-mtp --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} - - name: Accuracy Test MI35x (8-GPU GPT-OSS) + - name: Performance Test (8-GPU DeepSeek-V3.2 MTP) timeout-minutes: 180 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed run: | + > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-deepseek-v32-mtp --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU Grok1-INT4 (Accuracy + Performance combined) - nightly-8-gpu-mi35x-grok1-int4: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-grok1-int4') - runs-on: linux-mi35x-gpu-8 + # 8-GPU DeepSeek-V3 KV FP8 (Basic + MTP with --kv-cache-dtype fp8_e4m3) + nightly-8-gpu-deepseek-v3-kv-fp8: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-deepseek-v3-kv-fp8,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -490,43 +556,86 @@ jobs: GITHUB_WORKSPACE: ${{ github.workspace }} - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: DeepSeek-V3 KV FP8 Test (8-GPU Basic + MTP) + timeout-minutes: 120 run: | - bash scripts/ci/amd/amd_ci_install_dependency.sh - # Install tabulate for run_suite.py (missing in MI35x container) - bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-deepseek-v3-kv-fp8 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} - - name: Accuracy Test MI35x (8-GPU Grok1-INT4) - timeout-minutes: 60 + # 8-GPU Kimi-K2.6 (Accuracy) + nightly-8-gpu-kimi-k26: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-kimi-k26,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Accuracy Test (8-GPU Kimi-K2.6) + timeout-minutes: 120 run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ - -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-grok1-int4 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-kimi-k26 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - - name: Performance Test MI35x (8-GPU Grok1-INT4) - timeout-minutes: 60 - continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + nightly-8-gpu-qwen3-235b: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-qwen3-235b,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Accuracy Test + Performance Test (8-GPU Qwen3) + timeout-minutes: 120 run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ - -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-grok1-int4 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-8-gpu-qwen3-235b --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU Grok2 (Accuracy + Performance combined) - nightly-8-gpu-mi35x-grok2: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-grok2') - runs-on: linux-mi35x-gpu-8 + # 8-GPU Qwen 3.5 (Accuracy + Performance combined) + nightly-8-gpu-qwen35: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-qwen35,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -538,41 +647,39 @@ jobs: - name: Install dependencies run: | bash scripts/ci/amd/amd_ci_install_dependency.sh - # Install tabulate for run_suite.py (missing in MI35x container) - bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install mistral-common "lm-eval[api]" - - name: Accuracy Test MI35x (8-GPU Grok2) - timeout-minutes: 60 + - name: Accuracy Test (8-GPU Qwen 3.5) + timeout-minutes: 120 run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ - -e RCCL_MSCCL_ENABLE=0 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-grok2 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-qwen35 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - - name: Performance Test MI35x (8-GPU Grok2) - timeout-minutes: 60 - continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + - name: Performance Test (8-GPU Qwen 3.5 FP8) + timeout-minutes: 120 + continue-on-error: true run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ - -e RCCL_MSCCL_ENABLE=0 \ + -e SGLANG_USE_AITER=1 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-grok2 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-qwen35-fp8 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU DeepSeek-R1-MXFP4 (Accuracy + Performance combined) - nightly-8-gpu-mi35x-deepseek-r1-mxfp4: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-deepseek-r1-mxfp4') - runs-on: linux-mi35x-gpu-8 + # 8-GPU GLM-5.1 (Accuracy + Performance combined) + nightly-8-gpu-glm51: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-glm51,')) + runs-on: linux-mi325-8gpu-sglang steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -584,39 +691,123 @@ jobs: - name: Install dependencies run: | bash scripts/ci/amd/amd_ci_install_dependency.sh - # Install tabulate for run_suite.py (missing in MI35x container) - bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git@96f807a33b75 - - name: Accuracy Test MI35x (8-GPU DeepSeek-R1-MXFP4) - timeout-minutes: 180 + - name: Accuracy Test (8-GPU GLM-5.1 NSA) + timeout-minutes: 120 run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4 --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-glm51 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - - name: Performance Test MI35x (8-GPU DeepSeek-R1-MXFP4) - timeout-minutes: 300 + - name: Performance Test (8-GPU GLM-5.1) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-glm51 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # 8-GPU MiniMax-M2.7 (Accuracy + Performance combined, replaces M2.5) + nightly-8-gpu-minimax-m27: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-minimax-m27,')) + runs-on: linux-mi325-8gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Accuracy Test (8-GPU MiniMax-M2.7) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-minimax-m27 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test (8-GPU MiniMax-M2.7) + timeout-minutes: 120 continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_perf_mi35x.py || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-minimax-m27 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU DeepSeek-V3.2 Accuracy Test - nightly-accuracy-8-gpu-mi35x-deepseek-v32: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-accuracy-8-gpu-mi35x-deepseek-v32') - runs-on: linux-mi35x-gpu-8 + # ============================================== MI30x Diffusion Tests ============================================== + # 1-GPU Z-Image-Turbo (Diffusion T2I) + nightly-1-gpu-zimage-turbo: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-1-gpu-zimage-turbo,')) + runs-on: linux-mi325-1gpu-sglang + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Z-Image-Turbo Diffusion Test (1-GPU) + timeout-minutes: 45 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + -e SGLANG_DIFFUSION_ARTIFACT_DIR="/sglang-checkout/diffusion-artifacts" \ + pytest test/registered/amd/test_zimage_turbo.py -v -s ${{ inputs.continue_on_error && '|| true' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Upload generated images + if: always() + uses: actions/upload-artifact@v4 + with: + name: zimage-turbo-outputs + path: diffusion-artifacts/ + if-no-files-found: ignore + retention-days: 30 + + # ============================================== MI35x Tests ============================================== + # MI35x 1-GPU tests - platform-agnostic tests that may work on CDNA4 (gfx950) + nightly-test-1-gpu-mi35x: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-test-1-gpu-mi35x,')) + runs-on: linux-mi35x-gpu-1 steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -631,25 +822,24 @@ jobs: # Install tabulate for run_suite.py (missing in MI35x container) bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate - - name: Accuracy Test MI35x (8-GPU DeepSeek-V3.2) - timeout-minutes: 120 + - name: Nightly Test MI35x (1-GPU) + timeout-minutes: 90 run: | - > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-v32 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 900 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU DeepSeek-V3.2 Performance Test (Basic) - nightly-perf-8-gpu-mi35x-deepseek-v32-basic: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-perf-8-gpu-mi35x-deepseek-v32-basic') + # MI35x 8-GPU Accuracy Tests - GPT-OSS (accuracy only) + nightly-accuracy-8-gpu-mi35x: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-mi35x,')) runs-on: linux-mi35x-gpu-8 steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -664,25 +854,24 @@ jobs: # Install tabulate for run_suite.py (missing in MI35x container) bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate - - name: Performance Test MI35x (8-GPU DeepSeek-V3.2 Basic) - timeout-minutes: 150 + - name: Accuracy Test MI35x (8-GPU GPT-OSS) + timeout-minutes: 180 run: | - > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-deepseek-v32-basic --nightly --timeout-per-file 5400 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} - # MI35x 8-GPU DeepSeek-V3.2 Performance Test (MTP) - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp: - if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-perf-8-gpu-mi35x-deepseek-v32-mtp') + # MI35x 8-GPU Grok1-INT4 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-grok1-int4: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-grok1-int4,')) runs-on: linux-mi35x-gpu-8 steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ inputs.ref || github.sha }} - name: Setup docker run: | @@ -697,13 +886,537 @@ jobs: # Install tabulate for run_suite.py (missing in MI35x container) bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate - - name: Performance Test MI35x (8-GPU DeepSeek-V3.2 MTP) - timeout-minutes: 150 + - name: Accuracy Test MI35x (8-GPU Grok1-INT4) + timeout-minutes: 90 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-grok1-int4 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU Grok1-INT4) + timeout-minutes: 60 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-grok1-int4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Grok2 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-grok2: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-grok2,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU Grok2) + timeout-minutes: 60 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU Grok2) + timeout-minutes: 60 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e RCCL_MSCCL_ENABLE=0 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-grok2 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-R1-MXFP4 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-deepseek-r1-mxfp4: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-r1-mxfp4,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU DeepSeek-R1-MXFP4) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU DeepSeek-R1-MXFP4) + timeout-minutes: 300 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-R1-MXFP4 KV FP8 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU DeepSeek-R1-MXFP4 KV FP8) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU DeepSeek-R1-MXFP4 KV FP8) + timeout-minutes: 300 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_kv_fp8_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-R1-MXFP4 AllReduce Fusion (Accuracy + Performance combined) + nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU DeepSeek-R1-MXFP4 AllReduce Fusion) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU DeepSeek-R1-MXFP4 AllReduce Fusion) + timeout-minutes: 300 + continue-on-error: true # Perf test failure doesn't fail the job if accuracy passed + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_deepseek_r1_mxfp4_ar_fusion_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 Accuracy Test + nightly-accuracy-8-gpu-mi35x-deepseek-v32: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-mi35x-deepseek-v32,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU DeepSeek-V3.2) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-deepseek-v32 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 TP+MTP Accuracy Test + nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU DeepSeek-V3.2 TP+MTP) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-deepseek-v32-mtp --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 Performance Test (Basic) + nightly-perf-8-gpu-mi35x-deepseek-v32-basic: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-8-gpu-mi35x-deepseek-v32-basic,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Performance Test MI35x (8-GPU DeepSeek-V3.2 Basic) + timeout-minutes: 150 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-deepseek-v32-basic --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Kimi-K2.6 (Accuracy) + nightly-8-gpu-mi35x-kimi-k26: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-kimi-k26,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test MI35x (8-GPU Kimi-K2.6) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-kimi-k26 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Qwen3-235B-MXFP4 (Accuracy + Performance) + nightly-8-gpu-mi35x-qwen3-235b-mxfp4: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-qwen3-235b-mxfp4,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Accuracy Test + Performance Test MI35x (8-GPU Qwen3-235B-MXFP4) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-8-gpu-mi35x-qwen3-235b-mxfp4 --nightly --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU Qwen 3.5 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-qwen35: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-qwen35,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install mistral-common "lm-eval[api]" + + - name: Accuracy Test MI35x (8-GPU Qwen 3.5) + timeout-minutes: 120 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-qwen35 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU Qwen 3.5 FP8) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-qwen35-fp8 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU GLM-5.1 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-glm51: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-glm51,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git@96f807a33b75 + + - name: Accuracy Test MI35x (8-GPU GLM-5.1 NSA) + timeout-minutes: 180 + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-glm51 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU GLM-5.1) + timeout-minutes: 120 + continue-on-error: true + run: | + > github_summary.md # Clear summary file + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-glm51 --nightly --timeout-per-file 5400 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU GLM-5-MXFP4 (Accuracy + Performance combined) + nightly-8-gpu-mi35x-glm5-mxfp4: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-8-gpu-mi35x-glm5-mxfp4,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git@96f807a33b75 + + - name: Accuracy Test MI35x (8-GPU GLM-5-MXFP4) + timeout-minutes: 180 + run: | + > github_summary.md + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-glm5-mxfp4 --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + - name: Performance Test MI35x (8-GPU GLM-5-MXFP4) + timeout-minutes: 300 + continue-on-error: true + run: | + > github_summary.md + bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ + -e SGLANG_USE_AITER=1 \ + -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ + python3 registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py || TEST_EXIT_CODE=$? + echo "$(> $GITHUB_STEP_SUMMARY || true + exit ${TEST_EXIT_CODE:-0} + + # MI35x 8-GPU DeepSeek-V3.2 Performance Test (MTP) + nightly-perf-8-gpu-mi35x-deepseek-v32-mtp: + if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (!(inputs.job_filter || inputs.job_select) || (inputs.job_filter || inputs.job_select) == 'all' || contains(format(',{0},', inputs.job_filter || inputs.job_select), ',nightly-perf-8-gpu-mi35x-deepseek-v32-mtp,')) + runs-on: linux-mi35x-gpu-8 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.sha }} + + - name: Setup docker + run: | + touch github_summary.md + bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + # Install tabulate for run_suite.py (missing in MI35x container) + bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate + + - name: Performance Test MI35x (8-GPU DeepSeek-V3.2 MTP) + timeout-minutes: 180 run: | > github_summary.md # Clear summary file bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \ -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ - python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-deepseek-v32-mtp --nightly --timeout-per-file 5400 || TEST_EXIT_CODE=$? + python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-deepseek-v32-mtp --nightly --timeout-per-file 7200 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} || TEST_EXIT_CODE=$? echo "$(> $GITHUB_STEP_SUMMARY || true exit ${TEST_EXIT_CODE:-0} @@ -715,24 +1428,44 @@ jobs: # MI30x Accuracy Tests - nightly-accuracy-2-gpu - nightly-accuracy-2-gpu-vlm - # MI30x Performance Tests - - nightly-perf-2-gpu-text - - nightly-perf-2-gpu-vlm + # MI30x 4-GPU Tests + - nightly-4-gpu - nightly-accuracy-8-gpu - - nightly-accuracy-8-gpu-deepseek-r1 + # MI30x Performance Tests - excluded from check (perf failures don't block CI) + # - nightly-perf-2-gpu-text + # - nightly-perf-2-gpu-vlm # MI30x Combined Accuracy + Performance Tests - nightly-8-gpu-grok1-int4 - nightly-8-gpu-grok2 - nightly-8-gpu-deepseek-v31 + - nightly-8-gpu-deepseek-v32 + - nightly-8-gpu-deepseek-v32-mtp + - nightly-8-gpu-deepseek-v3-kv-fp8 + - nightly-8-gpu-kimi-k26 + - nightly-8-gpu-qwen3-235b + - nightly-8-gpu-qwen35 + - nightly-8-gpu-glm51 + - nightly-8-gpu-minimax-m27 + # MI30x Diffusion Tests + - nightly-1-gpu-zimage-turbo # MI35x jobs - nightly-test-1-gpu-mi35x - nightly-accuracy-8-gpu-mi35x - nightly-8-gpu-mi35x-grok1-int4 - nightly-8-gpu-mi35x-grok2 - nightly-8-gpu-mi35x-deepseek-r1-mxfp4 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-kv-fp8 + - nightly-8-gpu-mi35x-deepseek-r1-mxfp4-ar-fusion - nightly-accuracy-8-gpu-mi35x-deepseek-v32 - - nightly-perf-8-gpu-mi35x-deepseek-v32-basic - - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp + - nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp + - nightly-8-gpu-mi35x-kimi-k26 + - nightly-8-gpu-mi35x-qwen3-235b-mxfp4 + - nightly-8-gpu-mi35x-qwen35 + - nightly-8-gpu-mi35x-glm51 + - nightly-8-gpu-mi35x-glm5-mxfp4 + # MI35x perf jobs excluded from check - perf failures don't block CI + # - nightly-perf-8-gpu-mi35x-deepseek-v32-basic + # - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp runs-on: ubuntu-latest steps: - name: Check if any job failed diff --git a/.github/workflows/nightly-test-npu.yml b/.github/workflows/nightly-test-npu.yml index 1ab6b673314c..44071afc7e1d 100644 --- a/.github/workflows/nightly-test-npu.yml +++ b/.github/workflows/nightly-test-npu.yml @@ -2,7 +2,7 @@ name: Nightly Test (NPU) on: schedule: - - cron: '0 17 * * *' # Execute at 1:00 a.m. Beijing Time every day + - cron: '0 18 * * *' # Execute at 2:00 a.m. Beijing Time every day pull_request: branches: - main @@ -21,13 +21,61 @@ on: required: false type: string default: 'all' + image_a3: + description: 'The a3 running docker image of the test task.' + required: false + type: string + default: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11' + skip_install_flag: + description: 'Indicates whether to skip the installation of sglang, defaulting to false.' + required: false + type: string + default: 'false' + concurrency: group: nightly-test-npu-${{ inputs.ref || github.ref }} cancel-in-progress: ${{ github.event_name != 'workflow_call' }} jobs: + set-image-config: + runs-on: ubuntu-latest + outputs: + ref: ${{ steps.set-vars.outputs.ref }} + job_filter: ${{ steps.set-vars.outputs.job_filter }} + image_a3: ${{ steps.set-vars.outputs.image_a3 }} + skip_install_flag: ${{ steps.set-vars.outputs.skip_install_flag }} + steps: + # When triggered by PR, no inputs parameters are used. The latest community code is tested by default. + - name: Set image config + id: set-vars + run: | + if [ -z "${{ inputs.ref }}" ]; then + echo "ref=" >> $GITHUB_OUTPUT + else + echo "ref=${{ inputs.ref }}" >> $GITHUB_OUTPUT + fi + + if [ -z "${{ inputs.job_filter }}" ]; then + echo "job_filter=all" >> $GITHUB_OUTPUT + else + echo "job_filter=${{ inputs.job_filter }}" >> $GITHUB_OUTPUT + fi + + if [ -z "${{ inputs.image_a3 }}" ]; then + echo "image_a3=swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11" >> $GITHUB_OUTPUT + else + echo "image_a3=${{ inputs.image_a3 }}" >> $GITHUB_OUTPUT + fi + + if [ -z "${{ inputs.skip_install_flag }}" ]; then + echo "skip_install_flag=false" >> $GITHUB_OUTPUT + else + echo "skip_install_flag=${{ inputs.skip_install_flag }}" >> $GITHUB_OUTPUT + fi + nightly-1-npu-a3: + needs: [set-image-config] if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} runs-on: linux-aarch64-a3-2 strategy: @@ -35,31 +83,39 @@ jobs: matrix: part: [0, 1] container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11 + image: ${{ needs.set-image-config.outputs.image_a3 }} steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi - bash scripts/ci/npu/npu_ci_install_dependency.sh a3 # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Print Log Information run: | bash scripts/ci/npu/npu_log_print.sh + - name: Run test timeout-minutes: 240 env: @@ -70,12 +126,23 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" - pip install sentence_transformers accelerate + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.2.0 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ cd test python3 run_suite.py --hw npu --suite nightly-1-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 nightly-2-npu-a3: + needs: [set-image-config] if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} runs-on: linux-aarch64-a3-2 strategy: @@ -83,27 +150,34 @@ jobs: matrix: part: [0] container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11 + image: ${{ needs.set-image-config.outputs.image_a3 }} steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi - bash scripts/ci/npu/npu_ci_install_dependency.sh a3 # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Print Log Information run: | @@ -118,12 +192,23 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" - pip install sentence_transformers accelerate + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.2.0 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ cd test python3 run_suite.py --hw npu --suite nightly-2-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1 nightly-4-npu-a3: + needs: [set-image-config] if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} runs-on: linux-aarch64-a3-4 strategy: @@ -131,27 +216,34 @@ jobs: matrix: part: [0] container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11 + image: ${{ needs.set-image-config.outputs.image_a3 }} steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.ref || github.ref }} + ref: ${{ needs.set-image-config.outputs.ref|| github.ref }} - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi - bash scripts/ci/npu/npu_ci_install_dependency.sh a3 # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Print Log Information run: | @@ -167,12 +259,12 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" + pip install sglang_router hf download lmms-lab/MMMU --repo-type dataset - pip install sentence_transformers torchaudio==2.8.0 torch_npu==2.8.0 + pip install sentence_transformers torchaudio==2.8.0 pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap - pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter peft==0.2.0 black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 - pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.2.0 numpy==1.26.4 dotenv git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git cd ./lmms-eval nohup pip install . > lmmslog.txt 2>&1 & @@ -182,11 +274,148 @@ jobs: cd test python3 run_suite.py --hw npu --suite nightly-4-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1 + nightly-8-npu-a3: + needs: [set-image-config] + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + runs-on: linux-aarch64-a3-8 + strategy: + fail-fast: false + matrix: + part: [0] + container: + image: ${{ needs.set-image-config.outputs.image_a3 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi + + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Print Log Information + run: | + bash scripts/ci/npu/npu_log_print.sh + + - name: Run test + timeout-minutes: 240 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + run: | + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.2.0 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ + cd test + python3 run_suite.py --hw npu --suite nightly-8-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1 + + nightly-16-npu-a3: + needs: [set-image-config] + if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }} + runs-on: linux-aarch64-a3-16 + strategy: + fail-fast: false + matrix: + part: [0, 1] + container: + image: ${{ needs.set-image-config.outputs.image_a3 }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ needs.set-image-config.outputs.ref || github.ref }} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 + fi + + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Print Log Information + run: | + bash scripts/ci/npu/npu_log_print.sh + + - name: Run test + timeout-minutes: 240 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + run: | + pip install sglang_router + hf download lmms-lab/MMMU --repo-type dataset + pip install sentence_transformers torchaudio==2.8.0 + pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap + pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1 + pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.2.0 numpy==1.26.4 dotenv + git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git + cd ./lmms-eval + nohup pip install . > lmmslog.txt 2>&1 & + sleep 120 + export PYTHONPATH=$PYTHONPATH:$(pwd) + cd ../ + cd test + python3 run_suite.py --hw npu --suite nightly-16-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 + check-all-jobs: if: github.repository == 'sgl-project/sglang' && always() needs: - nightly-1-npu-a3 + - nightly-2-npu-a3 - nightly-4-npu-a3 + - nightly-8-npu-a3 + - nightly-16-npu-a3 runs-on: ubuntu-latest container: image: docker.m.daocloud.io/ubuntu:22.04 diff --git a/.github/workflows/nightly-test-nvidia.yml b/.github/workflows/nightly-test-nvidia.yml index 731757d6ebde..f77dfba5c3b4 100644 --- a/.github/workflows/nightly-test-nvidia.yml +++ b/.github/workflows/nightly-test-nvidia.yml @@ -12,19 +12,23 @@ on: default: 'all' options: - 'all' - - 'nightly-test-general-1-gpu-runner' + - 'nightly-test-general-1-gpu-h100' - 'nightly-test-general-4-gpu-h100' - 'nightly-test-general-8-gpu-h200' - 'nightly-test-general-8-gpu-h20' - 'nightly-test-general-8-gpu-b200' - - 'nightly-test-text-accuracy-2-gpu-runner' - - 'nightly-test-text-perf-2-gpu-runner' - - 'nightly-test-vlm-accuracy-2-gpu-runner' - - 'nightly-test-vlm-perf-2-gpu-runner' + - 'nightly-test-text-accuracy-2-gpu-h100' + - 'nightly-test-text-perf-2-gpu-h100' + - 'nightly-test-vlm-accuracy-2-gpu-h100' + - 'nightly-test-vlm-perf-2-gpu-h100' - 'nightly-test-multimodal-server-1-gpu' - 'nightly-test-multimodal-server-2-gpu' - 'nightly-test-perf-4-gpu-b200' - 'nightly-test-perf-8-gpu-b200' + - 'nightly-test-specialized-8-gpu-b200' + - 'nightly-test-kernel-1-gpu-h100' + - 'nightly-test-diffusion-comparison' + - 'nightly-test-kernel-8-gpu-h200' workflow_call: inputs: ref: @@ -44,30 +48,102 @@ concurrency: env: SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" HF_HUB_DOWNLOAD_TIMEOUT: 300 HF_HUB_ETAG_TIMEOUT: 300 jobs: # General tests - 1 GPU - nightly-test-general-1-gpu-runner: - if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-1-gpu-runner') - runs-on: 1-gpu-runner + nightly-test-general-1-gpu-h100: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-1-gpu-h100') + runs-on: 1-gpu-h100 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test timeout-minutes: 60 + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 run: | cd test python3 run_suite.py --hw cuda --suite nightly-1-gpu --nightly --continue-on-error + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + # JIT kernel full unit tests (expanded parameter ranges via SGLANG_JIT_KERNEL_RUN_FULL_TESTS) + nightly-test-kernel-1-gpu-h100: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-kernel-1-gpu-h100') + runs-on: 1-gpu-h100 + timeout-minutes: 60 + env: + # Full jit_kernel test grids (see sglang.jit_kernel.utils.should_run_full_tests) + SGLANG_JIT_KERNEL_RUN_FULL_TESTS: "1" + # Match pr-test-jit-kernel workflow for consistent JIT warmup behavior + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true + # Allow maintenance bypass on default branch (same semantics as PR JIT workflow) + PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - uses: ./.github/actions/check-maintenance + + - name: Install dependencies + timeout-minutes: 20 + run: | + bash scripts/ci/cuda/ci_install_dependency.sh + + - name: Run jit kernel nightly suite + timeout-minutes: 60 + run: | + cd test + python3 run_suite.py --hw cuda --suite nightly-kernel-1-gpu --nightly --continue-on-error + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + nightly-test-kernel-8-gpu-h200: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-kernel-8-gpu-h200') + runs-on: 8-gpu-h200 + timeout-minutes: 240 + env: + SGLANG_JIT_KERNEL_RUN_FULL_TESTS: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true + PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - uses: ./.github/actions/check-maintenance + + - name: Install dependencies + timeout-minutes: 20 + run: | + bash scripts/ci/cuda/ci_install_dependency.sh + + - name: Run multi-GPU jit kernel nightly suite + timeout-minutes: 90 + run: | + cd test + python3 run_suite.py --hw cuda --suite nightly-kernel-8-gpu-h200 --nightly --continue-on-error + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # General tests - 4 GPU H100 nightly-test-general-4-gpu-h100: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-4-gpu-h100') @@ -78,24 +154,30 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test - timeout-minutes: 30 + timeout-minutes: 60 run: | cd test python3 run_suite.py --hw cuda --suite nightly-4-gpu --nightly --continue-on-error + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # General tests - 8 GPU H200 nightly-test-general-8-gpu-h200: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h200') runs-on: 8-gpu-h200 strategy: fail-fast: false + max-parallel: 2 matrix: - partition: [0, 1, 2] + partition: [0, 1, 2, 3] env: RUNNER_LABELS: 8-gpu-h200 steps: @@ -104,6 +186,8 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh @@ -118,7 +202,26 @@ jobs: IS_H200: "1" run: | cd test - python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=18000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=3 + python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=18000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4 + + - name: Publish traces to storage repo + if: always() + continue-on-error: true + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + GITHUB_RUN_ID: ${{ github.run_id }} + GITHUB_RUN_NUMBER: ${{ github.run_number }} + run: | + TRACE_ARGS="" + for dir in test/performance_profiles_*/; do + [ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir" + done + if [ -n "$TRACE_ARGS" ]; then + python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS + find test/performance_profiles_*/ -name '*.json.gz' -delete + else + echo "No trace directories found, skipping publish" + fi - name: Run test timeout-minutes: 30 @@ -131,7 +234,7 @@ jobs: - name: Collect performance metrics if: always() run: | - python3 scripts/ci/save_metrics.py \ + python3 scripts/ci/utils/save_metrics.py \ --gpu-config 8-gpu-h200 \ --partition ${{ matrix.partition }} \ --run-id ${{ github.run_id }} \ @@ -148,6 +251,11 @@ jobs: retention-days: 5 if-no-files-found: ignore + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.partition }} + # General tests - 8 GPU H20 nightly-test-general-8-gpu-h20: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h20') @@ -160,6 +268,8 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh @@ -172,39 +282,64 @@ jobs: cd test python3 run_suite.py --hw cuda --suite nightly-8-gpu-h20 --nightly --continue-on-error + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # General tests - 8 GPU B200 nightly-test-general-8-gpu-b200: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-b200') runs-on: 8-gpu-b200 strategy: fail-fast: false + max-parallel: 2 matrix: - partition: [0, 1, 2] + partition: [0, 1, 2, 3] steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | - IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh + bash scripts/ci/cuda/ci_install_dependency.sh - name: Run common 8-GPU model tests if: always() - timeout-minutes: 300 + timeout-minutes: 200 env: TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }} PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }} GPU_CONFIG: "8-gpu-b200" run: | cd test - IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=12000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=3 + python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=12000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4 + + - name: Publish traces to storage repo + if: always() + continue-on-error: true + env: + GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + GITHUB_RUN_ID: ${{ github.run_id }} + GITHUB_RUN_NUMBER: ${{ github.run_number }} + run: | + TRACE_ARGS="" + for dir in test/performance_profiles_*/; do + [ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir" + done + if [ -n "$TRACE_ARGS" ]; then + python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS + find test/performance_profiles_*/ -name '*.json.gz' -delete + else + echo "No trace directories found, skipping publish" + fi - name: Collect performance metrics if: always() run: | - python3 scripts/ci/save_metrics.py \ + python3 scripts/ci/utils/save_metrics.py \ --gpu-config 8-gpu-b200 \ --partition ${{ matrix.partition }} \ --run-id ${{ github.run_id }} \ @@ -221,16 +356,23 @@ jobs: retention-days: 5 if-no-files-found: ignore + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.partition }} + # Text model accuracy tests - nightly-test-text-accuracy-2-gpu-runner: - if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-accuracy-2-gpu-runner') - runs-on: 2-gpu-runner + nightly-test-text-accuracy-2-gpu-h100: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-accuracy-2-gpu-h100') + runs-on: 2-gpu-h100 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh @@ -241,30 +383,35 @@ jobs: cd test python3 run_suite.py --hw cuda --suite nightly-eval-text-2-gpu --nightly --continue-on-error --timeout-per-file 4500 + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # Text model performance tests - nightly-test-text-perf-2-gpu-runner: - if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-perf-2-gpu-runner') - runs-on: 2-gpu-runner + nightly-test-text-perf-2-gpu-h100: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-perf-2-gpu-h100') + runs-on: 2-gpu-h100 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh - name: Run performance test for text models - timeout-minutes: 180 + timeout-minutes: 30 env: TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }} PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }} - GPU_CONFIG: "2-gpu-runner" + GPU_CONFIG: "2-gpu-h100" run: | cd test rm -rf performance_profiles_text_models/ - python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error + python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error --timeout-per-file 3600 - name: Publish traces to storage repo env: @@ -274,50 +421,60 @@ jobs: run: | python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_text_models + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # VLM accuracy tests - nightly-test-vlm-accuracy-2-gpu-runner: - if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-accuracy-2-gpu-runner') - runs-on: 2-gpu-runner + nightly-test-vlm-accuracy-2-gpu-h100: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-accuracy-2-gpu-h100') + runs-on: 2-gpu-h100 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh - name: Run eval test for VLM models (fixed MMMU-100) - timeout-minutes: 240 + timeout-minutes: 120 run: | cd test python3 run_suite.py --hw cuda --suite nightly-eval-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 9000 + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # VLM performance tests - nightly-test-vlm-perf-2-gpu-runner: - if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-perf-2-gpu-runner') - runs-on: 2-gpu-runner + nightly-test-vlm-perf-2-gpu-h100: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-perf-2-gpu-h100') + runs-on: 2-gpu-h100 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh - name: Run perf test for VLM models (MMMU) - timeout-minutes: 240 + timeout-minutes: 30 env: TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }} PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }} - GPU_CONFIG: "2-gpu-runner" + GPU_CONFIG: "2-gpu-h100" run: | cd test rm -rf performance_profiles_vlms/ - python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error + python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 3600 - name: Publish traces to storage repo env: @@ -327,13 +484,16 @@ jobs: run: | python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_vlms + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # diffusion performance tests nightly-test-multimodal-server-1-gpu: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-1-gpu') - runs-on: 1-gpu-runner + runs-on: 1-gpu-h100 strategy: fail-fast: false - max-parallel: 5 + max-parallel: 2 matrix: part: [0, 1] steps: @@ -342,6 +502,8 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh diffusion @@ -351,6 +513,7 @@ jobs: env: SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }} GITHUB_RUN_ID: ${{ github.run_id }} + GPU_CONFIG: "1-gpu-h100" timeout-minutes: 60 run: | @@ -360,13 +523,35 @@ jobs: --partition-id ${{ matrix.part }} \ --total-partitions 2 + - name: Collect diffusion performance metrics + if: always() + run: | + python3 scripts/ci/utils/diffusion/save_diffusion_metrics.py \ + --gpu-config 1-gpu-h100 \ + --run-id ${{ github.run_id }} \ + --output python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json \ + --results-json python/diffusion-results.json + + - name: Upload diffusion metrics + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-metrics-1gpu-partition-${{ matrix.part }} + path: python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json + retention-days: 90 + if-no-files-found: ignore + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.part }} nightly-test-multimodal-server-2-gpu: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-2-gpu') - runs-on: 2-gpu-runner + runs-on: 2-gpu-h100 strategy: fail-fast: false - max-parallel: 5 + max-parallel: 2 matrix: part: [0, 1] steps: @@ -375,6 +560,8 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh diffusion @@ -384,8 +571,9 @@ jobs: env: SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }} GITHUB_RUN_ID: ${{ github.run_id }} + GPU_CONFIG: "2-gpu-h100" - timeout-minutes: 60 + timeout-minutes: 210 run: | cd python python3 sglang/multimodal_gen/test/run_suite.py \ @@ -393,6 +581,29 @@ jobs: --partition-id ${{ matrix.part }} \ --total-partitions 2 + - name: Collect diffusion performance metrics + if: always() + run: | + python3 scripts/ci/utils/diffusion/save_diffusion_metrics.py \ + --gpu-config 2-gpu-h100 \ + --run-id ${{ github.run_id }} \ + --output python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json \ + --results-json python/diffusion-results.json + + - name: Upload diffusion metrics + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-metrics-2gpu-partition-${{ matrix.part }} + path: python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json + retention-days: 90 + if-no-files-found: ignore + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.part }} + # B200 Performance tests - 4 GPU nightly-test-perf-4-gpu-b200: if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-4-gpu-b200') @@ -403,19 +614,24 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | - IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh + bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test - timeout-minutes: 300 + timeout-minutes: 200 run: | cd test python3 run_suite.py --hw cuda --suite nightly-4-gpu-b200 --nightly --continue-on-error --timeout-per-file 12000 + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + # Specialized B200 tests - 8 GPU, for specific backends and configs nightly-test-specialized-8-gpu-b200: - if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-8-gpu-b200') + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-8-gpu-b200' || inputs.job_filter == 'nightly-test-specialized-8-gpu-b200') runs-on: 8-gpu-b200 env: RUNNER_LABELS: 8-gpu-b200 @@ -425,24 +641,95 @@ jobs: with: ref: ${{ inputs.ref || github.ref }} + - uses: ./.github/actions/check-maintenance + - name: Install dependencies run: | - IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh + bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test - timeout-minutes: 120 + timeout-minutes: 60 env: GPU_CONFIG: "8-gpu-b200" run: | cd test python3 run_suite.py --hw cuda --suite nightly-8-gpu-b200 --nightly --continue-on-error --timeout-per-file 2400 - # Consolidate performance metrics from all 8-GPU jobs + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + # Diffusion cross-framework comparison + nightly-test-diffusion-comparison: + if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-diffusion-comparison') + runs-on: 4-gpu-h100 + timeout-minutes: 300 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Install dependencies + run: | + bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run cross-framework comparison + env: + GITHUB_SHA: ${{ github.sha }} + GITHUB_RUN_ID: ${{ github.run_id }} + PYTHONUNBUFFERED: "1" + timeout-minutes: 210 + run: | + python3 -u scripts/ci/utils/diffusion/run_comparison.py \ + --output comparison-results.json + + - name: Generate dashboard + if: always() + env: + GH_PAT_FOR_NIGHTLY_CI_DATA: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + GH_TOKEN: ${{ github.token }} + run: | + python3 scripts/ci/utils/diffusion/generate_diffusion_dashboard.py \ + --results comparison-results.json \ + --output dashboard.md \ + --charts-dir comparison-charts \ + --fetch-history \ + --step-summary + + - name: Publish to sglang-ci-data + if: always() + env: + GH_PAT_FOR_NIGHTLY_CI_DATA: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }} + run: | + python3 scripts/ci/utils/diffusion/publish_comparison_results.py \ + --results comparison-results.json \ + --dashboard dashboard.md \ + --charts-dir comparison-charts + + - name: Upload comparison artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-comparison-${{ github.run_id }} + path: | + comparison-results.json + dashboard.md + comparison-charts/ + comparison-logs/ + retention-days: 90 + if-no-files-found: ignore + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + # Consolidate performance metrics from all jobs consolidate-metrics: if: github.repository == 'sgl-project/sglang' && always() needs: - nightly-test-general-8-gpu-h200 - nightly-test-general-8-gpu-b200 + - nightly-test-multimodal-server-1-gpu + - nightly-test-multimodal-server-2-gpu runs-on: ubuntu-latest steps: - name: Checkout code @@ -453,7 +740,7 @@ jobs: - name: Download all partition metrics uses: actions/download-artifact@v4 with: - pattern: metrics-* + pattern: "*metrics-*" path: metrics/ merge-multiple: true @@ -464,7 +751,7 @@ jobs: - name: Merge metrics run: | - python3 scripts/ci/merge_metrics.py \ + python3 scripts/ci/utils/merge_metrics.py \ --input-dir metrics/ \ --output consolidated-metrics-${{ github.run_id }}.json \ --run-id ${{ github.run_id }} \ @@ -483,19 +770,20 @@ jobs: check-all-jobs: if: github.repository == 'sgl-project/sglang' && always() needs: - - nightly-test-general-1-gpu-runner + - nightly-test-general-1-gpu-h100 - nightly-test-general-4-gpu-h100 - nightly-test-general-8-gpu-h200 - nightly-test-general-8-gpu-h20 - nightly-test-general-8-gpu-b200 - - nightly-test-text-accuracy-2-gpu-runner - - nightly-test-text-perf-2-gpu-runner - - nightly-test-vlm-accuracy-2-gpu-runner - - nightly-test-vlm-perf-2-gpu-runner + - nightly-test-text-accuracy-2-gpu-h100 + - nightly-test-text-perf-2-gpu-h100 + - nightly-test-vlm-accuracy-2-gpu-h100 + - nightly-test-vlm-perf-2-gpu-h100 - nightly-test-multimodal-server-1-gpu - nightly-test-multimodal-server-2-gpu - nightly-test-perf-4-gpu-b200 - nightly-test-specialized-8-gpu-b200 + - nightly-test-diffusion-comparison - consolidate-metrics runs-on: ubuntu-latest steps: diff --git a/.github/workflows/patch-docker-dev.yml b/.github/workflows/patch-docker-dev.yml new file mode 100644 index 000000000000..d81e10b6cd71 --- /dev/null +++ b/.github/workflows/patch-docker-dev.yml @@ -0,0 +1,118 @@ +name: Patch Docker Image + +on: + workflow_dispatch: + inputs: + pr_numbers: + description: "Comma-separated PR numbers to apply (e.g. 18962,19010)" + required: false + default: "" + image_tag: + description: "Base image tag to patch (e.g. dev-x86, dev-x86-cu13)" + required: true + +concurrency: + group: patch-docker-${{ inputs.image_tag }} + cancel-in-progress: true + +jobs: + patch: + if: github.repository == 'sgl-project/sglang' + runs-on: x64-docker-build-node + steps: + - name: Cleanup workspace (remove root-owned files from prior runs) + run: sudo rm -rf "$GITHUB_WORKSPACE"/* || true + + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Pull base image and extract commit + run: | + IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}" + docker pull "${IMAGE}" + if BASE_SHA=$(docker run --rm "${IMAGE}" git -C /sgl-workspace/sglang rev-parse HEAD 2>/dev/null); then + echo "Image built from commit: ${BASE_SHA}" + else + BASE_SHA="" + echo "::warning::Image has no .git directory — cannot extract base commit" + fi + echo "BASE_SHA=${BASE_SHA}" >> "$GITHUB_ENV" + + - name: Generate patches + run: | + git config --global --add safe.directory "$GITHUB_WORKSPACE" + git fetch origin main + mkdir -p /tmp/patch-ctx + + if [ -n "${{ inputs.pr_numbers }}" ]; then + IFS=',' read -ra PRS <<< "${{ inputs.pr_numbers }}" + for pr in "${PRS[@]}"; do + pr=$(echo "${pr}" | xargs) + echo "Fetching PR #${pr}" + git fetch origin "pull/${pr}/head:pr-${pr}" + MERGE_BASE=$(git merge-base origin/main "pr-${pr}") + echo " PR #${pr}: merge-base=${MERGE_BASE}" + git diff "${MERGE_BASE}..pr-${pr}" > "/tmp/patch-ctx/${pr}.patch" + echo " PR #${pr}: $(wc -l < /tmp/patch-ctx/${pr}.patch) lines" + done + elif [ -n "${BASE_SHA}" ]; then + echo "Generating diff: image ${BASE_SHA} → latest main" + git fetch origin "${BASE_SHA}" + git diff "${BASE_SHA}..origin/main" > /tmp/patch-ctx/main.patch + echo " main: $(wc -l < /tmp/patch-ctx/main.patch) lines" + else + echo "::error::No PR numbers specified and image has no .git — cannot generate diff against main" + exit 1 + fi + + TOTAL=$(cat /tmp/patch-ctx/*.patch | wc -l) + if [ "${TOTAL}" -eq 0 ]; then + echo "::warning::All patches are empty — image is already up to date" + echo "SKIP_BUILD=true" >> "$GITHUB_ENV" + fi + + - name: Build patched image + if: env.SKIP_BUILD != 'true' + run: | + IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}" + + cat <<'DOCKERFILE' > /tmp/patch-ctx/Dockerfile + ARG BASE_IMAGE + FROM ${BASE_IMAGE} + COPY *.patch /tmp/patches/ + RUN cd /sgl-workspace/sglang \ + && for p in /tmp/patches/*.patch; do \ + if [ ! -s "${p}" ]; then \ + echo "Skipping ${p} (empty)"; \ + else \ + echo "Applying ${p}..." \ + && patch -p1 --fuzz=2 --no-backup-if-mismatch -f < "${p}" \ + || { echo "ERROR: Failed to apply ${p}"; exit 1; }; \ + fi; \ + done \ + && rm -rf /tmp/patches + DOCKERFILE + + docker build \ + --no-cache \ + --build-arg BASE_IMAGE="${IMAGE}" \ + -t "${IMAGE}" \ + /tmp/patch-ctx/ + + - name: Push patched image + if: env.SKIP_BUILD != 'true' + run: | + IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}" + docker push "${IMAGE}" + + echo "### Patched \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY" + echo "- **Base commit:** \`${BASE_SHA:-unknown (no .git)}\`" >> "$GITHUB_STEP_SUMMARY" + echo "- **Source:** ${{ inputs.pr_numbers && format('PRs: {0}', inputs.pr_numbers) || 'latest main' }}" >> "$GITHUB_STEP_SUMMARY" diff --git a/.github/workflows/pr-test-amd-rocm720.yml b/.github/workflows/pr-test-amd-rocm720.yml new file mode 100644 index 000000000000..16edcb0c1766 --- /dev/null +++ b/.github/workflows/pr-test-amd-rocm720.yml @@ -0,0 +1,1110 @@ +name: PR Test ROCm 7.2 (AMD) +# Dynamic run-name for /rerun-stage commands to enable URL lookup +# Format: "[stage-name] sha" for fork PRs, "[stage-name]" for non-fork, default for normal runs +run-name: ${{ (inputs.target_stage || inputs.target_stage_select) && (inputs.pr_head_sha && format('[{0}] {1}', inputs.target_stage || inputs.target_stage_select, inputs.pr_head_sha) || format('[{0}]', inputs.target_stage || inputs.target_stage_select)) || '' }} + +on: + schedule: + - cron: '30 17 * * *' + # push: + # branches: [ main ] + # paths: + # - "python/**" + # - "scripts/ci/**" + # - "test/**" + # - "sgl-kernel/**" + # - ".github/workflows/pr-test-amd-rocm720.yml" + # - "docker/rocm.Dockerfile" + # pull_request: + # branches: [ main ] + # paths: + # - "python/**" + # - "scripts/ci/**" + # - "test/**" + # - "sgl-kernel/**" + # - ".github/workflows/pr-test-amd-rocm720.yml" + # - "docker/rocm.Dockerfile" + workflow_dispatch: + inputs: + target_stage_select: + description: "Select a stage to run from dropdown (leave empty for auto-detect)" + required: false + type: choice + default: '' + options: + - '' + - sgl-kernel-unit-test-amd-rocm720 + - sgl-kernel-unit-test-2-gpu-amd-rocm720 + - stage-a-test-1-gpu-small-amd-rocm720 + - jit-kernel-unit-test-amd-rocm720 + - stage-b-test-1-gpu-small-amd-rocm720 + - stage-b-test-1-gpu-small-amd-nondeterministic-rocm720 + - stage-b-test-1-gpu-small-amd-mi35x-rocm720 + - stage-b-test-1-gpu-large-amd-rocm720 + - stage-b-test-2-gpu-large-amd-rocm720 + - multimodal-gen-test-1-gpu-amd-rocm720 + - multimodal-gen-test-2-gpu-amd-rocm720 + - stage-c-test-large-8-gpu-amd-rocm720 + - stage-c-test-large-8-gpu-amd-mi35x-rocm720 + - stage-b-test-large-8-gpu-disaggregation-amd-rocm720 + - stage-c-test-4-gpu-amd-rocm720 + target_stage: + description: "Or type comma-separated stage names (overrides dropdown if non-empty)" + required: false + type: string + default: "" + pr_head_sha: + description: "PR head SHA to checkout (for /rerun-stage on fork PRs)" + required: false + type: string + default: "" + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: true + workflow_call: + inputs: + ref: + description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.' + required: false + type: string + default: '' + run_all_tests: + description: "Run all tests (for releasing or testing purpose)" + required: false + type: boolean + default: false + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: true + +env: + AITER_COMMIT_OVERRIDE: ${{ inputs.aiter_ref }} + DOCKERHUB_AMD_USERNAME: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + DOCKERHUB_AMD_TOKEN: ${{ secrets.DOCKERHUB_AMD_TOKEN }} + +concurrency: + # When called via workflow_call with run_all_tests=true, use a unique group per run to + # avoid collisions with direct schedule/workflow_dispatch triggers. We use run_all_tests + # (not github.event_name) to detect this, because github.event_name inherits from the caller. + # Manual dispatch runs also get unique groups so they never cancel each other. + group: pr-test-amd-rocm720-${{ (inputs.run_all_tests || github.event_name == 'workflow_dispatch') && format('full-{0}', github.run_id) || inputs.pr_head_sha || inputs.ref || github.ref }} + cancel-in-progress: ${{ !inputs.run_all_tests && github.event_name != 'workflow_call' && github.event_name != 'workflow_dispatch' }} + +jobs: + call-gate: + uses: ./.github/workflows/pr-gate.yml + secrets: inherit + check-changes: + needs: [call-gate] + runs-on: ubuntu-latest + outputs: + main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} + sgl_kernel: ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }} + jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }} + multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Determine run mode + id: run-mode + run: | + # Run all tests for workflow_call (when ref input is provided) + # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref + if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then + echo "run_all_tests=true" >> $GITHUB_OUTPUT + echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})" + else + echo "run_all_tests=false" >> $GITHUB_OUTPUT + echo "Run mode: FILTERED (triggered by ${{ github.event_name }})" + fi + + - name: Detect file changes + id: filter + uses: dorny/paths-filter@v3 + if: steps.run-mode.outputs.run_all_tests != 'true' + with: + filters: | + main_package: + - "python/sglang/!(multimodal_gen)/**/!(*.md)" + - "python/pyproject_rocm.toml" + - "python/pyproject_other.toml" + - "scripts/ci/amd/*" + - "scripts/ci/utils/*" + - "test/**/!(*.md)" + - ".github/workflows/pr-test-amd-rocm720.yml" + sgl_kernel: + - "sgl-kernel/**/!(*.md|THIRDPARTYNOTICES.txt|LICENSE)" + - ".github/workflows/pr-test-amd-rocm720.yml" + jit_kernel: + - "python/sglang/jit_kernel/**" + - ".github/workflows/pr-test-amd-rocm720.yml" + multimodal_gen: + - "python/sglang/multimodal_gen/**/!(*.md|*.ipynb)" + - "python/sglang/cli/**" + - "python/sglang/jit_kernel/diffusion/**" + - "python/sglang/jit_kernel/tests/diffusion/**" + - "python/sglang/jit_kernel/benchmark/diffusion/**" + - "python/pyproject_rocm.toml" + - "python/pyproject_other.toml" + + # =============================================== sgl-kernel ==================================================== + sgl-kernel-unit-test-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',sgl-kernel-unit-test-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + needs.check-changes.outputs.sgl_kernel == 'true' + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-1gpu-sglang] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 14 + run: | + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_align.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_softmax.py + docker exec -w /sglang-checkout/sgl-kernel/tests/speculative ci_sglang python3 -m pytest test_eagle_utils.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_apply_token_bitmask_inplace.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_activation.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_topk.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_kvcacheio.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_sigmoid.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_torch_defaults_reset.py + + sgl-kernel-unit-test-2-gpu-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',sgl-kernel-unit-test-2-gpu-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + needs.check-changes.outputs.sgl_kernel == 'true' + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-2gpu-sglang] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 20 + run: | + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_deterministic_custom_allreduce.py + docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_nccl_allreduce_determinism.py + + # =============================================== primary ==================================================== + + stage-a-test-1-gpu-small-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-a-test-1-gpu-small-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-1gpu-sglang] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 10 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-gpu-small-amd ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + jit-kernel-unit-test-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',jit-kernel-unit-test-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + needs.check-changes.outputs.jit_kernel == 'true' + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-1gpu-sglang] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run JIT kernel unit tests + timeout-minutes: 10 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout" python3 -m pytest -q python/sglang/jit_kernel/tests/test_store_cache.py + + stage-b-test-1-gpu-small-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-small-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-1gpu-sglang] + part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 30 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 14 --timeout-per-file 1800 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + stage-b-test-1-gpu-small-amd-nondeterministic-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-small-amd-nondeterministic-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-1gpu-sglang] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 30 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd-nondeterministic --timeout-per-file 1800 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + stage-b-test-1-gpu-small-amd-mi35x-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-small-amd-mi35x-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi35x-gpu-1] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 30 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd-mi35x ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + stage-b-test-1-gpu-large-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-large-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-1gpu-sglang] + part: [0, 1] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 30 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-large-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1800 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + stage-b-test-2-gpu-large-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-2-gpu-large-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-2gpu-sglang] + part: [0, 1] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 30 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-2-gpu-large-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1800 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + multimodal-gen-test-1-gpu-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',multimodal-gen-test-1-gpu-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT + matrix: + runner: [linux-mi325-1gpu-sglang] + part: [0, 1, 2, 3] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Download artifacts + if: needs.check-changes.outputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda12.9 + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh diffusion + docker exec ci_sglang pip install amdsmi + + - name: Setup kernel caches + run: | + # Use the persistent /sgl-data directory (mounted from /home/runner/sgl-data) + # This directory persists across container restarts on the self-hosted runner + docker exec ci_sglang mkdir -p /sgl-data/aiter-kernels /sgl-data/miopen-cache /sgl-data/hf-cache/hub + + # Clear pre-built AITER kernels from Docker image to avoid segfaults + # The image may have stale/incompatible kernels at /sgl-workspace/aiter/aiter/jit/ + echo "Clearing pre-built AITER kernels from Docker image..." + docker exec ci_sglang rm -rf /sgl-workspace/aiter/aiter/jit/*.so 2>/dev/null || true + docker exec ci_sglang rm -rf /sgl-data/aiter-kernels/*.so 2>/dev/null || true + echo "AITER kernels cleared - will be rebuilt on first use" + + # Create persistent cache marker if /sgl-data is a real mount (not ephemeral) + # This tells the test cleanup code to NOT delete downloaded models + if docker exec ci_sglang test -d /sgl-data && docker exec ci_sglang mountpoint -q /sgl-data 2>/dev/null; then + docker exec ci_sglang touch /sgl-data/hf-cache/.persistent_cache + echo "Created .persistent_cache marker - HF cache will persist" + else + echo "WARNING: /sgl-data is not a mount point - models will be cleaned up after each test" + fi + + # Check MIOpen cache (VAE convolution kernels) + miopen_files=$(docker exec ci_sglang find /sgl-data/miopen-cache -name "*.udb" 2>/dev/null | wc -l || echo "0") + echo "Found ${miopen_files} MIOpen cache files" + + - name: Diagnose HF cache and system resources + run: | + echo "=== System Memory Status ===" + free -h + echo "" + echo "=== Disk Space ===" + df -h /home/runner/sgl-data 2>/dev/null || df -h + echo "" + echo "=== HF Cache Directory Structure ===" + docker exec ci_sglang ls -la /sgl-data/hf-cache/ 2>/dev/null || echo "HF cache dir not found" + docker exec ci_sglang ls -la /sgl-data/hf-cache/hub/ 2>/dev/null || echo "HF hub cache not found" + echo "" + echo "=== Checking for cached diffusion models (1-GPU tests) ===" + # Models used in 1-GPU tests: Wan2.1-T2V-1.3B, HunyuanVideo, Qwen-Image, FLUX.1, FLUX.2 + for model in "Wan-AI--Wan2.1-T2V-1.3B-Diffusers" "tencent--HunyuanVideo" "Qwen--Qwen-Image" "black-forest-labs--FLUX.1-dev" "black-forest-labs--FLUX.2-dev"; do + cache_path="/sgl-data/hf-cache/hub/models--${model}" + if docker exec ci_sglang test -d "$cache_path"; then + size=$(docker exec ci_sglang du -sh "$cache_path" 2>/dev/null | cut -f1) + echo "✓ CACHED: $model ($size)" + else + echo "✗ NOT CACHED: $model" + fi + done + echo "" + echo "=== GPU Memory Status ===" + docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available" + + - name: Run diffusion server tests (1-GPU) + timeout-minutes: 60 + run: | + # AMD CI: All 1-GPU tests except FLUX.2 (FLUX.1 covers same code path) + # Tests: T2V, T2I, I2V, LoRA + # + # HF download env vars: + # - HF_HUB_ENABLE_HF_TRANSFER=1: Use faster hf_transfer for downloads (if available) + # - HF_HUB_DISABLE_SYMLINKS_WARNING=1: Suppress symlink warnings + docker exec \ + -e SGLANG_E2E_TOLERANCE=0.3 \ + -e SGLANG_STAGE_TIME_TOLERANCE=0.2 \ + -e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \ + -e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \ + -e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \ + -e SGLANG_SKIP_CONSISTENCY=1 \ + -e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \ + -e SGLANG_DIFFUSION_ARTIFACT_DIR=/sglang-checkout/diffusion-failures \ + -e AITER_JIT_DIR=/sgl-data/aiter-kernels \ + -e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \ + -e HF_HUB_ENABLE_HF_TRANSFER=1 \ + -e HF_HUB_DISABLE_SYMLINKS_WARNING=1 \ + -w /sglang-checkout/python \ + ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \ + --suite 1-gpu \ + --partition-id ${{ matrix.part }} \ + --total-partitions 4 \ + -k "not flux_2" + + # Post-test diagnostics + echo "=== Post-test System Memory Status ===" + free -h + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-amd-rocm720-1gpu-${{ matrix.part }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + multimodal-gen-test-2-gpu-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',multimodal-gen-test-2-gpu-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT + matrix: + runner: [linux-mi325-2gpu-sglang] + part: [0, 1, 2] # 3 partitions: 2 parametrized + 1 standalone (test_disagg_server.py) + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Download artifacts + if: needs.check-changes.outputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda12.9 + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh diffusion + docker exec ci_sglang pip install amdsmi + + - name: Setup kernel caches + run: | + # Use the persistent /sgl-data directory (mounted from /home/runner/sgl-data) + docker exec ci_sglang mkdir -p /sgl-data/aiter-kernels /sgl-data/miopen-cache /sgl-data/hf-cache/hub + + # Clear pre-built AITER kernels from Docker image to avoid segfaults + # The image may have stale/incompatible kernels at /sgl-workspace/aiter/aiter/jit/ + echo "Clearing pre-built AITER kernels from Docker image..." + docker exec ci_sglang rm -rf /sgl-workspace/aiter/aiter/jit/*.so 2>/dev/null || true + docker exec ci_sglang rm -rf /sgl-data/aiter-kernels/*.so 2>/dev/null || true + echo "AITER kernels cleared - will be rebuilt on first use" + + # Create persistent cache marker if /sgl-data is a real mount (not ephemeral) + # This tells the test cleanup code to NOT delete downloaded models + if docker exec ci_sglang test -d /sgl-data && docker exec ci_sglang mountpoint -q /sgl-data 2>/dev/null; then + docker exec ci_sglang touch /sgl-data/hf-cache/.persistent_cache + echo "Created .persistent_cache marker - HF cache will persist" + else + echo "WARNING: /sgl-data is not a mount point - models will be cleaned up after each test" + fi + + # Check MIOpen cache (VAE convolution kernels) + miopen_files=$(docker exec ci_sglang find /sgl-data/miopen-cache -name "*.udb" 2>/dev/null | wc -l || echo "0") + echo "Found ${miopen_files} MIOpen cache files" + + - name: Diagnose HF cache and system resources + run: | + echo "=== System Memory Status ===" + free -h + echo "" + echo "=== Disk Space ===" + df -h /home/runner/sgl-data 2>/dev/null || df -h + echo "" + echo "=== HF Cache Directory Structure ===" + docker exec ci_sglang ls -la /sgl-data/hf-cache/ 2>/dev/null || echo "HF cache dir not found" + docker exec ci_sglang ls -la /sgl-data/hf-cache/hub/ 2>/dev/null || echo "HF hub cache not found" + echo "" + echo "=== Checking for cached diffusion models (2-GPU tests) ===" + # Models used in 2-GPU tests: Wan2.2-T2V-A14B, Wan2.1-T2V-14B, Qwen-Image, FLUX.1 + for model in "Wan-AI--Wan2.2-T2V-A14B-Diffusers" "Wan-AI--Wan2.1-T2V-14B-Diffusers" "Qwen--Qwen-Image" "black-forest-labs--FLUX.1-dev"; do + cache_path="/sgl-data/hf-cache/hub/models--${model}" + if docker exec ci_sglang test -d "$cache_path"; then + size=$(docker exec ci_sglang du -sh "$cache_path" 2>/dev/null | cut -f1) + echo "✓ CACHED: $model ($size)" + else + echo "✗ NOT CACHED: $model" + fi + done + echo "" + echo "=== GPU Memory Status ===" + docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available" + + - name: Run diffusion server tests (2-GPU) + timeout-minutes: 150 + run: | + # AMD CI: All 2-GPU tests including LoRA + # Tests: T2V, T2I, I2V, LoRA + # + # HF download env vars: + # - HF_HUB_ENABLE_HF_TRANSFER=1: Use faster hf_transfer for downloads (if available) + # - HF_HUB_DISABLE_SYMLINKS_WARNING=1: Suppress symlink warnings + docker exec \ + -e SGLANG_E2E_TOLERANCE=0.3 \ + -e SGLANG_STAGE_TIME_TOLERANCE=0.2 \ + -e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \ + -e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \ + -e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \ + -e SGLANG_SKIP_CONSISTENCY=1 \ + -e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \ + -e SGLANG_DIFFUSION_ARTIFACT_DIR=/sglang-checkout/diffusion-failures \ + -e AITER_JIT_DIR=/sgl-data/aiter-kernels \ + -e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \ + -e HF_HUB_ENABLE_HF_TRANSFER=1 \ + -e HF_HUB_DISABLE_SYMLINKS_WARNING=1 \ + -w /sglang-checkout/python \ + ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \ + --suite 2-gpu \ + --partition-id ${{ matrix.part }} \ + --total-partitions 3 + + # Post-test diagnostics + echo "=== Post-test System Memory Status ===" + free -h + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-amd-rocm720-2gpu-${{ matrix.part }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + + stage-c-test-4-gpu-amd-rocm720: + needs: [check-changes, stage-b-test-1-gpu-small-amd-rocm720, stage-b-test-2-gpu-large-amd-rocm720] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-c-test-4-gpu-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-4gpu-sglang] + part: [0] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Run test + timeout-minutes: 60 + run: | + bash scripts/ci/amd/amd_ci_exec.sh \ + -e NCCL_CUMEM_ENABLE=0 \ + -e NCCL_NVLS_ENABLE=0 \ + -e RCCL_MSCCL_ENABLE=0 \ + -e SGLANG_USE_ROCM700A=1 \ + -w "/sglang-checkout/test" \ + python3 run_suite.py \ + --hw amd \ + --suite stage-c-test-4-gpu-amd \ + --auto-partition-id ${{ matrix.part }} \ + --auto-partition-size 1 \ + --timeout-per-file 1800 \ + --enable-retry \ + --max-attempts 2 \ + --retry-wait-seconds 120 \ + --retry-timeout-increase 0 \ + ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + stage-c-test-large-8-gpu-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-c-test-large-8-gpu-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + env: + RUNNER_LABELS: linux-mi325-8gpu-sglang + strategy: + fail-fast: false + matrix: + runner: [linux-mi325-8gpu-sglang] + part: [0, 1, 2] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Test RCCL multi-GPU communication + timeout-minutes: 5 + run: | + echo "Testing RCCL multi-GPU communication with debug info..." + docker exec ci_sglang bash -c "cd /sglang-checkout && NCCL_DEBUG=INFO RCCL_DEBUG=INFO torchrun --nproc_per_node=8 scripts/ci/amd/test_rccl_multi_gpu.py" + + - name: Run test + timeout-minutes: 60 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + stage-c-test-large-8-gpu-amd-mi35x-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-c-test-large-8-gpu-amd-mi35x-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi35x-gpu-8] + part: [0, 1] + runs-on: ${{matrix.runner}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Run test + timeout-minutes: 60 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd-mi35x --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 3600 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + # =============================================== Disaggregation ==================================================== + stage-b-test-large-8-gpu-35x-disaggregation-amd-rocm720: + needs: [check-changes] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-large-8-gpu-disaggregation-amd-rocm720,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (!failure() && !cancelled()) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + strategy: + fail-fast: false + matrix: + runner: [linux-mi35x-gpu-8.fabric] + + runs-on: ${{matrix.runner}} + + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Check Host RDMA Environment + id: rdma_detect + run: | + set +e + echo "=== Checking Host RDMA Environment ===" + + echo "" + echo "=== 1. Ionic driver library check ===" + ls -l /usr/lib/x86_64-linux-gnu/libibverbs/libionic* 2>/dev/null || echo "libionic not found in standard path" + + echo "" + echo "=== 2. Infiniband devices ===" + ls -la /dev/infiniband/ 2>/dev/null || echo "/dev/infiniband not found" + ls -la /sys/class/infiniband/ 2>/dev/null || echo "/sys/class/infiniband not found" + + echo "" + echo "=== 3. ibv_devinfo ===" + which ibv_devinfo 2>/dev/null && ibv_devinfo 2>&1 || echo "ibv_devinfo not available" + + echo "" + echo "=== 4. Kernel modules ===" + lsmod 2>/dev/null | grep -E "ib_|rdma|ionic" || echo "No RDMA kernel modules loaded" + + echo "" + echo "=== 5. Detect RDMA Devices for test environment ===" + if [ -d "/sys/class/infiniband" ]; then + RDMA_DEVS=$(ls /sys/class/infiniband | paste -sd "," -) + echo "Detected RDMA Devices: $RDMA_DEVS" + echo "SGLANG_TEST_RDMA_DEVICE=$RDMA_DEVS" >> $GITHUB_ENV + else + echo "No RDMA devices found in /sys/class/infiniband" + echo "SGLANG_TEST_RDMA_DEVICE=" >> $GITHUB_ENV + fi + + echo "" + echo "=== Host RDMA Check Complete ===" + + - name: Start Special Container + run: bash scripts/ci/amd/amd_ci_start_container_disagg.sh --rocm-version rocm720 + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Verify RDMA in Container + run: | + docker exec -u root ci_sglang bash -c ' + echo "=== Container RDMA Verification ===" + echo "Device nodes:" + ls -la /dev/infiniband/ + echo "" + echo "Provider libraries:" + ls /usr/lib/x86_64-linux-gnu/libibverbs/ | grep -E "ionic|mlx" || echo "No Ionic/Mellanox providers" + echo "" + echo "HCA devices:" + HCA_COUNT=$(ibv_devinfo -list 2>&1 | grep -oE "^[0-9]+ HCAs? found" | grep -oE "^[0-9]+" || echo "0") + ibv_devinfo -list + if [ "$HCA_COUNT" -gt 0 ]; then + echo "" + echo "=== SUCCESS: RDMA setup complete. Found $HCA_COUNT HCA(s) ===" + else + echo "" + echo "=== WARNING: No HCAs detected. RDMA tests may fail ===" + fi + ' + + - name: Run Aiter Op Test (RMSNorm) + timeout-minutes: 10 + run: | + echo "Running pre-check: test_rmsnorm2d.py" + docker exec \ + -e MAX_JOBS=192 \ + ci_sglang \ + python /sgl-workspace/aiter/op_tests/test_rmsnorm2d.py + + - name: Run test_disaggregation + timeout-minutes: 60 + run: | + bash scripts/ci/amd/amd_ci_exec.sh \ + -e SGLANG_TEST_RDMA_DEVICE="${{ env.SGLANG_TEST_RDMA_DEVICE }}" \ + -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-8-gpu-35x-disaggregation-amd --timeout-per-file 1800 ${{ inputs.continue_on_error && '--continue-on-error' || '' }} + + pr-test-amd-rocm720-finish: + needs: + [ + call-gate, + check-changes, + + sgl-kernel-unit-test-amd-rocm720, + sgl-kernel-unit-test-2-gpu-amd-rocm720, + multimodal-gen-test-1-gpu-amd-rocm720, + multimodal-gen-test-2-gpu-amd-rocm720, + + stage-a-test-1-gpu-small-amd-rocm720, + jit-kernel-unit-test-amd-rocm720, + stage-b-test-1-gpu-small-amd-rocm720, + stage-b-test-1-gpu-small-amd-nondeterministic-rocm720, + stage-b-test-1-gpu-small-amd-mi35x-rocm720, + stage-b-test-1-gpu-large-amd-rocm720, + stage-b-test-2-gpu-large-amd-rocm720, + stage-b-test-large-8-gpu-35x-disaggregation-amd-rocm720, + stage-c-test-4-gpu-amd-rocm720, + stage-c-test-large-8-gpu-amd-rocm720, + stage-c-test-large-8-gpu-amd-mi35x-rocm720, + ] + if: always() + runs-on: ubuntu-latest + steps: + - name: Check all dependent job statuses + run: | + # Convert the 'needs' context to a JSON string + json_needs='${{ toJson(needs) }}' + + # Get a list of all job names from the JSON keys + job_names=$(echo "$json_needs" | jq -r 'keys_unsorted[]') + + for job in $job_names; do + # For each job, extract its result + result=$(echo "$json_needs" | jq -r --arg j "$job" '.[$j].result') + + # Print the job name and its result + echo "$job: $result" + + # Check for failure or cancellation and exit if found + if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then + echo "The above jobs failed." + exit 1 + fi + done + + # If the loop completes, all jobs were successful + echo "All jobs completed successfully" + exit 0 diff --git a/.github/workflows/pr-test-amd.yml b/.github/workflows/pr-test-amd.yml index 432a1f1e9921..acb9281e3403 100644 --- a/.github/workflows/pr-test-amd.yml +++ b/.github/workflows/pr-test-amd.yml @@ -1,18 +1,11 @@ name: PR Test (AMD) # Dynamic run-name for /rerun-stage commands to enable URL lookup # Format: "[stage-name] sha" for fork PRs, "[stage-name]" for non-fork, default for normal runs -run-name: ${{ inputs.target_stage && (inputs.pr_head_sha && format('[{0}] {1}', inputs.target_stage, inputs.pr_head_sha) || format('[{0}]', inputs.target_stage)) || '' }} +run-name: ${{ (inputs.target_stage || inputs.target_stage_select) && (inputs.pr_head_sha && format('[{0}] {1}', inputs.target_stage || inputs.target_stage_select, inputs.pr_head_sha) || format('[{0}]', inputs.target_stage || inputs.target_stage_select)) || '' }} on: - push: - branches: [ main ] - paths: - - "python/**" - - "scripts/ci/**" - - "test/**" - - "sgl-kernel/**" - - ".github/workflows/pr-test-amd.yml" - - "docker/rocm.Dockerfile" + schedule: + - cron: '0 */6 * * *' # Run every 6 hours (UTC) pull_request: branches: [ main ] paths: @@ -24,8 +17,30 @@ on: - "docker/rocm.Dockerfile" workflow_dispatch: inputs: + target_stage_select: + description: "Select a stage to run from dropdown (leave empty for auto-detect)" + required: false + type: choice + default: '' + options: + - '' + - sgl-kernel-unit-test-amd + - sgl-kernel-unit-test-2-gpu-amd + - stage-a-test-1-gpu-small-amd + - jit-kernel-unit-test-amd + - stage-b-test-1-gpu-small-amd + - stage-b-test-1-gpu-small-amd-nondeterministic + - stage-b-test-1-gpu-small-amd-mi35x + - stage-b-test-1-gpu-large-amd + - stage-b-test-2-gpu-large-amd + - multimodal-gen-test-1-gpu-amd + - multimodal-gen-test-2-gpu-amd + - stage-c-test-4-gpu-amd + - stage-c-test-large-8-gpu-amd + - stage-c-test-large-8-gpu-amd-mi35x + - stage-b-test-large-8-gpu-35x-disaggregation-amd target_stage: - description: "Specific stage to run (optional, for quick testing)" + description: "Or type comma-separated stage names (overrides dropdown if non-empty)" required: false type: string default: "" @@ -34,6 +49,24 @@ on: required: false type: string default: "" + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: false + runner_arch: + description: 'AMD runner pool to dispatch GPU jobs to' + required: false + type: choice + default: mi300 + options: + - mi300 + - mi325 workflow_call: inputs: ref: @@ -46,23 +79,43 @@ on: required: false type: boolean default: false + aiter_ref: + description: 'Override AITER commit (optional, leave empty to use Dockerfile default)' + required: false + type: string + default: '' + continue_on_error: + description: 'Continue on error (do not fail the workflow on test failures)' + required: false + type: boolean + default: false + +env: + AITER_COMMIT_OVERRIDE: ${{ inputs.aiter_ref }} + DOCKERHUB_AMD_USERNAME: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + DOCKERHUB_AMD_TOKEN: ${{ secrets.DOCKERHUB_AMD_TOKEN }} concurrency: - # Include pr_head_sha in group for /rerun-stage dispatches to avoid collisions with main branch runs - group: pr-test-amd-${{ inputs.pr_head_sha || inputs.ref || github.ref }} - cancel-in-progress: ${{ github.event_name != 'workflow_call' }} + # Scheduled, run_all_tests, and manual dispatch runs get unique groups (never cancel each other). + # PR runs share a group per branch so new pushes cancel stale runs. + group: pr-test-amd-${{ (inputs.run_all_tests || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && format('full-{0}', github.run_id) || inputs.pr_head_sha || inputs.ref || github.ref }} + cancel-in-progress: ${{ !inputs.run_all_tests && github.event_name != 'workflow_call' && github.event_name != 'schedule' && github.event_name != 'workflow_dispatch' }} jobs: call-gate: + if: github.event_name != 'schedule' uses: ./.github/workflows/pr-gate.yml secrets: inherit check-changes: needs: [call-gate] + if: always() runs-on: ubuntu-latest outputs: main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} sgl_kernel: ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }} + jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }} multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} + continue_on_error: ${{ steps.set-continue-on-error.outputs.continue_on_error }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -72,16 +125,25 @@ jobs: - name: Determine run mode id: run-mode run: | - # Run all tests for workflow_call (when ref input is provided) - # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref - if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then + if [[ "${{ inputs.run_all_tests }}" == "true" || "${{ github.event_name }}" == "schedule" ]]; then echo "run_all_tests=true" >> $GITHUB_OUTPUT - echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})" + echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }}, event=${{ github.event_name }})" else echo "run_all_tests=false" >> $GITHUB_OUTPUT echo "Run mode: FILTERED (triggered by ${{ github.event_name }})" fi + - name: Set continue-on-error for schedule/full runs + id: set-continue-on-error + run: | + if [[ "${{ steps.run-mode.outputs.run_all_tests }}" == "true" || "${{ inputs.continue_on_error }}" == "true" ]]; then + echo "continue_on_error=true" >> $GITHUB_OUTPUT + echo "Continue-on-error: ENABLED (run_all_tests=${{ steps.run-mode.outputs.run_all_tests }}, input=${{ inputs.continue_on_error }})" + else + echo "continue_on_error=false" >> $GITHUB_OUTPUT + echo "Continue-on-error: DISABLED" + fi + - name: Detect file changes id: filter uses: dorny/paths-filter@v3 @@ -89,39 +151,43 @@ jobs: with: filters: | main_package: - - "python/sglang/!(multimodal_gen)/**" + - "python/sglang/!(multimodal_gen)/**/!(*.md)" - "python/pyproject_rocm.toml" - "python/pyproject_other.toml" - "scripts/ci/amd/*" - "scripts/ci/utils/*" - - "test/**" + - "test/**/!(*.md)" - ".github/workflows/pr-test-amd.yml" sgl_kernel: - - "sgl-kernel/**" + - "sgl-kernel/**/!(*.md|THIRDPARTYNOTICES.txt|LICENSE)" + - ".github/workflows/pr-test-amd.yml" + jit_kernel: + - "python/sglang/jit_kernel/**" - ".github/workflows/pr-test-amd.yml" multimodal_gen: - - "python/sglang/multimodal_gen/**" + - "python/sglang/multimodal_gen/**/!(*.md|*.ipynb)" - "python/sglang/cli/**" + - "python/sglang/jit_kernel/diffusion/**" + - "python/sglang/jit_kernel/tests/diffusion/**" + - "python/sglang/jit_kernel/benchmark/diffusion/**" - "python/pyproject_rocm.toml" - "python/pyproject_other.toml" # =============================================== sgl-kernel ==================================================== sgl-kernel-unit-test-amd: - needs: [check-changes] + name: ${{ format('sgl-kernel-unit-test-amd (linux-{0}-1gpu-sglang)', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + needs: [check-changes, call-gate] if: | - always() && + always() && !cancelled() && ( - (inputs.target_stage == 'sgl-kernel-unit-test-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',sgl-kernel-unit-test-amd,')) || ( - !inputs.target_stage && + !(inputs.target_stage || inputs.target_stage_select) && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') && needs.check-changes.outputs.sgl_kernel == 'true' ) ) - strategy: - fail-fast: false - matrix: - runner: [linux-mi325-gpu-1] - runs-on: ${{matrix.runner}} + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -129,7 +195,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -142,34 +208,44 @@ jobs: - name: Run test timeout-minutes: 14 + env: + CONTINUE_ON_ERROR: ${{ needs.check-changes.outputs.continue_on_error }} run: | - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_align.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_softmax.py - docker exec -w /sglang-checkout/sgl-kernel/tests/speculative ci_sglang python3 -m pytest test_eagle_utils.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_apply_token_bitmask_inplace.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_activation.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_topk.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_kvcacheio.py - docker exec -w /sglang-checkout/sgl-kernel/tests/sgl_diffusion ci_sglang python3 -m pytest test_timestep_embedding.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_sigmoid.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_torch_defaults_reset.py + # In continue-on-error mode (schedule/full runs), keep running all pytest + # files and aggregate the exit code. In PR mode, preserve fail-fast. + failures=0 + run_pytest() { + if [[ "$CONTINUE_ON_ERROR" == "true" ]]; then + "$@" || failures=$((failures + 1)) + else + "$@" + fi + } + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_align.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_softmax.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests/speculative ci_sglang python3 -m pytest test_eagle_utils.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_apply_token_bitmask_inplace.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_activation.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_topk.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_kvcacheio.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_sigmoid.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_torch_defaults_reset.py + exit $failures sgl-kernel-unit-test-2-gpu-amd: - needs: [check-changes] + name: ${{ format('sgl-kernel-unit-test-2-gpu-amd (linux-{0}-2gpu-sglang)', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + needs: [check-changes, call-gate] if: | - always() && + always() && !cancelled() && ( - (inputs.target_stage == 'sgl-kernel-unit-test-2-gpu-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',sgl-kernel-unit-test-2-gpu-amd,')) || ( - !inputs.target_stage && + !(inputs.target_stage || inputs.target_stage_select) && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') && needs.check-changes.outputs.sgl_kernel == 'true' ) ) - strategy: - fail-fast: false - matrix: - runner: [linux-mi325-gpu-2] - runs-on: ${{matrix.runner}} + runs-on: ${{ format('linux-{0}-2gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -177,7 +253,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -190,29 +266,141 @@ jobs: - name: Run test timeout-minutes: 20 + env: + CONTINUE_ON_ERROR: ${{ needs.check-changes.outputs.continue_on_error }} run: | - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_deterministic_custom_allreduce.py - docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_nccl_allreduce_determinism.py + failures=0 + run_pytest() { + if [[ "$CONTINUE_ON_ERROR" == "true" ]]; then + "$@" || failures=$((failures + 1)) + else + "$@" + fi + } + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_deterministic_custom_allreduce.py + run_pytest docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_nccl_allreduce_determinism.py + exit $failures # =============================================== primary ==================================================== - stage-a-test-1-amd: - needs: [check-changes] + stage-a-test-1-gpu-small-amd: + name: ${{ format('stage-a-test-1-gpu-small-amd (linux-{0}-1gpu-sglang)', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + needs: [check-changes, call-gate] + if: | + always() && !cancelled() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-a-test-1-gpu-small-amd,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Run test + timeout-minutes: 10 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-gpu-small-amd ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + + jit-kernel-unit-test-amd: + name: ${{ format('jit-kernel-unit-test-amd (linux-{0}-1gpu-sglang)', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + needs: [check-changes, call-gate] + if: | + always() && !cancelled() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',jit-kernel-unit-test-amd,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') && + needs.check-changes.outputs.jit_kernel == 'true' + ) + ) + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: | + bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Run JIT kernel unit tests + timeout-minutes: 10 + run: | + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout" python3 -m pytest -q python/sglang/jit_kernel/tests/test_store_cache.py + + # =============================================== Wait Jobs for Sequential PR Execution ==================================================== + # These jobs poll GitHub API to wait for previous stages to complete. + # For PR runs: wait jobs run and enforce sequential execution via polling. + # For scheduled runs: wait jobs are skipped, enabling parallel execution of all stages. + + wait-for-stage-a-amd: + needs: [check-changes, call-gate] + if: | + always() && + !cancelled() && + github.event_name == 'pull_request' && + !(inputs.target_stage || inputs.target_stage_select) && + (needs.check-changes.outputs.main_package == 'true' || needs.check-changes.outputs.sgl_kernel == 'true') && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') + runs-on: ubuntu-latest + outputs: + stage_a_result: ${{ steps.wait.outputs.result }} + steps: + - uses: actions/checkout@v4 + - uses: ./.github/actions/wait-for-jobs + id: wait + with: + stage-name: stage-a-amd + jobs: '[{"prefix": "stage-a-test-1-gpu-small-amd", "expected_count": 1}]' + max-wait-minutes: '240' + + stage-b-test-1-gpu-small-amd: + name: ${{ format('stage-b-test-1-gpu-small-amd (linux-{0}-1gpu-sglang, {1})', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325'), matrix.part) }} + needs: [check-changes, wait-for-stage-a-amd] if: | always() && ( - (inputs.target_stage == 'stage-a-test-1-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-small-amd,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi325-gpu-1] - runs-on: ${{matrix.runner}} + part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -220,7 +408,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -228,31 +416,65 @@ jobs: GITHUB_WORKSPACE: ${{ github.workspace }} - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh + + - name: Run test + timeout-minutes: 45 run: | - bash scripts/ci/amd/amd_ci_install_dependency.sh + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 14 --timeout-per-file 1800 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + + stage-b-test-1-gpu-small-amd-nondeterministic: + name: ${{ format('stage-b-test-1-gpu-small-amd-nondeterministic (linux-{0}-1gpu-sglang)', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + needs: [check-changes, wait-for-stage-a-amd] + if: | + always() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-small-amd-nondeterministic,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + ) + ) + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + + - name: Ensure VRAM is clear + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm + + - name: Start CI container + run: bash scripts/ci/amd/amd_ci_start_container.sh + env: + GITHUB_WORKSPACE: ${{ github.workspace }} + + - name: Install dependencies + run: bash scripts/ci/amd/amd_ci_install_dependency.sh - name: Run test - timeout-minutes: 10 + timeout-minutes: 45 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-amd + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd-nondeterministic --timeout-per-file 1800 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} - stage-b-test-small-1-gpu-amd: - needs: [check-changes, stage-a-test-1-amd] + stage-b-test-1-gpu-small-amd-mi35x: + needs: [check-changes, wait-for-stage-a-amd] if: | always() && ( - (inputs.target_stage == 'stage-b-test-small-1-gpu-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-small-amd-mi35x,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi325-gpu-1] - part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] + runner: [linux-mi35x-gpu-1] runs-on: ${{matrix.runner}} steps: - name: Checkout code @@ -261,7 +483,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -274,25 +496,26 @@ jobs: - name: Run test timeout-minutes: 30 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 13 --timeout-per-file 1800 + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd-mi35x ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} - stage-b-test-small-1-gpu-amd-mi35x: - needs: [check-changes, stage-a-test-1-amd] + stage-b-test-1-gpu-large-amd: + name: ${{ format('stage-b-test-1-gpu-large-amd (linux-{0}-1gpu-sglang, {1})', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325'), matrix.part) }} + needs: [check-changes, wait-for-stage-a-amd] if: | always() && ( - (inputs.target_stage == 'stage-b-test-small-1-gpu-amd-mi35x') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-1-gpu-large-amd,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi35x-gpu-1] - runs-on: ${{matrix.runner}} + part: [0, 1, 2] + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -300,7 +523,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -311,27 +534,28 @@ jobs: run: bash scripts/ci/amd/amd_ci_install_dependency.sh - name: Run test - timeout-minutes: 30 + timeout-minutes: 45 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-amd-mi35x + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-large-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 2700 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} - stage-b-test-large-2-gpu-amd: - needs: [check-changes, stage-a-test-1-amd] + stage-b-test-2-gpu-large-amd: + name: ${{ format('stage-b-test-2-gpu-large-amd (linux-{0}-2gpu-sglang, {1})', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325'), matrix.part) }} + needs: [check-changes, wait-for-stage-a-amd] if: | always() && ( - (inputs.target_stage == 'stage-b-test-large-2-gpu-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-2-gpu-large-amd,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi325-gpu-2] - runs-on: ${{matrix.runner}} + part: [0, 1] + runs-on: ${{ format('linux-{0}-2gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -339,7 +563,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -350,20 +574,28 @@ jobs: run: bash scripts/ci/amd/amd_ci_install_dependency.sh - name: Run test - timeout-minutes: 30 + timeout-minutes: 45 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-2-gpu-amd + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-2-gpu-large-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 2700 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} multimodal-gen-test-1-gpu-amd: - needs: [check-changes] - if: needs.check-changes.outputs.multimodal_gen == 'true' + name: ${{ format('multimodal-gen-test-1-gpu-amd (linux-{0}-1gpu-sglang, {1})', inputs.runner_arch || 'mi325', matrix.part) }} + needs: [check-changes, call-gate] + if: | + always() && !cancelled() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',multimodal-gen-test-1-gpu-amd,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') && + needs.check-changes.outputs.multimodal_gen == 'true' + ) + ) strategy: fail-fast: false - max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT matrix: - runner: [linux-mi325-gpu-1] - part: [0, 1] # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each - runs-on: ${{matrix.runner}} + part: [0, 1, 2, 3] + runs-on: ${{ format('linux-{0}-1gpu-sglang', inputs.runner_arch || 'mi325') }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -371,7 +603,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -444,7 +676,7 @@ jobs: docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available" - name: Run diffusion server tests (1-GPU) - timeout-minutes: 45 + timeout-minutes: 90 run: | # AMD CI: All 1-GPU tests except FLUX.2 (FLUX.1 covers same code path) # Tests: T2V, T2I, I2V, LoRA @@ -458,7 +690,9 @@ jobs: -e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \ -e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \ -e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \ + -e SGLANG_SKIP_CONSISTENCY=1 \ -e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \ + -e SGLANG_DIFFUSION_ARTIFACT_DIR=/sglang-checkout/diffusion-failures \ -e AITER_JIT_DIR=/sgl-data/aiter-kernels \ -e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \ -e HF_HUB_ENABLE_HF_TRANSFER=1 \ @@ -467,23 +701,41 @@ jobs: ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \ --suite 1-gpu \ --partition-id ${{ matrix.part }} \ - --total-partitions 2 \ - -k "not flux_2" + --total-partitions 4 \ + -k "not flux_2" \ + ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} # Post-test diagnostics echo "=== Post-test System Memory Status ===" free -h + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-amd-1gpu-${{ matrix.part }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + multimodal-gen-test-2-gpu-amd: - needs: [check-changes] - if: needs.check-changes.outputs.multimodal_gen == 'true' + name: ${{ format('multimodal-gen-test-2-gpu-amd (linux-{0}-2gpu-sglang, {1})', inputs.runner_arch || 'mi325', matrix.part) }} + needs: [check-changes, call-gate] + if: | + always() && !cancelled() && + ( + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',multimodal-gen-test-2-gpu-amd,')) || + ( + !(inputs.target_stage || inputs.target_stage_select) && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') && + needs.check-changes.outputs.multimodal_gen == 'true' + ) + ) strategy: fail-fast: false - max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT matrix: - runner: [linux-mi325-gpu-2] - part: [0, 1] # 2 partitions: 9 tests ÷ 2 = ~4-5 tests each - runs-on: ${{matrix.runner}} + part: [0, 1, 2] # 3 partitions: 2 parametrized + 1 standalone (test_disagg_server.py) + runs-on: ${{ format('linux-{0}-2gpu-sglang', inputs.runner_arch || 'mi325') }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -491,7 +743,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -563,7 +815,7 @@ jobs: docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available" - name: Run diffusion server tests (2-GPU) - timeout-minutes: 80 + timeout-minutes: 150 run: | # AMD CI: All 2-GPU tests including LoRA # Tests: T2V, T2I, I2V, LoRA @@ -577,7 +829,9 @@ jobs: -e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \ -e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \ -e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \ + -e SGLANG_SKIP_CONSISTENCY=1 \ -e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \ + -e SGLANG_DIFFUSION_ARTIFACT_DIR=/sglang-checkout/diffusion-failures \ -e AITER_JIT_DIR=/sgl-data/aiter-kernels \ -e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \ -e HF_HUB_ENABLE_HF_TRANSFER=1 \ @@ -586,79 +840,67 @@ jobs: ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \ --suite 2-gpu \ --partition-id ${{ matrix.part }} \ - --total-partitions 2 + --total-partitions 3 \ + ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} # Post-test diagnostics echo "=== Post-test System Memory Status ===" free -h + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-amd-2gpu-${{ matrix.part }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 - stage-c-test-large-8-gpu-amd: - needs: [check-changes, call-gate, stage-b-test-small-1-gpu-amd, stage-b-test-large-2-gpu-amd] + + wait-for-stage-b-amd: + needs: [check-changes, call-gate, wait-for-stage-a-amd] if: | always() && - ( - (inputs.target_stage == 'stage-c-test-large-8-gpu-amd') || - ( - !inputs.target_stage && - (!failure() && !cancelled()) && - ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) - ) - ) - env: - RUNNER_LABELS: linux-mi325-gpu-8 - strategy: - fail-fast: false - matrix: - runner: [linux-mi325-gpu-8] - part: [0, 1] - runs-on: ${{matrix.runner}} + !cancelled() && + github.event_name == 'pull_request' && + !(inputs.target_stage || inputs.target_stage_select) && + (needs.check-changes.outputs.main_package == 'true' || needs.check-changes.outputs.sgl_kernel == 'true') && + (needs.wait-for-stage-a-amd.result == 'success' || needs.wait-for-stage-a-amd.result == 'skipped') && + (needs.call-gate.result == 'success' || needs.call-gate.result == 'skipped') + runs-on: ubuntu-latest + outputs: + stage_b_result: ${{ steps.wait.outputs.result }} steps: - - name: Checkout code - uses: actions/checkout@v4 + - uses: actions/checkout@v4 + - uses: ./.github/actions/wait-for-jobs + id: wait with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm - - - name: Start CI container - run: bash scripts/ci/amd/amd_ci_start_container.sh - env: - GITHUB_WORKSPACE: ${{ github.workspace }} - - - name: Install dependencies - run: bash scripts/ci/amd/amd_ci_install_dependency.sh - - - name: Test RCCL multi-GPU communication - timeout-minutes: 5 - run: | - echo "Testing RCCL multi-GPU communication with debug info..." - docker exec ci_sglang bash -c "cd /sglang-checkout && NCCL_DEBUG=INFO RCCL_DEBUG=INFO torchrun --nproc_per_node=8 scripts/ci/amd/test_rccl_multi_gpu.py" - - - name: Run test - timeout-minutes: 60 - run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 3600 - - stage-c-test-large-8-gpu-amd-mi35x: - needs: [check-changes, call-gate, stage-b-test-small-1-gpu-amd, stage-b-test-large-2-gpu-amd] + stage-name: stage-b-amd + jobs: | + [ + {"prefix": "stage-b-test-1-gpu-small-amd", "expected_count": 14}, + {"prefix": "stage-b-test-2-gpu-large-amd", "expected_count": 2} + ] + max-wait-minutes: '480' + + stage-c-test-4-gpu-amd: + name: ${{ format('stage-c-test-4-gpu-amd (linux-{0}-4gpu-sglang, {1})', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325'), matrix.part) }} + needs: [check-changes, call-gate, wait-for-stage-b-amd] if: | always() && ( - (inputs.target_stage == 'stage-c-test-large-8-gpu-amd-mi35x') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-c-test-4-gpu-amd,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi35x-gpu-8] - part: [0, 1, 2] - runs-on: ${{matrix.runner}} + part: [0] + runs-on: ${{ format('linux-{0}-4gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -666,7 +908,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -677,27 +919,46 @@ jobs: run: bash scripts/ci/amd/amd_ci_install_dependency.sh - name: Run test - timeout-minutes: 60 + timeout-minutes: 90 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd-mi35x --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 3600 + bash scripts/ci/amd/amd_ci_exec.sh \ + -e NCCL_CUMEM_ENABLE=0 \ + -e NCCL_NVLS_ENABLE=0 \ + -e RCCL_MSCCL_ENABLE=0 \ + -e SGLANG_USE_ROCM700A=1 \ + -w "/sglang-checkout/test" \ + python3 run_suite.py \ + --hw amd \ + --suite stage-c-test-4-gpu-amd \ + --auto-partition-id ${{ matrix.part }} \ + --auto-partition-size 1 \ + --timeout-per-file 5400 \ + --enable-retry \ + --max-attempts 2 \ + --retry-wait-seconds 120 \ + --retry-timeout-increase 0 \ + ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} - stage-b-test-small-1-gpu-performance-amd: - needs: [check-changes, call-gate, stage-a-test-1-amd] + stage-c-test-large-8-gpu-amd: + name: ${{ format('stage-c-test-large-8-gpu-amd (linux-{0}-8gpu-sglang, {1})', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325'), matrix.part) }} + needs: [check-changes, call-gate, wait-for-stage-b-amd] if: | always() && ( - (inputs.target_stage == 'stage-b-test-small-1-gpu-performance-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-c-test-large-8-gpu-amd,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) + env: + RUNNER_LABELS: ${{ format('linux-{0}-8gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} strategy: fail-fast: false matrix: - runner: [linux-mi325-gpu-1] - runs-on: ${{matrix.runner}} + part: [0, 1, 2] + runs-on: ${{ format('linux-{0}-8gpu-sglang', inputs.runner_arch || (github.event_name == 'pull_request' && 'mi300' || 'mi325')) }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -705,7 +966,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -715,27 +976,33 @@ jobs: - name: Install dependencies run: bash scripts/ci/amd/amd_ci_install_dependency.sh + - name: Test RCCL multi-GPU communication + timeout-minutes: 5 + run: | + echo "Testing RCCL multi-GPU communication with debug info..." + docker exec ci_sglang bash -c "cd /sglang-checkout && NCCL_DEBUG=INFO RCCL_DEBUG=INFO torchrun --nproc_per_node=8 scripts/ci/amd/test_rccl_multi_gpu.py" + - name: Run test - timeout-minutes: 30 + timeout-minutes: 120 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-performance-amd --timeout-per-file 1200 + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 3600 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} - stage-b-test-large-1-gpu-performance-amd: - needs: [check-changes, call-gate, stage-a-test-1-amd] + stage-c-test-large-8-gpu-amd-mi35x: + needs: [check-changes, call-gate, wait-for-stage-b-amd] if: | always() && ( - (inputs.target_stage == 'stage-b-test-large-1-gpu-performance-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-c-test-large-8-gpu-amd-mi35x,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi325-gpu-1] + runner: [linux-mi35x-gpu-8] part: [0, 1] runs-on: ${{matrix.runner}} steps: @@ -745,7 +1012,7 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - name: Start CI container run: bash scripts/ci/amd/amd_ci_start_container.sh @@ -756,27 +1023,30 @@ jobs: run: bash scripts/ci/amd/amd_ci_install_dependency.sh - name: Run test - timeout-minutes: 30 + timeout-minutes: 60 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-1-gpu-performance-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1200 + bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd-mi35x --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 3600 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} - stage-b-test-large-2-gpu-performance-amd: - needs: [check-changes, call-gate, stage-a-test-1-amd] + # =============================================== Disaggregation ==================================================== + stage-b-test-large-8-gpu-35x-disaggregation-amd: + needs: [check-changes, wait-for-stage-a-amd] if: | always() && ( - (inputs.target_stage == 'stage-b-test-large-2-gpu-performance-amd') || + (contains(format(',{0},', inputs.target_stage || inputs.target_stage_select), ',stage-b-test-large-8-gpu-35x-disaggregation-amd,')) || ( - !inputs.target_stage && - (!failure() && !cancelled()) && + !(inputs.target_stage || inputs.target_stage_select) && + ((github.event_name == 'schedule') || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) strategy: fail-fast: false matrix: - runner: [linux-mi325-gpu-2] + runner: [linux-mi35x-gpu-8.fabric] + runs-on: ${{matrix.runner}} + steps: - name: Checkout code uses: actions/checkout@v4 @@ -784,98 +1054,90 @@ jobs: ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + run: bash scripts/ci/amd/ensure_vram_clear.sh rocm - - name: Start CI container - run: bash scripts/ci/amd/amd_ci_start_container.sh - env: - GITHUB_WORKSPACE: ${{ github.workspace }} - - - name: Install dependencies - run: bash scripts/ci/amd/amd_ci_install_dependency.sh - - - name: Run test - timeout-minutes: 30 + - name: Check Host RDMA Environment + id: rdma_detect run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-2-gpu-performance-amd --timeout-per-file 1200 + set +e + echo "=== Checking Host RDMA Environment ===" - stage-b-test-small-1-gpu-accuracy-amd: - needs: [check-changes, stage-a-test-1-amd] - if: | - always() && - ( - (inputs.target_stage == 'stage-b-test-small-1-gpu-accuracy-amd') || - ( - !inputs.target_stage && - (!failure() && !cancelled()) && - ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) - ) - ) - strategy: - fail-fast: false - matrix: - runner: [linux-mi325-gpu-1] - runs-on: ${{matrix.runner}} - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + echo "" + echo "=== 1. Ionic driver library check ===" + ls -l /usr/lib/x86_64-linux-gnu/libibverbs/libionic* 2>/dev/null || echo "libionic not found in standard path" - - name: Start CI container - run: bash scripts/ci/amd/amd_ci_start_container.sh - env: - GITHUB_WORKSPACE: ${{ github.workspace }} + echo "" + echo "=== 2. Infiniband devices ===" + ls -la /dev/infiniband/ 2>/dev/null || echo "/dev/infiniband not found" + ls -la /sys/class/infiniband/ 2>/dev/null || echo "/sys/class/infiniband not found" - - name: Install dependencies - run: bash scripts/ci/amd/amd_ci_install_dependency.sh + echo "" + echo "=== 3. ibv_devinfo ===" + which ibv_devinfo 2>/dev/null && ibv_devinfo 2>&1 || echo "ibv_devinfo not available" - - name: Run test - timeout-minutes: 30 - run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" -e SGLANG_USE_AITER=0 python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-accuracy-amd --timeout-per-file 1800 + echo "" + echo "=== 4. Kernel modules ===" + lsmod 2>/dev/null | grep -E "ib_|rdma|ionic" || echo "No RDMA kernel modules loaded" - stage-b-test-large-2-gpu-accuracy-amd: - needs: [check-changes, stage-a-test-1-amd] - if: | - always() && - ( - (inputs.target_stage == 'stage-b-test-large-2-gpu-accuracy-amd') || - ( - !inputs.target_stage && - (!failure() && !cancelled()) && - ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) - ) - ) - strategy: - fail-fast: false - matrix: - runner: [linux-mi325-gpu-2] - runs-on: ${{matrix.runner}} - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + echo "" + echo "=== 5. Detect RDMA Devices for test environment ===" + if [ -d "/sys/class/infiniband" ]; then + RDMA_DEVS=$(ls /sys/class/infiniband | paste -sd "," -) + echo "Detected RDMA Devices: $RDMA_DEVS" + echo "SGLANG_TEST_RDMA_DEVICE=$RDMA_DEVS" >> $GITHUB_ENV + else + echo "No RDMA devices found in /sys/class/infiniband" + echo "SGLANG_TEST_RDMA_DEVICE=" >> $GITHUB_ENV + fi - - name: Ensure VRAM is clear - run: bash scripts/ensure_vram_clear.sh rocm + echo "" + echo "=== Host RDMA Check Complete ===" - - name: Start CI container - run: bash scripts/ci/amd/amd_ci_start_container.sh + - name: Start Special Container + run: bash scripts/ci/amd/amd_ci_start_container_disagg.sh env: GITHUB_WORKSPACE: ${{ github.workspace }} - name: Install dependencies run: bash scripts/ci/amd/amd_ci_install_dependency.sh - - name: Run test - timeout-minutes: 30 + - name: Verify RDMA in Container + run: | + docker exec -u root ci_sglang bash -c ' + echo "=== Container RDMA Verification ===" + echo "Device nodes:" + ls -la /dev/infiniband/ + echo "" + echo "Provider libraries:" + ls /usr/lib/x86_64-linux-gnu/libibverbs/ | grep -E "ionic|mlx" || echo "No Ionic/Mellanox providers" + echo "" + echo "HCA devices:" + HCA_COUNT=$(ibv_devinfo -list 2>&1 | grep -oE "^[0-9]+ HCAs? found" | grep -oE "^[0-9]+" || echo "0") + ibv_devinfo -list + if [ "$HCA_COUNT" -gt 0 ]; then + echo "" + echo "=== SUCCESS: RDMA setup complete. Found $HCA_COUNT HCA(s) ===" + else + echo "" + echo "=== WARNING: No HCAs detected. RDMA tests may fail ===" + fi + ' + + - name: Run Aiter Op Test (RMSNorm) + timeout-minutes: 10 + run: | + echo "Running pre-check: test_rmsnorm2d.py" + docker exec \ + -e MAX_JOBS=192 \ + ci_sglang \ + python /sgl-workspace/aiter/op_tests/test_rmsnorm2d.py + + - name: Run test_disaggregation + timeout-minutes: 60 run: | - bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" -e SGLANG_USE_AITER_AR=0 -e SGLANG_USE_AITER=0 -e HF_HUB_ENABLE_HF_TRANSFER=0 python3 run_suite.py --hw amd --suite stage-b-test-large-2-gpu-accuracy-amd --timeout-per-file 1800 + bash scripts/ci/amd/amd_ci_exec.sh \ + -e SGLANG_TEST_RDMA_DEVICE="${{ env.SGLANG_TEST_RDMA_DEVICE }}" \ + -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-8-gpu-35x-disaggregation-amd --timeout-per-file 1800 ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} pr-test-amd-finish: needs: @@ -888,15 +1150,17 @@ jobs: multimodal-gen-test-1-gpu-amd, multimodal-gen-test-2-gpu-amd, - stage-a-test-1-amd, - stage-b-test-small-1-gpu-amd, - stage-b-test-small-1-gpu-amd-mi35x, - stage-b-test-large-2-gpu-amd, - stage-b-test-small-1-gpu-performance-amd, - stage-b-test-large-1-gpu-performance-amd, - stage-b-test-large-2-gpu-performance-amd, - stage-b-test-small-1-gpu-accuracy-amd, - stage-b-test-large-2-gpu-accuracy-amd, + wait-for-stage-a-amd, + stage-a-test-1-gpu-small-amd, + jit-kernel-unit-test-amd, + wait-for-stage-b-amd, + stage-b-test-1-gpu-small-amd, + stage-b-test-1-gpu-small-amd-nondeterministic, + stage-b-test-1-gpu-small-amd-mi35x, + stage-b-test-1-gpu-large-amd, + stage-b-test-2-gpu-large-amd, + stage-b-test-large-8-gpu-35x-disaggregation-amd, + stage-c-test-4-gpu-amd, stage-c-test-large-8-gpu-amd, stage-c-test-large-8-gpu-amd-mi35x, ] diff --git a/.github/workflows/pr-test-arm64.yml b/.github/workflows/pr-test-arm64.yml new file mode 100644 index 000000000000..4525ed388465 --- /dev/null +++ b/.github/workflows/pr-test-arm64.yml @@ -0,0 +1,118 @@ +name: PR Test (Arm64) + +on: + push: + branches: [ main ] + pull_request: + branches: [ main ] + workflow_dispatch: + workflow_call: + inputs: + ref: + description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.' + required: false + type: string + default: '' + run_all_tests: + description: "Run all tests (for releasing or testing purpose)" + required: false + type: boolean + default: false + +concurrency: + group: pr-test-arm64-${{ inputs.ref || github.ref }} + cancel-in-progress: false + +jobs: + check-changes: + runs-on: ubuntu-latest + outputs: + main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests}} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Determine run mode + id: run-mode + run: | + if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then + echo "run_all_tests=true" >> $GITHUB_OUTPUT + echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})" + else + echo "run_all_tests=false" >> $GITHUB_OUTPUT + echo "Run mode: FILTERED (triggered by ${{ github.event_name }})" + fi + + - name: Detect file changes + id: filter + uses: dorny/paths-filter@v3 + if: steps.run-mode.outputs.run_all_tests != 'true' + with: + filters: | + main_package: + - "python/sglang/!(multimodal_gen)/**/!(*.md)" + - "python/pyproject_cpu.toml" + - "test/**/!(*.md)" + - "sgl-kernel/**/*.!(md|txt)" + - ".github/workflows/pr-test-arm64.yml" + - "docker/arm64.Dockerfile" + + pr-gate: + needs: check-changes + if: needs.check-changes.outputs.main_package == 'true' + uses: ./.github/workflows/pr-gate.yml + secrets: inherit + + build-test: + needs: [check-changes, pr-gate] + if: needs.check-changes.outputs.main_package == 'true' + runs-on: ubuntu-24.04-arm + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + ref: ${{ inputs.ref || github.ref }} + + - name: Build container + run: | + PR_REPO=${{ github.event.pull_request.head.repo.clone_url }} + PR_HEAD_REF=${{ github.head_ref }} + + docker build \ + ${PR_REPO:+--build-arg SGLANG_REPO=$PR_REPO} \ + ${PR_HEAD_REF:+--build-arg VER_SGLANG=$PR_HEAD_REF} \ + . -f docker/arm64.Dockerfile -t sglang_arm64 --no-cache + + - name: Run container + run: | + docker run -dt \ + -v ${{ github.workspace }}:/sglang-checkout/ --ipc=host \ + --name ci_sglang_arm64 \ + sglang_arm64 + + - name: Arm sanity check + timeout-minutes: 5 + run: | + docker exec -w /sglang-checkout/ ci_sglang_arm64 \ + bash -c "source /opt/.venv/bin/activate && python3 -c 'import platform; import torch; import sgl_kernel; from sglang.srt.utils.common import is_host_cpu_arm64; assert platform.machine() in (\"aarch64\", \"arm64\"); assert is_host_cpu_arm64(); assert hasattr(torch.ops.sgl_kernel, \"decode_attention_cpu\"); assert hasattr(torch.ops.sgl_kernel, \"initialize\");'" + + - name: Run unit tests + timeout-minutes: 36 + run: | + docker exec -w /sglang-checkout/ ci_sglang_arm64 \ + bash -c "source /opt/.venv/bin/activate && cd ./test/srt && python3 run_suite.py --suite per-commit-cpu-arm64 --timeout-per-file 1500" + + - name: Change permission + timeout-minutes: 2 + run: | + docker exec -u root ci_sglang_arm64 bash -c " + rm -rf /tmp/ci-home && + chown -R $(id -u):$(id -g) /sglang-checkout/ 2>/dev/null || true + " + + - name: Cleanup container + if: always() + run: | + docker rm -f ci_sglang_arm64 || true diff --git a/.github/workflows/pr-test-jit-kernel.yml b/.github/workflows/pr-test-jit-kernel.yml new file mode 100644 index 000000000000..83a4ca0ea847 --- /dev/null +++ b/.github/workflows/pr-test-jit-kernel.yml @@ -0,0 +1,206 @@ +name: PR Test - JIT Kernel + +on: + workflow_call: + inputs: + jit_kernel: + required: true + type: string + sgl_kernel: + required: true + type: string + b200_runner: + required: true + type: string + pr_head_sha: + required: false + type: string + default: '' + git_ref: + required: false + type: string + default: '' + target_stage: + required: false + type: string + default: '' + test_parallel_dispatch: + required: false + type: string + default: 'false' + skip_stage_health_check: + required: false + type: boolean + default: false + +# Workflow-level env is NOT inherited from the caller in reusable workflows (verified by CI test). +# The github context (including github.event_name) IS inherited from the caller. +env: + SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true + PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }} + SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }} + +jobs: + jit-kernel-unit-test: + if: | + github.event_name != 'schedule' && + inputs.test_parallel_dispatch != 'true' && + !inputs.target_stage + runs-on: 1-gpu-h100 + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + if: inputs.sgl_kernel == 'true' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda13.0 + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{ inputs.sgl_kernel }} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run test + timeout-minutes: 30 + run: | + cd test/ + python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large + + jit-kernel-multigpu-unit-test: + if: | + github.event_name != 'schedule' && + inputs.test_parallel_dispatch != 'true' && + !inputs.target_stage + runs-on: 8-gpu-h200 + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + if: inputs.sgl_kernel == 'true' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda13.0 + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{ inputs.sgl_kernel }} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run multi-GPU test + timeout-minutes: 45 + run: | + cd test/ + python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-8-gpu-h200 + + jit-kernel-benchmark-test: + if: | + github.event_name != 'schedule' && + inputs.test_parallel_dispatch != 'true' && + !inputs.target_stage + runs-on: 1-gpu-h100 + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + if: inputs.sgl_kernel == 'true' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda13.0 + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{ inputs.sgl_kernel }} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run benchmark tests + timeout-minutes: 45 + run: | + cd test/ + python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large + + jit-kernel-b200-test: + if: | + github.event_name != 'schedule' && + inputs.test_parallel_dispatch != 'true' && + !inputs.target_stage + runs-on: ${{ inputs.b200_runner }} + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + if: inputs.sgl_kernel == 'true' + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda13.0 + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{ inputs.sgl_kernel }} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run B200 diffusion test + timeout-minutes: 30 + run: | + cd test/ + python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-b200 diff --git a/.github/workflows/pr-test-multimodal-gen.yml b/.github/workflows/pr-test-multimodal-gen.yml new file mode 100644 index 000000000000..0a94eb985c1d --- /dev/null +++ b/.github/workflows/pr-test-multimodal-gen.yml @@ -0,0 +1,424 @@ +name: PR Test - Multimodal Gen + +on: + workflow_call: + inputs: + multimodal_gen: + required: true + type: string + sgl_kernel: + required: true + type: string + b200_runner: + required: true + type: string + continue_on_error: + required: false + type: string + default: 'false' + pr_head_sha: + required: false + type: string + default: '' + git_ref: + required: false + type: string + default: '' + target_stage: + required: false + type: string + default: '' + test_parallel_dispatch: + required: false + type: string + default: 'false' + caller_needs_failure: + required: false + type: string + default: 'false' + skip_stage_health_check: + required: false + type: string + default: 'false' + +# Workflow-level env is NOT inherited from the caller in reusable workflows. +# The github context (including github.event_name) IS inherited from the caller. +env: + SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" + PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }} + SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == 'true' }} + +jobs: + compute-diffusion-partitions: + if: | + (inputs.target_stage == 'multimodal-gen-test-1-gpu') || + (inputs.target_stage == 'multimodal-gen-test-2-gpu') || + ( + !inputs.target_stage && + inputs.multimodal_gen == 'true' + ) + runs-on: ubuntu-latest + outputs: + matrix-1gpu: ${{ steps.compute.outputs.matrix-1gpu }} + matrix-2gpu: ${{ steps.compute.outputs.matrix-2gpu }} + partition-count-1gpu: ${{ steps.compute.outputs['partition-count-1gpu'] }} + partition-count-2gpu: ${{ steps.compute.outputs['partition-count-2gpu'] }} + plan-1gpu: ${{ steps.compute.outputs.plan-1gpu }} + plan-2gpu: ${{ steps.compute.outputs.plan-2gpu }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Compute partitions + id: compute + run: | + python scripts/ci/utils/diffusion/compute_diffusion_partitions.py --min-time 1200 --target-time 1800 --max-time 2400 --max-partitions 10 + + multimodal-gen-test-1-gpu: + needs: compute-diffusion-partitions + if: | + always() && + needs.compute-diffusion-partitions.result == 'success' && + needs.compute-diffusion-partitions.outputs.matrix-1gpu != '{"include":[]}' && + ( + (inputs.target_stage == 'multimodal-gen-test-1-gpu') || + ( + !inputs.target_stage && + ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) && + inputs.multimodal_gen == 'true' + ) + ) + runs-on: 1-gpu-h100 + timeout-minutes: 240 + strategy: + fail-fast: false + matrix: ${{ fromJson(needs.compute-diffusion-partitions.outputs.matrix-1gpu) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + - name: Run diffusion server tests + timeout-minutes: 240 + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + PARTITION_PLAN_JSON: ${{ needs.compute-diffusion-partitions.outputs.plan-1gpu }} + SGLANG_DIFFUSION_ARTIFACT_DIR: ${{ github.workspace }}/diffusion-failures + run: | + cd python + python3 sglang/multimodal_gen/test/run_suite.py \ + --suite 1-gpu \ + --partition-id ${{ matrix.part }} \ + --total-partitions ${{ needs.compute-diffusion-partitions.outputs['partition-count-1gpu'] }} \ + --partition-plan-json "$PARTITION_PLAN_JSON" \ + $CONTINUE_ON_ERROR_FLAG + + - name: Upload execution report + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-report-1gpu-${{ matrix.part }} + path: python/sglang/multimodal_gen/test/execution_report_*.json + retention-days: 1 + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-1gpu-${{ matrix.part }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.part }} + + multimodal-gen-test-2-gpu: + needs: compute-diffusion-partitions + if: | + always() && + needs.compute-diffusion-partitions.result == 'success' && + needs.compute-diffusion-partitions.outputs.matrix-2gpu != '{"include":[]}' && + ( + (inputs.target_stage == 'multimodal-gen-test-2-gpu') || + ( + !inputs.target_stage && + ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) && + inputs.multimodal_gen == 'true' + ) + ) + runs-on: 2-gpu-h100 + timeout-minutes: 240 + strategy: + fail-fast: false + matrix: ${{ fromJson(needs.compute-diffusion-partitions.outputs.matrix-2gpu) }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run diffusion server tests + timeout-minutes: 240 + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + PARTITION_PLAN_JSON: ${{ needs.compute-diffusion-partitions.outputs.plan-2gpu }} + SGLANG_DIFFUSION_ARTIFACT_DIR: ${{ github.workspace }}/diffusion-failures + run: | + cd python + python3 sglang/multimodal_gen/test/run_suite.py \ + --suite 2-gpu \ + --partition-id ${{ matrix.part }} \ + --total-partitions ${{ needs.compute-diffusion-partitions.outputs['partition-count-2gpu'] }} \ + --partition-plan-json "$PARTITION_PLAN_JSON" \ + $CONTINUE_ON_ERROR_FLAG + + - name: Upload execution report + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-report-2gpu-${{ matrix.part }} + path: python/sglang/multimodal_gen/test/execution_report_*.json + retention-days: 1 + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-2gpu-${{ matrix.part }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.part }} + + multimodal-gen-component-accuracy: + if: | + ( + inputs.target_stage == 'multimodal-gen-component-accuracy' || + inputs.target_stage == 'multimodal-gen-component-accuracy-1-gpu' || + inputs.target_stage == 'multimodal-gen-component-accuracy-2-gpu' + ) || + ( + !inputs.target_stage && + ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) && + inputs.multimodal_gen == 'true' + ) + runs-on: 2-gpu-h100 + timeout-minutes: 240 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run diffusion component accuracy tests + timeout-minutes: 240 + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + run: | + cd python + python3 sglang/multimodal_gen/test/run_suite.py \ + --suite component-accuracy \ + $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: always() + with: + artifact-suffix: component-accuracy + + multimodal-gen-test-1-b200: + if: | + (inputs.target_stage == 'multimodal-gen-test-1-b200') || + ( + !inputs.target_stage && + ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) && + inputs.multimodal_gen == 'true' + ) + runs-on: ${{ inputs.b200_runner }} + timeout-minutes: 240 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run diffusion server tests + timeout-minutes: 240 + env: + RUNAI_STREAMER_MEMORY_LIMIT: 0 + CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + SGLANG_DIFFUSION_ARTIFACT_DIR: ${{ github.workspace }}/diffusion-failures + run: | + cd python + python3 sglang/multimodal_gen/test/run_suite.py \ + --suite 1-gpu-b200 \ + $CONTINUE_ON_ERROR_FLAG + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-${{ github.job }}-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + multimodal-gen-unit-test: + if: | + (inputs.target_stage == 'multimodal-gen-unit-test') || + ( + !inputs.target_stage && + ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) && + inputs.multimodal_gen == 'true' + ) + runs-on: 1-gpu-h100 + timeout-minutes: 120 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Download artifacts + if: inputs.sgl_kernel == 'true' + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run diffusion unit tests + timeout-minutes: 60 + run: | + cd python + python3 sglang/multimodal_gen/test/run_suite.py --suite unit + + diffusion-coverage-check: + needs: [multimodal-gen-test-1-gpu, multimodal-gen-test-2-gpu] + if: | + always() && + inputs.multimodal_gen == 'true' && + ( + needs.multimodal-gen-test-1-gpu.result == 'success' || + needs.multimodal-gen-test-1-gpu.result == 'failure' || + needs.multimodal-gen-test-2-gpu.result == 'success' || + needs.multimodal-gen-test-2-gpu.result == 'failure' + ) + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Download all execution reports + uses: actions/download-artifact@v4 + with: + path: reports/ + pattern: diffusion-report-* + merge-multiple: true + + - name: Verify coverage + run: | + python scripts/ci/utils/diffusion/verify_diffusion_coverage.py --reports-dir reports/ diff --git a/.github/workflows/pr-test-npu.yml b/.github/workflows/pr-test-npu.yml index de7833571203..7bf719acd51b 100644 --- a/.github/workflows/pr-test-npu.yml +++ b/.github/workflows/pr-test-npu.yml @@ -28,8 +28,9 @@ jobs: check-changes: runs-on: ubuntu-latest outputs: - main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} - multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} + changes_exist: ${{ steps.filter.outputs.main_package == 'true' || steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true'}} + main_package: ${{ steps.filter.outputs.main_package == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }} + multimodal_gen: ${{ steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }} steps: - name: Checkout code uses: actions/checkout@v4 @@ -56,50 +57,66 @@ jobs: with: filters: | main_package: - - "python/sglang/!(multimodal_gen)/**" + - "python/sglang/!(multimodal_gen)/**/!(*.md)" - "python/pyproject_npu.toml" - "scripts/ci/npu/npu_ci_install_dependency.sh" - - "test/srt/ascend/**" + - "test/registered/ascend/**" - ".github/workflows/pr-test-npu.yml" multimodal_gen: - - "python/sglang/multimodal_gen/**" + - "python/sglang/multimodal_gen/**/!(*.md|*.ipynb)" + - "python/sglang/jit_kernel/diffusion/triton/npu_fallback.py" + - "python/sglang/srt/**" - "python/pyproject_npu.toml" - - "scripts/ci/npu_ci_install_dependency.sh" + - "scripts/ci/npu/npu_ci_install_dependency.sh" - ".github/workflows/pr-test-npu.yml" # ==================== PR Gate ==================== # pr-gate: needs: check-changes - if: needs.check-changes.outputs.main_package == 'true' + if: needs.check-changes.outputs.changes_exist == 'true' uses: ./.github/workflows/pr-gate.yml secrets: inherit - per-commit-1-npu-a2: + stage-b-test-1-npu-a2: needs: [check-changes, pr-gate] if: needs.check-changes.outputs.main_package == 'true' runs-on: linux-aarch64-a2-1 + strategy: + fail-fast: false + matrix: + part: [ 0, 1 ] container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11 + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-910b-ubuntu22.04-py3.11 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" bash scripts/ci/npu/npu_ci_install_dependency.sh 910b # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Run test timeout-minutes: 60 @@ -111,40 +128,49 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" - cd test/srt - python3 run_suite.py --suite per-commit-1-npu-a2 + cd test + python3 run_suite.py --hw npu --suite stage-b-test-1-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 - per-commit-2-npu-a2: + stage-b-test-2-npu-a2: needs: [check-changes, pr-gate] if: needs.check-changes.outputs.main_package == 'true' runs-on: linux-aarch64-a2-2 strategy: fail-fast: true matrix: - part: [0, 1, 2] + part: [0, 1] container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11 + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-910b-ubuntu22.04-py3.11 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" bash scripts/ci/npu/npu_ci_install_dependency.sh 910b # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Run test timeout-minutes: 60 @@ -156,36 +182,45 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" - cd test/srt - python3 run_suite.py --suite per-commit-2-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 + cd test + python3 run_suite.py --hw npu --suite stage-b-test-2-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 - per-commit-4-npu-a2: + stage-b-test-4-npu-a3: needs: [check-changes, pr-gate] if: needs.check-changes.outputs.main_package == 'true' - runs-on: linux-aarch64-a2-4 + runs-on: linux-aarch64-a3-4 container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11 + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" - bash scripts/ci/npu/npu_ci_install_dependency.sh 910b + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Run test timeout-minutes: 60 @@ -197,40 +232,46 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" - cd test/srt - python3 run_suite.py --suite per-commit-4-npu-a2 --timeout-per-file 3600 + cd test + python3 run_suite.py --hw npu --suite stage-b-test-4-npu-a3 --timeout-per-file 3600 + - per-commit-16-npu-a3: + stage-b-test-16-npu-a3: needs: [check-changes, pr-gate] if: needs.check-changes.outputs.main_package == 'true' runs-on: linux-aarch64-a3-16 - strategy: - fail-fast: true - matrix: - part: [0, 1] container: - image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11 + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11 steps: - name: Checkout code uses: actions/checkout@v4 with: ref: ${{ inputs.ref || github.ref }} + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" run: | # speed up by using infra cache services CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list pip config set global.index-url http://${CACHING_URL}/pypi/simple - pip config set global.extra-index-url "https://pypi.tuna.tsinghua.edu.cn/simple" - pip config set global.trusted-host "${CACHING_URL} pypi.tuna.tsinghua.edu.cn" + pip config set global.trusted-host "${CACHING_URL}" bash scripts/ci/npu/npu_ci_install_dependency.sh a3 # copy required file from our daily cache cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp - # copy download through proxy - curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp - name: Run test timeout-minutes: 60 @@ -242,6 +283,223 @@ jobs: PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" STREAMS_PER_DEVICE: 32 run: | - export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" - cd test/srt - python3 run_suite.py --suite per-commit-16-npu-a3 --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 + cd test + python3 run_suite.py --hw npu --suite stage-b-test-16-npu-a3 --timeout-per-file 3600 + + multimodal-gen-test-1-npu-a3: + needs: [check-changes, pr-gate] + if: needs.check-changes.outputs.multimodal_gen == 'true' + runs-on: linux-aarch64-a3-2 + container: + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11 + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Run test + timeout-minutes: 60 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + SGLANG_DIFFUSION_ARTIFACT_DIR: ${{ github.workspace }}/diffusion-failures + run: | + export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" + cd python + python3 sglang/multimodal_gen/test/run_suite_npu.py --suite 1-npu + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-npu-1-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + multimodal-gen-test-2-npu-a3: + needs: [check-changes, pr-gate] + if: needs.check-changes.outputs.multimodal_gen == 'true' + runs-on: linux-aarch64-a3-16 + container: + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11 + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Run test + timeout-minutes: 60 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + SGLANG_DIFFUSION_ARTIFACT_DIR: ${{ github.workspace }}/diffusion-failures + run: | + export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}" + cd python + python3 sglang/multimodal_gen/test/run_suite_npu.py --suite 2-npu + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-npu-2-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + multimodal-gen-test-8-npu-a3: + needs: [check-changes, pr-gate] + if: needs.check-changes.outputs.multimodal_gen == 'true' + runs-on: linux-aarch64-a3-8 + container: + image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11 + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Mark repository safe + run: | + git config --system --add safe.directory ${GITHUB_WORKSPACE} + + - name: Install dependencies + env: + TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu" + PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + UV_INDEX_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple" + GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/" + RUSTUP_DIST_SERVER: "https://rsproxy.cn" + RUSTUP_UPDATE_ROOT: "https://rsproxy.cn/rustup" + run: | + # speed up by using infra cache services + CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local" + sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list + pip config set global.index-url http://${CACHING_URL}/pypi/simple + pip config set global.trusted-host "${CACHING_URL}" + + bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion + # copy required file from our daily cache + cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp + # copy gsm8k dataset + cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp + + - name: Run test + timeout-minutes: 60 + env: + SGLANG_USE_MODELSCOPE: true + SGLANG_IS_IN_CI: true + HF_ENDPOINT: https://hf-mirror.com + TORCH_EXTENSIONS_DIR: /tmp/torch_extensions + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + STREAMS_PER_DEVICE: 32 + SGLANG_DIFFUSION_ARTIFACT_DIR: ${{ github.workspace }}/diffusion-failures + run: | + cd python + python3 sglang/multimodal_gen/test/run_suite_npu.py --suite 8-npu + + - name: Upload diffusion failure artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: diffusion-failures-npu-8-${{ github.run_attempt }} + path: diffusion-failures/ + if-no-files-found: ignore + retention-days: 7 + + pr-test-npu-finish: + needs: + [ + check-changes, + + stage-b-test-1-npu-a2, + stage-b-test-2-npu-a2, + stage-b-test-4-npu-a3, + stage-b-test-16-npu-a3, + + multimodal-gen-test-1-npu-a3, + multimodal-gen-test-2-npu-a3, + multimodal-gen-test-8-npu-a3, + ] + if: always() + runs-on: ubuntu-latest + steps: + - name: Check all dependent job statuses + run: | + # Convert the 'needs' context to a JSON string + json_needs='${{ toJson(needs) }}' + + # Get a list of all job names from the JSON keys + job_names=$(echo "$json_needs" | jq -r 'keys_unsorted[]') + + for job in $job_names; do + # For each job, extract its result + result=$(echo "$json_needs" | jq -r --arg j "$job" '.[$j].result') + + # Print the job name and its result + echo "$job: $result" + + # Check for failure or cancellation and exit if found + if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then + echo "The above jobs failed." + exit 1 + fi + done + # If the loop completes, all jobs were successful + echo "All jobs completed successfully" + exit 0 diff --git a/.github/workflows/pr-test-rust.yml b/.github/workflows/pr-test-rust.yml index 4202f6c2e0ec..b2be1d29bcb6 100644 --- a/.github/workflows/pr-test-rust.yml +++ b/.github/workflows/pr-test-rust.yml @@ -5,11 +5,17 @@ on: branches: [ main ] paths: - "sgl-model-gateway/**" + - ".github/workflows/pr-test-rust.yml" + - "scripts/ci/cuda/ci_install_dependency.sh" + - "scripts/ci/cuda/ci_install_gateway_dependencies.sh" pull_request: branches: [ main ] types: [opened, synchronize, reopened, labeled] paths: - "sgl-model-gateway/**" + - ".github/workflows/pr-test-rust.yml" + - "scripts/ci/cuda/ci_install_dependency.sh" + - "scripts/ci/cuda/ci_install_gateway_dependencies.sh" workflow_dispatch: concurrency: @@ -17,8 +23,7 @@ concurrency: cancel-in-progress: true env: - RUSTC_WRAPPER: sccache - SCCACHE_GHA_ENABLED: "true" + SGLANG_IS_IN_CI: true jobs: build-wheel: @@ -26,7 +31,18 @@ jobs: github.event_name != 'pull_request' || (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) || (github.event.action == 'labeled' && github.event.label.name == 'run-ci') - runs-on: 4-gpu-a10 + # Pin to 22.04 so the wheel auditwheel-tags as manylinux_2_35; the + # self-hosted GPU runners are Ubuntu 22.04 (glibc 2.35) and reject + # manylinux_2_39 wheels produced on ubuntu-latest (Ubuntu 24.04). + runs-on: ubuntu-22.04 + # sccache is only installed on the GitHub-hosted runners that run this + # job and `unit-tests`; setting RUSTC_WRAPPER workflow-wide leaks it to + # gateway-e2e on the self-hosted GPU runners (which don't have sccache), + # so any pip-install that compiles a Rust extension would fail with + # `could not execute process \`sccache rustc\``. + env: + RUSTC_WRAPPER: sccache + SCCACHE_GHA_ENABLED: "true" steps: - name: Checkout code uses: actions/checkout@v4 @@ -114,6 +130,9 @@ jobs: (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) || (github.event.action == 'labeled' && github.event.label.name == 'run-ci') runs-on: ubuntu-latest + env: + RUSTC_WRAPPER: sccache + SCCACHE_GHA_ENABLED: "true" steps: - name: Checkout code uses: actions/checkout@v4 @@ -179,11 +198,32 @@ jobs: github.event_name != 'pull_request' || (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) || (github.event.action == 'labeled' && github.event.label.name == 'run-ci') + # The `responses` matrix entry is intentionally omitted. It needs + # `docker run gvenzl/oracle-xe` + `docker run shoofio/brave-search-mcp-sse` + # on the runner host, but the 2-/4-gpu-h100 runners are themselves + # containers without a Docker daemon. Re-enable by adding back: + # - name: responses + # runner: 2-gpu-h100 + # timeout: 45 + # test_dirs: "e2e_test/responses" + # extra_deps: "" + # env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1" + # reruns: "--reruns 2 --reruns-delay 5" + # setup_oracle: true + # setup_brave: true + # parallel_opts: "" + # plus the Oracle Instant Client / `gvenzl/oracle-xe` / + # `shoofio/brave-search-mcp-sse` setup + cleanup steps (see commit + # cf346bb15 for the exact step bodies) once a runner with + # `docker.sock` (or binary-installed deps) is available. strategy: fail-fast: false matrix: include: - name: benchmarks + # 4 GPUs: test_pd_perf.py uses workers(prefill=2, decode=2) and + # test_regular_perf.py uses workers(count=4) — both need tp*workers=4. + runner: 4-gpu-h100 timeout: 32 test_dirs: "e2e_test/benchmarks" extra_deps: "genai-bench==0.0.3" @@ -191,81 +231,56 @@ jobs: reruns: "" upload_benchmarks: true parallel_opts: "" # No parallel for benchmarks (performance measurement) - - name: responses - timeout: 45 - test_dirs: "e2e_test/responses" - extra_deps: "" - env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1" - reruns: "--reruns 2 --reruns-delay 5" - setup_oracle: true - setup_brave: true - parallel_opts: "" # Cloud backend tests not compatible with parallel execution - name: e2e + runner: 2-gpu-h100 timeout: 45 test_dirs: "e2e_test/router e2e_test/embeddings" - extra_deps: "pytest-parallel py" # py is required for pytest-parallel with newer pytest + extra_deps: "" env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1" reruns: "--reruns 2 --reruns-delay 5" - parallel_opts: "--workers 1 --tests-per-worker 4" # Thread-based parallelism + # Run tests serially. pytest-parallel (unmaintained since 2019) + # has buggy fixture-finalize handling under thread dispatch: + # both class- and function-scoped fixture references leaked + # between tests, leaving model_pool instances pinned at + # _ref_count > 0 and deadlocking later tests that needed + # eviction (50+ min hangs). On a 2-GPU runner with 5 distinct + # model:mode combos in router+embeddings, the suite is + # eviction-bound anyway, so the parallel speedup was illusory. + parallel_opts: "" - name: chat-completions + runner: 2-gpu-h100 timeout: 45 test_dirs: "e2e_test/chat_completions" extra_deps: "" env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1" reruns: "--reruns 2 --reruns-delay 5" parallel_opts: "" - runs-on: 4-gpu-a10 + - name: chat-completions-4gpu + runner: 4-gpu-h100 + timeout: 45 + # qwen-30b (tp=4) tests can't fit on the 2-gpu-h100 matrix entries — + # they get skipped there by hooks.py. Run them here so coverage holds. + test_dirs: "e2e_test/chat_completions/test_enable_thinking.py" + extra_deps: "" + env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1" + reruns: "--reruns 2 --reruns-delay 5" + parallel_opts: "" + runs-on: ${{ matrix.runner }} timeout-minutes: ${{ matrix.timeout }} + # Self-hosted GPU runners are scarce; serialize per hardware type so + # 2-gpu-h100 and 4-gpu-h100 each run one job at a time across all + # in-flight PRs. Queue rather than cancel — different refs shouldn't + # interrupt each other. + concurrency: + group: pr-test-rust-${{ matrix.runner }} + cancel-in-progress: false steps: - name: Checkout code uses: actions/checkout@v4 - name: Install SGLang dependencies run: | - sudo --preserve-env=PATH bash scripts/ci/cuda/ci_install_dependency.sh - - - name: Setup Oracle Instant Client - if: matrix.setup_oracle - run: | - sudo apt-get install -y unzip - INSTANT_CLIENT_DIR="/home/ubuntu/instant-client" - INSTANT_CLIENT_ZIP="instantclient-basic-linux.x64-23.9.0.25.07.zip" - - if [ ! -d "$INSTANT_CLIENT_DIR/instantclient_23_9" ]; then - echo "Downloading Oracle Instant Client..." - mkdir -p "$INSTANT_CLIENT_DIR" - cd "$INSTANT_CLIENT_DIR" - wget https://download.oracle.com/otn_software/linux/instantclient/2390000/$INSTANT_CLIENT_ZIP - unzip $INSTANT_CLIENT_ZIP - rm $INSTANT_CLIENT_ZIP - else - echo "Oracle Instant Client already exists, skipping download" - fi - - echo "LD_LIBRARY_PATH=/home/ubuntu/instant-client/instantclient_23_9:\$LD_LIBRARY_PATH" >> $GITHUB_ENV - - - name: Start Oracle Database - if: matrix.setup_oracle - run: | - docker run -d -p 1521:1521 -e ORACLE_PASSWORD=oracle --name oracle-db gvenzl/oracle-xe:21-slim - echo "Starting Oracle DB..." - - # Export Oracle connection environment variables - echo "ATP_USER=system" >> $GITHUB_ENV - echo "ATP_PASSWORD=oracle" >> $GITHUB_ENV - echo "ATP_DSN=localhost:1521/XEPDB1" >> $GITHUB_ENV - - - name: Start Brave MCP Server - if: matrix.setup_brave - run: | - docker run -d --rm \ - -p 8001:8080 \ - -e BRAVE_API_KEY \ - --name brave-search-server \ - shoofio/brave-search-mcp-sse:1.0.10 - echo "Starting Brave MCP Server..." - sleep 2 - curl -f --max-time 1 http://localhost:8001/sse > /dev/null 2>&1 && echo "Brave MCP Server is healthy!" || echo "Brave MCP Server responded" + bash scripts/ci/cuda/ci_install_dependency.sh - name: Download wheel artifact uses: actions/download-artifact@v4 @@ -282,14 +297,27 @@ jobs: run: | python3 -m pip install pytest pytest-rerunfailures httpx openai grpcio grpcio-health-checking numpy if [ -n "${{ matrix.extra_deps }}" ]; then - python3 -m pip --no-cache-dir install --upgrade ${{ matrix.extra_deps }} + if echo "${{ matrix.extra_deps }}" | grep -q "genai-bench"; then + # genai-bench's transitive deps (oci/locust) pull + # transformers<5 which requires huggingface_hub<1.0 — that + # downgrades the 1.11.0 ci_install_dependency.sh settled on + # and breaks `kernels` (requires huggingface_hub>=1.3.0,<2.0). + # Install --no-deps and supply the runtime deps explicitly. + python3 -m pip --no-cache-dir install --no-deps ${{ matrix.extra_deps }} + python3 -m pip --no-cache-dir install \ + locust click rich tenacity oci openpyxl gevent matplotlib + else + # Other extras with well-behaved transitive deps — + # normal --upgrade install path. + python3 -m pip --no-cache-dir install --upgrade ${{ matrix.extra_deps }} + fi fi - name: Run E2E tests run: | - bash scripts/killall_sglang.sh "nuk_gpus" + python3 python/sglang/cli/killall.py cd sgl-model-gateway - ${{ matrix.env_vars }} ROUTER_LOCAL_MODEL_PATH="/home/ubuntu/models" pytest ${{ matrix.reruns }} ${{ matrix.parallel_opts }} ${{ matrix.test_dirs }} -s -vv -o log_cli=true --log-cli-level=INFO + ${{ matrix.env_vars }} pytest ${{ matrix.reruns }} ${{ matrix.parallel_opts }} ${{ matrix.test_dirs }} -s -vv -o log_cli=true --log-cli-level=INFO - name: Upload benchmark results if: matrix.upload_benchmarks && success() @@ -298,18 +326,6 @@ jobs: name: genai-bench-results-all-policies path: sgl-model-gateway/benchmark_**/ - - name: Cleanup Brave MCP Server - if: always() && matrix.setup_brave - run: | - docker stop brave-search-server || true - docker rm brave-search-server || true - - - name: Cleanup Oracle Database - if: always() && matrix.setup_oracle - run: | - docker stop oracle-db || true - docker rm oracle-db || true - docker-build-test: if: | github.event_name != 'pull_request' || @@ -333,8 +349,86 @@ jobs: cache-from: type=gha cache-to: type=gha,mode=max + k8s-integration: + # Runs SMG against a kind cluster with fake worker pods to exercise the + # K8s service discovery / reconciliation path. No GPU required (workers + # are python:3.12-slim mocks); the h100 matrix runners are unsuitable + # because they're containers without a Docker daemon. + if: | + github.event_name != 'pull_request' || + (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) || + (github.event.action == 'labeled' && github.event.label.name == 'run-ci') + runs-on: ubuntu-22.04 + timeout-minutes: 30 + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Install kind and kubectl + run: | + curl -fsSLo /tmp/kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-amd64 + chmod +x /tmp/kind && sudo mv /tmp/kind /usr/local/bin/kind + KUBECTL_VERSION=$(curl -fsSL https://dl.k8s.io/release/stable.txt) + curl -fsSLo /tmp/kubectl "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl" + chmod +x /tmp/kubectl && sudo mv /tmp/kubectl /usr/local/bin/kubectl + kind --version + kubectl version --client + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Build smg-gateway:test image + uses: docker/build-push-action@v5 + with: + context: sgl-model-gateway + file: sgl-model-gateway/e2e_test/k8s_integration/Dockerfile.gateway + tags: smg-gateway:test + load: true + cache-from: type=gha,scope=k8s-integration + cache-to: type=gha,scope=k8s-integration,mode=max + + - name: Install Python test dependencies + run: | + python3 -m pip install --upgrade pip + python3 -m pip install pytest httpx + + - name: Set up kind cluster and deploy + env: + SKIP_DOCKER_BUILD: "1" + run: | + cd sgl-model-gateway + bash e2e_test/k8s_integration/setup.sh + + - name: Run K8s integration tests + run: | + cd sgl-model-gateway + # confcutdir avoids loading the parent e2e_test/conftest.py, which + # pulls in heavy infra deps (requests, sglang_router) that this job + # intentionally doesn't install. + pytest e2e_test/k8s_integration/ \ + --confcutdir=e2e_test/k8s_integration \ + -v -s -o log_cli=true --log-cli-level=INFO + + - name: Dump cluster state on failure + if: failure() + run: | + kubectl --context kind-smg-test get all -A || true + kubectl --context kind-smg-test -n smg-test describe pods || true + kubectl --context kind-smg-test -n smg-test logs deploy/smg-gateway --tail=200 || true + + - name: Tear down kind cluster + if: always() + run: | + cd sgl-model-gateway + bash e2e_test/k8s_integration/setup.sh teardown || true + finish: - needs: [build-wheel, python-unit-tests, unit-tests, gateway-e2e, docker-build-test] + needs: [build-wheel, python-unit-tests, unit-tests, gateway-e2e, docker-build-test, k8s-integration] runs-on: ubuntu-latest steps: - name: Finish diff --git a/.github/workflows/pr-test-sgl-kernel.yml b/.github/workflows/pr-test-sgl-kernel.yml new file mode 100644 index 000000000000..bb69b89acec9 --- /dev/null +++ b/.github/workflows/pr-test-sgl-kernel.yml @@ -0,0 +1,214 @@ +name: PR Test - SGL Kernel + +on: + workflow_call: + inputs: + sgl_kernel: + required: true + type: string + b200_runner: + required: true + type: string + pr_head_sha: + required: false + type: string + default: '' + git_ref: + required: false + type: string + default: '' + skip_stage_health_check: + required: false + type: boolean + default: false + +# Workflow-level env is NOT inherited from the caller in reusable workflows. +# The github context (including github.event_name) IS inherited from the caller. +env: + SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" + PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }} + SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }} + +jobs: + sgl-kernel-unit-test: + runs-on: 1-gpu-h100 + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run test + timeout-minutes: 30 + run: | + cd sgl-kernel + pytest tests/ + + sgl-kernel-mla-test: + runs-on: 1-gpu-h100 + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh + + - name: Run test + timeout-minutes: 30 + run: | + cd test/registered/mla + python3 test_mla_deepseek_v3.py + + sgl-kernel-benchmark-test: + runs-on: 1-gpu-h100 + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh + + - name: Run benchmark tests + timeout-minutes: 45 + run: | + cd sgl-kernel/benchmark + echo "Running sgl-kernel benchmark tests in CI mode..." + + echo "CI environment variable: $CI" + echo "GITHUB_ACTIONS environment variable: $GITHUB_ACTIONS" + + for bench_file in bench_*.py; do + echo "Testing $bench_file..." + timeout 60 python3 "$bench_file" || echo "Warning: $bench_file timed out or failed, continuing..." + echo "Completed $bench_file" + echo "---" + done + + echo "All benchmark tests completed!" + + sgl-kernel-b200-test: + runs-on: ${{ inputs.b200_runner }} + timeout-minutes: 240 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance + + - name: Cleanup + run: | + ls -alh sgl-kernel/dist || true + rm -rf sgl-kernel/dist/* || true + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-python3.10-cuda* + + - name: Install dependencies + timeout-minutes: 20 + run: | + CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + + - name: Run sgl-kernel unit tests on B200 + timeout-minutes: 30 + run: | + cd sgl-kernel + pytest tests/ + + # Adding a single CUDA13 build-and-run check for the kernel + # TODO: Add back this test when it can pass on CI + # cuda13-kernel-build-check: + # if: inputs.sgl_kernel == 'true' + # runs-on: x64-cu13-kernel-tests + # steps: + # - uses: actions/checkout@v4 + + # - name: Cleanup + # run: | + # ls -alh sgl-kernel/dist || true + # rm -rf sgl-kernel/dist/* || true + + # - name: Download CUDA 13.0 artifacts + # uses: actions/download-artifact@v4 + # with: + # path: sgl-kernel/dist/ + # merge-multiple: true + # pattern: wheel-python3.10-cuda* + + # - name: Install dependencies + # run: | + # CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh + + # - name: Run kernel unit tests + # timeout-minutes: 30 + # run: | + # cd sgl-kernel + # pytest tests/ diff --git a/.github/workflows/pr-test-xeon.yml b/.github/workflows/pr-test-xeon.yml index 021a1308593c..0fb4721ba173 100644 --- a/.github/workflows/pr-test-xeon.yml +++ b/.github/workflows/pr-test-xeon.yml @@ -21,7 +21,7 @@ on: concurrency: group: pr-test-xeon-${{ inputs.ref || github.ref }} - cancel-in-progress: false + cancel-in-progress: ${{ github.event_name != 'workflow_call' }} jobs: # ==================== Check Changes ==================== # @@ -55,10 +55,10 @@ jobs: with: filters: | main_package: - - "python/sglang/!(multimodal_gen)/**" + - "python/sglang/!(multimodal_gen)/**/!(*.md)" - "python/pyproject_cpu.toml" - - "test/**" - - "sgl-kernel/**" + - "test/**/!(*.md)" + - "sgl-kernel/**/!(*.md|THIRDPARTYNOTICES.txt|LICENSE)" - ".github/workflows/pr-test-xeon.yml" - "docker/xeon.Dockerfile" diff --git a/.github/workflows/pr-test-xpu.yml b/.github/workflows/pr-test-xpu.yml index 38b89762a75a..aa0554c72532 100644 --- a/.github/workflows/pr-test-xpu.yml +++ b/.github/workflows/pr-test-xpu.yml @@ -54,10 +54,10 @@ jobs: with: filters: | main_package: - - "python/sglang/!(multimodal_gen)/**" + - "python/sglang/!(multimodal_gen)/**/!(*.md)" - "python/pyproject_xpu.toml" - - "test/**" - - "sgl-kernel/**" + - "test/**/!(*.md)" + - "sgl-kernel/**/!(*.md|THIRDPARTYNOTICES.txt|LICENSE)" - ".github/workflows/pr-test-xpu.yml" - "docker/xpu.Dockerfile" @@ -72,8 +72,6 @@ jobs: needs: [check-changes, pr-gate] if: needs.check-changes.outputs.main_package == 'true' runs-on: intel-bmg - env: - HF_HOME: /home/sdp/.cache/huggingface steps: - name: Checkout code uses: actions/checkout@v4 @@ -81,9 +79,6 @@ jobs: fetch-depth: 0 ref: ${{ inputs.ref || github.ref }} - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - name: Build Docker image run: | PR_REPO=${{ github.event.pull_request.head.repo.clone_url }} @@ -99,8 +94,10 @@ jobs: container_id=$(docker run -dt \ --group-add 992 \ --group-add $(getent group video | cut -d: -f3) \ - -v ${HF_HOME}:/root/.cache/huggingface \ + --group-add $(getent group render | cut -d: -f3) \ + -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --device /dev/dri \ + -v /dev/dri/by-path:/dev/dri/by-path \ -e HF_TOKEN="$(cat ~/huggingface_token.txt)" \ xpu_sglang_main:bmg) echo "Started container: $container_id" @@ -110,19 +107,17 @@ jobs: timeout-minutes: 20 run: | cid="${{ steps.start_container.outputs.container_id }}" - docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip install --upgrade pip - docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip install pytest expecttest ray huggingface_hub - docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip uninstall -y flashinfer-python - docker exec "$cid" /bin/bash -c '/home/sdp/miniforge3/envs/py3.10/bin/hf auth login --token ${HF_TOKEN} ' - docker exec -u root "$cid" /bin/bash -c "ln -sf /home/sdp/miniforge3/envs/py3.10/bin/python3 /usr/bin/python3" + docker exec "$cid" /home/sdp/miniforge3/envs/py3.12/bin/python3 -m pip install --upgrade pip + docker exec "$cid" /home/sdp/miniforge3/envs/py3.12/bin/python3 -m pip install pytest expecttest ray huggingface_hub + docker exec "$cid" /home/sdp/miniforge3/envs/py3.12/bin/python3 -m pip uninstall -y flashinfer-python + docker exec "$cid" /bin/bash -c '/home/sdp/miniforge3/envs/py3.12/bin/hf auth login --token ${HF_TOKEN} ' + - name: Run E2E Bfloat16 tests timeout-minutes: 20 run: | cid="${{ steps.start_container.outputs.container_id }}" - docker exec -w /home/sdp/sglang/ "$cid" \ - bash -c "LD_LIBRARY_PATH=/home/sdp/miniforge3/envs/py3.10/lib:$LD_LIBRARY_PATH && cd ./test/srt && python3 run_suite.py --suite per-commit-xpu" - + docker exec "$cid" bash -c "source /home/sdp/miniforge3/bin/activate && conda activate py3.12 && cd /home/sdp/sglang/test/srt && python3 run_suite.py --suite per-commit-xpu" - name: Cleanup container if: always() run: | diff --git a/.github/workflows/pr-test.yml b/.github/workflows/pr-test.yml index ef81c713ec65..9bcf60cd26bf 100644 --- a/.github/workflows/pr-test.yml +++ b/.github/workflows/pr-test.yml @@ -5,19 +5,11 @@ run-name: ${{ inputs.target_stage && (inputs.pr_head_sha && format('[{0}] {1}', on: schedule: - - cron: '0 */6 * * *' # Run every 6 hours + - cron: '0 1,9,17 * * *' # Run 3x daily: 2am / 10am / 6pm Pacific (PDT) pull_request: branches: [main] workflow_dispatch: inputs: - version: - description: "FlashInfer version" - required: true - type: choice - default: "release" - options: - - "release" - - "nightly" target_stage: description: "Specific stage to run (optional, for quick testing)" required: false @@ -33,6 +25,11 @@ on: required: false type: string default: "" + include_wheel_build: + description: "When set with target_stage, also run sgl-kernel-build-wheels so the target stage uses the freshly-built kernel (for /rerun-stage on PRs that modify sgl-kernel/)" + required: false + type: boolean + default: false test_parallel_dispatch: description: "Test parallel dispatch behavior (simulates scheduled run)" required: false @@ -40,7 +37,7 @@ on: default: false workflow_call: inputs: - ref: + git_ref: description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.' required: false type: string @@ -50,33 +47,61 @@ on: required: false type: boolean default: false + skip_stage_health_check: + description: "Skip stage health check fast-fail (e.g. for release branch cuts)" + required: false + type: boolean + default: false concurrency: - # Concurrency group structure: pr-test-{branch}-{pr_sha}-{stage} + # Concurrency group structure: pr-test-{event}-{branch}-{pr_sha}-{stage} + # - event_name prevents scheduled runs from colliding with fork PRs whose branch is named 'main' + # (without it, both resolve the branch segment to 'main' and block each other) # - github.head_ref (pull_request) or github.ref_name (workflow_dispatch) normalizes to branch name # - pr_head_sha isolates /rerun-stage from main branch runs # - target_stage allows parallel stage dispatches to run independently - # This ensures pull_request and workflow_dispatch on same branch cancel each other - group: pr-test-${{ github.head_ref || github.ref_name || 'default' }}-${{ inputs.pr_head_sha || 'current' }}-${{ inputs.target_stage || inputs.ref || 'all' }} + group: pr-test-${{ github.event_name }}-${{ github.head_ref || github.ref_name || 'default' }}-${{ inputs.pr_head_sha || 'current' }}-${{ inputs.target_stage || inputs.git_ref || 'all' }} cancel-in-progress: ${{ github.event_name != 'workflow_call' }} env: SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true + SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }} + # TEMP: rebuild deepep against the new torch for torch-211-merge PR only — revert before merging to main. + FORCE_REBUILD_DEEPEP: '1' + # Schedule / main-branch dispatch / workflow_call from main use refs/heads/main; PR events use refs/pull/*/merge + PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }} + USE_VENV: false permissions: actions: write contents: read + issues: read + pull-requests: read jobs: # =============================================== check changes ==================================================== check-changes: runs-on: ubuntu-latest outputs: - main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} - sgl_kernel: ${{ steps.filter.outputs.sgl_kernel }} # sgl-kernel tests only run when kernels are rebuilt - jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }} - multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} + # Use API-based detection for target_stage mode (filter-api), otherwise use dorny/paths-filter (filter) + main_package: ${{ steps.filter-api.outputs.main_package || steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} + # sgl_kernel is forced to false when target_stage is set AND include_wheel_build is NOT set, + # since sgl-kernel-build-wheels normally skips in target_stage mode. When include_wheel_build + # is true, keep the real value so the wheel build runs and the target stage downloads its + # artifact (used by /rerun-stage on PRs that modify sgl-kernel/). + # This prevents CUSTOM_BUILD_SGL_KERNEL=true when the wheel artifacts aren't available. + # Note: If PR has kernel changes AND target_stage is set AND include_wheel_build is NOT set, + # the validate-target-stage step will fail. + sgl_kernel: ${{ (!inputs.target_stage || inputs.include_wheel_build) && (steps.filter-api.outputs.sgl_kernel || steps.filter.outputs.sgl_kernel) }} + # Raw sgl_kernel value before target_stage override (used for validation) + sgl_kernel_raw: ${{ steps.filter-api.outputs.sgl_kernel || steps.filter.outputs.sgl_kernel }} + jit_kernel: ${{ steps.filter-api.outputs.jit_kernel || steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }} + multimodal_gen: ${{ steps.filter-api.outputs.multimodal_gen || steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} max_parallel: ${{ steps.set-parallel.outputs.max_parallel }} + max_parallel_small: ${{ steps.set-parallel.outputs.max_parallel_small }} + max_parallel_2gpu: ${{ steps.set-parallel.outputs.max_parallel_2gpu }} b200_runner: ${{ steps.set-runner.outputs.b200_runner }} enable_retry: ${{ steps.set-retry.outputs.enable_retry }} continue_on_error: ${{ steps.set-continue-on-error.outputs.continue_on_error }} @@ -84,13 +109,15 @@ jobs: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-maintenance - name: Determine run mode id: run-mode run: | # Run all tests for scheduled runs and workflow_call (when ref input is provided) - # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref + # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.git_ref if [[ "${{ github.event_name }}" == "schedule" || "${{ inputs.run_all_tests }}" == "true" ]]; then echo "run_all_tests=true" >> $GITHUB_OUTPUT echo "Run mode: ALL TESTS (schedule=${{ github.event_name == 'schedule' }}, run_all_tests=${{ inputs.run_all_tests }})" @@ -102,48 +129,160 @@ jobs: - name: Detect file changes id: filter uses: dorny/paths-filter@v3 - if: steps.run-mode.outputs.run_all_tests != 'true' + # Only use paths-filter for pull_request events (where it works correctly) + # For workflow_dispatch with target_stage, we use GitHub API in the next step + if: steps.run-mode.outputs.run_all_tests != 'true' && !inputs.target_stage with: filters: | main_package: - - "python/sglang/!(multimodal_gen)/**" + - ".github/workflows/pr-test.yml" + - ".github/workflows/pr-gate.yml" + - ".github/actions/**" - "python/pyproject.toml" + - "python/sglang/!(multimodal_gen)/**/!(*.md)" - "scripts/ci/cuda/*" - "scripts/ci/utils/*" - - "test/**" + - "test/**/!(*.md)" + multimodal_gen: - ".github/workflows/pr-test.yml" - sgl_kernel: - - "sgl-kernel/**" - jit_kernel: - - "python/sglang/jit_kernel/**" + - ".github/workflows/pr-test-multimodal-gen.yml" - "python/pyproject.toml" - - ".github/workflows/pr-test.yml" - multimodal_gen: - - "python/sglang/multimodal_gen/**" + - "python/sglang/multimodal_gen/**/!(*.md|*.ipynb)" + - "python/sglang/jit_kernel/diffusion/**" + - "python/sglang/jit_kernel/tests/diffusion/**" + - "python/sglang/jit_kernel/benchmark/diffusion/**" - "python/sglang/cli/**" - - "python/pyproject.toml" + jit_kernel: - ".github/workflows/pr-test.yml" + - ".github/workflows/pr-test-jit-kernel.yml" + - "python/pyproject.toml" + - "python/sglang/jit_kernel/**" + sgl_kernel: + - ".github/workflows/pr-test-sgl-kernel.yml" + - "sgl-kernel/**/!(*.md|THIRDPARTYNOTICES.txt|LICENSE)" + + # For /rerun-stage (workflow_dispatch with target_stage), dorny/paths-filter doesn't work + # correctly because it falls back to "last commit" detection which breaks for merge commits. + # Instead, we use the GitHub API to compare the PR commit against main. + - name: Detect file changes via API (for target_stage) + id: filter-api + if: inputs.target_stage && inputs.pr_head_sha + env: + GH_TOKEN: ${{ github.token }} + run: | + echo "Detecting file changes via GitHub API for target_stage mode..." + echo "PR head SHA: ${{ inputs.pr_head_sha }}" + + # Get the list of changed files by comparing PR commit against main + # This correctly handles merge commits by looking at the actual PR diff + CHANGED_FILES=$(gh api "repos/${{ github.repository }}/compare/main...${{ inputs.pr_head_sha }}" \ + --jq '[.files[].filename] | .[]' 2>/dev/null || echo "") + + if [ -z "$CHANGED_FILES" ]; then + echo "Warning: Could not fetch changed files from API, assuming no changes" + echo "sgl_kernel=false" >> $GITHUB_OUTPUT + echo "main_package=false" >> $GITHUB_OUTPUT + echo "jit_kernel=false" >> $GITHUB_OUTPUT + echo "multimodal_gen=false" >> $GITHUB_OUTPUT + exit 0 + fi + + echo "Changed files:" + echo "$CHANGED_FILES" | head -20 + echo "..." + + # Check for sgl-kernel changes + if echo "$CHANGED_FILES" | grep -qE "^(sgl-kernel/|\.github/workflows/pr-test-sgl-kernel\.yml)"; then + echo "sgl_kernel=true" >> $GITHUB_OUTPUT + echo "Detected sgl-kernel changes" + else + echo "sgl_kernel=false" >> $GITHUB_OUTPUT + fi + + # Check for main_package changes (excluding multimodal_gen, jit_kernel/diffusion, jit_kernel/tests/diffusion, jit_kernel/benchmark/diffusion, cli) + # Note: Need to filter out multimodal_gen and diffusion-related paths before checking, not pipe grep -q output + MAIN_PKG_FILES=$(echo "$CHANGED_FILES" | grep -E "^(python/sglang/|python/pyproject\.toml|scripts/ci/cuda/|scripts/ci/utils/|test/|\.github/workflows/pr-test\.yml|\.github/workflows/pr-gate\.yml|\.github/actions/)" | grep -v -E "^(python/sglang/multimodal_gen/|python/sglang/jit_kernel/diffusion/|python/sglang/jit_kernel/tests/diffusion/|python/sglang/jit_kernel/benchmark/diffusion/|python/sglang/cli/)" || true) + if [ -n "$MAIN_PKG_FILES" ]; then + echo "main_package=true" >> $GITHUB_OUTPUT + echo "Detected main_package changes" + else + echo "main_package=false" >> $GITHUB_OUTPUT + fi + + # Check for jit_kernel changes + if echo "$CHANGED_FILES" | grep -qE "^(python/sglang/jit_kernel/|python/pyproject\.toml|\.github/workflows/pr-test\.yml|\.github/workflows/pr-test-jit-kernel\.yml)"; then + echo "jit_kernel=true" >> $GITHUB_OUTPUT + echo "Detected jit_kernel changes" + else + echo "jit_kernel=false" >> $GITHUB_OUTPUT + fi + + # Check for multimodal_gen changes, including diffusion-specific jit_kernel coverage + if echo "$CHANGED_FILES" | grep -qE "^(python/sglang/multimodal_gen/|python/sglang/cli/|python/sglang/jit_kernel/diffusion/|python/sglang/jit_kernel/tests/diffusion/|python/sglang/jit_kernel/benchmark/diffusion/|python/pyproject\.toml|\.github/workflows/pr-test\.yml|\.github/workflows/pr-test-multimodal-gen\.yml)"; then + echo "multimodal_gen=true" >> $GITHUB_OUTPUT + echo "Detected multimodal_gen changes" + else + echo "multimodal_gen=false" >> $GITHUB_OUTPUT + fi - name: Set max-parallel based on run type id: set-parallel + env: + GH_TOKEN: ${{ github.token }} run: | - # Scheduled runs and high-priority PRs get full parallelism + # Determine if this run gets full parallelism (scheduled / high priority) + FULL=false if [[ "${{ github.event_name }}" == "schedule" ]]; then - echo "max_parallel=14" >> $GITHUB_OUTPUT - echo "Scheduled run detected, setting max_parallel to 14" + FULL=true + echo "Scheduled run detected, using full parallelism" elif [[ "${{ github.event_name }}" == "pull_request" && "${{ contains(github.event.pull_request.labels.*.name, 'high priority') }}" == "true" ]]; then + FULL=true + echo "High priority PR detected, using full parallelism" + elif [[ -n "${{ inputs.target_stage }}" ]]; then + # /rerun-stage (workflow_dispatch): query PR labels via GitHub API + # Try SHA lookup first (fork PRs), fallback to branch name (non-fork PRs) + LABELS="" + PR_HEAD_SHA="${{ inputs.pr_head_sha }}" + if [[ -n "$PR_HEAD_SHA" ]]; then + LABELS=$(gh api "repos/${{ github.repository }}/commits/${PR_HEAD_SHA}/pulls" \ + --jq '.[0].labels[].name' 2>/dev/null || true) + fi + if [[ -z "$LABELS" ]]; then + LABELS=$(gh pr list --head "${{ github.ref_name }}" --repo "${{ github.repository }}" \ + --json labels --jq '.[0].labels[].name' 2>/dev/null || true) + fi + echo "PR labels: ${LABELS:-"(none)"}" + if echo "$LABELS" | grep -Fxq "high priority"; then + FULL=true + echo "High priority PR detected via API (/rerun-stage), using full parallelism" + fi + fi + + # Set max-parallel for each runner type + # 1-gpu-h100: 14 partitions, 1-gpu-5090: 8 partitions, 2-gpu-h100: 4 partitions + if [[ "$FULL" == "true" ]]; then + LEVEL=full echo "max_parallel=14" >> $GITHUB_OUTPUT - echo "High priority PR detected, setting max_parallel to 14" + echo "max_parallel_small=8" >> $GITHUB_OUTPUT + echo "max_parallel_2gpu=4" >> $GITHUB_OUTPUT else + LEVEL=low echo "max_parallel=3" >> $GITHUB_OUTPUT - echo "Using default max_parallel of 3" + echo "max_parallel_small=3" >> $GITHUB_OUTPUT + echo "max_parallel_2gpu=2" >> $GITHUB_OUTPUT fi + echo "parallel_level=$LEVEL" >> $GITHUB_OUTPUT + echo "Parallelism level: $LEVEL" - name: Set B200 runner tag id: set-runner run: | - sgl_kernel="${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }}" - if [[ "$sgl_kernel" == "true" ]]; then + # Use kernel-build runner only when sgl_kernel changes are detected AND we're not in target_stage mode + # (target_stage skips wheel builds, so we can't use custom kernels) + # Use API-based detection (filter-api) for target_stage mode, otherwise use dorny/paths-filter (filter) + sgl_kernel="${{ steps.filter-api.outputs.sgl_kernel || steps.filter.outputs.sgl_kernel }}" + target_stage="${{ inputs.target_stage }}" + if [[ "$sgl_kernel" == "true" && -z "$target_stage" ]]; then echo "b200_runner=4-gpu-b200-kernel" >> $GITHUB_OUTPUT else echo "b200_runner=4-gpu-b200" >> $GITHUB_OUTPUT @@ -166,6 +305,32 @@ jobs: echo "Filtered run, continue-on-error disabled" fi + - name: Validate target_stage with kernel changes + # Fail only when PR has sgl-kernel changes AND the caller didn't opt into include_wheel_build. + # include_wheel_build=true means sgl-kernel-build-wheels will run alongside the target stage + # (see the sgl_kernel output and sgl-kernel-build-wheels if-conditions above/below), so it's + # safe to proceed. + if: inputs.target_stage && !inputs.include_wheel_build && (steps.filter-api.outputs.sgl_kernel == 'true' || steps.filter.outputs.sgl_kernel == 'true') + run: | + echo "::error::Cannot use /rerun-stage when PR has sgl-kernel changes without include_wheel_build." + echo "::error::The sgl-kernel-build-wheels job is skipped in target_stage mode by default, but this PR modifies sgl-kernel/ files." + echo "::error::The slash-command handler should have set include_wheel_build=true automatically; falling back to /tag-and-rerun-ci." + echo "" + echo "ERROR: Cannot use /rerun-stage when PR has sgl-kernel changes without include_wheel_build." + echo "" + echo "This PR modifies files in sgl-kernel/, which requires building custom kernel wheels." + echo "Running the target stage without rebuilding the kernel would use the wrong (PyPI)" + echo "version of sgl-kernel instead of your changes." + echo "" + echo "The /rerun-stage handler sets include_wheel_build=true automatically when it detects" + echo "sgl-kernel/ changes on the PR. If you see this error, the handler may be outdated." + echo "" + echo "Alternatives:" + echo " /tag-and-rerun-ci - Re-run the full workflow including kernel builds" + echo " /rerun-ci - Re-run the full workflow" + echo "" + exit 1 + - name: Show filter results in summary (table) run: | { @@ -173,11 +338,14 @@ jobs: echo "" echo "| Component | Changed |" echo "|-------------------|---------|" - echo "| main_package | ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} |" - echo "| sgl_kernel | ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }} |" - echo "| jit_kernel | ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }} |" - echo "| multimodal_gen | ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} |" - echo "| max_parallel | ${{ steps.set-parallel.outputs.max_parallel }} |" + echo "| main_package | ${{ steps.filter-api.outputs.main_package || steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }} |" + echo "| sgl_kernel (raw) | ${{ steps.filter-api.outputs.sgl_kernel || steps.filter.outputs.sgl_kernel }} |" + echo "| sgl_kernel (used) | ${{ (!inputs.target_stage || inputs.include_wheel_build) && (steps.filter-api.outputs.sgl_kernel || steps.filter.outputs.sgl_kernel) }} |" + echo "| jit_kernel | ${{ steps.filter-api.outputs.jit_kernel || steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }} |" + echo "| multimodal_gen | ${{ steps.filter-api.outputs.multimodal_gen || steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }} |" + echo "| target_stage | ${{ inputs.target_stage || '(none)' }} |" + echo "| detection_method | ${{ inputs.target_stage && 'GitHub API' || 'dorny/paths-filter' }} |" + echo "| max_parallel | ${{ steps.set-parallel.outputs.parallel_level }} (h100=${{ steps.set-parallel.outputs.max_parallel }}, 5090=${{ steps.set-parallel.outputs.max_parallel_small }}, 2gpu=${{ steps.set-parallel.outputs.max_parallel_2gpu }}) |" echo "| b200_runner | ${{ steps.set-runner.outputs.b200_runner }} |" echo "| enable_retry | ${{ steps.set-retry.outputs.enable_retry }} |" echo "| continue_on_error | ${{ steps.set-continue-on-error.outputs.continue_on_error }} |" @@ -187,12 +355,11 @@ jobs: # These jobs poll GitHub API to wait for previous stages to complete. # For PR runs: wait jobs run and enforce sequential execution via polling. # For scheduled runs: wait jobs are skipped, enabling parallel execution for easier retry. + # For PRs with the `bypass-fastfail` label: wait jobs run but return success immediately + # (handled inside the wait-for-jobs action), so downstream stages dispatch in parallel. wait-for-stage-a: needs: [check-changes, call-gate] - # Only run for PRs (not scheduled) and when not targeting a specific stage - # Skip if call-gate failed (stage-a jobs will be skipped, nothing to wait for) - # !cancelled() ensures this job respects workflow cancellation from concurrency group if: | always() && !cancelled() && @@ -205,53 +372,19 @@ jobs: outputs: stage_a_result: ${{ steps.wait.outputs.result }} steps: - - name: Wait for stage-a-test-1 to complete + - uses: actions/checkout@v4 + + - uses: ./.github/actions/check-maintenance + + - uses: ./.github/actions/wait-for-jobs id: wait - uses: actions/github-script@v7 with: - script: | - const maxWaitMinutes = 240; - const pollIntervalSeconds = 120; // 2 minutes to reduce GH API calls - const maxAttempts = (maxWaitMinutes * 60) / pollIntervalSeconds; - - for (let attempt = 0; attempt < maxAttempts; attempt++) { - const jobs = await github.paginate(github.rest.actions.listJobsForWorkflowRun, { - owner: context.repo.owner, - repo: context.repo.repo, - run_id: context.runId, - per_page: 100, - }); - - const stageAJob = jobs.find(job => job.name === 'stage-a-test-1'); - - if (stageAJob) { - console.log(`stage-a-test-1 status: ${stageAJob.status}, conclusion: ${stageAJob.conclusion}`); - - if (stageAJob.status === 'completed') { - if (stageAJob.conclusion === 'success' || stageAJob.conclusion === 'skipped') { - core.setOutput('result', stageAJob.conclusion === 'success' ? 'success' : 'skipped'); - return; - } else { - core.setOutput('result', 'failure'); - core.setFailed(`stage-a-test-1 ${stageAJob.conclusion}`); - return; - } - } - } else { - console.log('stage-a-test-1 job not found yet'); - } - - console.log(`Waiting ${pollIntervalSeconds}s... (attempt ${attempt + 1}/${maxAttempts})`); - await new Promise(resolve => setTimeout(resolve, pollIntervalSeconds * 1000)); - } - - core.setFailed('Timeout waiting for stage-a-test-1'); - core.setOutput('result', 'timeout'); + stage-name: stage-a + jobs: '["stage-a-test-1-gpu-small", {"prefix": "stage-a-test-cpu", "expected_count": 4}]' + max-wait-minutes: '240' wait-for-stage-b: needs: [check-changes, call-gate, wait-for-stage-a] - # Only run for PRs (not scheduled) and when not targeting a specific stage - # Skip if call-gate failed (stage-b jobs will be skipped, nothing to wait for) if: | always() && !cancelled() && @@ -265,88 +398,22 @@ jobs: outputs: stage_b_result: ${{ steps.wait.outputs.result }} steps: - - name: Wait for stage-b jobs to complete + - uses: actions/checkout@v4 + + - uses: ./.github/actions/check-maintenance + + - uses: ./.github/actions/wait-for-jobs id: wait - uses: actions/github-script@v7 with: - script: | - const maxWaitMinutes = 480; - const pollIntervalSeconds = 120; // 2 minutes to reduce GH API calls - const maxAttempts = (maxWaitMinutes * 60) / pollIntervalSeconds; - - // Stage-b jobs to wait for - const stageBJobs = [ - { prefix: 'stage-b-test-small-1-gpu', expectedCount: 8 }, // partitions 0-7 - { prefix: 'stage-b-test-large-1-gpu', expectedCount: 14 }, // partitions 0-13 - { prefix: 'stage-b-test-large-2-gpu', expectedCount: 4 }, // partitions 0-3 - { prefix: 'stage-b-test-4-gpu-b200', expectedCount: 1 }, - ]; - const totalExpectedJobs = stageBJobs.reduce((sum, j) => sum + j.expectedCount, 0); // 27 total - - // Helper to match job names exactly (prefix alone or prefix + " (N)" for matrix jobs) - const matchesPrefix = (jobName, prefix) => { - return jobName === prefix || jobName.startsWith(prefix + ' ('); - }; - - for (let attempt = 0; attempt < maxAttempts; attempt++) { - const jobs = await github.paginate(github.rest.actions.listJobsForWorkflowRun, { - owner: context.repo.owner, - repo: context.repo.repo, - run_id: context.runId, - per_page: 100, - }); - - let allCompleted = true; - let anyFailed = false; - let failedJobs = []; - let completedCount = 0; - let totalCount = 0; - - for (const { prefix, expectedCount } of stageBJobs) { - const matchingJobs = jobs.filter(job => matchesPrefix(job.name, prefix)); - - // Check existing jobs for failures first (fail fast) - for (const job of matchingJobs) { - totalCount++; - console.log(`${job.name}: status=${job.status}, conclusion=${job.conclusion}`); - - if (job.status !== 'completed') { - allCompleted = false; - } else { - completedCount++; - if (job.conclusion !== 'success' && job.conclusion !== 'skipped') { - anyFailed = true; - failedJobs.push(job.name); - } - } - } - - if (matchingJobs.length < expectedCount) { - console.log(`${prefix}: found ${matchingJobs.length}/${expectedCount} jobs (waiting for more)`); - allCompleted = false; - } - } - - console.log(`Progress: ${completedCount}/${totalCount} jobs completed (expected ${totalExpectedJobs})`); - - // Fail fast if any jobs failed (don't wait for all jobs to be created) - if (anyFailed) { - core.setOutput('result', 'failure'); - core.setFailed(`Stage-b jobs failed: ${failedJobs.join(', ')}`); - return; - } - - if (allCompleted && totalCount >= totalExpectedJobs) { - core.setOutput('result', 'success'); - return; - } - - console.log(`Waiting ${pollIntervalSeconds}s... (attempt ${attempt + 1}/${maxAttempts})`); - await new Promise(resolve => setTimeout(resolve, pollIntervalSeconds * 1000)); - } - - core.setFailed('Timeout waiting for stage-b jobs'); - core.setOutput('result', 'timeout'); + stage-name: stage-b + jobs: | + [ + {"prefix": "stage-b-test-1-gpu-small", "expected_count": 8}, + {"prefix": "stage-b-test-1-gpu-large", "expected_count": 14}, + {"prefix": "stage-b-test-2-gpu-large", "expected_count": 4}, + {"prefix": "stage-b-test-4-gpu-b200", "expected_count": 1} + ] + max-wait-minutes: '480' # =============================================== PR Gate ==================================================== call-gate: @@ -369,18 +436,31 @@ jobs: sgl-kernel-build-wheels: needs: [check-changes, call-gate] - # Skip for scheduled runs (they run stages independently) and when target_stage is set - if: github.event_name != 'schedule' && inputs.test_parallel_dispatch != true && !inputs.target_stage && needs.check-changes.outputs.sgl_kernel == 'true' + # Skip for scheduled runs (they run stages independently). Runs in target_stage mode only when + # include_wheel_build is true (i.e. /rerun-stage on a PR with sgl-kernel changes), so the + # target stage can download the freshly-built wheel. + # + # `always()` lets us run when call-gate is skipped (which it always is in target_stage mode by + # design). The explicit needs..result checks preserve old gating for the normal PR path. + if: | + always() && + github.event_name != 'schedule' && + inputs.test_parallel_dispatch != true && + needs.check-changes.result == 'success' && + needs.check-changes.outputs.sgl_kernel == 'true' && + ( + (!inputs.target_stage && needs.call-gate.result == 'success') || + (inputs.target_stage && inputs.include_wheel_build) + ) runs-on: x64-kernel-build-node timeout-minutes: 240 strategy: matrix: include: + - python-version: "3.10" + cuda-version: "13.0" - python-version: "3.10" cuda-version: "12.9" - # Add back when CUDA 13.0 is supported on CI - # - python-version: "3.10" - # cuda-version: "13.0" name: Build Wheel steps: - name: Cleanup @@ -390,13 +470,35 @@ jobs: - uses: actions/checkout@v4 with: submodules: "recursive" - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-maintenance - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 with: python-version: ${{ matrix.python-version }} + - name: Free Docker disk space + run: | + set -x + # build.sh retags sgl-kernel-deps:cuda${CUDA_VERSION}-${PY_TAG}-${ARCH} + # on every run, leaving the previous image as a dangling : + # entry (~16-23 GB each). Prune them before building so the runner + # doesn't fill up. The local buildx cache at ~/.cache/sgl-kernel/buildx + # and the tagged sgl-kernel-deps image are not affected. + # `until=12h` avoids racing with a sibling matrix cell (cuda 12.9 vs + # 13.0) that may have just orphaned an image seconds ago. + docker image prune -f --filter "until=12h" + # Drop orphaned buildx builder volumes from past `docker buildx create` + # invocations. The active `sgl-kernel-builder` volume is held open and + # would fail to remove anyway, but skip it explicitly for clarity. + for v in $(docker volume ls -q | grep '^buildx_buildkit_' | grep -v '^buildx_buildkit_sgl-kernel-builder' || true); do + echo "Removing orphan buildx volume: $v" + docker volume rm "$v" || true + done + df -h / + - name: Build wheel for Python ${{ matrix.python-version }} and CUDA ${{ matrix.cuda-version }} run: | cd sgl-kernel @@ -418,13 +520,27 @@ jobs: sgl-kernel-build-wheels-arm: needs: [check-changes, call-gate] - # Skip for scheduled runs (they run stages independently) and when target_stage is set - if: github.event_name != 'schedule' && inputs.test_parallel_dispatch != true && !inputs.target_stage && needs.check-changes.outputs.sgl_kernel == 'true' + # Skip for scheduled runs (they run stages independently). Runs in target_stage mode only when + # include_wheel_build is true (i.e. /rerun-stage on a PR with sgl-kernel changes). + # + # See sgl-kernel-build-wheels above for the always() + result-check rationale. + if: | + always() && + github.event_name != 'schedule' && + inputs.test_parallel_dispatch != true && + needs.check-changes.result == 'success' && + needs.check-changes.outputs.sgl_kernel == 'true' && + ( + (!inputs.target_stage && needs.call-gate.result == 'success') || + (inputs.target_stage && inputs.include_wheel_build) + ) runs-on: arm-kernel-build-node timeout-minutes: 240 strategy: matrix: include: + - python-version: "3.10" + cuda-version: "13.0" - python-version: "3.10" cuda-version: "12.9" name: Build Wheel Arm @@ -440,13 +556,26 @@ jobs: - uses: actions/checkout@v4 with: submodules: "recursive" - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-maintenance - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 with: python-version: ${{ matrix.python-version }} + - name: Free Docker disk space + run: | + set -x + # See sgl-kernel-build-wheels above for the rationale. + docker image prune -f --filter "until=12h" + for v in $(docker volume ls -q | grep '^buildx_buildkit_' | grep -v '^buildx_buildkit_sgl-kernel-builder' || true); do + echo "Removing orphan buildx volume: $v" + docker volume rm "$v" || true + done + df -h / + - name: Build wheel for Python ${{ matrix.python-version }} and CUDA ${{ matrix.cuda-version }} run: | cd sgl-kernel @@ -466,263 +595,71 @@ jobs: path: sgl-kernel/dist/* if-no-files-found: error - sgl-kernel-unit-test: - needs: [check-changes, call-gate, sgl-kernel-build-wheels] - # Skip for scheduled runs and when target_stage is set - if: | - github.event_name != 'schedule' && - inputs.test_parallel_dispatch != true && - !inputs.target_stage && - needs.check-changes.outputs.sgl_kernel == 'true' - runs-on: 1-gpu-runner - timeout-minutes: 240 - env: - RUNNER_LABELS: 1-gpu-runner - steps: - - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Cleanup - run: | - ls -alh sgl-kernel/dist || true - rm -rf sgl-kernel/dist/* || true - - - name: Download artifacts - uses: actions/download-artifact@v4 - with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 - - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion - - - name: Run test - timeout-minutes: 30 - run: | - cd sgl-kernel - pytest tests/ - - sgl-kernel-mla-test: - needs: [check-changes, call-gate, sgl-kernel-build-wheels] - # Skip for scheduled runs and when target_stage is set - if: | - github.event_name != 'schedule' && - inputs.test_parallel_dispatch != true && - !inputs.target_stage && - needs.check-changes.outputs.sgl_kernel == 'true' - runs-on: 1-gpu-runner - timeout-minutes: 240 - env: - RUNNER_LABELS: 1-gpu-runner - steps: - - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Cleanup - run: | - ls -alh sgl-kernel/dist || true - rm -rf sgl-kernel/dist/* || true - - - name: Download artifacts - uses: actions/download-artifact@v4 - with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 - - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - - - name: Run test - timeout-minutes: 30 - run: | - cd test/registered/mla - python3 test_mla_deepseek_v3.py - - sgl-kernel-benchmark-test: + call-sgl-kernel-tests: needs: [check-changes, call-gate, sgl-kernel-build-wheels] - # Skip for scheduled runs and when target_stage is set - if: | - github.event_name != 'schedule' && - inputs.test_parallel_dispatch != true && - !inputs.target_stage && - needs.check-changes.outputs.sgl_kernel == 'true' - runs-on: 1-gpu-runner - timeout-minutes: 240 - env: - CI: true - RUNNER_LABELS: 1-gpu-runner - steps: - - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Cleanup - run: | - ls -alh sgl-kernel/dist || true - rm -rf sgl-kernel/dist/* || true - - - name: Download artifacts - uses: actions/download-artifact@v4 - with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 - - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - - - name: Run benchmark tests - timeout-minutes: 45 - run: | - cd sgl-kernel/benchmark - echo "Running sgl-kernel benchmark tests in CI mode..." - - echo "CI environment variable: $CI" - echo "GITHUB_ACTIONS environment variable: $GITHUB_ACTIONS" - - for bench_file in bench_*.py; do - echo "Testing $bench_file..." - timeout 60 python3 "$bench_file" || echo "Warning: $bench_file timed out or failed, continuing..." - echo "Completed $bench_file" - echo "---" - done - - echo "All benchmark tests completed!" - - sgl-kernel-b200-test: - needs: [check-changes, sgl-kernel-build-wheels] - # Skip for scheduled runs and when target_stage is set if: | github.event_name != 'schedule' && inputs.test_parallel_dispatch != true && !inputs.target_stage && needs.check-changes.outputs.sgl_kernel == 'true' - runs-on: ${{ needs.check-changes.outputs.b200_runner }} - timeout-minutes: 240 - env: - RUNNER_LABELS: ${{ needs.check-changes.outputs.b200_runner }} - steps: - - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Cleanup - run: | - ls -alh sgl-kernel/dist || true - rm -rf sgl-kernel/dist/* || true - - - name: Download artifacts - uses: actions/download-artifact@v4 - with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 - - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh diffusion - - - name: Run sgl-kernel unit tests on B200 - timeout-minutes: 30 - run: | - cd sgl-kernel - pytest tests/ - - # Adding a single CUDA13 smoke test to verify that the kernel builds and runs - # TODO: Add back this test when it can pass on CI - # cuda13-kernel-smoke-test: - # needs: [check-changes, sgl-kernel-build-wheels] - # if: needs.check-changes.outputs.sgl_kernel == 'true' - # runs-on: x64-cu13-kernel-tests - # steps: - # - uses: actions/checkout@v4 - - # - name: Cleanup - # run: | - # ls -alh sgl-kernel/dist || true - # rm -rf sgl-kernel/dist/* || true - - # - name: Download CUDA 13.0 artifacts - # uses: actions/download-artifact@v4 - # with: - # path: sgl-kernel/dist/ - # merge-multiple: true - # pattern: wheel-python3.10-cuda13.0 - - # - name: Install dependencies - # run: | - # CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - - # - name: Run kernel unit tests - # timeout-minutes: 30 - # run: | - # cd sgl-kernel - # pytest tests/ + uses: ./.github/workflows/pr-test-sgl-kernel.yml + with: + sgl_kernel: ${{ needs.check-changes.outputs.sgl_kernel }} + b200_runner: ${{ needs.check-changes.outputs.b200_runner }} + pr_head_sha: ${{ inputs.pr_head_sha || '' }} + git_ref: ${{ inputs.git_ref || '' }} + skip_stage_health_check: ${{ inputs.skip_stage_health_check == true }} + secrets: inherit # =============================================== jit-kernel ==================================================== - jit-kernel-unit-test: - needs: [check-changes, call-gate] - # Skip for scheduled runs and when target_stage is set + call-jit-kernel-tests: + needs: [check-changes, call-gate, sgl-kernel-build-wheels] if: | + always() && + !failure() && !cancelled() && github.event_name != 'schedule' && inputs.test_parallel_dispatch != true && !inputs.target_stage && needs.check-changes.outputs.jit_kernel == 'true' - runs-on: 1-gpu-runner - timeout-minutes: 240 - env: - RUNNER_LABELS: 1-gpu-runner - steps: - - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Install dependencies - timeout-minutes: 10 - run: | - bash scripts/ci/cuda/ci_install_dependency.sh - - - name: Run test - timeout-minutes: 30 - run: | - cd python/sglang/jit_kernel - pytest tests/ + uses: ./.github/workflows/pr-test-jit-kernel.yml + with: + jit_kernel: ${{ needs.check-changes.outputs.jit_kernel }} + sgl_kernel: ${{ needs.check-changes.outputs.sgl_kernel }} + b200_runner: ${{ needs.check-changes.outputs.b200_runner }} + pr_head_sha: ${{ inputs.pr_head_sha || '' }} + git_ref: ${{ inputs.git_ref || '' }} + target_stage: ${{ inputs.target_stage || '' }} + test_parallel_dispatch: ${{ inputs.test_parallel_dispatch == true && 'true' || 'false' }} + skip_stage_health_check: ${{ inputs.skip_stage_health_check == true }} + secrets: inherit # =============================================== primary ==================================================== - stage-a-test-1: + # Runs on 5090 (32GB, SM120) + stage-a-test-1-gpu-small: needs: [check-changes, call-gate, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'stage-a-test-1') || + (inputs.target_stage == 'stage-a-test-1-gpu-small') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 1-gpu-runner + runs-on: 1-gpu-5090 timeout-minutes: 240 - env: - RUNNER_LABELS: 1-gpu-runner steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -730,31 +667,34 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test timeout-minutes: 10 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | cd test/ - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --hw cuda --suite stage-a-test-1 $CONTINUE_ON_ERROR_FLAG - # temporarily put backend-independent cpu tests here - python3 run_suite.py --hw cpu --suite default $CONTINUE_ON_ERROR_FLAG + python3 run_suite.py --hw cuda --suite stage-a-test-1-gpu-small $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() - stage-a-cpu-only: + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh + + stage-a-test-cpu: needs: [check-changes, call-gate] if: | always() && ( - (inputs.target_stage == 'stage-a-cpu-only') || + (inputs.target_stage == 'stage-a-test-cpu') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && @@ -763,6 +703,10 @@ jobs: ) runs-on: ubuntu-latest timeout-minutes: 240 + strategy: + fail-fast: false + matrix: + partition: [0, 1, 2, 3] steps: - name: Free disk space run: | @@ -772,35 +716,56 @@ jobs: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.10' + - name: Install uv + uses: astral-sh/setup-uv@v5 + + # Needed by setuptools-rust to build the bundled native gRPC extension + # (rust/sglang-grpc) when installing the main `sglang` wheel from source. + - name: Install protoc + Rust toolchain + timeout-minutes: 10 + run: bash scripts/ci/utils/install_rust_protoc.sh + + - name: Rust cache (sglang-grpc) + uses: Swatinem/rust-cache@v2 + with: + workspaces: rust/sglang-grpc + shared-key: "sglang-grpc-cpu" + save-if: ${{ matrix.partition == 0 }} + + # uv pip targets a venv by default; setup-python has no venv — install into that interpreter (see UV_SYSTEM_PYTHON in https://docs.astral.sh/uv/guides/integration/github/) - name: Install dependencies timeout-minutes: 20 + env: + UV_SYSTEM_PYTHON: "1" run: | - pip install -e "python/[dev]" + uv pip install -e "python[dev]" --index-strategy unsafe-best-match --prerelease allow - name: Run test timeout-minutes: 10 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | cd test/ - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --hw cpu --suite stage-a-cpu-only $CONTINUE_ON_ERROR_FLAG + python3 run_suite.py --hw cpu --suite stage-a-test-cpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG # Runs on 5090 (32GB, SM120) - stage-b-test-small-1-gpu: + stage-b-test-1-gpu-small: needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'stage-b-test-small-1-gpu') || + (inputs.target_stage == 'stage-b-test-1-gpu-small') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && @@ -809,19 +774,20 @@ jobs: ) runs-on: 1-gpu-5090 timeout-minutes: 240 - env: - RUNNER_LABELS: 1-gpu-5090 - IS_BLACKWELL: "1" strategy: fail-fast: false - max-parallel: 8 + max-parallel: ${{ fromJson(needs.check-changes.outputs.max_parallel_small) }} matrix: partition: [0, 1, 2, 3, 4, 5, 6, 7] steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -829,45 +795,45 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | - source /etc/profile.d/sglang-ci.sh CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - git clone https://github.com/merrymercy/human-eval.git - cd human-eval - pip install -e . - name: Run test timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - source /etc/profile.d/sglang-ci.sh cd test/ - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --hw cuda --suite stage-b-test-small-1-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 8 $CONTINUE_ON_ERROR_FLAG + python3 run_suite.py --hw cuda --suite stage-b-test-1-gpu-small --auto-partition-id ${{ matrix.partition }} --auto-partition-size 8 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.partition }} + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh # Runs on H100 (80GB, SM90) - tests that don't pass on 5090 (FA3, FP8, high VRAM, etc.) - stage-b-test-large-1-gpu: + stage-b-test-1-gpu-large: needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'stage-b-test-large-1-gpu') || + (inputs.target_stage == 'stage-b-test-1-gpu-large') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 1-gpu-runner + runs-on: 1-gpu-h100 timeout-minutes: 240 - env: - RUNNER_LABELS: 1-gpu-runner strategy: fail-fast: false max-parallel: ${{ fromJson(needs.check-changes.outputs.max_parallel) }} @@ -877,7 +843,11 @@ jobs: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -885,48 +855,58 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | cd test/ - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --hw cuda --suite stage-b-test-large-1-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 14 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG + python3 run_suite.py --hw cuda --suite stage-b-test-1-gpu-large --auto-partition-id ${{ matrix.partition }} --auto-partition-size 14 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.partition }} + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh - stage-b-test-large-2-gpu: + stage-b-test-2-gpu-large: needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'stage-b-test-large-2-gpu') || + (inputs.target_stage == 'stage-b-test-2-gpu-large') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 2-gpu-runner + runs-on: 2-gpu-h100 timeout-minutes: 240 - env: - RUNNER_LABELS: 2-gpu-runner strategy: fail-fast: false + max-parallel: ${{ fromJson(needs.check-changes.outputs.max_parallel_2gpu) }} matrix: partition: [0, 1, 2, 3] steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -934,25 +914,29 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - git clone https://github.com/merrymercy/human-eval.git - cd human-eval - pip install -e . - name: Run test timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | cd test/ - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --hw cuda --suite stage-b-test-large-2-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG + python3 run_suite.py --hw cuda --suite stage-b-test-2-gpu-large --auto-partition-id ${{ matrix.partition }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.partition }} + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh stage-b-test-4-gpu-b200: needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels] @@ -968,8 +952,6 @@ jobs: ) runs-on: ${{ needs.check-changes.outputs.b200_runner }} timeout-minutes: 240 - env: - RUNNER_LABELS: ${{ needs.check-changes.outputs.b200_runner }} strategy: fail-fast: false @@ -977,7 +959,11 @@ jobs: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -985,137 +971,93 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh + CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test - timeout-minutes: 30 + timeout-minutes: 40 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | cd test - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite stage-b-test-4-gpu-b200 $CONTINUE_ON_ERROR_FLAG + python3 run_suite.py --hw cuda --suite stage-b-test-4-gpu-b200 $CONTINUE_ON_ERROR_FLAG - name: Run FA4 jit_kernel tests (SM100+) timeout-minutes: 10 run: | - IS_BLACKWELL=1 python3 -m pytest -q python/sglang/jit_kernel/tests/test_flash_attention_4.py + python3 -m pytest -q python/sglang/jit_kernel/tests/test_flash_attention_4.py - stage-c-test-large-4-gpu: - needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] - if: | - always() && - ( - (inputs.target_stage == 'stage-c-test-large-4-gpu') || - ( - !inputs.target_stage && - ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && - ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) - ) - ) - runs-on: 4-gpu-h100 - timeout-minutes: 240 - env: - RUNNER_LABELS: 4-gpu-h100 - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Download artifacts - if: needs.check-changes.outputs.sgl_kernel == 'true' - uses: actions/download-artifact@v4 - with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 - - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() - - name: Run test - timeout-minutes: 30 - run: | - cd test/ - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --hw cuda --suite stage-c-test-large-4-gpu $CONTINUE_ON_ERROR_FLAG + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh - stage-c-test-large-4-gpu-b200: - needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] + call-multimodal-gen-tests: + needs: [check-changes, call-gate, sgl-kernel-build-wheels] if: | always() && + !cancelled() && ( - (inputs.target_stage == 'stage-c-test-large-4-gpu-b200') || + inputs.target_stage == 'multimodal-gen-test-1-gpu' || + inputs.target_stage == 'multimodal-gen-test-2-gpu' || + inputs.target_stage == 'multimodal-gen-component-accuracy' || + inputs.target_stage == 'multimodal-gen-component-accuracy-1-gpu' || + inputs.target_stage == 'multimodal-gen-component-accuracy-2-gpu' || + inputs.target_stage == 'multimodal-gen-test-1-b200' || + inputs.target_stage == 'multimodal-gen-unit-test' || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && - ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + needs.check-changes.outputs.multimodal_gen == 'true' ) ) - runs-on: ${{ needs.check-changes.outputs.b200_runner }} - timeout-minutes: 240 - env: - RUNNER_LABELS: ${{ needs.check-changes.outputs.b200_runner }} - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} - - - name: Download artifacts - if: needs.check-changes.outputs.sgl_kernel == 'true' - uses: actions/download-artifact@v6 - with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 - - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh - - - name: Run test - timeout-minutes: 30 - run: | - cd test/ - IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite stage-c-test-large-4-gpu-b200 + uses: ./.github/workflows/pr-test-multimodal-gen.yml + with: + multimodal_gen: ${{ needs.check-changes.outputs.multimodal_gen }} + sgl_kernel: ${{ needs.check-changes.outputs.sgl_kernel }} + b200_runner: ${{ needs.check-changes.outputs.b200_runner }} + continue_on_error: ${{ needs.check-changes.outputs.continue_on_error }} + pr_head_sha: ${{ inputs.pr_head_sha || '' }} + git_ref: ${{ inputs.git_ref || '' }} + target_stage: ${{ inputs.target_stage || '' }} + test_parallel_dispatch: ${{ inputs.test_parallel_dispatch == true && 'true' || 'false' }} + caller_needs_failure: ${{ (needs.call-gate.result == 'failure' || needs.sgl-kernel-build-wheels.result == 'failure' || needs.check-changes.result == 'failure') && 'true' || 'false' }} + skip_stage_health_check: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }} + secrets: inherit - multimodal-gen-test-1-gpu: - needs: [check-changes, call-gate, sgl-kernel-build-wheels] + stage-c-test-4-gpu-h100: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'multimodal-gen-test-1-gpu') || + (inputs.target_stage == 'stage-c-test-4-gpu-h100') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && - needs.check-changes.outputs.multimodal_gen == 'true' + ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 1-gpu-runner + runs-on: 4-gpu-h100 timeout-minutes: 240 strategy: fail-fast: false matrix: - part: [0, 1] + part: [0, 1, 2] steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1123,103 +1065,57 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion - - name: Run diffusion server tests - timeout-minutes: 240 + timeout-minutes: 20 run: | - cd python - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 sglang/multimodal_gen/test/run_suite.py \ - --suite 1-gpu \ - --partition-id ${{ matrix.part }} \ - --total-partitions 2 \ - $CONTINUE_ON_ERROR_FLAG - + CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - multimodal-gen-test-2-gpu: - needs: [check-changes, call-gate, sgl-kernel-build-wheels] - if: | - always() && - ( - (inputs.target_stage == 'multimodal-gen-test-2-gpu') || - ( - !inputs.target_stage && - ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && - needs.check-changes.outputs.multimodal_gen == 'true' - ) - ) - runs-on: 2-gpu-runner - timeout-minutes: 240 - strategy: - fail-fast: false - matrix: - part: [0, 1] - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + - name: Run test + timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + run: | + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-h100 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 $CONTINUE_ON_ERROR_FLAG - - name: Download artifacts - if: needs.check-changes.outputs.sgl_kernel == 'true' - uses: actions/download-artifact@v4 + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() with: - path: sgl-kernel/dist/ - merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + artifact-suffix: ${{ matrix.part }} - - name: Install dependencies - timeout-minutes: 10 - run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh - - name: Run diffusion server tests - timeout-minutes: 240 - run: | - cd python - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 sglang/multimodal_gen/test/run_suite.py \ - --suite 2-gpu \ - --partition-id ${{ matrix.part }} \ - --total-partitions 2 \ - $CONTINUE_ON_ERROR_FLAG - - unit-test-backend-4-gpu: - needs: [check-changes, call-gate, wait-for-stage-b] + stage-c-test-8-gpu-h200: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-backend-4-gpu') || + (inputs.target_stage == 'stage-c-test-8-gpu-h200') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 4-gpu-h100 + runs-on: 8-gpu-h200 timeout-minutes: 240 - env: - RUNNER_LABELS: 4-gpu-h100 strategy: fail-fast: false matrix: - part: [0, 1, 2] + part: [0, 1, 2, 3] steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1227,52 +1123,75 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh + - name: Warmup DeepGEMM JIT Compilation + timeout-minutes: 25 + run: | + # Activate venv if available (GITHUB_ENV may have failed to propagate) + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/bin/activate" ] && source "${SGLANG_CI_VENV_PATH}/bin/activate" + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/env.sh" ] && source "${SGLANG_CI_VENV_PATH}/env.sh" + python3 scripts/ci/cuda/warmup_deep_gemm.py \ + deepseek-ai/DeepSeek-V3-0324:8 \ + deepseek-ai/DeepSeek-V3.2-Exp:8 + + - name: Warmup Server CUDA Graphs + timeout-minutes: 25 + run: | + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/bin/activate" ] && source "${SGLANG_CI_VENV_PATH}/bin/activate" + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/env.sh" ] && source "${SGLANG_CI_VENV_PATH}/env.sh" + python3 scripts/ci/cuda/warmup_server.py \ + deepseek-ai/DeepSeek-V3-0324:8 \ + inclusionAI/Ring-2.5-1T:8 + - name: Run test - timeout-minutes: 20 + timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --suite per-commit-4-gpu --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-8-gpu-h200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.part }} - unit-test-backend-8-gpu-h200: - needs: [check-changes, call-gate, wait-for-stage-b] + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh + + stage-c-test-8-gpu-h20: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-backend-8-gpu-h200') || + (inputs.target_stage == 'stage-c-test-8-gpu-h20') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 8-gpu-h200 + runs-on: 8-gpu-h20 timeout-minutes: 240 env: - RUNNER_LABELS: 8-gpu-h200 - strategy: - fail-fast: false - matrix: - part: [0, 1, 2, 3] + SGLANG_CI_RDMA_ALL_DEVICES: "mlx5_1,mlx5_2,mlx5_3,mlx5_4" + CU_VERSION: cu129 steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1280,59 +1199,51 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - - # - name: Warmup Weights and JIT Compilation - # timeout-minutes: 20 - # run: | - # # An example command for testing the warmup. TODO: make this more general and move them to python scripts. - # python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --trust-remote-code + CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_deepep.sh - name: Run test - timeout-minutes: 20 + timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --suite per-commit-8-gpu-h200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 4 $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-8-gpu-h20 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh - unit-test-backend-8-gpu-h20: - needs: [check-changes, call-gate, wait-for-stage-b] + stage-c-test-deepep-4-gpu-h100: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-backend-8-gpu-h20') || + (inputs.target_stage == 'stage-c-test-deepep-4-gpu-h100') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 8-gpu-h20 + runs-on: 4-gpu-h100 timeout-minutes: 240 - env: - SGLANG_CI_RDMA_ALL_DEVICES: "mlx5_1,mlx5_2,mlx5_3,mlx5_4" - RUNNER_LABELS: 8-gpu-h20 - strategy: - fail-fast: false - matrix: - part: [0, 1] steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1340,48 +1251,68 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_deepep.sh + - name: Warmup DeepGEMM JIT Compilation + timeout-minutes: 25 + run: | + # Activate venv if available (GITHUB_ENV may have failed to propagate) + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/bin/activate" ] && source "${SGLANG_CI_VENV_PATH}/bin/activate" + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/env.sh" ] && source "${SGLANG_CI_VENV_PATH}/env.sh" + python3 scripts/ci/cuda/warmup_deep_gemm.py \ + lmsys/sglang-ci-dsv3-test:4 + + - name: Warmup Server CUDA Graphs + timeout-minutes: 25 + run: | + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/bin/activate" ] && source "${SGLANG_CI_VENV_PATH}/bin/activate" + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/env.sh" ] && source "${SGLANG_CI_VENV_PATH}/env.sh" + python3 scripts/ci/cuda/warmup_server.py \ + lmsys/sglang-ci-dsv3-test:4 + - name: Run test - timeout-minutes: 20 + timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --suite per-commit-8-gpu-h20 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-deepep-4-gpu-h100 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh - unit-test-deepep-4-gpu: - needs: [check-changes, call-gate, wait-for-stage-b] + stage-c-test-deepep-8-gpu-h200: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-deepep-4-gpu') || + (inputs.target_stage == 'stage-c-test-deepep-8-gpu-h200') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 4-gpu-h100 + runs-on: 8-gpu-h200-deepep timeout-minutes: 240 - env: - RUNNER_LABELS: 4-gpu-h100 steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1389,82 +1320,111 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_deepep.sh + - name: Warmup DeepGEMM JIT Compilation + timeout-minutes: 25 + run: | + # Activate venv if available (GITHUB_ENV may have failed to propagate) + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/bin/activate" ] && source "${SGLANG_CI_VENV_PATH}/bin/activate" + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/env.sh" ] && source "${SGLANG_CI_VENV_PATH}/env.sh" + python3 scripts/ci/cuda/warmup_deep_gemm.py \ + deepseek-ai/DeepSeek-V3-0324:8 \ + deepseek-ai/DeepSeek-V3.2-Exp:8 + + - name: Warmup Server CUDA Graphs + timeout-minutes: 25 + run: | + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/bin/activate" ] && source "${SGLANG_CI_VENV_PATH}/bin/activate" + [ -f "${SGLANG_CI_VENV_PATH:-/dev/null}/env.sh" ] && source "${SGLANG_CI_VENV_PATH}/env.sh" + python3 scripts/ci/cuda/warmup_server.py \ + deepseek-ai/DeepSeek-V3-0324:8 + - name: Run test - timeout-minutes: 20 + timeout-minutes: 45 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --suite per-commit-4-gpu-deepep $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-deepep-8-gpu-h200 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh - unit-test-deepep-8-gpu: - needs: [check-changes, call-gate, wait-for-stage-b] + stage-c-test-4-gpu-b200: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-deepep-8-gpu') || + (inputs.target_stage == 'stage-c-test-4-gpu-b200') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 8-gpu-h200 + runs-on: ${{ needs.check-changes.outputs.b200_runner }} timeout-minutes: 240 - env: - RUNNER_LABELS: 8-gpu-h200 + strategy: + fail-fast: false + matrix: + part: [0, 1, 2, 3, 4, 5] + steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' - uses: actions/download-artifact@v4 + uses: actions/download-artifact@v6 with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 20 run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_deepep.sh + CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh - name: Run test - timeout-minutes: 45 + timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --suite per-commit-8-gpu-h200-deepep $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-b200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 6 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG - unit-test-backend-4-gpu-b200: - needs: [check-changes, call-gate, wait-for-stage-b] + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + with: + artifact-suffix: ${{ matrix.part }} + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh + + stage-c-test-dsv4-4-gpu-b200: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-backend-4-gpu-b200') || + (inputs.target_stage == 'stage-c-test-dsv4-4-gpu-b200') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && @@ -1473,18 +1433,15 @@ jobs: ) runs-on: ${{ needs.check-changes.outputs.b200_runner }} timeout-minutes: 240 - env: - RUNNER_LABELS: ${{ needs.check-changes.outputs.b200_runner }} - strategy: - fail-fast: false - matrix: - part: [0, 1, 2] - steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1492,50 +1449,51 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 30 run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 bash scripts/ci/cuda/ci_install_dependency.sh + CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_flash_mla.sh - name: Run test timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - IS_BLACKWELL=1 python3 run_suite.py --suite per-commit-4-gpu-b200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 1800 $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-dsv4-4-gpu-b200 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG - unit-test-backend-4-gpu-gb200: - needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels-arm] + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh + + stage-c-test-dsv4-8-gpu-h200: + needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels] if: | always() && ( - (inputs.target_stage == 'unit-test-backend-4-gpu-gb200') || + (inputs.target_stage == 'stage-c-test-dsv4-8-gpu-h200') || ( !inputs.target_stage && ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) ) ) - runs-on: 4-gpu-gb200 + runs-on: 8-gpu-h200 timeout-minutes: 240 - env: - RUNNER_LABELS: 4-gpu-gb200 - strategy: - fail-fast: false steps: - name: Checkout code uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }} + ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + + - uses: ./.github/actions/check-stage-health + + - uses: ./.github/actions/check-maintenance - name: Download artifacts if: needs.check-changes.outputs.sgl_kernel == 'true' @@ -1543,26 +1501,79 @@ jobs: with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-python3.10-cuda12.9-aarch64 + pattern: wheel-python3.10-cuda* - name: Install dependencies - timeout-minutes: 10 + timeout-minutes: 30 run: | - CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 GRACE_BLACKWELL=1 bash scripts/ci/cuda/ci_install_deepep.sh + CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_flash_mla.sh - name: Run test - timeout-minutes: 45 + timeout-minutes: 30 + env: + CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} run: | - cd test/srt - RETRY_FLAG="" - if [[ "${{ needs.check-changes.outputs.enable_retry }}" == "true" ]]; then - RETRY_FLAG="--enable-retry" - fi - CONTINUE_ON_ERROR_FLAG="" - if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then - CONTINUE_ON_ERROR_FLAG="--continue-on-error" - fi - python3 run_suite.py --suite per-commit-4-gpu-gb200 --auto-partition-id 0 --auto-partition-size 1 --timeout-per-file 3600 $RETRY_FLAG $CONTINUE_ON_ERROR_FLAG + cd test + python3 run_suite.py --hw cuda --suite stage-c-test-dsv4-8-gpu-h200 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + - name: Cleanup venv + if: always() + run: bash scripts/ci/cuda/ci_cleanup_venv.sh + + # NOTE: GB200 stage temporarily disabled — no company-owned GB200 runner available yet. + # Re-enable when a 4-gpu-gb200 runner is provisioned. + # stage-c-test-4-gpu-gb200: + # needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels-arm] + # if: | + # always() && + # ( + # (inputs.target_stage == 'stage-c-test-4-gpu-gb200') || + # ( + # !inputs.target_stage && + # ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) && + # ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true')) + # ) + # ) + # runs-on: 4-gpu-gb200 + # timeout-minutes: 240 + # strategy: + # fail-fast: false + # steps: + # - uses: ./.github/actions/check-maintenance + # with: + # github-token: ${{ github.token }} + # + # - name: Checkout code + # uses: actions/checkout@v4 + # with: + # ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }} + # + # - name: Download artifacts + # if: needs.check-changes.outputs.sgl_kernel == 'true' + # uses: actions/download-artifact@v4 + # with: + # path: sgl-kernel/dist/ + # merge-multiple: true + # pattern: wheel-python3.10-cuda13.0-aarch64 + # + # - name: Install dependencies + # timeout-minutes: 20 + # run: | + # CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} GRACE_BLACKWELL=1 bash scripts/ci/cuda/ci_install_deepep.sh + # + # - name: Run test + # timeout-minutes: 45 + # env: + # CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }} + # run: | + # cd test + # python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-gb200 --timeout-per-file 3600 $CONTINUE_ON_ERROR_FLAG + # + # - uses: ./.github/actions/upload-cuda-coredumps + # if: failure() pr-test-finish: needs: @@ -1571,33 +1582,31 @@ jobs: check-changes, sgl-kernel-build-wheels, - sgl-kernel-unit-test, - sgl-kernel-mla-test, - sgl-kernel-benchmark-test, - sgl-kernel-b200-test, + sgl-kernel-build-wheels-arm, + call-sgl-kernel-tests, wait-for-stage-a, wait-for-stage-b, - jit-kernel-unit-test, + call-jit-kernel-tests, - multimodal-gen-test-1-gpu, - multimodal-gen-test-2-gpu, + call-multimodal-gen-tests, - stage-a-test-1, - stage-a-cpu-only, - stage-b-test-small-1-gpu, - stage-b-test-large-1-gpu, - stage-b-test-large-2-gpu, - stage-c-test-large-4-gpu, + stage-a-test-1-gpu-small, + stage-a-test-cpu, + stage-b-test-1-gpu-small, + stage-b-test-1-gpu-large, + stage-b-test-2-gpu-large, stage-b-test-4-gpu-b200, - unit-test-backend-4-gpu, - unit-test-backend-8-gpu-h20, - unit-test-backend-8-gpu-h200, - unit-test-deepep-4-gpu, - unit-test-deepep-8-gpu, - unit-test-backend-4-gpu-b200, - unit-test-backend-4-gpu-gb200, + stage-c-test-4-gpu-h100, + stage-c-test-8-gpu-h20, + stage-c-test-8-gpu-h200, + stage-c-test-deepep-4-gpu-h100, + stage-c-test-deepep-8-gpu-h200, + stage-c-test-4-gpu-b200, + stage-c-test-dsv4-4-gpu-b200, + stage-c-test-dsv4-8-gpu-h200, + # stage-c-test-4-gpu-gb200, # Temporarily disabled — no GB200 runner ] if: always() runs-on: ubuntu-latest diff --git a/.github/workflows/release-branch-cut.yml b/.github/workflows/release-branch-cut.yml index f39a8c5c688a..a4ed645d5131 100644 --- a/.github/workflows/release-branch-cut.yml +++ b/.github/workflows/release-branch-cut.yml @@ -16,6 +16,8 @@ on: permissions: actions: write contents: write + issues: read + pull-requests: read jobs: cut-release-branch: @@ -85,7 +87,7 @@ jobs: echo "Branch '$BRANCH_NAME' does not exist, proceeding with creation" - - name: Create and push release branch + - name: Create release branch id: set_output run: | COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}" @@ -97,11 +99,33 @@ jobs: # Create branch from the specified commit git checkout -b "$BRANCH_NAME" "$COMMIT_SHA" - # Push the new branch - git push origin "$BRANCH_NAME" - echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT - echo "Successfully created and pushed branch '$BRANCH_NAME' from commit '$COMMIT_SHA'" + echo "Successfully created branch '$BRANCH_NAME' from commit '$COMMIT_SHA'" + + - name: Update version references in documentation + run: | + BRANCH_NAME="${{ github.event.inputs.branch_name }}" + # Extract version from branch name (e.g., release/v0.5.8 -> v0.5.8) + VERSION=$(echo "$BRANCH_NAME" | sed 's/release\///') + + # Update git clone version references in docs + sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/get_started/install.md + sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/platforms/amd_gpu.md + + # Check if any changes were made + if git diff --quiet; then + echo "No version references needed updating" + else + git add docs/get_started/install.md docs/platforms/amd_gpu.md + git commit -m "docs: update version references to $VERSION" + echo "Updated version references to $VERSION" + fi + + - name: Push release branch + run: | + BRANCH_NAME="${{ steps.set_output.outputs.branch_name }}" + git push origin "$BRANCH_NAME" + echo "Successfully pushed branch '$BRANCH_NAME'" - name: Summary run: | @@ -125,8 +149,9 @@ jobs: needs: cut-release-branch uses: ./.github/workflows/pr-test.yml with: - ref: ${{ needs.cut-release-branch.outputs.branch_name }} + git_ref: ${{ needs.cut-release-branch.outputs.branch_name }} run_all_tests: true + skip_stage_health_check: true secrets: inherit run-pr-tests-amd: diff --git a/.github/workflows/release-docker-amd-nightly.yml b/.github/workflows/release-docker-amd-nightly.yml index f188fd03d911..5cd04909e4ec 100644 --- a/.github/workflows/release-docker-amd-nightly.yml +++ b/.github/workflows/release-docker-amd-nightly.yml @@ -1,8 +1,8 @@ -name: Release Docker Images Nightly (AMD) +name: Release Docker Images Nightly ROCm7.0 (AMD) on: workflow_dispatch: schedule: - - cron: '0 13 * * *' + - cron: '0 12 * * *' concurrency: # A PR number if a pull request and otherwise the commit hash. This cancels @@ -20,7 +20,7 @@ jobs: strategy: fail-fast: false matrix: - gpu_arch: ['gfx942', 'gfx942-rocm700', 'gfx950'] + gpu_arch: ['gfx942', 'gfx950'] build_type: ['all'] steps: - name: Checkout repository @@ -28,6 +28,11 @@ jobs: with: fetch-depth: 0 # Required for git describe to find tags + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.10" + - name: "Set Date" run: | echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV @@ -35,31 +40,39 @@ jobs: - name: Get version from latest tag id: version run: | - # Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7) - VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//') + # Use the shared helper so stable/post releases sort above rc tags. + VERSION=$(python3 python/tools/get_version_tag.py --tag-only | sed 's/^v//') if [ -z "$VERSION" ]; then echo "::error::Could not determine version from git tags" exit 1 fi + # Get short commit hash of current HEAD + COMMIT_HASH=$(git rev-parse --short HEAD) + + # Compose pretend version for setuptools_scm: e.g., 0.5.8.dev20260129+g1a2b3c4 + PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}" + echo "version=${VERSION}" >> $GITHUB_OUTPUT + echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT echo "Detected version: ${VERSION}" + echo "Pretend version for pip: ${PRETEND_VERSION}" - - name: Login to Docker Hub + - name: Login to Docker Hub (AMD) uses: docker/login-action@v2 with: username: ${{ secrets.DOCKERHUB_AMD_USERNAME }} password: ${{ secrets.DOCKERHUB_AMD_TOKEN }} - - name: Build and Push + - name: Build and Push to rocm/sgl-dev run: | version=${{ steps.version.outputs.version }} + pretend_version=${{ steps.version.outputs.pretend_version }} echo "Version: ${version}" + echo "Pretend version: ${pretend_version}" if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then - rocm_tag="rocm630-mi30x" - elif [ "${{ matrix.gpu_arch }}" = "gfx942-rocm700" ]; then rocm_tag="rocm700-mi30x" elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then rocm_tag="rocm700-mi35x" @@ -69,98 +82,171 @@ jobs: fi tag=v${version}-${rocm_tag} + echo "IMAGE_TAG=${tag}-${{ env.DATE }}" >> $GITHUB_ENV - docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache + # remove --build-arg NIC_BACKEND=ainic for auto detection nic support in mori + # UBUNTU_MIRROR forces apt over HTTPS to dodge port-80 reachability flakes + # to Canonical's archive.ubuntu.com mirror IPs from the amd-docker-scale runner. + docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} --build-arg UBUNTU_MIRROR=https://archive.ubuntu.com -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache docker push rocm/sgl-dev:${tag}-${{ env.DATE }} - # Temporarily disable docker cache seeding until performant storage is in place - cache: - if: false - # if: always() && github.repository == 'sgl-project/sglang' - runs-on: linux-mi300-gpu-1 + # Persist the tag right after rocm/sgl-dev push succeeds so the local + # registry mirror can run even if a later step in this job (lmsys push) + # fails. By default this step only runs when the previous step succeeded, + # so the artifact only exists when an image actually landed on Docker Hub. + - name: Save published image tag + run: | + mkdir -p image-tag + echo "${{ env.IMAGE_TAG }}" > "image-tag/${{ matrix.gpu_arch }}.txt" + + - name: Upload image tag artifact + uses: actions/upload-artifact@v4 + with: + name: image-tag-${{ matrix.gpu_arch }} + path: image-tag/${{ matrix.gpu_arch }}.txt + retention-days: 1 + + - name: Login to Docker Hub (lmsys) + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Push to lmsysorg/sglang-rocm + run: | + docker tag rocm/sgl-dev:${{ env.IMAGE_TAG }} lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }} + docker push lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }} + + # Mirror the freshly published rocm/sgl-dev image to the in-network Docker + # registry so AMD CI runners can pull without hitting Docker Hub rate limits. + # The tag is read verbatim from the publish job's artifact so this job uses + # exactly the same tag that publish pushed (only the registry prefix differs). + # `!cancelled()` lets us still mirror successful matrix legs when other legs + # of publish failed; legs without an artifact will fail at download and be + # the only ones marked red. + push_local_registry: + if: ${{ !cancelled() && github.repository == 'sgl-project/sglang' }} + runs-on: linux-mi300-1gpu-sglang environment: 'prod' needs: publish strategy: fail-fast: false matrix: - gpu_arch: ['gfx942', 'gfx942-rocm700'] - build_type: ['all'] + gpu_arch: ['gfx942', 'gfx950'] steps: - - name: Checkout repository - uses: actions/checkout@v4 + - name: Download image tag artifact + uses: actions/download-artifact@v4 with: - fetch-depth: 0 # Required for git describe to find tags - - - name: "Set Date" - run: | - echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV + name: image-tag-${{ matrix.gpu_arch }} - - name: Get version from latest tag - id: version + - name: Read image tag run: | - # Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7) - VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//') - - if [ -z "$VERSION" ]; then - echo "::error::Could not determine version from git tags" + image_tag=$(tr -d '[:space:]' < "${{ matrix.gpu_arch }}.txt") + if [ -z "${image_tag}" ]; then + echo "::error::Image tag artifact is empty" exit 1 fi + echo "IMAGE_TAG=${image_tag}" >> $GITHUB_ENV + echo "Resolved IMAGE_TAG=${image_tag}" - echo "version=${VERSION}" >> $GITHUB_OUTPUT - echo "Detected version: ${VERSION}" - - - name: Login to Docker Hub + - name: Login to Docker Hub (AMD) uses: docker/login-action@v2 with: username: ${{ secrets.DOCKERHUB_AMD_USERNAME }} password: ${{ secrets.DOCKERHUB_AMD_TOKEN }} - - name: Pull and Save Docker Image to Cache + - name: Mirror rocm/sgl-dev to local registry run: | - set -euxo pipefail - - version=${{ steps.version.outputs.version }} - echo "Version: ${version}" - - if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then - rocm_tag="rocm630-mi30x" - elif [ "${{ matrix.gpu_arch }}" = "gfx942-rocm700" ]; then - rocm_tag="rocm700-mi30x" - else - echo "Unsupported gfx arch" - exit 1 - fi + src="rocm/sgl-dev:${{ env.IMAGE_TAG }}" + dst="10.245.143.50:5000/rocm/sgl-dev:${{ env.IMAGE_TAG }}" + docker pull "${src}" + docker tag "${src}" "${dst}" + docker push "${dst}" - tag=v${version}-${rocm_tag} - - if [ "${{ matrix.build_type }}" = "all" ]; then - tag_suffix="" - else - echo "Unsupported build type" - exit 1 - fi - - image="rocm/sgl-dev:${tag}-${{ env.DATE }}${tag_suffix}" - - # Determine target cache file name based on ROCm variant - if [[ "${rocm_tag}" == rocm630* ]]; then - final_path="/home/runner/sgl-data/docker/image.tar" - elif [[ "${rocm_tag}" == rocm700* ]]; then - final_path="/home/runner/sgl-data/docker/image-700.tar" - else - echo "Unexpected ROCm tag: ${rocm_tag}" - exit 1 - fi - - tmp_path="${final_path}.tmp" - - echo "Pulling image: ${image}" - docker pull "${image}" - - echo "Saving to temp file: ${tmp_path}" - docker save "${image}" -o "${tmp_path}" - - echo "Moving to final path: ${final_path}" - mv -f "${tmp_path}" "${final_path}" - - echo "Cache populated successfully at ${final_path}" + # Temporarily disable docker cache seeding until performant storage is in place + # cache: + # if: false + # # if: always() && github.repository == 'sgl-project/sglang' + # runs-on: linux-mi300-gpu-1 + # environment: 'prod' + # needs: publish + # strategy: + # fail-fast: false + # matrix: + # gpu_arch: ['gfx942'] + # build_type: ['all'] + # steps: + # - name: Checkout repository + # uses: actions/checkout@v4 + # with: + # fetch-depth: 0 # Required for git describe to find tags + + # - name: "Set Date" + # run: | + # echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV + + # - name: Get version from latest tag + # id: version + # run: | + # # Use the shared helper so stable/post releases sort above rc tags. + # VERSION=$(python3 python/tools/get_version_tag.py --tag-only | sed 's/^v//') + + # if [ -z "$VERSION" ]; then + # echo "::error::Could not determine version from git tags" + # exit 1 + # fi + + # echo "version=${VERSION}" >> $GITHUB_OUTPUT + # echo "Detected version: ${VERSION}" + + # - name: Login to Docker Hub + # uses: docker/login-action@v2 + # with: + # username: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + # password: ${{ secrets.DOCKERHUB_AMD_TOKEN }} + + # - name: Pull and Save Docker Image to Cache + # run: | + # set -euxo pipefail + + # version=${{ steps.version.outputs.version }} + # echo "Version: ${version}" + + # if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then + # rocm_tag="rocm700-mi30x" + # else + # echo "Unsupported gfx arch" + # exit 1 + # fi + + # tag=v${version}-${rocm_tag} + + # if [ "${{ matrix.build_type }}" = "all" ]; then + # tag_suffix="" + # else + # echo "Unsupported build type" + # exit 1 + # fi + + # image="rocm/sgl-dev:${tag}-${{ env.DATE }}${tag_suffix}" + + # # Determine target cache file name based on ROCm variant + # if [[ "${rocm_tag}" == rocm700* ]]; then + # final_path="/home/runner/sgl-data/docker/image-700.tar" + # else + # echo "Unexpected ROCm tag: ${rocm_tag}" + # exit 1 + # fi + + # tmp_path="${final_path}.tmp" + + # echo "Pulling image: ${image}" + # docker pull "${image}" + + # echo "Saving to temp file: ${tmp_path}" + # docker save "${image}" -o "${tmp_path}" + + # echo "Moving to final path: ${final_path}" + # mv -f "${tmp_path}" "${final_path}" + + # echo "Cache populated successfully at ${final_path}" diff --git a/.github/workflows/release-docker-amd-rocm720-nightly.yml b/.github/workflows/release-docker-amd-rocm720-nightly.yml new file mode 100644 index 000000000000..db3498e65a8c --- /dev/null +++ b/.github/workflows/release-docker-amd-rocm720-nightly.yml @@ -0,0 +1,264 @@ +name: Release Docker Images Nightly ROCm7.2 (AMD) +on: + workflow_dispatch: + inputs: + job_select: + description: 'Select which release job to run' + required: false + type: choice + default: 'all' + options: + - 'all' + - publish + - publish_dsv4 + schedule: + - cron: '0 12 * * *' + +concurrency: + # A PR number if a pull request and otherwise the commit hash. This cancels + # queued and in-progress runs for the same PR (presubmit) or commit + # (postsubmit). The workflow name is prepended to avoid conflicts between + # different workflows. + group: ${{ github.workflow }}-${{ github.event.number || github.sha }} + cancel-in-progress: True + +jobs: + publish: + if: github.repository == 'sgl-project/sglang' && (github.event_name != 'workflow_dispatch' || inputs.job_select == 'all' || inputs.job_select == 'publish') + runs-on: amd-docker-scale + environment: 'prod' + strategy: + fail-fast: false + matrix: + gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720'] + build_type: ['all'] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 0 # Required for git describe to find tags + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.10" + + - name: "Set Date" + run: | + echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV + + - name: Get version from latest tag + id: version + run: | + # Use the shared helper so stable/post releases sort above rc tags. + VERSION=$(python3 python/tools/get_version_tag.py --tag-only | sed 's/^v//') + + if [ -z "$VERSION" ]; then + echo "::error::Could not determine version from git tags" + exit 1 + fi + + # Get short commit hash of current HEAD + COMMIT_HASH=$(git rev-parse --short HEAD) + + # Compose pretend version for setuptools_scm: e.g., 0.5.8.post1.dev20260211+g1a2b3c4 + PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}" + + echo "version=${VERSION}" >> $GITHUB_OUTPUT + echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT + echo "Detected version: ${VERSION}" + echo "Pretend version for pip: ${PRETEND_VERSION}" + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + password: ${{ secrets.DOCKERHUB_AMD_TOKEN }} + + - name: Build and Push to rocm/sgl-dev + run: | + version=${{ steps.version.outputs.version }} + pretend_version=${{ steps.version.outputs.pretend_version }} + echo "Version: ${version}" + echo "Pretend version: ${pretend_version}" + + if [ "${{ matrix.gpu_arch }}" = "gfx942-rocm720" ]; then + rocm_tag="rocm720-mi30x" + elif [ "${{ matrix.gpu_arch }}" = "gfx950-rocm720" ]; then + rocm_tag="rocm720-mi35x" + else + echo "Unsupported gfx arch" + exit 1 + fi + + tag=v${version}-${rocm_tag} + echo "IMAGE_TAG=${tag}-${{ env.DATE }}" >> $GITHUB_ENV + + # remove --build-arg NIC_BACKEND=ainic for auto detection nic support in mori + # UBUNTU_MIRROR forces apt over HTTPS to dodge port-80 reachability flakes + # to Canonical's archive.ubuntu.com mirror IPs from the amd-docker-scale runner. + docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} --build-arg UBUNTU_MIRROR=https://archive.ubuntu.com -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache + docker push rocm/sgl-dev:${tag}-${{ env.DATE }} + + # Persist the tag right after rocm/sgl-dev push succeeds so the local + # registry mirror can run even if a later step in this job (lmsys push) + # fails. By default this step only runs when the previous step succeeded, + # so the artifact only exists when an image actually landed on Docker Hub. + - name: Save published image tag + run: | + mkdir -p image-tag + echo "${{ env.IMAGE_TAG }}" > "image-tag/${{ matrix.gpu_arch }}.txt" + + - name: Upload image tag artifact + uses: actions/upload-artifact@v4 + with: + name: image-tag-${{ matrix.gpu_arch }} + path: image-tag/${{ matrix.gpu_arch }}.txt + retention-days: 1 + + - name: Login to Docker Hub (lmsys) + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Push to lmsysorg/sglang-rocm + run: | + docker tag rocm/sgl-dev:${{ env.IMAGE_TAG }} lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }} + docker push lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }} + + # Mirror the freshly published rocm/sgl-dev image to the in-network Docker + # registry so AMD CI runners can pull without hitting Docker Hub rate limits. + # The tag is read verbatim from the publish job's artifact so this job uses + # exactly the same tag that publish pushed (only the registry prefix differs). + # `!cancelled()` lets us still mirror successful matrix legs when other legs + # of publish failed; legs without an artifact will fail at download and be + # the only ones marked red. + push_local_registry: + if: ${{ !cancelled() && github.repository == 'sgl-project/sglang' && (github.event_name != 'workflow_dispatch' || inputs.job_select == 'all' || inputs.job_select == 'publish') }} + runs-on: linux-mi300-1gpu-sglang + environment: 'prod' + needs: publish + strategy: + fail-fast: false + matrix: + gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720'] + steps: + - name: Download image tag artifact + uses: actions/download-artifact@v4 + with: + name: image-tag-${{ matrix.gpu_arch }} + + - name: Read image tag + run: | + image_tag=$(tr -d '[:space:]' < "${{ matrix.gpu_arch }}.txt") + if [ -z "${image_tag}" ]; then + echo "::error::Image tag artifact is empty" + exit 1 + fi + echo "IMAGE_TAG=${image_tag}" >> $GITHUB_ENV + echo "Resolved IMAGE_TAG=${image_tag}" + + - name: Login to Docker Hub (AMD) + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + password: ${{ secrets.DOCKERHUB_AMD_TOKEN }} + + - name: Mirror rocm/sgl-dev to local registry + run: | + src="rocm/sgl-dev:${{ env.IMAGE_TAG }}" + dst="10.245.143.50:5000/rocm/sgl-dev:${{ env.IMAGE_TAG }}" + docker pull "${src}" + docker tag "${src}" "${dst}" + docker push "${dst}" + + publish_dsv4: + if: github.repository == 'sgl-project/sglang' && (github.event_name != 'workflow_dispatch' || inputs.job_select == 'all' || inputs.job_select == 'publish_dsv4') + runs-on: amd-docker-scale + environment: 'prod' + strategy: + fail-fast: false + matrix: + gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720'] + build_type: ['all'] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + ref: amd/deepseek_v4 + fetch-depth: 0 # Required for git describe to find tags + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.10" + + - name: "Set Date" + run: | + echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV + + - name: Get version from latest tag + id: version + run: | + # Use the shared helper so stable/post releases sort above rc tags. + VERSION=$(python3 python/tools/get_version_tag.py --tag-only | sed 's/^v//') + + if [ -z "$VERSION" ]; then + echo "::error::Could not determine version from git tags" + exit 1 + fi + + # Get short commit hash of current HEAD + COMMIT_SHA=$(git rev-parse HEAD) + COMMIT_HASH=${COMMIT_SHA:0:7} + + # Compose pretend version for setuptools_scm: e.g., 0.5.8.post1.dev20260211+g1a2b3c4 + PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}" + + echo "commit_sha=${COMMIT_SHA}" >> "$GITHUB_OUTPUT" + echo "commit_hash=${COMMIT_HASH}" >> "$GITHUB_OUTPUT" + echo "version=${VERSION}" >> "$GITHUB_OUTPUT" + echo "pretend_version=${PRETEND_VERSION}" >> "$GITHUB_OUTPUT" + echo "DeepSeek V4 commit: ${COMMIT_SHA}" + echo "Detected version: ${VERSION}" + echo "Pretend version for pip: ${PRETEND_VERSION}" + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_AMD_USERNAME }} + password: ${{ secrets.DOCKERHUB_AMD_TOKEN }} + + - name: Build and Push DSv4 image to rocm/sgl-dev + run: | + version=${{ steps.version.outputs.version }} + pretend_version=${{ steps.version.outputs.pretend_version }} + echo "Version: ${version}" + echo "Pretend version: ${pretend_version}" + + if [ "${{ matrix.gpu_arch }}" = "gfx942-rocm720" ]; then + rocm_tag="rocm720-mi30x" + elif [ "${{ matrix.gpu_arch }}" = "gfx950-rocm720" ]; then + rocm_tag="rocm720-mi35x" + else + echo "Unsupported gfx arch" + exit 1 + fi + + image_tag="${rocm_tag}-${{ steps.version.outputs.commit_hash }}-${{ env.DATE }}-DSv4" + echo "IMAGE_TAG=${image_tag}" >> "$GITHUB_ENV" + echo "Building rocm/sgl-dev:${image_tag} from amd/deepseek_v4 @ ${{ steps.version.outputs.commit_sha }}" + + # UBUNTU_MIRROR forces apt over HTTPS to dodge port-80 reachability flakes + # to Canonical's archive.ubuntu.com mirror IPs from the amd-docker-scale runner. + docker build . -f docker/rocm.Dockerfile \ + --build-arg SGL_BRANCH=${{ steps.version.outputs.commit_sha }} \ + --build-arg BUILD_TYPE=${{ matrix.build_type }} \ + --build-arg GPU_ARCH=${{ matrix.gpu_arch }} \ + --build-arg ENABLE_MORI=1 \ + --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} \ + --build-arg UBUNTU_MIRROR=https://archive.ubuntu.com \ + -t rocm/sgl-dev:${image_tag} \ + --no-cache + docker push rocm/sgl-dev:${image_tag} diff --git a/.github/workflows/release-docker-amd.yml b/.github/workflows/release-docker-amd.yml index a47104452606..6920338014f6 100644 --- a/.github/workflows/release-docker-amd.yml +++ b/.github/workflows/release-docker-amd.yml @@ -16,7 +16,8 @@ jobs: environment: 'prod' strategy: matrix: - gpu_arch: ['gfx942', 'gfx942-rocm700', 'gfx950'] + rocm_version: ['rocm700', 'rocm720'] + gpu_arch: ['gfx942', 'gfx950'] build_type: ['all'] steps: - name: Checkout repository @@ -55,19 +56,36 @@ jobs: version=${{ steps.version.outputs.version }} echo "Version: ${version}" - if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then - rocm_tag="rocm630-mi30x" - elif [ "${{ matrix.gpu_arch }}" = "gfx942-rocm700" ]; then - rocm_tag="rocm700-mi30x" - elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then - rocm_tag="rocm700-mi35x" + gpu_arch_suffix="" + if [ "${{ matrix.rocm_version }}" = "rocm700" ]; then + if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then + rocm_tag="rocm700-mi30x" + elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then + rocm_tag="rocm700-mi35x" + else + echo "Unsupported gfx arch" + exit 1 + fi + elif [ "${{ matrix.rocm_version }}" = "rocm720" ]; then + gpu_arch_suffix="-${{ matrix.rocm_version }}" + if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then + rocm_tag="rocm720-mi30x" + elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then + rocm_tag="rocm720-mi35x" + else + echo "Unsupported gfx arch" + exit 1 + fi else - echo "Unsupported gfx arch" + echo "Unsupported rocm version" exit 1 fi tag=v${version}-${rocm_tag} # rocm.Dockerfile expects SGL_BRANCH with 'v' prefix for git tag checkout - docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg SGL_BRANCH=v${version} -t lmsysorg/sglang:${tag} --no-cache + # remove --build-arg NIC_BACKEND=ainic for auto detection nic support in mori + # UBUNTU_MIRROR forces apt over HTTPS to dodge port-80 reachability flakes + # to Canonical's archive.ubuntu.com mirror IPs from the amd-docker-scale runner. + docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }}${gpu_arch_suffix} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg UBUNTU_MIRROR=https://archive.ubuntu.com -t lmsysorg/sglang:${tag} --no-cache docker push lmsysorg/sglang:${tag} diff --git a/.github/workflows/release-docker-cu13.yml b/.github/workflows/release-docker-cu13.yml deleted file mode 100644 index aa23483331ec..000000000000 --- a/.github/workflows/release-docker-cu13.yml +++ /dev/null @@ -1,122 +0,0 @@ -name: Build and Push CUDA 13 Docker Images - -# release this manually via workflow_dispatch for now -on: - workflow_dispatch: - schedule: - - cron: "0 0 * * *" -jobs: - build-dev: - if: ${{ github.repository == 'sgl-project/sglang' }} - runs-on: ${{ matrix.runner }} - strategy: - matrix: - include: - - runner: x64-docker-build-node - platform: linux/amd64 - build_type: all - grace_blackwell: 0 - tag: dev-x86-cu13 - version: 13.0.1 - - runner: arm-docker-build-node - platform: linux/arm64 - build_type: all - grace_blackwell: 1 - tag: dev-arm64-cu13 - version: 13.0.1 - steps: - - name: Delete huge unnecessary tools folder - run: rm -rf /opt/hostedtoolcache - - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Free disk space - uses: jlumbroso/free-disk-space@main - with: - tool-cache: true - docker-images: true - android: true - dotnet: true - haskell: true - large-packages: true - swap-storage: true - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Build and Push Dev Image - run: | - docker buildx build \ - --platform ${{ matrix.platform }} \ - --push \ - --target framework \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=${{ matrix.version }} \ - --build-arg BUILD_TYPE=${{ matrix.build_type }} \ - --build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \ - --build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \ - --build-arg USE_LATEST_SGLANG=1 \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - -t lmsysorg/sglang:${{ matrix.tag }} \ - --no-cache \ - . - - create-manifests: - runs-on: ubuntu-22.04 - needs: [build-dev] - if: ${{ github.repository == 'sgl-project/sglang' }} - strategy: - matrix: - variant: - - tag: dev-cu13 - x86_tag: dev-x86-cu13 - arm64_tag: dev-arm64-cu13 - steps: - - uses: docker/setup-buildx-action@v3 - - - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - run: | - docker buildx imagetools create \ - -t lmsysorg/sglang:${{ matrix.variant.tag }} \ - -t lmsysorg/sglang:nightly-${{ matrix.variant.tag }}-$(date +%Y%m%d)-${GITHUB_SHA:0:8} \ - lmsysorg/sglang:${{ matrix.variant.x86_tag }} \ - lmsysorg/sglang:${{ matrix.variant.arm64_tag }} - - - name: Cleanup Old Nightly Builds - run: | - # Get JWT token for Docker Hub API - TOKEN=$(curl -s -H "Content-Type: application/json" -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' https://hub.docker.com/v2/users/login/ | jq -r .token) - - # Get all tags for the repository - TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100") - - # Extract tags that match our pattern and sort by last_updated timestamp (most recent first) - TAGS=$(echo "$TAGS_RESPONSE" | jq -r '.results[] | select(.name | startswith("nightly-${{ matrix.variant.tag }}-")) | "\(.last_updated)|\(.name)"' | sort -r | cut -d'|' -f2) - - # Count total tags and keep only the 14 most recent - TAG_COUNT=$(echo "$TAGS" | wc -l) - if [ "$TAG_COUNT" -gt 14 ]; then - echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent" - TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15) - echo "Tags to delete: $TAGS_TO_DELETE" - - # Delete old tags - for tag in $TAGS_TO_DELETE; do - echo "Deleting tag: $tag" - curl -X DELETE \ - -H "Authorization: JWT $TOKEN" \ - "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/" - done - else - echo "Only $TAG_COUNT nightly builds found, no cleanup needed" - fi diff --git a/.github/workflows/release-docker-deepseek-v4.yml b/.github/workflows/release-docker-deepseek-v4.yml new file mode 100644 index 000000000000..e4da0f004b2a --- /dev/null +++ b/.github/workflows/release-docker-deepseek-v4.yml @@ -0,0 +1,149 @@ +name: Build and Push DeepSeek-V4 Docker Images + +# Builds the 4 Dockerfiles added in #23600 from the deepseek_v4 branch and +# pushes them to Docker Hub. Each Dockerfile is single-arch and does its own +# `git clone -b deepseek_v4` inside, so no build context source is required +# beyond the Dockerfiles themselves and `--no-cache` is mandatory. + +on: + workflow_dispatch: + inputs: + repository: + description: "Docker Hub destination repository. Default: lmsysorg/sglang-staging (set to lmsysorg/sglang for production release)." + required: false + default: "lmsysorg/sglang-staging" + build_hopper: + description: "Build and push the Hopper (H200) image." + required: false + type: boolean + default: true + build_blackwell: + description: "Build and push the Blackwell (B200) image." + required: false + type: boolean + default: true + build_b300: + description: "Build and push the B300 image." + required: false + type: boolean + default: true + build_grace_blackwell: + description: "Build and push the Grace Blackwell (ARM) image." + required: false + type: boolean + default: true + build_b300_dev: + description: "Build and push the B300 image from the deepseek_v4_dev branch." + required: false + type: boolean + default: true + build_grace_blackwell_dev: + description: "Build and push the Grace Blackwell (ARM) image from the deepseek_v4_dev branch." + required: false + type: boolean + default: true + +concurrency: + group: release-docker-deepseek-v4-${{ inputs.repository }} + cancel-in-progress: true + +jobs: + build-matrix: + if: ${{ github.repository == 'sgl-project/sglang' }} + runs-on: ubuntu-latest + outputs: + include: ${{ steps.set.outputs.include }} + steps: + - id: set + env: + BUILD_HOPPER: ${{ inputs.build_hopper }} + BUILD_BLACKWELL: ${{ inputs.build_blackwell }} + BUILD_B300: ${{ inputs.build_b300 }} + BUILD_GRACE_BLACKWELL: ${{ inputs.build_grace_blackwell }} + BUILD_B300_DEV: ${{ inputs.build_b300_dev }} + BUILD_GRACE_BLACKWELL_DEV: ${{ inputs.build_grace_blackwell_dev }} + run: | + entries=() + if [ "$BUILD_HOPPER" = "true" ]; then + entries+=('{"runner":"x64-docker-build-node","platform":"linux/amd64","dockerfile":"docker/deepseek_v4_h200.Dockerfile","tag":"deepseek-v4-hopper","branch":"deepseek_v4"}') + fi + if [ "$BUILD_BLACKWELL" = "true" ]; then + entries+=('{"runner":"x64-docker-build-node","platform":"linux/amd64","dockerfile":"docker/deepseek_v4_b200.Dockerfile","tag":"deepseek-v4-blackwell","branch":"deepseek_v4"}') + fi + if [ "$BUILD_B300" = "true" ]; then + entries+=('{"runner":"x64-docker-build-node","platform":"linux/amd64","dockerfile":"docker/deepseek_v4_b300.Dockerfile","tag":"deepseek-v4-b300","branch":"deepseek_v4"}') + fi + if [ "$BUILD_GRACE_BLACKWELL" = "true" ]; then + entries+=('{"runner":"arm-docker-build-node","platform":"linux/arm64","dockerfile":"docker/deepseek_v4_grace_blackwell.Dockerfile","tag":"deepseek-v4-grace-blackwell","branch":"deepseek_v4"}') + fi + if [ "$BUILD_B300_DEV" = "true" ]; then + entries+=('{"runner":"x64-docker-build-node","platform":"linux/amd64","dockerfile":"docker/deepseek_v4_b300.Dockerfile","tag":"deepseek-v4-b300-dev","branch":"deepseek_v4_dev"}') + fi + if [ "$BUILD_GRACE_BLACKWELL_DEV" = "true" ]; then + entries+=('{"runner":"arm-docker-build-node","platform":"linux/arm64","dockerfile":"docker/deepseek_v4_grace_blackwell.Dockerfile","tag":"deepseek-v4-grace-blackwell-dev","branch":"deepseek_v4_dev"}') + fi + if [ ${#entries[@]} -eq 0 ]; then + echo "::error::At least one build_* input must be true." + exit 1 + fi + joined=$(IFS=,; echo "${entries[*]}") + echo "include=[${joined}]" >> "$GITHUB_OUTPUT" + echo "Selected matrix: [${joined}]" + + build-deepseek-v4: + needs: build-matrix + runs-on: ${{ matrix.runner }} + strategy: + fail-fast: false + matrix: + include: ${{ fromJson(needs.build-matrix.outputs.include) }} + steps: + - name: Delete huge unnecessary tools folder + run: rm -rf /opt/hostedtoolcache + + - name: Cleanup workspace (remove root-owned files from prior runs) + run: sudo rm -rf "$GITHUB_WORKSPACE"/* || true + + - name: Checkout deepseek_v4 sources + uses: actions/checkout@v4 + with: + ref: ${{ matrix.branch }} + + - name: Free disk space + uses: jlumbroso/free-disk-space@main + with: + tool-cache: true + docker-images: true + android: true + dotnet: true + haskell: true + large-packages: true + swap-storage: true + + - name: Prune Docker to reclaim disk space + run: | + docker buildx prune --filter "until=72h" -f + docker system prune -af --filter "until=72h" + docker volume prune -af + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Build and Push DeepSeek-V4 image + run: | + IMAGE="${{ inputs.repository }}:${{ matrix.tag }}" + echo "Will push: ${IMAGE}" + docker buildx build \ + --platform ${{ matrix.platform }} \ + -f ${{ matrix.dockerfile }} \ + -t "${IMAGE}" \ + --push \ + --no-cache \ + . + echo "Published ${IMAGE}" diff --git a/.github/workflows/release-docker-dev-pr.yml b/.github/workflows/release-docker-dev-pr.yml deleted file mode 100644 index 08323008cc3b..000000000000 --- a/.github/workflows/release-docker-dev-pr.yml +++ /dev/null @@ -1,116 +0,0 @@ -name: Build PR Development Docker Images - -on: - workflow_dispatch: - inputs: - pr_number: - description: 'PR number to build from' - required: true - type: string - pr_branch: - description: 'PR branch name to build from (e.g., my-feature-branch or refs/pull/123/head)' - required: true - type: string - -concurrency: - group: release-docker-dev-pr-${{ github.event.inputs.pr_number }} - cancel-in-progress: true - -jobs: - build-dev: - if: ${{ github.repository == 'sgl-project/sglang' }} - environment: "prod" - runs-on: ${{ matrix.runner }} - strategy: - matrix: - include: - - runner: x64-docker-build-node - platform: linux/amd64 - build_type: all - grace_blackwell: 0 - arch_tag: x86 - version: 12.9.1 - - runner: arm-docker-build-node - platform: linux/arm64 - build_type: all - grace_blackwell: 1 - arch_tag: arm64 - version: 12.9.1 - steps: - - name: Delete huge unnecessary tools folder - run: rm -rf /opt/hostedtoolcache - - - name: Checkout repository - uses: actions/checkout@v4 - with: - ref: ${{ inputs.pr_branch }} - - - name: Free disk space - uses: jlumbroso/free-disk-space@main - with: - tool-cache: true - docker-images: true - android: true - dotnet: true - haskell: true - large-packages: true - swap-storage: true - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Build and Push Dev Image - run: | - tag=dev-${{ matrix.arch_tag }}-pr-${{ inputs.pr_number }} - - docker buildx build \ - --platform ${{ matrix.platform }} \ - --push \ - -f docker/Dockerfile \ - --target framework \ - --build-arg CUDA_VERSION=${{ matrix.version }} \ - --build-arg BUILD_TYPE=${{ matrix.build_type }} \ - --build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \ - --build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \ - --build-arg BRANCH_TYPE=local \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - create-manifests: - runs-on: ubuntu-22.04 - needs: [build-dev] - if: ${{ github.repository == 'sgl-project/sglang' }} - environment: "prod" - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Create multi-arch manifest - run: | - # Create PR dev manifest - docker buildx imagetools create \ - -t lmsysorg/sglang:dev-pr-${{ inputs.pr_number }} \ - lmsysorg/sglang:dev-x86-pr-${{ inputs.pr_number }} \ - lmsysorg/sglang:dev-arm64-pr-${{ inputs.pr_number }} - - echo "✓ Built Docker image: lmsysorg/sglang:dev-pr-${{ inputs.pr_number }}" - echo "" - echo "Usage:" - echo " docker pull lmsysorg/sglang:dev-pr-${{ inputs.pr_number }}" diff --git a/.github/workflows/release-docker-dev.yml b/.github/workflows/release-docker-dev.yml index 19a17e21ece8..4a82281063a6 100644 --- a/.github/workflows/release-docker-dev.yml +++ b/.github/workflows/release-docker-dev.yml @@ -2,122 +2,87 @@ name: Build and Push Development Docker Images on: workflow_dispatch: + inputs: + pr_number: + description: "PR number to build from (leave empty to use current branch)" + required: false + default: "" + tag: + description: "Custom tag suffix (overrides pr_number in tag). E.g. 'my-test' → dev-my-test, dev-cu13-my-test, etc." + required: false + default: "" + image_repo: + description: "Docker Hub repo to push to. Use lmsysorg/sglang-staging for testing." + required: false + default: "lmsysorg/sglang" schedule: - cron: "0 0 * * *" +concurrency: + group: release-docker-dev-${{ inputs.tag || inputs.pr_number || 'nightly' }} + cancel-in-progress: true + jobs: - build-dev: - if: ${{ github.repository == 'sgl-project/sglang' }} - runs-on: ${{ matrix.runner }} - strategy: - matrix: - include: - - runner: x64-docker-build-node - platform: linux/amd64 - build_type: all - grace_blackwell: 0 - tag: dev-x86 - version: 12.9.1 - - runner: arm-docker-build-node - platform: linux/arm64 - build_type: all - grace_blackwell: 1 - tag: dev-arm64 - version: 12.9.1 + prepare: + if: github.repository == 'sgl-project/sglang' + runs-on: ubuntu-latest + outputs: + checkout_ref: ${{ steps.config.outputs.checkout_ref }} + extra_build_args: ${{ steps.config.outputs.extra_build_args }} + tag_config: ${{ steps.config.outputs.tag_config }} steps: - - name: Delete huge unnecessary tools folder - run: rm -rf /opt/hostedtoolcache - - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Free disk space - uses: jlumbroso/free-disk-space@main - with: - tool-cache: true - docker-images: true - android: true - dotnet: true - haskell: true - large-packages: true - swap-storage: true - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Build and Push Dev Image + - name: Compute build configuration + id: config run: | - docker buildx build \ - --platform ${{ matrix.platform }} \ - --push \ - -f docker/Dockerfile \ - --target framework \ - --build-arg CUDA_VERSION=${{ matrix.version }} \ - --build-arg BUILD_TYPE=${{ matrix.build_type }} \ - --build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \ - --build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \ - --build-arg USE_LATEST_SGLANG=1 \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - -t lmsysorg/sglang:${{ matrix.tag }} \ - --no-cache \ - . - - create-manifests: - runs-on: ubuntu-22.04 - needs: [build-dev] - if: ${{ github.repository == 'sgl-project/sglang' }} - strategy: - matrix: - variant: - - tag: dev - x86_tag: dev-x86 - arm64_tag: dev-arm64 - steps: - - uses: docker/setup-buildx-action@v3 - - - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - run: | - SHORT_SHA="${{ github.sha }}" - docker buildx imagetools create \ - -t lmsysorg/sglang:${{ matrix.variant.tag }} \ - -t lmsysorg/sglang:nightly-${{ matrix.variant.tag }}-$(date +%Y%m%d)-${SHORT_SHA:0:8} \ - lmsysorg/sglang:${{ matrix.variant.x86_tag }} \ - lmsysorg/sglang:${{ matrix.variant.arm64_tag }} - - - name: Cleanup Old Nightly Builds - run: | - # Get JWT token for Docker Hub API - TOKEN=$(curl -s -H "Content-Type: application/json" -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' https://hub.docker.com/v2/users/login/ | jq -r .token) - - # Get all tags for the repository - TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100") + # Determine checkout ref + if [ -n "${{ inputs.pr_number }}" ]; then + echo "checkout_ref=refs/pull/${{ inputs.pr_number }}/head" >> $GITHUB_OUTPUT + else + echo "checkout_ref=" >> $GITHUB_OUTPUT + fi - # Extract tags that match our pattern and sort by last_updated timestamp (most recent first) - TAGS=$(echo "$TAGS_RESPONSE" | jq -r '.results[] | select(.name | startswith("nightly-${{ matrix.variant.tag }}-")) | "\(.last_updated)|\(.name)"' | sort -r | cut -d'|' -f2) + # Determine extra build args + if [ "${{ github.event_name }}" = "schedule" ]; then + echo "extra_build_args=--build-arg USE_LATEST_SGLANG=1 --build-arg CMAKE_BUILD_PARALLEL_LEVEL=\$(nproc)" >> $GITHUB_OUTPUT + else + echo "extra_build_args=--build-arg BRANCH_TYPE=local --build-arg CMAKE_BUILD_PARALLEL_LEVEL=\$(nproc)" >> $GITHUB_OUTPUT + fi - # Count total tags and keep only the 14 most recent - TAG_COUNT=$(echo "$TAGS" | wc -l) - if [ "$TAG_COUNT" -gt 14 ]; then - echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent" - TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15) - echo "Tags to delete: $TAGS_TO_DELETE" + # Determine tag suffix + SUFFIX="" + if [ -n "${{ inputs.tag }}" ]; then + SUFFIX="-${{ inputs.tag }}" + elif [ -n "${{ inputs.pr_number }}" ]; then + SUFFIX="-pr-${{ inputs.pr_number }}" + fi - # Delete old tags - for tag in $TAGS_TO_DELETE; do - echo "Deleting tag: $tag" - curl -X DELETE \ - -H "Authorization: JWT $TOKEN" \ - "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/" - done + # Build tag config. dev-cu13 / nightly-dev-cu13 are published as + # aliases on the cu130 image for backwards compatibility with + # consumers pinned to the pre-flip names. + if [ -z "${SUFFIX}" ]; then + # Nightly: include dated tags + TAG_CONFIG='[{"cuda":"cu129","tags":["dev-cu12","nightly-dev-cu12-{date}-{short_sha}"]},{"cuda":"cu130","tags":["dev","dev-cu13","nightly-dev-{date}-{short_sha}","nightly-dev-cu13-{date}-{short_sha}"]}]' else - echo "Only $TAG_COUNT nightly builds found, no cleanup needed" + TAG_CONFIG="[{\"cuda\":\"cu129\",\"tags\":[\"dev-cu12${SUFFIX}\"]},{\"cuda\":\"cu130\",\"tags\":[\"dev${SUFFIX}\",\"dev-cu13${SUFFIX}\"]}]" fi + echo "tag_config=${TAG_CONFIG}" >> $GITHUB_OUTPUT + + build-and-publish: + needs: prepare + uses: ./.github/workflows/_docker-build-and-publish.yml + with: + docker_target: framework_final + checkout_ref: ${{ needs.prepare.outputs.checkout_ref }} + extra_build_args: ${{ needs.prepare.outputs.extra_build_args }} + tag_config: ${{ needs.prepare.outputs.tag_config }} + image_repo: ${{ inputs.image_repo || 'lmsysorg/sglang' }} + secrets: inherit + + cleanup-nightly: + needs: build-and-publish + if: ${{ !inputs.tag && !inputs.pr_number }} + uses: ./.github/workflows/_docker-cleanup-nightly.yml + with: + tag_prefixes: '["nightly-dev", "nightly-dev-cu12", "nightly-dev-cu13"]' + image_repo: ${{ inputs.image_repo || 'lmsysorg/sglang' }} + secrets: inherit diff --git a/.github/workflows/release-docker-npu-nightly.yml b/.github/workflows/release-docker-npu-nightly.yml index 7b66eba246d8..8866ae2a2776 100644 --- a/.github/workflows/release-docker-npu-nightly.yml +++ b/.github/workflows/release-docker-npu-nightly.yml @@ -1,8 +1,14 @@ name: Release Docker Images Nightly (NPU) on: + pull_request: + branches: + - 'main' + paths: + - '.github/workflows/release-docker-npu-nightly.yml' + - 'docker/npu.Dockerfile' workflow_dispatch: schedule: - - cron: "0 0 * * *" + - cron: "0 16 * * *" # Execute at 0:00 a.m. Beijing Time every day concurrency: group: ${{ github.workflow }}-${{ github.sha }} @@ -13,7 +19,7 @@ jobs: runs-on: ubuntu-22.04-arm strategy: matrix: - cann_version: ["8.3.rc2"] + cann_version: ["8.5.0"] device_type: ["910b", "a3"] steps: - name: Checkout repository @@ -52,6 +58,14 @@ jobs: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} + # Enable Docker multi-architecture build environment + # Emulate non-native architectures + - name: Set up QEMU + uses: docker/setup-qemu-action@v3 + # Required for building and pushing multi-arch Docker images + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + # Build and push Docker image with Buildx (don't push on PR) # https://github.com/docker/build-push-action - name: Build and push Docker image @@ -60,13 +74,12 @@ jobs: with: context: docker file: docker/npu.Dockerfile - # TODO: need add x86 platforms support when memfabric is ready - platforms: linux/arm64 + platforms: linux/arm64,linux/amd64 labels: ${{ steps.meta.outputs.labels }} tags: ${{ steps.meta.outputs.tags }} push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }} provenance: false build-args: | - SGLANG_KERNEL_NPU_TAG=2025.12.31 + SGLANG_KERNEL_NPU_TAG=2026.03.10.rc1 CANN_VERSION=${{ matrix.cann_version }} DEVICE_TYPE=${{ matrix.device_type }} diff --git a/.github/workflows/release-docker-npu.yml b/.github/workflows/release-docker-npu.yml index 850efbae018f..f5788c2a77c0 100644 --- a/.github/workflows/release-docker-npu.yml +++ b/.github/workflows/release-docker-npu.yml @@ -14,7 +14,7 @@ jobs: runs-on: ubuntu-22.04-arm strategy: matrix: - cann_version: ["8.3.rc2"] + cann_version: ["8.5.0"] device_type: ["910b", "a3"] steps: - name: Checkout repository @@ -67,6 +67,13 @@ jobs: fi echo "version=v${VERSION}" >> $GITHUB_OUTPUT echo "TAG=lmsysorg/sglang:v${VERSION}-cann${{ matrix.cann_version }}-${{ matrix.device_type }}" >> $GITHUB_OUTPUT + # Enable Docker multi-architecture build environment + # Emulate non-native architectures + - name: Set up QEMU + uses: docker/setup-qemu-action@v3 + # Required for building and pushing multi-arch Docker images + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 - name: Build and push Docker image id: build-and-push @@ -74,14 +81,13 @@ jobs: with: context: docker file: docker/npu.Dockerfile - # TODO: need add x86 platforms support when memfabric is ready - platforms: linux/arm64 + platforms: linux/arm64,linux/amd64 labels: ${{ steps.meta.outputs.labels }} tags: ${{ steps.meta.outputs.tags || steps.version.outputs.TAG }} push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }} provenance: false build-args: | - SGLANG_KERNEL_NPU_TAG=2025.12.31 + SGLANG_KERNEL_NPU_TAG=2026.03.10.rc1 CANN_VERSION=${{ matrix.cann_version }} DEVICE_TYPE=${{ matrix.device_type }} SGLANG_TAG=${{ steps.version.outputs.version }} diff --git a/.github/workflows/release-docker-runtime.yml b/.github/workflows/release-docker-runtime.yml new file mode 100644 index 000000000000..0e224bf91e33 --- /dev/null +++ b/.github/workflows/release-docker-runtime.yml @@ -0,0 +1,55 @@ +name: Release Docker Runtime Images +# +# Builds and publishes runtime Docker images (production-optimized, ~50% smaller): +# - lmsysorg/sglang:v{version}-runtime, lmsysorg/sglang:latest-runtime +# - lmsysorg/sglang:v{version}-cu129-runtime, lmsysorg/sglang:latest-cu129-runtime +# +on: + push: + tags: + - "v[0-9]+.*" + workflow_dispatch: + inputs: + version: + description: "Version to build (without v prefix, e.g., 0.5.7)" + required: true + image_repo: + description: "Docker Hub repo to push to. Use lmsysorg/sglang-staging for testing." + required: false + default: "lmsysorg/sglang" + +jobs: + resolve-version: + if: github.repository == 'sgl-project/sglang' + runs-on: ubuntu-latest + outputs: + version: ${{ steps.version.outputs.version }} + steps: + - name: Get version + id: version + run: | + if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then + VERSION="${{ github.event.inputs.version }}" + else + VERSION="${GITHUB_REF_NAME#v}" + fi + if [ -z "$VERSION" ] || ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then + echo "::error::Invalid version: $VERSION (expected: X.Y.Z)" + exit 1 + fi + echo "version=${VERSION}" >> $GITHUB_OUTPUT + + build-and-publish: + needs: resolve-version + uses: ./.github/workflows/_docker-build-and-publish.yml + with: + docker_target: runtime + sgl_version: ${{ needs.resolve-version.outputs.version }} + use_environment: prod + image_repo: ${{ inputs.image_repo || 'lmsysorg/sglang' }} + tag_config: | + [ + {"cuda": "cu130", "tags": ["v${{ needs.resolve-version.outputs.version }}-runtime", "latest-runtime", "v${{ needs.resolve-version.outputs.version }}-cu130-runtime", "latest-cu130-runtime"]}, + {"cuda": "cu129", "tags": ["v${{ needs.resolve-version.outputs.version }}-cu129-runtime", "latest-cu129-runtime"]} + ] + secrets: inherit diff --git a/.github/workflows/release-docker.yml b/.github/workflows/release-docker.yml index d10f9261ee55..edf21469e134 100644 --- a/.github/workflows/release-docker.yml +++ b/.github/workflows/release-docker.yml @@ -1,322 +1,55 @@ name: Release Docker Images # -# This workflow builds and publishes both framework and runtime Docker images: -# -# Framework images (full development environment): -# - lmsysorg/sglang:v{version}, lmsysorg/sglang:latest -# - lmsysorg/sglang:v{version}-cu129-{amd64,arm64} -# -# Runtime images (production-optimized, ~50% smaller): -# - lmsysorg/sglang:v{version}-runtime, lmsysorg/sglang:latest-runtime -# - lmsysorg/sglang:v{version}-cu129-{amd64,arm64}-runtime +# Builds and publishes framework Docker images (full development environment): +# - lmsysorg/sglang:v{version}, lmsysorg/sglang:latest (cuda 13) +# - lmsysorg/sglang:v{version}-cu129, lmsysorg/sglang:latest-cu129 # on: push: tags: - - 'v[0-9]+.*' + - "v[0-9]+.*" workflow_dispatch: inputs: version: - description: 'Version to build (without v prefix, e.g., 0.5.7)' + description: "Version to build (without v prefix, e.g., 0.5.7)" required: true + image_repo: + description: "Docker Hub repo to push to. Use lmsysorg/sglang-staging for testing." + required: false + default: "lmsysorg/sglang" jobs: - publish-x86: + resolve-version: if: github.repository == 'sgl-project/sglang' - environment: "prod" - strategy: - matrix: - variant: - - cuda_version: "12.9.1" - build_type: "all" - grace_blackwell: 0 - runs-on: x64-docker-build-node + runs-on: ubuntu-latest + outputs: + version: ${{ steps.version.outputs.version }} steps: - - name: Delete huge unnecessary tools folder - run: rm -rf /opt/hostedtoolcache - - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Free disk space - uses: jlumbroso/free-disk-space@main - with: - tool-cache: false - docker-images: false - android: true - dotnet: true - haskell: true - large-packages: true - swap-storage: false - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Get version from tag + - name: Get version id: version run: | if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then VERSION="${{ github.event.inputs.version }}" else - # Extract version from tag (e.g., v0.5.7 -> 0.5.7) VERSION="${GITHUB_REF_NAME#v}" fi - - # Validate version format - if [ -z "$VERSION" ]; then - echo "::error::Version is empty" - exit 1 - fi - if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then - echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)" + if [ -z "$VERSION" ] || ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then + echo "::error::Invalid version: $VERSION (expected: X.Y.Z)" exit 1 fi - echo "version=${VERSION}" >> $GITHUB_OUTPUT - - name: Build AMD64 Framework - run: | - version=${{ steps.version.outputs.version }} - tag=v${version}-cu129-amd64 - - docker buildx build \ - --target framework \ - --platform linux/amd64 \ - --push \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \ - --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \ - --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - --build-arg SGL_VERSION=${version} \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - - name: Build and Push AMD64 Runtime - run: | - version=${{ steps.version.outputs.version }} - tag=v${version}-cu129-amd64-runtime - - docker buildx build \ - --target runtime \ - --platform linux/amd64 \ - --push \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \ - --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \ - --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - --build-arg SGL_VERSION=${version} \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - - name: Build and Push AMD64 Runtime (CUDA 13) - run: | - version=${{ steps.version.outputs.version }} - tag=v${version}-cu130-amd64-runtime - - docker buildx build \ - --target runtime \ - --platform linux/amd64 \ - --push \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=13.0.1 \ - --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - --build-arg GRACE_BLACKWELL=0 \ - --build-arg SGL_VERSION=${version} \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - publish-arm64: - if: github.repository == 'sgl-project/sglang' - environment: "prod" - strategy: - matrix: - variant: - - cuda_version: "12.9.1" - build_type: "all" - grace_blackwell: 1 - runs-on: arm-docker-build-node - steps: - - name: Delete huge unnecessary tools folder - run: rm -rf /opt/hostedtoolcache - - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Get version from tag - id: version - run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - VERSION="${{ github.event.inputs.version }}" - else - # Extract version from tag (e.g., v0.5.7 -> 0.5.7) - VERSION="${GITHUB_REF_NAME#v}" - fi - - # Validate version format - if [ -z "$VERSION" ]; then - echo "::error::Version is empty" - exit 1 - fi - if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then - echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)" - exit 1 - fi - - echo "version=${VERSION}" >> $GITHUB_OUTPUT - - - name: Build ARM64 Framework - run: | - version=${{ steps.version.outputs.version }} - tag=v${version}-cu129-arm64 - - docker buildx build \ - --target framework \ - --platform linux/arm64 \ - --push \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \ - --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \ - --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - --build-arg SGL_VERSION=${version} \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - - name: Build and Push ARM64 Runtime - run: | - version=${{ steps.version.outputs.version }} - tag=v${version}-cu129-arm64-runtime - - docker buildx build \ - --target runtime \ - --platform linux/arm64 \ - --push \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \ - --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \ - --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \ - --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \ - --build-arg SGL_VERSION=${version} \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - - name: Build and Push ARM64 Runtime (CUDA 13) - run: | - version=${{ steps.version.outputs.version }} - tag=v${version}-cu130-arm64-runtime - - docker buildx build \ - --target runtime \ - --platform linux/arm64 \ - --push \ - -f docker/Dockerfile \ - --build-arg CUDA_VERSION=13.0.1 \ - --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \ - --build-arg GRACE_BLACKWELL=1 \ - --build-arg SGL_VERSION=${version} \ - -t lmsysorg/sglang:${tag} \ - --no-cache \ - . - - create-manifests: - runs-on: ubuntu-22.04 - needs: [publish-x86, publish-arm64] - if: github.repository == 'sgl-project/sglang' - environment: "prod" - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Get version from tag - id: version - run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - VERSION="${{ github.event.inputs.version }}" - else - # Extract version from tag (e.g., v0.5.7 -> 0.5.7) - VERSION="${GITHUB_REF_NAME#v}" - fi - - # Validate version format - if [ -z "$VERSION" ]; then - echo "::error::Version is empty" - exit 1 - fi - if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then - echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)" - exit 1 - fi - - echo "version=${VERSION}" >> $GITHUB_OUTPUT - - - name: Create multi-arch manifests - run: | - version=${{ steps.version.outputs.version }} - - # Create versioned framework manifest (default) - docker buildx imagetools create \ - -t lmsysorg/sglang:v${version} \ - lmsysorg/sglang:v${version}-cu129-amd64 \ - lmsysorg/sglang:v${version}-cu129-arm64 - - # Create latest framework manifest (default) - docker buildx imagetools create \ - -t lmsysorg/sglang:latest \ - lmsysorg/sglang:v${version}-cu129-amd64 \ - lmsysorg/sglang:v${version}-cu129-arm64 - - # Create versioned runtime manifest - docker buildx imagetools create \ - -t lmsysorg/sglang:v${version}-runtime \ - lmsysorg/sglang:v${version}-cu129-amd64-runtime \ - lmsysorg/sglang:v${version}-cu129-arm64-runtime - - # Create latest runtime manifest - docker buildx imagetools create \ - -t lmsysorg/sglang:latest-runtime \ - lmsysorg/sglang:v${version}-cu129-amd64-runtime \ - lmsysorg/sglang:v${version}-cu129-arm64-runtime - - # Create versioned CUDA 13 runtime manifest - docker buildx imagetools create \ - -t lmsysorg/sglang:v${version}-cu130-runtime \ - lmsysorg/sglang:v${version}-cu130-amd64-runtime \ - lmsysorg/sglang:v${version}-cu130-arm64-runtime - - # Create latest CUDA 13 runtime manifest - docker buildx imagetools create \ - -t lmsysorg/sglang:latest-cu130-runtime \ - lmsysorg/sglang:v${version}-cu130-amd64-runtime \ - lmsysorg/sglang:v${version}-cu130-arm64-runtime + build-and-publish: + needs: resolve-version + uses: ./.github/workflows/_docker-build-and-publish.yml + with: + docker_target: framework_final + sgl_version: ${{ needs.resolve-version.outputs.version }} + use_environment: prod + image_repo: ${{ inputs.image_repo || 'lmsysorg/sglang' }} + tag_config: | + [ + {"cuda": "cu130", "tags": ["v${{ needs.resolve-version.outputs.version }}", "latest", "v${{ needs.resolve-version.outputs.version }}-cu130", "latest-cu130"]}, + {"cuda": "cu129", "tags": ["v${{ needs.resolve-version.outputs.version }}-cu129", "latest-cu129"]} + ] + secrets: inherit diff --git a/.github/workflows/release-docs.yml b/.github/workflows/release-docs.yml index 07e6871a2960..eefd15cd025d 100644 --- a/.github/workflows/release-docs.yml +++ b/.github/workflows/release-docs.yml @@ -1,6 +1,8 @@ name: Release Documentation on: + release: + types: [published] push: branches: - main @@ -14,14 +16,22 @@ concurrency: group: release-docs-${{ github.ref }} cancel-in-progress: true +env: + SGLANG_IS_IN_CI: true + jobs: execute-and-deploy: - runs-on: 1-gpu-runner + runs-on: 1-gpu-h100 if: github.repository == 'sgl-project/sglang' steps: - name: Checkout code uses: actions/checkout@v4 + - name: Fetch full git history for release index + if: github.event_name == 'release' + run: | + git fetch --prune --unshallow || git fetch --prune --depth=0 + - name: Install dependencies run: | bash scripts/ci/cuda/ci_install_dependency.sh @@ -47,12 +57,26 @@ jobs: run: | cd docs make html + make markdown python3 wrap_run_llm.py + if [[ "${{ github.event_name }}" == "release" ]]; then + python3 release_lookup/generate_index.py --output release_lookup/release_index.json + + # Copy release lookup tool for official docs on published releases. + mkdir -p _build/html/release_lookup + cp release_lookup/index.html _build/html/release_lookup/ + cp release_lookup/release_index.json _build/html/release_lookup/ + fi + cd _build/html git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io --depth 1 - find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete + if [[ "${{ github.event_name }}" == "release" ]]; then + find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete + else + find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -path "../sgl-project.github.io/release_lookup*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete + fi cp -r * ../sgl-project.github.io cp ../../README.md ../sgl-project.github.io/README.md cd ../sgl-project.github.io diff --git a/.github/workflows/release-pypi-nightly.yml b/.github/workflows/release-pypi-nightly.yml index 93971b9ef7ff..f94a316bfc58 100644 --- a/.github/workflows/release-pypi-nightly.yml +++ b/.github/workflows/release-pypi-nightly.yml @@ -14,11 +14,6 @@ on: description: 'Specific commit SHA to build (leave empty for latest)' required: false type: string - cuda_version: - description: 'CUDA version (e.g., 129 or 130)' - required: false - default: '129' - type: string concurrency: group: release-pypi-nightly-${{ github.ref }} @@ -48,25 +43,38 @@ jobs: run: | pip install build wheel setuptools setuptools-scm + # Needed by setuptools-rust to build the bundled native gRPC extension + # (rust/sglang-grpc) when `python -m build` builds the sglang wheel. + - name: Install protoc + run: sudo bash scripts/ci/utils/install_protoc.sh + - name: Build wheel id: build run: | cd python cp ../README.md ../LICENSE . - # Parse git describe output to detect exact tag builds (distance=0) - # Use same command as pyproject.toml to ensure version consistency - DESC=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1 | xargs git describe --tags --long 2>/dev/null || echo 'v0.0.0-0-g0000000') - DIST=$(echo "$DESC" | cut -d- -f2) - - # If building at exact tag (distance=0), force dev0 version for unique wheel names - if [ "$DIST" = "0" ]; then - TAG=$(echo "$DESC" | cut -d- -f1) - HASH=$(echo "$DESC" | cut -d- -f3-) - FORCE_VERSION="${TAG#v}.dev0+${HASH}" - echo "Building at exact tag, forcing version to: $FORCE_VERSION" - export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION" - fi + TAG=$(python3 ../python/tools/get_version_tag.py) + HASH="g$(git rev-parse --short HEAD)" + BUILD_DATE=$(date -u +%Y%m%d) + + # Increment patch version for nightlies (e.g., v0.5.9 -> 0.5.10) + # Must always increment so nightly > latest tag per PEP 440 ordering: + # X.Y.Z.devN < X.Y.Z.rcN < X.Y.Z < X.Y.(Z+1).devN + VERSION=${TAG#v} # Remove 'v' prefix + MAJOR=$(echo "$VERSION" | cut -d. -f1) + MINOR=$(echo "$VERSION" | cut -d. -f2) + PATCH_RAW=$(echo "$VERSION" | cut -d. -f3) + # Strip pre-release suffixes (rc0, post1, etc.) to get numeric patch + PATCH=$(echo "$PATCH_RAW" | sed 's/[^0-9].*//') + NEXT_PATCH=$((PATCH + 1)) + NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}" + + # Use date-based dev number for correct chronological sorting + # e.g., 0.5.9.dev20260215+g4cf4f0859 > 0.5.9.dev20260214+g45a4697d4 + FORCE_VERSION="${NEXT_VERSION}.dev${BUILD_DATE}+${HASH}" + echo "Forcing nightly version to: $FORCE_VERSION" + export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION" # Build wheel python3 -m build --wheel @@ -99,6 +107,15 @@ jobs: needs: build-nightly-wheel runs-on: ubuntu-latest environment: 'prod' + strategy: + fail-fast: false + # The wheel is CUDA-agnostic and built once — we just register the same + # artifact under cu129/sglang/ and cu130/sglang/ wheel indexes so users + # can install via either --extra-index-url. Serialize because both matrix + # runs clone and push to the same sgl-whl branch. + max-parallel: 1 + matrix: + cuda_version: ['129', '130'] steps: - uses: actions/checkout@v4 @@ -147,12 +164,12 @@ jobs: python3 scripts/update_nightly_whl_index.py \ --commit-hash ${{ needs.build-nightly-wheel.outputs.commit_hash }} \ --nightly-version ${{ needs.build-nightly-wheel.outputs.nightly_version }} \ - --cuda-version ${{ inputs.cuda_version || '129' }} \ + --cuda-version ${{ matrix.cuda_version }} \ --build-date ${{ needs.build-nightly-wheel.outputs.build_date }} - name: Push wheel index run: | cd sgl-whl git add -A - git diff --staged --quiet || git commit -m "Update nightly wheel index for commit ${{ needs.build-nightly-wheel.outputs.commit_hash }}" + git diff --staged --quiet || git commit -m "Update cu${{ matrix.cuda_version }} nightly wheel index for commit ${{ needs.build-nightly-wheel.outputs.commit_hash }}" git push diff --git a/.github/workflows/release-pypi-pr.yml b/.github/workflows/release-pypi-pr.yml index deff4665c574..46632b8f3719 100644 --- a/.github/workflows/release-pypi-pr.yml +++ b/.github/workflows/release-pypi-pr.yml @@ -4,11 +4,7 @@ on: workflow_dispatch: inputs: pr_number: - description: 'PR number to build wheel for' - required: true - type: string - pr_branch: - description: 'PR branch name to build from (e.g., my-feature-branch or refs/pull/123/head)' + description: 'PR number to build wheel for (works with both internal and fork PRs)' required: true type: string @@ -27,7 +23,7 @@ jobs: steps: - uses: actions/checkout@v4 with: - ref: ${{ inputs.pr_branch }} + ref: refs/pull/${{ inputs.pr_number }}/head fetch-depth: 0 # Need full history for version generation - name: Set up Python @@ -38,13 +34,9 @@ jobs: - name: Generate PR wheel version id: gen_version run: | - # Get base version from setuptools_scm - cd python - pip install setuptools-scm - FULL_VERSION=$(python -c "from setuptools_scm import get_version; print(get_version(root='..'))") - # Strip any existing .dev or + suffix to get clean base version - BASE_VERSION=$(echo "$FULL_VERSION" | sed 's/\.dev.*//;s/+.*//') - cd .. + LATEST_TAG=$(python3 python/tools/get_version_tag.py) + BASE_VERSION=${LATEST_TAG#v} + echo "Latest release tag: ${LATEST_TAG}" # Get commit info COMMIT_HASH=$(git rev-parse --short HEAD) diff --git a/.github/workflows/release-pypi.yml b/.github/workflows/release-pypi.yml index e6190c03254e..050f1c0445c0 100644 --- a/.github/workflows/release-pypi.yml +++ b/.github/workflows/release-pypi.yml @@ -4,28 +4,113 @@ on: tags: - 'v[0-9]+.*' workflow_dispatch: + inputs: + version: + description: 'Release version to publish (e.g. v0.5.11). Overrides setuptools-scm.' + required: true + type: string jobs: - publish: + build: if: github.repository == 'sgl-project/sglang' - runs-on: ubuntu-latest - environment: "prod" + strategy: + fail-fast: false + matrix: + python-version: ["3.10", "3.11", "3.12", "3.13"] + arch: [x86_64, aarch64] + include: + - arch: x86_64 + runner: x64-docker-build-node + - arch: aarch64 + runner: arm-docker-build-node + runs-on: ${{ matrix.runner }} steps: - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.10" + # Self-hosted build nodes retain the workspace across jobs. Prior builds + # leave root-owned artifacts behind that actions/checkout cannot remove, + # causing EACCES on rmdir. Wipe them via a throwaway root container. + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true - name: Checkout repository uses: actions/checkout@v4 with: fetch-depth: 0 # Required for setuptools-scm to determine version from tags + submodules: "recursive" - - name: Upload to pypi + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v5 + with: + python-version: ${{ matrix.python-version }} + + # Needed by setuptools-rust to build the bundled native gRPC extension + # (rust/sglang-grpc) when `python -m build` builds the sglang wheel. + - name: Install protoc + run: sudo bash scripts/ci/utils/install_protoc.sh + + - name: Install Rust toolchain + uses: dtolnay/rust-toolchain@stable + + - name: Build wheel + env: + # When triggered manually, pin the wheel version to the supplied + # input (strip leading `v`) so setuptools-scm doesn't derive a + # `.devN+gHASH` version from the branch HEAD. + RELEASE_VERSION: ${{ inputs.version }} run: | cd python cp ../README.md ../LICENSE . - pip install build wheel setuptools setuptools-scm - python3 -m build + pip install build wheel setuptools setuptools-scm setuptools-rust + if [ -n "$RELEASE_VERSION" ]; then + export SETUPTOOLS_SCM_PRETEND_VERSION="${RELEASE_VERSION#v}" + echo "Pinning wheel version to $SETUPTOOLS_SCM_PRETEND_VERSION" + fi + python3 -m build --wheel + + # PyPI rejects plain `linux_x86_64` / `linux_aarch64` platform tags; + # auditwheel rewrites the wheel's platform tag to a `manylinux_*` tag + # and bundles any external native deps. The runner's glibc determines + # the lowest acceptable manylinux policy. + - name: Repair wheel for manylinux + run: | + cd python + pip install auditwheel patchelf + mkdir -p dist-repaired + python3 -m auditwheel repair dist/*.whl -w dist-repaired/ + rm dist/*.whl + mv dist-repaired/*.whl dist/ + + - name: Upload artifacts + uses: actions/upload-artifact@v4 + with: + name: wheel-py${{ matrix.python-version }}-${{ matrix.arch }} + path: python/dist/*.whl + + publish: + needs: build + if: github.repository == 'sgl-project/sglang' + runs-on: ubuntu-latest + environment: "prod" + steps: + - name: Download all wheels + uses: actions/download-artifact@v4 + with: + path: dist/ + merge-multiple: true + pattern: wheel-* + + - name: List wheels + run: ls -lh dist/ + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.10" + + - name: Upload to pypi + run: | pip install twine - python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }} + # --skip-existing makes partial reruns safe (e.g. one arch failed, + # rerun only the failed shard without re-uploading already-published wheels). + python3 -m twine upload --skip-existing dist/*.whl -u __token__ -p ${{ secrets.PYPI_TOKEN }} diff --git a/.github/workflows/release-whl-deepgemm.yml b/.github/workflows/release-whl-deepgemm.yml new file mode 100644 index 000000000000..404dbb20ed5d --- /dev/null +++ b/.github/workflows/release-whl-deepgemm.yml @@ -0,0 +1,211 @@ +name: Release sgl-deep-gemm + +on: + workflow_dispatch: + inputs: + version: + description: "Wheel version (e.g. 0.1.0, 0.1.1rc0)" + type: string + required: true + target: + type: choice + description: "Build target (default: all)" + required: false + default: 'all' + options: + - 'all' + - 'cu129' + - 'cu130' + branch: + description: "DeepGEMM branch to build from (default: release-0426)" + type: string + required: false + default: 'release-0426' + +concurrency: + group: release-sgl-deepgemm-${{ github.ref }} + cancel-in-progress: true + +jobs: + build-cu129-matrix: + if: | + github.repository == 'sgl-project/sglang' && + (github.event.inputs.target == 'all' || github.event.inputs.target == 'cu129') + strategy: + matrix: + python-version: ["3.12"] + cuda-version: ["12.9"] + arch: [x86_64, aarch64] + include: + - arch: x86_64 + runner: x64-kernel-build-node + - arch: aarch64 + runner: arm-kernel-build-node + runs-on: ${{ matrix.runner }} + steps: + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + + - uses: actions/checkout@v4 + + - name: Checkout DeepGEMM + uses: actions/checkout@v4 + with: + repository: sgl-project/DeepGEMM + ref: ${{ inputs.branch }} + path: DeepGEMM + submodules: recursive + + - name: Set wheel version + run: | + echo -n "${{ inputs.version }}" > DeepGEMM/sgl_deep_gemm/VERSION + cat DeepGEMM/sgl_deep_gemm/VERSION + + - name: Build wheel + run: | + chmod +x ./scripts/build_sgl_deep_gemm.sh ./scripts/rename_sgl_deep_gemm_whl.sh + ./scripts/build_sgl_deep_gemm.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" "${{ github.workspace }}/DeepGEMM" "${{ matrix.arch }}" + + - name: Upload artifacts + uses: actions/upload-artifact@v4 + with: + name: deepgemm-wheel-cuda${{ matrix.cuda-version }}-${{ matrix.arch }} + path: DeepGEMM/dist/*.whl + + release-cu129: + needs: build-cu129-matrix + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: dist/ + merge-multiple: true + pattern: deepgemm-wheel-cuda12.9-* + + - name: Release + uses: softprops/action-gh-release@v2 + with: + tag_name: v${{ inputs.version }} + repository: sgl-project/whl + token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + files: | + dist/* + + - name: Clone wheel index + run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl + env: + WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + + - name: Update wheel index + run: python3 scripts/update_deepgemm_whl_index.py --cuda 129 + + - name: Push wheel index + run: | + cd sgl-whl + git config --local user.name "sglang-bot" + git config --local user.email "sglangbot@gmail.com" + git add -A + git commit -m "update sgl-deep-gemm whl index for v${{ inputs.version }}" + git push + + build-cu130-matrix: + if: | + github.repository == 'sgl-project/sglang' && + (github.event.inputs.target == 'all' || github.event.inputs.target == 'cu130') + strategy: + matrix: + python-version: ["3.12"] + cuda-version: ["13.0"] + arch: [x86_64, aarch64] + include: + - arch: x86_64 + runner: x64-kernel-build-node + - arch: aarch64 + runner: arm-kernel-build-node + runs-on: ${{ matrix.runner }} + steps: + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + + - uses: actions/checkout@v4 + + - name: Checkout DeepGEMM + uses: actions/checkout@v4 + with: + repository: sgl-project/DeepGEMM + ref: ${{ inputs.branch }} + path: DeepGEMM + submodules: recursive + + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v5 + with: + python-version: ${{ matrix.python-version }} + + - name: Set wheel version + run: | + echo -n "${{ inputs.version }}" > DeepGEMM/sgl_deep_gemm/VERSION + cat DeepGEMM/sgl_deep_gemm/VERSION + + - name: Build wheel + run: | + chmod +x ./scripts/build_sgl_deep_gemm.sh ./scripts/rename_sgl_deep_gemm_whl.sh + ./scripts/build_sgl_deep_gemm.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" "${{ github.workspace }}/DeepGEMM" "${{ matrix.arch }}" + + - name: Upload to PyPI + working-directory: DeepGEMM + run: | + pip install twine + python3 -m twine upload --skip-existing dist-pypi/* -u __token__ -p ${{ secrets.SGL_DEEP_GEMM_PYPI_TOKEN }} + + - name: Upload artifacts + uses: actions/upload-artifact@v4 + with: + name: deepgemm-wheel-cuda${{ matrix.cuda-version }}-${{ matrix.arch }} + path: DeepGEMM/dist/*.whl + + release-cu130: + needs: build-cu130-matrix + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: dist/ + merge-multiple: true + pattern: deepgemm-wheel-cuda13.0-* + + - name: Release + uses: softprops/action-gh-release@v2 + with: + tag_name: v${{ inputs.version }} + repository: sgl-project/whl + token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + files: | + dist/* + + - name: Clone wheel index + run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl + env: + WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + + - name: Update wheel index + run: python3 scripts/update_deepgemm_whl_index.py --cuda 130 + + - name: Push wheel index + run: | + cd sgl-whl + git config --local user.name "sglang-bot" + git config --local user.email "sglangbot@gmail.com" + git add -A + git commit -m "update sgl-deep-gemm whl index for v${{ inputs.version }}" + git push diff --git a/.github/workflows/release-whl-kernel.yml b/.github/workflows/release-whl-kernel.yml index 2fe1e8aefa50..775fafacc11f 100644 --- a/.github/workflows/release-whl-kernel.yml +++ b/.github/workflows/release-whl-kernel.yml @@ -8,7 +8,24 @@ on: - sgl-kernel/python/sgl_kernel/version.py workflow_dispatch: inputs: + target: + type: choice + description: 'Build target' + required: false + default: 'all' + options: + - 'all' + - 'cu129' + - 'cu130' + - 'rocm700' + - 'rocm720' + - 'musa43' tag_name: + description: "Version number, must be in the form of vX.Y.Z (e.g. v0.4.0)" + type: string + required: false + pr_number: + description: "PR number to build from (e.g. 12345)" type: string required: false @@ -17,8 +34,13 @@ concurrency: cancel-in-progress: true jobs: + # cu130 is the PyPI-released variant; cu129 wheels are published only to the + # sgl-project/whl index (consumed via `pip install ...+cu129` for the legacy + # cuda 12.9 path), not to PyPI. build-cu129-matrix: - if: github.repository == 'sgl-project/sglang' + if: | + github.repository == 'sgl-project/sglang' && + (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'cu129') strategy: matrix: python-version: ["3.10"] @@ -31,9 +53,19 @@ jobs: runner: arm-kernel-build-node runs-on: ${{ matrix.runner }} steps: + # Self-hosted build nodes retain the workspace across jobs. Prior builds + # leave root-owned artifacts under sgl-kernel/build/ that actions/checkout + # cannot remove, causing EACCES on rmdir. Wipe them via a throwaway root + # container before checkout recreates the workspace. + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + - uses: actions/checkout@v4 with: submodules: "recursive" + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 @@ -46,14 +78,8 @@ jobs: chmod +x ./build.sh ./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }} env: - USE_CCACHE: 0 - CMAKE_EXTRA_ARGS: ${{ matrix.arch == 'aarch64' && '-DENABLE_BELOW_SM90=ON' || '' }} - - - name: Upload to PyPI - working-directory: sgl-kernel - run: | - pip install twine - python3 -m twine upload --skip-existing dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }} + BUILD_JOBS: 64 + NVCC_THREADS: 8 - name: Upload artifacts uses: actions/upload-artifact@v4 @@ -66,6 +92,8 @@ jobs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} - name: Download artifacts uses: actions/download-artifact@v4 @@ -110,9 +138,10 @@ jobs: git commit -m "update whl index" git push - # for now we do not release CUDA 13.0 wheels to pypi build-cu130-matrix: - if: github.repository == 'sgl-project/sglang' + if: | + github.repository == 'sgl-project/sglang' && + (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'cu130') strategy: matrix: python-version: ["3.10"] @@ -125,9 +154,19 @@ jobs: runner: arm-kernel-build-node runs-on: ${{ matrix.runner }} steps: + # Self-hosted build nodes retain the workspace across jobs. Prior builds + # leave root-owned artifacts under sgl-kernel/build/ that actions/checkout + # cannot remove, causing EACCES on rmdir. Wipe them via a throwaway root + # container before checkout recreates the workspace. + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + - uses: actions/checkout@v4 with: submodules: "recursive" + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 @@ -140,7 +179,39 @@ jobs: chmod +x ./build.sh ./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }} env: - USE_CCACHE: 0 + BUILD_JOBS: 64 + NVCC_THREADS: 8 + + - name: Strip +cu130 local version for PyPI upload + working-directory: sgl-kernel + run: | + set -eux + pip install wheel + mkdir -p dist-pypi + for w in dist/*.whl; do + tmp=$(mktemp -d) + python3 -m wheel unpack "$w" --dest "$tmp" + unpacked=$(find "$tmp" -mindepth 1 -maxdepth 1 -type d | head -1) + info=$(find "$unpacked" -maxdepth 1 -type d -name "*.dist-info" | head -1) + meta="$info/METADATA" + orig=$(grep '^Version:' "$meta" | head -1 | sed 's/^Version:[[:space:]]*//') + new=$(echo "$orig" | sed 's/+cu[0-9]\+$//') + if [ "$orig" != "$new" ]; then + sed -i "s/^Version:.*/Version: ${new}/" "$meta" + old_base=$(basename "$info") + new_base="${old_base/${orig}/${new}}" + mv "$info" "$(dirname "$info")/${new_base}" + fi + python3 -m wheel pack "$unpacked" --dest-dir dist-pypi + rm -rf "$tmp" + done + ls -lh dist-pypi/ + + - name: Upload to PyPI + working-directory: sgl-kernel + run: | + pip install twine + python3 -m twine upload --skip-existing dist-pypi/* -u __token__ -p ${{ secrets.PYPI_TOKEN_SGLANG_KERNEL }} - name: Upload artifacts uses: actions/upload-artifact@v4 @@ -153,6 +224,8 @@ jobs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} - name: Download artifacts uses: actions/download-artifact@v4 @@ -197,17 +270,29 @@ jobs: git commit -m "update whl index" git push - build-rocm700: - if: github.repository == 'sgl-project/sglang' + build-rocm-matrix: + if: | + github.repository == 'sgl-project/sglang' && + (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'rocm700' || github.event.inputs.target == 'rocm720') runs-on: amd-docker-scale strategy: matrix: python-version: ["3.10"] - rocm-version: ["700"] + rocm-version: ["700", "720"] steps: + # Self-hosted build nodes retain the workspace across jobs. Prior builds + # leave root-owned artifacts under sgl-kernel/build/ that actions/checkout + # cannot remove, causing EACCES on rmdir. Wipe them via a throwaway root + # container before checkout recreates the workspace. + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + - uses: actions/checkout@v4 with: submodules: "recursive" + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 @@ -216,7 +301,7 @@ jobs: - name: Build wheels run: | - cp 3rdparty/amd/sgl-kernel/* sgl-kernel/ + cp 3rdparty/amd/wheel/sgl-kernel/* sgl-kernel/ cd sgl-kernel chmod +x ./build_rocm.sh ./build_rocm.sh "${{ matrix.rocm-version }}" @@ -228,17 +313,19 @@ jobs: path: sgl-kernel/dist/* release-rocm700: - needs: build-rocm700 + needs: build-rocm-matrix runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} - name: Download artifacts uses: actions/download-artifact@v4 with: path: sgl-kernel/dist/ merge-multiple: true - pattern: wheel-* + pattern: wheel-*-rocm700 - name: Set tag name id: set_tag_name @@ -275,3 +362,142 @@ jobs: git add -A git commit -m "update whl index" git push + + release-rocm720: + needs: build-rocm-matrix + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }} + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-*-rocm720 + + - name: Set tag name + id: set_tag_name + run: | + if [ -z "${{ inputs.tag_name }}" ]; then + TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)" + echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT + else + echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT + fi + + - name: Release + uses: softprops/action-gh-release@v2 + with: + tag_name: ${{ steps.set_tag_name.outputs.tag_name }} + repository: sgl-project/whl + token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + files: | + sgl-kernel/dist/* + + - name: Clone wheel index + run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl + env: + WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + + - name: Update wheel index + run: python3 scripts/update_kernel_whl_index.py --rocm 720 + + - name: Push wheel index + run: | + cd sgl-whl + git config --local user.name "sglang-bot" + git config --local user.email "sglangbot@gmail.com" + git add -A + git commit -m "update whl index" + git push + + build-musa43: + if: | + github.repository == 'sgl-project/sglang' && + (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'musa43') + runs-on: kernel-build-node-musa + strategy: + matrix: + python-version: ["3.10"] + musa-version: ["43"] + steps: + # Self-hosted build nodes retain the workspace across jobs. Prior builds + # leave root-owned artifacts under sgl-kernel/build/ that actions/checkout + # cannot remove, causing EACCES on rmdir. Wipe them via a throwaway root + # container before checkout recreates the workspace. + - name: Clean workspace (remove root-owned files from prior runs) + run: | + docker run --rm -v "${{ github.workspace }}:/workspace" alpine:3 \ + sh -c 'rm -rf /workspace/..?* /workspace/.[!.]* /workspace/*' || true + + - uses: actions/checkout@v4 + with: + submodules: "recursive" + + - name: Build wheels + run: | + cd sgl-kernel + mv pyproject_musa.toml pyproject.toml + python setup_musa.py sdist bdist_wheel + + - name: Rename MUSA wheels + run: | + bash scripts/ci/musa/rename_wheels_musa.sh ${{ matrix.musa-version }} sgl-kernel/dist + + - name: Upload artifacts + uses: actions/upload-artifact@v4 + with: + name: wheel-python${{ matrix.python-version }}-musa${{ matrix.musa-version }} + path: sgl-kernel/dist/* + + release-musa43: + needs: build-musa43 + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + path: sgl-kernel/dist/ + merge-multiple: true + pattern: wheel-* + + - name: Set tag name + id: set_tag_name + run: | + if [ -z "${{ inputs.tag_name }}" ]; then + TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)" + echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT + else + echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT + fi + + - name: Release + uses: softprops/action-gh-release@v2 + with: + tag_name: ${{ steps.set_tag_name.outputs.tag_name }} + repository: sgl-project/whl + token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + files: | + sgl-kernel/dist/* + + - name: Clone wheel index + run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl + env: + WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }} + + - name: Update wheel index + run: python3 scripts/update_kernel_whl_index.py --musa 43 + + - name: Push wheel index + run: | + cd sgl-whl + git config --local user.name "sglang-bot" + git config --local user.email "sglangbot@gmail.com" + git add -A + git commit -m "update whl index" + git push diff --git a/.github/workflows/rerun-test.yml b/.github/workflows/rerun-test.yml new file mode 100644 index 000000000000..1241763f38b4 --- /dev/null +++ b/.github/workflows/rerun-test.yml @@ -0,0 +1,198 @@ +name: Rerun Test +run-name: ${{ inputs.pr_head_sha && format('[rerun-test] {0} {1}', inputs.test_command, inputs.pr_head_sha) || format('[rerun-test] {0}', inputs.test_command) }} + +on: + workflow_dispatch: + inputs: + test_command: + description: "Test command(s) to run, one per line (e.g. 'registered/core/test_srt_endpoint.py TestSRTEndpoint.test_simple_decode')" + required: true + type: string + runner_label: + description: "Runner label" + required: true + type: choice + options: + - 1-gpu-h100 + - 1-gpu-5090 + - 2-gpu-h100 + - 4-gpu-h100 + - 4-gpu-a10 + - 4-gpu-b200 + - 8-gpu-h200 + - 8-gpu-h200-deepep + - 8-gpu-h20 + - 8-gpu-b200 + - ubuntu-latest + pr_head_sha: + description: "PR head SHA to checkout (for /rerun-test on fork PRs)" + required: false + type: string + default: "" + use_deepep: + description: "Use ci_install_deepep.sh instead of ci_install_dependency.sh" + required: false + type: string + default: "false" + is_cpu: + description: "Run as CPU-only test (uses ubuntu-latest with uv pip install)" + required: false + type: string + default: "false" + install_diffusion: + description: "Install diffusion dependencies (for multimodal gen tests)" + required: false + type: string + default: "false" + +env: + SGLANG_IS_IN_CI: true + SGLANG_CUDA_COREDUMP: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true + # TEMP: rebuild deepep against the new torch for torch-211-merge PR only — revert before merging to main. + FORCE_REBUILD_DEEPEP: '1' + +permissions: + actions: write + contents: read + issues: read + +jobs: + rerun-test-cuda: + if: inputs.is_cpu != 'true' + runs-on: ${{ inputs.runner_label }} + timeout-minutes: 120 + env: + RUNNER_LABELS: ${{ inputs.runner_label }} + SGLANG_CI_RDMA_ALL_DEVICES: ${{ inputs.runner_label == '8-gpu-h20' && 'mlx5_1,mlx5_2,mlx5_3,mlx5_4' || '' }} + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || github.sha }} + + - uses: ./.github/actions/check-maintenance + + - name: Install dependencies + timeout-minutes: 20 + run: | + if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then + source /etc/profile.d/sglang-ci.sh + fi + if [[ "${{ inputs.use_deepep }}" == "true" ]]; then + bash scripts/ci/cuda/ci_install_deepep.sh + elif [[ "${{ inputs.install_diffusion }}" == "true" ]]; then + bash scripts/ci/cuda/ci_install_dependency.sh diffusion + else + bash scripts/ci/cuda/ci_install_dependency.sh + fi + + - name: Run test + timeout-minutes: 60 + run: | + if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then + source /etc/profile.d/sglang-ci.sh + fi + # Collect non-empty commands into an array for counting. + cmds=() + while IFS= read -r cmd; do + [ -z "$cmd" ] && continue + cmds+=("$cmd") + done <<< "${{ inputs.test_command }}" + total=${#cmds[@]} + suite_start=$SECONDS + for idx in "${!cmds[@]}"; do + i=$((idx + 1)) + cmd="${cmds[$idx]}" + echo "" + echo "." + if [[ "${{ inputs.install_diffusion }}" == "true" ]]; then + echo "Begin ($i/$total): python3 -m pytest $cmd -x" + echo "." + file_start=$SECONDS + python3 -m pytest $cmd -x || exit 1 + else + echo "Begin ($i/$total): python3 $cmd" + echo "." + file_start=$SECONDS + (cd test/ && python3 $cmd -f) || exit 1 + fi + elapsed=$(( SECONDS - file_start )) + echo "." + echo "End ($i/$total): elapsed=${elapsed}s" + echo "." + echo "" + done + total_elapsed=$(( SECONDS - suite_start )) + echo "All $total test(s) passed in ${total_elapsed}s" + + - uses: ./.github/actions/upload-cuda-coredumps + if: failure() + + rerun-test-cpu: + if: inputs.is_cpu == 'true' + runs-on: ubuntu-latest + timeout-minutes: 120 + steps: + - name: Free disk space + run: | + sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc + df -h + + - name: Checkout code + uses: actions/checkout@v4 + with: + ref: ${{ inputs.pr_head_sha || github.sha }} + + - uses: ./.github/actions/check-maintenance + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Install uv + uses: astral-sh/setup-uv@v5 + + # Needed by setuptools-rust to build the bundled native gRPC extension + # (rust/sglang-grpc) when installing the main `sglang` wheel from source. + - name: Install protoc + Rust toolchain + timeout-minutes: 10 + run: bash scripts/ci/utils/install_rust_protoc.sh + + - name: Install dependencies + timeout-minutes: 20 + env: + UV_SYSTEM_PYTHON: "1" + run: | + uv pip install -e "python[dev]" --index-strategy unsafe-best-match --prerelease allow + + - name: Run test + timeout-minutes: 60 + run: | + cd test/ + # Collect non-empty commands into an array for counting. + cmds=() + while IFS= read -r cmd; do + [ -z "$cmd" ] && continue + cmds+=("$cmd") + done <<< "${{ inputs.test_command }}" + total=${#cmds[@]} + suite_start=$SECONDS + for idx in "${!cmds[@]}"; do + i=$((idx + 1)) + cmd="${cmds[$idx]}" + echo "" + echo "." + echo "Begin ($i/$total): python3 $cmd" + echo "." + file_start=$SECONDS + python3 $cmd -f || exit 1 + elapsed=$(( SECONDS - file_start )) + echo "." + echo "End ($i/$total): elapsed=${elapsed}s" + echo "." + echo "" + done + total_elapsed=$(( SECONDS - suite_start )) + echo "All $total test(s) passed in ${total_elapsed}s" diff --git a/.github/workflows/retag-docker.yml b/.github/workflows/retag-docker.yml new file mode 100644 index 000000000000..633a275ed033 --- /dev/null +++ b/.github/workflows/retag-docker.yml @@ -0,0 +1,30 @@ +name: Retag Docker Image + +on: + workflow_dispatch: + inputs: + source_tag: + description: "Existing image tag (e.g., v0.4.7-cu129-amd64)" + required: true + target_tag: + description: "New tag to apply (e.g., latest)" + required: true + +jobs: + retag: + if: github.repository == 'sgl-project/sglang' + runs-on: ubuntu-22.04 + environment: "prod" + steps: + - name: Login to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Retag image + run: | + echo "Retagging lmsysorg/sglang:${{ inputs.source_tag }} -> lmsysorg/sglang:${{ inputs.target_tag }}" + docker buildx imagetools create \ + -t lmsysorg/sglang:${{ inputs.target_tag }} \ + lmsysorg/sglang:${{ inputs.source_tag }} diff --git a/.github/workflows/slash-command-handler.yml b/.github/workflows/slash-command-handler.yml index 012208f9f271..0702506aea74 100644 --- a/.github/workflows/slash-command-handler.yml +++ b/.github/workflows/slash-command-handler.yml @@ -19,7 +19,9 @@ jobs: (contains(github.event.comment.body, '/tag-run-ci-label') || contains(github.event.comment.body, '/rerun-failed-ci') || contains(github.event.comment.body, '/tag-and-rerun-ci') || - contains(github.event.comment.body, '/rerun-stage')) + contains(github.event.comment.body, '/rerun-stage') || + contains(github.event.comment.body, '/rerun-group') || + contains(github.event.comment.body, '/rerun-test')) runs-on: ubuntu-latest steps: @@ -48,14 +50,34 @@ jobs: fi echo "is_fork=$IS_FORK" >> $GITHUB_OUTPUT echo "ref=$(echo "$PR_DATA" | jq -r '.headRefName')" >> $GITHUB_OUTPUT + echo "pr_ref=refs/pull/${{ github.event.issue.number }}/head" >> $GITHUB_OUTPUT echo "PR owner: $HEAD_OWNER, Repo owner: $REPO_OWNER, Is fork: $IS_FORK" + - name: Check commenter permission for fork PRs + id: perm + if: steps.pr.outputs.is_fork == 'true' + shell: bash + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + PERM=$(gh api repos/${{ github.repository }}/collaborators/${{ github.event.comment.user.login }}/permission --jq '.permission') || { + PERM="none" + echo "::warning::Failed to check commenter permission, defaulting to none" + } + if [[ "$PERM" == "admin" || "$PERM" == "maintain" || "$PERM" == "write" ]]; then + echo "safe_to_checkout_pr=true" >> $GITHUB_OUTPUT + else + echo "safe_to_checkout_pr=false" >> $GITHUB_OUTPUT + fi + echo "Commenter ${{ github.event.comment.user.login }} permission: $PERM" + - name: Checkout code uses: actions/checkout@v4 with: - # For non-fork PRs, checkout PR branch to allow testing handler changes - # For fork PRs, stay on main for security (don't run untrusted code with elevated permissions) - ref: ${{ steps.pr.outputs.is_fork == 'false' && steps.pr.outputs.ref || '' }} + # For non-fork PRs: checkout PR branch by name + # For fork PRs with trusted commenter: checkout via refs/pull/N/head + # For fork PRs with untrusted commenter: stay on main for security + ref: ${{ steps.pr.outputs.is_fork == 'false' && steps.pr.outputs.ref || (steps.perm.outputs.safe_to_checkout_pr == 'true' && steps.pr.outputs.pr_ref || '') }} - name: Set up Python uses: actions/setup-python@v5 diff --git a/.github/workflows/sync-lmsys-sglang-blogs.yml b/.github/workflows/sync-lmsys-sglang-blogs.yml new file mode 100644 index 000000000000..e68acdb156d2 --- /dev/null +++ b/.github/workflows/sync-lmsys-sglang-blogs.yml @@ -0,0 +1,40 @@ +name: Sync LMSYS SGLang blogs + +on: + workflow_dispatch: + schedule: + - cron: "0 */12 * * *" + +permissions: + contents: write + +jobs: + sync: + runs-on: ubuntu-latest + steps: + - name: Check out repository + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.11" + + - name: Sync blog cards + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: python docs_new/scripts/update_lmsys_sglang_blogs.py + + - name: Commit and push changes + run: | + if git diff --quiet -- docs_new/index.mdx; then + echo "No blog card changes to commit." + exit 0 + fi + + git config user.name "github-actions[bot]" + git config user.email "github-actions[bot]@users.noreply.github.com" + git add docs_new/index.mdx + git diff --cached --quiet && exit 0 + git commit -m "docs: sync LMSYS SGLang blog cards" + git push diff --git a/.github/workflows/trivy-scan-dev.yml b/.github/workflows/trivy-scan-dev.yml new file mode 100644 index 000000000000..f354765978f1 --- /dev/null +++ b/.github/workflows/trivy-scan-dev.yml @@ -0,0 +1,88 @@ +name: Trivy Scan Dev Docker Images + +on: + # Run daily after nightly dev builds (which run at midnight UTC) + schedule: + - cron: "0 6 * * *" + workflow_dispatch: + inputs: + tag: + description: "Image tag to scan (e.g., dev, dev-cu13, latest)" + required: false + default: "" + +jobs: + scan: + if: github.repository == 'sgl-project/sglang' + runs-on: x64-docker-build-node + timeout-minutes: 45 + permissions: + contents: read + security-events: write + strategy: + fail-fast: false + matrix: + tag: ${{ inputs.tag && fromJSON(format('["{0}"]', inputs.tag)) || fromJSON('["dev", "dev-cu12"]') }} + steps: + - name: Cleanup workspace (remove root-owned files from prior runs) + run: sudo rm -rf "$GITHUB_WORKSPACE"/* || true + + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Run Trivy vulnerability scanner + uses: aquasecurity/trivy-action@v0.35.0 + with: + image-ref: 'docker.io/lmsysorg/sglang:${{ matrix.tag }}' + scanners: 'vuln' + format: 'sarif' + output: 'trivy-results-${{ matrix.tag }}.sarif' + severity: 'CRITICAL,HIGH' + ignore-unfixed: true + skip-dirs: 'usr/local/go,opt/nvidia' + + - name: Upload Trivy scan results to GitHub Security + uses: github/codeql-action/upload-sarif@v4 + if: always() && hashFiles(format('trivy-results-{0}.sarif', matrix.tag)) != '' + with: + sarif_file: 'trivy-results-${{ matrix.tag }}.sarif' + category: 'trivy-${{ matrix.tag }}' + + - name: Run Trivy (table output for logs) + if: success() + uses: aquasecurity/trivy-action@v0.35.0 + with: + image-ref: 'docker.io/lmsysorg/sglang:${{ matrix.tag }}' + scanners: 'vuln' + format: 'table' + severity: 'CRITICAL,HIGH' + ignore-unfixed: true + skip-dirs: 'usr/local/go,opt/nvidia' + + - name: Scan summary + if: always() + run: | + IMAGE="docker.io/lmsysorg/sglang:${{ matrix.tag }}" + SARIF="trivy-results-${{ matrix.tag }}.sarif" + + echo "## Trivy Scan: \`${{ matrix.tag }}\`" >> "$GITHUB_STEP_SUMMARY" + + if [ ! -f "${SARIF}" ]; then + echo "**Status:** Scan failed — no SARIF output produced" >> "$GITHUB_STEP_SUMMARY" + exit 0 + fi + + VULN_COUNT=$(python3 -c " + import json + data = json.load(open('${SARIF}')) + print(sum(len(run.get('results', [])) for run in data.get('runs', []))) + ") + + echo "- **Image**: \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY" + echo "- **Findings**: ${VULN_COUNT}" >> "$GITHUB_STEP_SUMMARY" + + if [ "${VULN_COUNT}" = "0" ]; then + echo "- **Result**: No CRITICAL/HIGH unfixed vulnerabilities found" >> "$GITHUB_STEP_SUMMARY" + else + echo "- **Result**: Found ${VULN_COUNT} finding(s) — check the Security tab for details" >> "$GITHUB_STEP_SUMMARY" + fi diff --git a/.github/workflows/weekly-update-est-time.yml b/.github/workflows/weekly-update-est-time.yml new file mode 100644 index 000000000000..301b5af2fe87 --- /dev/null +++ b/.github/workflows/weekly-update-est-time.yml @@ -0,0 +1,79 @@ +name: Weekly Update Est Time + +on: + schedule: + - cron: '0 0 * * 1' # Monday 00:00 UTC + workflow_dispatch: {} + +permissions: + contents: write + pull-requests: write + +jobs: + update-est-time: + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + token: ${{ secrets.GITHUB_TOKEN }} + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.10' + + - name: Update est_time values + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + python scripts/ci/update_est_time.py \ + --summary-file /tmp/est_time_summary.md + + - name: Check for changes + id: changes + run: | + if git diff --quiet; then + echo "has_changes=false" >> "$GITHUB_OUTPUT" + echo "No est_time changes detected" + else + echo "has_changes=true" >> "$GITHUB_OUTPUT" + echo "Est_time changes detected:" + git diff --stat + fi + + - name: Create PR + if: steps.changes.outputs.has_changes == 'true' + env: + GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }} + run: | + git config user.name "sglang-bot" + git config user.email "sglang-bot@users.noreply.github.com" + + BRANCH_NAME="bot/update-est-time-$(date +%Y%m%d)" + git checkout -b "$BRANCH_NAME" + + git add -A + git commit -m "chore: update CI test est_time from recent run data" + + git push origin "$BRANCH_NAME" + + { + echo "## Summary" + echo + echo "Updates \`est_time\` values in CI test registration calls based on the 90th percentile of the last 15 successful executions from scheduled PR Test runs on main." + echo + echo "This keeps the LPT load-balancing algorithm accurate for partitioning tests across parallel CI jobs." + echo + if [ -f /tmp/est_time_summary.md ]; then + cat /tmp/est_time_summary.md + echo + fi + echo "🤖 Generated with GitHub Actions" + } > /tmp/pr_body.md + + gh pr create \ + --title "chore: update CI test est_time values" \ + --body-file /tmp/pr_body.md \ + --base main \ + --head "$BRANCH_NAME" diff --git a/.gitignore b/.gitignore index 3ecff6e63f7d..29de5aee37fd 100644 --- a/.gitignore +++ b/.gitignore @@ -178,6 +178,7 @@ benchmark/llava_bench/images benchmark/llava_bench/mme_pack *.jsonl tmp*.txt +/tmp/ # Torch Compile logs tl_out/ @@ -191,6 +192,7 @@ work_dirs/ *.csv !logo.png +!docs_new/images/*.png # Prerequisites *.d @@ -224,16 +226,11 @@ work_dirs/ *.exe *.out *.app - -compile_commands.json - *.iml # VSCode .vscode -1 - # Autoenv .env.leave @@ -243,12 +240,13 @@ Cargo.lock # Generated vision test fixtures (regenerate with: python scripts/generate_vision_golden.py) sgl-model-gateway/tests/fixtures/golden/ +# Other repos lmms-eval -**/.claude/ **/.serena/ ctags/ outputs/ +inputs/ # Eval Cache .longbench_cache/ @@ -261,22 +259,21 @@ outputs/ # setuptools-scm generated version file python/sglang/_version.py - -# Generated protobuf files (regenerate during wheel build or with compile_proto.py) -python/sglang/srt/grpc/*_pb2.py -python/sglang/srt/grpc/*_pb2_grpc.py -python/sglang/srt/grpc/*_pb2.pyi +python/kernel.lock # MUSA section # Generated source files by torchada sgl-kernel/csrc_musa/ sgl-kernel/include_musa/ sgl-kernel/csrc/**/*_musa/ -sgl-kernel/third_party/*/csrc_musa/ -sgl-kernel/third_party/*/include_musa/ - -# Third-party libraries source code -sgl-kernel/third_party/ # MUSA core dump files -core_* +*.mudmp + +# Others +# diffusion 3D outputs +*.glb +*.ply +*.npz +artifacts/ +.claude/scheduled_tasks.lock diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 7abe48029ea7..8118e91c26cf 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -3,7 +3,7 @@ exclude: ^(python/sglang/multimodal_gen/csrc|python/sglang/jit_kernel/flash_atte repos: - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v5.0.0 + rev: v6.0.0 hooks: - id: check-symlinks - id: destroyed-symlinks @@ -21,12 +21,12 @@ repos: - id: debug-statements - id: no-commit-to-branch - repo: https://github.com/PyCQA/isort - rev: 5.13.2 + rev: 7.0.0 hooks: - id: isort exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$' - repo: https://github.com/astral-sh/ruff-pre-commit - rev: v0.11.7 + rev: v0.15.1 hooks: - id: ruff args: @@ -43,7 +43,7 @@ repos: python/sglang/srt/grpc/.*_pb2_grpc\.pyi$| )$ - repo: https://github.com/psf/black - rev: 24.10.0 + rev: 26.1.0 hooks: - id: black-jupyter exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$' @@ -53,13 +53,13 @@ repos: - id: codespell args: ['--config', '.codespellrc'] - repo: https://github.com/pre-commit/mirrors-clang-format - rev: v18.1.8 + rev: v20.1.7 hooks: - id: clang-format types_or: [c++, cuda] args: [--style=file, --verbose] - repo: https://github.com/kynan/nbstripout - rev: 0.8.1 + rev: 0.9.0 hooks: - id: nbstripout args: @@ -67,9 +67,49 @@ repos: - '--extra-keys=metadata.kernelspec metadata.language_info.version' - repo: local hooks: + - id: check-chinese-characters + name: check chinese characters in multimodal_gen + entry: >- + python3 -c 'import sys, re; p=re.compile(r"[\u4e00-\u9fff]"); ec=0; [ ([(print(f"{f}:{i+1}: {l.strip()}") or (ec:=1)) for i,l in enumerate(open(f, "r", encoding="utf-8", errors="ignore")) if p.search(l)]) for f in sys.argv[1:] ]; sys.exit(ec)' + language: system + files: ^python/sglang/multimodal_gen/.* + exclude: ^(python/sglang/multimodal_gen/configs/sample|python/sglang/multimodal_gen/apps/ComfyUI_SGLDiffusion/workflows|python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages)(/|$) + types_or: [python, markdown, json, text] - id: sort-ci-permissions name: sort CI_PERMISSIONS.json entry: python3 .github/update_ci_permission.py --sort-only language: system files: ^\.github/CI_PERMISSIONS\.json$ pass_filenames: false + - id: check-workflow-job-names + name: check for duplicate workflow job names + entry: python3 scripts/ci/check_workflow_job_names.py + language: system + files: ^\.github/workflows/.*\.yml$ + pass_filenames: false + - id: check-registered-tests + name: check registered tests have CI registry + entry: python3 scripts/ci/check_registered_tests.py + language: system + files: ^test/registered/.*\.py$ + pass_filenames: false + - id: check-no-docs-changes + name: reject changes under legacy docs/ + entry: python3 scripts/ci/check_no_docs_changes.py + language: system + pass_filenames: false + always_run: true + stages: [pre-commit] + - repo: https://github.com/lycheeverse/lychee.git + rev: lychee-v0.22.0 + hooks: + - id: lychee + name: check doc links (offline) + args: ["--config", ".github/linters/lychee.toml"] + stages: [manual] + exclude: ^docs/_build/ + files: | + (?x)^( + README\.md| + docs/.*\.(md|rst|ipynb) + )$ diff --git a/3rdparty/amd/sgl-kernel/build_rocm.sh b/3rdparty/amd/sgl-kernel/build_rocm.sh deleted file mode 100755 index 1022d8bb50f3..000000000000 --- a/3rdparty/amd/sgl-kernel/build_rocm.sh +++ /dev/null @@ -1,123 +0,0 @@ -#!/bin/bash -set -euo pipefail -ROCM_VERSION=$1 - -PYTHON_ROOT_PATH="/opt/venv/bin" -AMDGPU_TARGET="gfx942;gfx950" - -echo "Python root path is: $PYTHON_ROOT_PATH" - -# Get version from git tags -SGLANG_VERSION="v0.5.6" # Default version, will be overridden if git tags are found - -# Fetch tags from origin to ensure we have the latest -if git fetch --tags origin; then - # Get the latest version tag sorted by version number (e.g., v0.5.7) - VERSION_FROM_TAG=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1) - if [ -n "$VERSION_FROM_TAG" ]; then - SGLANG_VERSION="$VERSION_FROM_TAG" - echo "Using SGLang version from git tags: $SGLANG_VERSION" - else - echo "Warning: No version tags found; using default $SGLANG_VERSION" >&2 - fi -else - echo "Warning: Failed to fetch tags from origin; using default $SGLANG_VERSION" >&2 -fi - -# Default base tags (can be overridden by command line arguments) -DEFAULT_MI30X_BASE_TAG="${SGLANG_VERSION}-rocm700-mi30x" -DEFAULT_MI35X_BASE_TAG="${SGLANG_VERSION}-rocm700-mi35x" - -# Parse command line arguments -MI30X_BASE_TAG="${DEFAULT_MI30X_BASE_TAG}" -MI35X_BASE_TAG="${DEFAULT_MI35X_BASE_TAG}" - -# Detect GPU architecture from the Kubernetes runner hostname -HOSTNAME_VALUE=$(hostname) -GPU_ARCH="mi30x" # default - -# Host names look like: linux-mi35x-gpu-1-xxxxx-runner-zzzzz -if [[ "${HOSTNAME_VALUE}" =~ ^linux-(mi[0-9]+[a-z]*)-gpu-[0-9]+ ]]; then - GPU_ARCH="${BASH_REMATCH[1]}" - echo "Detected GPU architecture from hostname: ${GPU_ARCH}" -else - echo "Warning: could not parse GPU architecture from '${HOSTNAME_VALUE}', defaulting to ${GPU_ARCH}" -fi - -case "${GPU_ARCH}" in - mi35x) - echo "Runner uses ${GPU_ARCH}; will fetch mi35x image." - ;; - mi30x|mi300|mi325) - echo "Runner uses ${GPU_ARCH}; will fetch mi30x image." - GPU_ARCH="mi30x" - ;; - *) - echo "Runner architecture '${GPU_ARCH}' unrecognised; defaulting to mi30x image." >&2 - GPU_ARCH="mi30x" - ;; -esac - -if [[ -f /etc/podinfo/gha-render-devices ]]; then - DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices) -else - DEVICE_FLAG="--device /dev/dri" -fi - -# Find the latest image -find_latest_image() { - local gpu_arch=$1 - local base_tag days_back image_tag - - case "${gpu_arch}" in - mi30x) base_tag="${MI30X_BASE_TAG}" ;; - mi35x) base_tag="${MI35X_BASE_TAG}" ;; - *) echo "Error: unsupported GPU architecture '${gpu_arch}'" >&2; return 1 ;; - esac - - for days_back in {0..6}; do - image_tag="${base_tag}-$(date -d "${days_back} days ago" +%Y%m%d)" - echo "Checking for image: rocm/sgl-dev:${image_tag}" >&2 - if docker manifest inspect "rocm/sgl-dev:${image_tag}" >/dev/null 2>&1; then - echo "Found available image: rocm/sgl-dev:${image_tag}" >&2 - echo "rocm/sgl-dev:${image_tag}" - return 0 - fi - done - - echo "Error: no ${gpu_arch} image found in the last 7 days for base ${base_tag}" >&2 - echo "Using hard-coded fallback…" >&2 - if [[ "${gpu_arch}" == "mi35x" ]]; then - echo "rocm/sgl-dev:v0.5.3-rocm700-mi35x-20251009" - else - echo "rocm/sgl-dev:v0.5.3-rocm700-mi30x-20251009" - fi -} - -# Pull and run the latest image -IMAGE=$(find_latest_image "${GPU_ARCH}") -echo "Pulling Docker image: ${IMAGE}" -docker pull "${IMAGE}" - -docker run --rm \ - -v $(pwd):/sgl-kernel \ - -e AMDGPU_TARGET="${AMDGPU_TARGET}" \ - ${IMAGE} \ - bash -c " - # Install CMake (version >= 3.26) - Robust Installation - export CMAKE_VERSION_MAJOR=3.31 - export CMAKE_VERSION_MINOR=1 - echo \"Downloading CMake from: https://cmake.org/files/v\${CMAKE_VERSION_MAJOR}/cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz\" - wget https://cmake.org/files/v\${CMAKE_VERSION_MAJOR}/cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz - tar -xzf cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz - mv cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64 /opt/cmake - export PATH=/opt/cmake/bin:\$PATH - - ${PYTHON_ROOT_PATH}/pip install --no-cache-dir ninja setuptools wheel numpy uv scikit-build-core && \ - - cd /sgl-kernel && \ - rm -rf CMakeLists.txt && mv CMakeLists_rocm.txt CMakeLists.txt && \ - ${PYTHON_ROOT_PATH}/python rocm_hipify.py && \ - ${PYTHON_ROOT_PATH}/python -m uv build --wheel -Cbuild-dir=build . --color=always --no-build-isolation && \ - ./rename_wheels_rocm.sh -" diff --git a/3rdparty/amd/tuning/TUNING.md b/3rdparty/amd/tuning/TUNING.md index e7b9b2049d61..a903bba03eca 100644 --- a/3rdparty/amd/tuning/TUNING.md +++ b/3rdparty/amd/tuning/TUNING.md @@ -25,7 +25,7 @@ To maximize Triton kernel efficiency, several strategies can be employed: triton.Config({'waves_per_eu': 4}, num_warps=16, num_stages=1), ], key=['BLOCK_N', 'NUM_TOKEN_BLKS'], use_cuda_graph=True) @triton.jit -def _triton_kernel_funtion(): +def _triton_kernel_function(): ... ``` ## 2. Torch Tunable Operations diff --git a/3rdparty/amd/tuning/benchmark_moe_rocm.py b/3rdparty/amd/tuning/benchmark_moe_rocm.py index af596d218310..d7ea67c5fab1 100644 --- a/3rdparty/amd/tuning/benchmark_moe_rocm.py +++ b/3rdparty/amd/tuning/benchmark_moe_rocm.py @@ -10,7 +10,7 @@ from tqdm import tqdm from transformers import AutoConfig -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import ( +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import ( fused_moe, get_config_file_name, ) @@ -187,10 +187,8 @@ def run_grid(bs, model, method, tp_size, dtype: str): configs = union_of_list_of_dicts(prune_configs_1, prune_configs_2) - print( - f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \ - {len(prune_configs_2)=} | {len(configs)=}" - ) + print(f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \ + {len(prune_configs_2)=} | {len(configs)=}") best_config = None best_time_us = 1e20 diff --git a/3rdparty/amd/wheel/README.md b/3rdparty/amd/wheel/README.md new file mode 100644 index 000000000000..7d14c704fe0b --- /dev/null +++ b/3rdparty/amd/wheel/README.md @@ -0,0 +1,97 @@ +# sglang-kernel (prior sgl-kernel) + +Building and releasing `sglang-kernel` as a wheel is a part of the release workflow. Check [release-whl-kernel.yml](https://github.com/sgl-project/sglang/blob/main/.github/workflows/release-whl-kernel.yml) for details. + +# sglang + +`3rdparty/amd/wheel/sglang/pyproject.toml` is the AMD-specific pyproject for building the `amd-sglang` wheel. It extends `python/pyproject_other.toml` with two ROCm-version extras (`rocm700`, `rocm720`) that pin the matching torch/triton/torchaudio/torchvision/`sglang-kernel` wheels, and renames the package to `amd-sglang`. + +## Operation to build sglang wheel + +``` +$ git clone https://github.com/sgl-project/sglang.git && cd sglang +$ cp 3rdparty/amd/wheel/sglang/pyproject.toml python/pyproject.toml +$ cd python && python -m build +``` + +## Installation + +### v0.5.9 + +ROCm 7.0.0: +``` +pip uninstall sglang-kernel sglang amd-sglang +pip install "amd-sglang[all-hip,rocm700]" -i https://pypi.amd.com/rocm-7.0.0/simple --extra-index-url https://pypi.org/simple +``` + +ROCm 7.2.0: +``` +pip uninstall sglang-kernel sglang amd-sglang +pip install "amd-sglang[all-hip,rocm720]" -i https://pypi.amd.com/rocm-7.2.0/simple --extra-index-url https://pypi.org/simple +``` + +Note: You must resolve the two dependencies, AITER and triton, below. Others are optional depending on your applications. + +## Manual Dependency Resolution + +### Resolving AITER + +[AITER](https://github.com/ROCm/aiter) is a fundamental dependency. Wheel-izing it is ongoing. +Until we can pin it reliably, install it manually (typically following the [ROCm docker recipe](https://github.com/sgl-project/sglang/blob/main/docker/rocm.Dockerfile#L106). + +### Revolving triton + +To avoid known issues in triton 3.5.1 installed by default, we recommend upgrading triton after installation. In ROCm 7.0.0 environment, +``` +pip install triton==3.6.0 +``` +or ROCm 7.2.0, +``` +pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/triton-3.6.0%2Brocm7.2.0.gitba5c1517-cp310-cp310-linux_x86_64.whl +``` + +#### `torch._inductor.exc.InductorError: AttributeError: 'KernelMetadata' object has no attribute 'cluster_dims'` + +After upgrading, you may hit this error during inference when PyTorch Inductor interacts with Triton metadata. + +A pragmatic workaround is to guard the metadata access in Inductor's Triton heuristics so it only reads `cluster_dims` when the attribute exists: + +```diff +--- a/opt/venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py ++++ b/opt/venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py +@@ -1759,6 +1759,8 @@ + else ( + (binary.metadata.num_ctas, *binary.metadata.cluster_dims) + if hasattr(binary, "metadata") ++ and hasattr(binary.metadata, "num_ctas") ++ and hasattr(binary.metadata, "cluster_dims") + else () + ) + ), +``` + +### Resolving Dependencies for Distributed Inference + +#### sgl-model-gateway + +Install sgl-model-gateway as follows: + +``` +$ apt install openssl libssl-dev protobuf +$ export PATH="/$HOME/.cargo/bin:${PATH}" \ + && curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \ + && rustc --version && cargo --version # Prepare for a rust toolchain +$ python3 -m pip install --no-cache-dir setuptools-rust \ + && cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \ + && cargo build --release \ + && python3 -m pip install --no-cache-dir . \ + && rm -rf /root/.cache # Build and install sgl-model-gateway +``` + +#### [Mori](https://github.com/sgl-project/sglang/blob/main/docker/rocm.Dockerfile#L381) + +### Resolving Dependencies for DeepSeek-V3.2 + +#### [TileLang](https://github.com/sgl-project/sglang/blob/main/docker/rocm.Dockerfile#L216) + +#### [FHT (fast-hadamard-transform)](https://github.com/sgl-project/sglang/blob/main/docker/rocm.Dockerfile#L300) diff --git a/3rdparty/amd/sgl-kernel/CMakeLists_rocm.txt b/3rdparty/amd/wheel/sgl-kernel/CMakeLists_rocm.txt similarity index 97% rename from 3rdparty/amd/sgl-kernel/CMakeLists_rocm.txt rename to 3rdparty/amd/wheel/sgl-kernel/CMakeLists_rocm.txt index e4d29ae73104..eb9f1e40d510 100644 --- a/3rdparty/amd/sgl-kernel/CMakeLists_rocm.txt +++ b/3rdparty/amd/wheel/sgl-kernel/CMakeLists_rocm.txt @@ -36,6 +36,7 @@ endif() # ROCm/HIP enable_language(HIP) +list(APPEND CMAKE_PREFIX_PATH "/opt/rocm/lib/cmake/hip-lang") find_package(hip REQUIRED CONFIG) # Determine AMDGPU target from environment variable or default to gfx942 @@ -106,10 +107,10 @@ ${PROJ_ROOT}/csrc/elementwise/pos_enc.hip ${PROJ_ROOT}/csrc/elementwise/topk.hip ${PROJ_ROOT}/csrc/grammar/apply_token_bitmask_inplace_hip.hip ${PROJ_ROOT}/csrc/kvcacheio/transfer.hip +${PROJ_ROOT}/csrc/memory/weak_ref_tensor.cpp ${PROJ_ROOT}/csrc/moe/moe_align_kernel.hip ${PROJ_ROOT}/csrc/moe/moe_topk_softmax_kernels.hip ${PROJ_ROOT}/csrc/moe/moe_topk_sigmoid_kernels.hip -${PROJ_ROOT}/csrc/sgl_diffusion/elementwise/timestep_embedding.hip ${PROJ_ROOT}/csrc/speculative/eagle_utils.hip ) set_source_files_properties( diff --git a/3rdparty/amd/wheel/sgl-kernel/build_rocm.sh b/3rdparty/amd/wheel/sgl-kernel/build_rocm.sh new file mode 100755 index 000000000000..4347737fa534 --- /dev/null +++ b/3rdparty/amd/wheel/sgl-kernel/build_rocm.sh @@ -0,0 +1,50 @@ +#!/bin/bash +set -euo pipefail + +ROCM_VERSION=${1:-} + +if [[ "${ROCM_VERSION}" == "700" ]]; then + IMAGE="lmsysorg/sglang:v0.5.8.post1-rocm700-mi35x" +elif [[ "${ROCM_VERSION}" == "720" ]]; then + IMAGE="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" +else + echo "ERROR: Unsupported ROCM_VERSION='${ROCM_VERSION}'. Only '700' and '720' are supported." >&2 + exit 1 +fi + +PYTHON_ROOT_PATH="/opt/venv/bin" +AMDGPU_TARGET="gfx942;gfx950" + +# Pull and run the latest image +echo "Pulling Docker image: ${IMAGE}" +docker pull "${IMAGE}" + +docker run --rm \ + -v $(pwd):/sgl-kernel \ + -e AMDGPU_TARGET="${AMDGPU_TARGET}" \ + -e PYTORCH_ROCM_ARCH="${AMDGPU_TARGET}" \ + ${IMAGE} \ + bash -c " + # Install torch, triton, and friends, depending on the ROCm version + if [[ "${ROCM_VERSION}" == "700" ]]; then + ${PYTHON_ROOT_PATH}/pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/torch-2.9.1.dev20251204%2Brocm7.0.2.lw.git351ff442-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/triton-3.5.1%2Brocm7.0.2.gita272dfa8-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/torchaudio-2.9.0%2Brocm7.0.2.gite3c6ee2b-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/torchvision-0.24.0%2Brocm7.0.2.gitb919bd0c-cp310-cp310-linux_x86_64.whl + elif [[ "${ROCM_VERSION}" == "720" ]]; then + ${PYTHON_ROOT_PATH}/pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torch-2.9.1%2Brocm7.2.0.lw.git7e1940d4-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/triton-3.5.1%2Brocm7.2.0.gita272dfa8-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torchaudio-2.9.0%2Brocm7.2.0.gite3c6ee2b-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torchvision-0.24.0%2Brocm7.2.0.gitb919bd0c-cp310-cp310-linux_x86_64.whl + fi + # Install CMake (version >= 3.26) - Robust Installation + export CMAKE_VERSION_MAJOR=3.31 + export CMAKE_VERSION_MINOR=1 + echo \"Downloading CMake from: https://cmake.org/files/v\${CMAKE_VERSION_MAJOR}/cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz\" + wget https://cmake.org/files/v\${CMAKE_VERSION_MAJOR}/cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz + tar -xzf cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz + mv cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64 /opt/cmake + export PATH=/opt/cmake/bin:\$PATH + + ${PYTHON_ROOT_PATH}/pip install --no-cache-dir ninja setuptools wheel numpy uv scikit-build-core && \ + + cd /sgl-kernel && \ + rm -rf CMakeLists.txt && mv CMakeLists_rocm.txt CMakeLists.txt && \ + ${PYTHON_ROOT_PATH}/python rocm_hipify.py && \ + ${PYTHON_ROOT_PATH}/python -m uv build --wheel -Cbuild-dir=build . --color=always --no-build-isolation && \ + ./rename_wheels_rocm.sh +" diff --git a/3rdparty/amd/sgl-kernel/rename_wheels_rocm.sh b/3rdparty/amd/wheel/sgl-kernel/rename_wheels_rocm.sh similarity index 75% rename from 3rdparty/amd/sgl-kernel/rename_wheels_rocm.sh rename to 3rdparty/amd/wheel/sgl-kernel/rename_wheels_rocm.sh index 691407a3e63f..3c8fb2a63b6e 100755 --- a/3rdparty/amd/sgl-kernel/rename_wheels_rocm.sh +++ b/3rdparty/amd/wheel/sgl-kernel/rename_wheels_rocm.sh @@ -6,6 +6,7 @@ WHEEL_DIR="dist" wheel_files=($WHEEL_DIR/*.whl) for wheel in "${wheel_files[@]}"; do intermediate_wheel="${wheel/linux/manylinux2014}" + [[ "$intermediate_wheel" == *"+rocm"* ]] && continue # Extract the current python version from the wheel name if [[ $intermediate_wheel =~ -cp([0-9]+)- ]]; then @@ -16,11 +17,8 @@ for wheel in "${wheel_files[@]}"; do fi # Detect ROCm version and add appropriate suffix - if ls /opt | grep -q "7.0"; then - new_wheel="${intermediate_wheel/-cp${cp_version}/+rocm700-cp${cp_version}}" - else - new_wheel="$intermediate_wheel" - fi + ver_abrv=$(realpath /opt/rocm-* | sed -e 's/.*-//' -e 's/\.//g') + new_wheel=${intermediate_wheel/-cp${cp_version}/+rocm${ver_abrv}-cp${cp_version}} if [[ "$wheel" != "$new_wheel" ]]; then echo "Renaming $wheel to $new_wheel" diff --git a/3rdparty/amd/sgl-kernel/rocm_hipify.py b/3rdparty/amd/wheel/sgl-kernel/rocm_hipify.py similarity index 94% rename from 3rdparty/amd/sgl-kernel/rocm_hipify.py rename to 3rdparty/amd/wheel/sgl-kernel/rocm_hipify.py index c758fe0f7bbb..7383408bed16 100644 --- a/3rdparty/amd/sgl-kernel/rocm_hipify.py +++ b/3rdparty/amd/wheel/sgl-kernel/rocm_hipify.py @@ -21,10 +21,10 @@ "csrc/elementwise/topk.cu", "csrc/grammar/apply_token_bitmask_inplace_cuda.cu", "csrc/kvcacheio/transfer.cu", + "csrc/memory/weak_ref_tensor.cpp", "csrc/moe/moe_align_kernel.cu", "csrc/moe/moe_topk_softmax_kernels.cu", "csrc/moe/moe_topk_sigmoid_kernels.cu", - "csrc/sgl_diffusion/elementwise/timestep_embedding.cu", "csrc/speculative/eagle_utils.cu", ] diff --git a/3rdparty/amd/wheel/sglang/pyproject.toml b/3rdparty/amd/wheel/sglang/pyproject.toml new file mode 100644 index 000000000000..d04c3f3bb96c --- /dev/null +++ b/3rdparty/amd/wheel/sglang/pyproject.toml @@ -0,0 +1,218 @@ +[build-system] +requires = ["setuptools>=61.0", "setuptools-scm>=8.0", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "amd-sglang" +dynamic = ["version"] +description = "SGLang is a fast serving framework for large language models and vision language models." +readme = "README.md" +requires-python = ">=3.10" +license = { file = "LICENSE" } +classifiers = [ + "Programming Language :: Python :: 3", + "License :: OSI Approved :: Apache Software License", +] +dependencies = ["aiohttp", "requests", "tqdm", "numpy", "IPython", "setproctitle"] + +[project.optional-dependencies] +runtime_common = [ + "IPython", + "aiohttp", + "anthropic>=0.20.0", + "blobfile==3.0.0", + "build", + "compressed-tensors", + "decord2", + "datasets", + "einops", + "fastapi", + "gguf", + "hf_transfer", + "huggingface_hub", + "interegular", + "llguidance>=0.7.11,<0.8.0", + "modelscope", + "msgspec", + "ninja", + "numpy", + "openai-harmony==0.0.4", + "openai==2.6.1", + "orjson", + "outlines==0.1.11", + "packaging", + "partial_json_parser", + "pillow", + "prometheus-client>=0.20.0", + "psutil", + "py-spy", + "pybase64", + "pydantic", + "python-multipart", + "pyzmq>=25.1.2", + "requests", + "scipy", + "sentencepiece", + "setproctitle", + "soundfile==0.13.1", + "tiktoken", + "timm==1.0.16", + "torchao==0.9.0", + "tqdm", + "transformers==4.57.1", + "uvicorn", + "uvloop", + "xgrammar==0.2.0", + "smg-grpc-servicer>=0.5.0", +] + +# ROCm specific packages (https://repo.radeon.com/rocm/manylinux/) +# Existing practice for daily rocm700 docker images relies on 700-rc +# versions of software that are not public available. Here we pin some +# from rocm702 as the closest set as daily rocm700 images. +rocm700 = [ + "torch @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/torch-2.9.1.dev20251204%2Brocm7.0.2.lw.git351ff442-cp310-cp310-linux_x86_64.whl", + "triton @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/triton-3.5.1%2Brocm7.0.2.gita272dfa8-cp310-cp310-linux_x86_64.whl", + "torchaudio @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/torchaudio-2.9.0%2Brocm7.0.2.gite3c6ee2b-cp310-cp310-linux_x86_64.whl", + "torchvision @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0.2/torchvision-0.24.0%2Brocm7.0.2.gitb919bd0c-cp310-cp310-linux_x86_64.whl", + "mooncake-transfer-engine-non-cuda==0.3.8.post1", + "sglang-kernel @ https://github.com/sgl-project/whl/releases/download/v0.4.0/sglang_kernel-0.4.0+rocm700-cp310-abi3-manylinux2014_x86_64.whl", +] + +rocm720 = [ + "torch @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torch-2.9.1%2Brocm7.2.0.lw.git7e1940d4-cp310-cp310-linux_x86_64.whl", + "triton @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/triton-3.5.1%2Brocm7.2.0.gita272dfa8-cp310-cp310-linux_x86_64.whl", + "torchaudio @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torchaudio-2.9.0%2Brocm7.2.0.gite3c6ee2b-cp310-cp310-linux_x86_64.whl", + "torchvision @ https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torchvision-0.24.0%2Brocm7.2.0.gitb919bd0c-cp310-cp310-linux_x86_64.whl", + "mooncake-transfer-engine-non-cuda==0.3.8.post1", + "sglang-kernel @ https://github.com/sgl-project/whl/releases/download/v0.4.0/sglang_kernel-0.4.0+rocm720-cp310-abi3-manylinux2014_x86_64.whl", +] + +# HIP (Heterogeneous-computing Interface for Portability) for AMD +# Install with one of: +# pip install "amd-sglang[srt_hip,rocm700]" +# pip install "amd-sglang[srt_hip,rocm720]" +srt_hip = [ + "amd-sglang[runtime_common]", + "petit_kernel==0.0.2", + "wave-lang==3.8.2", +] + +diffusion_hip = [ + "PyYAML==6.0.1", + "cloudpickle", + "diffusers==0.37.0", + "imageio==2.36.0", + "imageio-ffmpeg==0.5.1", + "moviepy>=2.0.0", + "opencv-python-headless==4.10.0.84", + "remote-pdb", + "st_attn==0.0.7", + "vsa==0.0.4", + "runai_model_streamer>=0.15.5", + "cache-dit==1.1.8", + "addict", +] + +# For Intel Gaudi(device : hpu) follow the installation guide +# https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html +srt_hpu = ["sglang[runtime_common]"] + +# https://docs.sglang.io/platforms/mthreads_gpu.md +srt_musa = [ + "sglang[runtime_common]", + "torch", + "torch_musa", + "torchada>=0.1.54", + "mthreads-ml-py", + "mate>=0.2.0", + "deep-gemm>=0.1.3", + "flash_attn_3>=0.1.4", + "numpy<2.0", +] + +diffusion_musa = [ + "PyYAML==6.0.1", + "cloudpickle", + "diffusers==0.37.0", + "imageio==2.36.0", + "imageio-ffmpeg==0.5.1", + "moviepy>=2.0.0", + "opencv-python-headless==4.10.0.84", + "remote-pdb", + "st_attn==0.0.7", + "vsa==0.0.4", + "runai_model_streamer>=0.15.5", + "cache-dit==1.1.8", + "addict", +] + +tracing = [ + "opentelemetry-sdk", + "opentelemetry-api", + "opentelemetry-exporter-otlp", + "opentelemetry-exporter-otlp-proto-grpc", +] + +test = [ + "accelerate", + "expecttest", + "gguf", + "jsonlines", + "matplotlib", + "pandas", + "peft", + "pytest", + "sentence_transformers", + "tabulate", +] + +all_hip = ["amd-sglang[srt_hip]", "amd-sglang[diffusion_hip]"] +all_hpu = ["sglang[srt_hpu]"] +all_musa = ["sglang[srt_musa]", "sglang[diffusion_musa]"] + +dev_hip = ["amd-sglang[all_hip]", "amd-sglang[test]"] +dev_hpu = ["sglang[all_hpu]", "sglang[test]"] +dev_musa = ["sglang[all_musa]", "sglang[test]"] + +[project.urls] +"Homepage" = "https://github.com/sgl-project/sglang" +"Bug Tracker" = "https://github.com/sgl-project/sglang/issues" + +[project.scripts] +sglang = "sglang.cli.main:main" + +[tool.setuptools.package-data] +"sglang" = [ + "srt/**/*", + "jit_kernel/**/*", +] + +[tool.setuptools.packages.find] +exclude = [ + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", +] + +[tool.wheel] +exclude = [ + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", +] + +[tool.setuptools_scm] +root = ".." +version_file = "sglang/_version.py" +git_describe_command = ["python3", "python/tools/get_version_tag.py"] +# Allow editable installs even when .git metadata is not available. +fallback_version = "0.0.0.dev0" diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md deleted file mode 100644 index 18c91471812c..000000000000 --- a/CODE_OF_CONDUCT.md +++ /dev/null @@ -1,128 +0,0 @@ -# Contributor Covenant Code of Conduct - -## Our Pledge - -We as members, contributors, and leaders pledge to make participation in our -community a harassment-free experience for everyone, regardless of age, body -size, visible or invisible disability, ethnicity, sex characteristics, gender -identity and expression, level of experience, education, socio-economic status, -nationality, personal appearance, race, religion, or sexual identity -and orientation. - -We pledge to act and interact in ways that contribute to an open, welcoming, -diverse, inclusive, and healthy community. - -## Our Standards - -Examples of behavior that contributes to a positive environment for our -community include: - -* Demonstrating empathy and kindness toward other people -* Being respectful of differing opinions, viewpoints, and experiences -* Giving and gracefully accepting constructive feedback -* Accepting responsibility and apologizing to those affected by our mistakes, - and learning from the experience -* Focusing on what is best not just for us as individuals, but for the - overall community - -Examples of unacceptable behavior include: - -* The use of sexualized language or imagery, and sexual attention or - advances of any kind -* Trolling, insulting or derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others' private information, such as a physical or email - address, without their explicit permission -* Other conduct which could reasonably be considered inappropriate in a - professional setting - -## Enforcement Responsibilities - -Community leaders are responsible for clarifying and enforcing our standards of -acceptable behavior and will take appropriate and fair corrective action in -response to any behavior that they deem inappropriate, threatening, offensive, -or harmful. - -Community leaders have the right and responsibility to remove, edit, or reject -comments, commits, code, wiki edits, issues, and other contributions that are -not aligned to this Code of Conduct, and will communicate reasons for moderation -decisions when appropriate. - -## Scope - -This Code of Conduct applies within all community spaces, and also applies when -an individual is officially representing the community in public spaces. -Examples of representing our community include using an official e-mail address, -posting via an official social media account, or acting as an appointed -representative at an online or offline event. - -## Enforcement - -Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported to the community leaders responsible for enforcement at -. -All complaints will be reviewed and investigated promptly and fairly. - -All community leaders are obligated to respect the privacy and security of the -reporter of any incident. - -## Enforcement Guidelines - -Community leaders will follow these Community Impact Guidelines in determining -the consequences for any action they deem in violation of this Code of Conduct: - -### 1. Correction - -**Community Impact**: Use of inappropriate language or other behavior deemed -unprofessional or unwelcome in the community. - -**Consequence**: A private, written warning from community leaders, providing -clarity around the nature of the violation and an explanation of why the -behavior was inappropriate. A public apology may be requested. - -### 2. Warning - -**Community Impact**: A violation through a single incident or series -of actions. - -**Consequence**: A warning with consequences for continued behavior. No -interaction with the people involved, including unsolicited interaction with -those enforcing the Code of Conduct, for a specified period of time. This -includes avoiding interactions in community spaces as well as external channels -like social media. Violating these terms may lead to a temporary or -permanent ban. - -### 3. Temporary Ban - -**Community Impact**: A serious violation of community standards, including -sustained inappropriate behavior. - -**Consequence**: A temporary ban from any sort of interaction or public -communication with the community for a specified period of time. No public or -private interaction with the people involved, including unsolicited interaction -with those enforcing the Code of Conduct, is allowed during this period. -Violating these terms may lead to a permanent ban. - -### 4. Permanent Ban - -**Community Impact**: Demonstrating a pattern of violation of community -standards, including sustained inappropriate behavior, harassment of an -individual, or aggression toward or disparagement of classes of individuals. - -**Consequence**: A permanent ban from any sort of public interaction within -the community. - -## Attribution - -This Code of Conduct is adapted from the [Contributor Covenant][homepage], -version 2.0, available at -https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. - -Community Impact Guidelines were inspired by [Mozilla's code of conduct -enforcement ladder](https://github.com/mozilla/diversity). - -[homepage]: https://www.contributor-covenant.org - -For answers to common questions about this code of conduct, see the FAQ at -https://www.contributor-covenant.org/faq. Translations are available at -https://www.contributor-covenant.org/translations. diff --git a/README.md b/README.md index 93e6b6429f18..bdb9a5e047dc 100644 --- a/README.md +++ b/README.md @@ -22,21 +22,23 @@

## News +- [2026/02] 🔥 Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72 ([blog](https://lmsys.org/blog/2026-02-20-gb300-inferencex/)). +- [2026/01] 🔥 SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2026-01-16-sglang-diffusion/)). - [2025/12] SGLang provides day-0 support for latest open models ([MiMo-V2-Flash](https://lmsys.org/blog/2025-12-16-mimo-v2-flash/), [Nemotron 3 Nano](https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/), [Mistral Large 3](https://github.com/sgl-project/sglang/pull/14213), [LLaDA 2.0 Diffusion LLM](https://lmsys.org/blog/2025-12-19-diffusion-llm/), [MiniMax M2](https://lmsys.org/blog/2025-11-04-miminmax-m2/)). -- [2025/11] 🔥 SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2025-11-07-sglang-diffusion/)). - [2025/10] 🔥 SGLang now runs natively on TPU with the SGLang-Jax backend ([blog](https://lmsys.org/blog/2025-10-29-sglang-jax/)). - [2025/09] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput ([blog](https://lmsys.org/blog/2025-09-25-gb200-part-2/)). - [2025/09] SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention ([blog](https://lmsys.org/blog/2025-09-29-deepseek-V32/)). - [2025/08] SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking ([Roadmap](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_roadmap.pdf), [Large-scale EP](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_sglang_ep.pdf), [Highlights](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_highlights.pdf), [AITER/MoRI](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_aiter_mori.pdf), [Wave](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_meetup_wave.pdf)). -- [2025/08] SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https://github.com/sgl-project/sglang/issues/8833)) -- [2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
More +- [2025/11] SGLang Diffusion accelerates video and image generation ([blog](https://lmsys.org/blog/2025-11-07-sglang-diffusion/)). - [2025/10] PyTorch Conference 2025 SGLang Talk ([slide](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/sglang_pytorch_2025.pdf)). - [2025/10] SGLang x Nvidia SF Meetup on 10/2 ([recap](https://x.com/lmsysorg/status/1975339501934510231)). +- [2025/08] SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https://github.com/sgl-project/sglang/issues/8833)) - [2025/06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z ([a16z blog](https://a16z.com/advancing-open-source-ai-through-benchmarks-and-bold-experimentation/)). +- [2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)). - [2025/06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput ([blog](https://lmsys.org/blog/2025-06-16-gb200-part-1/)). - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)) - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/)) @@ -59,9 +61,9 @@ Its core features include: - **Fast Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching. - **Broad Model Support**: Supports a wide range of language models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), reward models (Skywork), and diffusion models (WAN, Qwen-Image), with easy extensibility for adding new models. Compatible with most Hugging Face models and OpenAI APIs. -- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more. +- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark/5090), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more. - **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 400,000 GPUs worldwide. -- **RL & Post-Training Backbone**: SGLang is a proven rollout backend across the world, with native RL integrations and adoption by well-known post-training frameworks such as [**AReaL**](https://github.com/inclusionAI/AReaL), [**Miles**](https://github.com/radixark/miles), [**slime**](https://github.com/THUDM/slime), [**Tunix**](https://github.com/google/tunix), [**verl**](https://github.com/volcengine/verl) and more. +- **RL & Post-Training Backbone**: SGLang is a proven rollout backend used for training many frontier models, with native RL integrations and adoption by well-known post-training frameworks such as [**AReaL**](https://github.com/inclusionAI/AReaL), [**Miles**](https://github.com/radixark/miles), [**slime**](https://github.com/THUDM/slime), [**Tunix**](https://github.com/google/tunix), [**verl**](https://github.com/volcengine/verl) and more. ## Getting Started - [Install SGLang](https://docs.sglang.io/get_started/install.html) @@ -71,17 +73,19 @@ Its core features include: - [Contribution Guide](https://docs.sglang.io/developer_guide/contribution_guide.html) ## Benchmark and Performance -Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), [Large-scale expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/), [GB200 rack-scale parallelism](https://lmsys.org/blog/2025-09-25-gb200-part-2/). +Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), [Large-scale expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/), [GB200 rack-scale parallelism](https://lmsys.org/blog/2025-09-25-gb200-part-2/), [GB300 long context](https://lmsys.org/blog/2026-02-19-gb300-longctx/). ## Adoption and Sponsorship -SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. +SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 400,000 GPUs worldwide. SGLang is currently hosted under the non-profit open-source organization [LMSYS](https://lmsys.org/about/). logo ## Contact Us -For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at sglang@lmsys.org +For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at [sglang@lmsys.org](mailto:sglang@lmsys.org). + +Long-term active SGLang contributors are eligible for coding agent sponsorship, such as Cursor, Claude Code, or OpenAI Codex. Email [sglang@lmsys.org](mailto:sglang@lmsys.org) with your most important commits or pull requests. ## Acknowledgment We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql). diff --git a/benchmark/asr/README.md b/benchmark/asr/README.md new file mode 100644 index 000000000000..5c16490e9262 --- /dev/null +++ b/benchmark/asr/README.md @@ -0,0 +1,168 @@ +# ASR Benchmark + +This benchmark evaluates the performance and accuracy (Word Error Rate - WER) of Automatic Speech Recognition (ASR) models served via SGLang. + +## Supported Models + +- `openai/whisper-large-v3` +- `openai/whisper-large-v3-turbo` +- `Qwen/Qwen3-ASR-1.7B` +- `Qwen/Qwen3-ASR-0.6B` + +## Setup + +Install the required dependencies: + +```bash +apt install ffmpeg +pip install librosa soundfile datasets evaluate jiwer transformers openai torchcodec torch +``` + +## Running the Benchmark + +### 1. Start SGLang Server + +Launch the SGLang server with a Whisper model: + +```bash +python -m sglang.launch_server --model-path openai/whisper-large-v3 --port 30000 +``` + +### 2. Run the Benchmark Script + +Basic usage (using chat completions API): + +```bash +python bench_sglang.py --base-url http://localhost:30000 --model openai/whisper-large-v3 --n-examples 10 +``` + +Using the OpenAI-compatible transcription API: + +```bash +python bench_sglang.py \ + --base-url http://localhost:30000 \ + --model openai/whisper-large-v3 \ + --api-type transcription \ + --language English \ + --n-examples 10 +``` + +Run with streaming and show real-time output: + +```bash +python bench_sglang.py \ + --base-url http://localhost:30000 \ + --model openai/whisper-large-v3 \ + --api-type transcription \ + --stream \ + --show-predictions \ + --concurrency 1 +``` + +Run with higher concurrency and save results: + +```bash +python bench_sglang.py \ + --base-url http://localhost:30000 \ + --model openai/whisper-large-v3 \ + --concurrency 8 \ + --n-examples 100 \ + --output results.json \ + --show-predictions +``` + +## Arguments + +| Argument | Description | Default | +|----------|-------------|---------| +| `--base-url` | SGLang server URL | `http://localhost:30000` | +| `--model` | Model name on the server | `openai/whisper-large-v3` | +| `--dataset` | HuggingFace dataset for evaluation | `D4nt3/esb-datasets-earnings22-validation-tiny-filtered` | +| `--split` | Dataset split to use | `validation` | +| `--concurrency` | Number of concurrent requests | `4` | +| `--n-examples` | Number of examples to process (`-1` for all) | `-1` | +| `--output` | Path to save results as JSON | `None` | +| `--show-predictions` | Display sample predictions | `False` | +| `--print-n` | Number of samples to display | `5` | +| `--api-type` | API to use: `chat` (chat completions) or `transcription` (audio transcriptions) | `chat` | +| `--language` | Language for transcription API (e.g., `English`, `en`) | `None` | +| `--stream` | Enable streaming mode for transcription API | `False` | + +## Metrics + +The benchmark outputs: + +| Metric | Description | +|--------|-------------| +| **Total Requests** | Number of successful ASR requests processed | +| **WER** | Word Error Rate (lower is better), computed using the `evaluate` library | +| **Average Latency** | Mean time per request (seconds) | +| **Median Latency** | 50th percentile latency (seconds) | +| **95th Latency** | 95th percentile latency (seconds) | +| **Throughput** | Requests processed per second | +| **Token Throughput** | Output tokens per second | + +## Example Output + +```bash +python bench_sglang.py --api-type transcription --concurrency 128 --model openai/whisper-large-v3 --show-predictions + +Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered... +Using API type: transcription +Repo card metadata block was not found. Setting CardData to empty. +WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty. +Performing warmup... +Processing 511 samples... +------------------------------ +Results for openai/whisper-large-v3: +Total Requests: 511 +WER: 12.7690 +Average Latency: 1.3602s +Median Latency: 1.2090s +95th Latency: 2.9986s +Throughput: 19.02 req/s +Token Throughput: 354.19 tok/s +Total Test Time: 26.8726s +------------------------------ + +==================== Sample Predictions ==================== +Sample 1: + REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement + PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement +---------------------------------------- +Sample 2: + REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like + PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like +---------------------------------------- +Sample 3: + REF: we talked about 4.7 gigawatts + PRED: we talked about 4.7 gigawatts +---------------------------------------- +Sample 4: + REF: and you know depending on that working capital build we will we will see what that yields + PRED: and depending on that working capital build we will see what that yields what +---------------------------------------- +Sample 5: + REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70% + PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70% +---------------------------------------- +============================================================ +``` + +## Notes + +- Audio samples longer than 30 seconds are automatically filtered out (Whisper limitation) +- The benchmark performs a warmup request before measuring performance +- Results are normalized using the model's tokenizer when available +- When using `--stream` with `--show-predictions`, use `--concurrency 1` for clean sequential output +- The `--language` option accepts both full names (e.g., `English`) and ISO 639-1 codes (e.g., `en`) + +## Troubleshooting + +**Server connection refused** +- Ensure the SGLang server is running and accessible at the specified `--base-url` +- Check that the port is not blocked by a firewall + +**Out of memory errors** +- Reduce `--concurrency` to lower GPU memory usage +- Use a smaller Whisper model variant diff --git a/benchmark/asr/bench_sglang.py b/benchmark/asr/bench_sglang.py new file mode 100644 index 000000000000..875ed952bf60 --- /dev/null +++ b/benchmark/asr/bench_sglang.py @@ -0,0 +1,404 @@ +import argparse +import asyncio +import base64 +import io +import json +import time +from statistics import mean, median + +import httpx +import librosa +import numpy as np +import soundfile +from datasets import load_dataset +from evaluate import load +from openai import AsyncOpenAI, OpenAI +from transformers import AutoTokenizer + + +def to_bytes(y, sr): + buffer = io.BytesIO() + soundfile.write(buffer, y, sr, format="WAV") + buffer.seek(0) + return buffer + + +async def run_asr_chat(client, model_name, y, sr): + """Use chat completions API with audio_url for ASR.""" + with to_bytes(y, sr) as f: + audio_bytes = f.read() + audio_base64 = base64.b64encode(audio_bytes).decode("utf-8") + + start_time = time.perf_counter() + response = await client.chat.completions.create( + model=model_name, + messages=[ + { + "role": "user", + "content": [ + { + "type": "audio_url", + "audio_url": {"url": f"data:audio/wav;base64,{audio_base64}"}, + } + ], + } + ], + temperature=0.0, + ) + end_time = time.perf_counter() + + asr_text = response.choices[0].message.content + latency = end_time - start_time + return latency, asr_text + + +def run_asr_transcription_sync(client, model_name, y, sr, language=None): + """Use audio transcriptions API for ASR (sync version).""" + audio_buffer = to_bytes(y, sr) + audio_buffer.name = "audio.wav" # OpenAI client needs a name attribute + + start_time = time.perf_counter() + kwargs = { + "model": model_name, + "file": audio_buffer, + } + if language: + kwargs["language"] = language + + transcription = client.audio.transcriptions.create(**kwargs) + end_time = time.perf_counter() + + latency = end_time - start_time + return latency, transcription.text + + +def run_asr_transcription_stream_sync( + base_url, model_name, y, sr, language=None, show_stream=False +): + """Use audio transcriptions API with streaming for ASR.""" + audio_buffer = to_bytes(y, sr) + audio_bytes = audio_buffer.read() + + data = { + "model": model_name, + "response_format": "json", + "stream": "true", + } + if language: + data["language"] = language + + start_time = time.perf_counter() + text_chunks = [] + + if show_stream: + print("[STREAM] ", end="", flush=True) + + with httpx.stream( + "POST", + f"{base_url}/v1/audio/transcriptions", + data=data, + files={"file": ("audio.wav", audio_bytes, "audio/wav")}, + timeout=60.0, + ) as response: + for line in response.iter_lines(): + if line.startswith("data: ") and not line.startswith("data: [DONE]"): + try: + chunk = json.loads(line[6:]) + if "choices" in chunk and chunk["choices"]: + delta = chunk["choices"][0].get("delta", {}) + content = delta.get("content", "") + if content: + text_chunks.append(content) + if show_stream: + print(content, end="", flush=True) + except json.JSONDecodeError: + pass + + if show_stream: + print() # newline after stream + + end_time = time.perf_counter() + latency = end_time - start_time + return latency, "".join(text_chunks) + + +async def run_asr_transcription( + client, + model_name, + y, + sr, + language=None, + stream=False, + base_url=None, + show_stream=False, +): + """Async wrapper for transcription API (runs sync call in executor).""" + loop = asyncio.get_event_loop() + if stream: + return await loop.run_in_executor( + None, + run_asr_transcription_stream_sync, + base_url, + model_name, + y, + sr, + language, + show_stream, + ) + return await loop.run_in_executor( + None, run_asr_transcription_sync, client, model_name, y, sr, language + ) + + +async def bound_asr( + sem, + client, + model_name, + tokenizer, + audio, + reference, + api_type="chat", + language=None, + stream=False, + base_url=None, + show_stream=False, +): + async with sem: + try: + if api_type == "transcription": + latency, text = await run_asr_transcription( + client, + model_name, + *audio, + language=language, + stream=stream, + base_url=base_url, + show_stream=show_stream, + ) + else: + latency, text = await run_asr_chat(client, model_name, *audio) + + # Calculate tokens for throughput metrics + num_output_tokens = len(tokenizer(text, add_special_tokens=False).input_ids) + + # Normalize for WER evaluation + # Whisper tokenizer has a normalize method + if hasattr(tokenizer, "normalize"): + out = tokenizer.normalize(text) + ref = tokenizer.normalize(reference) + else: + out = text.lower().strip() + ref = reference.lower().strip() + + return latency, num_output_tokens, out, ref + except Exception as e: + print(f"Error during ASR: {e}") + return None + + +async def process_dataset( + model_name, + client, + data, + concurrent_request, + api_type="chat", + language=None, + stream=False, + base_url=None, + show_predictions=False, +): + sem = asyncio.Semaphore(concurrent_request) + tokenizer = AutoTokenizer.from_pretrained(model_name) + + # Warmup + print("Performing warmup...") + audio_warmup, sr_warmup = ( + data[0]["audio"]["array"], + data[0]["audio"]["sampling_rate"], + ) + await bound_asr( + sem, + client, + model_name, + tokenizer, + (audio_warmup, sr_warmup), + "", + api_type=api_type, + language=language, + stream=stream, + base_url=base_url, + show_stream=False, # Don't show stream during warmup + ) + + tasks = [] + print(f"Processing {len(data)} samples...") + for sample in data: + audio, sr = sample["audio"]["array"], sample["audio"]["sampling_rate"] + tasks.append( + asyncio.create_task( + bound_asr( + sem, + client, + model_name, + tokenizer, + (audio, sr), + sample["text"], + api_type=api_type, + language=language, + stream=stream, + base_url=base_url, + show_stream=show_predictions and stream, + ) + ) + ) + + results = await asyncio.gather(*tasks) + return [r for r in results if r is not None] + + +def run_evaluation(args): + # Use sync client for transcription API, async for chat API + if args.api_type == "transcription": + client = OpenAI(base_url=f"{args.base_url}/v1", api_key="None") + else: + client = AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="None") + + print(f"Loading dataset: {args.dataset}...") + print(f"Using API type: {args.api_type}" + (f" (streaming)" if args.stream else "")) + dataset = load_dataset(args.dataset, split=args.split) + + # Filter by duration if needed (Whisper max is 30s) + def add_duration(sample): + y, sr = sample["audio"]["array"], sample["audio"]["sampling_rate"] + sample["duration_ms"] = librosa.get_duration(y=y, sr=sr) * 1000 + return sample + + if "duration_ms" not in dataset.column_names: + dataset = dataset.map(add_duration) + + dataset = dataset.filter(lambda x: x["duration_ms"] < 30000) + + if args.n_examples > 0: + dataset = dataset.select(range(min(args.n_examples, len(dataset)))) + + start = time.perf_counter() + results = asyncio.run( + process_dataset( + args.model, + client, + dataset, + args.concurrency, + api_type=args.api_type, + language=args.language, + stream=args.stream, + base_url=args.base_url, + show_predictions=args.show_predictions, + ) + ) + total_test_time = time.perf_counter() - start + + if not results: + print("No successful results to evaluate.") + return + + # Metrics + latencies = [res[0] for res in results] + total_tokens = sum([res[1] for res in results]) + predictions = [res[2] for res in results] + references = [res[3] for res in results] + + wer_metric = load("wer") + wer_score = 100 * wer_metric.compute(references=references, predictions=predictions) + + print("-" * 30) + print(f"Results for {args.model}:") + print(f"Total Requests: {len(results)}") + print(f"WER: {wer_score:.4f}") + print(f"Average Latency: {mean(latencies):.4f}s") + print(f"Median Latency: {median(latencies):.4f}s") + print(f"95th Latency: {np.percentile(latencies, 95):.4f}s") + print(f"Throughput: {len(results) / total_test_time:.2f} req/s") + print(f"Token Throughput: {total_tokens / total_test_time:.2f} tok/s") + print(f"Total Test Time: {total_test_time:.4f}s") + print("-" * 30) + + if args.output: + with open(args.output, "w") as f: + import json + + json.dump( + { + "model": args.model, + "dataset": args.dataset, + "wer": wer_score, + "avg_latency": mean(latencies), + "throughput": len(results) / total_test_time, + "token_throughput": total_tokens / total_test_time, + }, + f, + indent=2, + ) + + if args.show_predictions: + print("\n" + "=" * 20 + " Sample Predictions " + "=" * 20) + num_to_show = min(args.print_n, len(results)) + for i in range(num_to_show): + print(f"Sample {i+1}:") + print(f" REF: {references[i]}") + print(f" PRED: {predictions[i]}") + print("-" * 40) + print("=" * 60) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Benchmark sGLang ASR performance.") + parser.add_argument( + "--base-url", default="http://localhost:30000", help="sGLang server base URL" + ) + parser.add_argument( + "--model", default="openai/whisper-large-v3", help="Model name on the server" + ) + parser.add_argument( + "--dataset", + default="D4nt3/esb-datasets-earnings22-validation-tiny-filtered", + help="HF dataset repo", + ) + parser.add_argument("--split", default="validation", help="Dataset split") + parser.add_argument( + "--concurrency", type=int, default=4, help="Number of concurrent requests" + ) + parser.add_argument( + "--n-examples", + "-n", + type=int, + default=-1, + help="Number of examples to test (-1 for all)", + ) + parser.add_argument("--output", help="Path to save results in JSON") + parser.add_argument( + "--show-predictions", + action="store_true", + help="Print sample predictions and references", + ) + parser.add_argument( + "--print-n", type=int, default=5, help="Number of sample predictions to print" + ) + parser.add_argument( + "--api-type", + choices=["chat", "transcription"], + default="chat", + help="API type to use: 'chat' for chat completions with audio_url, 'transcription' for audio.transcriptions API", + ) + parser.add_argument( + "--language", + default=None, + help="Language code for transcription API (e.g., 'en')", + ) + parser.add_argument( + "--stream", + action="store_true", + help="Use streaming mode for transcription API", + ) + args = parser.parse_args() + + run_evaluation(args) diff --git a/benchmark/bench_adaptive_speculative.py b/benchmark/bench_adaptive_speculative.py new file mode 100644 index 000000000000..2a4ca0edc001 --- /dev/null +++ b/benchmark/bench_adaptive_speculative.py @@ -0,0 +1,263 @@ +"""Benchmark adaptive speculative decoding against static baselines. + +Run the same workload against one adaptive server and one or more static +servers, then compare throughput, latency, and acceptance length. + +Workloads: +- low: steady-state low-acceptance generation +- high: steady-state high-acceptance generation +- transition: alternating low/high acceptance shifts to stress runtime switching +""" + +import argparse +import time +from concurrent.futures import ThreadPoolExecutor + +import requests + +HIGH_PROMPTS = [ + "Output exactly 256 new lines. Every line must be 1. Do not add numbering, punctuation, or commentary.", + "Output exactly 256 new lines. Every line must be READY. Do not add numbering, punctuation, or commentary.", +] + +LOW_PROMPTS = [ + "Compose a poem in the style of Emily Dickinson about quantum entanglement. Make it emotionally resonant.", + "Write 100 two-sentence biographies of eccentric inventors with unique names, hometowns, and inventions.", + "Write a long travel diary from a botanist visiting a chain of floating islands. Every paragraph should introduce new flora, customs, weather, and political tensions.", + "Write 80 newspaper headlines and subheads from 80 different alternate-history worlds. Each headline must introduce a different place, conflict, and technology.", +] + +WORKLOADS = { + "low": [ + ("low", LOW_PROMPTS), + ], + "high": [ + ("high", HIGH_PROMPTS), + ], + "transition": [ + ("low_1", LOW_PROMPTS), + ("high_1", HIGH_PROMPTS), + ("low_2", LOW_PROMPTS), + ("high_2", HIGH_PROMPTS), + ], +} + + +def build_phase_plan(workload: str, num_requests: int): + return [ + (phase_name, prompts, num_requests) + for phase_name, prompts in WORKLOADS[workload] + ] + + +def send_request(base_url: str, prompt: str, max_tokens: int = 256): + start = time.perf_counter() + try: + resp = requests.post( + f"{base_url}/generate", + json={ + "text": prompt, + "sampling_params": { + "temperature": 0, + "max_new_tokens": max_tokens, + }, + "return_logprob": False, + }, + timeout=max(120, max_tokens), + ) + resp.raise_for_status() + data = resp.json() + except Exception as e: + return {"error": str(e), "latency": time.perf_counter() - start} + + latency = time.perf_counter() - start + meta = data.get("meta_info", {}) + completion_tokens = meta.get("completion_tokens", 0) + spec_verify_ct = meta.get("spec_verify_ct", 0) + accept_len = ( + completion_tokens / spec_verify_ct if spec_verify_ct > 0 else float("nan") + ) + + return { + "latency": latency, + "completion_tokens": completion_tokens, + "spec_verify_ct": spec_verify_ct, + "accept_length": accept_len, + } + + +def run_phase( + base_url: str, + prompts, + phase_name: str, + num_requests: int, + max_tokens: int, + concurrency: int, +): + expanded = (prompts * ((num_requests + len(prompts) - 1) // len(prompts)))[ + :num_requests + ] + + print( + f"\n--- Phase: {phase_name} ({num_requests} requests, concurrency={concurrency}) ---" + ) + start = time.perf_counter() + + with ThreadPoolExecutor(max_workers=concurrency) as pool: + futures = [pool.submit(send_request, base_url, p, max_tokens) for p in expanded] + results = [f.result() for f in futures] + + elapsed = time.perf_counter() - start + errors = [r for r in results if "error" in r] + ok = [r for r in results if "error" not in r] + + if not ok: + print(f" All {len(errors)} requests failed!") + return {"phase": phase_name, "error": True} + + total_tokens = sum(r["completion_tokens"] for r in ok) + total_verify = sum(r["spec_verify_ct"] for r in ok) + avg_latency = sum(r["latency"] for r in ok) / len(ok) + throughput = total_tokens / elapsed + avg_accept_len = total_tokens / total_verify if total_verify > 0 else float("nan") + + stats = { + "phase": phase_name, + "num_requests": len(ok), + "num_errors": len(errors), + "total_tokens": total_tokens, + "elapsed_s": round(elapsed, 2), + "throughput_tok_s": round(throughput, 2), + "avg_latency_s": round(avg_latency, 3), + "avg_accept_length": round(avg_accept_len, 3), + } + + print( + f" Throughput: {throughput:.1f} tok/s | " + f"Avg latency: {avg_latency:.3f}s | " + f"Avg accept_len: {avg_accept_len:.2f} | " + f"Errors: {len(errors)}" + ) + return stats + + +def summarize_phases(phase_stats): + ok_stats = [s for s in phase_stats if not s.get("error")] + if not ok_stats: + return {"error": True} + + total_tokens = sum(s["total_tokens"] for s in ok_stats) + total_elapsed = sum(s["elapsed_s"] for s in ok_stats) + total_requests = sum(s["num_requests"] for s in ok_stats) + + weighted_latency = sum(s["avg_latency_s"] * s["num_requests"] for s in ok_stats) + weighted_accept = sum(s["avg_accept_length"] * s["num_requests"] for s in ok_stats) + + return { + "num_requests": total_requests, + "total_tokens": total_tokens, + "elapsed_s": round(total_elapsed, 2), + "throughput_tok_s": round(total_tokens / total_elapsed, 2), + "avg_latency_s": round(weighted_latency / total_requests, 3), + "avg_accept_length": round(weighted_accept / total_requests, 3), + } + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark one workload for adaptive-vs-static speculative decoding" + ) + parser.add_argument("--host", type=str, default="127.0.0.1") + parser.add_argument("--port", type=int, default=30000) + parser.add_argument( + "--workload", + choices=sorted(WORKLOADS), + default="transition", + help="Workload preset to run.", + ) + parser.add_argument( + "--requests", + type=int, + default=8, + help="Requests per phase.", + ) + parser.add_argument("--max-tokens", type=int, default=256) + parser.add_argument( + "--concurrency", + type=int, + default=2, + help="Concurrent requests.", + ) + parser.add_argument( + "--warmup", type=int, default=2, help="Warmup requests before the benchmark." + ) + args = parser.parse_args() + + if args.requests < 1: + parser.error("--requests must be >= 1") + if args.concurrency < 1: + parser.error("--concurrency must be >= 1") + if args.warmup < 0: + parser.error("--warmup must be >= 0") + + base_url = f"http://{args.host}:{args.port}" + + print(f"Server: {base_url}") + print(f"Workload: {args.workload}") + + phase_plan = build_phase_plan(args.workload, args.requests) + if args.warmup > 0: + print(f"\nWarming up with {args.warmup} requests...") + warmup_prompts = phase_plan[0][1] + run_phase( + base_url, + warmup_prompts, + "warmup", + args.warmup, + args.max_tokens, + args.concurrency, + ) + + phase_stats = [] + for phase_name, prompts, num_requests in phase_plan: + phase_stats.append( + run_phase( + base_url, + prompts, + phase_name, + num_requests, + args.max_tokens, + args.concurrency, + ) + ) + + overall = summarize_phases(phase_stats) + + print("\n" + "=" * 70) + print("SUMMARY") + print("=" * 70) + print(f"{'Phase':<10} {'Throughput':>12} {'Avg Latency':>12} {'Accept Len':>12}") + print("-" * 50) + for stats in phase_stats: + if stats.get("error"): + print(f"{stats['phase']:<10} {'ERROR':>12}") + continue + print( + f"{stats['phase']:<10} " + f"{stats['throughput_tok_s']:>10.1f}/s " + f"{stats['avg_latency_s']:>10.3f}s " + f"{stats['avg_accept_length']:>11.2f}" + ) + + if not overall.get("error"): + print("-" * 50) + print( + f"{'OVERALL':<10} " + f"{overall['throughput_tok_s']:>10.1f}/s " + f"{overall['avg_latency_s']:>10.3f}s " + f"{overall['avg_accept_length']:>11.2f}" + ) + + +if __name__ == "__main__": + main() diff --git a/benchmark/bench_linear_attention/bench_cutedsl_kda_decode.py b/benchmark/bench_linear_attention/bench_cutedsl_kda_decode.py new file mode 100644 index 000000000000..ea124c487bdd --- /dev/null +++ b/benchmark/bench_linear_attention/bench_cutedsl_kda_decode.py @@ -0,0 +1,481 @@ +"""Benchmark & Correctness: CuTe DSL KDA Decode vs Triton KDA Decode. + +This benchmark assumes the production / Triton canonical state layout: + ssm_states.shape == (pool_size, HV, V, K) + +Both the Triton baseline and the CuTe DSL candidate operate directly on that VK +layout. No transpose is performed anywhere in the benchmark. +""" + +import argparse +import os +import sys +import time + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "python")) + +import torch +import triton + +from sglang.jit_kernel.cutedsl_kda import cutedsl_fused_sigmoid_gating_kda_update +from sglang.srt.layers.attention.fla.fused_sigmoid_gating_recurrent import ( + fused_sigmoid_gating_delta_rule_update, +) +from sglang.srt.layers.attention.fla.kda import chunk_kda + + +def make_inputs( + B: int, + H: int, + HV: int, + K: int, + V: int, + pool_size: int, + device: str, + dtype: torch.dtype, + layout: str, + seed: int = 42, +): + torch.manual_seed(seed) + + assert K == 128 + assert V % 16 == 0 and V % 32 == 0 + + if layout == "varlen": + q = torch.randn(1, B, H, K, device=device, dtype=dtype) + k = torch.randn(1, B, H, K, device=device, dtype=dtype) + v = torch.randn(1, B, HV, V, device=device, dtype=dtype) + + # decode params + a = torch.randn(B, HV, K, device=device, dtype=dtype) + b = torch.randn(B, HV, device=device, dtype=dtype) + + # prefill params for chunk_kda must keep batch dim = 1 + # chunk_kda requires g, beta, v to have the same head count as k (H), + # matching the real KimiLinear model where num_heads == num_kv_heads. + prefill_v = torch.randn(1, B, H, V, device=device, dtype=dtype) + prefill_g = torch.randn(1, B, H, K, device=device, dtype=dtype) + prefill_beta = torch.sigmoid(torch.randn(1, B, H, device=device, dtype=dtype)) + + cu_seqlens = torch.arange(B + 1, device=device, dtype=torch.int32) + + elif layout == "dense": + q = torch.randn(B, 1, H, K, device=device, dtype=dtype) + k = torch.randn(B, 1, H, K, device=device, dtype=dtype) + v = torch.randn(B, 1, HV, V, device=device, dtype=dtype) + + # decode params + a = torch.randn(B, 1, HV, K, device=device, dtype=dtype) + b = torch.randn(B, 1, HV, device=device, dtype=dtype) + + # prefill params for chunk_kda dense path + # chunk_kda requires g, beta, v to have the same head count as k (H), + # matching the real KimiLinear model where num_heads == num_kv_heads. + prefill_v = torch.randn(B, 1, H, V, device=device, dtype=dtype) + prefill_g = torch.randn(B, 1, H, K, device=device, dtype=dtype) + prefill_beta = torch.sigmoid(torch.randn(B, 1, H, device=device, dtype=dtype)) + + cu_seqlens = torch.arange(B + 1, device=device, dtype=torch.int32) + else: + raise ValueError(f"Unknown layout: {layout}") + + A_log = torch.randn(HV, device=device, dtype=torch.float32) + dt_bias = torch.randn(HV, K, device=device, dtype=dtype) + + ssm_states = ( + torch.randn(pool_size, HV, V, K, device=device, dtype=torch.float32) * 0.1 + ) + cache_indices = torch.arange(B, device=device, dtype=torch.int32) + + return dict( + B=B, + H=H, + HV=HV, + K=K, + V=V, + pool_size=pool_size, + layout=layout, + q=q, + k=k, + v=v, + a=a, + b=b, + prefill_v=prefill_v, + prefill_g=prefill_g, + prefill_beta=prefill_beta, + A_log=A_log, + dt_bias=dt_bias, + ssm_states=ssm_states, + cache_indices=cache_indices, + cu_seqlens=cu_seqlens, + ) + + +def run_baseline(inp): + state = inp["ssm_states"].clone() + o = fused_sigmoid_gating_delta_rule_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=inp["q"], + k=inp["k"], + v=inp["v"], + a=inp["a"], + b=inp["b"], + initial_state_source=state, + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + is_kda=True, + ) + return o, state + + +def run_cutedsl(inp): + state = inp["ssm_states"].clone() + o = cutedsl_fused_sigmoid_gating_kda_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=inp["q"], + k=inp["k"], + v=inp["v"], + a=inp["a"], + b=inp["b"], + initial_state_source=state, + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + ) + return o, state + + +def run_prefill_then_decode_baseline(inp): + ssm_states = inp["ssm_states"].clone() + prefill_v_clone = inp["prefill_v"].clone() + v_clone = inp["v"].clone() + + _ = chunk_kda( + q=inp["q"], + k=inp["k"], + v=prefill_v_clone, + g=inp["prefill_g"], + beta=inp["prefill_beta"], + initial_state=ssm_states, + initial_state_indices=inp["cache_indices"], + use_qk_l2norm_in_kernel=True, + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + ) + + o = fused_sigmoid_gating_delta_rule_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=inp["q"], + k=inp["k"], + v=v_clone, + a=inp["a"], + b=inp["b"], + initial_state_source=ssm_states, + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + is_kda=True, + ) + return o, ssm_states + + +def run_prefill_then_decode_cutedsl(inp): + ssm_states = inp["ssm_states"].clone() + prefill_v_clone = inp["prefill_v"].clone() + v_clone = inp["v"].clone() + + _ = chunk_kda( + q=inp["q"], + k=inp["k"], + v=prefill_v_clone, + g=inp["prefill_g"], + beta=inp["prefill_beta"], + initial_state=ssm_states, + initial_state_indices=inp["cache_indices"], + use_qk_l2norm_in_kernel=True, + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + ) + + o = cutedsl_fused_sigmoid_gating_kda_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=inp["q"], + k=inp["k"], + v=v_clone, + a=inp["a"], + b=inp["b"], + initial_state_source=ssm_states, + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + ) + return o, ssm_states + + +def _assert_close(name, x, y, atol=3e-2, rtol=2e-2): + try: + torch.testing.assert_close(x.float(), y.float(), atol=atol, rtol=rtol) + return True, 0.0 + except AssertionError: + max_diff = (x - y).abs().max().item() + return False, max_diff + + +def check_correctness(B, H, HV, K, V, pool_size, device, dtype, layout): + tag = ( + f"layout={layout:<6} B={B:>4} H={H:>2} HV={HV:>2} " + f"K={K:>3} V={V:>3} pool={pool_size:>4}" + ) + inp = make_inputs(B, H, HV, K, V, pool_size, device, dtype, layout) + + o_ref, st_ref = run_baseline(inp) + o_cute, st_cute = run_cutedsl(inp) + + ok_o, diff_o = _assert_close("output", o_cute, o_ref) + valid_mask = inp["cache_indices"] >= 0 + valid_idx = inp["cache_indices"][valid_mask] + ok_s, diff_s = _assert_close("state", st_cute[valid_idx], st_ref[valid_idx]) + + if ok_o and ok_s: + print(f" [PASS] {tag}") + return True + + details = [] + if not ok_o: + details.append(f"output max_diff={diff_o:.6f}") + if not ok_s: + details.append(f"state max_diff={diff_s:.6f}") + print(f" [FAIL] {tag} ({', '.join(details)})") + return False + + +def check_prefill_chain(B, H, HV, K, V, pool_size, device, dtype, layout): + tag = ( + f"[prefill->decode] layout={layout:<6} B={B:>4} H={H:>2} HV={HV:>2} " + f"K={K:>3} V={V:>3} pool={pool_size:>4}" + ) + inp = make_inputs(B, H, HV, K, V, pool_size, device, dtype, layout) + + o_ref, st_ref = run_prefill_then_decode_baseline(inp) + o_cute, st_cute = run_prefill_then_decode_cutedsl(inp) + + ok_o, diff_o = _assert_close("output", o_cute, o_ref) + valid_mask = inp["cache_indices"] >= 0 + valid_idx = inp["cache_indices"][valid_mask] + ok_s, diff_s = _assert_close("state", st_cute[valid_idx], st_ref[valid_idx]) + + if ok_o and ok_s: + print(f" [PASS] {tag}") + return True + + details = [] + if not ok_o: + details.append(f"output max_diff={diff_o:.6f}") + if not ok_s: + details.append(f"state max_diff={diff_s:.6f}") + print(f" [FAIL] {tag} ({', '.join(details)})") + return False + + +def bench_shape(B, H, HV, K, V, pool_size, device, dtype, layout): + inp = make_inputs(B, H, HV, K, V, pool_size, device, dtype, layout) + + def fn_triton(): + fused_sigmoid_gating_delta_rule_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=inp["q"], + k=inp["k"], + v=inp["v"], + a=inp["a"], + b=inp["b"], + initial_state_source=inp["ssm_states"], + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + is_kda=True, + ) + + def fn_cute(): + cutedsl_fused_sigmoid_gating_kda_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=inp["q"], + k=inp["k"], + v=inp["v"], + a=inp["a"], + b=inp["b"], + initial_state_source=inp["ssm_states"], + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"] if inp["layout"] == "varlen" else None, + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + ) + + for _ in range(10): + fn_triton() + fn_cute() + torch.cuda.synchronize() + + try: + ms_triton, _, _ = triton.testing.do_bench( + fn_triton, quantiles=[0.5, 0.2, 0.8], warmup=50, rep=200 + ) + ms_cute, _, _ = triton.testing.do_bench( + fn_cute, quantiles=[0.5, 0.2, 0.8], warmup=50, rep=200 + ) + except Exception: + rep = 100 + st = time.perf_counter() + for _ in range(rep): + fn_triton() + torch.cuda.synchronize() + ms_triton = (time.perf_counter() - st) / rep * 1000 + + st = time.perf_counter() + for _ in range(rep): + fn_cute() + torch.cuda.synchronize() + ms_cute = (time.perf_counter() - st) / rep * 1000 + + speedup = ms_triton / ms_cute if ms_cute > 0 else float("inf") + delta = (ms_cute - ms_triton) * 1000 + print( + f" {layout:>6} {B:>5} {H:>3} {HV:>3} {K:>3} {V:>3} | " + f"{ms_triton * 1000:>12.1f} | " + f"{ms_cute * 1000:>13.1f} | " + f"{speedup:>8.2f} | " + f"{delta:>11.1f}" + ) + + +def run_correctness(device, dtype): + print("=" * 78) + print("Correctness: Triton KDA Decode vs CuTe DSL KDA Decode") + print("=" * 78) + + shapes = [ + ("dense", 1, 8, 16, 128, 128, 32), + ("dense", 4, 8, 16, 128, 128, 32), + ("dense", 32, 8, 16, 128, 128, 128), + ("dense", 64, 8, 16, 128, 128, 128), + ("varlen", 4, 8, 16, 128, 128, 32), + ("varlen", 16, 8, 16, 128, 128, 64), + ("varlen", 32, 8, 16, 128, 128, 128), + ("varlen", 64, 8, 16, 128, 128, 128), + ("varlen", 1, 16, 32, 128, 128, 32), + ("varlen", 32, 16, 32, 128, 128, 128), + ("varlen", 64, 16, 16, 128, 128, 128), + ] + + all_pass = True + for layout, B, H, HV, K, V, pool_size in shapes: + if not check_correctness(B, H, HV, K, V, pool_size, device, dtype, layout): + all_pass = False + + print() + print("=" * 78) + print("Correctness: Triton prefill/extend -> CuTe decode chain") + print("=" * 78) + for layout, B, H, HV, K, V, pool_size in shapes[:8]: + if not check_prefill_chain(B, H, HV, K, V, pool_size, device, dtype, layout): + all_pass = False + + print() + print("ALL PASSED." if all_pass else "SOME FAILED.") + return all_pass + + +def run_benchmark(device, dtype): + print() + print("=" * 92) + print("Benchmark: Triton KDA Decode vs CuTe DSL KDA Decode") + print("=" * 92) + + bench_configs = [ + ("dense", 1, 8, 16), + ("dense", 4, 8, 16), + ("dense", 32, 8, 16), + ("dense", 64, 8, 16), + ("varlen", 1, 8, 16), + ("varlen", 4, 8, 16), + ("varlen", 8, 8, 16), + ("varlen", 16, 8, 16), + ("varlen", 32, 8, 16), + ("varlen", 64, 8, 16), + ("varlen", 128, 8, 16), + ("varlen", 32, 16, 32), + ("varlen", 64, 16, 16), + ] + + K = 128 + V = 128 + pool_size = 512 + + print(f" Config: K={K}, V={V}, pool_size={pool_size}, dtype={dtype}") + print( + f" {'layout':>6} {'B':>5} {'H':>3} {'HV':>3} {'K':>3} {'V':>3} | " + f"{'triton (μs)':>12} | " + f"{'cutedsl (μs)':>13} | " + f"{'speedup':>8} | " + f"{'delta (μs)':>11}" + ) + print(" " + "-" * 82) + + for layout, B, H, HV in bench_configs: + actual_pool = max(pool_size, B + 16) + bench_shape(B, H, HV, K, V, actual_pool, device, dtype, layout) + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark & Correctness: Triton KDA Decode vs CuTe DSL KDA Decode" + ) + parser.add_argument( + "--mode", + choices=["all", "correctness", "bench"], + default="all", + help="Run mode (default: all)", + ) + parser.add_argument( + "--dtype", + choices=["float16", "bfloat16", "float32"], + default="bfloat16", + ) + args = parser.parse_args() + + device = "cuda" + dtype = getattr(torch, args.dtype) + + cap = torch.cuda.get_device_capability() + dev_name = torch.cuda.get_device_name() + print(f"Device: {dev_name} (SM {cap[0]}{cap[1]})") + + if args.mode in ("all", "correctness"): + all_pass = run_correctness(device, dtype) + if not all_pass and args.mode == "all": + print("\nSkipping benchmark due to correctness failures.") + return 1 + + if args.mode in ("all", "bench"): + run_benchmark(device, dtype) + + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/benchmark/bench_linear_attention/bench_fused_gate_cumsum.py b/benchmark/bench_linear_attention/bench_fused_gate_cumsum.py new file mode 100644 index 000000000000..1b2c105c3b41 --- /dev/null +++ b/benchmark/bench_linear_attention/bench_fused_gate_cumsum.py @@ -0,0 +1,213 @@ +""" +Benchmark: Fused Gate+Cumsum vs Separate Gate + Cumsum. + +Compares two paths: + - Separate: torch gate activation -> chunk_local_cumsum (2 steps) + - Fused: kda_gate_chunk_cumsum (single kernel) + +Both produce the same output: cumsum of gate-activated g. + +Usage: + python bench_fused_gate_cumsum.py + python bench_fused_gate_cumsum.py --batch-sizes 4 16 64 128 + python bench_fused_gate_cumsum.py --seq-lens 64 128 256 512 1024 +""" + +import argparse +import os +import sys + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "python")) + +import torch +import triton + +from sglang.srt.layers.attention.fla.cumsum import chunk_local_cumsum +from sglang.srt.layers.attention.fla.index import prepare_chunk_indices +from sglang.srt.layers.attention.fla.kda import kda_gate_chunk_cumsum + +CHUNK_SIZE = 64 + + +def make_inputs( + B: int, + T_per_seq: int, + H: int, + K: int, + device: str, + dtype: torch.dtype, + seed: int = 42, +): + T = B * T_per_seq + torch.manual_seed(seed) + + # Raw gate: [1, T_total, H, K] (varlen format, before activation) + raw_g = torch.randn(1, T, H, K, dtype=dtype, device=device) + + # A_log: [H] (per-head log-scale parameter) + A_log = torch.randn(H, dtype=torch.float32, device=device) * 0.5 + + # dt_bias: [H*K] (per-head bias, flat) + dt_bias = torch.randn(H * K, dtype=torch.float32, device=device) * 0.1 + + # cu_seqlens for varlen mode + cu_seqlens = torch.arange( + 0, (B + 1) * T_per_seq, T_per_seq, dtype=torch.long, device=device + ) + + return dict( + raw_g=raw_g, + A_log=A_log, + dt_bias=dt_bias, + cu_seqlens=cu_seqlens, + B=B, + T=T, + T_per_seq=T_per_seq, + H=H, + K=K, + ) + + +def run_ref(inp): + """Separate path: torch gate activation -> chunk_local_cumsum.""" + raw_g = inp["raw_g"] # [1, T, H, K] + A_log = inp["A_log"] # [H] + dt_bias = inp["dt_bias"] # [H*K] + cu_seqlens = inp["cu_seqlens"] + H, K = inp["H"], inp["K"] + + # Step 1: gate activation using torch ops + g_float = raw_g.float() + if dt_bias is not None: + g_float = g_float + dt_bias.float().view(1, 1, H, K) + g_activated = -torch.exp( + A_log.float().view(1, 1, H, 1) + ) * torch.nn.functional.softplus(g_float) + + # Step 2: chunk-local cumsum + chunk_indices = prepare_chunk_indices(cu_seqlens, CHUNK_SIZE) + g_cumsum = chunk_local_cumsum( + g_activated, + chunk_size=CHUNK_SIZE, + cu_seqlens=cu_seqlens, + chunk_indices=chunk_indices, + ) + return g_cumsum + + +def run_fused(inp): + """Fused path: kda_gate_chunk_cumsum (single kernel).""" + raw_g = inp["raw_g"] + A_log = inp["A_log"] + dt_bias = inp["dt_bias"] + cu_seqlens = inp["cu_seqlens"] + + chunk_indices = prepare_chunk_indices(cu_seqlens, CHUNK_SIZE) + g_cumsum = kda_gate_chunk_cumsum( + raw_g, + A_log=A_log, + chunk_size=CHUNK_SIZE, + dt_bias=dt_bias, + cu_seqlens=cu_seqlens, + chunk_indices=chunk_indices, + ) + return g_cumsum + + +def verify_correctness(inp): + """Verify fused and separate paths produce the same output.""" + out_separate = run_ref(inp) + out_fused = run_fused(inp) + + max_diff = (out_separate - out_fused).abs().max().item() + rel_diff = max_diff / (out_separate.abs().mean().item() + 1e-8) + return max_diff, rel_diff + + +def bench_shape(B, H, T_per_seq, K, device, dtype): + T = B * T_per_seq + inp = make_inputs(B, T_per_seq, H, K, device, dtype) + + # Warmup (includes triton compilation) + for _ in range(5): + run_ref(inp) + run_fused(inp) + torch.cuda.synchronize() + + ms_sep, ms_sep_lo, ms_sep_hi = triton.testing.do_bench( + lambda: run_ref(inp), quantiles=[0.5, 0.2, 0.8], warmup=50, rep=200 + ) + ms_fused, ms_fused_lo, ms_fused_hi = triton.testing.do_bench( + lambda: run_fused(inp), quantiles=[0.5, 0.2, 0.8], warmup=50, rep=200 + ) + + speedup = ms_sep / ms_fused if ms_fused > 0 else 0 + saved_us = (ms_sep - ms_fused) * 1000 # microseconds + + print( + f" {B:>5} {H:>3} {T_per_seq:>6} {T:>7} | " + f"{ms_sep:>8.3f} {ms_fused:>8.3f} | " + f"{speedup:>6.2f}x {saved_us:>+8.1f}us" + ) + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark: Fused vs Separate Gate+Cumsum" + ) + parser.add_argument("--dtype", choices=["bfloat16", "float16"], default="bfloat16") + parser.add_argument("--head-size-k", type=int, default=128) + parser.add_argument("--num-heads", type=int, nargs="+", default=[16]) + parser.add_argument( + "--batch-sizes", type=int, nargs="+", default=[4, 8, 16, 32, 64, 128] + ) + parser.add_argument( + "--seq-lens", type=int, nargs="+", default=[64, 128, 256, 512, 1024] + ) + args = parser.parse_args() + + device = "cuda" + dtype = getattr(torch, args.dtype) + K = args.head_size_k + + cap = torch.cuda.get_device_capability() + dev_name = torch.cuda.get_device_name() + print(f"Device: {dev_name} (SM {cap[0]}{cap[1]})") + print() + + # Correctness check + print("=" * 80) + print("Correctness verification") + print("=" * 80) + for H in args.num_heads: + inp = make_inputs(16, 256, H, K, device, dtype) + max_diff, rel_diff = verify_correctness(inp) + print( + f" H={H:>3}, B=16, T/seq=256: " + f"max_diff={max_diff:.2e}, rel_diff={rel_diff:.2e} " + f"{'PASS' if max_diff < 1e-3 else 'FAIL'}" + ) + print() + + # Performance benchmark + print("=" * 80) + print("Performance: Separate (gate+cumsum) vs Fused (single kernel)") + print("=" * 80) + print(f" Config: K={K}, chunk_size={CHUNK_SIZE}, dtype={dtype}") + print( + f" {'B':>5} {'H':>3} {'T/seq':>6} {'T_tot':>7} | " + f"{'sep(ms)':>8} {'fuse(ms)':>8} | " + f"{'speedup':>6} {'saved':>9}" + ) + print(" " + "-" * 73) + + for H in args.num_heads: + for B in args.batch_sizes: + for T_per_seq in args.seq_lens: + bench_shape(B, H, T_per_seq, K, device, dtype) + if len(args.num_heads) > 1: + print() + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmark/bench_linear_attention/bench_gdn_decode.py b/benchmark/bench_linear_attention/bench_gdn_decode.py new file mode 100644 index 000000000000..816c6d978aeb --- /dev/null +++ b/benchmark/bench_linear_attention/bench_gdn_decode.py @@ -0,0 +1,488 @@ +""" +Benchmark & Correctness: GDN Packed Decode vs Baseline Decode. + +Compares: + - Baseline: split(mixed_qkv) → view → fused_sigmoid_gating_delta_rule_update + - Packed: fused_recurrent_gated_delta_rule_packed_decode (single kernel) + +The packed path eliminates: + - torch.split() + .view() tensor materialization + - Separate gating kernel launches + - Intermediate tensor allocations + +Reports correctness (output & state matching) and performance (ms, speedup). + +Usage: + python bench_gdn_decode.py # default sweep + python bench_gdn_decode.py --mode bench # benchmark only + python bench_gdn_decode.py --mode correctness # correctness only + python bench_gdn_decode.py --preset qwen3.5-35b # Qwen3.5-35B-A3B config +""" + +import argparse +import os +import sys +import time + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "python")) + +import torch +import triton + +from sglang.srt.layers.attention.fla.fused_recurrent import ( + fused_recurrent_gated_delta_rule_packed_decode, +) +from sglang.srt.layers.attention.fla.fused_sigmoid_gating_recurrent import ( + fused_sigmoid_gating_delta_rule_update, +) + +# --------------------------------------------------------------------------- +# Input factory +# --------------------------------------------------------------------------- + + +def make_inputs( + B: int, + H: int, + HV: int, + K: int, + V: int, + pool_size: int, + device: str, + dtype: torch.dtype, + seed: int = 42, +): + """Create all input tensors for a single benchmark / correctness run.""" + torch.manual_seed(seed) + + qkv_dim = 2 * H * K + HV * V + mixed_qkv = torch.randn(B, qkv_dim, device=device, dtype=dtype) + a = torch.randn(B, HV, device=device, dtype=dtype) + b = torch.randn(B, HV, device=device, dtype=dtype) + A_log = torch.randn(HV, device=device, dtype=dtype) + dt_bias = torch.randn(HV, device=device, dtype=dtype) + + ssm_states = torch.randn(pool_size, HV, V, K, device=device, dtype=dtype) * 0.1 + cache_indices = torch.arange(B, device=device, dtype=torch.int32) + + cu_seqlens = torch.arange(B + 1, device=device, dtype=torch.long) + + return dict( + B=B, + H=H, + HV=HV, + K=K, + V=V, + qkv_dim=qkv_dim, + pool_size=pool_size, + mixed_qkv=mixed_qkv, + a=a, + b=b, + A_log=A_log, + dt_bias=dt_bias, + ssm_states=ssm_states, + cache_indices=cache_indices, + cu_seqlens=cu_seqlens, + ) + + +# --------------------------------------------------------------------------- +# Runner wrappers +# --------------------------------------------------------------------------- + + +def run_baseline(inp): + """Baseline path: split → view → fused_sigmoid_gating_delta_rule_update. + + This mirrors the FULL original decode path in GDNAttnBackend.forward_decode, + including the split, view, and kernel call. + """ + B, H, HV, K, V = inp["B"], inp["H"], inp["HV"], inp["K"], inp["V"] + mixed_qkv = inp["mixed_qkv"] + ssm_states = inp["ssm_states"].clone() + + # Step 1: split (same as forward_decode) + q_flat, k_flat, v_flat = torch.split(mixed_qkv, [H * K, H * K, HV * V], dim=-1) + + # Step 2: view + reshape (same as forward_decode) + q = q_flat.view(1, B, H, K) + k = k_flat.view(1, B, H, K) + v = v_flat.view(1, B, HV, V) + + # Step 3: fused gating + recurrent update + o = fused_sigmoid_gating_delta_rule_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=q, + k=k, + v=v, + a=inp["a"], + b=inp["b"], + initial_state_source=ssm_states, + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"], + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + ) + + return o, ssm_states + + +def run_packed(inp): + """Packed path: single fused kernel directly on mixed_qkv.""" + B, HV, K, V = inp["B"], inp["HV"], inp["K"], inp["V"] + ssm_states = inp["ssm_states"].clone() + out = inp["mixed_qkv"].new_empty(B, 1, HV, V) + + fused_recurrent_gated_delta_rule_packed_decode( + mixed_qkv=inp["mixed_qkv"], + a=inp["a"], + b=inp["b"], + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + scale=inp["K"] ** -0.5, + initial_state=ssm_states, + out=out, + ssm_state_indices=inp["cache_indices"], + use_qk_l2norm_in_kernel=True, + ) + + # Convert [B, 1, HV, V] → [1, B, HV, V] to match baseline layout + return out.transpose(0, 1), ssm_states + + +# --------------------------------------------------------------------------- +# Correctness check +# --------------------------------------------------------------------------- + + +def check_correctness(B, H, HV, K, V, pool_size, device, dtype, seed=42): + """Run correctness check for a single config. Returns True if PASS.""" + tag = f"B={B:>4} H={H:>2} HV={HV:>2} K={K:>3} V={V:>3} pool={pool_size:>4}" + + inp = make_inputs(B, H, HV, K, V, pool_size, device, dtype, seed=seed) + + o_baseline, state_baseline = run_baseline(inp) + o_packed, state_packed = run_packed(inp) + + # Output comparison + atol = 2e-2 if dtype != torch.float32 else 1e-4 + rtol = 1e-2 if dtype != torch.float32 else 1e-4 + + try: + torch.testing.assert_close(o_packed, o_baseline, atol=atol, rtol=rtol) + output_ok = True + except AssertionError as e: + output_ok = False + out_diff = (o_packed - o_baseline).abs().max().item() + + # State comparison (only for slots that were updated) + indices = inp["cache_indices"] + try: + torch.testing.assert_close( + state_packed[indices], state_baseline[indices], atol=atol, rtol=rtol + ) + state_ok = True + except AssertionError: + state_ok = False + st_diff = (state_packed[indices] - state_baseline[indices]).abs().max().item() + + passed = output_ok and state_ok + + if passed: + print(f" [PASS] {tag}") + else: + details = [] + if not output_ok: + details.append(f"output max_diff={out_diff:.6f}") + if not state_ok: + details.append(f"state max_diff={st_diff:.6f}") + print(f" [FAIL] {tag} ({', '.join(details)})") + + return passed + + +# --------------------------------------------------------------------------- +# Benchmark +# --------------------------------------------------------------------------- + + +def bench_shape(B, H, HV, K, V, pool_size, device, dtype): + """Benchmark baseline vs packed for a single config.""" + inp = make_inputs(B, H, HV, K, V, pool_size, device, dtype) + + # ── Baseline: full path including split + view ── + def fn_baseline(): + q_flat, k_flat, v_flat = torch.split( + inp["mixed_qkv"], [H * K, H * K, HV * V], dim=-1 + ) + q = q_flat.view(1, B, H, K) + k = k_flat.view(1, B, H, K) + v = v_flat.view(1, B, HV, V) + fused_sigmoid_gating_delta_rule_update( + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + q=q, + k=k, + v=v, + a=inp["a"], + b=inp["b"], + initial_state_source=inp["ssm_states"], + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"], + use_qk_l2norm_in_kernel=True, + softplus_beta=1.0, + softplus_threshold=20.0, + ) + + # ── Packed: single kernel ── + out_buf = inp["mixed_qkv"].new_empty(B, 1, HV, V) + + def fn_packed(): + fused_recurrent_gated_delta_rule_packed_decode( + mixed_qkv=inp["mixed_qkv"], + a=inp["a"], + b=inp["b"], + A_log=inp["A_log"], + dt_bias=inp["dt_bias"], + scale=K**-0.5, + initial_state=inp["ssm_states"], + out=out_buf, + ssm_state_indices=inp["cache_indices"], + use_qk_l2norm_in_kernel=True, + ) + + # Warmup + for _ in range(10): + fn_baseline() + fn_packed() + torch.cuda.synchronize() + + quantiles = [0.5, 0.2, 0.8] + + try: + ms_baseline, ms_base_lo, ms_base_hi = triton.testing.do_bench( + fn_baseline, quantiles=quantiles, warmup=50, rep=200 + ) + ms_packed, ms_pack_lo, ms_pack_hi = triton.testing.do_bench( + fn_packed, quantiles=quantiles, warmup=50, rep=200 + ) + except Exception: + # Fallback to manual timing + torch.cuda.synchronize() + N = 200 + start = time.perf_counter() + for _ in range(N): + fn_baseline() + torch.cuda.synchronize() + ms_baseline = (time.perf_counter() - start) / N * 1000 + + start = time.perf_counter() + for _ in range(N): + fn_packed() + torch.cuda.synchronize() + ms_packed = (time.perf_counter() - start) / N * 1000 + + speedup = ms_baseline / ms_packed if ms_packed > 0 else float("inf") + saved_us = (ms_baseline - ms_packed) * 1000 + + print( + f" {B:>5} {H:>3} {HV:>3} {K:>3} {V:>3} | " + f"{ms_baseline * 1000:>10.1f} | " + f"{ms_packed * 1000:>10.1f} | " + f"{speedup:>7.2f}x | " + f"{saved_us:>+9.1f}" + ) + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + + +def run_correctness(device, dtype): + print("=" * 70) + print("Correctness: Baseline GDN Decode vs Packed GDN Decode") + print("=" * 70) + + shapes = [ + # (B, H, HV, K, V, pool_size) + # --- Qwen3.5-35B-A3B style (TP=2: H=8, HV=16) --- + (1, 8, 16, 128, 128, 32), + (4, 8, 16, 128, 128, 32), + (16, 8, 16, 128, 128, 64), + (32, 8, 16, 128, 128, 128), + (64, 8, 16, 128, 128, 128), + (128, 8, 16, 128, 128, 256), + (256, 8, 16, 128, 128, 512), + # --- Qwen3.5-35B-A3B style (TP=1: H=16, HV=32) --- + (1, 16, 32, 128, 128, 32), + (32, 16, 32, 128, 128, 128), + (64, 16, 32, 128, 128, 128), + # --- Qwen3-Next-80B-A3B style --- + (32, 16, 16, 128, 128, 128), + (64, 16, 16, 128, 128, 128), + # --- With PAD_SLOT_ID --- + (32, 8, 16, 128, 128, 128), # some indices may be padded + # --- Edge cases --- + (1, 8, 16, 128, 128, 32), + (2, 8, 16, 128, 128, 32), + ] + + all_pass = True + for B, H, HV, K, V, pool_size in shapes: + if not check_correctness(B, H, HV, K, V, pool_size, device, dtype): + all_pass = False + + # PAD_SLOT_ID test + print("\n PAD_SLOT_ID test (indices with -1):") + inp = make_inputs(32, 8, 16, 128, 128, 128, device, dtype) + o_baseline, st_baseline = run_baseline(inp) + o_packed, st_packed = run_packed(inp) + + try: + torch.testing.assert_close(o_packed, o_baseline, atol=2e-2, rtol=1e-2) + print(" [PASS] PAD_SLOT_ID=-1 handling") + except AssertionError: + print(" [FAIL] PAD_SLOT_ID=-1 handling") + all_pass = False + + print() + if all_pass: + print("ALL PASSED.") + else: + print("SOME FAILED.") + return all_pass + + +def run_benchmark(device, dtype, args): + print() + print("=" * 85) + print("Benchmark: Baseline GDN Decode vs Packed GDN Decode") + print("=" * 85) + + K = args.head_size_k + V = args.head_size_v + pool_size = args.pool_size + + if args.preset == "qwen3.5-35b": + # Qwen3.5-35B-A3B: H_qk=16, H_v=32, K=128, V=128 + # After TP=2: H=8, HV=16 + bench_configs = [ + # (B, H, HV) — TP=2 config + (1, 8, 16), + (2, 8, 16), + (4, 8, 16), + (8, 8, 16), + (16, 8, 16), + (32, 8, 16), + (64, 8, 16), + (128, 8, 16), + (256, 8, 16), + (512, 8, 16), + # TP=1 config (full heads) + (1, 16, 32), + (8, 16, 32), + (32, 16, 32), + (64, 16, 32), + (128, 16, 32), + (256, 16, 32), + ] + elif args.preset == "qwen3-next-80b": + bench_configs = [ + # Qwen3-Next-80B-A3B: all same H=HV=16 after TP + (1, 16, 16), + (8, 16, 16), + (32, 16, 16), + (64, 16, 16), + (128, 16, 16), + (256, 16, 16), + ] + else: + bench_configs = [] + for B in args.batch_sizes: + for H in args.num_q_heads: + for HV in args.num_v_heads: + bench_configs.append((B, H, HV)) + + print(f" Config: K={K}, V={V}, pool_size={pool_size}, dtype={dtype}") + print( + f" {'B':>5} {'H':>3} {'HV':>3} {'K':>3} {'V':>3} | " + f"{'base (μs)':>10} | " + f"{'packed (μs)':>10} | " + f"{'speedup':>8} | " + f"{'saved (μs)':>10}" + ) + print(" " + "-" * 75) + + for B, H, HV in bench_configs: + actual_pool = max(pool_size, B + 16) + bench_shape(B, H, HV, K, V, actual_pool, device, dtype) + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark & Correctness: GDN Packed Decode vs Baseline" + ) + parser.add_argument( + "--mode", + choices=["all", "correctness", "bench"], + default="all", + help="Run mode (default: all)", + ) + parser.add_argument( + "--preset", + choices=["qwen3.5-35b", "qwen3-next-80b", "custom"], + default="qwen3.5-35b", + help="Preset config (default: qwen3.5-35b)", + ) + parser.add_argument( + "--dtype", + choices=["float16", "bfloat16", "float32"], + default="bfloat16", + ) + parser.add_argument("--head-size-k", type=int, default=128) + parser.add_argument("--head-size-v", type=int, default=128) + parser.add_argument("--pool-size", type=int, default=512) + parser.add_argument( + "--batch-sizes", + type=int, + nargs="+", + default=[1, 4, 8, 16, 32, 64, 128, 256, 512], + ) + parser.add_argument( + "--num-q-heads", + type=int, + nargs="+", + default=[8, 16], + ) + parser.add_argument( + "--num-v-heads", + type=int, + nargs="+", + default=[16, 32], + ) + args = parser.parse_args() + + device = "cuda" + dtype = getattr(torch, args.dtype) + + cap = torch.cuda.get_device_capability() + dev_name = torch.cuda.get_device_name() + print(f"Device: {dev_name} (SM {cap[0]}{cap[1]})") + + if args.mode in ("all", "correctness"): + all_pass = run_correctness(device, dtype) + if not all_pass and args.mode == "all": + print("\nSkipping benchmark due to correctness failures.") + return 1 + + if args.mode in ("all", "bench"): + run_benchmark(device, dtype, args) + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmark/bench_linear_attention/bench_gdn_prefill.py b/benchmark/bench_linear_attention/bench_gdn_prefill.py new file mode 100644 index 000000000000..04fdb7c5089e --- /dev/null +++ b/benchmark/bench_linear_attention/bench_gdn_prefill.py @@ -0,0 +1,639 @@ +""" +Benchmark & Correctness: Triton GDN vs FlashInfer GDN (prefill). + +Compares: + - Triton: sglang's chunk_gated_delta_rule (K-contiguous pool, pool-indexed) + - FlashInfer: flashinfer's chunk_gated_delta_rule (gather/scatter, 3D tensors) + +The two kernels have different APIs: + - Triton: q/k/v=[1,T,H,D], g=logsigmoid, beta=sigmoid, has initial_state_indices + - FlashInfer: q/k/v=[T,H,D], g=alpha(float32), beta=float32, no indices (gathered state) + +Reports correctness (output & state matching) and performance (ms, TFLOPS, TB/s). + +Usage: + python benchmark_gdn_prefill.py # default sweep + python benchmark_gdn_prefill.py --mode bench # benchmark only + python benchmark_gdn_prefill.py --mode correctness # correctness only + python benchmark_gdn_prefill.py --preset qwen3-next # Qwen3-Next config +""" + +import argparse +import os +import sys + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "python")) + +import torch +from flashinfer.gdn_prefill import ( + chunk_gated_delta_rule as flashinfer_chunk_gated_delta_rule, +) + +from sglang.srt.layers.attention.fla.chunk import ( + chunk_gated_delta_rule as triton_chunk_gated_delta_rule, +) +from sglang.srt.layers.attention.fla.l2norm import l2norm_fwd + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def make_k_contiguous(t: torch.Tensor) -> torch.Tensor: + """ + Given a V-contiguous tensor [..., K, V], return a K-contiguous view of the + same logical shape [..., K, V] (physically [..., V, K], K-last). + """ + return t.transpose(-2, -1).contiguous().transpose(-2, -1) + + +def gdn_flops( + total_seq_len: int, + num_heads: int, + head_size_k: int, + head_size_v: int, +) -> int: + """ + FLOPs for GDN prefill (delta rule). + + Per token per head: + 1. k @ v^T (outer product): 2 * K * V + 2. q @ state (output): 2 * K * V + """ + outer_product_flops = 2 * total_seq_len * num_heads * head_size_k * head_size_v + output_flops = 2 * total_seq_len * num_heads * head_size_k * head_size_v + return outer_product_flops + output_flops + + +def gdn_bytes( + total_seq_len: int, + num_q_heads: int, + num_v_heads: int, + head_size_k: int, + head_size_v: int, + num_seqs: int, + dtype: torch.dtype, +) -> int: + """Memory bytes accessed (inputs + outputs + state).""" + num_o_heads = max(num_q_heads, num_v_heads) + elem = dtype.itemsize + + q_bytes = total_seq_len * num_q_heads * head_size_k * elem + k_bytes = total_seq_len * num_v_heads * head_size_k * elem + v_bytes = total_seq_len * num_v_heads * head_size_v * elem + o_bytes = total_seq_len * num_o_heads * head_size_v * elem + + # state (float32): read + write + state_bytes = 2 * num_seqs * num_o_heads * head_size_k * head_size_v * 4 + + # g, beta (float32) + g_bytes = total_seq_len * num_o_heads * 4 + beta_bytes = total_seq_len * num_o_heads * 4 + + return q_bytes + k_bytes + v_bytes + o_bytes + state_bytes + g_bytes + beta_bytes + + +# --------------------------------------------------------------------------- +# Input factory +# --------------------------------------------------------------------------- + + +def make_inputs( + B: int, + T_per_seq: int, + H: int, + K: int, + V: int, + pool_size: int, + device: str, + dtype: torch.dtype, + sequential_indices: bool = False, + seed: int = 42, +): + """Create all input tensors for a single benchmark / correctness run. + + Returns a dict with both Triton-format and FlashInfer-format tensors. + """ + T = B * T_per_seq + torch.manual_seed(seed) + + if sequential_indices: + cache_indices = torch.arange(B, dtype=torch.int32, device=device) + else: + perm = torch.randperm(pool_size, device=device)[:B] + cache_indices = perm.to(torch.int32) + + pool_init = torch.randn(pool_size, H, K, V, dtype=dtype, device=device) * 0.1 + + cu_seqlens = torch.arange( + 0, (B + 1) * T_per_seq, T_per_seq, dtype=torch.long, device=device + ) + + # Triton format: [1, T, H, D] + q = torch.randn(1, T, H, K, dtype=dtype, device=device) + k = torch.randn(1, T, H, K, dtype=dtype, device=device) + v = torch.randn(1, T, H, V, dtype=dtype, device=device) + + # g (logsigmoid) and beta (sigmoid) in Triton format: [1, T, H] + g_raw = torch.randn(1, T, H, dtype=dtype, device=device) + g_triton = torch.nn.functional.logsigmoid(g_raw) # logsigmoid for Triton + beta_triton = torch.sigmoid(torch.randn(1, T, H, dtype=dtype, device=device)) + + return dict( + B=B, + T=T, + T_per_seq=T_per_seq, + H=H, + K=K, + V=V, + pool_size=pool_size, + cache_indices=cache_indices, + pool_init=pool_init, + cu_seqlens=cu_seqlens, + q=q, + k=k, + v=v, + g_triton=g_triton, + beta_triton=beta_triton, + ) + + +# --------------------------------------------------------------------------- +# Runner wrappers +# --------------------------------------------------------------------------- + + +def run_triton(inp): + """Triton path: K-contiguous pool, pool-indexed, [1,T,H,D] tensors.""" + pool = make_k_contiguous(inp["pool_init"].clone()) + + o, _, h = triton_chunk_gated_delta_rule( + q=inp["q"], + k=inp["k"], + v=inp["v"], + g=inp["g_triton"], + beta=inp["beta_triton"], + initial_state=pool, + initial_state_indices=inp["cache_indices"], + cu_seqlens=inp["cu_seqlens"], + head_first=False, + use_qk_l2norm_in_kernel=True, + ) + return o, pool, h + + +def run_flashinfer(inp): + """FlashInfer path: matches sglang FlashInferGDNKernel.extend() exactly. + + Key differences from Triton path: + - q, k are L2-normalized BEFORE calling the kernel + - use_qk_l2norm_in_kernel=False (kernel skips internal normalization) + - Tensors are [T, H, D] (no batch dim) + - g is alpha = exp(logsigmoid(...)) = sigmoid(...), float32 + - beta is float32 + - initial_state is gathered from pool (no pool-index support) + - Uses keyword arguments (matching sglang production code) + + NOTE: FlashInfer GDN requires K == V (square head_size). + """ + K = inp["K"] + V = inp["V"] + assert K == V, f"FlashInfer GDN requires K == V, got K={K}, V={V}" + + pool = make_k_contiguous(inp["pool_init"].clone()) + cache_indices = inp["cache_indices"] + + # Gather states from K-contiguous pool -> K-contiguous float32 + # In production, ssm_states is already float32 so .float() is no-op. + # Here pool_init is bf16, so .float() loses K-contiguous layout. + gathered = pool[cache_indices] + initial_state = make_k_contiguous(gathered.float().contiguous()) + + q_fi = l2norm_fwd(inp["q"][0].contiguous()) + k_fi = l2norm_fwd(inp["k"][0].contiguous()) + v_fi = inp["v"][0].contiguous() + + # g -> alpha (exp of logsigmoid = sigmoid), float32 + alpha_fi = torch.exp(inp["g_triton"][0].to(torch.float32)) + # beta -> float32 + beta_fi = inp["beta_triton"][0].to(torch.float32) + + cu_seqlens_fi = inp["cu_seqlens"].to(torch.int64) + + # Call FlashInfer with keyword args (matching sglang production code) + # use_qk_l2norm_in_kernel=False because we pre-normalized above + o_fi, state_fi = flashinfer_chunk_gated_delta_rule( + q=q_fi, + k=k_fi, + v=v_fi, + g=alpha_fi, + beta=beta_fi, + scale=None, + initial_state=initial_state, + output_final_state=True, + cu_seqlens=cu_seqlens_fi, + use_qk_l2norm_in_kernel=False, + ) + + # Scatter updated states back to K-contiguous pool + pool[cache_indices] = state_fi.to(pool.dtype) + + # Reshape output: [T, H, D] -> [1, T, H, D] to match Triton + o_out = o_fi.unsqueeze(0) + + return o_out, pool, state_fi + + +# --------------------------------------------------------------------------- +# Correctness check +# --------------------------------------------------------------------------- + + +def check_shape( + B, + T_per_seq, + H, + K, + V, + pool_size, + device, + dtype, + sequential_indices=False, + seed=42, +): + """Run correctness check for a single shape config. Returns True if PASS. + + Pass/fail is based on OUTPUT comparison only (atol=5e-2). + Pool state diff is reported as informational — state divergence over many + tokens is expected due to different chunk sizes and accumulation order. + """ + tag = ( + f"B={B:>3} T/seq={T_per_seq:>4} H={H:>2} K={K:>3} V={V:>3} pool={pool_size:>4}" + ) + idx_tag = " (seq)" if sequential_indices else "" + + # FlashInfer GDN requires K == V (square head_size) + if K != V: + print(f" [SKIP] {tag}{idx_tag} (FlashInfer requires K==V)") + return True + + # FlashInfer GDN CUTLASS kernels are only compiled for head_size=128. + # Running with other sizes causes illegal memory access that poisons + # the CUDA context (unrecoverable), so we must skip upfront. + FLASHINFER_SUPPORTED_HEAD_SIZES = {128} + if K not in FLASHINFER_SUPPORTED_HEAD_SIZES: + print( + f" [SKIP] {tag}{idx_tag} (FlashInfer only supports head_size={FLASHINFER_SUPPORTED_HEAD_SIZES})" + ) + return True + + inp = make_inputs( + B, + T_per_seq, + H, + K, + V, + pool_size, + device, + dtype, + sequential_indices=sequential_indices, + seed=seed, + ) + + o_triton, pool_triton, h_triton = run_triton(inp) + + # FlashInfer may not support all head_size values (e.g., only 128). + # CUDA errors from unsupported configs are often asynchronous, so we + # must synchronize inside the try block to catch them here. + try: + o_fi, pool_fi, _ = run_flashinfer(inp) + torch.cuda.synchronize() + except Exception as e: + # Catch RuntimeError, torch.AcceleratorError, etc. + # Reset CUDA error state so subsequent tests can proceed + try: + torch.cuda.synchronize() + except Exception: + pass + print(f" [SKIP] {tag}{idx_tag} (FlashInfer error: {e})") + return True + + cache_indices = inp["cache_indices"] + + # --- Output comparison --- + # bf16 prefill with L2norm + chunked accumulation + torch.testing.assert_close(o_triton, o_fi, atol=5e-2, rtol=1e-2) + + # --- Stride check --- + def strides_ok(pool): + s = pool.stride() + return s[-2] == 1 and s[-1] == K + + strides_triton = strides_ok(pool_triton) + strides_fi = strides_ok(pool_fi) + + passed = strides_triton and strides_fi + + # Build detail string + details = [] + if not strides_triton: + details.append("triton strides bad") + if not strides_fi: + details.append("flashinfer strides bad") + + status = "PASS" if passed else "FAIL" + detail_str = f" [{', '.join(details)}]" + print(f" [{status}] {tag}{idx_tag}") + return passed + + +# --------------------------------------------------------------------------- +# Benchmark +# --------------------------------------------------------------------------- + + +def bench_shape(B, H, T_per_seq, K, V, pool_size, device, dtype): + """Benchmark Triton vs FlashInfer for a single config. Requires K == V.""" + import triton.testing + + assert K == V, f"FlashInfer GDN requires K == V, got K={K}, V={V}" + + T = B * T_per_seq + inp = make_inputs(B, T_per_seq, H, K, V, pool_size, device, dtype) + + # -- Shared read-only tensors -- + q, k_t, v = inp["q"], inp["k"], inp["v"] + g_triton, beta_triton = inp["g_triton"], inp["beta_triton"] + cu_seqlens = inp["cu_seqlens"] + cache_indices = inp["cache_indices"] + seq_indices = torch.arange(B, dtype=torch.int32, device=device) + pool_v = inp["pool_init"] + + def fn_triton(): + pool = make_k_contiguous(pool_v.clone()) + triton_chunk_gated_delta_rule( + q=q, + k=k_t, + v=v, + g=g_triton, + beta=beta_triton, + initial_state=pool, + initial_state_indices=cache_indices, + cu_seqlens=cu_seqlens, + head_first=False, + use_qk_l2norm_in_kernel=True, + ) + + def fn_flashinfer(): + # -- Pre-compute FlashInfer format tensors (outside timing) -- + # Pre-normalize q and k (matching sglang production: l2norm_fwd) + # q_fi = torch.nn.functional.normalize(q[0].contiguous().float(), p=2.0, dim=-1).to( + # dtype + # ) + # k_fi = torch.nn.functional.normalize(k_t[0].contiguous().float(), p=2.0, dim=-1).to( + # dtype + # ) + q_fi = l2norm_fwd(q[0].contiguous()) + k_fi = l2norm_fwd(k_t[0].contiguous()) + v_fi = v[0].contiguous() + alpha_fi = torch.exp(g_triton[0].to(torch.float32)) + beta_fi = beta_triton[0].to(torch.float32) + cu_seqlens_fi = cu_seqlens.to(torch.int64) + pool = make_k_contiguous(pool_v.clone()) + gathered = pool[cache_indices] + initial_state = make_k_contiguous(gathered.float().contiguous()) + flashinfer_chunk_gated_delta_rule( + q=q_fi, + k=k_fi, + v=v_fi, + g=alpha_fi, + beta=beta_fi, + scale=None, + initial_state=initial_state, + output_final_state=True, + cu_seqlens=cu_seqlens_fi, + use_qk_l2norm_in_kernel=False, + ) + + quantiles = [0.5, 0.2, 0.8] + + # Warmup + fn_triton() + fn_flashinfer() + torch.cuda.synchronize() + + ms_triton, _, _ = triton.testing.do_bench_cudagraph(fn_triton, quantiles=quantiles) + ms_fi, _, _ = triton.testing.do_bench_cudagraph(fn_flashinfer, quantiles=quantiles) + + # Metrics + num_o_heads = H + flops = gdn_flops(T, num_o_heads, K, V) + mem_bytes = gdn_bytes(T, H, H, K, V, B, dtype) + + tflops_triton = flops / ms_triton / 1e9 + tflops_fi = flops / ms_fi / 1e9 + tb_s_triton = mem_bytes / ms_triton / 1e9 + tb_s_fi = mem_bytes / ms_fi / 1e9 + + speedup = ms_triton / ms_fi if ms_fi > 0 else float("inf") + + print( + f" {B:>5} {H:>3} {T_per_seq:>6} {T:>7} | " + f"{ms_triton:>8.3f} {tflops_triton:>7.2f} {tb_s_triton:>7.2f} | " + f"{ms_fi:>8.3f} {tflops_fi:>7.2f} {tb_s_fi:>7.2f} | " + f"{speedup:>7.2f}x" + ) + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + + +def run_correctness(device, dtype): + print("=" * 78) + print("Correctness sweep: Triton vs FlashInfer") + print("=" * 78) + + shapes = [ + # (B, T_per_seq, H, K, V, pool_size) + # --- baseline (Qwen3-Next style) --- + (4, 64, 16, 128, 128, 32), + (4, 256, 16, 128, 128, 32), + # --- different batch sizes --- + (1, 128, 16, 128, 128, 32), + (8, 128, 16, 128, 128, 64), + (16, 64, 16, 128, 128, 128), + (32, 32, 16, 128, 128, 256), + # --- different head counts --- + (4, 128, 4, 128, 128, 32), + (4, 128, 8, 128, 128, 32), + (4, 128, 16, 64, 64, 32), + (4, 128, 32, 128, 128, 32), + (4, 128, 64, 128, 128, 32), + # --- short sequences --- + (4, 1, 16, 128, 128, 32), + (4, 7, 16, 128, 128, 32), + (4, 16, 16, 128, 128, 32), + # --- large pool (sparse access) --- + (4, 128, 16, 128, 128, 512), + # --- combined stress --- + (32, 128, 32, 128, 128, 256), + ] + + shapes_seq = [ + (8, 128, 16, 128, 128, 8), + (4, 128, 32, 128, 128, 4), + (4, 128, 64, 128, 128, 4), + (32, 128, 32, 128, 128, 32), + ] + + all_pass = True + for B, T_per_seq, H, K, V, pool_size in shapes: + if not check_shape(B, T_per_seq, H, K, V, pool_size, device, dtype): + all_pass = False + + print() + print("Sequential-index variants:") + for B, T_per_seq, H, K, V, pool_size in shapes_seq: + if not check_shape( + B, + T_per_seq, + H, + K, + V, + pool_size, + device, + dtype, + sequential_indices=True, + ): + all_pass = False + + print() + if all_pass: + print("ALL PASSED.") + else: + print("SOME FAILED.") + return all_pass + + +def run_benchmark(device, dtype, args): + print() + print("=" * 105) + print("Benchmark: Triton GDN vs FlashInfer GDN (do_bench_cudagraph)") + print("=" * 105) + + K = args.head_size_k + V = args.head_size_v + pool_size = args.pool_size + + if args.preset == "qwen3-next": + bench_configs = [ + # (B, H, T_per_seq) + (4, 16, 256), + (4, 32, 256), + (16, 16, 256), + (16, 32, 256), + (32, 16, 256), + (32, 32, 256), + (64, 16, 256), + (64, 32, 256), + (128, 16, 256), + (128, 32, 256), + # longer sequences + (4, 16, 1024), + (4, 32, 1024), + (32, 16, 1024), + (32, 32, 1024), + ] + else: + bench_configs = [] + for B in args.batch_sizes: + for H in args.num_heads: + for T_per_seq in args.seq_lens: + bench_configs.append((B, H, T_per_seq)) + + print(f" Config: K={K}, V={V}, pool_size={pool_size}, dtype={dtype}") + print( + f" {'B':>5} {'H':>3} {'T/seq':>6} {'T_tot':>7} | " + f"{'tri(ms)':>8} {'TFLOPS':>7} {'TB/s':>7} | " + f"{'fi(ms)':>8} {'TFLOPS':>7} {'TB/s':>7} | " + f"{'speedup':>8}" + ) + print(" " + "-" * 98) + + for B, H, T_per_seq in bench_configs: + actual_pool = max(pool_size, B) + bench_shape(B, H, T_per_seq, K, V, actual_pool, device, dtype) + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark & Correctness: Triton GDN vs FlashInfer GDN" + ) + parser.add_argument( + "--mode", + choices=["all", "correctness", "bench"], + default="all", + help="Run mode (default: all)", + ) + parser.add_argument( + "--preset", + choices=["qwen3-next", "custom"], + default="qwen3-next", + help="Preset config (default: qwen3-next)", + ) + parser.add_argument( + "--dtype", + choices=["float16", "bfloat16"], + default="bfloat16", + ) + parser.add_argument("--head-size-k", type=int, default=128) + parser.add_argument("--head-size-v", type=int, default=128) + parser.add_argument("--pool-size", type=int, default=256) + parser.add_argument( + "--batch-sizes", + type=int, + nargs="+", + default=[4, 16, 32, 64, 128], + ) + parser.add_argument( + "--num-heads", + type=int, + nargs="+", + default=[16, 32], + ) + parser.add_argument( + "--seq-lens", + type=int, + nargs="+", + default=[128, 256, 512, 1024], + ) + args = parser.parse_args() + + if args.preset == "qwen3-next": + args.head_size_k = 128 + args.head_size_v = 128 + + device = "cuda" + dtype = getattr(torch, args.dtype) + + # Check SM version + cap = torch.cuda.get_device_capability() + dev_name = torch.cuda.get_device_name() + print(f"Device: {dev_name} (SM {cap[0]}{cap[1]})") + + if args.mode in ("all", "correctness"): + all_pass = run_correctness(device, dtype) + if not all_pass and args.mode == "all": + print("\nSkipping benchmark due to correctness failures.") + return 1 + + if args.mode in ("all", "bench"): + run_benchmark(device, dtype, args) + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmark/bench_pynccl_allocator/bench_segment_tracking.py b/benchmark/bench_pynccl_allocator/bench_segment_tracking.py new file mode 100644 index 000000000000..1900fc7f968c --- /dev/null +++ b/benchmark/bench_pynccl_allocator/bench_segment_tracking.py @@ -0,0 +1,210 @@ +""" +Benchmark for comparing CPU overhead of segment tracking methods: +1. nccl_allocator_register_segments_with_comm() - C++ registration with index tracking +2. torch.cuda.memory.memory_snapshot() - PyTorch memory snapshot + +Usage: + python benchmark/bench_pynccl_allocator/bench_segment_tracking.py --num-segments 50 --num-iters 1000 +""" + +import argparse +import time +import warnings +from typing import List + +import torch + +warnings.filterwarnings("ignore") + + +def setup_segments(num_segments: int, segment_size: int = 1024 * 1024): + """ + Allocate a specified number of segments using the NCCL allocator. + """ + import os + + import torch.distributed as dist + + from sglang.srt.distributed.device_communicators.pynccl_allocator import ( + get_nccl_mem_pool, + ) + + # Initialize distributed if not already done + if not dist.is_initialized(): + os.environ.setdefault("MASTER_ADDR", "localhost") + os.environ.setdefault("MASTER_PORT", "29500") + dist.init_process_group( + backend="nccl", + rank=0, + world_size=1, + device_id=torch.device(f"cuda:{torch.cuda.current_device()}"), + ) + + mem_pool = get_nccl_mem_pool() + + # Allocate segments in the pool + tensors: List[torch.Tensor] = [] + with torch.cuda.use_mem_pool(mem_pool): + for _ in range(num_segments): + t = torch.empty(segment_size, dtype=torch.uint8, device="cuda") + tensors.append(t) + + # Keep tensors alive by returning them (caller should hold reference) + return tensors, mem_pool + + +def bench_register_segments_with_comm( + nccl_lib, comm_ptr: int, num_iters: int = 10000 +) -> float: + """ + Benchmark nccl_allocator_register_segments_with_comm() function. + + Args: + nccl_lib: The loaded NCCL allocator library + comm_ptr: The communicator pointer value + num_iters: Number of iterations + + Returns: + Average time per call in microseconds. + """ + import ctypes + + # Setup the C function signature + register_func = nccl_lib.nccl_allocator_register_segments_with_comm + register_func.restype = ctypes.c_int + register_func.argtypes = [ctypes.c_uint64] + + # Warmup + for _ in range(100): + register_func(comm_ptr) + + # Benchmark + start = time.perf_counter() + for _ in range(num_iters): + register_func(comm_ptr) + end = time.perf_counter() + + avg_us = (end - start) / num_iters * 1e6 + return avg_us + + +def bench_mempool_snapshot( + mem_pool: torch.cuda.MemPool, num_iters: int = 10000 +) -> float: + """ + Benchmark torch.cuda.MemPool.snapshot() function. + + Returns: + Average time per call in microseconds. + """ + # Warmup + for _ in range(100): + mem_pool.snapshot() + + # Benchmark + start = time.perf_counter() + for _ in range(num_iters): + mem_pool.snapshot() + end = time.perf_counter() + + avg_us = (end - start) / num_iters * 1e6 + return avg_us + + +def bench_with_various_segment_counts( + segment_counts: List[int], + num_iters: int = 10000, + segment_size: int = 1024 * 1024, # 1MB per segment +): + """ + Run benchmarks with various numbers of tracked segments. + """ + print("=" * 80) + print("Benchmark: Segment Registration CPU Overhead") + print("=" * 80) + print(f"Segment size: {segment_size / 1024 / 1024:.2f} MB") + print(f"Iterations per measurement: {num_iters}") + print() + print( + f"{'Segments':<12} {'register_segments (µs)':<30} {'snapshot (µs)':<20} {'Speedup':<10}" + ) + print("-" * 80) + + all_tensors = [] # Keep all tensors alive + comm_ptr = 0 # Use dummy comm_ptr for benchmarking (no actual NCCL registration) + + for num_segments in segment_counts: + # Clean up previous segments + all_tensors = [] + + # Allocate segments (this initializes _nccl_allocator_lib via get_nccl_mem_pool) + tensors, mem_pool = setup_segments(num_segments, segment_size) + all_tensors.extend(tensors) + + # Sync to ensure allocations are complete + torch.cuda.synchronize() + + # Import _nccl_allocator_lib after setup_segments (ensures library is loaded) + from sglang.srt.distributed.device_communicators.pynccl_allocator import ( + _nccl_allocator_lib, + ) + + # Run benchmarks + time_register = bench_register_segments_with_comm( + _nccl_allocator_lib, comm_ptr, num_iters + ) + time_snapshot = bench_mempool_snapshot(mem_pool, num_iters) + + speedup = time_snapshot / time_register if time_register > 0 else float("inf") + + print( + f"{num_segments:<12} {time_register:<30.3f} {time_snapshot:<20.3f} {speedup:<10.2f}x" + ) + + print("-" * 80) + print() + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark segment tracking methods in pynccl_allocator" + ) + parser.add_argument( + "--num-segments", + type=int, + nargs="+", + default=[10, 50, 100, 200, 500, 1000], + help="Number of segments to track (can specify multiple values)", + ) + parser.add_argument( + "--num-iters", + type=int, + default=10000, + help="Number of iterations for each measurement", + ) + parser.add_argument( + "--segment-size", + type=int, + default=1024 * 1024, # 1MB + help="Size of each segment in bytes", + ) + args = parser.parse_args() + + # Check CUDA availability + if not torch.cuda.is_available(): + print("Error: CUDA is not available. This benchmark requires a GPU.") + return + + # Initialize CUDA context by creating a small tensor + _ = torch.zeros(1, device="cuda") + + # Run benchmarks + bench_with_various_segment_counts( + segment_counts=args.num_segments, + num_iters=args.num_iters, + segment_size=args.segment_size, + ) + + +if __name__ == "__main__": + main() diff --git a/benchmark/bench_rope/benchmark_rope_index.py b/benchmark/bench_rope/benchmark_rope_index.py new file mode 100644 index 000000000000..d59a96e1a00b --- /dev/null +++ b/benchmark/bench_rope/benchmark_rope_index.py @@ -0,0 +1,425 @@ +# This script benchmarks MRotaryEmbedding.get_rope_index_glm4v (GLM4V mrope index builder). +# It generates synthetic multimodal input_ids + attention_mask (+ optional image/video grids), +# runs benchmarks. +# +# == Usage Examples == +# +# python3 benchmark_rope_index.py --device cuda --num-tokens 1024 2048 --benchmark-iter 200 + +import argparse +import math +import time +from dataclasses import dataclass, field +from typing import Any + +import numpy as np +import torch + +from sglang.srt.layers.rotary_embedding import MRotaryEmbedding + + +# ----------------------------- +# Minimal config objects +# ----------------------------- +@dataclass +class DummyVisionConfig: + spatial_merge_size: int = 2 + + +@dataclass +class DummyHFConfig: + image_token_id: int = 32000 + video_start_token_id: int = 32001 + video_end_token_id: int = 32002 + vision_config: DummyVisionConfig = field( + default_factory=lambda: DummyVisionConfig(spatial_merge_size=2) + ) + + +# ----------------------------- +# Helpers +# ----------------------------- +def calculate_stats(times: list[float]) -> dict[str, float]: + """Calculate statistics from a list of times.""" + times_array = np.array(times, dtype=np.float64) + return { + "mean": float(np.mean(times_array)), + "median": float(np.median(times_array)), + "p99": float(np.percentile(times_array, 99)), + "min": float(np.min(times_array)), + "max": float(np.max(times_array)), + } + + +def _sync(device: torch.device): + if device.type == "cuda": + torch.cuda.synchronize() + + +def _approx_hw(patches: int, merge: int) -> tuple[int, int]: + # want (h/merge)*(w/merge) ~= patches + gh = int(math.sqrt(max(1, patches))) + gw = max(1, patches // max(1, gh)) + return gh * merge, gw * merge + + +def generate_test_data( + num_tokens: int, + batch_size: int, + hf_config: DummyHFConfig, + dtype: torch.dtype, + device: torch.device, + pad_ratio: float, + num_images_per_sample: int, + image_patch_tokens: int, + num_videos_per_sample: int, + video_patch_tokens: int, + seed: int, +): + """ + Generate synthetic (input_ids, attention_mask, image_grid_thw, video_grid_thw). + + NOTE: + - image_grid_thw / video_grid_thw are global lists across the entire batch in encounter order, + matching the function's image_index/video_index behavior. + - image patches are represented by repeated image_token_id. + - video patches are represented by image_token_id wrapped with start/end tokens. + """ + torch.manual_seed(seed) + + forbidden = { + 0, + hf_config.image_token_id, + hf_config.video_start_token_id, + hf_config.video_end_token_id, + } + vocab_size = 50000 + + def rand_text(n: int) -> torch.Tensor: + # generate random ids not in forbidden + out = torch.randint(1, vocab_size, (n,), device=device, dtype=torch.long) + # fix forbidden by +1 until ok (cheap, deterministic enough for benchmark data) + for bad in forbidden: + out = torch.where(out == bad, out + 1, out) + return out + + image_grids: list[list[int]] = [] + video_grids: list[list[int]] = [] + + input_ids = torch.zeros((batch_size, num_tokens), device=device, dtype=torch.long) + attention_mask = torch.zeros( + (batch_size, num_tokens), device=device, dtype=torch.long + ) + + eff_len = int(round(num_tokens * (1.0 - pad_ratio))) + eff_len = max(1, min(num_tokens, eff_len)) + + min_needed = 1 + min_needed += num_images_per_sample * image_patch_tokens + min_needed += num_videos_per_sample * (2 + video_patch_tokens) + if eff_len < min_needed: + num_images_per_sample = 0 + num_videos_per_sample = 0 + + for b in range(batch_size): + blocks: list[torch.Tensor] = [] + + reserved = ( + num_images_per_sample * image_patch_tokens + + num_videos_per_sample * (2 + video_patch_tokens) + ) + reserved = min(reserved, max(0, eff_len - 1)) + text_budget = max(1, eff_len - reserved) + + n_text_chunks = num_images_per_sample + num_videos_per_sample + 1 + base = text_budget // n_text_chunks + rem = text_budget % n_text_chunks + text_chunks = [base + (1 if i < rem else 0) for i in range(n_text_chunks)] + + tci = 0 + for _ in range(num_images_per_sample): + blocks.append(rand_text(text_chunks[tci])) + tci += 1 + blocks.append( + torch.full( + (image_patch_tokens,), + hf_config.image_token_id, + device=device, + dtype=torch.long, + ) + ) + + h, w = _approx_hw( + image_patch_tokens, hf_config.vision_config.spatial_merge_size + ) + image_grids.append([1, h, w]) + + for _ in range(num_videos_per_sample): + blocks.append(rand_text(text_chunks[tci])) + tci += 1 + blocks.append( + torch.tensor( + [hf_config.video_start_token_id], device=device, dtype=torch.long + ) + ) + blocks.append( + torch.full( + (video_patch_tokens,), + hf_config.image_token_id, + device=device, + dtype=torch.long, + ) + ) + blocks.append( + torch.tensor( + [hf_config.video_end_token_id], device=device, dtype=torch.long + ) + ) + + h, w = _approx_hw( + video_patch_tokens, hf_config.vision_config.spatial_merge_size + ) + # first field = group count used by code; set to 1 + video_grids.append([1, h, w]) + + blocks.append(rand_text(text_chunks[tci])) + + tokens = torch.cat(blocks, dim=0)[:eff_len] + pad = torch.zeros( + (num_tokens - tokens.numel(),), device=device, dtype=torch.long + ) + ids = torch.cat([tokens, pad], dim=0) + + mask = torch.cat( + [ + torch.ones((tokens.numel(),), device=device, dtype=torch.long), + torch.zeros( + (num_tokens - tokens.numel(),), device=device, dtype=torch.long + ), + ], + dim=0, + ) + + input_ids[b] = ids + attention_mask[b] = mask + + image_grid_thw = ( + torch.tensor(image_grids, device=device, dtype=torch.long) + if len(image_grids) + else None + ) + video_grid_thw = ( + torch.tensor(video_grids, device=device, dtype=torch.long) + if len(video_grids) + else None + ) + return ( + input_ids.to(dtype=torch.long), + attention_mask.to(dtype=torch.long), + image_grid_thw, + video_grid_thw, + ) + + +def benchmark_rope_index( + model_name: str, + tp_size: int, + num_tokens: int, + batch_size: int, + pad_ratio: float, + spatial_merge_size: int, + num_images: int, + image_patch_tokens: int, + num_videos: int, + video_patch_tokens: int, + dtype: torch.dtype, + seed: int, + warmup_iter: int, + benchmark_iter: int, + device: torch.device, +): + torch.manual_seed(seed) + hf_config = DummyHFConfig( + image_token_id=32000, + video_start_token_id=32001, + video_end_token_id=32002, + vision_config=DummyVisionConfig(spatial_merge_size=spatial_merge_size), + ) + + print(80 * "=") + print( + f"Evaluating: {model_name} tp_size={tp_size} " + f"num_tokens={num_tokens} batch={batch_size} pad_ratio={pad_ratio} " + f"images/sample={num_images} image_patch_tokens={image_patch_tokens} " + f"videos/sample={num_videos} video_patch_tokens={video_patch_tokens} " + f"dtype={dtype} device={device}" + ) + + input_ids, attention_mask, image_grid_thw, video_grid_thw = generate_test_data( + num_tokens=num_tokens, + batch_size=batch_size, + hf_config=hf_config, + dtype=dtype, + device=device, + pad_ratio=pad_ratio, + num_images_per_sample=num_images, + image_patch_tokens=image_patch_tokens, + num_videos_per_sample=num_videos, + video_patch_tokens=video_patch_tokens, + seed=seed, + ) + + # Validate output shapes before benchmarking. + has_mm = (image_grid_thw is not None) or (video_grid_thw is not None) + if has_mm: + pos, delta = MRotaryEmbedding.get_rope_index_glm4v( + input_ids=input_ids, + hf_config=hf_config, + image_grid_thw=image_grid_thw, + video_grid_thw=video_grid_thw, + attention_mask=attention_mask, + ) + assert pos.shape == (3, batch_size, num_tokens) + assert delta.shape == (batch_size, 1) + + # Warm up + for _ in range(warmup_iter): + if has_mm: + MRotaryEmbedding.get_rope_index_glm4v( + input_ids=input_ids, + hf_config=hf_config, + image_grid_thw=image_grid_thw, + video_grid_thw=video_grid_thw, + attention_mask=attention_mask, + ) + MRotaryEmbedding.get_rope_index_glm4v( + input_ids=input_ids, + hf_config=hf_config, + image_grid_thw=None, + video_grid_thw=None, + attention_mask=attention_mask, + ) + + _sync(device) + + # Time multimodal branch + multimodal_times = [] + for _ in range(benchmark_iter): + _sync(device) + start = time.time() + MRotaryEmbedding.get_rope_index_glm4v( + input_ids=input_ids, + hf_config=hf_config, + image_grid_thw=image_grid_thw, + video_grid_thw=video_grid_thw, + attention_mask=attention_mask, + ) + _sync(device) + multimodal_times.append(time.time() - start) + + # Time fallback branch + fallback_times = [] + for _ in range(benchmark_iter): + _sync(device) + start = time.time() + MRotaryEmbedding.get_rope_index_glm4v( + input_ids=input_ids, + hf_config=hf_config, + image_grid_thw=None, + video_grid_thw=None, + attention_mask=attention_mask, + ) + _sync(device) + fallback_times.append(time.time() - start) + + multimodal_stats = calculate_stats(multimodal_times) + fallback_stats = calculate_stats(fallback_times) + + print(f"\nPerformance for config (B={batch_size}, T={num_tokens}):") + print( + f"Multimodal: mean={multimodal_stats['mean']:.8f}s, " + f"median={multimodal_stats['median']:.8f}s, " + f"p99={multimodal_stats['p99']:.8f}s" + ) + print( + f"Fallback: mean={fallback_stats['mean']:.8f}s, " + f"median={fallback_stats['median']:.8f}s, " + f"p99={fallback_stats['p99']:.8f}s" + ) + + if has_mm: + speedup = ( + multimodal_stats["mean"] / fallback_stats["mean"] + if fallback_stats["mean"] > 0 + else float("inf") + ) + print(f"Fallback Speedup over Multimodal: {speedup:.8f}x") + else: + speedup = float("nan") + print( + "[INFO] num_tokens too small for multimodal segments; skip multimodal benchmark." + ) + + print(f"Fallback Speedup over Multimodal: {speedup:.8f}x") + + return multimodal_stats, fallback_stats, speedup + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Benchmark GLM4V get_rope_index_glm4v." + ) + parser.add_argument("--model-name", type=str, default="GLM4V") + parser.add_argument("--tp-size", type=int, default=1) + parser.add_argument( + "--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu" + ) + parser.add_argument("--warmup-iter", type=int, default=10) + parser.add_argument("--benchmark-iter", type=int, default=100) + parser.add_argument("--dtype", type=str, choices=["int64"], default="int64") + parser.add_argument("--seed", type=int, default=0) + + # token length sweep + parser.add_argument("--num-tokens", type=int, nargs="+", required=False) + + # data shape knobs + parser.add_argument("--batch-size", type=int, default=1) + parser.add_argument("--pad-ratio", type=float, default=0.0) + parser.add_argument("--spatial-merge-size", type=int, default=2) + parser.add_argument("--num-images", type=int, default=1) + parser.add_argument("--image-patch-tokens", type=int, default=256) + parser.add_argument("--num-videos", type=int, default=1) + parser.add_argument("--video-patch-tokens", type=int, default=256) + + # output + parser.add_argument("--out-dir", type=str, default=".") + args = parser.parse_args() + print(args) + + device = torch.device(args.device) + + if args.num_tokens is None: + num_tokens_list = [2**i for i in range(0, 18)] + else: + num_tokens_list = args.num_tokens + + rows: list[dict[str, Any]] = [] + + for num_tokens in num_tokens_list: + multimodal_stats, fallback_stats, speedup = benchmark_rope_index( + model_name=args.model_name, + tp_size=args.tp_size, + num_tokens=num_tokens, + batch_size=args.batch_size, + pad_ratio=args.pad_ratio, + spatial_merge_size=args.spatial_merge_size, + num_images=args.num_images, + image_patch_tokens=args.image_patch_tokens, + num_videos=args.num_videos, + video_patch_tokens=args.video_patch_tokens, + dtype=getattr(torch, args.dtype), + seed=args.seed, + warmup_iter=args.warmup_iter, + benchmark_iter=args.benchmark_iter, + device=device, + ) diff --git a/benchmark/blog_v0_2/README.md b/benchmark/blog_v0_2/README.md index 7448554ee610..c8f0f123b744 100644 --- a/benchmark/blog_v0_2/README.md +++ b/benchmark/blog_v0_2/README.md @@ -73,7 +73,7 @@ cat online.jsonl | cut -d':' -f9 | cut -d',' -f1 We tried using vLLM 0.5.3.post1, but it often crashes under high loads, and it seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking, so we are using the older version, vLLM 0.5.2. -Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16. +For TensorRT LLM preparation, follow your internal TensorRT-LLM deployment guide. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16. ```bash # vLLM diff --git a/benchmark/boolq/bench_sglang.py b/benchmark/boolq/bench_sglang.py index b3ce3c9962a0..b38d7a01e579 100644 --- a/benchmark/boolq/bench_sglang.py +++ b/benchmark/boolq/bench_sglang.py @@ -4,7 +4,7 @@ import numpy as np -from sglang.api import set_default_backend +from sglang.lang.api import set_default_backend from sglang.test.test_utils import ( add_common_sglang_args_and_parse, select_sglang_backend, diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index ff2769f1e042..cf6a569cbbab 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -4,7 +4,7 @@ The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVI Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources. -For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.io/basic_usage/deepseek.html). +For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek V3/V3.1/R1 Model Optimizations in SGLang](https://docs.sglang.io/basic_usage/deepseek_v3.html#optimizations). ## Installation & Launch @@ -33,7 +33,7 @@ Add [performance optimization options](#performance-optimization-options) as nee ```bash # Installation -pip install "sglang[all]>=0.5.6.post2" +pip install sglang # Launch python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code @@ -271,7 +271,7 @@ Then we can benchmark the accuracy and latency by accessing the first node's exp ```bash # bench accuracy -python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000 +python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host 10.0.0.1 --port 30000 # bench latency python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128 diff --git a/benchmark/fla/benchmark_layernorm_gated.py b/benchmark/fla/benchmark_layernorm_gated.py index 82440582bc2d..e678d8c31966 100644 --- a/benchmark/fla/benchmark_layernorm_gated.py +++ b/benchmark/fla/benchmark_layernorm_gated.py @@ -7,7 +7,9 @@ from sglang.srt.layers.attention.fla.layernorm_gated import ( _layer_norm_fwd as layer_norm_fwd, ) -from sglang.srt.layers.attention.fla.layernorm_gated import rms_norm_ref +from sglang.srt.layers.attention.fla.layernorm_gated import ( + rms_norm_ref, +) def benchmark_layer_norm_fwd( diff --git a/benchmark/gsm8k/bench_sglang.py b/benchmark/gsm8k/bench_sglang.py index 98c28b39b373..be766cd9af5c 100644 --- a/benchmark/gsm8k/bench_sglang.py +++ b/benchmark/gsm8k/bench_sglang.py @@ -48,6 +48,18 @@ def main(args): # Select backend set_default_backend(select_sglang_backend(args)) + # Load tokenizer if enable_thinking is set + tokenizer = None + if args.enable_thinking: + from transformers import AutoTokenizer + + assert ( + args.tokenizer_path is not None + ), "--tokenizer-path is required when --enable-thinking is set" + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_path, trust_remote_code=True + ) + # Read data if args.platinum: print("Loading GSM8K Platinum dataset from HuggingFace...") @@ -70,7 +82,16 @@ def main(args): questions = [] labels = [] for i in range(len(lines[:num_questions])): - questions.append(get_one_example(lines, i, False)) + raw_question = few_shot_examples + get_one_example(lines, i, False) + if tokenizer is not None: + messages = [{"role": "user", "content": raw_question}] + raw_question = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True, + enable_thinking=True, + ) + questions.append(raw_question) labels.append(get_answer_value(lines[i]["answer"])) assert all(l != INVALID for l in labels) arguments = [{"question": q} for q in questions] @@ -83,9 +104,11 @@ def main(args): @sgl.function def few_shot_gsm8k(s, question): - s += few_shot_examples + question + s += question s += sgl.gen( - "answer", max_tokens=512, stop=["Question", "Assistant:", "<|separator|>"] + "answer", + max_tokens=args.max_new_tokens, + stop=["Question", "Assistant:", "<|separator|>"], ) ##################################### @@ -96,7 +119,8 @@ def few_shot_gsm8k(s, question): tic = time.perf_counter() states = few_shot_gsm8k.run_batch( arguments, - temperature=0, + temperature=args.temperature, + top_p=args.top_p, num_threads=args.parallel, progress_bar=True, ) @@ -152,6 +176,20 @@ def few_shot_gsm8k(s, question): parser.add_argument("--num-shots", type=int, default=5) parser.add_argument("--data-path", type=str, default="test.jsonl") parser.add_argument("--num-questions", type=int, default=200) + parser.add_argument("--max-new-tokens", type=int, default=512) + parser.add_argument("--temperature", type=float, default=0.0) + parser.add_argument("--top-p", type=float, default=1.0) + parser.add_argument( + "--enable-thinking", + action="store_true", + help="Enable thinking mode by wrapping prompts with chat template", + ) + parser.add_argument( + "--tokenizer-path", + type=str, + default=None, + help="Path to tokenizer (required when --enable-thinking is set)", + ) parser.add_argument( "--platinum", action="store_true", diff --git a/benchmark/hicache/bench_long_context.py b/benchmark/hicache/bench_long_context.py index a3656cef9ea3..80c2c8a711e9 100644 --- a/benchmark/hicache/bench_long_context.py +++ b/benchmark/hicache/bench_long_context.py @@ -12,14 +12,14 @@ ) from tqdm.asyncio import tqdm -from sglang.bench_serving import get_tokenizer +from sglang.benchmark.utils import get_tokenizer +from sglang.test.kits.cache_hit_kit import async_request_sglang_generate class ContextWorkloadGenerator(WorkloadGenerator): def __init__(self, args): - # Construct the base URL for requests - self.baseurl = f"http://{args.host}:{args.port}/" - self.url = self.baseurl + "generate" + self.url = f"http://{args.host}:{args.port}/generate" + self.request_func = async_request_sglang_generate self.tokenizer = get_tokenizer(args.model_path) self.distribution = args.distribution @@ -36,20 +36,18 @@ def __init__(self, args): init_requests = [] for i in range(num_requests): context_id = self.dataset["queries"][i]["context"] - init_requests.append( - ( - i, - gen_payload( - self.dataset["contexts"][context_id] - + self.dataset["queries"][i]["question"], - len( - self.tokenizer( - self.dataset["queries"][i]["reference_answer"] - )["input_ids"] - ), - ), - ) + # Tokenize the context + question to get input_ids + prompt_text = ( + self.dataset["contexts"][context_id] + + self.dataset["queries"][i]["question"] ) + input_ids = self.tokenizer.encode(prompt_text) + output_len = len( + self.tokenizer(self.dataset["queries"][i]["reference_answer"])[ + "input_ids" + ] + ) + init_requests.append((i, gen_payload(input_ids, output_len))) self.ready_queue = ReadyQueue(init_requests=init_requests) self.response_queue = queue.Queue() diff --git a/benchmark/hicache/bench_mix.py b/benchmark/hicache/bench_mix.py index cfd25bc4003d..2a65574ea882 100644 --- a/benchmark/hicache/bench_mix.py +++ b/benchmark/hicache/bench_mix.py @@ -12,12 +12,9 @@ import aiohttp -from sglang.bench_serving import ( - RequestFuncOutput, - get_tokenizer, - remove_prefix, - sample_random_requests, -) +from sglang.bench_serving import RequestFuncOutput +from sglang.benchmark.datasets.random import sample_random_requests +from sglang.benchmark.utils import get_tokenizer, remove_prefix # Set up logger logger = logging.getLogger(__name__) @@ -429,11 +426,13 @@ async def handle_request(self, user_data): def request_sender(self): async def request_loop(): + tasks = [] while True: if self.sent_requests - self.completed_requests < self.max_parallel: new_request = self.user_generator.pop() if new_request: - asyncio.create_task(self.handle_request(new_request)) + task = asyncio.create_task(self.handle_request(new_request)) + tasks.append(task) self.sent_requests += 1 else: await asyncio.sleep(0.05) @@ -443,6 +442,11 @@ async def request_loop(): self.done = True break + # Cancel all pending tasks and wait for them to finish + for task in tasks: + task.cancel() + await asyncio.gather(*tasks, return_exceptions=True) + loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) loop.run_until_complete(request_loop()) diff --git a/benchmark/hicache/bench_multiturn.py b/benchmark/hicache/bench_multiturn.py index 95e7c9f5c8d0..49483fe849db 100644 --- a/benchmark/hicache/bench_multiturn.py +++ b/benchmark/hicache/bench_multiturn.py @@ -6,22 +6,21 @@ import threading import time from datetime import datetime -from typing import Optional -import aiohttp import numpy as np import requests from tqdm.asyncio import tqdm -from sglang.bench_serving import ( - RequestFuncOutput, - get_tokenizer, - remove_prefix, - sample_random_requests, +from sglang.bench_serving import RequestFuncOutput +from sglang.benchmark.datasets.random import sample_random_requests +from sglang.benchmark.utils import get_tokenizer +from sglang.test.kits.cache_hit_kit import ( + async_request_openai_chat_completions, + async_request_sglang_generate, + gen_payload, + gen_payload_openai, ) -AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=20 * 60 * 60) - def parse_args(): parser = argparse.ArgumentParser( @@ -133,6 +132,24 @@ def parse_args(): default="", help="Tag of a certain run in the log file", ) + parser.add_argument( + "--min-rounds", + type=int, + default=0, + help="Min rounds per client (0 = use --num-rounds)", + ) + parser.add_argument( + "--max-rounds", + type=int, + default=0, + help="Max rounds per client (0 = use --num-rounds)", + ) + parser.add_argument( + "--range-ratio", + type=float, + default=1.0, + help="Length variation ratio for prompts and outputs (1.0 = no variation, 0.5 = 50%% variation)", + ) parser.add_argument("--seed", type=int, default=1, help="The random seed.") parser.add_argument( "--lora-path", @@ -140,98 +157,17 @@ def parse_args(): default="", help="String of LoRA path. Currently we only support benchmarking on a single LoRA adaptor.", ) + parser.add_argument( + "--api-format", + type=str, + default="sglang", + choices=["sglang", "openai"], + help="API format to use: 'sglang' for native /generate endpoint, " + "'openai' for OpenAI-compatible /v1/chat/completions endpoint.", + ) return parser.parse_args() -async def async_request_sglang_generate( - payload, - url, - pbar: Optional[tqdm] = None, -): - """ - Sends a streaming request to the server. Gathers text token-by-token. - """ - async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: - headers = {} - generated_text = "" - ttft = 0.0 - st = time.perf_counter() - most_recent_timestamp = st - output = RequestFuncOutput() - - try: - async with session.post(url=url, json=payload, headers=headers) as response: - if response.status == 200: - prompt_tokens = 0 - cached_tokens = 0 - async for chunk_bytes in response.content: - chunk_bytes = chunk_bytes.strip() - if not chunk_bytes: - continue - - chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ") - latency = time.perf_counter() - st - if chunk == "[DONE]": - pass - else: - data = json.loads(chunk) - - if data["text"]: - timestamp = time.perf_counter() - # First token - if ttft == 0.0: - ttft = time.perf_counter() - st - output.ttft = ttft - prompt_tokens = (data.get("meta_info") or {}).get( - "prompt_tokens", 0 - ) - cached_tokens = (data.get("meta_info") or {}).get( - "cached_tokens", 0 - ) - - # Decoding phase - else: - output.itl.append(timestamp - most_recent_timestamp) - - most_recent_timestamp = timestamp - generated_text = data["text"] - - output.generated_text = generated_text - output.success = True - output.latency = latency - output.prompt_len = prompt_tokens - output.cached_tokens = cached_tokens - output.generated_len = len(output.itl) + 1 - else: - output.error = response.reason or "" - output.success = False - except Exception as e: - output.success = False - output.error = str(e) - print(f"Request failed: {e}") - - if pbar: - pbar.update(1) - return output - - -def gen_payload(prompt, output_len, lora_path=""): - payload = { - "text": prompt, - "sampling_params": { - "temperature": 0.0, - "max_new_tokens": output_len, - "ignore_eos": True, - }, - "stream": True, - "stream_options": {"include_usage": True}, - "lora_path": lora_path, - "return_logprob": False, - "logprob_start_len": -1, - } - return payload - - def log_to_jsonl_file(data, file_path="performance_metrics.jsonl", tag=""): """Append the data with a timestamp and tag to the specified JSONL file.""" timestamped_data = {"timestamp": datetime.now().isoformat(), "tag": tag, **data} @@ -274,66 +210,159 @@ def pop(self): class WorkloadGenerator: def __init__(self, args): - # Construct the base URL for requests - self.url = f"http://{args.host}:{args.port}/generate" + self.api_format = args.api_format + self.model_path = args.model_path + + # Construct the base URL and select request/payload functions + if self.api_format == "openai": + self.url = f"http://{args.host}:{args.port}/v1/chat/completions" + self.request_func = async_request_openai_chat_completions + else: + self.url = f"http://{args.host}:{args.port}/generate" + self.request_func = async_request_sglang_generate self.tokenizer = get_tokenizer(args.model_path) self.distribution = args.distribution self.request_rate = args.request_rate self.start_time = None self.finished_time = None + self.lora_path = args.lora_path self.sent_requests = 0 self.completed_requests = 0 - self.candidate_inputs = sample_random_requests( + # Resolve per-client round counts + min_rounds = args.min_rounds + max_rounds = args.max_rounds + if min_rounds == 0 and max_rounds == 0: + # Backward compat: all clients use --num-rounds + min_rounds = args.num_rounds + max_rounds = args.num_rounds + elif min_rounds == 0: + min_rounds = max_rounds + elif max_rounds == 0: + max_rounds = min_rounds + if min_rounds < 1: + raise ValueError(f"--min-rounds must be >= 1, got {min_rounds}") + if min_rounds > max_rounds: + raise ValueError( + f"--min-rounds ({min_rounds}) must be <= --max-rounds ({max_rounds})" + ) + + self.min_rounds = min_rounds + self.max_rounds = max_rounds + + if min_rounds == max_rounds: + # All clients have the same round count; skip randint to preserve random state + self.client_total_rounds = [min_rounds] * args.num_clients + else: + self.client_total_rounds = [ + random.randint(min_rounds, max_rounds) for _ in range(args.num_clients) + ] + + # clients_per_round[r] = number of clients participating in round r + self.clients_per_round = [ + sum(1 for t in self.client_total_rounds if t > r) for r in range(max_rounds) + ] + self.total_requests = sum(self.client_total_rounds) + + range_ratio = args.range_ratio + + # Use return_text=False to get token ids instead of text + first_round_samples = sample_random_requests( input_len=args.request_length, output_len=args.output_length, num_prompts=args.num_clients, - range_ratio=1.0, + range_ratio=range_ratio, tokenizer=self.tokenizer, dataset_path=args.dataset_path, random_sample=not args.disable_random_sample, + return_text=False, ) - self.candidate_inputs = [i.prompt for i in self.candidate_inputs] + # Store per-sample output_len for first round + first_round_output_lens = [row.output_len for row in first_round_samples] + # r.prompt is now List[int] when return_text=False + self.candidate_inputs = [list(i.prompt) for i in first_round_samples] if args.sub_question_input_length != 0: sub_question_input_length = args.sub_question_input_length else: sub_question_input_length = args.request_length + num_sub_questions = sum(max(t - 1, 0) for t in self.client_total_rounds) + self.sub_question_inputs = sample_random_requests( input_len=sub_question_input_length, output_len=args.output_length, - num_prompts=args.num_clients * max(args.num_rounds - 1, 1), - range_ratio=1.0, + num_prompts=max(num_sub_questions, 1), + range_ratio=range_ratio, tokenizer=self.tokenizer, dataset_path=args.dataset_path, random_sample=not args.disable_random_sample, + return_text=False, ) - init_requests = [ - ( - i, - gen_payload( - self.candidate_inputs[i], args.output_length, args.lora_path - ), - ) - for i in range(args.num_clients) - ] - self.client_records = { - i: {"round": 0, "history": init_requests[i][1]["text"]} - for i in range(args.num_clients) - } + if self.api_format == "openai": + # OpenAI mode: history is a messages list for /v1/chat/completions + initial_messages = { + i: [ + { + "role": "user", + "content": self.tokenizer.decode(self.candidate_inputs[i]), + } + ] + for i in range(args.num_clients) + } + init_requests = [ + ( + i, + gen_payload_openai( + initial_messages[i], + first_round_output_lens[i], + self.model_path, + ), + ) + for i in range(args.num_clients) + ] + self.client_records = { + i: { + "round": 0, + "history": initial_messages[i], + "total_rounds": self.client_total_rounds[i], + } + for i in range(args.num_clients) + } + else: + # SGLang mode: history is List[int] (token ids) + init_requests = [ + ( + i, + gen_payload( + self.candidate_inputs[i], + first_round_output_lens[i], + args.lora_path, + ), + ) + for i in range(args.num_clients) + ] + self.client_records = { + i: { + "round": 0, + "history": list(self.candidate_inputs[i]), + "total_rounds": self.client_total_rounds[i], + } + for i in range(args.num_clients) + } self.ready_queue = ReadyQueue( init_requests=init_requests, policy=args.ready_queue_policy ) self.candidate_inputs = self.candidate_inputs[args.num_clients :] self.response_queue = queue.Queue() - self.pbar = tqdm(total=args.num_clients * args.num_rounds) + self.pbar = tqdm(total=self.total_requests) self.performance_metrics = { "ttft": [], + "itl": [], "latency": [], "prompt_len": [], "cached_tokens": [], @@ -342,7 +371,7 @@ def __init__(self, args): self.enable_round_barrier = args.enable_round_barrier if self.enable_round_barrier: # Add round-specific metrics while preserving the original structure - for i in range(args.num_rounds): + for i in range(self.max_rounds): self.performance_metrics[f"round_{i}"] = { "ttft": [], "latency": [], @@ -352,19 +381,23 @@ def __init__(self, args): } self.num_clients = args.num_clients - self.num_rounds = args.num_rounds + self.num_rounds = self.max_rounds self.max_parallel = args.max_parallel self.output_length = args.output_length async def handle_request(self, item): + client_id, payload = item try: - client_id, payload = item - response = await async_request_sglang_generate(payload, self.url, self.pbar) + response = await self.request_func(payload, self.url, self.pbar) if self.pbar.n == self.pbar.total: self.finished_time = time.perf_counter() self.response_queue.put((client_id, response)) except Exception as e: - print(f"Request failed: {e}") + print(f"Request failed for client {client_id}: {e}") + failed_response = RequestFuncOutput() + failed_response.success = False + failed_response.error = str(e) + self.response_queue.put((client_id, failed_response)) def request_sender(self): async def request_loop(): @@ -401,17 +434,31 @@ async def request_loop(): def response_handler(self): next_round_reqs = [] + current_barrier_round = 0 + barrier_round_completed = 0 while True: try: client_id, response = self.response_queue.get( timeout=10 ) # Block until response is available if not response.success: - raise ValueError(f"Request failed with error: {response.error}") - self.client_records[client_id]["history"] += response.generated_text + print(f"Request failed for client {client_id}: {response.error}") + self.completed_requests += 1 + continue + # Extend history with response + if self.api_format == "openai": + if response.generated_text: + self.client_records[client_id]["history"].append( + {"role": "assistant", "content": response.generated_text} + ) + else: + self.client_records[client_id]["history"].extend( + response.output_ids + ) current_round = self.client_records[client_id]["round"] self.client_records[client_id]["round"] += 1 self.performance_metrics["ttft"].append(response.ttft) + self.performance_metrics["itl"].extend(response.itl) self.performance_metrics["latency"].append(response.latency) self.performance_metrics["prompt_len"].append(response.prompt_len) self.performance_metrics["cached_tokens"].append(response.cached_tokens) @@ -434,27 +481,61 @@ def response_handler(self): ].append(response.generated_len) self.completed_requests += 1 - if self.client_records[client_id]["round"] < self.num_rounds: - # append new request to client's history - self.client_records[client_id][ - "history" - ] += self.sub_question_inputs.pop().prompt - new_req = ( - client_id, - gen_payload( - self.client_records[client_id]["history"], - self.output_length, - args.lora_path, - ), - ) + client_total = self.client_records[client_id]["total_rounds"] + if self.client_records[client_id]["round"] < client_total: + sub_q = self.sub_question_inputs.pop() + if self.api_format == "openai": + # Append sub-question as a new user message + sub_q_text = self.tokenizer.decode(list(sub_q.prompt)) + self.client_records[client_id]["history"].append( + {"role": "user", "content": sub_q_text} + ) + new_req = ( + client_id, + gen_payload_openai( + self.client_records[client_id]["history"], + sub_q.output_len, + self.model_path, + ), + ) + else: + # Append sub-question token ids to client's history + sub_q_ids = list(sub_q.prompt) + self.client_records[client_id]["history"].extend(sub_q_ids) + new_req = ( + client_id, + gen_payload( + self.client_records[client_id]["history"], + sub_q.output_len, + self.lora_path, + ), + ) if self.enable_round_barrier: next_round_reqs.append(new_req) - if len(next_round_reqs) == self.num_clients: - for req in next_round_reqs: - self.ready_queue.append(req) - next_round_reqs = [] else: self.ready_queue.append(new_req) + + # Barrier logic: release next round when all clients for + # current barrier round have completed + if ( + self.enable_round_barrier + and current_barrier_round < self.max_rounds + ): + barrier_round_completed += 1 + expected = self.clients_per_round[current_barrier_round] + if barrier_round_completed == expected: + print( + f"\n Barrier: round {current_barrier_round} complete " + f"({expected} clients), releasing {len(next_round_reqs)} " + f"requests for round {current_barrier_round + 1}" + ) + self._send_heartbeat(input_len=100, output_len=100) + time.sleep(10) + for req in next_round_reqs: + self.ready_queue.append(req) + next_round_reqs = [] + current_barrier_round += 1 + barrier_round_completed = 0 except queue.Empty: if self.pbar.n == self.pbar.total: break @@ -462,6 +543,15 @@ def response_handler(self): print(f"Error processing response for client {client_id}: {e}") continue + def _send_heartbeat(self, input_len=100, output_len=20): + """Send a small heartbeat request to the server.""" + heartbeat_input = [1] * input_len + payload = gen_payload(heartbeat_input, output_len, self.lora_path) + try: + requests.post(self.url, json=payload, timeout=30) + except Exception as e: + print(f"Heartbeat request failed: {e}") + def run(self): request_thread = threading.Thread(target=self.request_sender, daemon=True) response_thread = threading.Thread(target=self.response_handler, daemon=True) @@ -477,6 +567,9 @@ def run(self): duration = self.finished_time - self.start_time sorted_ttft = sorted(self.performance_metrics["ttft"]) sorted_latency = sorted(self.performance_metrics["latency"]) + sorted_itl = sorted(self.performance_metrics["itl"]) + sorted_prompt_len = sorted(self.performance_metrics["prompt_len"]) + sorted_output_len = sorted(self.performance_metrics["generated_len"]) def percentile(sorted_vals, q): if not sorted_vals: @@ -505,12 +598,26 @@ def max_or_zero(sorted_vals): if self.performance_metrics["generated_len"] else 0.0 ), + "p90_prompt_len": percentile(sorted_prompt_len, 0.9), + "p99_prompt_len": percentile(sorted_prompt_len, 0.99), + "p90_output_len": percentile(sorted_output_len, 0.9), + "p99_output_len": percentile(sorted_output_len, 0.99), "average_ttft": sum(self.performance_metrics["ttft"]) / len(self.performance_metrics["ttft"]), "p90_ttft": percentile(sorted_ttft, 0.9), "p99_ttft": percentile(sorted_ttft, 0.99), "median_ttft": percentile(sorted_ttft, 0.5), "max_ttft": max_or_zero(sorted_ttft), + "average_itl": ( + sum(self.performance_metrics["itl"]) + / len(self.performance_metrics["itl"]) + if self.performance_metrics["itl"] + else 0.0 + ), + "p90_itl": percentile(sorted_itl, 0.9), + "p99_itl": percentile(sorted_itl, 0.99), + "median_itl": percentile(sorted_itl, 0.5), + "max_itl": max_or_zero(sorted_itl), "average_latency": sum(self.performance_metrics["latency"]) / len(self.performance_metrics["latency"]), "p90_latency": percentile(sorted_latency, 0.9), @@ -534,7 +641,7 @@ def max_or_zero(sorted_vals): } if self.enable_round_barrier: performance_data["round"] = {} - for round_num in range(args.num_rounds): + for round_num in range(self.num_rounds): round_key = f"round_{round_num}" round_metrics = self.performance_metrics[round_key] performance_data["round"][round_key] = { @@ -562,11 +669,28 @@ def max_or_zero(sorted_vals): print( f" Average Output Length: {performance_data['summary']['average_output_len']:.2f} tokens" ) + print( + f" P90 Prompt Length: {performance_data['summary']['p90_prompt_len']:.0f} tokens" + ) + print( + f" P99 Prompt Length: {performance_data['summary']['p99_prompt_len']:.0f} tokens" + ) + print( + f" P90 Output Length: {performance_data['summary']['p90_output_len']:.0f} tokens" + ) + print( + f" P99 Output Length: {performance_data['summary']['p99_output_len']:.0f} tokens" + ) print(f" Average TTFT: {performance_data['summary']['average_ttft']:.2f}") print(f" P90 TTFT: {performance_data['summary']['p90_ttft']:.2f}") print(f" P99 TTFT: {performance_data['summary']['p99_ttft']:.2f}") print(f" Median TTFT: {performance_data['summary']['median_ttft']:.2f}") print(f" Max TTFT: {performance_data['summary']['max_ttft']:.2f}") + print(f" Average ITL: {performance_data['summary']['average_itl']:.4f}") + print(f" P90 ITL: {performance_data['summary']['p90_itl']:.4f}") + print(f" P99 ITL: {performance_data['summary']['p99_itl']:.4f}") + print(f" Median ITL: {performance_data['summary']['median_itl']:.4f}") + print(f" Max ITL: {performance_data['summary']['max_itl']:.4f}") print( f" Average latency: {performance_data['summary']['average_latency']:.2f}" ) @@ -596,10 +720,12 @@ def max_or_zero(sorted_vals): avg_ttft = round_data["average_ttft"] cache_hit_rate = round_data["cache_hit_rate"] request_count = round_data["request_count"] + clients_in_round = self.clients_per_round[round_num] print( f" Round {round_num}: Average TTFT = {avg_ttft:.2f}s, " f"Cache Hit Rate = {cache_hit_rate:.6f} " - f"({request_count} requests)" + f"({request_count} requests, " + f"{clients_in_round} clients)" ) else: print(f" Round {round_num}: No requests completed") diff --git a/benchmark/hicache/bench_serving.py b/benchmark/hicache/bench_serving.py index e38d0d0eaf21..2355e7721c14 100644 --- a/benchmark/hicache/bench_serving.py +++ b/benchmark/hicache/bench_serving.py @@ -32,7 +32,7 @@ from tqdm.asyncio import tqdm from transformers import PreTrainedTokenizerBase -from sglang.bench_serving import get_tokenizer, remove_prefix, set_ulimit +from sglang.benchmark.utils import get_tokenizer, remove_prefix, set_ulimit AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=20 * 60 * 60) diff --git a/benchmark/hicache/bench_warm_cache.py b/benchmark/hicache/bench_warm_cache.py new file mode 100644 index 000000000000..f0f2666331a7 --- /dev/null +++ b/benchmark/hicache/bench_warm_cache.py @@ -0,0 +1,673 @@ +# Adapted from benchmark/hicache/bench_serving.py and python/sglang/bench_serving.py + +""" +Benchmark warm-cache serving with exact shared-prefix control. + +This benchmark is designed for cache-focused studies where each request has a +fixed total input length and an exactly controlled shared-prefix ratio. For each +shared-prefix percentage, the benchmark: + +1. Flushes the server KV cache. +2. Builds prompts with an identical shared prefix and random unique suffixes. +3. Warms only the shared prefix once. +4. Benchmarks the full prompts through SGLang's native /generate endpoint. + +Compared with the existing hicache shared-prefix benchmarks, this benchmark +provides direct control over total length, shared-prefix length, and suffix +length at the token-id level. +""" + +import argparse +import asyncio +import json +import random +import time +import warnings +from dataclasses import dataclass, field +from typing import Any, Dict, List, Optional, Tuple + +import aiohttp +import numpy as np +import requests +from transformers import PreTrainedTokenizerBase + +from sglang.benchmark.utils import get_tokenizer, remove_prefix, set_ulimit + +AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=20 * 60 * 60) +AIOHTTP_READ_BUFSIZE = 10 * 1024**2 + + +global args + + +@dataclass +class RequestFuncOutput: + generated_text: str = "" + success: bool = False + latency: float = 0.0 + ttft: float = 0.0 + itl: List[float] = field(default_factory=list) + prompt_len: int = 0 + error: str = "" + output_len: int = 0 + start_time: float = 0.0 + + +@dataclass +class BenchmarkMetrics: + completed: int + total_input: int + total_output: int + total_output_retokenized: int + request_throughput: float + input_throughput: float + output_throughput: float + output_throughput_retokenized: float + total_throughput: float + total_throughput_retokenized: float + mean_ttft_ms: float + median_ttft_ms: float + std_ttft_ms: float + p90_ttft_ms: float + p99_ttft_ms: float + mean_tpot_ms: float + median_tpot_ms: float + std_tpot_ms: float + p90_tpot_ms: float + p99_tpot_ms: float + mean_itl_ms: float + median_itl_ms: float + std_itl_ms: float + p90_itl_ms: float + p99_itl_ms: float + mean_e2e_latency_ms: float + median_e2e_latency_ms: float + std_e2e_latency_ms: float + p99_e2e_latency_ms: float + concurrency: float + + +def _create_bench_client_session() -> aiohttp.ClientSession: + return aiohttp.ClientSession( + timeout=AIOHTTP_TIMEOUT, + read_bufsize=AIOHTTP_READ_BUFSIZE, + ) + + +async def async_request_sglang_generate( + api_url: str, + input_ids: List[int], + prompt_len: int, + output_len: int, + pbar: Optional[Any] = None, +) -> RequestFuncOutput: + async with _create_bench_client_session() as session: + payload = { + "input_ids": input_ids, + "sampling_params": { + "temperature": 0.0, + "max_new_tokens": output_len, + "ignore_eos": not args.disable_ignore_eos, + }, + "stream": True, + **args.extra_request_body, + } + + output = RequestFuncOutput(prompt_len=prompt_len) + + generated_text = "" + ttft = 0.0 + st = time.perf_counter() + output.start_time = st + most_recent_timestamp = st + last_output_len = 0 + latency = 0.0 + + try: + async with session.post(url=api_url, json=payload) as response: + if response.status == 200: + async for chunk_bytes in response.content: + chunk_bytes = chunk_bytes.strip() + if not chunk_bytes: + continue + + chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ") + latency = time.perf_counter() - st + if chunk == "[DONE]": + continue + + data = json.loads(chunk) + + if "text" in data and data["text"]: + timestamp = time.perf_counter() + generated_text = data["text"] + current_output_len = data["meta_info"]["completion_tokens"] + + if ttft == 0.0: + ttft = timestamp - st + output.ttft = ttft + else: + num_new_tokens = current_output_len - last_output_len + if num_new_tokens == 0: + continue + chunk_gap = timestamp - most_recent_timestamp + adjust_itl = chunk_gap / num_new_tokens + output.itl.extend([adjust_itl] * num_new_tokens) + + most_recent_timestamp = timestamp + last_output_len = current_output_len + output.output_len = current_output_len + + output.generated_text = generated_text + output.success = True + output.latency = latency + else: + output.error = ( + (response.reason or "") + ": " + (await response.text()) + ) + output.success = False + except Exception as exc: + output.success = False + output.error = str(exc) + + if pbar: + pbar.update(1) + return output + + +async def run_batch( + api_url: str, + prompts: List[Dict[str, Any]], + output_len: int, + max_concurrency: Optional[int], + pbar: Optional[Any] = None, +) -> List[RequestFuncOutput]: + semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None + + async def limited_request(prompt: Dict[str, Any]) -> RequestFuncOutput: + if semaphore is None: + return await async_request_sglang_generate( + api_url=api_url, + input_ids=prompt["input_ids"], + prompt_len=prompt["prompt_len"], + output_len=output_len, + pbar=pbar, + ) + async with semaphore: + return await async_request_sglang_generate( + api_url=api_url, + input_ids=prompt["input_ids"], + prompt_len=prompt["prompt_len"], + output_len=output_len, + pbar=pbar, + ) + + tasks = [asyncio.create_task(limited_request(prompt)) for prompt in prompts] + return await asyncio.gather(*tasks) + + +def flush_cache(base_url: str) -> None: + response = requests.post(f"{base_url}/flush_cache", timeout=30) + response.raise_for_status() + + +def gen_token_ids( + vocab_ids: List[int], + token_num: int, + rng: random.Random, +) -> List[int]: + if token_num <= 0: + return [] + return rng.choices(vocab_ids, k=token_num) + + +def build_prompts( + vocab_ids: List[int], + total_tokens: int, + shared_pct: int, + num_prompts: int, + rng: random.Random, +) -> List[Dict[str, Any]]: + prefix_len = total_tokens * shared_pct // 100 + suffix_len = total_tokens - prefix_len + + shared_prefix = gen_token_ids(vocab_ids, prefix_len, rng) + + prompts: List[Dict[str, Any]] = [] + for _ in range(num_prompts): + suffix = gen_token_ids(vocab_ids, suffix_len, rng) + input_ids = shared_prefix + suffix + prompts.append({"input_ids": input_ids, "prompt_len": len(input_ids)}) + + return prompts + + +async def warm_shared_prefix(api_url: str, shared_prefix_ids: List[int]) -> None: + if not shared_prefix_ids: + return + + warmup = await async_request_sglang_generate( + api_url=api_url, + input_ids=shared_prefix_ids, + prompt_len=len(shared_prefix_ids), + output_len=1, + ) + if not warmup.success: + raise RuntimeError( + "Warmup failed - Please make sure benchmark arguments are correctly " + f"specified. Error: {warmup.error}" + ) + + +def calculate_metrics( + outputs: List[RequestFuncOutput], + dur_s: float, + tokenizer: PreTrainedTokenizerBase, +) -> Tuple[BenchmarkMetrics, List[int]]: + output_lens: List[int] = [] + retokenized_output_lens: List[int] = [] + total_input = 0 + completed = 0 + itls: List[float] = [] + tpots: List[float] = [] + ttfts: List[float] = [] + e2e_latencies: List[float] = [] + + for output in outputs: + if output.success: + output_len = output.output_len + output_lens.append(output_len) + retokenized_output_len = len( + tokenizer.encode(output.generated_text, add_special_tokens=False) + ) + retokenized_output_lens.append(retokenized_output_len) + total_input += output.prompt_len + if output_len > 1: + tpots.append((output.latency - output.ttft) / (output_len - 1)) + itls += output.itl + ttfts.append(output.ttft) + e2e_latencies.append(output.latency) + completed += 1 + else: + output_lens.append(0) + retokenized_output_lens.append(0) + + if completed == 0: + warnings.warn( + "All requests failed. This is likely due to a misconfiguration " + "on the benchmark arguments.", + stacklevel=2, + ) + + metrics = BenchmarkMetrics( + completed=completed, + total_input=total_input, + total_output=sum(output_lens), + total_output_retokenized=sum(retokenized_output_lens), + request_throughput=completed / dur_s, + input_throughput=total_input / dur_s, + output_throughput=sum(output_lens) / dur_s, + output_throughput_retokenized=sum(retokenized_output_lens) / dur_s, + total_throughput=(total_input + sum(output_lens)) / dur_s, + total_throughput_retokenized=(total_input + sum(retokenized_output_lens)) + / dur_s, + mean_ttft_ms=np.mean(ttfts or 0) * 1000, + median_ttft_ms=np.median(ttfts or 0) * 1000, + std_ttft_ms=np.std(ttfts or 0) * 1000, + p90_ttft_ms=np.percentile(ttfts or 0, 90) * 1000, + p99_ttft_ms=np.percentile(ttfts or 0, 99) * 1000, + mean_tpot_ms=np.mean(tpots or 0) * 1000, + median_tpot_ms=np.median(tpots or 0) * 1000, + std_tpot_ms=np.std(tpots or 0) * 1000, + p90_tpot_ms=np.percentile(tpots or 0, 90) * 1000, + p99_tpot_ms=np.percentile(tpots or 0, 99) * 1000, + mean_itl_ms=np.mean(itls or 0) * 1000, + median_itl_ms=np.median(itls or 0) * 1000, + std_itl_ms=np.std(itls or 0) * 1000, + p90_itl_ms=np.percentile(itls or 0, 90) * 1000, + p99_itl_ms=np.percentile(itls or 0, 99) * 1000, + mean_e2e_latency_ms=np.mean(e2e_latencies) * 1000, + median_e2e_latency_ms=np.median(e2e_latencies) * 1000, + std_e2e_latency_ms=np.std(e2e_latencies) * 1000, + p99_e2e_latency_ms=np.percentile(e2e_latencies, 99) * 1000, + concurrency=np.sum(e2e_latencies) / dur_s, + ) + return metrics, output_lens + + +def print_benchmark_result( + metrics: BenchmarkMetrics, + benchmark_duration: float, + backend: str, + request_rate: float, + max_concurrency: Optional[int], +) -> None: + print("\n{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="=")) + print("{:<40} {:<10}".format("Backend:", backend)) + print("{:<40} {:<10}".format("Traffic request rate:", request_rate)) + print( + "{:<40} {:<10}".format( + "Max request concurrency:", + max_concurrency if max_concurrency else "not set", + ) + ) + print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) + print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) + print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) + print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output)) + print( + "{:<40} {:<10}".format( + "Total generated tokens (retokenized):", metrics.total_output_retokenized + ) + ) + print( + "{:<40} {:<10.2f}".format( + "Request throughput (req/s):", metrics.request_throughput + ) + ) + print( + "{:<40} {:<10.2f}".format( + "Input token throughput (tok/s):", metrics.input_throughput + ) + ) + print( + "{:<40} {:<10.2f}".format( + "Output token throughput (tok/s):", metrics.output_throughput + ) + ) + print( + "{:<40} {:<10.2f}".format( + "Total token throughput (tok/s):", metrics.total_throughput + ) + ) + print("{:<40} {:<10.2f}".format("Concurrency:", metrics.concurrency)) + print("{s:{c}^{n}}".format(s="End-to-End Latency", n=50, c="-")) + print( + "{:<40} {:<10.2f}".format("Mean E2E Latency (ms):", metrics.mean_e2e_latency_ms) + ) + print( + "{:<40} {:<10.2f}".format( + "Median E2E Latency (ms):", metrics.median_e2e_latency_ms + ) + ) + print("{s:{c}^{n}}".format(s="Time to First Token", n=50, c="-")) + print("{:<40} {:<10.2f}".format("Mean TTFT (ms):", metrics.mean_ttft_ms)) + print("{:<40} {:<10.2f}".format("Median TTFT (ms):", metrics.median_ttft_ms)) + print("{:<40} {:<10.2f}".format("P90 TTFT (ms):", metrics.p90_ttft_ms)) + print("{:<40} {:<10.2f}".format("P99 TTFT (ms):", metrics.p99_ttft_ms)) + print( + "{s:{c}^{n}}".format(s="Time per Output Token (excl. 1st token)", n=50, c="-") + ) + print("{:<40} {:<10.2f}".format("Mean TPOT (ms):", metrics.mean_tpot_ms)) + print("{:<40} {:<10.2f}".format("Median TPOT (ms):", metrics.median_tpot_ms)) + print("{:<40} {:<10.2f}".format("P90 TPOT (ms):", metrics.p90_tpot_ms)) + print("{:<40} {:<10.2f}".format("P99 TPOT (ms):", metrics.p99_tpot_ms)) + print("{s:{c}^{n}}".format(s="Inter-token Latency", n=50, c="-")) + print("{:<40} {:<10.2f}".format("Mean ITL (ms):", metrics.mean_itl_ms)) + print("{:<40} {:<10.2f}".format("Median ITL (ms):", metrics.median_itl_ms)) + print("{:<40} {:<10.2f}".format("P90 ITL (ms):", metrics.p90_itl_ms)) + print("{:<40} {:<10.2f}".format("P99 ITL (ms):", metrics.p99_itl_ms)) + print("=" * 50) + + +def maybe_write_summary_jsonl( + pct: int, + prefix_len: int, + suffix_len: int, + metrics: BenchmarkMetrics, + output_file: Optional[str], + benchmark_duration: float, +) -> None: + if not output_file: + return + + result = { + "backend": args.backend, + "dataset_name": "warm-cache", + "request_rate": float("inf"), + "max_concurrency": args.max_concurrency, + "shared_prefix_pct": pct, + "prefix_len": prefix_len, + "suffix_len": suffix_len, + "total_tokens": args.total_tokens, + "num_prompts": args.num_prompts, + "output_len": args.output_len, + "completed": metrics.completed, + "benchmark_duration": benchmark_duration, + "total_input": metrics.total_input, + "total_output": metrics.total_output, + "total_output_retokenized": metrics.total_output_retokenized, + "request_throughput": metrics.request_throughput, + "input_throughput": metrics.input_throughput, + "output_throughput": metrics.output_throughput, + "output_throughput_retokenized": metrics.output_throughput_retokenized, + "total_throughput": metrics.total_throughput, + "total_throughput_retokenized": metrics.total_throughput_retokenized, + "mean_ttft_ms": metrics.mean_ttft_ms, + "median_ttft_ms": metrics.median_ttft_ms, + "std_ttft_ms": metrics.std_ttft_ms, + "p90_ttft_ms": metrics.p90_ttft_ms, + "p99_ttft_ms": metrics.p99_ttft_ms, + "mean_tpot_ms": metrics.mean_tpot_ms, + "median_tpot_ms": metrics.median_tpot_ms, + "std_tpot_ms": metrics.std_tpot_ms, + "p90_tpot_ms": metrics.p90_tpot_ms, + "p99_tpot_ms": metrics.p99_tpot_ms, + "mean_itl_ms": metrics.mean_itl_ms, + "median_itl_ms": metrics.median_itl_ms, + "std_itl_ms": metrics.std_itl_ms, + "p90_itl_ms": metrics.p90_itl_ms, + "p99_itl_ms": metrics.p99_itl_ms, + "mean_e2e_latency_ms": metrics.mean_e2e_latency_ms, + "median_e2e_latency_ms": metrics.median_e2e_latency_ms, + "std_e2e_latency_ms": metrics.std_e2e_latency_ms, + "p99_e2e_latency_ms": metrics.p99_e2e_latency_ms, + "concurrency": metrics.concurrency, + } + + with open(output_file, "a", encoding="utf-8") as fout: + fout.write(json.dumps(result) + "\n") + + +async def benchmark_shared_prefix_pct( + api_url: str, + base_url: str, + tokenizer: PreTrainedTokenizerBase, + vocab_ids: List[int], + rng: random.Random, + pct: int, +) -> Tuple[BenchmarkMetrics, float, int, int, int]: + prefix_len = args.total_tokens * pct // 100 + suffix_len = args.total_tokens - prefix_len + + print(f"\n{'=' * 70}") + print( + f"shared_prefix={pct}% prefix_len={prefix_len} " + f"suffix_len={suffix_len} total={prefix_len + suffix_len}" + ) + print(f"{'=' * 70}") + + print("Flushing KV cache ...") + flush_cache(base_url) + time.sleep(1) + + print(f"Building {args.num_prompts} prompts ...") + prompts = build_prompts( + vocab_ids=vocab_ids, + total_tokens=args.total_tokens, + shared_pct=pct, + num_prompts=args.num_prompts, + rng=rng, + ) + + if prefix_len > 0: + print(f"Warming shared prefix only ({prefix_len} tokens) ...") + await warm_shared_prefix( + api_url=api_url, shared_prefix_ids=prompts[0]["input_ids"][:prefix_len] + ) + + print(f"Sending requests (max_concurrency={args.max_concurrency}) ...") + benchmark_start_time = time.perf_counter() + outputs = await run_batch( + api_url=api_url, + prompts=prompts, + output_len=args.output_len, + max_concurrency=args.max_concurrency, + pbar=None, + ) + benchmark_duration = time.perf_counter() - benchmark_start_time + + failed_outputs = [output for output in outputs if not output.success] + if failed_outputs: + print(f"WARNING: {len(failed_outputs)}/{len(outputs)} requests failed") + for output in failed_outputs[:5]: + print(f" {output.error[:160]}") + + metrics, _ = calculate_metrics( + outputs=outputs, + dur_s=benchmark_duration, + tokenizer=tokenizer, + ) + + if metrics.completed == 0: + raise RuntimeError("All requests failed for this shared-prefix percentage.") + + print_benchmark_result( + metrics=metrics, + benchmark_duration=benchmark_duration, + backend=args.backend, + request_rate=float("inf"), + max_concurrency=args.max_concurrency, + ) + + maybe_write_summary_jsonl( + pct=pct, + prefix_len=prefix_len, + suffix_len=suffix_len, + metrics=metrics, + output_file=args.output_file, + benchmark_duration=benchmark_duration, + ) + + return metrics, benchmark_duration, prefix_len, suffix_len, len(outputs) + + +async def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--backend", + type=str, + default="sglang", + choices=["sglang"], + help="Warm-cache benchmark currently supports the native SGLang /generate endpoint.", + ) + parser.add_argument( + "--base-url", + type=str, + default=None, + help="Server base url if not using host and port.", + ) + parser.add_argument("--host", type=str, default="127.0.0.1") + parser.add_argument("--port", type=int, default=30000) + parser.add_argument( + "--model", + type=str, + required=True, + help="Name or path of the model. Used to load the tokenizer and vocab ids.", + ) + parser.add_argument( + "--tokenizer", + type=str, + default=None, + help="Name or path of the tokenizer. Defaults to --model.", + ) + parser.add_argument( + "--num-prompts", + type=int, + default=64, + help="Number of prompts to process per shared-prefix percentage.", + ) + parser.add_argument( + "--total-tokens", + type=int, + default=70000, + help="Total input tokens per request (shared prefix + unique suffix).", + ) + parser.add_argument( + "--output-len", + type=int, + default=200, + help="Output length for each request.", + ) + parser.add_argument( + "--max-concurrency", + type=int, + default=4, + help="Maximum number of concurrent requests.", + ) + parser.add_argument( + "--pcts", + type=str, + default="0,10,20,30,40,50,60,70,80,90,92,95,97,99", + help="Comma-separated shared-prefix percentages to sweep.", + ) + parser.add_argument( + "--seed", + type=int, + default=42, + help="Random seed for synthetic prompt generation.", + ) + parser.add_argument( + "--disable-ignore-eos", + action="store_true", + help="Disable ignoring EOS.", + ) + parser.add_argument( + "--output-file", + type=str, + default=None, + help="Optional JSONL file to append one result object per shared-prefix percentage.", + ) + parser.add_argument( + "--extra-request-body", + metavar='{"key1": "value1", "key2": "value2"}', + type=str, + help="Append given JSON object to the request payload. You can use this to specify additional generate params.", + ) + global args + args = parser.parse_args() + + args.extra_request_body = ( + json.loads(args.extra_request_body) if args.extra_request_body else {} + ) + + base_url = args.base_url or f"http://{args.host}:{args.port}" + api_url = f"{base_url}/generate" + pcts = [int(p.strip()) for p in args.pcts.split(",") if p.strip()] + rng = random.Random(args.seed) + + tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model + tokenizer = get_tokenizer(tokenizer_id) + vocab_ids = list(tokenizer.get_vocab().values()) + + print(f"{args}\n") + print(f"Loading tokenizer from {tokenizer_id} ...") + print(f"Tokenizer loaded (vocab_size={len(vocab_ids)})") + + for pct in pcts: + await benchmark_shared_prefix_pct( + api_url=api_url, + base_url=base_url, + tokenizer=tokenizer, + vocab_ids=vocab_ids, + rng=rng, + pct=pct, + ) + + if args.output_file: + print(f"JSONL results saved to {args.output_file}") + + +if __name__ == "__main__": + set_ulimit() + asyncio.run(main()) diff --git a/benchmark/hicache/data_processing.py b/benchmark/hicache/data_processing.py index dd0cbf669dc0..8c4b8cd1bfb5 100644 --- a/benchmark/hicache/data_processing.py +++ b/benchmark/hicache/data_processing.py @@ -11,13 +11,13 @@ from tqdm.asyncio import tqdm from transformers import PreTrainedTokenizerBase -from sglang.bench_serving import ( +from sglang.benchmark.datasets.common import ( SHAREGPT_FILENAME, SHAREGPT_REPO_ID, - download_and_cache_hf_file, gen_prompt, - get_gen_prefix_cache_path, ) +from sglang.benchmark.datasets.generated_shared_prefix import get_gen_prefix_cache_path +from sglang.benchmark.utils import download_and_cache_hf_file from sglang.lang.chat_template import get_chat_template, get_chat_template_by_model_path from sglang.srt.entrypoints.openai.protocol import ChatCompletionMessageContentPart from sglang.utils import encode_video_base64 @@ -442,7 +442,15 @@ def sample_generated_shared_prefix_requests( disable_shuffle: bool = False, ) -> SampleOutput: """Generate benchmark requests with shared system prompts using random tokens and caching.""" - cache_path = get_gen_prefix_cache_path(args, tokenizer) + cache_path = get_gen_prefix_cache_path( + args.seed, + num_groups, + prompts_per_group, + system_prompt_len, + question_len, + output_len, + tokenizer, + ) # Try to load from cache first if cache_path.exists(): diff --git a/benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py b/benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py new file mode 100644 index 000000000000..1fa3819cc290 --- /dev/null +++ b/benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py @@ -0,0 +1,536 @@ +""" +Benchmark fused allreduce+rmsnorm on AMD with correctness checks. + +This script targets the same fused op used by SGLang: +`tensor_model_parallel_fused_allreduce_rmsnorm`. + +It reports: +- eager mode latency (prefill-like) +- graph mode latency (decode-like) +- fused availability (whether fused path returns non-None) +- correctness (fused output matches split allreduce + rmsnorm reference) + +Usage example: + torchrun --nproc_per_node=8 \ + benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py \ + --dtype bfloat16 \ + --prefill-shapes 2048x8192,8192x8192 \ + --decode-shapes 1x8192,4x8192,16x8192 \ + --warmup 10 --iters 30 --repeats 5 +""" + +import argparse +import csv +import os +import statistics +from typing import Dict, List, Optional, Sequence, Tuple + +import torch +import torch.distributed as dist +import torch.nn.functional as F + +from sglang.srt.distributed.communication_op import ( + tensor_model_parallel_all_reduce, + tensor_model_parallel_fused_allreduce_rmsnorm, +) +from sglang.srt.distributed.parallel_state import ( + destroy_distributed_environment, + destroy_model_parallel, + graph_capture, + init_distributed_environment, + initialize_model_parallel, + set_custom_all_reduce, +) + +Shape = Tuple[int, int] + + +def parse_shapes(raw: str) -> List[Shape]: + shapes: List[Shape] = [] + for item in [x.strip() for x in raw.split(",") if x.strip()]: + if "x" not in item: + raise ValueError(f"Invalid shape '{item}', expected MxN format.") + m_str, n_str = item.split("x", 1) + m = int(m_str) + n = int(n_str) + if m <= 0 or n <= 0: + raise ValueError(f"Invalid shape '{item}', both dims must be positive.") + shapes.append((m, n)) + if not shapes: + raise ValueError("Empty shape list is not allowed.") + return shapes + + +def dtype_from_name(name: str) -> torch.dtype: + mapping = { + "float16": torch.float16, + "fp16": torch.float16, + "bfloat16": torch.bfloat16, + "bf16": torch.bfloat16, + } + if name not in mapping: + raise ValueError(f"Unsupported dtype: {name}") + return mapping[name] + + +def check_close( + a: torch.Tensor, b: torch.Tensor, dtype: torch.dtype +) -> Tuple[bool, str]: + if dtype == torch.bfloat16: + rtol, atol = 2e-2, 1.25e-1 + else: + rtol, atol = 1e-2, 2e-2 + try: + torch.testing.assert_close(a, b, rtol=rtol, atol=atol) + return True, "PASS" + except AssertionError: + max_diff = torch.max(torch.abs(a - b)).item() + mean_diff = torch.mean(torch.abs(a - b)).item() + return False, f"FAIL(max={max_diff:.6f},mean={mean_diff:.6f})" + + +def _measure_us( + fn, + warmup: int, + iters: int, + repeats: int, + device: torch.device, +) -> Tuple[float, Dict[str, float]]: + for _ in range(warmup): + fn() + torch.cuda.synchronize() + + start_event = torch.cuda.Event(enable_timing=True) + end_event = torch.cuda.Event(enable_timing=True) + samples_us: List[float] = [] + + for _ in range(max(1, repeats)): + _barrier(device) + torch.cuda.synchronize() + start_event.record() + for _ in range(iters): + fn() + end_event.record() + end_event.synchronize() + samples_us.append(start_event.elapsed_time(end_event) * 1000.0 / iters) + + sorted_samples = sorted(samples_us) + p50 = float(statistics.median(sorted_samples)) + p95 = float(sorted_samples[int((len(sorted_samples) - 1) * 0.95)]) + return p50, { + "p50_us": p50, + "p95_us": p95, + "min_us": float(sorted_samples[0]), + "max_us": float(sorted_samples[-1]), + } + + +def _barrier(device: torch.device): + try: + dist.barrier(device_ids=[device.index]) + except TypeError: + dist.barrier() + + +def _mean_across_ranks(value: float, device: torch.device) -> float: + t = torch.tensor([value], dtype=torch.float64, device=device) + dist.all_reduce(t, op=dist.ReduceOp.SUM) + t /= dist.get_world_size() + return float(t.item()) + + +def _all_true_across_ranks(value: bool, device: torch.device) -> bool: + t = torch.tensor([1 if value else 0], dtype=torch.int32, device=device) + dist.all_reduce(t, op=dist.ReduceOp.MIN) + return bool(int(t.item())) + + +def _make_inputs( + shape: Shape, + dtype: torch.dtype, + seed: int, + residual_mode: str, + rank: int, + device: torch.device, +) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: + m, n = shape + torch.manual_seed(seed + rank * 17) + x = torch.randn((m, n), dtype=torch.float32, device=device).to(dtype) + if residual_mode == "self": + residual = x.clone() + elif residual_mode == "random": + residual = torch.randn((m, n), dtype=torch.float32, device=device).to(dtype) + elif residual_mode == "zero": + residual = torch.zeros((m, n), dtype=dtype, device=device) + else: + raise ValueError(f"Unknown residual_mode: {residual_mode}") + weight = torch.randn((n,), dtype=torch.float32, device=device).to(dtype) + return x, residual, weight + + +def _split_reference( + x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor, eps: float +) -> Tuple[torch.Tensor, torch.Tensor]: + ar_out = tensor_model_parallel_all_reduce(x.clone()) + residual_out = ar_out + residual + out = F.rms_norm( + input=residual_out, + normalized_shape=(residual_out.shape[-1],), + weight=weight, + eps=eps, + ) + return out, residual_out + + +def bench_eager( + x: torch.Tensor, + residual: torch.Tensor, + weight: torch.Tensor, + eps: float, + warmup: int, + iters: int, + repeats: int, +) -> Dict[str, object]: + split_fn = lambda: _split_reference(x, residual, weight, eps) + split_us, split_stats = _measure_us(split_fn, warmup, iters, repeats, x.device) + + fused_probe = tensor_model_parallel_fused_allreduce_rmsnorm( + x.clone(), residual.clone(), weight, eps + ) + fused_available = fused_probe is not None + + fused_us: Optional[float] = None + fused_stats: Optional[Dict[str, float]] = None + if fused_available: + fused_fn = lambda: tensor_model_parallel_fused_allreduce_rmsnorm( + x, residual, weight, eps + ) + fused_us, fused_stats = _measure_us(fused_fn, warmup, iters, repeats, x.device) + + ref_out, ref_residual = _split_reference(x, residual, weight, eps) + if fused_available: + fused_out, fused_residual = tensor_model_parallel_fused_allreduce_rmsnorm( + x.clone(), residual.clone(), weight, eps + ) + out_ok, out_detail = check_close(fused_out, ref_out, x.dtype) + res_ok, res_detail = check_close(fused_residual, ref_residual, x.dtype) + correctness_ok = out_ok and res_ok + correctness_detail = f"out={out_detail}, residual={res_detail}" + else: + correctness_ok = True + correctness_detail = "SKIP(fused_unavailable)" + + return { + "split_us": split_us, + "split_stats": split_stats, + "fused_available": fused_available, + "fused_us": fused_us, + "fused_stats": fused_stats, + "correctness_ok": correctness_ok, + "correctness_detail": correctness_detail, + } + + +def bench_graph( + x: torch.Tensor, + residual: torch.Tensor, + weight: torch.Tensor, + eps: float, + warmup: int, + iters: int, + repeats: int, +) -> Dict[str, object]: + split_x = x.clone() + split_res = residual.clone() + split_graph_out: Optional[torch.Tensor] = None + + with graph_capture() as gc: + split_graph = torch.cuda.CUDAGraph() + with torch.cuda.graph(split_graph, stream=gc.stream): + split_graph_out, _ = _split_reference(split_x, split_res, weight, eps) + + def split_replay(): + split_graph.replay() + + split_us, split_stats = _measure_us(split_replay, warmup, iters, repeats, x.device) + + fused_probe = tensor_model_parallel_fused_allreduce_rmsnorm( + x.clone(), residual.clone(), weight, eps + ) + fused_available = fused_probe is not None + + fused_us: Optional[float] = None + fused_stats: Optional[Dict[str, float]] = None + fused_graph_out: Optional[torch.Tensor] = None + fused_graph_residual: Optional[torch.Tensor] = None + + if fused_available: + fused_x = x.clone() + fused_res = residual.clone() + with graph_capture() as gc: + fused_graph = torch.cuda.CUDAGraph() + with torch.cuda.graph(fused_graph, stream=gc.stream): + fused_graph_out, fused_graph_residual = ( + tensor_model_parallel_fused_allreduce_rmsnorm( + fused_x, fused_res, weight, eps + ) + ) + + def fused_replay(): + fused_graph.replay() + + fused_us, fused_stats = _measure_us( + fused_replay, warmup, iters, repeats, x.device + ) + + ref_out, ref_residual = _split_reference(x, residual, weight, eps) + if ( + fused_available + and fused_graph_out is not None + and fused_graph_residual is not None + ): + fused_graph.replay() + torch.cuda.synchronize() + out_ok, out_detail = check_close(fused_graph_out, ref_out, x.dtype) + res_ok, res_detail = check_close(fused_graph_residual, ref_residual, x.dtype) + correctness_ok = out_ok and res_ok + correctness_detail = f"out={out_detail}, residual={res_detail}" + else: + correctness_ok = True + correctness_detail = "SKIP(fused_unavailable)" + + return { + "split_us": split_us, + "split_stats": split_stats, + "fused_available": fused_available, + "fused_us": fused_us, + "fused_stats": fused_stats, + "correctness_ok": correctness_ok, + "correctness_detail": correctness_detail, + } + + +def _shape_bytes(shape: Shape, dtype: torch.dtype) -> int: + m, n = shape + return m * n * torch.tensor([], dtype=dtype).element_size() + + +def parse_args(): + parser = argparse.ArgumentParser( + description="Benchmark fused allreduce+rmsnorm (prefill eager + decode graph)." + ) + parser.add_argument( + "--dtype", + type=str, + default="bf16", + choices=["fp16", "bf16", "float16", "bfloat16"], + ) + parser.add_argument("--eps", type=float, default=1e-6) + parser.add_argument("--seed", type=int, default=1234) + parser.add_argument( + "--residual-mode", + type=str, + default="self", + choices=["self", "random", "zero"], + help="Use residual=x (self) to match aiter test behavior by default.", + ) + parser.add_argument( + "--prefill-shapes", + type=str, + default="2048x2880,2048x8192,8192x8192,16384x8192", + help="Comma-separated MxN shapes for eager mode.", + ) + parser.add_argument( + "--decode-shapes", + type=str, + default="1x2880,4x2880,16x2880,1x8192,2x8192,4x8192,8x8192,16x8192", + help="Comma-separated MxN shapes for graph mode.", + ) + parser.add_argument("--warmup", type=int, default=10) + parser.add_argument("--iters", type=int, default=30) + parser.add_argument("--repeats", type=int, default=5) + parser.add_argument( + "--mode", + type=str, + default="both", + choices=["eager", "graph", "both"], + ) + parser.add_argument( + "--csv-out", + type=str, + default=None, + help="Optional output CSV path (written on rank 0 only).", + ) + return parser.parse_args() + + +def main(): + args = parse_args() + dtype = dtype_from_name(args.dtype) + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", str(rank))) + torch.cuda.set_device(local_rank % torch.cuda.device_count()) + device = torch.device(f"cuda:{local_rank % torch.cuda.device_count()}") + + set_custom_all_reduce(True) + init_distributed_environment( + world_size=world_size, + rank=rank, + local_rank=local_rank, + distributed_init_method="env://", + backend="nccl", + ) + initialize_model_parallel(tensor_model_parallel_size=world_size) + + prefill_shapes = parse_shapes(args.prefill_shapes) + decode_shapes = parse_shapes(args.decode_shapes) + + if rank == 0: + print( + "Config: " + f"world_size={world_size}, dtype={dtype}, residual_mode={args.residual_mode}, " + f"warmup={args.warmup}, iters={args.iters}, repeats={args.repeats}" + ) + + run_modes: Sequence[str] + if args.mode == "both": + run_modes = ("eager", "graph") + else: + run_modes = (args.mode,) + csv_rows: List[Dict[str, object]] = [] + + for mode in run_modes: + shapes = prefill_shapes if mode == "eager" else decode_shapes + if rank == 0: + phase_name = "prefill(eager)" if mode == "eager" else "decode(graph)" + print("\n" + "=" * 120) + print(f"Mode: {phase_name}") + print( + "| Shape | Input bytes/rank | Split p50 (us) | Fused p50 (us) | Speedup | Fused available | Correctness |" + ) + print( + "|:------|-----------------:|---------------:|---------------:|--------:|:----------------|:------------|" + ) + + for shape in shapes: + x, residual, weight = _make_inputs( + shape=shape, + dtype=dtype, + seed=args.seed, + residual_mode=args.residual_mode, + rank=rank, + device=device, + ) + + if mode == "eager": + metrics = bench_eager( + x=x, + residual=residual, + weight=weight, + eps=args.eps, + warmup=args.warmup, + iters=args.iters, + repeats=args.repeats, + ) + else: + metrics = bench_graph( + x=x, + residual=residual, + weight=weight, + eps=args.eps, + warmup=args.warmup, + iters=args.iters, + repeats=args.repeats, + ) + + split_us = _mean_across_ranks(float(metrics["split_us"]), device) + fused_available = _all_true_across_ranks( + bool(metrics["fused_available"]), device + ) + correctness_ok = _all_true_across_ranks( + bool(metrics["correctness_ok"]), device + ) + + fused_us: Optional[float] = None + if fused_available and metrics["fused_us"] is not None: + fused_us = _mean_across_ranks(float(metrics["fused_us"]), device) + + if rank == 0: + m, n = shape + shape_str = f"{m}x{n}" + bytes_per_rank = _shape_bytes(shape, dtype) + if fused_us is not None and fused_us > 0: + speedup = split_us / fused_us + speedup_str = f"{speedup:.3f}x" + fused_str = f"{fused_us:.1f}" + else: + speedup_str = "N/A" + fused_str = "N/A" + correctness_text = ( + "PASS" if correctness_ok else str(metrics["correctness_detail"]) + ) + print( + f"| {shape_str} | {bytes_per_rank} | {split_us:.1f} | {fused_str} | " + f"{speedup_str} | {str(fused_available)} | {correctness_text} |" + ) + csv_rows.append( + { + "mode": mode, + "shape": shape_str, + "m": m, + "n": n, + "bytes_per_rank": bytes_per_rank, + "split_p50_us": split_us, + "fused_p50_us": fused_us if fused_us is not None else "", + "speedup_split_over_fused": ( + split_us / fused_us + if fused_us is not None and fused_us > 0 + else "" + ), + "fused_available": fused_available, + "correctness_ok": correctness_ok, + "correctness_detail": correctness_text, + "dtype": str(dtype), + "world_size": world_size, + "residual_mode": args.residual_mode, + "warmup": args.warmup, + "iters": args.iters, + "repeats": args.repeats, + } + ) + + if rank == 0 and args.csv_out: + os.makedirs(os.path.dirname(args.csv_out) or ".", exist_ok=True) + fieldnames = [ + "mode", + "shape", + "m", + "n", + "bytes_per_rank", + "split_p50_us", + "fused_p50_us", + "speedup_split_over_fused", + "fused_available", + "correctness_ok", + "correctness_detail", + "dtype", + "world_size", + "residual_mode", + "warmup", + "iters", + "repeats", + ] + with open(args.csv_out, "w", newline="", encoding="utf-8") as f: + writer = csv.DictWriter(f, fieldnames=fieldnames) + writer.writeheader() + writer.writerows(csv_rows) + print(f"\nSaved CSV to: {args.csv_out}") + + _barrier(device) + destroy_model_parallel() + destroy_distributed_environment() + + +if __name__ == "__main__": + main() diff --git a/benchmark/kernels/all_reduce/benchmark_torch_symm_mem.py b/benchmark/kernels/all_reduce/benchmark_torch_symm_mem.py index 030fd5bb2366..5bdb7f5d687d 100644 --- a/benchmark/kernels/all_reduce/benchmark_torch_symm_mem.py +++ b/benchmark/kernels/all_reduce/benchmark_torch_symm_mem.py @@ -44,12 +44,9 @@ initialize_model_parallel, set_torch_symm_mem_all_reduce, ) +from sglang.utils import is_in_ci -# CI environment detection -IS_CI = ( - os.getenv("CI", "false").lower() == "true" - or os.getenv("GITHUB_ACTIONS", "false").lower() == "true" -) +IS_CI = is_in_ci() def torch_allreduce(torch_input: torch.Tensor, group: ProcessGroup) -> torch.Tensor: diff --git a/benchmark/kernels/deepep/tuning_deepep.py b/benchmark/kernels/deepep/tuning_deepep.py index db08a8f14d36..191819d2cf30 100644 --- a/benchmark/kernels/deepep/tuning_deepep.py +++ b/benchmark/kernels/deepep/tuning_deepep.py @@ -40,11 +40,11 @@ def test_main( ): # Settings num_tokens, hidden, num_topk_groups, num_topk, num_experts = ( - 4096, - 7168, + args.num_tokens, + args.hidden, min(num_nodes, 4), - 8, - (256 // num_ranks) * num_ranks, + args.num_topk, + (args.num_experts // num_ranks) * num_ranks, ) assert num_experts % num_ranks == 0 and num_local_ranks == 8 if local_rank == 0: @@ -462,6 +462,10 @@ def test_loop(local_rank: int, num_local_ranks: int, args): if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--num-sms", type=int, default=24) + parser.add_argument("--num-tokens", type=int, default=4096) + parser.add_argument("--hidden", type=int, default=7168) + parser.add_argument("--num-topk", type=int, default=8) + parser.add_argument("--num-experts", type=int, default=256) parser.add_argument("--output-path", type=str, default="deepep_tuned.json") parser.add_argument("--nnodes", type=int, default=1) parser.add_argument("--node-rank", type=int, default=0) diff --git a/benchmark/kernels/deepseek/benchmark_deepgemm_dsv3_router_gemm_blackwell.py b/benchmark/kernels/deepseek/benchmark_deepgemm_dsv3_router_gemm_blackwell.py new file mode 100644 index 000000000000..a44c8ffc10c5 --- /dev/null +++ b/benchmark/kernels/deepseek/benchmark_deepgemm_dsv3_router_gemm_blackwell.py @@ -0,0 +1,250 @@ +import argparse +import os +from typing import List + +import torch +import triton +from flashinfer.gemm import mm_M1_16_K7168_N256 +from sgl_kernel import dsv3_router_gemm + +N = 256 +K = 7168 + + +def create_benchmark_configs(tp_sizes: List[int]): + configs = [] + for tp_size in tp_sizes: + for m in range(1, 17): + configs.append((m, N, K, tp_size)) + return configs + + +def dsv3_router_gemm_flashinfer( + hidden_states: torch.Tensor, + router_weights: torch.Tensor, +): + """Flashinfer implementation of dsv3 router gemm""" + output = torch.empty( + hidden_states.shape[0], + router_weights.shape[0], + device="cuda", + dtype=torch.float32, + ) + mm_M1_16_K7168_N256( + hidden_states, router_weights.t(), output, launch_with_pdl=args.use_pdl + ) + return output + + +def dsv3_router_gemm_sgl( + hidden_states: torch.Tensor, + router_weights: torch.Tensor, +): + """SGLang implementation of dsv3 router gemm""" + output = dsv3_router_gemm( + hidden_states, + router_weights, + out_dtype=torch.float32, + ) + return output + + +def check_accuracy(a, b, atol, rtol, percent): + """Unified accuracy checking function with detailed error reporting.""" + if not torch.isfinite(a).all(): + print("Non-finite values in reference output") + return False + if not torch.isfinite(b).all(): + print("Non-finite values in actual output") + return False + assert a.shape == b.shape, f"Shape mismatch: {a.shape} vs {b.shape}" + + close = torch.isclose(a, b, atol=atol, rtol=rtol) + match_ratio = close.float().mean() + if match_ratio >= percent: + return True + + mismatch_percent = 1.0 - match_ratio.item() + if mismatch_percent > 1 - percent: + print( + f"Mismatch percentage is {mismatch_percent:.4f} for rtol {rtol} " + f"(threshold: {1 - percent:.4f})" + ) + return False + + +def calculate_diff(m: int, n: int, k: int): + hidden_states = torch.randn((m, k), device="cuda", dtype=torch.bfloat16) + router_weights = torch.randn((n, k), device="cuda", dtype=torch.bfloat16) + + out_flashinfer = dsv3_router_gemm_flashinfer( + hidden_states.clone(memory_format=torch.contiguous_format), + router_weights.clone(memory_format=torch.contiguous_format), + ) + + out_sgl = dsv3_router_gemm_sgl( + hidden_states.clone(memory_format=torch.contiguous_format), + router_weights.clone(memory_format=torch.contiguous_format), + ) + + print(f"Shape m={m}, n={n}, k={k}:") + print(f"Using PDL={args.use_pdl}") + print(f"Flashinfer output: {out_flashinfer[0, 0:5]}") + print(f"SGLang output: {out_sgl[0, 0:5]}") + + flashinfer_sgl_match = check_accuracy(out_flashinfer, out_sgl, 0.1, 0.6, 0.95) + print("Correctness check:") + print(f" - Flashinfer vs SGLang: {'✅' if flashinfer_sgl_match else '❌'}") + + +def _benchmark(m, n, k, tp_size, provider): + print(f"Shape (m={m}, n={n}, k={k}, tp={tp_size}), Provider: {provider}") + hidden_states = torch.randn( + (m, k), device="cuda", dtype=torch.bfloat16 + ).contiguous() + router_weights = torch.randn( + (n, k), device="cuda", dtype=torch.bfloat16 + ).contiguous() + + quantiles = [0.5, 0.2, 0.8] + + if provider == "sglang": + ms, min_ms, max_ms = triton.testing.do_bench( + lambda: dsv3_router_gemm_sgl( + hidden_states.clone(memory_format=torch.contiguous_format), + router_weights.clone(memory_format=torch.contiguous_format), + ), + quantiles=quantiles, + ) + elif provider == "flashinfer": + ms, min_ms, max_ms = triton.testing.do_bench( + lambda: dsv3_router_gemm_flashinfer( + hidden_states.clone(memory_format=torch.contiguous_format), + router_weights.clone(memory_format=torch.contiguous_format), + ), + quantiles=quantiles, + ) + + # Calculate TFLOPS + flops = 2 * m * n * k # multiply-adds + tflops = flops / (ms * 1e-3) / 1e12 + + # Print shape-specific results with TFLOPS + print(f"Time: {ms*1000:.2f} us, TFLOPS: {tflops:.2f}") + return ms, max_ms, min_ms + + +def get_benchmark_plot_friendly(tp_sizes): + all_configs = create_benchmark_configs(tp_sizes) + x_vals = list(range(len(all_configs))) + + @triton.testing.perf_report( + triton.testing.Benchmark( + x_names=["cfg_id"], + x_vals=x_vals, + line_arg="provider", + line_vals=["sglang", "flashinfer"], + line_names=["SGLang", "Flashinfer"], + styles=[("blue", "-"), ("red", "-")], + ylabel="us", + plot_name=f"fp8-gemm-performance-comparison-tp-{'-'.join(str(tp) for tp in tp_sizes)}", + args={}, + ) + ) + def benchmark(cfg_id, provider): + m, n, k, tp_size = all_configs[cfg_id] + ms, min_ms, max_ms = _benchmark(m, n, k, tp_size, provider) + return ms * 1000, max_ms * 1000, min_ms * 1000 # convert to ms + + return benchmark + + +def get_benchmark(tp_sizes): + all_configs = create_benchmark_configs(tp_sizes) + + @triton.testing.perf_report( + triton.testing.Benchmark( + x_names=[ + "m", + "n", + "k", + "tp_size", + ], + x_vals=[list(config) for config in all_configs], + line_arg="provider", + line_vals=["sglang", "flashinfer"], + line_names=["SGLang", "Flashinfer"], + styles=[("blue", "-"), ("red", "-")], + ylabel="us", + plot_name=f"fp8-gemm-performance-comparison-tp-{'-'.join(str(tp) for tp in tp_sizes)}", + args={}, + ) + ) + def benchmark(m, n, k, tp_size, provider): + ms, min_ms, max_ms = _benchmark(m, n, k, tp_size, provider) + return ms * 1000, max_ms * 1000, min_ms * 1000 # convert to ms + + return benchmark + + +if __name__ == "__main__": + if not torch.cuda.is_available() or torch.cuda.get_device_capability()[0] != 10: + print("Skipping benchmark because the device is not supported") + exit(0) + + parser = argparse.ArgumentParser() + parser.add_argument( + "--save-path", + type=str, + default="./configs/benchmark_ops/dsv3_router_gemm/", + help="Path to save dsv3 router gemm benchmark results", + ) + parser.add_argument( + "--run-correctness", + action="store_true", + default=True, + help="Whether to run correctness test", + ) + parser.add_argument( + "--tp-sizes", + type=int, + nargs="+", + default=[1], + help="List of tensor parallelism sizes to benchmark", + ) + parser.add_argument( + "--plot-friendly", + action="store_true", + default=False, + help="Plot x axis as the config index instead of the m", + ) + parser.add_argument( + "--use-pdl", + action="store_true", + default=False, + help="Use PDL if true.", + ) + args = parser.parse_args() + + # Set random seed for reproducibility + torch.manual_seed(0) + torch.cuda.manual_seed(0) + + if args.use_pdl: + os.environ["TRTLLM_ENABLE_PDL"] = "1" + + # Run correctness tests on a few examples + if args.run_correctness: + print("Running correctness tests...") + for m, n, k, _ in create_benchmark_configs(args.tp_sizes): + calculate_diff(m, n, k) + + # Get the benchmark function with the specified tp_size + benchmark = ( + get_benchmark_plot_friendly(args.tp_sizes) + if args.plot_friendly + else get_benchmark(args.tp_sizes) + ) + + print(f"Running performance benchmark for TP sizes = {args.tp_sizes}...") + benchmark.run(print_data=True, save_path=args.save_path) diff --git a/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm.py b/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm.py index bd02e2aee4a2..0b18e3badf46 100644 --- a/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm.py +++ b/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm.py @@ -11,6 +11,7 @@ w8a8_block_fp8_matmul as vllm_w8a8_block_fp8_matmul, ) +from sglang.benchmark.bench_utils import run_bench from sglang.srt.layers.quantization.fp8_kernel import ( w8a8_block_fp8_matmul_deepgemm as w8a8_block_fp8_matmul, ) @@ -303,10 +304,10 @@ def benchmark(m, n, k, tp_size, provider): y_fp8, y_scale = per_block_cast_to_fp8(y) x_scale_col_major = get_mn_major_tma_aligned_tensor(x_scale.clone()) - quantiles = [0.5, 0.2, 0.8] + quantiles = (0.5, 0.2, 0.8) if provider == "deepgemm": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: fp8_gemm_deepgemm( x_fp8.clone(), x_scale_col_major.clone(), @@ -319,7 +320,7 @@ def benchmark(m, n, k, tp_size, provider): quantiles=quantiles, ) elif provider == "sglang": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: fp8_gemm_sglang( x_fp8.clone(), x_scale.clone(), @@ -334,7 +335,7 @@ def benchmark(m, n, k, tp_size, provider): else: # tilelang tilelang_func = tl_gemm(m, n, k, "e4m3_float8", "bfloat16", "float32") tilelang_kernel = tilelang.compile(tilelang_func, out_idx=[-1]) - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: tilelang_kernel( x_fp8.clone(), x_scale.clone(), diff --git a/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm_blackwell.py b/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm_blackwell.py index de14bd90ec2f..3257da7b3787 100644 --- a/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm_blackwell.py +++ b/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_gemm_blackwell.py @@ -6,6 +6,7 @@ from deep_gemm import ceil_div from flashinfer.gemm import gemm_fp8_nt_groupwise +from sglang.benchmark.bench_utils import run_bench from sglang.srt.layers.quantization.fp8_kernel import ( sglang_per_token_group_quant_fp8, w8a8_block_fp8_matmul_deepgemm, @@ -195,10 +196,10 @@ def _benchmark(m, n, k, tp_size, provider): y_fp8, y_scale, [BLOCK_SIZE, BLOCK_SIZE] ) - quantiles = [0.5, 0.2, 0.8] + quantiles = (0.5, 0.2, 0.8) if provider == "deepgemm": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: fp8_gemm_deepgemm_blackwell( dg_x_fp8, dg_x_scale, @@ -208,7 +209,7 @@ def _benchmark(m, n, k, tp_size, provider): quantiles=quantiles, ) elif provider == "flashinfer": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: fp8_gemm_flashinfer( x_fp8, x_scale, diff --git a/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_group_gemm.py b/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_group_gemm.py index b2cea0705776..8b1be7b888f9 100644 --- a/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_group_gemm.py +++ b/benchmark/kernels/deepseek/benchmark_deepgemm_fp8_group_gemm.py @@ -8,6 +8,7 @@ from deep_gemm.utils.layout import get_mn_major_tma_aligned_tensor # Import shared functionality from the regular GEMM benchmark +from sglang.benchmark.bench_utils import run_bench from sglang.benchmark.kernels.deepseek.benchmark_deepgemm_fp8_gemm import ( per_block_cast_to_fp8, per_token_cast_to_fp8, @@ -397,10 +398,10 @@ def benchmark(m, n, k, num_groups, tp_size, provider): .view(-1) ) - quantiles = [0.5, 0.2, 0.8] + quantiles = (0.5, 0.2, 0.8) if provider == "deepgemm": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: fp8_gemm_group_deepgemm( x_fp8_grouped, y_fp8_grouped, @@ -420,7 +421,7 @@ def benchmark(m, n, k, num_groups, tp_size, provider): M, _ = a.shape _, N = b.shape c = torch.empty((M, N), device=a.device, dtype=torch.bfloat16) - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: fp8_gemm_group_triton( (a, a_scale), (b, b_scale), diff --git a/benchmark/kernels/elementwise/benchmark_concat_mla.py b/benchmark/kernels/elementwise/benchmark_concat_mla.py index c4d7bb1c8ff0..7bc51d3da4be 100644 --- a/benchmark/kernels/elementwise/benchmark_concat_mla.py +++ b/benchmark/kernels/elementwise/benchmark_concat_mla.py @@ -3,6 +3,8 @@ import triton.language as tl from sgl_kernel import concat_mla_k as concat_mla_k_cuda +from sglang.benchmark.bench_utils import run_bench + DEVICE = triton.runtime.driver.active.get_active_torch_device() num_local_heads = 128 @@ -179,7 +181,7 @@ def execute_and_get_output(f, data): ) def benchmark(num_tokens, provider): data = create_data(num_tokens=num_tokens) - quantiles = [0.5, 0.2, 0.8] + quantiles = (0.5, 0.2, 0.8) fn = { "torch": fn_torch, "torch_compiled": fn_torch_compiled, @@ -187,9 +189,7 @@ def benchmark(num_tokens, provider): "hack_non_strided": fn_hack_non_strided, "cuda": fn_cuda, }[provider] - ms, min_ms, max_ms = triton.testing.do_bench( - lambda: fn(**data), quantiles=quantiles - ) + ms, min_ms, max_ms = run_bench(lambda: fn(**data), quantiles=quantiles) return ms, min_ms, max_ms diff --git a/benchmark/kernels/flashinfer_allreduce_fusion/benchmark_fused_collective.py b/benchmark/kernels/flashinfer_allreduce_fusion/benchmark_fused_collective.py index 4aebf62b90e8..7050897c0cea 100644 --- a/benchmark/kernels/flashinfer_allreduce_fusion/benchmark_fused_collective.py +++ b/benchmark/kernels/flashinfer_allreduce_fusion/benchmark_fused_collective.py @@ -42,7 +42,8 @@ try: from sgl_kernel import fused_add_rmsnorm as SGL_FUSED_ADD_RMS_NORM from sgl_kernel import rmsnorm as SGL_RMS_NORM - from sgl_kernel import scaled_fp4_quant as SGL_SCALED_FP4_QUANT + + from sglang.jit_kernel.nvfp4 import scaled_fp4_quant as SGL_SCALED_FP4_QUANT except Exception: # pragma: no cover - fallback on non-supported platforms SGL_FUSED_ADD_RMS_NORM = None SGL_RMS_NORM = None diff --git a/benchmark/kernels/fused_moe_triton/README.md b/benchmark/kernels/fused_moe_triton/README.md index f11c6541a0ea..e2bfcc41dd1d 100644 --- a/benchmark/kernels/fused_moe_triton/README.md +++ b/benchmark/kernels/fused_moe_triton/README.md @@ -151,7 +151,7 @@ After tuning, configuration files will be generated: - **Standard tuning**: `E=64,N=640,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp8_w8a8.json` - **Separate kernel tuning**: Two files for up/down kernels with TMA optimization flags -Move these files to `sglang/srt/layers/moe/fused_moe_triton/configs/triton_version/` directory to use them in SGLang. +Move these files to `sglang/srt/layers/moe/moe_runner/triton_utils/configs/triton_version/` directory to use them in SGLang. ### Supported Models diff --git a/benchmark/kernels/fused_moe_triton/benchmark_sglang_fused_moe_triton.py b/benchmark/kernels/fused_moe_triton/benchmark_sglang_fused_moe_triton.py index b418855a2188..4515ff53b34c 100644 --- a/benchmark/kernels/fused_moe_triton/benchmark_sglang_fused_moe_triton.py +++ b/benchmark/kernels/fused_moe_triton/benchmark_sglang_fused_moe_triton.py @@ -5,20 +5,27 @@ import triton from common_utils import get_model_config +from sglang.benchmark.bench_utils import run_bench from sglang.srt.distributed.parallel_state import ( destroy_distributed_environment, destroy_model_parallel, init_distributed_environment, initialize_model_parallel, ) -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import ( - fused_moe as fused_moe_sglang, -) from sglang.srt.layers.moe.fused_moe_triton.triton_kernels_moe import ( triton_kernel_moe_forward, ) from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig -from sglang.srt.layers.moe.topk import TopK, TopKConfig, select_experts +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import ( + fused_moe as fused_moe_sglang, +) +from sglang.srt.layers.moe.topk import ( + TopK, + TopKConfig, + TopKOutputFormat, + select_experts, +) +from sglang.srt.server_args import ServerArgs, set_global_server_args_for_scheduler def fused_moe_triton_api( @@ -32,8 +39,8 @@ def fused_moe_triton_api( top_k=topk, renormalize=False, use_grouped_topk=False, + output_format=TopKOutputFormat.TRITON_KERNEL, ) - topk_op.use_triton_kernels = True triton_topk_output = topk_op.forward_cuda( hidden_states=x, router_logits=input_gating, @@ -175,8 +182,8 @@ def benchmark( else: bench_lambda = lambda: api_func(**api_kwargs) - quantiles = [0.5, 0.2, 0.8] - ms, min_ms, max_ms = triton.testing.do_bench(bench_lambda, quantiles=quantiles) + quantiles = (0.5, 0.2, 0.8) + ms, min_ms, max_ms = run_bench(bench_lambda, quantiles=quantiles) return ms, min_ms, max_ms @@ -199,6 +206,10 @@ def main(): parser.add_argument("--trust-remote-code", action="store_true") args = parser.parse_args() + # Initialize global server args (required by SGLang MoE kernels) + server_args = ServerArgs(model_path=args.model) + set_global_server_args_for_scheduler(server_args) + try: if not torch.distributed.is_initialized(): torch.distributed.init_process_group( @@ -217,8 +228,8 @@ def main(): ) initialize_model_parallel( - tensor_model_parallel_size=args.ep_size, - pipeline_model_parallel_size=args.tp_size, + tensor_model_parallel_size=1, + expert_model_parallel_size=1, ) model_config = get_model_config(args.model, args.tp_size, args.ep_size) diff --git a/benchmark/kernels/fused_moe_triton/benchmark_torch_compile_fused_moe.py b/benchmark/kernels/fused_moe_triton/benchmark_torch_compile_fused_moe.py index 2b4faa24b1db..e6fdfa8a7f5f 100644 --- a/benchmark/kernels/fused_moe_triton/benchmark_torch_compile_fused_moe.py +++ b/benchmark/kernels/fused_moe_triton/benchmark_torch_compile_fused_moe.py @@ -6,7 +6,8 @@ from torch.nn import functional as F from transformers import AutoConfig -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import ( +from sglang.benchmark.bench_utils import run_bench +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import ( fused_moe as fused_moe_triton, ) from sglang.srt.model_executor.cuda_graph_runner import set_torch_compile_config @@ -258,8 +259,8 @@ def benchmark(batch_size, provider, model_config, use_fp8_w8a8=False): ) torch.cuda.synchronize() - quantiles = [0.5, 0.2, 0.8] - ms, min_ms, max_ms = triton.testing.do_bench( + quantiles = (0.5, 0.2, 0.8) + ms, min_ms, max_ms = run_bench( lambda: api_func( x, w1, diff --git a/benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py b/benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py index 206ee2a86675..fc100ce50804 100644 --- a/benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py +++ b/benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py @@ -5,13 +5,14 @@ import triton from vllm.model_executor.layers.fused_moe.fused_moe import fused_moe as fused_moe_vllm +from sglang.benchmark.bench_utils import run_bench from sglang.srt.distributed.parallel_state import ( destroy_distributed_environment, destroy_model_parallel, init_distributed_environment, initialize_model_parallel, ) -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import ( +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import ( fused_moe as fused_moe_sglang, ) @@ -190,8 +191,8 @@ def benchmark(batch_size, provider, model_config, use_fp8_w8a8=False): ) torch.cuda.synchronize() - quantiles = [0.5, 0.2, 0.8] - ms, min_ms, max_ms = triton.testing.do_bench( + quantiles = (0.5, 0.2, 0.8) + ms, min_ms, max_ms = run_bench( lambda: api_func( x, w1, diff --git a/benchmark/kernels/fused_moe_triton/common_utils.py b/benchmark/kernels/fused_moe_triton/common_utils.py index 5f2d9aa8a244..64189aa2a871 100644 --- a/benchmark/kernels/fused_moe_triton/common_utils.py +++ b/benchmark/kernels/fused_moe_triton/common_utils.py @@ -3,8 +3,8 @@ import torch -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import get_config_dtype_str -from sglang.srt.layers.moe.fused_moe_triton.fused_moe_triton_config import ( +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import get_config_dtype_str +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe_triton_config import ( get_config_file_name, ) from sglang.srt.utils import is_hip @@ -37,7 +37,7 @@ def get_model_config( topk_ids_dir: str = None, ) -> Dict: config = get_config(model_name, trust_remote_code=True) - + architecture = config.architectures[0] block_shape = None if ( hasattr(config, "quantization_config") @@ -46,8 +46,17 @@ def get_model_config( block_shape = config.quantization_config["weight_block_size"] assert len(block_shape) == 2 - architecture = config.architectures[0] - + if ( + hasattr(config, "quantization_config") + and "config_groups" in config.quantization_config + ): + config_groups = config.quantization_config["config_groups"] + # Get group_size from the first group's weights config + first_group = next(iter(config_groups.values()), {}) + weights_config = first_group.get("weights", {}) + group_size = weights_config.get("group_size") + block_shape = [0, group_size] + assert len(block_shape) == 2 # Replace config with text_config for encoder-decoder models after getting block_shape and architecture if hasattr(config, "text_config"): config = config.get_text_config() @@ -66,6 +75,7 @@ def get_model_config( "Qwen3MoeForCausalLM", "Qwen3NextForCausalLM", "Qwen3VLMoeForConditionalGeneration", + "Qwen3_5MoeForConditionalGeneration", ]: E = config.num_experts // ep_size topk = config.num_experts_per_tok @@ -73,7 +83,9 @@ def get_model_config( elif architecture in [ "DeepseekV2ForCausalLM", "DeepseekV3ForCausalLM", + "DeepseekV32ForCausalLM", "Glm4MoeForCausalLM", + "GlmMoeDsaForCausalLM", "MistralLarge3ForCausalLM", ]: E = (config.n_routed_experts // ep_size) + ( @@ -82,7 +94,9 @@ def get_model_config( or architecture not in [ "DeepseekV3ForCausalLM", + "DeepseekV32ForCausalLM", "Glm4MoeForCausalLM", + "GlmMoeDsaForCausalLM", "MistralLarge3ForCausalLM", ] else 1 @@ -115,11 +129,23 @@ def get_model_config( E = config.num_experts // ep_size topk = config.num_experts_per_tok intermediate_size = config.moe_intermediate_size + elif architecture == "HYV3ForCausalLM": + E = config.num_experts // ep_size + topk = config.num_experts_per_tok + intermediate_size = config.expert_hidden_dim elif architecture == "NemotronHForCausalLM": E = config.n_routed_experts // ep_size topk = config.num_experts_per_tok intermediate_size = config.moe_intermediate_size hidden_size = getattr(config, "moe_latent_size", None) or hidden_size + elif architecture == "Gemma4ForConditionalGeneration": + E = config.num_experts // ep_size + topk = config.top_k_experts + intermediate_size = config.moe_intermediate_size + elif architecture == "Lfm2MoeForCausalLM": + E = config.num_experts // ep_size + topk = config.num_experts_per_tok + intermediate_size = config.moe_intermediate_size else: # Default: Mixtral E = config.num_local_experts // ep_size @@ -222,6 +248,7 @@ def get_config_filename( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, per_channel_quant: bool, block_shape: List[int], ) -> str: @@ -230,13 +257,18 @@ def get_config_filename( use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8, use_int8_w8a8=use_int8_w8a8, + use_int4_w4a16=use_int4_w4a16, ) # NOTE(woosuk): The current naming convention uses w2.shape[2], which # is the intermediate size after silu_and_mul. + N = shard_intermediate_size // 2 + if use_int4_w4a16: + N = N // 2 + filename = get_config_file_name( num_experts, - shard_intermediate_size // 2, + N, dtype_str, block_shape, per_channel_quant, diff --git a/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py b/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py index aef7ed8f6ca7..f134c56ef7bb 100644 --- a/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py +++ b/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py @@ -20,17 +20,22 @@ from ray.experimental.tqdm_ray import tqdm from sglang.srt.layers.moe.fused_moe_triton import override_config -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_moe -from sglang.srt.layers.moe.fused_moe_triton.fused_moe_triton_config import ( +from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import fused_moe +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe_triton_config import ( get_config_dtype_str, get_default_config, get_moe_configs, ) -from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig from sglang.srt.layers.moe.topk import TopKConfig, select_experts -from sglang.srt.utils import is_hip +from sglang.srt.server_args import ( + ServerArgs, + set_global_server_args_for_scheduler, +) +from sglang.srt.utils import get_device, is_hip, is_xpu _is_hip = is_hip() +_is_xpu = is_xpu() def benchmark_config( @@ -44,6 +49,7 @@ def benchmark_config( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, per_channel_quant: bool, block_shape: List[int] = None, num_iters: int = 100, @@ -71,6 +77,27 @@ def benchmark_config( ), dtype=torch.int8, ) + elif use_int4_w4a16: + w1 = torch.randint( + 0, + 255, + ( + num_experts, + shard_intermediate_size, + hidden_size // 2, + ), + dtype=torch.uint8, + ) + w2 = torch.randint( + 0, + 255, + ( + num_experts, + hidden_size, + shard_intermediate_size // 4, + ), + dtype=torch.uint8, + ) else: w1 = torch.randn( num_experts, shard_intermediate_size, hidden_size, dtype=init_dtype @@ -89,6 +116,19 @@ def benchmark_config( (num_experts, 2 * shard_intermediate_size), dtype=torch.float32 ) w2_scale = torch.randn((hidden_size, num_experts), dtype=torch.float32) + if use_int4_w4a16: + block_n = 1 if (block_shape[0] == 0) else block_shape[0] + block_k = block_shape[1] + n_tiles_w1 = (shard_intermediate_size + block_n - 1) // block_n + n_tiles_w2 = (hidden_size + block_n - 1) // block_n + k_tiles_w1 = (hidden_size + block_k - 1) // block_k + k_tiles_w2 = (shard_intermediate_size // 2 + block_k - 1) // block_k + w1_scale = torch.randn( + (num_experts, n_tiles_w1, k_tiles_w1), dtype=torch.bfloat16 + ) + w2_scale = torch.randn( + (num_experts, n_tiles_w2, k_tiles_w2), dtype=torch.bfloat16 + ) if use_fp8_w8a8 or use_int8_w8a8: if use_int8_w8a8 and block_shape is None: w1_scale = torch.randn( @@ -146,6 +186,7 @@ def run(): use_fp8_w8a8=use_fp8_w8a8, use_int8_w8a8=use_int8_w8a8, use_int8_w8a16=use_int8_w8a16, + use_int4_w4a16=use_int4_w4a16, w1_scale=w1_scale, w2_scale=w2_scale, a1_scale=a1_scale, @@ -195,13 +236,14 @@ def run(): @ray.remote(num_gpus=1) class BenchmarkWorker: - def __init__(self, seed: int) -> None: - torch.set_default_device("cuda") - torch.cuda.manual_seed_all(0) + def __init__(self, seed: int, server_args: ServerArgs) -> None: + torch.set_default_device(get_device()) + torch.get_device_module().manual_seed_all(0) self.seed = seed # Get the device ID to allocate tensors and kernels # on the respective GPU. self.device_id = int(ray.get_gpu_ids()[0]) + set_global_server_args_for_scheduler(server_args) def benchmark( self, @@ -214,20 +256,27 @@ def benchmark( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, per_channel_quant: bool, block_shape: List[int], ) -> Tuple[Dict[str, int], float]: torch.cuda.manual_seed_all(0) dtype_str = get_config_dtype_str( - dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8 + dtype, + use_int8_w8a16=use_int8_w8a16, + use_fp8_w8a8=use_fp8_w8a8, + use_int4_w4a16=use_int4_w4a16, ) # NOTE(woosuk): The current naming convention uses w2.shape[2], which # is the intermediate size after silu_and_mul. block_n = block_shape[0] if block_shape else 0 block_k = block_shape[1] if block_shape else 0 + N = shard_intermediate_size // 2 + if use_int4_w4a16: + N = N // 2 op_config = get_moe_configs( num_experts, - shard_intermediate_size // 2, + N, dtype_str, block_n, block_k, @@ -258,6 +307,7 @@ def benchmark( use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, per_channel_quant, block_shape, ) @@ -274,13 +324,18 @@ def tune( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, per_channel_quant: bool, block_shape: List[int], search_space: List[Dict[str, int]], ) -> Dict[str, int]: best_config = None best_time = float("inf") - with torch.cuda.device(self.device_id) if is_hip() else nullcontext(): + with ( + torch.get_device_module().device(self.device_id) + if _is_xpu or _is_hip + else nullcontext() + ): for config in tqdm(search_space): try: kernel_time = benchmark_config( @@ -294,6 +349,7 @@ def tune( use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, per_channel_quant, block_shape, num_iters=10, @@ -312,7 +368,9 @@ def tune( def main(args: argparse.Namespace): - print(args) + server_args = ServerArgs( + model_path=args.model, tp_size=args.tp_size, ep_size=args.ep_size + ) model_config = get_model_config( args.model, args.tp_size, args.ep_size, args.disable_shared_experts_fusion @@ -328,6 +386,7 @@ def main(args: argparse.Namespace): use_fp8_w8a8 = args.dtype == "fp8_w8a8" use_int8_w8a8 = args.dtype == "int8_w8a8" use_int8_w8a16 = args.dtype == "int8_w8a16" + use_int4_w4a16 = args.dtype == "int4_w4a16" per_channel_quant = args.per_channel_quant if args.batch_size is None: @@ -337,7 +396,7 @@ def main(args: argparse.Namespace): ray.init() num_gpus = int(ray.available_resources()["GPU"]) - workers = [BenchmarkWorker.remote(args.seed) for _ in range(num_gpus)] + workers = [BenchmarkWorker.remote(args.seed, server_args) for _ in range(num_gpus)] def _distribute(method: str, inputs: List[Any]) -> List[Any]: outputs = [] @@ -369,6 +428,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, per_channel_quant, block_shape, ) @@ -390,6 +450,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, per_channel_quant, block_shape, search_space, @@ -420,6 +481,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, per_channel_quant, block_shape, ) @@ -442,7 +504,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: parser.add_argument( "--dtype", type=str, - choices=["auto", "fp8_w8a8", "int8_w8a16", "int8_w8a8"], + choices=["auto", "fp8_w8a8", "int8_w8a16", "int8_w8a8", "int4_w4a16"], default="auto", ) parser.add_argument( diff --git a/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py b/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py index d0c922a4c7b3..cf9be7eb48ea 100644 --- a/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py +++ b/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py @@ -22,16 +22,20 @@ ) from ray.experimental.tqdm_ray import tqdm -from sglang.srt.layers.moe.fused_moe_triton.fused_moe import ( +from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe import ( get_config_dtype_str, invoke_fused_moe_kernel, moe_align_block_size, ) -from sglang.srt.layers.moe.fused_moe_triton.fused_moe_triton_config import ( +from sglang.srt.layers.moe.moe_runner.triton_utils.fused_moe_triton_config import ( get_config_file_name, ) -from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig from sglang.srt.layers.moe.topk import TopKConfig, select_experts +from sglang.srt.server_args import ( + ServerArgs, + set_global_server_args_for_scheduler, +) from sglang.srt.utils import is_hip _is_hip = is_hip() @@ -132,8 +136,10 @@ def benchmark_config( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, topk_ids_list, block_shape: List[int] = None, + ep_size: int = 1, num_iters: int = 100, ) -> float: ncu_enable = os.getenv("NCU_ENABLE", "0") == "1" @@ -162,6 +168,27 @@ def benchmark_config( ), dtype=torch.int8, ) + elif use_int4_w4a16: + w1 = torch.randint( + 0, + 255, + ( + num_experts, + shard_intermediate_size, + hidden_size // 2, + ), + dtype=torch.uint8, + ) + w2 = torch.randint( + 0, + 255, + ( + num_experts, + hidden_size, + shard_intermediate_size // 4, + ), + dtype=torch.uint8, + ) else: w1 = torch.randn( num_experts, shard_intermediate_size, hidden_size, dtype=init_dtype @@ -179,6 +206,19 @@ def benchmark_config( (num_experts, 2 * shard_intermediate_size), dtype=torch.float32 ) w2_scale = torch.randn((hidden_size, num_experts), dtype=torch.float32) + if use_int4_w4a16: + block_n = 1 if (block_shape[0] == 0) else block_shape[0] + block_k = block_shape[1] + n_tiles_w1 = (shard_intermediate_size + block_n - 1) // block_n + n_tiles_w2 = (hidden_size + block_n - 1) // block_n + k_tiles_w1 = (hidden_size + block_k - 1) // block_k + k_tiles_w2 = (shard_intermediate_size // 2 + block_k - 1) // block_k + w1_scale = torch.randn( + (num_experts, n_tiles_w1, k_tiles_w1), dtype=torch.bfloat16 + ) + w2_scale = torch.randn( + (num_experts, n_tiles_w2, k_tiles_w2), dtype=torch.bfloat16 + ) if use_fp8_w8a8 or use_int8_w8a8: if use_int8_w8a8 and block_shape is None: w1_scale = torch.randn( @@ -253,6 +293,12 @@ def benchmark_config( def prepare(i: int, inner_iter): # update inputs according to topk_ids for k in range(inner_iter): topk_ids = topk_ids_list[i * inner_iter + k] + # With EP, saved topk_ids are global expert indices; remap to local. + if ep_size > 1: + topk_ids = (topk_ids // ep_size).to( + device=moe_inputs[k].topk_ids.device, + dtype=moe_inputs[k].topk_ids.dtype, + ) tokens, _topk = moe_inputs[k].topk_ids.shape moe_inputs[k].topk_ids.copy_(topk_ids[:tokens, :_topk]) sorted_token_ids_, expert_ids_, num_tokens_post_padded_ = ( @@ -277,7 +323,7 @@ def get_kernel_wrapper(moe_use_tma, inner_iter, use_cuda_graph): B=w1, bias=None, C=intermediate_cache1, - A_scale=None, + A_scale=a1_scale, B_scale=w1_scale, B_zp=None, topk_weights=topk_output_.topk_weights, @@ -287,9 +333,9 @@ def get_kernel_wrapper(moe_use_tma, inner_iter, use_cuda_graph): config=config, compute_type=compute_type, use_fp8_w8a8=use_fp8_w8a8, - use_int8_w8a8=False, - use_int8_w8a16=False, - use_int4_w4a16=False, + use_int8_w8a8=use_int8_w8a8, + use_int8_w8a16=use_int8_w8a16, + use_int4_w4a16=use_int4_w4a16, per_channel_quant=False, block_shape=block_shape, b_use_tma=moe_use_tma, @@ -313,9 +359,9 @@ def get_kernel_wrapper(moe_use_tma, inner_iter, use_cuda_graph): config=config, compute_type=compute_type, use_fp8_w8a8=use_fp8_w8a8, - use_int8_w8a8=False, - use_int8_w8a16=False, - use_int4_w4a16=False, + use_int8_w8a8=use_int8_w8a8, + use_int8_w8a16=use_int8_w8a16, + use_int4_w4a16=use_int4_w4a16, per_channel_quant=False, block_shape=block_shape, a_use_tma=moe_use_tma, @@ -398,13 +444,14 @@ def config_dict(self, block_m): class BenchmarkWorker: - def __init__(self, seed: int) -> None: + def __init__(self, seed: int, server_args: ServerArgs) -> None: torch.set_default_device("cuda") torch.cuda.manual_seed_all(0) self.seed = seed # Get the device ID to allocate tensors and kernels # on the respective GPU. self.device_id = 0 # int(ray.get_gpu_ids()[0]) + set_global_server_args_for_scheduler(server_args) def benchmark( self, @@ -417,9 +464,11 @@ def benchmark( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, block_shape: List[int], cfg: Dict[str, int], topk_ids_dir: str, + ep_size: int = 1, ) -> Tuple[Dict[str, int], float]: torch.cuda.manual_seed_all(0) topk_ids_list = [load_topk_ids(topk_ids_dir, i) for i in range(100)] @@ -435,8 +484,10 @@ def benchmark( use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, topk_ids_list, block_shape, + ep_size=ep_size, ) return cfg, kernel_time @@ -451,9 +502,11 @@ def tune( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, block_shape: List[int], search_space: List[Dict[str, int]], topk_ids_dir: str, + ep_size: int = 1, ) -> Dict[str, int]: trace0 = BestConfigTrace("kernel0", down_moe=False) trace1 = BestConfigTrace("kernel1", down_moe=True) @@ -473,8 +526,10 @@ def tune( use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, topk_ids_list, block_shape, + ep_size=ep_size, num_iters=100, ) except triton.runtime.autotuner.OutOfResources: @@ -516,9 +571,11 @@ def cmp_configs( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, block_shape: List[int], cmp_config_files: List[str], topk_ids_dir: str, + ep_size: int = 1, ): # compare performance of different configs cmp_configs = [] @@ -550,8 +607,10 @@ def cmp_configs( use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, topk_ids_list, block_shape, + ep_size=ep_size, ) kernel_times.append(kernel_time) print(f"batch_size={bs=}:") @@ -569,6 +628,7 @@ def save_configs_sep( use_fp8_w8a8: bool, use_int8_w8a8: bool, use_int8_w8a16: bool, + use_int4_w4a16: bool, block_shape: List[int], down_moe: bool = False, ) -> None: @@ -577,6 +637,7 @@ def save_configs_sep( use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8, use_int8_w8a8=use_int8_w8a8, + use_int4_w4a16=use_int4_w4a16, ) # NOTE(woosuk): The current naming convention uses w2.shape[2], which @@ -598,6 +659,10 @@ def save_configs_sep( def main(args: argparse.Namespace): print(args) + server_args = ServerArgs( + model_path=args.model, tp_size=args.tp_size, ep_size=args.ep_size + ) + model_config = get_model_config( args.model, args.tp_size, @@ -616,6 +681,7 @@ def main(args: argparse.Namespace): use_fp8_w8a8 = args.dtype == "fp8_w8a8" use_int8_w8a8 = args.dtype == "int8_w8a8" use_int8_w8a16 = args.dtype == "int8_w8a16" + use_int4_w4a16 = args.dtype == "int4_w4a16" topk_ids_dir = args.topk_ids_dir if args.batch_size is None: @@ -625,7 +691,7 @@ def main(args: argparse.Namespace): batch_sizes = [args.batch_size] if args.cmp_configs is not None: - worker = BenchmarkWorker(args.seed) + worker = BenchmarkWorker(args.seed, server_args) worker.cmp_configs( batch_sizes, E, @@ -636,14 +702,16 @@ def main(args: argparse.Namespace): use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, block_shape, args.cmp_configs, topk_ids_dir, + args.ep_size, ) return if len(batch_sizes) == 1: - worker = BenchmarkWorker(args.seed) + worker = BenchmarkWorker(args.seed, server_args) if args.tune: search_space = get_configs_compute_bound() worker.tune( @@ -656,9 +724,11 @@ def main(args: argparse.Namespace): use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, block_shape, search_space, topk_ids_dir, + args.ep_size, ) else: cfg = { @@ -680,9 +750,11 @@ def main(args: argparse.Namespace): use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, block_shape, cfg, topk_ids_dir, + args.ep_size, ) print(f"{t0=}, {t0_tma=}, {t1=}, {t1_tma=}") return @@ -692,7 +764,7 @@ def main(args: argparse.Namespace): ray.init() num_gpus = int(ray.available_resources()["GPU"]) workers = [ - ray.remote(num_gpus=1)(BenchmarkWorker).remote(args.seed) + ray.remote(num_gpus=1)(BenchmarkWorker).remote(args.seed, server_args) for _ in range(num_gpus) ] @@ -722,6 +794,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, False, block_shape, ) @@ -743,9 +816,11 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, block_shape, search_space, topk_ids_dir, + args.ep_size, ) for batch_size in batch_sizes ], @@ -770,6 +845,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, block_shape, ) @@ -784,6 +860,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, + use_int4_w4a16, block_shape, down_moe=True, ) @@ -801,7 +878,7 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]: parser.add_argument( "--dtype", type=str, - choices=["auto", "fp8_w8a8", "int8_w8a16", "int8_w8a8"], + choices=["auto", "fp8_w8a8", "int8_w8a16", "int8_w8a8", "int8_w4a16"], default="auto", ) parser.add_argument("--seed", type=int, default=0) diff --git a/benchmark/kernels/lora_csgmv/tune_lora_csgmv.py b/benchmark/kernels/lora_csgmv/tune_lora_csgmv.py new file mode 100755 index 000000000000..1c162beca29b --- /dev/null +++ b/benchmark/kernels/lora_csgmv/tune_lora_csgmv.py @@ -0,0 +1,747 @@ +""" +Auto-tuning script for LoRA CSGMV (Chunked Segmented Matrix-Vector) kernels. + +LoRA adds low-rank adapters to linear layers. The two kernels are: + - Shrink (lora_a): x @ A^T, projecting from input_dim down to rank + - Expand (lora_b): (x @ A^T) @ B^T, projecting from rank back up to output_dim + +Terminology / dimensions: + K For shrink: input_dim (the large dimension, e.g. hidden_size). + For expand: output_dim (e.g. hidden_size or qkv_output_dim). + R Max LoRA rank (e.g. 16, 32, 64). The small dimension. + S num_slices — how many weight slices a layer fuses together: + qkv_proj → 3 (q, k, v), gate_up_proj → 2, others → 1. + Affects the Triton grid (N = S * R for shrink, grid dim for expand). + chunk_size BLOCK_M — the max segment length in the chunked batch. Sequences + are split into fixed-size chunks for load-balanced GPU scheduling. + Typical values: 16, 32, 64, 128. + +Tuned parameters (per kernel, K, R, S, chunk_size): + BLOCK_N Tile size along the N (output) dimension. + BLOCK_K Tile size along the K (reduction) dimension. + num_warps Number of warps per Triton program instance. + num_stages Number of software pipelining stages. + maxnreg (expand only) Register cap to improve occupancy. + +Config files are saved as JSON keyed by chunk_size, e.g.: + lora_shrink,K=1024,R=64,S=3,device=NVIDIA_H100.json + +The server loads these at startup via lora_tuning_config.py. If no tuned +config exists, hardcoded defaults are used. + +Usage: + # Tune from model name (auto-derives hidden_size, QKV dims) + python benchmark/kernels/lora_csgmv/tune_lora_csgmv.py \ + --model Qwen/Qwen3-0.6B --rank 64 + + # Tune with explicit dimensions + python benchmark/kernels/lora_csgmv/tune_lora_csgmv.py \ + --hidden-size 1024 --rank 64 + + # Tune for specific chunk sizes + python benchmark/kernels/lora_csgmv/tune_lora_csgmv.py \ + --model Qwen/Qwen3-0.6B --rank 64 --chunk-sizes 32 64 128 + + # Another model + python benchmark/kernels/lora_csgmv/tune_lora_csgmv.py \ + --model meta-llama/Llama-2-7b-hf --rank 32 +""" + +import argparse +import json +import math +import os +import statistics +from datetime import datetime +from typing import Any, Dict, List, Optional + +import torch +import triton + +from sglang.srt.lora.triton_ops.chunked_sgmv_expand import _chunked_lora_expand_kernel +from sglang.srt.lora.triton_ops.chunked_sgmv_shrink import _chunked_lora_shrink_kernel +from sglang.srt.lora.triton_ops.lora_tuning_config import ( + DEFAULT_EXPAND_CONFIG, + DEFAULT_SHRINK_CONFIG, + get_lora_config_file_name, +) +from sglang.srt.lora.utils import LoRABatchInfo + + +def _get_raw_kernel(cached_kernel): + """Get the underlying triton.jit function, bypassing cached_triton_kernel.""" + return getattr(cached_kernel, "fn", cached_kernel) + + +def build_batch_info( + total_tokens: int, + chunk_size: int, + rank: int, + device: torch.device, +) -> LoRABatchInfo: + """Build a LoRABatchInfo for benchmarking with a single LoRA adapter.""" + num_segments = math.ceil(total_tokens / chunk_size) + + seg_indptr = [] + for i in range(num_segments): + seg_indptr.append(i * chunk_size) + seg_indptr.append(total_tokens) + seg_indptr = torch.tensor(seg_indptr, dtype=torch.int32, device=device) + + weight_indices = torch.ones(num_segments, dtype=torch.int32, device=device) + lora_ranks = torch.tensor([0, rank], dtype=torch.int32, device=device) + scalings = torch.ones(2, dtype=torch.float32, device=device) + permutation = torch.arange(total_tokens, dtype=torch.int32, device=device) + + return LoRABatchInfo( + use_cuda_graph=False, + bs=1, + num_segments=num_segments, + max_len=chunk_size, + seg_indptr=seg_indptr, + weight_indices=weight_indices, + lora_ranks=lora_ranks, + scalings=scalings, + seg_lens=None, + permutation=permutation, + ) + + +def timed_cuda_ms(fn, warmup: int = 10, trials: int = 50) -> float: + """Time a GPU function using CUDA events. Returns median time in ms.""" + for _ in range(warmup): + fn() + torch.cuda.synchronize() + + times = [] + for _ in range(trials): + start = torch.cuda.Event(enable_timing=True) + end = torch.cuda.Event(enable_timing=True) + start.record() + fn() + end.record() + torch.cuda.synchronize() + times.append(start.elapsed_time(end)) + return statistics.median(times) + + +# --------------------------------------------------------------------------- +# Search spaces +# --------------------------------------------------------------------------- + + +def get_shrink_search_space() -> List[Dict[str, Any]]: + """Generate candidate configs for the shrink kernel.""" + configs = [] + for block_n in [16, 32, 64]: + for block_k in [64, 128, 256]: + for num_warps in [4, 8]: + for num_stages in [2, 3, 4]: + configs.append( + { + "BLOCK_N": block_n, + "BLOCK_K": block_k, + "num_warps": num_warps, + "num_stages": num_stages, + } + ) + return configs + + +def get_expand_search_space() -> List[Dict[str, Any]]: + """Generate candidate configs for the expand kernel.""" + configs = [] + for block_n in [32, 64]: + for block_k in [16, 32]: + for num_warps in [4, 8]: + for num_stages in [1, 2, 3]: + # Without maxnreg + configs.append( + { + "BLOCK_N": block_n, + "BLOCK_K": block_k, + "num_warps": num_warps, + "num_stages": num_stages, + } + ) + # With maxnreg (register capping for occupancy) + for maxnreg in [96, 112, 128, 160]: + configs.append( + { + "BLOCK_N": block_n, + "BLOCK_K": block_k, + "num_warps": num_warps, + "num_stages": num_stages, + "maxnreg": maxnreg, + } + ) + return configs + + +# --------------------------------------------------------------------------- +# Benchmark functions +# --------------------------------------------------------------------------- + + +def benchmark_shrink_config( + config: Dict[str, Any], + x: torch.Tensor, + weights: torch.Tensor, + batch_info: LoRABatchInfo, + num_slices: int, + N: int, + K: int, +) -> Optional[float]: + """Benchmark a single shrink config. Returns median ms or None on failure.""" + kernel = _get_raw_kernel(_chunked_lora_shrink_kernel) + S = x.shape[0] + num_segments = batch_info.num_segments + + grid = (triton.cdiv(N, config["BLOCK_N"]), num_segments) + output = torch.empty((S, N), device=x.device, dtype=x.dtype) + + extra_kwargs = {} + if "num_warps" in config: + extra_kwargs["num_warps"] = config["num_warps"] + if "num_stages" in config: + extra_kwargs["num_stages"] = config["num_stages"] + + try: + kernel[grid]( + x=x, + weights=weights, + output=output, + seg_indptr=batch_info.seg_indptr, + weight_indices=batch_info.weight_indices, + lora_ranks=batch_info.lora_ranks, + permutation=batch_info.permutation, + num_segs=num_segments, + N=N, + K=K, + NUM_SLICES=num_slices, + BLOCK_M=batch_info.max_len, + BLOCK_N=config["BLOCK_N"], + BLOCK_K=config["BLOCK_K"], + **extra_kwargs, + ) + torch.cuda.synchronize() + except Exception: + return None + + def run(): + kernel[grid]( + x=x, + weights=weights, + output=output, + seg_indptr=batch_info.seg_indptr, + weight_indices=batch_info.weight_indices, + lora_ranks=batch_info.lora_ranks, + permutation=batch_info.permutation, + num_segs=num_segments, + N=N, + K=K, + NUM_SLICES=num_slices, + BLOCK_M=batch_info.max_len, + BLOCK_N=config["BLOCK_N"], + BLOCK_K=config["BLOCK_K"], + **extra_kwargs, + ) + + return timed_cuda_ms(run, warmup=10, trials=50) + + +def benchmark_expand_config( + config: Dict[str, Any], + x: torch.Tensor, + weights: torch.Tensor, + batch_info: LoRABatchInfo, + slice_offsets: torch.Tensor, + max_slice_size: int, + output_dim: int, + num_slices: int, + max_rank: int, +) -> Optional[float]: + """Benchmark a single expand config. Returns median ms or None on failure.""" + kernel = _get_raw_kernel(_chunked_lora_expand_kernel) + M = x.shape[0] + num_segments = batch_info.num_segments + + grid = ( + triton.cdiv(max_slice_size, config["BLOCK_N"]), + num_slices, + num_segments, + ) + output = torch.zeros((M, output_dim), device=x.device, dtype=x.dtype) + + extra_kwargs = {} + if "num_warps" in config: + extra_kwargs["num_warps"] = config["num_warps"] + if "num_stages" in config: + extra_kwargs["num_stages"] = config["num_stages"] + if "maxnreg" in config: + extra_kwargs["maxnreg"] = config["maxnreg"] + + try: + kernel[grid]( + x=x, + weights=weights, + output=output, + seg_indptr=batch_info.seg_indptr, + weight_indices=batch_info.weight_indices, + lora_ranks=batch_info.lora_ranks, + permutation=batch_info.permutation, + num_segs=num_segments, + scalings=batch_info.scalings, + slice_offsets=slice_offsets, + NUM_SLICES=num_slices, + OUTPUT_DIM=output_dim, + MAX_RANK=max_rank, + BLOCK_M=batch_info.max_len, + BLOCK_N=config["BLOCK_N"], + BLOCK_K=config["BLOCK_K"], + **extra_kwargs, + ) + torch.cuda.synchronize() + except Exception: + return None + + def run(): + output.zero_() + kernel[grid]( + x=x, + weights=weights, + output=output, + seg_indptr=batch_info.seg_indptr, + weight_indices=batch_info.weight_indices, + lora_ranks=batch_info.lora_ranks, + permutation=batch_info.permutation, + num_segs=num_segments, + scalings=batch_info.scalings, + slice_offsets=slice_offsets, + NUM_SLICES=num_slices, + OUTPUT_DIM=output_dim, + MAX_RANK=max_rank, + BLOCK_M=batch_info.max_len, + BLOCK_N=config["BLOCK_N"], + BLOCK_K=config["BLOCK_K"], + **extra_kwargs, + ) + + return timed_cuda_ms(run, warmup=10, trials=50) + + +# --------------------------------------------------------------------------- +# Config saving +# --------------------------------------------------------------------------- + + +def save_config( + configs: Dict[int, Dict[str, Any]], + kernel: str, + major_dim: int, + max_rank: int, + num_slices: int, +) -> str: + """Save tuned configs to the standard config directory. Returns filepath. + + Args: + configs: Dict mapping chunk_size -> best block config. + kernel: "shrink" or "expand". + major_dim: The large dimension (input_dim for shrink, output_dim for expand). + max_rank: The max LoRA rank. + num_slices: Number of fused weight slices (qkv=3, gate_up=2, others=1). + """ + filename = get_lora_config_file_name(kernel, major_dim, max_rank, num_slices) + + triton_version = triton.__version__ + version_dir = f"triton_{triton_version.replace('.', '_')}" + config_dir = os.path.join( + os.path.dirname(os.path.realpath(__file__)), + "..", + "..", + "..", + "python", + "sglang", + "srt", + "lora", + "triton_ops", + "csgmv_configs", + version_dir, + ) + config_dir = os.path.normpath(config_dir) + os.makedirs(config_dir, exist_ok=True) + + filepath = os.path.join(config_dir, filename) + with open(filepath, "w") as f: + json.dump(configs, f, indent=4) + f.write("\n") + return filepath + + +def sort_config(config: Dict[str, Any]) -> Dict[str, Any]: + """Sort config keys for consistent JSON output.""" + ordered = {} + for key in ["BLOCK_N", "BLOCK_K", "num_warps", "num_stages", "maxnreg"]: + if key in config: + ordered[key] = config[key] + return ordered + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + + +def get_model_dims(args: argparse.Namespace): + """Extract all LoRA layer dimensions from model config or CLI args. + + Returns a list of (label, shrink_K, expand_output_dim, num_slices, + slice_offsets_list) tuples for each LoRA layer type. + """ + if args.model: + from transformers import AutoConfig + + config = AutoConfig.from_pretrained(args.model, trust_remote_code=True) + hidden_size = config.hidden_size + + num_heads = config.num_attention_heads + num_kv_heads = getattr(config, "num_key_value_heads", num_heads) + head_dim = getattr(config, "head_dim", hidden_size // num_heads) + intermediate_size = config.intermediate_size + + q_dim = num_heads * head_dim + kv_dim = num_kv_heads * head_dim + qkv_output_dim = q_dim + 2 * kv_dim + + print(f"Model: {args.model}") + print( + f" hidden_size={hidden_size}, num_heads={num_heads}, " + f"num_kv_heads={num_kv_heads}, head_dim={head_dim}" + ) + print(f" intermediate_size={intermediate_size}") + else: + hidden_size = args.hidden_size + intermediate_size = getattr(args, "intermediate_size", None) or hidden_size * 3 + if args.qkv_output_dim: + qkv_output_dim = args.qkv_output_dim + q_dim = qkv_output_dim // 2 + kv_dim = (qkv_output_dim - q_dim) // 2 + else: + q_dim = hidden_size * 2 + kv_dim = hidden_size + qkv_output_dim = q_dim + 2 * kv_dim + + # All LoRA layer types with their dimensions: + # (label, shrink_K, expand_output_dim, num_slices, slice_offsets) + layers = [ + ( + "qkv", + hidden_size, + qkv_output_dim, + 3, + [0, q_dim, q_dim + kv_dim, qkv_output_dim], + ), + ("o_proj", q_dim, hidden_size, 1, [0, hidden_size]), + ( + "gate_up", + hidden_size, + 2 * intermediate_size, + 2, + [0, intermediate_size, 2 * intermediate_size], + ), + ("down_proj", intermediate_size, hidden_size, 1, [0, hidden_size]), + ] + + print(f"\nLoRA layer dimensions:") + for label, sk, eo, ns, so in layers: + print(f" {label:>10}: shrink K={sk}, expand output_dim={eo}, num_slices={ns}") + + return layers + + +def _tune_shrink( + label: str, + K: int, + N: int, + num_slices: int, + rank: int, + chunk_sizes: List[int], + total_tokens: int, + device: torch.device, +) -> tuple: + """Tune shrink kernel for one layer type. Returns (best_configs, results).""" + print(f"\n{'='*80}") + print(f"Tuning SHRINK — {label} (K={K}, N={N}, slices={num_slices})") + print(f"{'='*80}") + + search = get_shrink_search_space() + print(f"Search space: {len(search)} configs") + + best_configs = {} + results = {} + + for chunk_size in chunk_sizes: + batch_info = build_batch_info(total_tokens, chunk_size, rank, device) + x = torch.randn(total_tokens, K, device=device, dtype=torch.float16) + weights = torch.randn(2, N, K, device=device, dtype=torch.float16) + + baseline_time = benchmark_shrink_config( + DEFAULT_SHRINK_CONFIG, + x, + weights, + batch_info, + num_slices, + N, + K, + ) + print(f" chunk={chunk_size}: baseline={baseline_time:.3f}ms") + + best_config = None + best_time = float("inf") + + for i, config in enumerate(search): + t = benchmark_shrink_config( + config, x, weights, batch_info, num_slices, N, K + ) + if t is not None and t < best_time: + best_time = t + best_config = config + if (i + 1) % 20 == 0: + print( + f" chunk={chunk_size}: {i+1}/{len(search)} tested, best={best_time:.3f}ms" + ) + + best_configs[chunk_size] = sort_config(best_config) + results[chunk_size] = (baseline_time, best_time, best_configs[chunk_size]) + speedup = baseline_time / best_time if best_time > 0 else 0 + print( + f" chunk={chunk_size}: best={best_time:.3f}ms ({speedup:.2f}x), config={best_configs[chunk_size]}" + ) + + return best_configs, results + + +def _tune_expand( + label: str, + output_dim: int, + num_slices: int, + slice_offsets_list: List[int], + max_slice_size: int, + rank: int, + chunk_sizes: List[int], + total_tokens: int, + device: torch.device, +) -> tuple: + """Tune expand kernel for one layer type. Returns (best_configs, results).""" + print(f"\n{'='*80}") + print(f"Tuning EXPAND — {label} (output_dim={output_dim}, slices={num_slices})") + print(f"{'='*80}") + + search = get_expand_search_space() + print(f"Search space: {len(search)} configs") + + slice_offsets = torch.tensor(slice_offsets_list, dtype=torch.int64, device=device) + best_configs = {} + results = {} + + for chunk_size in chunk_sizes: + batch_info = build_batch_info(total_tokens, chunk_size, rank, device) + x = torch.randn( + total_tokens, num_slices * rank, device=device, dtype=torch.float16 + ) + weights = torch.randn(2, output_dim, rank, device=device, dtype=torch.float16) + + baseline_time = benchmark_expand_config( + DEFAULT_EXPAND_CONFIG, + x, + weights, + batch_info, + slice_offsets, + max_slice_size, + output_dim, + num_slices, + rank, + ) + print(f" chunk={chunk_size}: baseline={baseline_time:.3f}ms") + + best_config = None + best_time = float("inf") + + for i, config in enumerate(search): + t = benchmark_expand_config( + config, + x, + weights, + batch_info, + slice_offsets, + max_slice_size, + output_dim, + num_slices, + rank, + ) + if t is not None and t < best_time: + best_time = t + best_config = config + if (i + 1) % 50 == 0: + print( + f" chunk={chunk_size}: {i+1}/{len(search)} tested, best={best_time:.3f}ms" + ) + + best_configs[chunk_size] = sort_config(best_config) + results[chunk_size] = (baseline_time, best_time, best_configs[chunk_size]) + speedup = baseline_time / best_time if best_time > 0 else 0 + print( + f" chunk={chunk_size}: best={best_time:.3f}ms ({speedup:.2f}x), config={best_configs[chunk_size]}" + ) + + return best_configs, results + + +def main(args: argparse.Namespace): + device = torch.device("cuda:0") + rank = args.rank + chunk_sizes = args.chunk_sizes + total_tokens = args.total_tokens + + layers = get_model_dims(args) + + print(f"\nLoRA CSGMV Tuning") + print(f" rank={rank}, total_tokens={total_tokens}, chunk_sizes={chunk_sizes}") + + # Collect all results for summary + all_results = [] # (label, kernel, K_or_outdim, results_dict) + + # Deduplicate: multiple layers can share the same (shrink_K, num_slices) or + # (expand_output_dim, num_slices). No need to tune the same config twice. + tuned_shrink = {} # (shrink_K, num_slices) -> best_configs + tuned_expand = {} # (expand_output_dim, num_slices) -> best_configs + + for label, shrink_K, expand_output_dim, num_slices, slice_offsets_list in layers: + # --- Shrink --- + shrink_key = (shrink_K, num_slices) + if shrink_key not in tuned_shrink: + N_shrink = num_slices * rank + best_configs, results = _tune_shrink( + label, + shrink_K, + N_shrink, + num_slices, + rank, + chunk_sizes, + total_tokens, + device, + ) + filepath = save_config(best_configs, "shrink", shrink_K, rank, num_slices) + print(f" Saved to: {filepath}") + tuned_shrink[shrink_key] = best_configs + all_results.append((label, "shrink", shrink_K, results)) + else: + print( + f"\n Skipping shrink {label} (K={shrink_K}, S={num_slices}) — already tuned" + ) + + # --- Expand --- + expand_key = (expand_output_dim, num_slices) + if expand_key not in tuned_expand: + # max_slice_size = largest slice width + slice_widths = [ + slice_offsets_list[i + 1] - slice_offsets_list[i] + for i in range(num_slices) + ] + max_slice_size = max(slice_widths) + + best_configs, results = _tune_expand( + label, + expand_output_dim, + num_slices, + slice_offsets_list, + max_slice_size, + rank, + chunk_sizes, + total_tokens, + device, + ) + filepath = save_config( + best_configs, "expand", expand_output_dim, rank, num_slices + ) + print(f" Saved to: {filepath}") + tuned_expand[expand_key] = best_configs + all_results.append((label, "expand", expand_output_dim, results)) + else: + print( + f"\n Skipping expand {label} (output_dim={expand_output_dim}, S={num_slices}) — already tuned" + ) + + # --- Summary --- + print(f"\n{'='*80}") + print(f"SUMMARY") + print(f"{'='*80}") + print( + f"\n{'layer':<10} {'kernel':<8} {'K/dim':>6} {'chunk':>6}" + f" {'baseline':>10} {'tuned':>10} {'speedup':>8} config" + ) + print("-" * 100) + for label, kernel, dim, results in all_results: + for chunk_size in chunk_sizes: + if chunk_size in results: + base, best, cfg = results[chunk_size] + spd = base / best if best > 0 else 0 + print( + f"{label:<10} {kernel:<8} {dim:>6} {chunk_size:>6}" + f" {base:>9.3f}ms {best:>9.3f}ms {spd:>7.2f}x {cfg}" + ) + + now = datetime.now() + print(f"\nTuning completed at {now.ctime()}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Auto-tune LoRA CSGMV kernel block dimensions" + ) + parser.add_argument( + "--model", + type=str, + default=None, + help="HuggingFace model name to auto-derive dimensions " + "(e.g., Qwen/Qwen3-0.6B, meta-llama/Llama-2-7b-hf)", + ) + parser.add_argument( + "--hidden-size", + type=int, + default=None, + help="Model hidden size (e.g., 1024 for Qwen3-0.6B). " + "Required if --model is not specified.", + ) + parser.add_argument( + "--rank", + type=int, + required=True, + help="LoRA rank (e.g., 16, 32, 64)", + ) + parser.add_argument( + "--qkv-output-dim", + type=int, + default=None, + help="QKV output dimension. Only used with --hidden-size. " + "Default: 4 * hidden_size", + ) + parser.add_argument( + "--chunk-sizes", + type=int, + nargs="+", + default=[16, 32, 64, 128], + help="Chunk sizes to tune (default: 16 32 64 128)", + ) + parser.add_argument( + "--total-tokens", + type=int, + default=30720, + help="Total tokens for benchmarking (default: 30720 = 2 reqs x 15360)", + ) + args = parser.parse_args() + + if not args.model and not args.hidden_size: + parser.error("Either --model or --hidden-size is required") + + main(args) diff --git a/benchmark/kernels/quantization/bench_fp4_quant.py b/benchmark/kernels/quantization/bench_fp4_quant.py index afc12dd8d3f7..9baedf4077be 100644 --- a/benchmark/kernels/quantization/bench_fp4_quant.py +++ b/benchmark/kernels/quantization/bench_fp4_quant.py @@ -9,6 +9,7 @@ ) from sgl_kernel.elementwise import silu_and_mul +from sglang.benchmark.bench_utils import run_bench from sglang.srt.layers import deep_gemm_wrapper from sglang.srt.layers.moe.ep_moe.kernels import silu_and_mul_masked_post_quant_fwd @@ -75,9 +76,9 @@ def benchmark(M, K, provider): dtype=torch.float32, ) - quantiles = [0.5, 0.2, 0.8] + quantiles = (0.5, 0.2, 0.8) if provider == "triton_fp8": - ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( + ms, min_ms, max_ms = run_bench( lambda: silu_and_mul_masked_post_quant_fwd( x, fp8_out, @@ -89,7 +90,7 @@ def benchmark(M, K, provider): quantiles=quantiles, ) if provider == "cuda_unfused_fp4": - ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( + ms, min_ms, max_ms = run_bench( lambda: scaled_fp4_grouped_quantize( silu_and_mul(x), masks, @@ -98,7 +99,7 @@ def benchmark(M, K, provider): quantiles=quantiles, ) if provider == "cuda_fused_fp4": - ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( + ms, min_ms, max_ms = run_bench( lambda: silu_and_mul_scaled_nvfp4_experts_quantize( x, masks, diff --git a/benchmark/kernels/quantization/bench_int8_quant.py b/benchmark/kernels/quantization/bench_int8_quant.py index 94b795690bfc..d40458ed9e34 100644 --- a/benchmark/kernels/quantization/bench_int8_quant.py +++ b/benchmark/kernels/quantization/bench_int8_quant.py @@ -4,6 +4,7 @@ import triton from vllm._custom_ops import scaled_int8_quant as vllm_scaled_int8_quant +from sglang.benchmark.bench_utils import run_bench from sglang.srt.layers.quantization.int8_kernel import per_token_quant_int8 @@ -59,19 +60,19 @@ def benchmark(batch_size, provider): M, K = batch_size, 16384 x = torch.randn(M, K, dtype=torch.float16, device="cuda") * 1000 - quantiles = [0.5, 0.2, 0.8] + quantiles = (0.5, 0.2, 0.8) if provider == "vllm op": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: vllm_scaled_int8_quant(x, symmetric=True), quantiles=quantiles, ) if provider == "triton": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: per_token_quant_int8(x), quantiles=quantiles, ) if provider == "torch.compile": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: torch_int8_quant(x), quantiles=quantiles, ) diff --git a/benchmark/kernels/quantization/tuning_block_wise_kernel.py b/benchmark/kernels/quantization/tuning_block_wise_kernel.py index 0a5e7fb534b9..edd91c3201b9 100644 --- a/benchmark/kernels/quantization/tuning_block_wise_kernel.py +++ b/benchmark/kernels/quantization/tuning_block_wise_kernel.py @@ -16,6 +16,7 @@ import json import multiprocessing as mp import os +import random import time from datetime import datetime from typing import Any, Dict, List @@ -31,7 +32,13 @@ _w8a8_block_fp8_matmul_unrolledx4, ) from sglang.srt.layers.quantization.int8_kernel import _w8a8_block_int8_matmul -from sglang.srt.utils import get_device_core_count, get_device_name, is_hip +from sglang.srt.utils import ( + get_device, + get_device_core_count, + get_device_count, + get_device_name, + is_hip, +) _is_hip = is_hip() @@ -98,12 +105,15 @@ def grid(META): N, config["BLOCK_SIZE_N"] ) + extra_kernel_args = {} if A.dtype == torch.float8_e4m3fnuz or A.dtype == torch.float8_e4m3fn: kernel = ( _w8a8_block_fp8_matmul_unrolledx4 if (_is_hip == True and num_workgroups <= get_device_core_count()) else _w8a8_block_fp8_matmul ) + # set masking flag required by kernel arguments + extra_kernel_args["needs_masking"] = needs_masking else: kernel = _w8a8_block_int8_matmul @@ -129,7 +139,7 @@ def grid(META): Bs.stride(1), Bs.stride(0), **config, - needs_masking=needs_masking, + **extra_kernel_args, ) return C @@ -221,18 +231,18 @@ def benchmark_config( def run(): w8a8_block_matmul(A, B, As, Bs, block_size, config, out_dtype) - torch.cuda.synchronize() + torch.get_device_module().synchronize() # JIT complication & warmup for _ in range(5): run() - torch.cuda.synchronize() + torch.get_device_module().synchronize() - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) + start_event = torch.get_device_module().Event(enable_timing=True) + end_event = torch.get_device_module().Event(enable_timing=True) latencies: List[float] = [] - for i in range(num_iters): - torch.cuda.synchronize() + for _ in range(num_iters): + torch.get_device_module().synchronize() start_event.record() run() end_event.record() @@ -244,6 +254,7 @@ def run(): def tune(M, N, K, block_size, out_dtype, search_space, input_type): factor_for_scale = 1e-2 + device = get_device() if input_type == "fp8": fp8_info = torch.finfo( @@ -252,14 +263,14 @@ def tune(M, N, K, block_size, out_dtype, search_space, input_type): fp8_max, fp8_min = fp8_info.max, fp8_info.min A_fp32 = ( - (torch.rand(M, K, dtype=torch.float32, device="cuda") - 0.5) * 2 * fp8_max + (torch.rand(M, K, dtype=torch.float32, device=device) - 0.5) * 2 * fp8_max ) A = A_fp32.clamp(min=fp8_min, max=fp8_max).to( torch.float8_e4m3fnuz if _is_hip else torch.float8_e4m3fn ) B_fp32 = ( - (torch.rand(N, K, dtype=torch.float32, device="cuda") - 0.5) * 2 * fp8_max + (torch.rand(N, K, dtype=torch.float32, device=device) - 0.5) * 2 * fp8_max ) B = B_fp32.clamp(min=fp8_min, max=fp8_max).to( torch.float8_e4m3fnuz if _is_hip else torch.float8_e4m3fn @@ -269,12 +280,12 @@ def tune(M, N, K, block_size, out_dtype, search_space, input_type): int8_max, int8_min = int8_info.max, int8_info.min A_fp32 = ( - (torch.rand(M, K, dtype=torch.float32, device="cuda") - 0.5) * 2 * int8_max + (torch.rand(M, K, dtype=torch.float32, device=device) - 0.5) * 2 * int8_max ) A = A_fp32.clamp(min=int8_min, max=int8_max).to(torch.int8) B_fp32 = ( - (torch.rand(N, K, dtype=torch.float32, device="cuda") - 0.5) * 2 * int8_max + (torch.rand(N, K, dtype=torch.float32, device=device) - 0.5) * 2 * int8_max ) B = B_fp32.clamp(min=int8_min, max=int8_max).to(torch.int8) @@ -282,9 +293,9 @@ def tune(M, N, K, block_size, out_dtype, search_space, input_type): n_tiles = (N + block_n - 1) // block_n k_tiles = (K + block_k - 1) // block_k - As = torch.rand(M, k_tiles, dtype=torch.float32, device="cuda") * factor_for_scale + As = torch.rand(M, k_tiles, dtype=torch.float32, device=device) * factor_for_scale Bs = ( - torch.rand(n_tiles, k_tiles, dtype=torch.float32, device="cuda") + torch.rand(n_tiles, k_tiles, dtype=torch.float32, device=device) * factor_for_scale ) @@ -323,6 +334,7 @@ def save_configs( configs, save_path, input_type="fp8", + lock=None, ) -> None: os.makedirs(save_path, exist_ok=True) device_name = get_device_name().replace(" ", "_") @@ -331,14 +343,24 @@ def save_configs( config_file_path = os.path.join(save_path, json_file_name) print(f"Writing best config to {config_file_path}...") - with open(config_file_path, "w") as f: - json.dump(configs, f, indent=4) - f.write("\n") + if lock is not None: + lock.acquire() + try: + existing_configs = {} + if os.path.exists(config_file_path): + with open(config_file_path, "r") as f: + existing_configs = json.load(f) + existing_configs = {int(k): v for k, v in existing_configs.items()} + existing_configs.update(configs) + existing_configs = dict(sorted(existing_configs.items())) -def get_available_gpu_count(): - """Get the number of available GPUs.""" - return torch.cuda.device_count() + with open(config_file_path, "w") as f: + json.dump(existing_configs, f, indent=4) + f.write("\n") + finally: + if lock is not None: + lock.release() def tune_on_gpu(args_dict): @@ -347,8 +369,9 @@ def tune_on_gpu(args_dict): batch_sizes = args_dict["batch_sizes"] weight_shapes = args_dict["weight_shapes"] args = args_dict["args"] + lock = args_dict["lock"] - torch.cuda.set_device(gpu_id) + torch.get_device_module().set_device(gpu_id) print(f"Starting tuning on GPU {gpu_id} with batch sizes {batch_sizes}") block_n = args.block_n @@ -363,7 +386,6 @@ def tune_on_gpu(args_dict): ] start = time.perf_counter() - results = {} for shape in tqdm(weight_shapes, desc=f"GPU {gpu_id} - Shapes"): N, K = shape[0], shape[1] print(f"[GPU {gpu_id}] Tune for weight shape of `N: {N}, K: {K}`") @@ -380,7 +402,7 @@ def tune_on_gpu(args_dict): for batch_size in tqdm(batch_sizes, desc=f"GPU {gpu_id} - Batch sizes") ] best_configs = {M: config for M, config in zip(batch_sizes, benchmark_results)} - save_configs(N, K, block_n, block_k, best_configs, save_path, input_type) + save_configs(N, K, block_n, block_k, best_configs, save_path, input_type, lock) end = time.perf_counter() print(f"Tuning on GPU {gpu_id} took {end - start:.2f} seconds") @@ -388,6 +410,8 @@ def tune_on_gpu(args_dict): def distribute_batch_sizes(batch_sizes, num_gpus): """Distribute batch sizes across available GPUs.""" + # shuffle to distribute workload more evenly and minimize bottleneck effects + random.shuffle(batch_sizes) batches_per_gpu = [] for i in range(num_gpus): start_idx = i * len(batch_sizes) // num_gpus @@ -399,14 +423,14 @@ def distribute_batch_sizes(batch_sizes, num_gpus): def main(args): print(args) - num_gpus = get_available_gpu_count() + num_gpus = get_device_count() if num_gpus == 0: raise RuntimeError("No GPU available for tuning") print(f"Found {num_gpus} GPUs for parallel tuning") - torch.cuda.init() + torch.get_device_module().init() - if args.batch_size is None: + if args.batch_sizes is None: batch_sizes = [ 1, 2, @@ -428,8 +452,7 @@ def main(args): 4096, ] else: - batch_sizes = [args.batch_size] - num_gpus = 1 # If only one batch size, use only one GPU + batch_sizes = args.batch_sizes # Support manual N and K specification if args.N is not None and args.K is not None: @@ -441,6 +464,10 @@ def main(args): batches_per_gpu = distribute_batch_sizes(batch_sizes, num_gpus) + ctx = mp.get_context("spawn") + manager = ctx.Manager() + lock = manager.Lock() + process_args = [] for gpu_id in range(num_gpus): process_args.append( @@ -449,10 +476,10 @@ def main(args): "batch_sizes": batches_per_gpu[gpu_id], "weight_shapes": weight_shapes, # Each GPU processes all weight shapes "args": args, + "lock": lock, } ) - ctx = mp.get_context("spawn") with ctx.Pool(num_gpus) as pool: pool.map(tune_on_gpu, process_args) @@ -492,7 +519,7 @@ def main(args): ) parser.add_argument("--block-n", type=int, default=128) parser.add_argument("--block-k", type=int, default=128) - parser.add_argument("--batch-size", type=int, required=False) + parser.add_argument("--batch-sizes", nargs="+", type=int, required=False) parser.add_argument( "--save-path", type=str, default="python/sglang/srt/layers/quantization/configs" ) diff --git a/benchmark/kernels/scheduler_batch/benchmark_get_last_loc_triton.py b/benchmark/kernels/scheduler_batch/benchmark_get_last_loc_triton.py index 3e17205e73a8..911cdb8278b1 100644 --- a/benchmark/kernels/scheduler_batch/benchmark_get_last_loc_triton.py +++ b/benchmark/kernels/scheduler_batch/benchmark_get_last_loc_triton.py @@ -4,6 +4,8 @@ import triton import triton.language as tl +from sglang.benchmark.bench_utils import run_bench + @torch.compile(dynamic=True) def get_last_loc_torch( @@ -124,14 +126,14 @@ def benchmark(batch_size, provider): quantiles = [0.5, 0.2, 0.8] if provider == "reference": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: get_last_loc_torch(req_to_token, req_pool_indices, pre_lens), - quantiles=quantiles, + quantiles=tuple(quantiles), ) elif provider == "triton": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: get_last_loc_triton(req_to_token, req_pool_indices, pre_lens), - quantiles=quantiles, + quantiles=tuple(quantiles), ) return 1000 * ms, 1000 * max_ms, 1000 * min_ms diff --git a/benchmark/kernels/scheduler_batch/benchmark_write_req_to_token_pool_triton.py b/benchmark/kernels/scheduler_batch/benchmark_write_req_to_token_pool_triton.py index 1ce43c8bacfd..561ff88ee301 100644 --- a/benchmark/kernels/scheduler_batch/benchmark_write_req_to_token_pool_triton.py +++ b/benchmark/kernels/scheduler_batch/benchmark_write_req_to_token_pool_triton.py @@ -5,6 +5,8 @@ import triton import triton.language as tl +from sglang.benchmark.bench_utils import run_bench + @triton.jit def write_req_to_token_pool_triton( @@ -263,7 +265,7 @@ def benchmark(batch_size, extend_len, provider): quantiles = [0.5, 0.2, 0.8] if provider == "reference": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: write_req_to_token_pool_reference( req_to_token.clone(), req_pool_indices, @@ -272,10 +274,10 @@ def benchmark(batch_size, extend_len, provider): extend_lens, out_cache_loc, ), - quantiles=quantiles, + quantiles=tuple(quantiles), ) elif provider == "triton": - ms, min_ms, max_ms = triton.testing.do_bench( + ms, min_ms, max_ms = run_bench( lambda: write_req_to_token_pool_triton[(batch_size,)]( req_to_token.clone(), req_pool_indices, @@ -285,7 +287,7 @@ def benchmark(batch_size, extend_len, provider): out_cache_loc, max_context_len, ), - quantiles=quantiles, + quantiles=tuple(quantiles), ) else: @@ -303,9 +305,7 @@ def run_optimized(): BLOCK_SIZE=block_size, ) - ms, min_ms, max_ms = triton.testing.do_bench( - run_optimized, quantiles=quantiles - ) + ms, min_ms, max_ms = run_bench(run_optimized, quantiles=tuple(quantiles)) return 1000 * ms, 1000 * max_ms, 1000 * min_ms diff --git a/benchmark/kernels/sliding_window_attention_triton/bench_triton_swa_kernel.py b/benchmark/kernels/sliding_window_attention_triton/bench_triton_swa_kernel.py index 98144d47043a..9fd42fb12a80 100644 --- a/benchmark/kernels/sliding_window_attention_triton/bench_triton_swa_kernel.py +++ b/benchmark/kernels/sliding_window_attention_triton/bench_triton_swa_kernel.py @@ -4,6 +4,7 @@ import torch.nn.functional as F import triton.testing as tt +from sglang.benchmark.bench_utils import run_bench from sglang.srt.layers.attention.triton_ops.extend_attention import extend_attention_fwd @@ -270,9 +271,19 @@ def bench( raise AssertionError("Mismatch between triton and torch reference.") if provider == "triton": - ms = tt.do_bench(lambda: _run_triton(inputs), warmup=warmup, rep=rep) + ms = run_bench( + lambda: _run_triton(inputs), + quantiles=None, + warmup_ms=warmup, + rep_ms=rep, + )[0] elif provider == "torch": - ms = tt.do_bench(lambda: _run_torch_ref(inputs), warmup=warmup, rep=rep) + ms = run_bench( + lambda: _run_torch_ref(inputs), + quantiles=None, + warmup_ms=warmup, + rep_ms=rep, + )[0] else: raise ValueError(provider) diff --git a/benchmark/lora/lora_bench.py b/benchmark/lora/lora_bench.py index 4f380c705122..7d3397c0ef75 100644 --- a/benchmark/lora/lora_bench.py +++ b/benchmark/lora/lora_bench.py @@ -35,10 +35,9 @@ _create_bench_client_session, calculate_metrics, get_request, - get_tokenizer, - remove_prefix, - sample_random_requests, ) +from sglang.benchmark.datasets.random import sample_random_requests +from sglang.benchmark.utils import get_tokenizer, remove_prefix global args diff --git a/benchmark/mmlu/bench_hf.py b/benchmark/mmlu/bench_hf.py new file mode 100644 index 000000000000..c76a18db685b --- /dev/null +++ b/benchmark/mmlu/bench_hf.py @@ -0,0 +1,151 @@ +""" +Usage: +python3 bench_hf.py --model-path meta-llama/Llama-2-7b-hf --data-dir data --ntrain 5 +""" + +import argparse +import json +import os +import time + +import numpy as np +import pandas as pd +import torch +from tqdm import tqdm +from transformers import AutoModelForCausalLM, AutoTokenizer + +choices = ["A", "B", "C", "D"] + + +def format_subject(subject): + l = subject.split("_") + s = "" + for entry in l: + s += " " + entry + return s + + +def format_example(df, idx, include_answer=True): + prompt = df.iloc[idx, 0] + k = df.shape[1] - 2 + for j in range(k): + prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1]) + prompt += "\nAnswer:" + if include_answer: + prompt += " {}\n\n".format(df.iloc[idx, k + 1]) + return prompt + + +def gen_prompt(train_df, subject, k=-1): + prompt = "The following are multiple choice questions (with answers) about{}.\n\n".format( + format_subject(subject) + ) + if k == -1: + k = train_df.shape[0] + for i in range(k): + prompt += format_example(train_df, i) + return prompt + + +@torch.no_grad() +def main(args): + print(f"Loading model: {args.model_path}") + tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained( + args.model_path, + torch_dtype=torch.bfloat16, + trust_remote_code=True, + device_map="auto", + ).eval() + + subjects = sorted( + [ + f.split("_test.csv")[0] + for f in os.listdir(os.path.join(args.data_dir, "test")) + if "_test.csv" in f + ] + ) + + all_cors = [] + num_requests = 0 + total_latency = 0 + + for subject in tqdm(subjects[: args.nsub]): + dev_df = pd.read_csv( + os.path.join(args.data_dir, "dev", subject + "_dev.csv"), header=None + )[: args.ntrain] + test_df = pd.read_csv( + os.path.join(args.data_dir, "test", subject + "_test.csv"), header=None + ) + + k = args.ntrain + few_shot_examples = gen_prompt(dev_df, subject, k) + while len(tokenizer.encode(few_shot_examples)) > 1536: + k -= 1 + if k < 0: + break + few_shot_examples = gen_prompt(dev_df, subject, k) + + preds = [] + labels = [] + tic = time.perf_counter() + + for i in range(test_df.shape[0]): + prompt_end = format_example(test_df, i, include_answer=False) + prompt = few_shot_examples + prompt_end + + input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) + output_ids = model.generate( + input_ids, + max_new_tokens=1, + do_sample=False, + pad_token_id=tokenizer.eos_token_id, + ) + + output_str = tokenizer.decode( + output_ids[0][input_ids.shape[-1] :], skip_special_tokens=True + ) + preds.append(output_str.strip()[0] if len(output_str.strip()) > 0 else "") + labels.append(test_df.iloc[i, test_df.shape[1] - 1]) + + latency = time.perf_counter() - tic + total_latency += latency + + cors = [pred == label for pred, label in zip(preds, labels)] + all_cors.append(cors) + num_requests += len(test_df) + + print( + f"Subject: {subject}, Accuracy: {np.mean(cors):.3f}, Latency: {latency:.3f}s" + ) + + weighted_acc = np.mean(np.concatenate(all_cors)) + print(f"Total Latency: {total_latency:.3f}s") + print(f"Average Accuracy: {weighted_acc:.3f}") + + if args.output: + with open(args.output, "a") as fout: + value = { + "task": "mmlu", + "backend": "hf", + "model": args.model_path, + "latency": round(total_latency, 3), + "accuracy": round(weighted_acc, 3), + "num_requests": num_requests, + "other": { + "nsub": args.nsub, + "ntrain": args.ntrain, + }, + } + fout.write(json.dumps(value) + "\n") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--model-path", type=str, required=True) + parser.add_argument("--ntrain", type=int, default=5) + parser.add_argument("--data-dir", type=str, default="data") + parser.add_argument("--nsub", type=int, default=60) + parser.add_argument("--output", type=str, help="Output file path") + args = parser.parse_args() + main(args) diff --git a/benchmark/mmlu/bench_sglang.py b/benchmark/mmlu/bench_sglang.py index 23057be4aed8..9a2006e3d2c1 100644 --- a/benchmark/mmlu/bench_sglang.py +++ b/benchmark/mmlu/bench_sglang.py @@ -1,6 +1,8 @@ import argparse import json import os +import subprocess +import tarfile import time import numpy as np @@ -13,6 +15,8 @@ select_sglang_backend, ) +SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) + choices = ["A", "B", "C", "D"] tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo") @@ -48,6 +52,28 @@ def gen_prompt(train_df, subject, k=-1): return prompt +def download_data(data_dir): + """Download and extract MMLU data if it doesn't exist.""" + if os.path.isdir(os.path.join(data_dir, "test")): + return + print(f"Data not found at {data_dir}. Downloading...") + os.makedirs(data_dir, exist_ok=True) + tar_path = os.path.join(data_dir, "data.tar") + subprocess.check_call( + ["wget", "-O", tar_path, "https://people.eecs.berkeley.edu/~hendrycks/data.tar"] + ) + with tarfile.open(tar_path) as tar: + tar.extractall(path=data_dir, filter="data") + # The tarball extracts into a "data/" subdirectory; move contents up if needed + nested = os.path.join(data_dir, "data") + if os.path.isdir(nested): + for item in os.listdir(nested): + os.rename(os.path.join(nested, item), os.path.join(data_dir, item)) + os.rmdir(nested) + os.remove(tar_path) + print("Download complete.") + + def main(args): subjects = sorted( [ @@ -174,8 +200,11 @@ def few_shot_mmlu(s, examples, question): if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--ntrain", "-k", type=int, default=5) - parser.add_argument("--data_dir", "-d", type=str, default="data") + parser.add_argument( + "--data_dir", "-d", type=str, default=os.path.join(SCRIPT_DIR, "data") + ) parser.add_argument("--save_dir", "-s", type=str, default="results") parser.add_argument("--nsub", type=int, default=60) args = add_common_sglang_args_and_parse(parser) + download_data(args.data_dir) main(args) diff --git a/benchmark/mmmu/bench_hf.py b/benchmark/mmmu/bench_hf.py index c841f44466d7..62418d6bb5a2 100644 --- a/benchmark/mmmu/bench_hf.py +++ b/benchmark/mmmu/bench_hf.py @@ -70,6 +70,10 @@ def eval_mmmu(args): ) samples = prepare_samples(eval_args) + if getattr(args, "limit", None): + total = len(samples) + samples = samples[: args.limit] + print(f"--limit {args.limit}: keeping {len(samples)} of {total} samples") out_samples = dict() answer_dict = {} @@ -95,7 +99,7 @@ def eval_mmmu(args): response = model.chat( tokenizer, pixel_values, contents, generation_config_internvl ) - print(f"response: {response}") + sample["original_response"] = response process_result(response, sample, answer_dict, out_samples) continue @@ -143,7 +147,7 @@ def eval_mmmu(args): generate_audio=False, temperature=0.0, ) - print(f"response: {response}") + sample["original_response"] = response process_result(response, sample, answer_dict, out_samples) args.output_path = f"{args.model_path}_answer_hf.json" @@ -163,6 +167,12 @@ def eval_mmmu(args): help="The path of the model weights. This can be a local folder or a Hugging Face repo ID.", required=True, ) + parser.add_argument( + "--limit", + type=int, + default=None, + help="If set, only evaluate this many samples (debug / smoke runs).", + ) EvalArgs.add_cli_args(parser) args = parser.parse_args() diff --git a/benchmark/mmmu/bench_sglang.py b/benchmark/mmmu/bench_sglang.py index d9426ae5a3ac..0a28c7fc270c 100644 --- a/benchmark/mmmu/bench_sglang.py +++ b/benchmark/mmmu/bench_sglang.py @@ -11,11 +11,14 @@ import argparse import asyncio +import base64 +import mimetypes import re import sys import time import traceback from dataclasses import dataclass, field +from pathlib import Path from typing import Any, List, Optional, Tuple import aiohttp @@ -74,7 +77,12 @@ def _get_prefix_suffix(prompt: str) -> Tuple[str, str]: async def process_sample( - client: Any, sample: dict, sampling_params: dict, lora_path: Optional[str] = None + client: Any, + sample: dict, + sampling_params: dict, + model: str, + reasoning_effort: Optional[str] = None, + lora_path: Optional[str] = None, ) -> Tuple[dict, str]: """Send a single sample to the LLM and return (sample, response).""" prompt = sample["final_input_prompt"] @@ -82,25 +90,38 @@ async def process_sample( image = sample["image"] assert image is not None image_path = sample["image_path"] - extra_body = None if lora_path is None else {"lora_path": lora_path} + if image_path and not image_path.startswith(("http://", "https://", "data:")): + p = Path(image_path) + mime = mimetypes.guess_type(str(p))[0] or "image/png" + with open(p, "rb") as f: + b64 = base64.b64encode(f.read()).decode() + image_url = f"data:{mime};base64,{b64}" + else: + image_url = image_path + extra_body = {"lora_path": lora_path} if lora_path else None payload = { - "model": "default", + "model": model, "messages": [ { "role": "user", "content": [ {"type": "text", "text": prefix}, - {"type": "image_url", "image_url": {"url": image_path}}, + {"type": "image_url", "image_url": {"url": image_url}}, {"type": "text", "text": suffix}, ], } ], "extra_body": extra_body, + **sampling_params, } - if sampling_params: - payload.update(sampling_params) + if reasoning_effort: + payload["reasoning_effort"] = reasoning_effort response = await client.chat.completions.create(**payload) - return sample, response.choices[0].message.content + msg = response.choices[0].message + content = msg.content + if content is None: + content = getattr(msg, "reasoning_content", None) + return sample, content async def process_sample_with_semaphore( @@ -108,11 +129,15 @@ async def process_sample_with_semaphore( client: Any, sample: dict, sampling_params: dict, + model: str, + reasoning_effort: Optional[str] = None, lora_path: Optional[str] = None, ) -> Tuple[dict, str]: """Wrap process_sample with a semaphore for concurrency control.""" async with semaphore: - return await process_sample(client, sample, sampling_params, lora_path) + return await process_sample( + client, sample, sampling_params, model, reasoning_effort, lora_path + ) async def eval_mmmu(args) -> None: @@ -120,6 +145,8 @@ async def eval_mmmu(args) -> None: eval_args = EvalArgs.from_cli_args(args) sampling_params = get_sampling_params(eval_args) samples = prepare_samples(eval_args) + model = args.model + reasoning_effort = eval_args.reasoning_effort lora_path = eval_args.lora_path answer_dict = {} out_samples = {} @@ -146,7 +173,7 @@ async def eval_mmmu(args) -> None: # this is mainly for profiling for sample in tqdm(samples): _, response = await process_sample( - client, sample, sampling_params, lora_path + client, sample, sampling_params, model, reasoning_effort, lora_path ) sample["original_response"] = response answer = ( @@ -164,7 +191,13 @@ async def eval_mmmu(args) -> None: semaphore = asyncio.Semaphore(args.concurrency) tasks = [ process_sample_with_semaphore( - semaphore, client, sample, sampling_params, lora_path + semaphore, + client, + sample, + sampling_params, + model, + reasoning_effort, + lora_path, ) for sample in samples ] @@ -202,6 +235,12 @@ async def eval_mmmu(args) -> None: def parse_args(): parser = argparse.ArgumentParser() + parser.add_argument( + "--model", + type=str, + default="default", + help="Model name to use in API requests.", + ) EvalArgs.add_cli_args(parser) args = add_common_sglang_args_and_parse(parser) return args diff --git a/benchmark/mmmu/eval_utils.py b/benchmark/mmmu/eval_utils.py index b3edd69fc1ce..33a5925511ba 100644 --- a/benchmark/mmmu/eval_utils.py +++ b/benchmark/mmmu/eval_utils.py @@ -38,8 +38,9 @@ class EvalArgs: concurrency: int = 1 max_new_tokens: Optional[int] = None temperature: Optional[float] = None - response_answer_regex: str = "(.*)" + response_answer_regex: str = "(?s)(.*)" lora_path: Optional[str] = None + reasoning_effort: Optional[str] = None @staticmethod def add_cli_args(parser: argparse.ArgumentParser): @@ -120,6 +121,13 @@ def add_cli_args(parser: argparse.ArgumentParser): default=EvalArgs.lora_path, help="Specify the LoRA path to use for evaluation. If specified, the value will be specified in the body of every request as `lora-path`.", ) + parser.add_argument( + "--reasoning-effort", + type=str, + default=EvalArgs.reasoning_effort, + choices=["none", "high"], + help="Reasoning effort for the model (none or high).", + ) @classmethod def from_cli_args(cls, args: argparse.Namespace): @@ -265,11 +273,42 @@ def get_sampling_params(eval_args): # ----------- Process Multi-choice ------------- +# Patterns that explicitly commit to a single letter as the final answer. +# Each captures the letter in group(1). Matching uses ``re.IGNORECASE`` and +# all matches are collected across patterns; the one with the latest offset +# wins. +_EXPLICIT_ANSWER_PATTERNS = ( + # "answer: X" / "Final answer: X" (with optional bold/parens) + r"\banswer\s*:\s*\*{0,2}\s*\(?([A-Z])\)?\s*\*{0,2}(?![A-Za-z])", + # bare "X" / "(X)" on its own line at the end of the response + r"(?:^|\n)\s*\*{0,2}\s*\(?([A-Z])\)?\s*\*{0,2}\s*\.?\s*$", + # "\boxed{X}" (LaTeX boxed answer, common in math/CoT outputs) + r"\\boxed\{\s*\*{0,2}\s*\(?([A-Z])\)?\s*\*{0,2}\s*\}", + # "(the) answer is X" / "(the) correct answer is X" + r"\b(?:the\s+)?answer\s+is\s*\*{0,2}\s*\(?([A-Z])\)?\s*\*{0,2}(?![A-Za-z])", +) + + +def _parse_explicit_multi_choice_answer(response, all_choices): + choice_map = {choice.upper(): choice for choice in all_choices} + matches = [] + for pattern in _EXPLICIT_ANSWER_PATTERNS: + for match in re.finditer(pattern, response, flags=re.IGNORECASE): + candidate = match.group(1).upper() + if candidate in choice_map: + matches.append((match.start(1), choice_map[candidate])) + return max(matches)[1] if matches else None + + def parse_multi_choice_response(response, all_choices, index2ans): """ Parse the prediction from the generated response. Return the predicted index e.g., A, B, C, D. """ + explicit_answer = _parse_explicit_multi_choice_answer(response, all_choices) + if explicit_answer is not None: + return explicit_answer + for char in [",", ".", "!", "?", ";", ":", "'"]: response = response.strip(char) response = " " + response + " " # add space to avoid partial match diff --git a/benchmark/tip_suggestion/bench_other.py b/benchmark/tip_suggestion/bench_other.py index 2630081bd620..6e3d098fe5e7 100644 --- a/benchmark/tip_suggestion/bench_other.py +++ b/benchmark/tip_suggestion/bench_other.py @@ -13,8 +13,7 @@ def expand_tip(topic, tip, generate): - s = ( - """Please expand a tip for a topic into a detailed paragraph. + s = """Please expand a tip for a topic into a detailed paragraph. Topic: staying healthy Tip: Regular Exercise @@ -28,12 +27,7 @@ def expand_tip(topic, tip, generate): Tip: structure your content effectively Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement. -Topic: """ - + topic - + "\nTip: " - + tip - + "\nParagraph:" - ) +Topic: """ + topic + "\nTip: " + tip + "\nParagraph:" return generate(s, max_tokens=128, stop=["\n\n"]) diff --git a/benchmark/tip_suggestion/bench_sglang.py b/benchmark/tip_suggestion/bench_sglang.py index 86c476f97fbf..ef78dce6985c 100644 --- a/benchmark/tip_suggestion/bench_sglang.py +++ b/benchmark/tip_suggestion/bench_sglang.py @@ -14,8 +14,7 @@ @sgl.function def expand_tip(s, topic, tip): - s += ( - """Please expand a tip for a topic into a detailed paragraph. + s += """Please expand a tip for a topic into a detailed paragraph. Topic: staying healthy Tip: Regular Exercise @@ -29,12 +28,7 @@ def expand_tip(s, topic, tip): Tip: structure your content effectively Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement. -Topic: """ - + topic - + "\nTip: " - + tip - + "\nParagraph:" - ) +Topic: """ + topic + "\nTip: " + tip + "\nParagraph:" s += sgl.gen("paragraph", max_tokens=128, stop=["\n\n"], temperature=0) diff --git a/benchmark/tip_suggestion/lmql_funcs.py b/benchmark/tip_suggestion/lmql_funcs.py index 7790bbe950d2..1d4c97e38c57 100644 --- a/benchmark/tip_suggestion/lmql_funcs.py +++ b/benchmark/tip_suggestion/lmql_funcs.py @@ -2,8 +2,7 @@ async def expand_tip_async(topic, tip, generate): - s = ( - """Please expand a tip for a topic into a detailed paragraph. + s = """Please expand a tip for a topic into a detailed paragraph. Topic: staying healthy Tip: Regular Exercise @@ -17,12 +16,7 @@ async def expand_tip_async(topic, tip, generate): Tip: structure your content effectively Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement. -Topic: """ - + topic - + "\nTip: " - + tip - + "\nParagraph:" - ) +Topic: """ + topic + "\nTip: " + tip + "\nParagraph:" return await generate(s, max_tokens=128, stop="\n\n") diff --git a/docker/Dockerfile b/docker/Dockerfile index 366efa327e45..2e57ed442e20 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -1,4 +1,4 @@ -ARG CUDA_VERSION=12.9.1 +ARG CUDA_VERSION=13.0.1 FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu24.04 AS base ARG TARGETARCH @@ -11,15 +11,19 @@ ARG GRACE_BLACKWELL_DEEPEP_BRANCH=gb200_blog_part_2 ARG HOPPER_SBO_DEEPEP_COMMIT=9f2fc4b3182a51044ae7ecb6610f7c9c3258c4d6 ARG DEEPEP_COMMIT=9af0e0d0e74f3577af1979c9b9e1ac2cad0104ee ARG BUILD_AND_DOWNLOAD_PARALLEL=8 -ARG SGL_KERNEL_VERSION=0.3.21 +ARG SGL_KERNEL_VERSION=0.4.2.post1 ARG SGL_VERSION +ARG SGL_DEEP_GEMM_VERSION=0.0.1 ARG USE_LATEST_SGLANG=0 ARG GDRCOPY_VERSION=2.5.1 ARG PIP_DEFAULT_INDEX ARG UBUNTU_MIRROR ARG GITHUB_ARTIFACTORY=github.com ARG INSTALL_FLASHINFER_JIT_CACHE=0 -ARG FLASHINFER_VERSION=0.6.2 +ARG FLASHINFER_VERSION=0.6.8.post1 +ARG MOONCAKE_VERSION=0.3.10.post2 +#if need other arg please add in MOONCAKE_COMPILE_ARG +ARG MOONCAKE_COMPILE_ARG="-DUSE_HTTP=ON -DUSE_MNNVL=ON -DUSE_CUDA=ON -DWITH_EP=ON" ENV DEBIAN_FRONTEND=noninteractive \ CUDA_HOME=/usr/local/cuda \ @@ -37,11 +41,11 @@ RUN if [ -n "$UBUNTU_MIRROR" ]; then \ fi # Python setup (combined with apt update to reduce layers) +# Ubuntu 24.04 ships Python 3.12 in main, so we no longer need the deadsnakes +# PPA. Dropping it avoids transient Launchpad 504s in `add-apt-repository`. RUN --mount=type=cache,target=/var/cache/apt,id=base-apt \ apt update && apt install -y --no-install-recommends wget software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa -y \ - && apt install -y --no-install-recommends python3.12-full python3.12-dev python3.10-venv \ - && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \ + && apt install -y --no-install-recommends python3.12-full python3.12-dev \ && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 2 \ && update-alternatives --set python3 /usr/bin/python3.12 \ && wget -q https://bootstrap.pypa.io/get-pip.py \ @@ -51,15 +55,12 @@ RUN --mount=type=cache,target=/var/cache/apt,id=base-apt \ && python3 -m pip config set global.break-system-packages true \ # Fix for apt-add-repository && cd /usr/lib/python3/dist-packages/ \ - && ln -s apt_pkg.cpython-310-*-linux-gnu.so apt_pkg.so + && ln -s apt_pkg.cpython-312-*-linux-gnu.so apt_pkg.so # Install system dependencies (organized by category for better caching) RUN --mount=type=cache,target=/var/cache/apt,id=base-apt \ - echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ - && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ - && apt-get update && apt-get install -y --no-install-recommends \ + apt-get update && apt-get install -y --no-install-recommends \ # Core system utilities - tzdata \ ca-certificates \ software-properties-common \ netcat-openbsd \ @@ -114,6 +115,7 @@ RUN --mount=type=cache,target=/var/cache/apt,id=base-apt \ libczmq4 \ libczmq-dev \ libfabric-dev \ + linux-libc-dev \ # Package building tools devscripts \ debhelper \ @@ -151,48 +153,45 @@ ENV LANG=en_US.UTF-8 \ LC_ALL=en_US.UTF-8 ######################################################## -########## Framework Development Image ################ +########## PARALLEL BUILDER STAGES #################### ######################################################## +# +# These stages run IN PARALLEL via BuildKit: +# +# base +# | +# +-- torch_deps ------> deepep_builder (needs torch) +# | \-> flashinfer_cache (needs flashinfer) +# | +# +-- devtools_builder (independent) +# +-- gateway_builder (independent, only needs gateway source) +# | +# v +# framework (combines all artifacts) +# -# Copy local source if building from local -FROM scratch AS local_src -COPY . /src - -FROM base AS framework +######################################################## +# PARALLEL STAGE 1: Torch/Deps Builder (starts from base) +######################################################## +FROM base AS torch_deps -ARG BRANCH_TYPE -ARG BUILD_TYPE ARG CUDA_VERSION -ARG BUILD_AND_DOWNLOAD_PARALLEL +ARG BUILD_TYPE ARG SGL_KERNEL_VERSION -ARG SGL_VERSION -ARG USE_LATEST_SGLANG -ARG INSTALL_FLASHINFER_JIT_CACHE -ARG FLASHINFER_VERSION -ARG GRACE_BLACKWELL -ARG GRACE_BLACKWELL_DEEPEP_BRANCH -ARG DEEPEP_COMMIT -ARG TRITON_LANG_COMMIT ARG GITHUB_ARTIFACTORY WORKDIR /sgl-workspace -# Install SGLang -COPY --from=local_src /src /tmp/local_src -RUN if [ "$BRANCH_TYPE" = "local" ]; then \ - cp -r /tmp/local_src /sgl-workspace/sglang; \ - elif [ "$USE_LATEST_SGLANG" = "1" ]; then \ - git clone --depth=1 https://github.com/sgl-project/sglang.git /sgl-workspace/sglang; \ - elif [ -z "$SGL_VERSION" ]; then \ - echo "ERROR: SGL_VERSION must be set when USE_LATEST_SGLANG=0 and BRANCH_TYPE!=local" && exit 1; \ - else \ - git clone --depth=1 --branch v${SGL_VERSION} https://github.com/sgl-project/sglang.git /sgl-workspace/sglang; \ - fi \ - && rm -rf /tmp/local_src +# Rust toolchain for setuptools-rust extensions (e.g. sglang-grpc). +# Requires >= 1.85 (edition 2024). Inherited by framework via FROM torch_deps. +ENV PATH="/root/.cargo/bin:${PATH}" +RUN curl --proto '=https' --tlsv1.2 --retry 3 --retry-delay 2 -sSf https://sh.rustup.rs \ + | sh -s -- -y --no-modify-path --profile minimal \ + && rustc --version && cargo --version +# Install sgl-kernel (from pre-built wheel) RUN --mount=type=cache,target=/root/.cache/pip \ python3 -m pip install --upgrade pip setuptools wheel html5lib six \ - && cd sglang \ && case "$CUDA_VERSION" in \ 12.6.1) CUINDEX=126 ;; \ 12.8.1) CUINDEX=128 ;; \ @@ -201,63 +200,109 @@ RUN --mount=type=cache,target=/root/.cache/pip \ *) echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1 ;; \ esac \ && if [ "$CUDA_VERSION" = "12.6.1" ]; then \ - python3 -m pip install https://${GITHUB_ARTIFACTORY}/sgl-project/whl/releases/download/v${SGL_KERNEL_VERSION}/sgl_kernel-${SGL_KERNEL_VERSION}+cu124-cp310-abi3-manylinux2014_$(uname -m).whl --force-reinstall --no-deps \ + python3 -m pip install https://${GITHUB_ARTIFACTORY}/sgl-project/whl/releases/download/v${SGL_KERNEL_VERSION}/sglang_kernel-${SGL_KERNEL_VERSION}+cu124-cp310-abi3-manylinux2014_$(uname -m).whl --force-reinstall --no-deps \ ; \ elif [ "$CUDA_VERSION" = "12.8.1" ] || [ "$CUDA_VERSION" = "12.9.1" ]; then \ - python3 -m pip install sgl-kernel==${SGL_KERNEL_VERSION} \ + python3 -m pip install https://github.com/sgl-project/whl/releases/download/v${SGL_KERNEL_VERSION}/sglang_kernel-${SGL_KERNEL_VERSION}+cu129-cp310-abi3-manylinux2014_$(uname -m).whl --force-reinstall --no-deps \ ; \ elif [ "$CUDA_VERSION" = "13.0.1" ]; then \ - python3 -m pip install https://github.com/sgl-project/whl/releases/download/v${SGL_KERNEL_VERSION}/sgl_kernel-${SGL_KERNEL_VERSION}+cu130-cp310-abi3-manylinux2014_$(uname -m).whl --force-reinstall --no-deps \ + # --no-deps prevents pip from pulling torch from default PyPI + python3 -m pip install sglang-kernel==${SGL_KERNEL_VERSION} --force-reinstall --no-deps \ ; \ else \ echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1 \ ; \ - fi \ - && python3 -m pip install -e "python[${BUILD_TYPE}]" --extra-index-url https://download.pytorch.org/whl/cu${CUINDEX} \ - && if [ "$INSTALL_FLASHINFER_JIT_CACHE" = "1" ]; then \ - python3 -m pip install flashinfer-jit-cache==${FLASHINFER_VERSION} --index-url https://flashinfer.ai/whl/cu${CUINDEX} ; \ - fi \ - && FLASHINFER_CUBIN_DOWNLOAD_THREADS=${BUILD_AND_DOWNLOAD_PARALLEL} FLASHINFER_LOGGING_LEVEL=warning python3 -m flashinfer --download-cubin + fi -# DeepEP -# We use Tom's DeepEP fork for GB200 for now; the 1fd57b0276311d035d16176bb0076426166e52f3 commit is https://github.com/fzyzcjy/DeepEP/tree/gb200_blog_part_2 -# TODO: move from Tom's branch to DeepEP hybrid-ep branch -# We use the nvshmem version that ships with torch 2.9.1 -# CU12 uses 3.3.20 and CU13 uses 3.3.24 +# Copy dep spec + Rust crate source + proto files. setuptools-rust compiles the +# Rust extension during the stub wheel build; the crate's build.rs references +# ../../proto for tonic_build. Split from the pip install so source changes to +# these paths invalidate the dep-install layer, but Python source changes don't. +COPY python/pyproject.toml /tmp/sglang_deps/python/pyproject.toml +COPY rust/sglang-grpc /tmp/sglang_deps/rust/sglang-grpc +COPY proto /tmp/sglang_deps/proto + +# Install sglang dependencies (torch, transformers, etc.) +# Generate constraints.txt to prevent reinstalling these deps in later stages +RUN --mount=type=cache,target=/root/.cache/pip \ + --mount=type=cache,target=/root/.cargo/registry \ + case "$CUDA_VERSION" in \ + 12.6.1) CUINDEX=126 ;; \ + 12.8.1) CUINDEX=128 ;; \ + 12.9.1) CUINDEX=129 ;; \ + 13.0.1) CUINDEX=130 ;; \ + *) echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1 ;; \ + esac \ + && cd /tmp/sglang_deps/python \ + && mkdir -p sglang \ + && touch sglang/__init__.py \ + && echo '__version__ = "0.0.0"' > sglang/version.py \ + && touch README.md \ + && touch LICENSE \ + && python3 -m pip install --extra-index-url https://download.pytorch.org/whl/cu${CUINDEX} ".[${BUILD_TYPE}]" \ + && if [ "${CUDA_VERSION%%.*}" = "12" ]; then \ + pip list --format=freeze | awk -F'==' '/-cu13(==|$)/ {print $1}' \ + | xargs -r python3 -m pip uninstall -y && \ + python3 -m pip install --index-url https://download.pytorch.org/whl/cu${CUINDEX} \ + torch torchvision torchaudio --force-reinstall; \ + python3 -m pip install https://github.com/sgl-project/whl/releases/download/v${SGL_DEEP_GEMM_VERSION}/sgl_deep_gemm-${SGL_DEEP_GEMM_VERSION}+cu129-py3-none-manylinux2014_$(uname -m).whl --force-reinstall; \ + fi \ + && cd /sgl-workspace \ + && rm -rf /tmp/sglang_deps \ + && pip freeze | grep -v "^sglang==" > /sgl-workspace/constraints.txt + +######################################################## +# PARALLEL STAGE 2: DeepEP Builder (needs torch_deps) +######################################################## +FROM torch_deps AS deepep_builder + +ARG CUDA_VERSION +ARG BUILD_AND_DOWNLOAD_PARALLEL +ARG GRACE_BLACKWELL +ARG GRACE_BLACKWELL_DEEPEP_BRANCH +ARG HOPPER_SBO +ARG HOPPER_SBO_DEEPEP_COMMIT +ARG DEEPEP_COMMIT +ARG GITHUB_ARTIFACTORY + +WORKDIR /build + +# Clone DeepEP RUN set -eux; \ if [ "$GRACE_BLACKWELL" = "1" ]; then \ git clone https://github.com/fzyzcjy/DeepEP.git && \ cd DeepEP && \ git checkout ${GRACE_BLACKWELL_DEEPEP_BRANCH} && \ sed -i 's/#define NUM_CPU_TIMEOUT_SECS 100/#define NUM_CPU_TIMEOUT_SECS 1000/' csrc/kernels/configs.cuh && \ + sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh && \ cd .. ; \ elif [ "$HOPPER_SBO" = "1" ]; then \ git clone https://github.com/deepseek-ai/DeepEP.git -b antgroup-opt && \ cd DeepEP && \ git checkout ${HOPPER_SBO_DEEPEP_COMMIT} && \ sed -i 's/#define NUM_CPU_TIMEOUT_SECS 100/#define NUM_CPU_TIMEOUT_SECS 1000/' csrc/kernels/configs.cuh && \ + sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh && \ cd .. ; \ else \ curl --retry 3 --retry-delay 2 -fsSL -o ${DEEPEP_COMMIT}.zip \ https://${GITHUB_ARTIFACTORY}/deepseek-ai/DeepEP/archive/${DEEPEP_COMMIT}.zip && \ unzip -q ${DEEPEP_COMMIT}.zip && rm ${DEEPEP_COMMIT}.zip && mv DeepEP-${DEEPEP_COMMIT} DeepEP && cd DeepEP && \ sed -i 's/#define NUM_CPU_TIMEOUT_SECS 100/#define NUM_CPU_TIMEOUT_SECS 1000/' csrc/kernels/configs.cuh && \ + sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh && \ cd .. ; \ fi -# Install DeepEP +# Build DeepEP wheel RUN --mount=type=cache,target=/root/.cache/pip \ - cd /sgl-workspace/DeepEP && \ + cd /build/DeepEP && \ case "$CUDA_VERSION" in \ 12.6.1) \ CHOSEN_TORCH_CUDA_ARCH_LIST='9.0' \ ;; \ 12.8.1) \ - # FIXED: 12.8.1 does NOT support Blackwell 10.3 \ CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0' \ ;; \ 12.9.1|13.0.1) \ - # 12.9.1+ properly supports Blackwell 10.3 \ CHOSEN_TORCH_CUDA_ARCH_LIST='9.0;10.0;10.3' \ ;; \ *) \ @@ -267,55 +312,159 @@ RUN --mount=type=cache,target=/root/.cache/pip \ if [ "${CUDA_VERSION%%.*}" = "13" ]; then \ sed -i "/^ include_dirs = \['csrc\/'\]/a\ include_dirs.append('${CUDA_HOME}/include/cccl')" setup.py; \ fi && \ - TORCH_CUDA_ARCH_LIST="${CHOSEN_TORCH_CUDA_ARCH_LIST}" MAX_JOBS=${BUILD_AND_DOWNLOAD_PARALLEL} pip install --no-build-isolation . + TORCH_CUDA_ARCH_LIST="${CHOSEN_TORCH_CUDA_ARCH_LIST}" MAX_JOBS=${BUILD_AND_DOWNLOAD_PARALLEL} \ + python3 setup.py bdist_wheel -d /wheels + +######################################################## +# PARALLEL STAGE 3: FlashInfer Cache (needs torch_deps) +######################################################## +FROM torch_deps AS flashinfer_cache + +ARG CUDA_VERSION +ARG INSTALL_FLASHINFER_JIT_CACHE +ARG FLASHINFER_VERSION -# Install essential Python packages +# Stage jit-cache artifacts into /flashinfer_jit_output for clean COPY later RUN --mount=type=cache,target=/root/.cache/pip \ - python3 -m pip install \ - datamodel_code_generator \ - mooncake-transfer-engine==0.3.8.post1 \ - pre-commit \ - pytest \ - black \ - isort \ - icdiff \ - uv \ - wheel \ - scikit-build-core \ - nixl \ - py-spy \ - cubloaty \ - google-cloud-storage + case "$CUDA_VERSION" in \ + 12.6.1) CUINDEX=126 ;; \ + 12.8.1) CUINDEX=128 ;; \ + 12.9.1) CUINDEX=129 ;; \ + 13.0.1) CUINDEX=130 ;; \ + *) echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1 ;; \ + esac \ + && mkdir -p /flashinfer_jit_output \ + && if [ "$INSTALL_FLASHINFER_JIT_CACHE" = "1" ]; then \ + python3 -m pip install flashinfer-jit-cache==${FLASHINFER_VERSION} --index-url https://flashinfer.ai/whl/cu${CUINDEX} \ + && cp -r /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache /flashinfer_jit_output/ \ + && cp -r /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache-*.dist-info /flashinfer_jit_output/ ; \ + fi + +######################################################## +# PARALLEL STAGE 4: Dev Tools Builder (starts from base) +######################################################## +FROM base AS devtools_builder + +ARG GITHUB_ARTIFACTORY + +WORKDIR /tools + +# Minimal apt deps needed for oh-my-zsh install in this stage +# Full dev apt packages (gdb, vim, tmux, nsight, etc.) are installed in the framework stage +RUN --mount=type=cache,target=/var/cache/apt,id=devtools-apt \ + apt-get update && apt-get install -y --no-install-recommends zsh git \ + && rm -rf /var/lib/apt/lists/* + +# Download CLI tools (each in its own layer for parallel downloads) +RUN curl --retry 3 --retry-delay 2 -LSso /tools/diff-so-fancy \ + https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy \ + && chmod +x /tools/diff-so-fancy -# Build and install sgl-model-gateway (install Rust, build, then remove to save space) +RUN curl --retry 3 --retry-delay 2 -LSso /tools/clang-format \ + https://${GITHUB_ARTIFACTORY}/muttleyxd/clang-tools-static-binaries/releases/download/master-32d3ac78/clang-format-16_linux-amd64 \ + && chmod +x /tools/clang-format + +RUN curl --retry 3 --retry-delay 2 -fsSL -o /tmp/clangd.zip \ + https://${GITHUB_ARTIFACTORY}/clangd/clangd/releases/download/18.1.3/clangd-linux-18.1.3.zip \ + && unzip -q /tmp/clangd.zip -d /tmp \ + && cp /tmp/clangd_18.1.3/bin/* /tools/ \ + && mkdir -p /tools/lib && cp -r /tmp/clangd_18.1.3/lib/* /tools/lib/ \ + && rm -rf /tmp/clangd.zip /tmp/clangd_18.1.3 + +RUN CMAKE_VERSION=3.31.1 \ + && ARCH=$(uname -m) \ + && CMAKE_INSTALLER="cmake-${CMAKE_VERSION}-linux-${ARCH}" \ + && curl --retry 3 --retry-delay 2 -fsSL -o "/tmp/${CMAKE_INSTALLER}.tar.gz" \ + "https://${GITHUB_ARTIFACTORY}/Kitware/CMake/releases/download/v${CMAKE_VERSION}/${CMAKE_INSTALLER}.tar.gz" \ + && tar -xzf "/tmp/${CMAKE_INSTALLER}.tar.gz" -C /tmp \ + && cp -r "/tmp/${CMAKE_INSTALLER}/bin/"* /tools/ \ + && mkdir -p /tools/share && cp -r "/tmp/${CMAKE_INSTALLER}/share/"* /tools/share/ \ + && rm -rf "/tmp/${CMAKE_INSTALLER}" "/tmp/${CMAKE_INSTALLER}.tar.gz" + +RUN curl --proto '=https' --tlsv1.2 --retry 3 --retry-delay 2 -sSf https://just.systems/install.sh | \ + sed "s|https://github.com|https://${GITHUB_ARTIFACTORY}|g" | \ + bash -s -- --tag 1.42.4 --to /tools + +# Install oh-my-zsh and plugins +RUN sh -c "$(curl --retry 3 --retry-delay 2 -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended \ + && git clone --depth 1 https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-/root/.oh-my-zsh/custom}/plugins/zsh-autosuggestions \ + && git clone --depth 1 https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-/root/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting + +######################################################## +# PARALLEL STAGE 5: Gateway Builder (starts from base) +######################################################## +# Builds sgl-model-gateway in isolation so Python-only changes +# don't trigger a full Rust recompilation. +FROM base AS gateway_builder + +ARG GITHUB_ARTIFACTORY +ARG BRANCH_TYPE +ARG SGL_VERSION +ARG USE_LATEST_SGLANG + +WORKDIR /build + +# Copy ONLY the gateway source (not the full repo) +COPY sgl-model-gateway /build/sgl-model-gateway + +# Install Rust, build gateway binary and Python bindings, then clean up Rust toolchain RUN --mount=type=cache,target=/root/.cache/pip \ curl --proto '=https' --tlsv1.2 --retry 3 --retry-delay 2 -sSf https://sh.rustup.rs | sh -s -- -y \ && export PATH="/root/.cargo/bin:${PATH}" \ - && rustc --version && cargo --version \ && python3 -m pip install maturin \ - && cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \ - && ulimit -n 65536 && maturin build --release --features vendored-openssl --out dist \ - && python3 -m pip install --force-reinstall dist/*.whl \ - && cd /sgl-workspace/sglang/sgl-model-gateway \ - && cargo build --release --bin sglang-router --features vendored-openssl \ - && cp target/release/sglang-router /usr/local/bin/sglang-router \ - && rm -rf /root/.cargo /root/.rustup target dist ~/.cargo \ - && sed -i '/\.cargo\/env/d' /root/.profile /root/.bashrc 2>/dev/null || true - -# Patching packages for CUDA 12/13 compatibility -# TODO: Remove when torch version covers these packages -RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \ - python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \ - python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \ -elif [ "${CUDA_VERSION%%.*}" = "13" ]; then \ - python3 -m pip install nvidia-nccl-cu13==2.28.3 --force-reinstall --no-deps ; \ - python3 -m pip install nvidia-cudnn-cu13==9.16.0.29 --force-reinstall --no-deps ; \ - python3 -m pip install nvidia-cublas==13.1.0.3 --force-reinstall --no-deps ; \ - python3 -m pip install nixl-cu13 --no-deps ; \ - python3 -m pip install cuda-python==13.1.1 ; \ -fi + && cd /build/sgl-model-gateway/bindings/python \ + && ulimit -n 65536 && maturin build --release --features vendored-openssl --out /build/gateway_wheels \ + && cd /build/sgl-model-gateway \ + && cargo build --release --bin sgl-model-gateway --features vendored-openssl \ + && cp target/release/sgl-model-gateway /build/sgl-model-gateway-bin \ + && rm -rf /root/.cargo /root/.rustup /build/sgl-model-gateway/target /build/sgl-model-gateway/bindings/python/target -# Install development tools +######################################################## +########## Final Framework Image ###################### +######################################################## +# +# Combines all artifacts from parallel builder stages +# +FROM torch_deps AS framework + +ARG BRANCH_TYPE +ARG BUILD_TYPE +ARG CUDA_VERSION +ARG BUILD_AND_DOWNLOAD_PARALLEL +ARG SGL_VERSION +ARG USE_LATEST_SGLANG +ARG GITHUB_ARTIFACTORY +ARG MOONCAKE_VERSION +ARG MOONCAKE_COMPILE_ARG + +WORKDIR /sgl-workspace + +# ============================================================================= +# Copy artifacts from parallel builders +# ============================================================================= + +# Copy DeepEP wheel and install +COPY --from=deepep_builder /wheels /tmp/wheels/deepep +COPY --from=deepep_builder /build/DeepEP /sgl-workspace/DeepEP +RUN --mount=type=cache,target=/root/.cache/pip \ + pip install /tmp/wheels/deepep/*.whl && rm -rf /tmp/wheels/deepep + +# Copy flashinfer jit-cache package (if installed) +COPY --from=flashinfer_cache /flashinfer_jit_output/ /usr/local/lib/python3.12/dist-packages/ + +# Copy dev tools +COPY --from=devtools_builder /tools/diff-so-fancy /usr/local/bin/ +COPY --from=devtools_builder /tools/clang-format /usr/local/bin/ +COPY --from=devtools_builder /tools/clangd /usr/local/bin/ +COPY --from=devtools_builder /tools/lib /usr/local/lib/ +COPY --from=devtools_builder /tools/cmake /usr/local/bin/ +COPY --from=devtools_builder /tools/ctest /usr/local/bin/ +COPY --from=devtools_builder /tools/cpack /usr/local/bin/ +COPY --from=devtools_builder /tools/share/cmake-3.31 /usr/local/share/cmake-3.31 +COPY --from=devtools_builder /tools/just /usr/local/bin/ +COPY --from=devtools_builder /root/.oh-my-zsh /root/.oh-my-zsh + +# Install dev apt packages (need to re-run since we're in a different stage) RUN --mount=type=cache,target=/var/cache/apt,id=framework-apt \ apt-get update && apt-get install -y --no-install-recommends \ gdb \ @@ -354,63 +503,63 @@ RUN --mount=type=cache,target=/var/cache/apt,id=framework-apt \ && apt install -y --no-install-recommends nsight-systems-cli \ && rm -rf /var/lib/apt/lists/* -# Install minimal Python dev packages +# ============================================================================= +# Python packages and tools (before source copy for better caching) +# ============================================================================= + +# Install Mooncake RUN --mount=type=cache,target=/root/.cache/pip \ - python3 -m pip install --break-system-packages \ + CUDA_MAJOR="${CUDA_VERSION%%.*}" && \ + if [ "$CUDA_MAJOR" -ge 13 ]; then \ + python3 -m pip install mooncake-transfer-engine-cuda13==${MOONCAKE_VERSION}; \ + else \ + python3 -m pip install mooncake-transfer-engine==${MOONCAKE_VERSION}; \ + fi + +# Install essential Python packages (use constraints to prevent conflicts) +RUN --mount=type=cache,target=/root/.cache/pip \ + python3 -m pip install -c /sgl-workspace/constraints.txt \ + datamodel_code_generator \ + pre-commit \ pytest \ black \ isort \ icdiff \ - scikit-build-core \ uv \ - pre-commit \ + wheel \ + scikit-build-core \ + py-spy \ + cubloaty \ + google-cloud-storage \ pandas \ matplotlib \ tabulate \ - termplotlib - -# diff-so-fancy -RUN curl --retry 3 --retry-delay 2 -LSso /usr/local/bin/diff-so-fancy \ - https://${GITHUB_ARTIFACTORY}/so-fancy/diff-so-fancy/releases/download/v1.4.4/diff-so-fancy \ - && chmod +x /usr/local/bin/diff-so-fancy - -# clang-format -RUN curl --retry 3 --retry-delay 2 -LSso /usr/local/bin/clang-format \ - https://${GITHUB_ARTIFACTORY}/muttleyxd/clang-tools-static-binaries/releases/download/master-32d3ac78/clang-format-16_linux-amd64 \ - && chmod +x /usr/local/bin/clang-format - -# clangd -RUN curl --retry 3 --retry-delay 2 -fsSL -o clangd.zip \ - https://${GITHUB_ARTIFACTORY}/clangd/clangd/releases/download/18.1.3/clangd-linux-18.1.3.zip \ - && unzip -q clangd.zip \ - && cp -r clangd_18.1.3/bin/* /usr/local/bin/ \ - && cp -r clangd_18.1.3/lib/* /usr/local/lib/ \ - && rm -rf clangd_18.1.3 clangd.zip - -# CMake -RUN CMAKE_VERSION=3.31.1 \ - && ARCH=$(uname -m) \ - && CMAKE_INSTALLER="cmake-${CMAKE_VERSION}-linux-${ARCH}" \ - && curl --retry 3 --retry-delay 2 -fsSL -o "${CMAKE_INSTALLER}.tar.gz" \ - "https://${GITHUB_ARTIFACTORY}/Kitware/CMake/releases/download/v${CMAKE_VERSION}/${CMAKE_INSTALLER}.tar.gz" \ - && tar -xzf "${CMAKE_INSTALLER}.tar.gz" \ - && cp -r "${CMAKE_INSTALLER}/bin/"* /usr/local/bin/ \ - && cp -r "${CMAKE_INSTALLER}/share/"* /usr/local/share/ \ - && rm -rf "${CMAKE_INSTALLER}" "${CMAKE_INSTALLER}.tar.gz" - -# Install just -RUN curl --proto '=https' --tlsv1.2 --retry 3 --retry-delay 2 -sSf https://just.systems/install.sh | \ - sed "s|https://github.com|https://${GITHUB_ARTIFACTORY}|g" | \ - bash -s -- --tag 1.42.4 --to /usr/local/bin + termplotlib \ + "runai-model-streamer[s3,gcs,azure]>=0.15.7" + +# Per-CUDA-major package installs. The `nixl` stub package is needed (it owns +# the `nixl` import path) but unconditionally requires nixl-cu12, so we install +# it with --no-deps and pair it with the matching nixl-cu12 / nixl-cu13 binary +# to avoid shipping wrong-CUDA libs on cu13 images. +# The upstream flash-mla packages are required for running deepseek-v4 models +RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \ + python3 -m pip install nixl nixl-cu12 --no-deps ; \ + python3 -m pip install cuda-python==12.9 ; \ + cd /sgl-workspace && git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla \ + && cd flash-mla && git submodule update --init --recursive \ + && pip install --no-build-isolation -v . ; \ +elif [ "${CUDA_VERSION%%.*}" = "13" ]; then \ + python3 -m pip install nixl nixl-cu13 --no-deps ; \ + python3 -m pip install cuda-python==13.2.0 ; \ + cd /sgl-workspace && git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla \ + && ln -s /usr/local/cuda/include/cccl/cuda /usr/local/cuda/include/cuda \ + && cd flash-mla && git submodule update --init --recursive \ + && pip install --no-build-isolation -v . ; \ +fi # Add yank script COPY --chown=root:root --chmod=755 docker/configs/yank /usr/local/bin/yank -# Install oh-my-zsh and plugins -RUN sh -c "$(curl --retry 3 --retry-delay 2 -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended \ - && git clone --depth 1 https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions \ - && git clone --depth 1 https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting - # These configs are optional; users can override them by mounting their own files COPY docker/configs/opt/.vimrc /opt/sglang/.vimrc COPY docker/configs/opt/.tmux.conf /opt/sglang/.tmux.conf @@ -419,17 +568,106 @@ COPY docker/configs/opt/.gitconfig /opt/sglang/.gitconfig # Configure development environment COPY docker/configs/.zshrc /root/.zshrc -# Fix Triton to use system ptxas for Blackwell (sm_103a) support (CUDA 13+ only) -RUN if [ "${CUDA_VERSION%%.*}" = "13" ] && [ -d /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin ]; then \ - rm -f /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas && \ - ln -s /usr/local/cuda/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas; \ - fi +# Fix Trivy-reported CVEs +# pip: urllib3 (CVE-2025-43859), pillow (CVE-2026-25990) +# binutils family: CVE-2025-{1147,1148,3198,5244,5245,7545,7546,8225,11082,11083,11412,11413,11414,11494,11839,11840} +# libgnutls30t64: CVE-2025-{9820,14831} +# libpam: CVE-2024-10963 +# libsqlite3-0: CVE-2025-{6965,7709} +# libtasn1-6: CVE-2025-13151 +# dpkg: CVE-2025-6297 +RUN python3 -m pip install --upgrade "urllib3>=2.6.3" "pillow>=12.1.1" +RUN --mount=type=cache,target=/var/cache/apt,id=framework-apt \ + apt-get update && apt-get install -y --only-upgrade \ + binutils binutils-common binutils-x86-64-linux-gnu libbinutils \ + libctf0 libctf-nobfd0 libgprofng0 libsframe1 \ + libgnutls30t64 \ + libpam-modules libpam-modules-bin libpam-runtime libpam0g \ + libsqlite3-0 libtasn1-6 \ + dpkg dpkg-dev libdpkg-perl \ + && rm -rf /var/lib/apt/lists/* -RUN python3 -m pip install --upgrade "urllib3>=2.6.3" +# ============================================================================= +# Copy sglang source and do editable install (LAST for better caching) +# ============================================================================= + +# Copy local source if building from local +FROM scratch AS local_src +COPY . /src + +FROM framework AS framework_final + +ARG BRANCH_TYPE +ARG BUILD_TYPE +ARG CUDA_VERSION +ARG SGL_VERSION +ARG USE_LATEST_SGLANG + +WORKDIR /sgl-workspace + +COPY --from=local_src /src /tmp/local_src +RUN if [ "$BRANCH_TYPE" = "local" ]; then \ + cp -r /tmp/local_src /sgl-workspace/sglang; \ + elif [ "$USE_LATEST_SGLANG" = "1" ]; then \ + git clone --depth=1 https://github.com/sgl-project/sglang.git /sgl-workspace/sglang; \ + elif [ -z "$SGL_VERSION" ]; then \ + echo "ERROR: SGL_VERSION must be set when USE_LATEST_SGLANG=0 and BRANCH_TYPE!=local" && exit 1; \ + else \ + git clone --depth=1 --branch v${SGL_VERSION} https://github.com/sgl-project/sglang.git /sgl-workspace/sglang; \ + fi \ + && rm -rf /tmp/local_src + +# Editable install (fast - dependencies already installed via constraints) +# Clean up __pycache__/tests/pyc in same RUN to avoid writing ~28k files to layer +RUN --mount=type=cache,target=/root/.cache/pip \ + cd /sgl-workspace/sglang \ + && python3 -m pip install --no-deps -e "python[${BUILD_TYPE}]" \ + && kernels lock python \ + && ( success=0; \ + # aarch64: kernels-community/sgl-flash-attn3 ships no arm variants; JIT-compile at runtime. + # Remove this branch once arm cubins are published upstream. + if [ "$(uname -m)" = "aarch64" ]; then \ + echo "Skipping kernels-community/sgl-flash-attn3 cubin download on aarch64 (no variants published upstream); kernels will be JIT-compiled at runtime"; \ + success=1; \ + else \ + for i in 1 2 3; do \ + echo "Attempt $i/3: downloading sgl-kernel cubins..." && \ + kernels download python && \ + success=1 && break; \ + echo "sgl-kernel cubin download failed, retrying in 30s..." && sleep 30; \ + done; \ + fi; \ + [ "$success" = "1" ] ) \ + && mkdir -p /root/.cache/huggingface /root/.cache/sglang \ + && ( if [ -f python/kernels.lock ]; then mv python/kernels.lock /root/.cache/sglang/; fi ) \ + && ( find /usr/local/lib/python3.12/dist-packages -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true ) + + +# Install pre-built gateway artifacts from parallel builder +COPY --from=gateway_builder /build/sgl-model-gateway-bin /usr/local/bin/sgl-model-gateway +COPY --from=gateway_builder /build/gateway_wheels /tmp/gateway_wheels +RUN --mount=type=cache,target=/root/.cache/pip \ + python3 -m pip install --force-reinstall /tmp/gateway_wheels/*.whl \ + && rm -rf /tmp/gateway_wheels # Set workspace directory WORKDIR /sgl-workspace/sglang +# Keep build provenance at the end so metadata changes do not invalidate build layers. +ARG SGLANG_BUILD_COMMIT=unknown +ARG SGLANG_BUILD_URL= +ARG SGLANG_IMAGE_TAG=local/sglang:dev +ENV SGLANG_BUILD_COMMIT=${SGLANG_BUILD_COMMIT:-unknown} \ + SGLANG_BUILD_URL=${SGLANG_BUILD_URL:-} \ + SGLANG_IMAGE_TAG=${SGLANG_IMAGE_TAG:-local/sglang:dev} +LABEL org.opencontainers.image.source="https://github.com/sgl-project/sglang" \ + org.opencontainers.image.revision="${SGLANG_BUILD_COMMIT}" \ + org.opencontainers.image.version="${SGLANG_IMAGE_TAG}" \ + org.opencontainers.image.url="${SGLANG_BUILD_URL}" \ + ai.sglang.build.commit="${SGLANG_BUILD_COMMIT}" \ + ai.sglang.build.url="${SGLANG_BUILD_URL}" \ + ai.sglang.image.tag="${SGLANG_IMAGE_TAG}" + ######################################################## ########## Runtime Image ############################## ######################################################## @@ -463,17 +701,14 @@ ENV PATH="${PATH}:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/cuda/nvvm LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64" # Install runtime dependencies (devel base provides gcc/g++/build tools) +# Python 3.12 ships in Ubuntu 24.04 main, so no deadsnakes PPA needed. RUN --mount=type=cache,target=/var/cache/apt,id=runtime-apt \ - apt-get update && apt-get install -y --no-install-recommends \ + apt-get update && apt-get install -y --no-install-recommends --allow-change-held-packages \ # Python runtime - software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa -y \ - && apt-get update && apt-get install -y --no-install-recommends --allow-change-held-packages \ python3.12-full \ python3.12-dev \ wget \ # Core system utilities - tzdata \ ca-certificates \ netcat-openbsd \ curl \ @@ -510,6 +745,7 @@ RUN --mount=type=cache,target=/var/cache/apt,id=runtime-apt \ libnccl-dev \ # GPG key verification gnupg2 \ + linux-libc-dev \ && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 2 \ && update-alternatives --set python3 /usr/bin/python3.12 \ && ln -sf /usr/bin/python3.12 /usr/bin/python \ @@ -530,27 +766,57 @@ ENV LANG=en_US.UTF-8 \ LANGUAGE=en_US:en \ LC_ALL=en_US.UTF-8 -# Copy Python site-packages from framework (contains all built packages) -COPY --from=framework /usr/local/lib/python3.12/dist-packages /usr/local/lib/python3.12/dist-packages +# Fix Trivy-reported CVEs (see framework stage for full CVE list) +RUN --mount=type=cache,target=/var/cache/apt,id=runtime-apt \ + apt-get update && apt-get install -y --only-upgrade \ + binutils binutils-common binutils-x86-64-linux-gnu libbinutils \ + libctf0 libctf-nobfd0 libgprofng0 libsframe1 \ + libgnutls30t64 \ + libpam-modules libpam-modules-bin libpam-runtime libpam0g \ + libsqlite3-0 libtasn1-6 \ + dpkg dpkg-dev libdpkg-perl \ + && rm -rf /var/lib/apt/lists/* + +# Copy Python site-packages from framework (already cleaned of __pycache__/tests/pyc files) +COPY --from=framework_final /usr/local/lib/python3.12/dist-packages /usr/local/lib/python3.12/dist-packages # Copy SGLang workspace -COPY --from=framework /sgl-workspace /sgl-workspace +COPY --from=framework_final /sgl-workspace /sgl-workspace -# Fix Triton to use system ptxas for Blackwell (sm_103a) support (CUDA 13+ only) -RUN if [ "${CUDA_VERSION%%.*}" = "13" ] && [ -d /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin ]; then \ - rm -f /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas && \ - ln -s /usr/local/cuda/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas; \ - fi +# Copy sgl-model-gateway binary +COPY --from=framework_final /usr/local/bin/sgl-model-gateway /usr/local/bin/sgl-model-gateway + +# Copy py-spy binary +COPY --from=framework_final /usr/local/bin/py-spy /usr/local/bin/py-spy + +# Copy cache for kernels from kernels community +COPY --from=framework_final /root/.cache/huggingface /root/.cache/huggingface +COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang # Copy GDRCopy runtime libraries (but not the build artifacts) -COPY --from=framework /usr/lib/libgdrapi.so* /usr/lib/ -COPY --from=framework /usr/bin/gdrcopy_* /usr/bin/ -COPY --from=framework /usr/src/gdrdrv-2.5.1 /usr/src/gdrdrv-2.5.1 +COPY --from=framework_final /usr/lib/libgdrapi.so* /usr/lib/ +COPY --from=framework_final /usr/bin/gdrcopy_* /usr/bin/ +COPY --from=framework_final /usr/src/gdrdrv-2.5.1 /usr/src/gdrdrv-2.5.1 # Fix DeepEP IBGDA symlink in runtime RUN ln -sf /usr/lib/$(uname -m)-linux-gnu/libmlx5.so.1 /usr/lib/$(uname -m)-linux-gnu/libmlx5.so WORKDIR /sgl-workspace/sglang +# Keep build provenance at the end so metadata changes do not invalidate build layers. +ARG SGLANG_BUILD_COMMIT=unknown +ARG SGLANG_BUILD_URL= +ARG SGLANG_IMAGE_TAG=local/sglang:dev +ENV SGLANG_BUILD_COMMIT=${SGLANG_BUILD_COMMIT:-unknown} \ + SGLANG_BUILD_URL=${SGLANG_BUILD_URL:-} \ + SGLANG_IMAGE_TAG=${SGLANG_IMAGE_TAG:-local/sglang:dev} +LABEL org.opencontainers.image.source="https://github.com/sgl-project/sglang" \ + org.opencontainers.image.revision="${SGLANG_BUILD_COMMIT}" \ + org.opencontainers.image.version="${SGLANG_IMAGE_TAG}" \ + org.opencontainers.image.url="${SGLANG_BUILD_URL}" \ + ai.sglang.build.commit="${SGLANG_BUILD_COMMIT}" \ + ai.sglang.build.url="${SGLANG_BUILD_URL}" \ + ai.sglang.image.tag="${SGLANG_IMAGE_TAG}" + # Default command CMD ["/bin/bash"] diff --git a/docker/arm64.Dockerfile b/docker/arm64.Dockerfile new file mode 100644 index 000000000000..5173e46bedfc --- /dev/null +++ b/docker/arm64.Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 +SHELL ["/bin/bash", "-c"] + +ARG SGLANG_REPO=https://github.com/sgl-project/sglang.git +ARG VER_SGLANG=main + +RUN apt-get update && \ + apt-get full-upgrade -y && \ + DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \ + ca-certificates \ + git \ + curl \ + wget \ + vim \ + gcc \ + g++ \ + make \ + cmake \ + libsqlite3-dev \ + google-perftools \ + libtbb-dev \ + libnuma-dev \ + numactl + +WORKDIR /opt + +RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \ + source $HOME/.local/bin/env && \ + uv venv --python 3.12 + +RUN echo -e '[[index]]\nname = "torch"\nurl = "https://download.pytorch.org/whl/cpu"\n\n[[index]]\nname = "torchvision"\nurl = "https://download.pytorch.org/whl/cpu"\n\n[[index]]\nname = "torchaudio"\nurl = "https://download.pytorch.org/whl/cpu"\n\n[[index]]\nname = "triton"\nurl = "https://download.pytorch.org/whl/cpu"' > .venv/uv.toml + +ENV UV_CONFIG_FILE=/opt/.venv/uv.toml +ENV CMAKE_BUILD_PARALLEL_LEVEL=1 + +WORKDIR /sgl-workspace +RUN source $HOME/.local/bin/env && \ + source /opt/.venv/bin/activate && \ + git clone ${SGLANG_REPO} sglang && \ + cd sglang && \ + git checkout ${VER_SGLANG} && \ + cd python && \ + cp pyproject_cpu.toml pyproject.toml && \ + uv pip install . && \ + cd ../sgl-kernel && \ + cp pyproject_cpu.toml pyproject.toml && \ + uv pip install . + +ENV SGLANG_USE_CPU_ENGINE=1 +RUN echo 'source /opt/.venv/bin/activate' >> /root/.bashrc + +WORKDIR /sgl-workspace/sglang diff --git a/docker/diffusion.Dockerfile b/docker/diffusion.Dockerfile deleted file mode 100644 index d8af45b7c013..000000000000 --- a/docker/diffusion.Dockerfile +++ /dev/null @@ -1,104 +0,0 @@ -FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 - -ENV DEBIAN_FRONTEND=noninteractive - -SHELL ["/bin/bash", "-c"] - -WORKDIR /sgl-workspace/sglang - -RUN apt-get update && apt-get install -y --no-install-recommends \ - wget \ - git \ - ca-certificates \ - openssh-server \ - zsh \ - vim \ - curl \ - gcc-11 \ - g++-11 \ - clang-11 \ - libnuma1 libnuma-dev \ - && rm -rf /var/lib/apt/lists/* - -# Install oh-my-zsh and plugins -RUN sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended \ - && git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions \ - && git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting - - -# Set up C++20 compilers for ThunderKittens -RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 100 --slave /usr/bin/g++ g++ /usr/bin/g++-11 - -# Set CUDA environment variables -ENV CUDA_HOME=/usr/local/cuda-12.8 -ENV PATH=${CUDA_HOME}/bin:${PATH} -ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH - -# Install uv and source its environment -RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \ - echo 'source $HOME/.local/bin/env' >> /root/.zshrc - -# Copy just the pyproject.toml first to leverage Docker cache -COPY python/pyproject.toml python/ - -# Create a dummy README to satisfy the installation -RUN mkdir -p python && echo "# Placeholder" > python/README.md - -# Create and activate virtual environment with specific Python version and seed -RUN source $HOME/.local/bin/env && \ - uv venv --python 3.12 --seed /opt/venv && \ - source /opt/venv/bin/activate && \ - uv pip install nvitop && \ - uv pip install --no-cache-dir --upgrade pip && \ - uv pip install --no-cache-dir --prerelease=allow ./python[diffusion] - -COPY . . - -# Install dependencies using uv and set up shell configuration -RUN source $HOME/.local/bin/env && \ - source /opt/venv/bin/activate && \ - git config --unset-all http.https://github.com/.extraheader || true && \ - echo 'source /opt/venv/bin/activate' >> /root/.zshrc && \ - echo 'if [ -n "$ZSH_VERSION" ] && [ -f ~/.zshrc ]; then . ~/.zshrc; elif [ -f ~/.bashrc ]; then . ~/.bashrc; fi' > /root/.profile - -# Set PATH to include venv bin -ENV PATH=/opt/venv/bin:$PATH - -# Configure zsh -COPY --chown=root:root <<-"EOF" /root/.zshrc -export ZSH="/root/.oh-my-zsh" - -source $HOME/.local/bin/env -source /opt/venv/bin/activate - -## Theme -ZSH_THEME="robbyrussell" - -## Plugins -plugins=( - git - z - zsh-autosuggestions - zsh-syntax-highlighting -) - -source $ZSH/oh-my-zsh.sh - -## Aliases -alias ll='ls -alF' -alias la='ls -A' -alias l='ls -CF' -alias vi='vim' - -## Enhanced history -HISTSIZE=10000 -SAVEHIST=10000 -setopt HIST_IGNORE_ALL_DUPS -setopt HIST_FIND_NO_DUPS -setopt INC_APPEND_HISTORY -EOF - - -EXPOSE 22 - -CMD ["/bin/zsh"] diff --git a/docker/gateway.Dockerfile b/docker/gateway.Dockerfile index 9084c930a460..f69e98da921c 100644 --- a/docker/gateway.Dockerfile +++ b/docker/gateway.Dockerfile @@ -16,9 +16,7 @@ ENV PATH="$VIRTUAL_ENV/bin:$PATH" # install dependencies -RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ - && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ - && apt update -y \ +RUN apt update -y \ && apt install -y curl \ && rm -rf /var/lib/apt/lists/* \ && apt clean diff --git a/docker/npu.Dockerfile b/docker/npu.Dockerfile index e49551b19379..bf135b293e2f 100644 --- a/docker/npu.Dockerfile +++ b/docker/npu.Dockerfile @@ -1,4 +1,4 @@ -ARG CANN_VERSION=8.3.rc2 +ARG CANN_VERSION=8.5.0 ARG DEVICE_TYPE=a3 ARG OS=ubuntu22.04 ARG PYTHON_VERSION=py3.11 @@ -6,14 +6,15 @@ ARG PYTHON_VERSION=py3.11 FROM quay.io/ascend/cann:$CANN_VERSION-$DEVICE_TYPE-$OS-$PYTHON_VERSION # Update pip & apt sources +ARG TARGETARCH +ARG CANN_VERSION +ARG DEVICE_TYPE ARG PIP_INDEX_URL="https://pypi.org/simple/" ARG APTMIRROR="" ARG PYTORCH_VERSION="2.8.0" ARG TORCHVISION_VERSION="0.23.0" -ARG PTA_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/torch_npu/torch_npu-2.8.0.post2.dev20251113-cp311-cp311-manylinux_2_28_aarch64.whl" -ARG TRITON_ASCEND_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/triton_ascend/triton_ascend-3.2.0.dev2025112116-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl" -ARG BISHENG_NAME="Ascend-BiSheng-toolkit_aarch64_20251121.run" -ARG BISHENG_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/triton_ascend/${BISHENG_NAME}" +ARG PTA_URL_ARM64="https://gitcode.com/Ascend/pytorch/releases/download/v7.3.0-pytorch2.8.0/torch_npu-2.8.0.post2-cp311-cp311-manylinux_2_28_aarch64.whl" +ARG PTA_URL_AMD64="https://gitcode.com/Ascend/pytorch/releases/download/v7.3.0-pytorch2.8.0/torch_npu-2.8.0.post2-cp311-cp311-manylinux_2_28_x86_64.whl" ARG SGLANG_TAG=main ARG ASCEND_CANN_PATH=/usr/local/Ascend/ascend-toolkit ARG SGLANG_KERNEL_NPU_TAG=main @@ -21,6 +22,16 @@ ARG SGLANG_KERNEL_NPU_TAG=main ARG PIP_INSTALL="python3 -m pip install --no-cache-dir" ARG DEVICE_TYPE +RUN if [ "$TARGETARCH" = "amd64" ]; then \ + echo "Using x86_64 dependencies"; \ + echo "PTA_URL=$PTA_URL_AMD64" >> /etc/environment_new; \ + elif [ "$TARGETARCH" = "arm64" ]; then \ + echo "Using aarch64 dependencies"; \ + echo "PTA_URL=$PTA_URL_ARM64" >> /etc/environment_new; \ + else \ + echo "Unsupported TARGETARCH: $TARGETARCH"; exit 1; \ + fi + WORKDIR /workspace # Define environments @@ -31,6 +42,7 @@ RUN if [ -n "$APTMIRROR" ];then sed -i "s|.*.ubuntu.com|$APTMIRROR|g" /etc/apt/s # Install development tools and utilities RUN apt-get update -y && apt upgrade -y && apt-get install -y \ + unzip \ build-essential \ cmake \ vim \ @@ -45,6 +57,8 @@ RUN apt-get update -y && apt upgrade -y && apt-get install -y \ openssl \ libssl-dev \ pkg-config \ + libgl1-mesa-glx \ + libgl1-mesa-dri \ ca-certificates \ && rm -rf /var/cache/apt/* \ && rm -rf /var/lib/apt/lists/* \ @@ -57,44 +71,34 @@ ENV LC_ALL=en_US.UTF-8 ### Install MemFabric -RUN ${PIP_INSTALL} memfabric-hybrid==1.0.0 +RUN ${PIP_INSTALL} memfabric-hybrid==1.0.5 ### Install SGLang Model Gateway RUN ${PIP_INSTALL} sglang-router ### Install PyTorch and PTA -RUN (${PIP_INSTALL} torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION} --index-url https://download.pytorch.org/whl/cpu) \ +RUN . /etc/environment_new && \ + (${PIP_INSTALL} torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION} --index-url https://download.pytorch.org/whl/cpu) \ && (${PIP_INSTALL} ${PTA_URL}) -# TODO: install from pypi released triton-ascend -RUN (${PIP_INSTALL} pybind11) \ - && (${PIP_INSTALL} ${TRITON_ASCEND_URL}) +## Install triton-ascend +RUN (${PIP_INSTALL} pybind11 triton-ascend) -# Install SGLang -RUN git clone https://github.com/sgl-project/sglang --branch $SGLANG_TAG && \ - (cd sglang/python && rm -rf pyproject.toml && mv pyproject_npu.toml pyproject.toml && ${PIP_INSTALL} -v .[all_npu]) && \ - rm -rf sglang +# Install SGLang (editable mode to preserve source and git history) +RUN git clone https://github.com/sgl-project/sglang --branch $SGLANG_TAG /sgl-workspace/sglang && \ + cd /sgl-workspace/sglang/python && rm -rf pyproject.toml && mv pyproject_npu.toml pyproject.toml && \ + ${PIP_INSTALL} -v -e .[all_npu] # Install Deep-ep # pin wheel to 0.45.1 ref: https://github.com/pypa/wheel/issues/662 -RUN ${PIP_INSTALL} wheel==0.45.1 && git clone --branch $SGLANG_KERNEL_NPU_TAG https://github.com/sgl-project/sgl-kernel-npu.git \ - && export LD_LIBRARY_PATH=${ASCEND_CANN_PATH}/latest/runtime/lib64/stub:$LD_LIBRARY_PATH && \ - source ${ASCEND_CANN_PATH}/set_env.sh && \ - cd sgl-kernel-npu && \ - bash build.sh \ - && ${PIP_INSTALL} output/deep_ep*.whl output/sgl_kernel_npu*.whl \ +RUN ${PIP_INSTALL} wheel==0.45.1 pybind11 pyyaml decorator scipy attrs psutil \ + && mkdir sgl-kernel-npu \ + && cd sgl-kernel-npu \ + && wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/${SGLANG_KERNEL_NPU_TAG}/sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-torch2.8.0-py311-cann${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \ + && unzip sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-torch2.8.0-py311-cann${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \ + && ${PIP_INSTALL} deep_ep*.whl sgl_kernel_npu*.whl \ && cd .. && rm -rf sgl-kernel-npu \ - && cd "$(python3 -m pip show deep-ep | awk '/^Location:/ {print $2}')" && ln -s deep_ep/deep_ep_cpp*.so - -# Install CustomOps -RUN wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/ops/CANN-custom_ops-8.2.0.0-$DEVICE_TYPE-linux.aarch64.run && \ - chmod a+x ./CANN-custom_ops-8.2.0.0-$DEVICE_TYPE-linux.aarch64.run && \ - ./CANN-custom_ops-8.2.0.0-$DEVICE_TYPE-linux.aarch64.run --quiet --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp && \ - wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/ops/custom_ops-1.0.$DEVICE_TYPE-cp311-cp311-linux_aarch64.whl && \ - ${PIP_INSTALL} ./custom_ops-1.0.$DEVICE_TYPE-cp311-cp311-linux_aarch64.whl - -# Install Bisheng -RUN wget -O "${BISHENG_NAME}" "${BISHENG_URL}" && chmod a+x "${BISHENG_NAME}" && "./${BISHENG_NAME}" --install && rm "${BISHENG_NAME}" + && cd "$(python3 -m pip show deep-ep | awk '/^Location:/ {print $2}')" && ln -sf deep_ep/deep_ep_cpp*.so CMD ["/bin/bash"] diff --git a/docker/rocm.Dockerfile b/docker/rocm.Dockerfile index e364da905030..52451682e805 100644 --- a/docker/rocm.Dockerfile +++ b/docker/rocm.Dockerfile @@ -1,36 +1,47 @@ # Usage (to build SGLang ROCm docker image): -# docker build --build-arg SGL_BRANCH=v0.5.6.post2 --build-arg GPU_ARCH=gfx942 -t v0.5.6.post2-rocm630-mi30x -f rocm.Dockerfile . -# docker build --build-arg SGL_BRANCH=v0.5.6.post2 --build-arg GPU_ARCH=gfx942-rocm700 -t v0.5.6.post2-rocm700-mi30x -f rocm.Dockerfile . -# docker build --build-arg SGL_BRANCH=v0.5.6.post2 --build-arg GPU_ARCH=gfx950 -t v0.5.6.post2-rocm700-mi35x -f rocm.Dockerfile . - +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx942 -t v0.5.10.post1-rocm700-mi30x -f rocm.Dockerfile . +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx942-rocm720 -t v0.5.10.post1-rocm720-mi30x -f rocm.Dockerfile . +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx950 -t v0.5.10.post1-rocm700-mi35x -f rocm.Dockerfile . +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx950-rocm720 -t v0.5.10.post1-rocm720-mi35x -f rocm.Dockerfile . + +# Usage (to build SGLang ROCm + Mori docker image): +# remove --build-arg NIC_BACKEND=ainic since new MoRI JIT will do NIC auto detection on target +# Keep the build-arg for user to select the desired nic support, current choice: [ainic, bxnt] +# if no set this arg, it will support nic auto detection. On a target with more than 1 type of +# RDMA NICs installed (rare), overwrite w. runtime env MORI_DEVICE_NIC = "bnxt"|"ionic"|"mlx5" +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx942 --build-arg ENABLE_MORI=1 -t v0.5.10.post1-rocm700-mi30x -f rocm.Dockerfile . +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx942-rocm720 --build-arg ENABLE_MORI=1 -t v0.5.10.post1-rocm720-mi30x -f rocm.Dockerfile . +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx950 --build-arg ENABLE_MORI=1 -t v0.5.10.post1-rocm700-mi35x -f rocm.Dockerfile . +# docker build --build-arg SGL_BRANCH=v0.5.10.post1 --build-arg GPU_ARCH=gfx950-rocm720 --build-arg ENABLE_MORI=1 -t v0.5.10.post1-rocm720-mi35x -f rocm.Dockerfile . # Default base images -ARG BASE_IMAGE_942="rocm/sgl-dev:vllm20250114" -ARG BASE_IMAGE_942_ROCM700="rocm/sgl-dev:rocm7-vllm-20250904" +ARG BASE_IMAGE_942="rocm/sgl-dev:rocm7-vllm-20250904" +ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" ARG BASE_IMAGE_950="rocm/sgl-dev:rocm7-vllm-20250904" +ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" # This is necessary for scope purpose ARG GPU_ARCH=gfx950 # =============================== -# Base image 942 with rocm630 and args +# Base image 942 with rocm700 and args FROM $BASE_IMAGE_942 AS gfx942 ENV BUILD_VLLM="0" -ENV BUILD_TRITON="1" +ENV BUILD_TRITON="0" ENV BUILD_LLVM="0" ENV BUILD_AITER_ALL="1" ENV BUILD_MOONCAKE="1" -ENV AITER_COMMIT="v0.1.4" +ENV AITER_COMMIT_DEFAULT="a6bb499375849eec45d68c5ccaebc8865fd422c0" # =============================== -# Base image 942 and args -FROM $BASE_IMAGE_942_ROCM700 AS gfx942-rocm700 +# Base image 942 with rocm720 and args +FROM $BASE_IMAGE_942_ROCM720 AS gfx942-rocm720 ENV BUILD_VLLM="0" -ENV BUILD_TRITON="0" +ENV BUILD_TRITON="1" ENV BUILD_LLVM="0" ENV BUILD_AITER_ALL="1" ENV BUILD_MOONCAKE="1" -ENV AITER_COMMIT="v0.1.9.post1" +ENV AITER_COMMIT_DEFAULT="a6bb499375849eec45d68c5ccaebc8865fd422c0" # =============================== # Base image 950 and args @@ -38,9 +49,20 @@ FROM $BASE_IMAGE_950 AS gfx950 ENV BUILD_VLLM="0" ENV BUILD_TRITON="0" ENV BUILD_LLVM="0" -ENV BUILD_AITER_ALL="0" +ENV BUILD_AITER_ALL="1" +ENV BUILD_MOONCAKE="1" +ENV AITER_COMMIT_DEFAULT="a6bb499375849eec45d68c5ccaebc8865fd422c0" + +# =============================== +# Base image 950 with rocm720 and args +FROM $BASE_IMAGE_950_ROCM720 AS gfx950-rocm720 +ENV BUILD_VLLM="0" +ENV BUILD_TRITON="1" +ENV BUILD_LLVM="0" +ENV BUILD_AITER_ALL="1" ENV BUILD_MOONCAKE="1" -ENV AITER_COMMIT="v0.1.9.post1" +ENV AITER_COMMIT_DEFAULT="a6bb499375849eec45d68c5ccaebc8865fd422c0" + # =============================== # Chosen arch and args FROM ${GPU_ARCH} @@ -48,15 +70,21 @@ FROM ${GPU_ARCH} # This is necessary for scope purpose, again ARG GPU_ARCH=gfx950 ENV GPU_ARCH_LIST=${GPU_ARCH%-*} +ENV PYTORCH_ROCM_ARCH=gfx942;gfx950 ARG SGL_REPO="https://github.com/sgl-project/sglang.git" ARG SGL_DEFAULT="main" ARG SGL_BRANCH=${SGL_DEFAULT} -ARG TRITON_REPO="https://github.com/ROCm/triton.git" -ARG TRITON_COMMIT="improve_fa_decode_3.0.0" +# Version override for setuptools_scm (used in nightly builds) +ARG SETUPTOOLS_SCM_PRETEND_VERSION="" + +ARG TRITON_REPO="https://github.com/triton-lang/triton.git" +ARG TRITON_COMMIT="42270451990532c67e69d753fbd026f28fcc4840" ARG AITER_REPO="https://github.com/ROCm/aiter.git" +ARG AITER_COMMIT="" +ENV AITER_COMMIT="${AITER_COMMIT:-${AITER_COMMIT_DEFAULT}}" ARG LLVM_REPO="https://github.com/jrbyrnes/llvm-project.git" ARG LLVM_BRANCH="MainOpSelV2" @@ -65,23 +93,92 @@ ARG LLVM_COMMIT="6520ace8227ffe2728148d5f3b9872a870b0a560" ARG MOONCAKE_REPO="https://github.com/kvcache-ai/Mooncake.git" ARG MOONCAKE_COMMIT="b6a841dc78c707ec655a563453277d969fb8f38d" -ARG TILELANG_REPO="https://github.com/HaiShaw/tilelang.git" -ARG TILELANG_BRANCH="dsv32-mi35x" -ARG TILELANG_COMMIT="ae938cf885743f165a19656d1122ad42bb0e30b8" - -ARG TILELANG_GFX942_REPO="https://github.com/tile-ai/tilelang.git" -ARG TILELANG_GFX942_BRANCH="main" -ARG TILELANG_GFX942_COMMIT="2d8d3676eda18bd3d8e6fa783399ff96d3cd4ded" +ARG TILELANG_REPO="https://github.com/tile-ai/tilelang.git" +ARG TILELANG_COMMIT="a55a82302bf7f3c5af635b5c9146f728185cc900" ARG FHT_REPO="https://github.com/jeffdaily/fast-hadamard-transform.git" ARG FHT_BRANCH="rocm" ARG FHT_COMMIT="46efb7d776d38638fc39f3c803eaee3dd7016bd1" + +ARG ENABLE_MORI=0 +ARG NIC_BACKEND=none + +ARG MORI_REPO="https://github.com/ROCm/mori.git" +ARG MORI_COMMIT="v1.1.1" + +# AMD AINIC apt repo settings +ARG AINIC_VERSION=1.117.5-a-38 +ARG UBUNTU_CODENAME=jammy + +# Optional Ubuntu mirror override + apt hardening. +# - UBUNTU_MIRROR is empty by default (no behaviour change for local builds). +# When set (typically in CI), all http://*archive.ubuntu.com and +# http://*security.ubuntu.com entries in /etc/apt/sources.list are rewritten +# to point at the given base URL, e.g. +# --build-arg UBUNTU_MIRROR=https://archive.ubuntu.com +# --build-arg UBUNTU_MIRROR=https://tw.archive.ubuntu.com +# --build-arg UBUNTU_MIRROR=http://internal-cache.example.com +# This mirrors the pattern already used in docker/Dockerfile (NVIDIA) and +# docker/npu.Dockerfile, and lets CI runners that cannot reach Canonical's +# port-80 mirror IPs still complete `apt-get update`. +# - The 80-net-hardening apt config adds retries + per-request timeout so that +# transient mirror flakes don't immediately fail a build (apt's default is 0 +# retries). +ARG UBUNTU_MIRROR= USER root +RUN if [ -n "$UBUNTU_MIRROR" ]; then \ + sed -i "s|http://[^[:space:]/]*archive.ubuntu.com|$UBUNTU_MIRROR|g" /etc/apt/sources.list && \ + sed -i "s|http://[^[:space:]/]*security.ubuntu.com|$UBUNTU_MIRROR|g" /etc/apt/sources.list; \ + fi && \ + printf 'Acquire::Retries "5";\nAcquire::http::Timeout "30";\nAcquire::https::Timeout "30";\n' \ + > /etc/apt/apt.conf.d/80-net-hardening + +# Fix hipDeviceGetName returning empty string in ROCm 7.0 docker images. +# The ROCm 7.0 base image is missing libdrm-amdgpu-common which provides the +# amdgpu.ids device-ID-to-marketing-name mapping file. +# ROCm 7.2 base images already ship these packages, so this step is skipped. +# See https://github.com/ROCm/ROCm/issues/5992 +RUN set -eux; \ + case "${GPU_ARCH}" in \ + *rocm720*) \ + echo "ROCm 7.2 (GPU_ARCH=${GPU_ARCH}): libdrm-amdgpu packages already present, skipping"; \ + ;; \ + *) \ + echo "ROCm 7.0 (GPU_ARCH=${GPU_ARCH}): installing libdrm-amdgpu packages"; \ + curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key \ + | gpg --dearmor -o /etc/apt/keyrings/amdgpu-graphics.gpg \ + && echo 'deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/amdgpu-graphics.gpg] https://repo.radeon.com/graphics/7.0/ubuntu jammy main' \ + > /etc/apt/sources.list.d/amdgpu-graphics.list \ + && apt-get update \ + && apt-get install -y --no-install-recommends \ + libdrm-amdgpu-common \ + libdrm-amdgpu-amdgpu1 \ + libdrm2-amdgpu \ + && rm -rf /var/lib/apt/lists/* \ + && cp /opt/amdgpu/share/libdrm/amdgpu.ids /usr/share/libdrm/amdgpu.ids; \ + ;; \ + esac + + # Install some basic utilities RUN python -m pip install --upgrade pip && pip install setuptools_scm RUN apt-get purge -y sccache; python -m pip uninstall -y sccache; rm -f "$(which sccache)" +# Install AMD SMI Python package from ROCm distribution. +# The ROCm 7.2 base image (rocm/pytorch) does not pre-install this package. +RUN set -eux; \ + case "${GPU_ARCH}" in \ + *rocm720*) \ + echo "ROCm 7.2 flavor detected from GPU_ARCH=${GPU_ARCH}"; \ + cd /opt/rocm/share/amd_smi \ + && python3 -m pip install --no-cache-dir . \ + ;; \ + *) \ + echo "Not rocm720 (GPU_ARCH=${GPU_ARCH}), skip amdsmi installation"; \ + ;; \ + esac + WORKDIR /sgl-workspace # ----------------------- @@ -99,44 +196,29 @@ RUN if [ "$BUILD_LLVM" = "1" ]; then \ # ----------------------- # AITER +# Unset setuptools_scm override so AITER gets its own version (AITER_COMMIT), not SGLang's +# (SETUPTOOLS_SCM_PRETEND_VERSION is set later for SGLang nightly builds and would otherwise +# leak into AITER's version when AITER uses setuptools_scm) +ENV SETUPTOOLS_SCM_PRETEND_VERSION= RUN pip uninstall -y aiter RUN git clone ${AITER_REPO} \ && cd aiter \ && git checkout ${AITER_COMMIT} \ - && git submodule update --init --recursive + && git submodule update --init --recursive \ + && pip install -r requirements.txt + RUN cd aiter \ && echo "[AITER] GPU_ARCH=${GPU_ARCH}" \ && if [ "$BUILD_AITER_ALL" = "1" ] && [ "$BUILD_LLVM" = "1" ]; then \ - sh -c "HIP_CLANG_PATH=/sgl-workspace/llvm-project/build/bin/ PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \ + sh -c "HIP_CLANG_PATH=/sgl-workspace/llvm-project/build/bin/ PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py build_ext --inplace" \ + && sh -c "HIP_CLANG_PATH=/sgl-workspace/llvm-project/build/bin/ GPU_ARCHS=$GPU_ARCH_LIST pip install --config-settings editable_mode=compat -e ."; \ elif [ "$BUILD_AITER_ALL" = "1" ]; then \ - sh -c "PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \ + sh -c "PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py build_ext --inplace" \ + && sh -c "GPU_ARCHS=$GPU_ARCH_LIST pip install --config-settings editable_mode=compat -e ."; \ else \ - sh -c "GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \ - fi - -# ----------------------- -# Triton -RUN if [ "$BUILD_TRITON" = "1" ]; then \ - pip uninstall -y triton \ - && git clone ${TRITON_REPO} \ - && cd triton \ - && git checkout ${TRITON_COMMIT} \ - && cd python \ - && python setup.py install; \ - fi - -# ----------------------- -# Build vLLM -ARG VLLM_REPO="https://github.com/ROCm/vllm.git" -ARG VLLM_BRANCH="9f6b92db47c3444b7a7d67451ba0c3a2d6af4c2c" -RUN if [ "$BUILD_VLLM" = "1" ]; then \ - git clone ${VLLM_REPO} \ - && cd vllm \ - && git checkout ${VLLM_BRANCH} \ - && python -m pip install -r requirements/rocm.txt \ - && python setup.py clean --all \ - && python setup.py develop; \ - fi + sh -c "GPU_ARCHS=$GPU_ARCH_LIST pip install --config-settings editable_mode=compat -e ."; \ + fi \ + && echo "export PYTHONPATH=/sgl-workspace/aiter:\${PYTHONPATH}" >> /etc/bash.bashrc # ----------------------- # Build Mooncake @@ -165,6 +247,10 @@ RUN if [ "$BUILD_MOONCAKE" = "1" ]; then \ # Build SGLang ARG BUILD_TYPE=all +# Set version for setuptools_scm if provided (for nightly builds). Only pass in the SGLang +# pip install RUN so it does not affect AITER, sgl-model-gateway, TileLang, FHT, MORI, etc. +ARG SETUPTOOLS_SCM_PRETEND_VERSION + RUN pip install IPython \ && pip install orjson \ && pip install python-multipart \ @@ -188,9 +274,9 @@ RUN git clone ${SGL_REPO} \ && cd .. \ && rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml \ && if [ "$BUILD_TYPE" = "srt" ]; then \ - python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"; \ + export SETUPTOOLS_SCM_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION}" && python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"; \ else \ - python -m pip --no-cache-dir install -e "python[all_hip,diffusion_hip]"; \ + export SETUPTOOLS_SCM_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION}" && python -m pip --no-cache-dir install -e "python[all_hip]"; \ fi RUN python -m pip cache purge @@ -204,12 +290,16 @@ RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \ ENV PATH="/root/.cargo/bin:${PATH}" RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \ && rustc --version && cargo --version +ENV CARGO_BUILD_JOBS=4 # Build and install sgl-model-gateway -RUN python3 -m pip install --no-cache-dir setuptools-rust \ +RUN python3 -m pip install --no-cache-dir maturin \ + && sed -i -E 's|^(smg-[a-zA-Z-]+)\s*=\s*"~1\.0\.0"|\1 = "=1.0.0"|' \ + /sgl-workspace/sglang/sgl-model-gateway/Cargo.toml \ + && grep -E '^smg-' /sgl-workspace/sglang/sgl-model-gateway/Cargo.toml \ && cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \ - && cargo build --release \ - && python3 -m pip install --no-cache-dir . \ + && ulimit -n 65536 && maturin build --release --features vendored-openssl --out dist \ + && python3 -m pip install --force-reinstall dist/*.whl \ && rm -rf /root/.cache # ----------------------- @@ -219,82 +309,72 @@ ENV LIBGL_ALWAYS_INDIRECT=1 RUN echo "LC_ALL=en_US.UTF-8" >> /etc/environment RUN /bin/bash -lc 'set -euo pipefail; \ - # Build TileLang for gfx950 and gfx942-rocm700 - if [ "${GPU_ARCH:-}" != "gfx950" ] && [ "${GPU_ARCH:-}" != "gfx942-rocm700" ]; then \ - echo "[TileLang] Skipping (GPU_ARCH=${GPU_ARCH:-unset})"; \ - exit 0; \ - fi; \ echo "[TileLang] Building TileLang for ${GPU_ARCH}"; \ - if [ "$GPU_ARCH" = "gfx950" ]; then \ - \ - # System dependencies (NO llvm-dev to avoid llvm-config-16 shadowing) - apt-get update && apt-get install -y --no-install-recommends \ - build-essential git wget curl ca-certificates gnupg \ - libgtest-dev libgmock-dev \ - libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \ - python3 python3-dev python3-setuptools python3-pip \ - gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev \ - cmake ninja-build pkg-config libstdc++6 \ - && rm -rf /var/lib/apt/lists/*; \ - \ - # Build GoogleTest static libs (Ubuntu package ships sources only) - cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release && \ - cmake --build /tmp/build-gtest -j"$(nproc)" && \ - cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ && \ - rm -rf /tmp/build-gtest; \ - \ - # Keep setuptools < 80 (compat with base image) - python3 -m pip install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja && \ - python3 -m pip cache purge || true; \ - \ - # Locate ROCm llvm-config; fallback to installing LLVM 18 if missing - LLVM_CONFIG_PATH=""; \ - for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do \ - if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi; \ - done; \ - if [ -z "$LLVM_CONFIG_PATH" ]; then \ - echo "[TileLang] ROCm llvm-config not found; installing LLVM 18..."; \ - curl -fsSL https://apt.llvm.org/llvm.sh -o /tmp/llvm.sh; \ - chmod +x /tmp/llvm.sh; \ - /tmp/llvm.sh 18; \ - LLVM_CONFIG_PATH="$(command -v llvm-config-18)"; \ - if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ERROR: llvm-config-18 not found after install"; exit 1; fi; \ - fi; \ - echo "[TileLang] Using LLVM_CONFIG at: $LLVM_CONFIG_PATH"; \ - export PATH="$(dirname "$LLVM_CONFIG_PATH"):/usr/local/bin:${PATH}"; \ - export LLVM_CONFIG="$LLVM_CONFIG_PATH"; \ - \ - # Optional shim for tools that expect llvm-config-16 - mkdir -p /usr/local/bin && \ - printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16 && \ - chmod +x /usr/local/bin/llvm-config-16; \ - \ - # TVM Python bits need Cython - python3 -m pip install --no-cache-dir "cython>=0.29.36,<3.0"; \ - \ - # Clone + pin TileLang (bundled TVM), then build - git clone --recursive --branch "${TILELANG_BRANCH}" "${TILELANG_REPO}" /opt/tilelang && \ - cd /opt/tilelang && \ - git fetch --depth=1 origin "${TILELANG_COMMIT}" || true && \ - git checkout -f "${TILELANG_COMMIT}" && \ - git submodule update --init --recursive && \ - export CMAKE_ARGS="-DLLVM_CONFIG=${LLVM_CONFIG} ${CMAKE_ARGS:-}" && \ - bash ./install_rocm.sh; \ - else \ - # Build GoogleTest static libs (Ubuntu package ships sources only) - apt-get install -y libgtest-dev libgmock-dev && \ - cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release && \ - cmake --build /tmp/build-gtest -j && \ - cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ && \ - rm -rf /tmp/build-gtest; \ - # Build TileLang for gfx942-rocm700 - git clone --branch "${TILELANG_GFX942_BRANCH}" "${TILELANG_GFX942_REPO}" /opt/tilelang && \ - cd /opt/tilelang && \ - git checkout -f "${TILELANG_GFX942_COMMIT}" && \ - git submodule update --init --recursive && \ - sed -i "/^[[:space:]]*\"torch/d" pyproject.toml && \ - USE_ROCM=1 USE_CUDA=0 pip install -e . -v ; \ - fi' + # System dependencies (NO llvm-dev to avoid llvm-config-16 shadowing) + apt-get update && apt-get install -y --no-install-recommends \ + build-essential git wget curl ca-certificates gnupg \ + libgtest-dev libgmock-dev \ + libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \ + python3 python3-dev python3-setuptools python3-pip python3-apt \ + gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev vim \ + cmake ninja-build pkg-config libstdc++6 software-properties-common \ + && rm -rf /var/lib/apt/lists/*; \ + \ + # Prefer the container venv + VENV_PY="/opt/venv/bin/python"; \ + VENV_PIP="/opt/venv/bin/pip"; \ + if [ ! -x "$VENV_PY" ]; then VENV_PY="python3"; fi; \ + if [ ! -x "$VENV_PIP" ]; then VENV_PIP="pip3"; fi; \ + \ + # Build GoogleTest static libs (Ubuntu package ships sources only) + cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release && \ + cmake --build /tmp/build-gtest -j"$(nproc)" && \ + cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ && \ + rm -rf /tmp/build-gtest; \ + \ + # Keep setuptools < 80 (compat with base image) + "$VENV_PIP" install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja scikit-build-core && \ + "$VENV_PIP" cache purge || true; \ + \ + # Locate ROCm llvm-config; fallback to installing LLVM 18 if missing + LLVM_CONFIG_PATH=""; \ + for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do \ + if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi; \ + done; \ + if [ -z "$LLVM_CONFIG_PATH" ]; then \ + echo "[TileLang] ROCm llvm-config not found; installing LLVM 18..."; \ + curl -fsSL https://apt.llvm.org/llvm-snapshot.gpg.key | gpg --dearmor -o /etc/apt/keyrings/llvm.gpg; \ + echo "deb [signed-by=/etc/apt/keyrings/llvm.gpg] http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main" > /etc/apt/sources.list.d/llvm.list; \ + apt-get update; \ + apt-get install -y --no-install-recommends llvm-18; \ + rm -rf /var/lib/apt/lists/*; \ + LLVM_CONFIG_PATH="$(command -v llvm-config-18)"; \ + if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ERROR: llvm-config-18 not found after install"; exit 1; fi; \ + fi; \ + echo "[TileLang] Using LLVM_CONFIG at: $LLVM_CONFIG_PATH"; \ + export PATH="$(dirname "$LLVM_CONFIG_PATH"):/usr/local/bin:${PATH}"; \ + export LLVM_CONFIG="$LLVM_CONFIG_PATH"; \ + \ + # Optional shim for tools that expect llvm-config-16 + mkdir -p /usr/local/bin && \ + printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16 && \ + chmod +x /usr/local/bin/llvm-config-16; \ + \ + # TVM Python bits need Cython + z3 before configure. + # Pin z3-solver==4.15.4.0: 4.15.4.0 has a manylinux wheel; 4.15.5.0 has no wheel and builds from source (fails: C++20 needs GCC 14+, image has GCC 11). + "$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi @ git+https://github.com/apache/tvm-ffi.git@37d0485b2058885bf4e7a486f7d7b2174a8ac1ce" "z3-solver==4.15.4.0"; \ + \ + # Clone + pin TileLang (bundled TVM), then build + git clone --recursive "${TILELANG_REPO}" /opt/tilelang && \ + cd /opt/tilelang && \ + git fetch --depth=1 origin "${TILELANG_COMMIT}" || true && \ + git checkout -f "${TILELANG_COMMIT}" && \ + git submodule update --init --recursive && \ + export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG} -DSKBUILD_SABI_VERSION= ${CMAKE_ARGS:-}" && \ + "$VENV_PIP" install -e . -v --no-build-isolation --no-deps; \ + if [ -f pyproject.toml ]; then sed -i "/^[[:space:]]*\"torch/d" pyproject.toml || true; fi; \ + "$VENV_PIP" cache purge || true; \ + "$VENV_PY" -c "import tilelang; print(tilelang.__version__)"' # ----------------------- # Hadamard-transform (HIP build) @@ -308,11 +388,185 @@ RUN /bin/bash -lc 'set -euo pipefail; \ # Python tools RUN python3 -m pip install --no-cache-dir \ py-spy \ - pre-commit + pre-commit \ + tabulate + +# ----------------------- +# MORI (optional) +RUN /bin/bash -lc 'set -euo pipefail; \ + if [ "${ENABLE_MORI}" != "1" ]; then \ + echo "[MORI] Skipping (ENABLE_MORI=${ENABLE_MORI})"; \ + exit 0; \ + fi; \ + echo "[MORI] Enabling MORI (NIC_BACKEND=${NIC_BACKEND})"; \ + \ + # Base deps for MORI build + apt-get update && apt-get install -y --no-install-recommends \ + build-essential \ + g++ \ + jq \ + libopenmpi-dev \ + libpci-dev \ + initramfs-tools \ + && rm -rf /var/lib/apt/lists/*; \ + \ + # NIC backend deps — mori auto-detects NIC at runtime (MORI_DEVICE_NIC env var override). + # Only vendor packages are installed here for dlopen (e.g. libionic.so); no compile-time flags needed. + case "${NIC_BACKEND}" in \ + # default: install ainic and bxnt driver + none) \ + apt-get update && apt-get install -y --no-install-recommends ca-certificates curl gnupg apt-transport-https && \ + rm -rf /var/lib/apt/lists/* && mkdir -p /etc/apt/keyrings; \ + curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/amdainic.gpg; \ + echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/amdainic.gpg] https://repo.radeon.com/amdainic/pensando/ubuntu/${AINIC_VERSION} ${UBUNTU_CODENAME} main" \ + > /etc/apt/sources.list.d/amdainic.list; \ + apt-get update && apt-get install -y --no-install-recommends \ + libionic-dev \ + ionic-common \ + ; \ + rm -rf /var/lib/apt/lists/*; \ + install -m 0755 -d /etc/apt/keyrings \ + && curl -fsSL https://packages.broadcom.com/artifactory/api/security/keypair/PackagesKey/public -o /etc/apt/keyrings/broadcom-nic.asc \ + && chmod a+r /etc/apt/keyrings/broadcom-nic.asc \ + && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/broadcom-nic.asc] https://packages.broadcom.com/artifactory/ethernet-nic-debian-public jammy main" > /etc/apt/sources.list.d/broadcom-nic.list \ + && apt-get update \ + && apt-get install -y ibverbs-utils bnxt-rocelib=235.2.86.0 \ + && cp /usr/local/lib/x86_64-linux-gnu/libbnxt_re* /usr/local/lib/. \ + ;; \ + # AMD NIC + ainic) \ + apt-get update && apt-get install -y --no-install-recommends ca-certificates curl gnupg apt-transport-https && \ + rm -rf /var/lib/apt/lists/* && mkdir -p /etc/apt/keyrings; \ + curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/amdainic.gpg; \ + echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/amdainic.gpg] https://repo.radeon.com/amdainic/pensando/ubuntu/${AINIC_VERSION} ${UBUNTU_CODENAME} main" \ + > /etc/apt/sources.list.d/amdainic.list; \ + apt-get update && apt-get install -y --no-install-recommends \ + libionic-dev \ + ionic-common \ + ; \ + rm -rf /var/lib/apt/lists/*; \ + ;; \ + bnxt) \ + echo "[MORI] Enabling Broadcom BNXT backend"; \ + apt-get update \ + && apt-get install -y --no-install-recommends ca-certificates curl \ + && install -m 0755 -d /etc/apt/keyrings \ + && curl -fsSL https://packages.broadcom.com/artifactory/api/security/keypair/PackagesKey/public -o /etc/apt/keyrings/broadcom-nic.asc \ + && chmod a+r /etc/apt/keyrings/broadcom-nic.asc \ + && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/broadcom-nic.asc] https://packages.broadcom.com/artifactory/ethernet-nic-debian-public jammy main" > /etc/apt/sources.list.d/broadcom-nic.list \ + && apt-get update \ + && apt-get install -y ibverbs-utils bnxt-rocelib=235.2.86.0 \ + && cp /usr/local/lib/x86_64-linux-gnu/libbnxt_re* /usr/local/lib/. \ + ;; \ + *) \ + echo "ERROR: unknown NIC_BACKEND=${NIC_BACKEND}. Use one of: none, ainic"; \ + exit 2; \ + ;; \ + esac; \ + \ + # Build/install MORI + export MORI_GPU_ARCHS="${GPU_ARCH_LIST}"; \ + echo "[MORI] MORI_GPU_ARCHS=${MORI_GPU_ARCHS} NIC_BACKEND=${NIC_BACKEND}"; \ + rm -rf /sgl-workspace/mori; \ + git clone "${MORI_REPO}" /sgl-workspace/mori; \ + cd /sgl-workspace/mori; \ + git checkout "${MORI_COMMIT}"; \ + git submodule update --init --recursive; \ + python3 setup.py develop; \ + python3 -c "import os, torch; print(os.path.join(os.path.dirname(torch.__file__), \"lib\"))" > /etc/ld.so.conf.d/torch.conf; \ + ldconfig; \ + echo "export PYTHONPATH=/sgl-workspace/mori:\${PYTHONPATH}" >> /etc/bash.bashrc; \ + echo "[MORI] Done."' + +# ----------------------- +# Hot patch: torch-ROCm +# The artifact hardcoded the supported triton version to be 3.5.1. +# Rewrite the restriction directly. +ARG TORCH_ROCM_FILE="torch-2.9.1+rocm7.2.0.lw.git7e1940d4-cp310-cp310-linux_x86_64.whl" +RUN mkdir /tmp/whl && cd /tmp/whl \ + && export TORCH_ROCM_FILE="${TORCH_ROCM_FILE}" \ + && cat > hack.py <<"PY" +import zipfile, csv, os, re +from pathlib import Path + +fname = os.environ["TORCH_ROCM_FILE"] +in_whl = Path("/") / fname +out_whl = Path("/tmp")/ fname +work = Path("/tmp/whl") + +# 1) Extract +with zipfile.ZipFile(in_whl, "r") as z: + z.extractall(work) + +# 2) Locate dist-info and patch METADATA (edit this logic to match your exact line) +dist_info = next(work.glob("*.dist-info")) +meta = dist_info / "METADATA" +txt = meta.read_text(encoding="utf-8") + +# Example: replace one exact requirement form. +# Adjust the string to match what you actually see. +pat = r"^Requires-Dist:\s*triton==3.5.1[^\s]*;" +txt2, n = re.subn(pat, r"triton>=3.5.1;", txt, flags=re.MULTILINE) +if txt2 == txt: + raise SystemExit("Did not find expected Requires-Dist line to replace in METADATA") +meta.write_text(txt2, encoding="utf-8") + +# 3) Hacky step: blank hash/size columns in RECORD +record = dist_info / "RECORD" +rows = [] +with record.open(newline="", encoding="utf-8") as f: + for r in csv.reader(f): + if not r: + continue + # keep filename, blank out hash and size + rows.append([r[0], "", ""]) +with record.open("w", newline="", encoding="utf-8") as f: + csv.writer(f).writerows(rows) + +# 4) Re-zip as a wheel +with zipfile.ZipFile(out_whl, "w", compression=zipfile.ZIP_DEFLATED) as z: + for p in work.rglob("*"): + if p.is_file(): + z.write(p, p.relative_to(work).as_posix()) + +print("Wrote", out_whl) +PY + +RUN cd /tmp/whl \ + && case "${GPU_ARCH}" in \ + *rocm720*) \ + echo "ROCm 7.2 flavor detected from GPU_ARCH=${GPU_ARCH}"; \ + python hack.py \ + && python3 -m pip install --force --no-deps /tmp/${TORCH_ROCM_FILE} \ + && rm -fr /tmp/whl /tmp/${TORCH_ROCM_FILE} \ + ;; \ + *) \ + echo "Not rocm720 (GPU_ARCH=${GPU_ARCH}), skip patch"; \ + ;; \ + esac + + +# ----------------------- +# Hot patch: Triton +# For ROCm 7.2, this custom build breaks pip dependency management, +# so future `pip install` will break the ROCm stack. +# A workaround for this is to reinstall the default triton +# wheel with the `rocm/pytorch` image in the root directory. +RUN if [ "$BUILD_TRITON" = "1" ]; then \ + pip uninstall -y triton \ + && apt install -y cmake \ + && git clone ${TRITON_REPO} triton-custom \ + && cd triton-custom \ + && git checkout ${TRITON_COMMIT} \ + && pip install -r python/requirements.txt \ + && pip install -e .; \ + fi # ----------------------- # Performance environment variable. +# Skip CuDNN compatibility check - not applicable for ROCm (uses MIOpen instead) +ENV SGLANG_DISABLE_CUDNN_CHECK=1 ENV HIP_FORCE_DEV_KERNARG=1 ENV HSA_NO_SCRATCH_RECLAIM=1 ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 @@ -325,10 +579,7 @@ ENV SGLANG_USE_AITER=1 ENV SGLANG_USE_ROCM700A=1 ENV NCCL_MIN_NCHANNELS=112 -ENV VLLM_FP8_PADDING=1 -ENV VLLM_FP8_ACT_PADDING=1 -ENV VLLM_FP8_WEIGHT_PADDING=1 -ENV VLLM_FP8_REDUCE_CONV=1 +ENV ROCM_QUICK_REDUCE_QUANTIZATION=INT8 ENV TORCHINDUCTOR_MAX_AUTOTUNE=1 ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 diff --git a/docker/sgl-deep-gemm.Dockerfile b/docker/sgl-deep-gemm.Dockerfile new file mode 100644 index 000000000000..0083d4ddc3da --- /dev/null +++ b/docker/sgl-deep-gemm.Dockerfile @@ -0,0 +1,35 @@ +ARG BASE_IMG=pytorch/manylinux2_28-builder +ARG CUDA_VERSION=13.0 + +FROM ${BASE_IMG}:cuda${CUDA_VERSION} + +ARG ARCH=x86_64 +ARG CUDA_VERSION=13.0 +ARG PYTHON_VERSION=3.12 +ARG PYTHON_TAG=cp312-cp312 +ARG TORCH_VER=2.11.0 +ARG TVM_FFI_VER=0.1.9 +ARG PIP_DEFAULT_INDEX=https://pypi.python.org/simple +ARG PYTORCH_MIRROR=download.pytorch.org + +ENV PYTHON_ROOT_PATH=/opt/python/${PYTHON_TAG} +ENV PATH=${PYTHON_ROOT_PATH}/bin:${PATH} + +RUN yum install -y --nogpgcheck git wget tar gcc gcc-c++ make \ + && yum clean all && rm -rf /var/cache/yum + +RUN set -eux; \ + if [ "${ARCH}" = "aarch64" ]; then _LIB=sbsa; else _LIB="${ARCH}"; fi; \ + mkdir -p /usr/lib/${ARCH}-linux-gnu/; \ + ln -sf /usr/local/cuda-${CUDA_VERSION}/targets/${_LIB}-linux/lib/stubs/libcuda.so /usr/lib/${ARCH}-linux-gnu/libcuda.so + +RUN --mount=type=cache,id=sgl-deep-gemm-pip,target=/root/.cache/pip \ + set -eux; \ + case "${CUDA_VERSION}" in \ + 13.0) CU_TAG=cu130 ;; \ + 12.9) CU_TAG=cu129 ;; \ + *) CU_TAG=cu130 ;; \ + esac; \ + ${PYTHON_ROOT_PATH}/bin/pip install torch==${TORCH_VER} --index-url https://${PYTORCH_MIRROR}/whl/${CU_TAG}; \ + ${PYTHON_ROOT_PATH}/bin/pip install --index-url ${PIP_DEFAULT_INDEX} \ + ninja setuptools wheel build numpy apache-tvm-ffi==${TVM_FFI_VER} diff --git a/docker/xeon.Dockerfile b/docker/xeon.Dockerfile index f793db49a9ef..98e443a1f023 100644 --- a/docker/xeon.Dockerfile +++ b/docker/xeon.Dockerfile @@ -4,12 +4,6 @@ SHELL ["/bin/bash", "-c"] ARG SGLANG_REPO=https://github.com/sgl-project/sglang.git ARG VER_SGLANG=main -ARG VER_TORCH=2.9.0 -ARG VER_TORCHVISION=0.24.0 -ARG VER_TORCHAUDIO=2.9.0 -ARG VER_TORCHAO=0.14.1 -ARG VER_TRITON=3.5.0 - RUN apt-get update && \ apt-get full-upgrade -y && \ DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \ @@ -46,8 +40,6 @@ RUN source $HOME/.local/bin/env && \ cd python && \ cp pyproject_cpu.toml pyproject.toml && \ uv pip install . && \ - uv pip install torch==${VER_TORCH} torchvision==${VER_TORCHVISION} torchaudio==${VER_TORCHAUDIO} torchao==${VER_TORCHAO} triton==${VER_TRITON} --force-reinstall && \ - uv pip install tabulate && \ cd ../sgl-kernel && \ cp pyproject_cpu.toml pyproject.toml && \ uv pip install . diff --git a/docker/xpu.Dockerfile b/docker/xpu.Dockerfile index 0fa726632fa7..feec566bb8ff 100644 --- a/docker/xpu.Dockerfile +++ b/docker/xpu.Dockerfile @@ -3,13 +3,13 @@ # Usage: docker build --build-arg UBUNTU_VERSION=24.04 --build-arg PYTHON_VERSION=3.10 -t sglang:xpu_kernel -f xpu.Dockerfile --no-cache . # Use Intel deep learning essentials base image with Ubuntu 24.04 -FROM intel/deep-learning-essentials:2025.2.2-0-devel-ubuntu24.04 +FROM intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04 # Avoid interactive prompts during package install ENV DEBIAN_FRONTEND=noninteractive # Define build arguments -ARG PYTHON_VERSION=3.10 +ARG PYTHON_VERSION=3.12 ARG SG_LANG_REPO=https://github.com/sgl-project/sglang.git ARG SG_LANG_BRANCH=main @@ -20,6 +20,18 @@ ARG SG_LANG_KERNEL_BRANCH=main RUN useradd -m -d /home/sdp -s /bin/bash sdp && \ chown -R sdp:sdp /home/sdp +USER root + +# Install the latest UMD driver for SYCL-TLA +RUN apt-get update && apt-get install -y software-properties-common && \ + add-apt-repository -y ppa:kobuk-team/intel-graphics && \ + apt-get update && \ + apt-get install -y \ + libze-intel-gpu1 libze1 intel-metrics-discovery intel-opencl-icd clinfo intel-gsc \ + intel-media-va-driver-non-free libmfx-gen1 libvpl2 libvpl-tools libva-glx2 va-driver-all vainfo \ + libze-dev intel-ocloc && \ + rm -rf /var/lib/apt/lists/* + # Switch to non-root user 'sdp' USER sdp @@ -38,28 +50,22 @@ RUN curl -fsSL -v -o miniforge.sh -O https://github.com/conda-forge/miniforge/re # Append environment activation to .bashrc for interactive shells echo ". /home/sdp/miniforge3/bin/activate; conda activate py${PYTHON_VERSION}; . /opt/intel/oneapi/setvars.sh; cd /home/sdp" >> /home/sdp/.bashrc -USER root -RUN apt-get update && apt install -y intel-ocloc - -# Switch back to user sdp -USER sdp - RUN --mount=type=secret,id=github_token \ cd /home/sdp && \ . /home/sdp/miniforge3/bin/activate && \ conda activate py${PYTHON_VERSION} && \ - pip3 install torch==2.9.0+xpu torchao torchvision torchaudio pytorch-triton-xpu==3.5.0 --index-url https://download.pytorch.org/whl/xpu + pip3 install torch==2.11.0+xpu torchao torchvision torchaudio==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu RUN --mount=type=secret,id=github_token \ cd /home/sdp && \ . /home/sdp/miniforge3/bin/activate && \ conda activate py${PYTHON_VERSION} && \ echo "Cloning ${SG_LANG_BRANCH} from ${SG_LANG_REPO}" && \ - git clone --branch ${SG_LANG_BRANCH} --single-branch ${SG_LANG_REPO} && \ + git clone --branch ${SG_LANG_BRANCH} --single-branch ${SG_LANG_REPO} sglang && \ cd sglang && cd python && \ cp pyproject_xpu.toml pyproject.toml && \ - pip install . && \ - pip install xgrammar --no-deps && \ + pip install . --extra-index-url https://download.pytorch.org/whl/xpu && \ + pip install --no-deps xgrammar==0.1.33 && \ pip install msgspec blake3 py-cpuinfo compressed_tensors gguf partial_json_parser einops tabulate --root-user-action=ignore && \ conda install libsqlite=3.48.0 -y && \ # Add environment setup commands to .bashrc again (in case it was overwritten) diff --git a/docs/Makefile b/docs/Makefile index 6b8792c42856..716160e56684 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -38,6 +38,46 @@ compile: echo "Total execution time: $${TOTAL_ELAPSED}s" >> logs/timing.log; \ echo "All Notebook execution timings:" && cat logs/timing.log +# Convert Notebook files to Markdown artifacts (no execution) +markdown: + @set -e; \ + echo "Exporting docs to Markdown..."; \ + mkdir -p "$(BUILDDIR)/html/markdown"; \ + \ + # 1) Copy .md and .rst files as-is; additionally convert .rst -> .md \ + find $(SOURCEDIR) -path "*/_build/*" -prune -o \( -name "*.md" -o -name "*.rst" \) -print0 | \ + parallel -0 -j3 --halt soon,fail=1 ' \ + SRC="{}"; \ + REL_DIR=$$(dirname "$$SRC"); \ + OUT_DIR="$(BUILDDIR)/html/markdown/$$REL_DIR"; \ + mkdir -p "$$OUT_DIR"; \ + cp -f "$$SRC" "$$OUT_DIR/"; \ + case "$$SRC" in \ + *.rst) \ + BASE=$$(basename "$$SRC" .rst); \ + pandoc -f rst -t gfm "$$SRC" -o "$$OUT_DIR/$$BASE.md" ;; \ + esac \ + ' || exit 1; \ + \ + # 2) Convert .ipynb -> .md \ + find $(SOURCEDIR) -path "*/_build/*" -prune -o -name "*.ipynb" -print0 | \ + parallel -0 -j3 --halt soon,fail=1 ' \ + NB_SRC="{}"; \ + REL_DIR=$$(dirname "$$NB_SRC"); \ + NB_NAME=$$(basename "$$NB_SRC"); \ + NB_BASE=$${NB_NAME%.ipynb}; \ + OUT_DIR="$(BUILDDIR)/html/markdown/$$REL_DIR"; \ + mkdir -p "$$OUT_DIR"; \ + jupyter nbconvert --to markdown "$$NB_SRC" \ + --output "$$NB_BASE.md" \ + --output-dir "$$OUT_DIR" \ + >/dev/null; \ + ' || exit 1; \ + \ + echo "Markdown artifacts written to: $(BUILDDIR)/html/markdown" + + + # Serve documentation with auto-build and live reload serve: @echo "Starting auto-build server at http://0.0.0.0:$(PORT)" diff --git a/docs/README.md b/docs/README.md index f4cb9ce46361..7764169b1c5e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -9,11 +9,18 @@ Most documentation files are located under the `docs/` folder. ### Install Dependency +**Linux:** ```bash apt-get update && apt-get install -y pandoc parallel retry pip install -r requirements.txt ``` +**macOS:** +```bash +brew install pandoc parallel retry +pip install -r requirements.txt +``` + ### Update Documentation Update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly. @@ -45,7 +52,6 @@ find . -name '*.ipynb' -exec nbstripout {} \; # After these checks pass, push your changes and open a PR on your branch pre-commit run --all-files ``` ---- ## Documentation Style Guidelines @@ -55,3 +61,71 @@ pre-commit run --all-files - Reuse the launched server as much as possible to reduce server launch time. - Do not use absolute links (e.g., `https://docs.sglang.io/get_started/install.html`). Always prefer relative links (e.g., `../get_started/install.md`). - Follow the existing examples to learn how to launch a server, send a query and other common styles. + +## Documentation Build, Deployment, and CI + +The SGLang documentation pipeline is based on **Sphinx** and supports rendering Jupyter notebooks (`.ipynb`) into HTML/Markdown for web display. Detailed logits can be found in the [Makefile](./Makefile). + +### Notebook Execution (`make compile`) + +The `make compile` target is responsible for executing notebooks before rendering: + +* Finds all `.ipynb` files under `docs/` (excluding `_build/`) +* Executes notebooks in parallel using GNU Parallel, with a relatively small `--mem-fraction-static` +* Wraps execution with `retry` to reduce flaky failures +* Executes notebooks via `jupyter nbconvert --execute --inplace` +* Records execution timing in `logs/timing.log` + +This step ensures notebooks contain up-to-date outputs with each commit in the main branch before rendering. + +### Web Rendering (`make html`) + +After compilation, Sphinx builds the website: + +* Reads Markdown, reStructuredText, and Jupyter notebooks +* Renders them into HTML pages +* Outputs the website into: + +``` +docs/_build/html/ +``` + +This directory is the source for online documentation hosting. + +### Markdown Export (`make markdown`) + +To support downstream consumers, we add a **new Makefile target**: + +```bash +make markdown +``` + +This target: + +* Does **not modify** `make compile` +* Scans all `.ipynb` files (excluding `_build/`) +* Converts notebooks directly to Markdown using `jupyter nbconvert --to markdown` +* Writes Markdown artifacts into the existing build directory: + +``` +docs/_build/html/markdown/.md +``` + +Example: + +``` +docs/advanced_features/lora.ipynb +→ docs/_build/html/markdown/advanced_features/lora.md +``` + +### CI Execution + +In our [CI](https://github.com/sgl-project/sglang/blob/main/.github/workflows/release-docs.yml), the documentation pipeline first gets all the executed results and renders HTML and Markdown by: + +```bash +make compile # execute notebooks (ensure outputs are up to date) +make html # build website as usual +make markdown # export markdown artifacts into _build/html/markdown +``` + +Then, the compiled results are forced pushed to [sgl-project.io](https://github.com/sgl-project/sgl-project.github.io) for rendering. In other words, sgl-project.io is push-only. All the changes of SGLang docs should be made directly in SGLang main repo, then push to the sgl-project.io. diff --git a/docs/_static/css/custom_log.css b/docs/_static/css/custom_log.css index 61f65d0199df..57d0cf6d1d8d 100644 --- a/docs/_static/css/custom_log.css +++ b/docs/_static/css/custom_log.css @@ -27,3 +27,27 @@ div.output_area.stderr { div.output_area.stdout { color: #d3d3d3 !important; } + +.sglang-docs-deprecation-banner { + background: #fff4cc; + border-bottom: 1px solid #d8a21f; + color: #2f2a1f; + font-size: 0.95rem; + line-height: 1.45; + overflow-wrap: anywhere; + padding: 0.75rem 1.25rem; + position: relative; + text-align: center; + z-index: 1030; +} + +.sglang-docs-deprecation-banner a { + color: #1f5fbf; + font-weight: 600; + text-decoration: underline; +} + +.sglang-docs-deprecation-banner a:focus, +.sglang-docs-deprecation-banner a:hover { + color: #143f80; +} diff --git a/docs/_static/image/dpa.png b/docs/_static/image/dpa.png new file mode 100644 index 000000000000..672e022186e4 Binary files /dev/null and b/docs/_static/image/dpa.png differ diff --git a/docs/_static/js/deprecation_banner.js b/docs/_static/js/deprecation_banner.js new file mode 100644 index 000000000000..87c8d73fad33 --- /dev/null +++ b/docs/_static/js/deprecation_banner.js @@ -0,0 +1,49 @@ +(function () { + "use strict"; + + var oldOrigin = "https://sgl-project.github.io"; + var newOrigin = "https://docs.sglang.io"; + + function buildNewDocsUrl() { + var href = window.location.href; + + if (href === oldOrigin || href.indexOf(oldOrigin + "/") === 0) { + return href.replace(oldOrigin, newOrigin); + } + + return newOrigin + window.location.pathname + window.location.search + window.location.hash; + } + + function addDeprecationBanner() { + if (document.getElementById("sglang-docs-deprecation-banner")) { + return; + } + + var link = document.createElement("a"); + link.href = buildNewDocsUrl(); + link.textContent = link.href; + + var banner = document.createElement("div"); + banner.id = "sglang-docs-deprecation-banner"; + banner.className = "sglang-docs-deprecation-banner"; + banner.setAttribute("role", "status"); + banner.setAttribute("aria-live", "polite"); + + var prefix = document.createTextNode( + "This legacy documentation site will be deprecated soon. Please use the new SGLang documentation at " + ); + var suffix = document.createTextNode("."); + + banner.appendChild(prefix); + banner.appendChild(link); + banner.appendChild(suffix); + + document.body.insertBefore(banner, document.body.firstChild); + } + + if (document.readyState === "loading") { + document.addEventListener("DOMContentLoaded", addDeprecationBanner); + } else { + addDeprecationBanner(); + } +})(); diff --git a/docs/advanced_features/adaptive_speculative_decoding.md b/docs/advanced_features/adaptive_speculative_decoding.md new file mode 100644 index 000000000000..64a31f3d8de7 --- /dev/null +++ b/docs/advanced_features/adaptive_speculative_decoding.md @@ -0,0 +1,156 @@ +# Adaptive Speculative Decoding + +Adaptive speculative decoding lets SGLang adjust `speculative_num_steps/speculative_num_draft_tokens` at runtime instead of keeping a single fixed value for the whole server lifetime. +It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal. + +## Current support + +- Only `--speculative-algorithm EAGLE` +- Only `--speculative-eagle-topk 1` +- If either condition is not met, SGLang falls back to static speculative settings + +## Why adaptive steps help + +`speculative_num_steps` controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload. + +- If `num_steps` is too small, the draft model could have produced more accepted tokens, but the round stops too early. +- If `num_steps` is too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted. + +Real traffic often moves between high-acceptance and low-acceptance phases, so one fixed step count is usually a compromise. Adaptive mode tries to follow the workload instead of hard-coding a single global `num_steps`. + +## Design overview + +The adaptive mechanism has three pieces: + +- `AdaptiveSpeculativeParams`: the EMA-based policy +- `SpecRuntimeState`: the per-tier runtime state bundle +- `AdaptiveController`: the coordinator that chooses a tier and activates the matching runtime state + +At startup, SGLang pre-builds one runtime state per candidate tier. By default, the candidate tiers are `candidate_steps = [1, 3, 7]`. + +```text +┌──────────────────────────────────────────────────────────┐ +│ SpecRuntimeState │ +│ │ +│ speculative_num_steps / speculative_num_draft_tokens │ +│ │ +│ ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │ +│ │ Draft stage │ │ Verify stage │ │ Extend stage │ │ +│ │ │ │ │ │ │ │ +│ │ attn_backend │ │ attn_backend │ │ attn_backend │ │ +│ │ cuda_graph │ │ cuda_graph │ │ cuda_graph │ │ +│ └────────────────┘ └────────────────┘ └──────────────┘ │ +└──────────────────────────────────────────────────────────┘ +``` + +This matters because `CudaGraphRunner` is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture. + +## Runtime flow + +The adaptive update happens after verify and affects the next round, not the current one: + +```text +┌─────────────────────────────────────────────────────────────────────┐ +│ EAGLEWorker.forward_batch_generation() — decode path │ +│ │ +│ ① draft(batch) │ +│ │ draft model multi-step generation with current tier │ +│ v │ +│ ② verify(batch, spec_info) │ +│ │ target model tree verification │ +│ │ → produces accept_length_per_req │ +│ v │ +│ ③ forward_draft_extend_after_decode(batch) │ +│ │ draft model KV-cache catch-up │ +│ v │ +│ ④ adaptive_controller.on_verify_complete(accept_lengths) │ +│ │ │ +│ │ update EMA, apply warmup / interval / hysteresis gates │ +│ │ if tier changed, select a pre-built state from pool │ +│ v │ +│ worker.apply_runtime_state(state) │ +│ │ +│ Tier switch happens after the current round completes. │ +│ Backends and CUDA graphs are never swapped mid-round. │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +## How the policy decides + +After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the pre-built candidate tiers `[1, 3, 7]` by default. + +The decision logic is intentionally conservative: + +- `warmup_batches` skips the first few batches +- `update_interval` avoids switching every batch +- `down_hysteresis` and `up_hysteresis` reduce oscillation + +Conceptually, the policy probes one step beyond the observed acceptance: + +```text +target_steps ≈ clamp(round(ema_accept_len) + 1, min(candidate_steps), max(candidate_steps)) +``` + +So if recent requests consistently accept more drafted tokens, the policy tends to move up. If they start rejecting earlier, it tends to move down. + +## Usage + +`--speculative-adaptive-config` is optional, but the speculative setup still needs to be valid for adaptive mode. + +```bash +python3 -m sglang.launch_server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \ + --speculative-eagle-topk 1 \ + --speculative-num-steps 3 \ + --speculative-num-draft-tokens 4 \ + --speculative-adaptive +``` + +If you want to override the defaults, add `--speculative-adaptive-config /path/to/adaptive_spec.json`. + +Example config: + +```json +{ + "candidate_steps": [1, 3, 7], + "ema_alpha": 0.2, + "warmup_batches": 10, + "update_interval": 5 +} +``` + +## Config file reference + +The config file is optional. Any omitted keys use defaults. + +| Key | Default | Meaning | +|---|---|---| +| `candidate_steps` | `[1, 3, 7]` | Discrete `speculative_num_steps` tiers that adaptive mode can switch between | +| `ema_alpha` | `0.2` | EMA smoothing factor for accepted draft length | +| `update_interval` | `5` | Recompute interval, in verify batches, after warmup | +| `warmup_batches` | `10` | Number of verify batches to observe before switching | +| `down_hysteresis` | `-0.25` | Extra margin before moving to a smaller step | +| `up_hysteresis` | `0.0` | Extra margin before moving to a larger step | + +The initial `--speculative-num-steps` is snapped to the nearest value in `candidate_steps`. + +## Monitoring + +You can inspect the active tier and acceptance metric via `/server_info`: + +```bash +curl -s http://127.0.0.1:30000/server_info | jq '.internal_states[0] | {speculative_num_steps, avg_spec_accept_length}' +``` + +- `speculative_num_steps` is the current active tier +- `avg_spec_accept_length` helps explain whether the server is likely to move up or down + +## Tuning tips + +- Start with the default candidate tiers `[1, 3, 7]` +- Use fewer tiers if you want lower startup and graph-memory overhead +- Increase `ema_alpha` to react faster, or lower it for more stability +- Increase `warmup_batches` or `update_interval` if tier switching is too noisy +- If your workload is already stable and one static setting is well tuned, adaptive mode may not help much diff --git a/docs/advanced_features/attention_backend.md b/docs/advanced_features/attention_backend.md index 12649c305c11..98d07d31a258 100644 --- a/docs/advanced_features/attention_backend.md +++ b/docs/advanced_features/attention_backend.md @@ -19,15 +19,15 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu |---------------------------------|-----------------------------|------------------|-----------------|-----------------|-----------------|--------------------|----------------| | **FlashInfer** | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | | **FA3 (FlashAttention 3)** | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | -| **FA4 (FlashAttention 4)** | 128 | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | -| **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | +| **FA4 (FlashAttention 4)** | 128 | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | +| **Triton** | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | **Torch Native (SDPA)** | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | | **FlexAttention (PyTorch)** | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | | **TRTLLM MHA** | 16, 32 or 64 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | | **Dual Chunk FlashAttention** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | -| **AITER (ROCm)** | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | +| **AITER (ROCm)** | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | | **Wave (ROCm)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | -| **Ascend (NPU)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | +| **Ascend (NPU)** | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | | **Intel XPU** | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | | **Intel AMX (CPU)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | @@ -41,7 +41,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu | **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ✅ | ❌ | | **FA3 (FlashAttention 3)** | n/a | ❌ | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) | | **Triton** | n/a | ❌ | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) | -| **FA4** | 1 | ❌ | ✅ | ❌ | ❌ | ❌ | +| **FA4** | 1 | ❌ | ✅ | ✅ | ❌ | ❌ | | **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ | ```{note} @@ -49,8 +49,12 @@ Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" c ``` ```{note} -- FlashAttention 4 is prefill-only for now. -- NSA is specifically designed for [DeepSeek V3.2 DSA](https://lmsys.org/blog/2025-09-29-deepseek-V32/). +- FlashAttention 4 supports both prefill and decode on SM90 (Hopper) and SM100 (Blackwell). FA4 MLA supports `page_size = 1`; FA4 MHA requires `page_size = 128`. On SM100, this is auto-enforced by the server; on SM90, users must set `--page-size 128` manually. +- NSA is specifically designed for [DeepSeek V3.2 DSA](https://lmsys.org/blog/2025-09-29-deepseek-V32/). See the [DSA Attention Backend (NSA)](#dsa-attention-backend-nsa) section and [DeepSeek V3.2 deployment guide](../basic_usage/deepseek_v32.md) for details. +``` + +```{warning} +**FA4 on Hopper (SM90):** FA4 decode speed decreases as sequence length grows due to lack of SplitKV support. At batch=1 compared to FA3 on H100: ~-10% at 2K tokens, ~-18% at 4K, ~-31% at 8K, ~-49% at 16K. Larger batch sizes reduce the gap (e.g., batch=8: ~-2% at 2K, ~-8% at 4K). Blackwell (SM100) is not affected. ``` ```{note} @@ -61,8 +65,16 @@ For the KV4 FA4 scenario, FA4 requires using a different --decode-attention-back Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths. ``` +```{note} +**Speculative Decoding V2 (Spec V2):** Spec V2 uses overlap scheduling (`SGLANG_ENABLE_SPEC_V2=True`) that benefits various attention backends. Requires `--speculative-eagle-topk 1` and currently applies to EAGLE and EAGLE3. + +**Verified backends:** TRTLLM MLA, TRTLLM MHA, FA3, Ascend (NPU), Triton. + +**Limited support:** FlashInfer can run under Spec V2, but its plan stream (used for split-KV optimization) introduces a synchronization point that limits overlap benefits. +``` + ```{tip} -Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching). +Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching). Note that higher page sizes generally improve attention kernel performance, so prefer `page_size > 1` when prefix cache reuse is not critical. ``` Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). @@ -73,6 +85,46 @@ MLA page-size constraints: - Cutlass MLA: page_size = 128. - TRTLLM MLA: page_size ∈ {32, 64}. +### GDN Attention Backends + +GDN (Gated Delta Network) is a linear attention mechanism with O(n) complexity, used in hybrid models that alternate GDN linear attention layers with standard full attention layers. GDN is **not** selected via `--attention-backend`; it is automatically activated when the model architecture requires it (e.g., Qwen 3.5, Qwen 3 Next, Jet Nemotron, Jet VLM). + +The GDN linear attention layers have their own kernel backends, selected via `--linear-attn-backend` (default: `triton`). You can override the kernel per phase with `--linear-attn-decode-backend` and `--linear-attn-prefill-backend`. + +| **Backend** | **Decode** | **Prefill / Extend** | **Spec Decoding (Target Verify)** | +|--------------------------|------------|----------------------|-----------------------------------| +| **Triton (CUDA)** | ✅ | ✅ | ✅ | +| **Triton (AMD/ROCm)** | ✅ | ✅ | ✅ | +| **Triton (NPU)** | ✅ | ✅ | ❌ | +| **Triton (CPU)** | ✅ | ✅ | ❌ | +| **CuTe DSL (CUDA only)**| ✅ | ❌ | ❌ | + +```{important} +GDN models are hybrid: the full-attention layers still require a standard `--attention-backend`. Platform constraints for the full-attention backend on hybrid GDN models: +- **Blackwell (e.g., B200)**: `triton`, `trtllm_mha`, or `fa4` only. +- **NPU (Ascend)**: `ascend` only. +- **AMD (ROCm)**: `triton` recommended. +- **Other CUDA (Hopper, Ampere, etc.)**: auto-selection works; no special constraints. +``` + +### DSA Attention Backend (NSA) + +DSA (Deepseek Sparse Attention) is a native sparse attention mechanism used by [DeepSeek V3.2](https://lmsys.org/blog/2025-09-29-deepseek-V32/). It is activated automatically when the model architecture requires it and is selected via `--attention-backend nsa`. + +Internally, the NSA backend dispatches to different sub-backends for prefill and decode phases. You can override these with `--nsa-prefill-backend` and `--nsa-decode-backend`: + +| **Sub-backend** | **Prefill** | **Decode** | **Notes** | +|-----------------------|-------------|------------|-----------------------------------------------| +| **flashmla_sparse** | ✅ | ✅ | Default prefill on Hopper and Blackwell (bf16) | +| **flashmla_kv** | ✅ | ✅ | Default decode for FP8 on Blackwell with DP | +| **flashmla_auto** | ✅ | ❌ | Auto-selects flashmla_sparse or flashmla_kv based on kv_cache_dtype | +| **fa3** | ✅ | ✅ | Default decode on Hopper (bf16) | +| **trtllm** | ✅ | ✅ | Default decode on Blackwell (bf16); default for both on Blackwell without DP | +| **tilelang** | ✅ | ✅ | Default on AMD (ROCm) | +| **aiter** | ✅ | ✅ | AMD-specific kernel library (requires aiter package) | + +For deployment examples, see the [DeepSeek V3.2 deployment guide](../basic_usage/deepseek_v32.md). + ### Hybrid attention (different backends for prefill vs decode) (Experimental) ```{warning} @@ -124,7 +176,7 @@ If the `--attention-backend` argument is not specified, SGLang automatically sel **2. MLA Models (e.g., DeepSeek V3)** - **Hopper**: Defaults to `fa3` (requires CUDA 12.3+). -- **Blackwell**: Defaults to `trtllm_mla`. +- **Blackwell**: Defaults to `flashinfer`; `trtllm_mla` is auto-selected for DeepSeek V3 models specifically. - **Other Architectures**: Defaults to `triton`. @@ -202,8 +254,34 @@ python3 -m sglang.launch_server \ --trust-remote-code ``` +- TRTLLM MHA (Optimized for Blackwell Architecture, e.g., B200) +```bash +python3 -m sglang.launch_server \ + --tp 4 \ + --model Qwen/Qwen3.5-35B-A3B-FP8 \ + --attention-backend trtllm_mha \ + --trust-remote-code +``` + +- TRTLLM MHA (XQA backend) (Optimized for SM90 and SM120, e.g., H20, H200, 5090) + Note that TRTLLM XQA backend only works well for pagesize 64. +```bash +python3 -m sglang.launch_server \ + --tp 4 \ + --model Qwen/Qwen3.5-35B-A3B-FP8 \ + --decode-attention-backend trtllm_mha \ + --trust-remote-code +``` + - FlashAttention 4 (MHA & MLA) ```bash +# FA4 for both prefill and decode on SM90/SM100 +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \ + --attention-backend fa4 \ + --page-size 128 \ + --trust-remote-code + python3 -m sglang.launch_server \ --tp 8 \ --model deepseek-ai/DeepSeek-R1 \ @@ -267,24 +345,28 @@ To add a new attention backend, you can learn from the existing backends (`python/sglang/srt/layers/attention/triton_backend.py`, `python/sglang/srt/layers/attention/flashattention_backend.py`) and follow the steps below. +```{note} +Linear attention kernel backends (GDN, KDA) follow a different pattern. They implement `LinearAttnKernelBase` in `python/sglang/srt/layers/attention/linear/kernels/` and are dispatched by `GDNKernelDispatcher` / `KDAKernelDispatcher` rather than registered via `@register_attention_backend`. +``` + 1. Run without cuda graph. Support the two forward functions - - forward_extend - - Will be used for prefill, prefill with KV cache, and target verification - - It will be called once per layer - - forward_decode - - Will be used for normal decode, and draft decode - - It will be called once per layer - - init_forward_metadata - - Initialize the class and common metadata shared by all layers - - Call the plan function for optimizations like split_kv - - It will be called once per forward +- forward_extend + - Will be used for prefill, prefill with KV cache, and target verification + - It will be called once per layer +- forward_decode + - Will be used for normal decode, and draft decode + - It will be called once per layer +- init_forward_metadata + - Initialize the class and common metadata shared by all layers + - Call the plan function for optimizations like split_kv + - It will be called once per forward 2. Run with cuda graph. It has two phases (capture and replay) and you need to implement three functions - - init_cuda_graph_state - - It will be called once during life time - - Create all common shared buffers - - init_forward_metadata_capture_cuda_graph - - It will be called before capturing a cuda graph - - It is similar to init_forward_metadata but write the medatada to some pre-defined buffers - - init_forward_metadata_replay_cuda_graph - - It will be called before replaying a cuda graph - - This function is in the critical path and needs to be fast +- init_cuda_graph_state + - It will be called once during life time + - Create all common shared buffers +- init_forward_metadata_capture_cuda_graph + - It will be called before capturing a cuda graph + - It is similar to init_forward_metadata but write the medatada to some pre-defined buffers +- init_forward_metadata_replay_cuda_graph + - It will be called before replaying a cuda graph + - This function is in the critical path and needs to be fast diff --git a/docs/advanced_features/breakable_cuda_graph.md b/docs/advanced_features/breakable_cuda_graph.md new file mode 100644 index 000000000000..4fb2c090c459 --- /dev/null +++ b/docs/advanced_features/breakable_cuda_graph.md @@ -0,0 +1,139 @@ +# Breakable CUDA Graph + +## Motivation + +Standard CUDA graphs capture an entire forward pass as a single, opaque graph. This is great for performance, but creates two problems: + +1. **Debugging is hard.** When something goes wrong inside a captured graph (wrong outputs, numerical mismatches, crashes), there is no way to step through the operations or insert print statements because the graph replays as a monolithic unit. + +2. **Some ops are incompatible.** Certain operations — dynamic control flow, host-device synchronization, JIT compilation, or ops that change behavior across iterations — cannot be captured into a CUDA graph at all. Today, the only workaround is to disable CUDA graphs entirely, which sacrifices the kernel launch overhead savings for the rest of the model. + +**Breakable CUDA Graph** solves both problems by allowing graph breaks to be inserted at specific points. The computation is split into multiple captured graph segments with eager (non-graph) execution in between. This preserves most of the CUDA graph performance benefit while allowing targeted operations to run outside the graph. + +## Usage + +### Debug Mode: Run Everything Eagerly + +The simplest use case is debugging. The `--debug-cuda-graph` flag wraps the entire decode forward pass in a graph break, so every operation runs eagerly while still going through the full CUDA graph capture/replay code path. This lets you debug CUDA graph issues without changing model code. + +```bash +python -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --debug-cuda-graph +``` + +This mode is intended for debugging only — it eliminates the performance benefit of CUDA graphs since every op runs eagerly. + +### Selective Graph Breaks in Model Code + +For production use, you can mark specific functions as "non-graphable" using the `@eager_on_graph` decorator. During CUDA graph capture, these functions run eagerly between captured graph segments. Outside of capture, they behave normally. + +```python +from sglang.srt.model_executor.breakable_cuda_graph.breakable_cuda_graph import eager_on_graph + +@eager_on_graph(enable=True) +def my_dynamic_op(x): + # This op is incompatible with CUDA graph capture + return some_dynamic_operation(x) +``` + +You can also insert a bare graph break (no computation) using the `break_graph()` helper: + +```python +from sglang.srt.model_executor.breakable_cuda_graph.breakable_cuda_graph import break_graph + +def forward(self, x): + x = self.layer1(x) + break_graph() # force a segment split here + x = self.layer2(x) + return x +``` + +To enable breakable CUDA graph at the environment level (without debug mode), set the environment variable: + +```bash +export SGLANG_USE_BREAKABLE_CUDA_GRAPH=1 +python -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct +``` + +### Server Args + +| Argument | Default | Description | +|---|---|---| +| `--debug-cuda-graph` | `False` | Enable debug/eager mode. Wraps the entire forward pass in a graph break so every op runs eagerly through the capture/replay path. | +| `SGLANG_USE_BREAKABLE_CUDA_GRAPH` | `0` | Environment variable. Enables breakable CUDA graph without debug mode. Required for `@eager_on_graph` decorators to take effect. | + +## How It Works + +### Capture + +Breakable CUDA graph extends PyTorch's `torch.cuda.CUDAGraph` by splitting a single capture into multiple segments separated by graph breaks. + +During capture, the flow is: + +``` +Begin capture (segment 1) + ... graphable ops ... + @eager_on_graph function encountered: + 1. End current capture segment + 2. Run the function eagerly (allocates output tensors) + 3. Record the function for later replay + 4. Begin new capture segment + ... more graphable ops ... +End capture (segment N) +``` + +Each segment is independently instantiated as a CUDA graph executable. The non-graph functions and their argument references are stored for replay. + +### Replay + +During replay: + +``` +For each segment i: + 1. Launch CUDA graph segment i + 2. Run the recorded non-graph function i eagerly +Launch final CUDA graph segment +``` + +The non-graph functions are re-invoked with the same tensor references as capture time. Since these references point to the CUDA graph's static input/output buffers, they see updated values on each replay. + +### Output Writeback + +When a non-graph function produces output during replay, the result must be written back into the same tensor buffers that downstream graph segments reference. The mechanism handles: + +- **Plain tensors**: In-place `copy_()` into the original buffer. +- **Structured outputs** (dataclasses, objects with tensor attributes): Tensor fields are copied in-place; non-tensor fields are replaced. +- **Dicts of tensors**: Tensor values are copied in-place; non-tensor values are replaced. + +### Stream Fork/Join Tracking + +Some models fork work onto secondary CUDA streams (e.g., for overlapped computation). Breakable CUDA graph hooks `torch.cuda.Stream.wait_stream` to track which streams are forked from the capture stream. When a graph break occurs, all forked streams are automatically joined back before ending the capture segment, and re-forked after beginning the next segment. + +## Compatibility + +- **NVIDIA CUDA only.** Breakable CUDA graph is not supported on ROCm/HIP or other non-CUDA platforms. On unsupported platforms, `--debug-cuda-graph` is automatically disabled with a warning. +- **Requires `cuda-python`.** The `cuda.bindings` package must be installed (`pip install cuda-python`). +- **Not compatible with memory saver mode.** Cannot be used together with `SGLANG_MEMORY_SAVER_CUDA_GRAPH`. + +## Performance + +When no graph breaks are inserted, breakable CUDA graph has minimal overhead compared to standard CUDA graph — the capture/replay path is nearly identical. + +Each graph break adds: +- One `cudaGraphLaunch` call (to replay the segment before the break) +- One eager Python function call +- One `cudaStreamBeginCapture` / `cudaStreamEndCapture` pair during capture + +For typical use cases with a small number of graph breaks, the overhead is negligible compared to the saved kernel launch overhead from the captured segments. + +## Code Reference + +| File | Description | +|---|---| +| `python/sglang/srt/model_executor/breakable_cuda_graph/breakable_cuda_graph.py` | Core implementation: `eager_on_graph`, `BreakableCUDAGraph`, `BreakableCUDAGraphCapture` | +| `python/sglang/srt/model_executor/breakable_cuda_graph/cuda_utils.py` | CUDA runtime binding utilities | +| `python/sglang/srt/model_executor/cuda_graph_runner.py` | Integration with the main CUDA graph runner | +| `python/sglang/srt/server_args.py` | `--debug-cuda-graph` flag and environment variable handling | +| `python/sglang/srt/environ.py` | `SGLANG_USE_BREAKABLE_CUDA_GRAPH` environment variable definition | diff --git a/docs/advanced_features/dp_dpa_smg_guide.md b/docs/advanced_features/dp_dpa_smg_guide.md new file mode 100644 index 000000000000..9ec5df64856e --- /dev/null +++ b/docs/advanced_features/dp_dpa_smg_guide.md @@ -0,0 +1,373 @@ +# DP, DPA and SGLang DP Router + +This guide explains the difference between Data Parallelism (DP) and Data Parallelism Attention (DPA), how to enable each mode correctly, and how to use the SGLang Model Gateway (SMG) for production-grade DP deployments. + +## Data Parallelism (DP) + +**Data Parallelism (DP)** is the most common parallelism strategy that replicates the entire model across multiple GPU sets and processes different batches of requests in parallel. Each GPU set handles independent requests. With dedicated routing strategies, as we will introduce later, with those proper routing algorithms in SGLang Model Gateway, the throughput of your serving system could be multiplied nearly linearly. + +### Key characteristics + +- Each replica has a full copy of the model +- Requests are distributed/scattered across replicas +- No inter-replica communication during one request's inference (for simple DP) + +## Data Parallelism Attention (DPA) + +**Data Parallelism Attention (DPA)**, also known as DP Attention, is an advanced parallelism strategy. While DPA provides the most significant benefits for **Multi-Head Latent Attention (MLA)** models (such as DeepSeek, MiniMax, Kimi-K2), it also supports **standard attention models** like Qwen. + +### The Problem with Tensor Parallelism for MLA Models + +The most common parallelism strategy for inference is **Tensor Parallelism (TP)**. However, TP might not be the most efficient strategy for certain models. For example, DeepSeek models use MLA and only have **one KV head**. If we use tensor parallelism on 8 GPUs, it will lead to: + +- **Duplicated KV cache** across all GPUs +- **Unwanted memory usage** that limits batch size +- **Reduced throughput** due to memory constraints + +### How DPA Works + +DPA addresses these limitations by applying **data parallelism specifically to the attention component**. + + + + + + +
+DPA + EP Architecture + + +**Each DP replica:** + +- Processes different batches independently (can be in different forward modes: prefill, decode, or idle) +- Maintains its own KV cache (no duplication) +- Enables significantly larger batch sizes due to memory savings + +**Communication patterns in DPA + EP:** +- +- **All2All (Dispatch)**: Routes tokens to expert sub-groups based on gating decisions +- **All2All (Combine)**: Gathers computed results from experts back to original token positions + +
+ +### Key benefits of DPA + +1. **Significantly reduced KV cache memory**: Each DP replica only stores KV cache for its own batches +2. **Larger batch sizes**: Memory savings enable larger batch sizes +3. **Improved decoding throughput**: Significant throughput gains for MLA-based models +4. **Independent forward modes**: Each DP replica can be in different forward modes (prefill, decode, or idle) and handles its assigned batches independently during attention computation + +### DPA with Expert Parallelism for MoE + +For MoE models like DeepSeek, DPA is **often** paired with Expert Parallelism (EP) for best throughput at scale. However, **DPA does not require EP**: you can enable DPA without EP if your deployment does not need expert sharding. + +- Distribute 256+ expert weights across GPUs (cannot fit on a single GPU) +- Enable efficient all-to-all token routing via DeepEP +- Scale to large clusters (up to 5x throughput improvement over vanilla TP) + +### Recommended setup for DeepSeek + +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ + --tp 8 \ + --dp-size 8 \ + --ep 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --moe-runner-backend deep_gemm +``` + +> **Note**: `--dp-size` must be explicitly set when using `--enable-dp-attention`. If `dp_size` is 1 (default), DPA will be disabled. + +For detailed EP configuration (DeepEP, Two-Batch Overlap, EPLB), see [Expert Parallelism](expert_parallelism.md). + +### Target Models + +DPA supports the following model architectures: + +- **MLA (Multi-Head Latent Attention) models** - where DPA provides the most significant benefits: + - DeepSeek family (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1) + - MiniMax models + - Kimi-K2 + - Other models using MLA architecture + +- **Standard attention models** - also supported: + - Qwen models (see [PR #6121](https://github.com/sgl-project/sglang/pull/6121)) + +For models like Llama, with standard GQA, standard DP, or TP is typically recommended. + +To enable DPA, add `--enable-dp-attention` to your server launch command. + +### Activation Logic + +DPA is enabled explicitly via server arguments (CLI or config). You must set both `--dp-size` and `--enable-dp-attention`: + +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ + --tp 8 \ + --dp-size 8 \ + --enable-dp-attention +``` + +**Important**: `--dp-size` must be greater than 1 for DPA to work. When `dp_size == 1` (default), `--enable-dp-attention` is automatically disabled. The constraint `tp_size % dp_size == 0` must also be satisfied. + +### Standard DP for MLA models + +Note that MLA models, of course, also support DP. Suppose you want to enable standard DP for MLA models. First, launch each MLA model's replica independently. You may launch these replicas one by one with DPA enabled. After launching each MLA model's replica, launch an SMG and connect all the replicas to the SMG. A detailed explanation of SMG is as follows. + +## Modern Data Parallelism SGLang Model Gateway (SMG) + +### Native DP Mode + +Native DP (built-in Data Parallelism) in SGLang creates multiple worker processes within a single SGLang instance, under the control of `DataParallelController` with the launching parameter of `dp-size`. + + +```bash +# Native DP mode +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 +``` + +**Limitations:** + +- Built-in in-process load balancing only (e.g., `round_robin`, `total_requests`, `total_tokens`) +- No cache-aware routing +- Limited observability and metrics +- No fault tolerance or circuit breakers +- Not suitable for production workloads + +⚠️ Native DP is **highly not recommended for use right now**. It is only used in some ancient/outdated RL frameworks. You can use SGLang Model Gateway (SMG) to power up your data parallelism in any use case. + +### SMG-Based DP (Recommended) + +Starting from September 2024, SGLang Model Gateway, i.e., SMG, formerly named as SGLang DP Router, was built especially as a production-ready DP routing system with Rust. It starts from DP routing, but later we further expanded its scope to coordinate RL, PD Disaggregation, and other scenarios. This doc only discusses SMG's usage in DP routing. For other usage, please refer to [SGLang Model Gateway Documentation](sgl_model_gateway.md). + +> To achieve the best production-level routing performance and reduce the overhead to an extreme extent, we use Rust to build SMG, but not Python, since Python is never FAST enough. + +**We strongly recommend using the SGLang Model Gateway (SMG) for production-grade Data Parallelism.** SMG provides significant advantages over native DP mode. + +```bash +# SMG-based DP mode (Recommended) +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 +``` + +⚠️ Note that **SMG and Naive DP share the same launching parameter, `--dp-size`**. But the entrypoint of Naive DP is `python -m sglang.launch_server`, and SMG's entrypoint is `python -m sglang_router.launch_server`. + +**Advantages of SMG-Based DP:** + +| Feature | Native DP | SMG-Based DP | +|---------|-----------|--------------| +| **Load Balancing** | Built-in in-process methods | Advanced policies (cache-aware, power-of-two, etc.) | +| **Cache Awareness** | ❌ No | ✅ Yes - significantly higher cache hit rate | +| **Throughput** | Baseline | Significant improvement | +| **Multi-Node Support** | Limited | ✅ Full support | +| **Worker Health Monitoring** | Basic | ✅ Circuit breakers, health checks | +| **Reliability** | Basic | ✅ Retries, rate limiting, queuing | +| **Observability** | Basic metrics | ✅ 40+ Prometheus metrics, OpenTelemetry | +| **Hot Worker Add/Remove** | ❌ No | ✅ Yes | + +### SMG's Performance + +The cache-aware routing policy in SMG significantly improves performance for workloads with shared prefixes: + +| Metric | Without Cache-Aware | With Cache-Aware SMG | +|--------|---------------------|----------------------| +| Throughput (token/s) | 82,665 | 158,596 (+92%) | +| Cache Hit Rate | 20% | 75% (+275%) | + +*Benchmark from [SGLang v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), workload with multiple long prefix groups, 8x A100 80GB GPUs, dp-size=8* + +### When to Use Each + +**Use Native DP when:** + +- ~Never use Native/Naive DP~ +- Learning material of DP routing + +**Use SMG-Based DP when:** + +- In any case, when you think DP is needed +- Production deployments +- Multi-node distributed setups +- Workloads with shared prefixes (high cache reuse potential) +- You need high availability and reliability features +- You require detailed observability and metrics +- You want to have highly efficient RL rollout systems + +Note that for RL rollout systems, **there are four crucial reasons that SMG-Based DP is far better than naive DP routing**. Details can be found at [Load Balancing Router in RL](./sglang_for_rl.md#load-balancing-router). + +### Quick Start For SMG + +**Installation** + +```bash +pip install sglang-router +# or +pip install "sglang[all]" +``` + +**Option A: Co-launch Workers and SMG (Simplest)** + +This is the easiest way to get started - SMG and workers are launched together: + +```bash +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Option B: Separate Launch (Multi-Node)** + +For distributed deployments across multiple machines: + +1. Launch workers on each node + +```bash +# Node 1 +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --port 8000 + +# Node 2 +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --port 8000 +``` + +2. Launch SMG pointing to workers + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://node1:8000 http://node2:8000 \ + --policy cache_aware \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Option C: Dynamic Worker Registration** + +For elastic deployments where workers can be added/removed dynamically: + +```bash +# Launch SMG first +python -m sglang_router.launch_router \ + --policy cache_aware \ + --host 0.0.0.0 \ + --port 30000 + +# Register workers dynamically +curl -X POST http://localhost:30000/workers \ + -H "Content-Type: application/json" \ + -d '{"url": "http://worker1:8000"}' + +curl -X POST http://localhost:30000/workers \ + -H "Content-Type: application/json" \ + -d '{"url": "http://worker2:8000"}' +``` + +### Load Balancing Policies + +SMG supports multiple load balancing policies: + +| Policy | Description | Best For | +|--------|-------------|----------| +| `cache_aware` | Combines cache locality with load balancing | **Recommended for most workloads** | +| `round_robin` | Cycles through workers in order | Simple, predictable distribution | +| `random` | Random worker selection | Baseline, testing | +| `power_of_two` | Samples two workers, picks lighter one | Low latency requirements | + +**Cache-Aware Policy (Default, Recommended)** + +The cache-aware policy provides the best performance for most workloads: + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8000 \ + --policy cache_aware \ + --cache-threshold 0.5 \ + --balance-abs-threshold 32 \ + --balance-rel-threshold 1.5 \ + --eviction-interval-secs 120 \ + --max-tree-size 67108864 +``` + +**How it works:** + +1. Maintains an approximate radix tree for each worker based on request history +2. Routes requests to workers with the highest prefix match (cache hit) +3. Falls back to shortest-queue routing when load is imbalanced +4. Automatically evicts old entries to prevent memory overflow + +### Best Practices + +1. **Start with `cache_aware` policy** - It provides the best balance between cache locality and load distribution for most workloads +2. **Use SMG for production** - Prefer `sglang_router.launch_server` over `sglang.launch_server` for better reliability and observability +3. **Enable health checks** - Configure `--router-health-check-interval-secs` to detect and remove unhealthy workers automatically + +**Recommended command with best practices applied:** + +```bash +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 \ + --router-policy cache_aware \ + --router-health-check-interval-secs 30 \ + --router-prometheus-port 10001 \ + --host 0.0.0.0 \ + --port 30000 +``` + +For advanced configuration (circuit breakers, retries, Prometheus metrics, K8s integration), see [SGLang Model Gateway Documentation](sgl_model_gateway.md). + +### Verifying Traffic Distribution + +After launching SMG, verify that traffic is being distributed correctly: + +**1. Check worker status:** + +```bash +curl http://localhost:30000/workers +``` + +**2. Check load distribution:** + +```bash +curl http://localhost:30000/get_loads +``` + +**3. Monitor metrics (if Prometheus enabled):** + +```bash +# Key metrics to check +smg_router_requests_total{model="..."} +smg_worker_requests_active{worker="..."} +sglang_cache_hit_rate{source="..."} +``` + +For detailed metrics and monitoring setup, see [SGLang Model Gateway Documentation](sgl_model_gateway.md). + +## Reference + +| Strategy | Use Case | Key Benefit | +|----------|----------|-------------| +| **Native DP** (`--dp-size`) | Never | Easy to understand, not rust based | +| **SMG-Based DP** | **Production (recommended)** | Cache-aware routing, high availability | +| **DPA** (`--dp-size N --enable-dp-attention`) | DeepSeek/MLA models | Eliminates KV cache duplication, improved throughput | +| **DPA + EP** | DeepSeek MoE models | Significant throughput improvement vs vanilla TP | + +**Recommended production setup for DeepSeek:** +1. Enable **DPA** for attention layers (`--dp-size 8 --enable-dp-attention`) +2. Enable **EP** for MoE layers (`--ep 8 --moe-a2a-backend deepep`) +3. Use **SMG** with **cache_aware** policy + +**Related documentation:** +- [Expert Parallelism](expert_parallelism.md) - DeepEP, Two-Batch Overlap, EPLB +- [SGLang Model Gateway Documentation](sgl_model_gateway.md) - SMG configuration & troubleshooting +- [Large-Scale EP Blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/) - 96 GPU deployment guide diff --git a/docs/advanced_features/dp_for_multi_modal_encoder.md b/docs/advanced_features/dp_for_multi_modal_encoder.md index 62057f9581a0..a100e0688439 100644 --- a/docs/advanced_features/dp_for_multi_modal_encoder.md +++ b/docs/advanced_features/dp_for_multi_modal_encoder.md @@ -4,7 +4,7 @@ A typical VLM architecture involves two main components: an multi-modal encoder Most VLMs utilize a Vision Transformer (ViT) as their multi-modal encoder, it is responsible for processing visual data, extracting features (objects, colors, textures, etc.), and transforming them into a format that can be understood by the model. -The text deocoder is based on LLM. It processes textual data and generates output based on the encoded visual features. +The text decoder is based on LLM. It processes textual data and generates output based on the encoded visual features. However, since the size of ViT is very small compared to language decoders, there is relatively little gain from TP. On the other hand, TP incurs significant communication diff --git a/docs/advanced_features/epd_disaggregation.md b/docs/advanced_features/epd_disaggregation.md index 550503dfc930..d07898361a27 100644 --- a/docs/advanced_features/epd_disaggregation.md +++ b/docs/advanced_features/epd_disaggregation.md @@ -16,6 +16,81 @@ When launching a language-only model, you must additionally specify the encoder We support multiple encoder transfer backends, including zmq_to_scheduler, zmq_to_tokenizer, and mooncake (the default is zmq_to_scheduler). The backend can be selected using `--encoder-transfer-backend`. +### Encoder transfer with Mooncake + +`--encoder-transfer-backend mooncake` controls **how encoder outputs are transferred** between encoder and language/prefill services. It is an encoder transfer option and can be used independently of the global multimodal embedding cache. + +Example: + +```bash +# encoder +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --encoder-transfer-backend mooncake \ + --port 30000 + +# language-only server +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --language-only \ + --encoder-urls http://127.0.0.1:30000 \ + --encoder-transfer-backend mooncake \ + --port 30002 +``` + +### Global multimodal embedding cache with Mooncake + +SGLang also supports a Mooncake-backed **global multimodal embedding cache** for EPD workloads. When enabled on encoder servers, repeated image inputs can reuse previously computed ViT embeddings across instances instead of running the vision encoder again. + +This feature is useful when: + +- the deployment serves repeated or overlapping image inputs, +- encoder compute is the bottleneck, and +- Mooncake is already available in the cluster. + +At a high level, the encoder checks whether the image embedding already exists in Mooncake. Cache hits are prefetched from the global store, while misses are encoded normally and inserted into the cache in the background. + +To enable it: + +- install and configure Mooncake in the same way as other SGLang Mooncake integrations, +- add `--enable-mm-global-cache` on the encoder server. + +`--enable-mm-global-cache` controls **whether multimodal embeddings are looked up and stored in the global Mooncake cache**. It is separate from `--encoder-transfer-backend`, which only controls encoder output transport. + +For Mooncake deployment and configuration details, see [HiCache best practices](hicache_best_practices.md#deployment-with-mooncake) and the [Mooncake backend README](../../python/sglang/srt/mem_cache/storage/mooncake_store/README.md). + +Example: + +```bash +# Shared Mooncake configuration +export MOONCAKE_TE_META_DATA_SERVER="http://127.0.0.1:8080/metadata" +export MOONCAKE_MASTER="127.0.0.1:50051" +export MOONCAKE_PROTOCOL="rdma" +export MOONCAKE_GLOBAL_SEGMENT_SIZE="4gb" + +# encoder with global multimodal cache enabled +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --enable-mm-global-cache \ + --port 30000 + +# language-only server +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --language-only \ + --encoder-urls http://127.0.0.1:30000 \ + --port 30002 +``` + +Notes: + +- This cache is for **multimodal encoder embeddings**, not the language model KV cache. +- The feature currently uses Mooncake as the shared backing store. +- It can be enabled regardless of which `--encoder-transfer-backend` you use. +- It is most relevant for EPD or encoder-disaggregated VLM deployments where the same images are likely to appear across requests or instances. + #### Qwen VL - EP Disaggregation @@ -78,3 +153,42 @@ python -m sglang_router.launch_router \ --port 8000 ``` + +#### gRPC Encoder (EPD) + +You can run the encoder as a gRPC server while keeping prefill/decode as HTTP. +When using gRPC encoders, set `SGLANG_ENCODER_MM_RECEIVER_MODE=grpc` for the +prefill process so it uses the gRPC receiver. + +```bash +# gRPC encoder +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --grpc-mode \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30000 + +# prefill (HTTP) - tell it to use gRPC receiver +SGLANG_ENCODER_MM_RECEIVER_MODE=grpc \ +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --disaggregation-mode prefill \ + --language-only \ + --encoder-urls grpc://127.0.0.1:30000 \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30002 + +# decode (HTTP) +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --disaggregation-mode decode \ + --port 30003 + +# router +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://$PREFILL_HOST:30002 \ + --decode http://$DECODE_HOST:30003 \ + --port 8000 +``` diff --git a/docs/advanced_features/expert_parallelism.md b/docs/advanced_features/expert_parallelism.md index fdde94f8caf7..5c052114b000 100644 --- a/docs/advanced_features/expert_parallelism.md +++ b/docs/advanced_features/expert_parallelism.md @@ -15,12 +15,14 @@ SGLang's EP integrates diverse, highly efficient backends for different use case | **`none` (default)** | Disables all-to-all for EP. Uses All-Reduce or All-Gather for token dispatch. | Hybrid EP and TP setups. | | `deepep` | DeepEP, a communication library for efficient token shuffling in MoE models. | Large-scale EP deployments. | | `mooncake` | An extension of DeepEP for elastic inference, leveraging RDMA for high-performance data transfers. | Elastic EP serving. | +| `nixl` | [NIXL-EP](https://github.com/ai-dynamo/nixl/tree/main/examples/device/ep), an elastic EP communication library built on NVIDIA's [NIXL](https://github.com/ai-dynamo/nixl) framework with native RDMA and NVLink support. | Elastic EP serving with fault tolerance and dynamic scaling. | +| `mori` | MORI-EP, AMD's native all-to-all communication implementation optimized for ROCm. | AMD GPU deployments. | | `flashinfer` | Flashinfer implementation of all-to-all. | Large-scale EP deployments. | | `ascend_fuseep` | Ascend NPU native fused all-to-all communication. | Ascend NPU deployments. | -DeepEP and Mooncake backends support two modes for token dispatch: `normal` mode (optimized for prefill workloads with high throughput) and `low_latency` mode (optimized for decode workloads with low latency and CUDA Graph compatibility). Users are recommended to set `--deepep-mode auto` to enable automatic dispatch mode switching during runtime. Setting `--deepep-mode normal` or `--deepep-mode low_latency` is useful for debugging or development purposes. +DeepEP and Mooncake backends support two modes for token dispatch: `normal` mode (optimized for prefill workloads with high throughput) and `low_latency` mode (optimized for decode workloads with low latency and CUDA Graph compatibility). MORI backend only supports `normal` mode now. NIXL-EP currently operates in low-latency mode with CUDA Graph support. Users are recommended to set `--deepep-mode auto` to enable automatic dispatch mode switching during runtime. Setting `--deepep-mode normal` or `--deepep-mode low_latency` is useful for debugging or development purposes. -Currently, DeepEP and Mooncake only support cases where `ep_size = tp_size`. For hybrid EP and TP (i.e., `ep_size < tp_size`), only the `none` backend (All-Reduce or All-Gather-based dispatching) is supported. +Currently, DeepEP, Mooncake, NIXL-EP, `ascend_fuseep` and MORI only support cases where `ep_size = tp_size`. For hybrid EP and TP (i.e., `ep_size < tp_size`), only the `none` backend (All-Reduce or All-Gather-based dispatching) is supported. ### Backends for MoE Computation @@ -31,6 +33,7 @@ Currently, DeepEP and Mooncake only support cases where `ep_size = tp_size`. For | `deep_gemm` | DeepGEMM backend optimized for MoE matrix multiplications, supporting contiguous layouts for prefill and masked layouts for decode; often JIT-compiled for performance. | Large-scale EP deployments with FP8 block-wise quantization. | | `cutlass` | CUTLASS-based backend for efficient GEMMs. | NVIDIA architectures with CUTLASS support. | | `flashinfer_trtllm` | FlashInfer integrated with TensorRT-LLM for accelerated MoE computations, supporting FP4 communication operators and high-performance GEMMs. | Blackwell with TRT-LLM. | +| `flashinfer_trtllm_routed` | FlashInfer integrated with TensorRT-LLM for accelerated routed MoE computations, consuming SGLang-computed top-k expert assignments and weights. | Blackwell with TRT-LLM. | | `flashinfer_cutlass` | FlashInfer combined with CUTLASS for high-performance grouped GEMMs in MoE layers, handling FP4/FP8 quantization efficiently. | Blackwell with FP4/FP8 models. | | `flashinfer_mxfp4` | FlashInfer variant optimized for MXFP4 (mixed FP4) quantization in MoE runners, focusing on memory-efficient low-precision inference. | Low-precision models with MXFP4. | | `flashinfer_cutedsl` | FlashInfer with a custom DSL for flexible and efficient MoE kernel generation, integrated with ModelOpt FP4 quantization. | Low-precision models with NVFP4. | @@ -155,3 +158,43 @@ For model like `nvidia/DeepSeek-R1-0528-NVFP4-v2`, the target model uses NVFP4 p --speculative-moe-runner-backend triton \ ... ``` + + +## Ascend NPU Guidance + + +### Guidance on SGLang configuration in Ascend NPU +- `--moe-a2a-backend` only supports `deepep` and `ascend_fuseep` backends, + - `deepep`: The mechanism is consistent with the above description. + - `ascend_fuseep`: Offer a large fused operator which integrates all operations between dispatch and combine to boost MoE computation. Only used for decode stage in PD Disaggregation Mode. +- `--moe-runner-backend` parameter does not need to be configured. +- `--deepep-mode`: + - In PD mixed mode, please set `--deepep-mode auto`. + - In PD Disaggregation Mode, prefill instance sets `--deepep-mode normal`, and decode instance sets `--deepep-mode low_latency`. + + +### DeepEP Ascend Introduction + +DeepEP Ascend is the adapted version of the DeepEP communication library for Huawei Ascend NPUs, specifically designed for Mixture-of-Experts (MoE) model Expert Parallelism (EP). +It supports the Ant-moving Function (Split the sequence length into rounds for streaming batch transmission) to optimize the buffer size occupied during collective communication in prefill stage, especially for long sequences. + +Ant-moving Function can be enabled for both the dispatch and combine phases via the following environment variables: +- `DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS`: Enable ant-moving function in dispatch stage. Indicates the number of tokens transmitted per round on each rank, default 8192. +- `DEEPEP_NORMAL_LONG_SEQ_ROUND`: Enable ant-moving function in dispatch stage. Indicates the number of rounds transmitted on each rank, default 1. +- `DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ`: Enable ant-moving function in combine stage, default 0 (means disabled). + +`DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS * DEEPEP_NORMAL_LONG_SEQ_ROUND` means input sequence length. When the input sequence length exceeds 8192, it is recommended to enable the ant-moving function in both dispatch and combine phase. + +The environment variable `HCCL_BUFFSIZE` is used to configure the buffer size (MB) actually allocated. Its calculation formula is as follows: +```angular2html +# Enable Ant-moving Function +HCCL_BUFFSIZE >= 2 * (102MB + 4MB + DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS * (hidden_size + hidden_size + hidden_size) * topk) + PADDING_BUFFSIZE + +# Disable Ant-moving Function +HCCL_BUFFSIZE >= 2 * (102MB + 4MB + TOTAL_SEQ_LEN * (hidden_size + hidden_size) * topk) + PADDING_BUFFSIZE +``` +Wherein the parameters are described as follows: +- `hidden_size`: hidden size in model config. +- `topk`: The number of selected routing experts. +- `TOTAL_SEQ_LEN`: input sequence length. +- `PADDING_BUFFSIZE`: A value of 20 or greater is recommended. diff --git a/docs/advanced_features/hicache.rst b/docs/advanced_features/hicache.rst index b2bd08b79e76..e7d83211dc9a 100644 --- a/docs/advanced_features/hicache.rst +++ b/docs/advanced_features/hicache.rst @@ -6,3 +6,4 @@ Hierarchical KV Caching (HiCache) hicache_best_practices.md hicache_design.md + hicache_storage_runtime_attach_detach.md diff --git a/docs/advanced_features/hicache_best_practices.md b/docs/advanced_features/hicache_best_practices.md index cb1baa01e1c8..104c2b0e2d54 100644 --- a/docs/advanced_features/hicache_best_practices.md +++ b/docs/advanced_features/hicache_best_practices.md @@ -19,6 +19,10 @@ SGLang HiCache extends the traditional RadixAttention with a three-tier hierarch --hicache-storage-backend # Optional storage backend (e.g., hf3fs, mooncake, etc.) ``` +Notes: + +- Besides configuring `--hicache-storage-backend` at startup, SGLang also supports **runtime attach/detach** of the HiCache storage backend (no restart required) via HTTP admin endpoints. See [Runtime Attach/Detach HiCache Storage Backend](hicache_storage_runtime_attach_detach.md). + ## Key Configurations with Storage Backends Enabled ### Memory Layout Optimization @@ -35,6 +39,23 @@ SGLang HiCache extends the traditional RadixAttention with a three-tier hierarch - `page_first`: Only compatible with `kernel` I/O backend, automatically switches to `layer_first` with `direct` backend - `page_first_direct`: Specifically designed for `direct` I/O backend with optimized memory organization +### Heterogeneous TP Support (GQA/MHA models) + +HiCache storage supports cross-cluster KV reuse when different deployments use different TP sizes (for example, `tp=4` and `tp=8`) and share the same storage backend namespace. + +Use `tp_lcm_size` in `--hicache-storage-backend-extra-config`: + +```bash +# Example: heterogeneous TP = {4, 8}, so lcm = 8 +--hicache-storage-backend-extra-config '{"tp_lcm_size": 8}' +``` + +Guidelines: + +- Set `tp_lcm_size` to the least common multiple (LCM) of all TP sizes that will share the same HiCache storage. +- For MHA models with Mooncake and `page_head` layout, HiCache will split head shards based on `tp_lcm_size` to make keys reusable across heterogeneous TP deployments. +- If all clusters use the same TP size, this option is not needed. + ### Prefetch Policies ```bash diff --git a/docs/advanced_features/hicache_storage_runtime_attach_detach.md b/docs/advanced_features/hicache_storage_runtime_attach_detach.md new file mode 100644 index 000000000000..555d799c2a53 --- /dev/null +++ b/docs/advanced_features/hicache_storage_runtime_attach_detach.md @@ -0,0 +1,132 @@ +# Runtime Attach/Detach HiCache Storage Backend (No Restart) + +This document explains how to **dynamically attach/detach the HiCache L3 storage backend at runtime** (e.g., `mooncake` / `hf3fs` / `nixl` / `file` / `aibrix` / `eic`) while **SGLang is already running and serving traffic**, without restarting the process. + +For safety and consistency, the current implementation **strictly requires** these operations to happen only when the service is **idle**: + +- **No running requests** +- **No waiting/queued requests** + +If the idle condition is not met, the API will fail fast (HTTP 400) and **will not modify** the current service state. + +--- + +## 1. Background and implementation overview + +### 1.1 Architecture / control path + +The control path is: + +1. **HTTP Server** (`python/sglang/srt/entrypoints/http_server.py`) + - Exposes `PUT /hicache/storage-backend`, `DELETE /hicache/storage-backend`, `GET /hicache/storage-backend` +2. **TokenizerManager** (`python/sglang/srt/managers/tokenizer_control_mixin.py`) + - Sends the request to the Scheduler via `FanOutCommunicator` +3. **Scheduler** (`python/sglang/srt/managers/scheduler.py`) + - Performs a **strict idle check** + - Calls `tree_cache.attach_storage_backend(...)` / `detach_storage_backend(...)` +4. **HiRadixCache** (`python/sglang/srt/mem_cache/hiradix_cache.py`) + - Parses `hicache_storage_backend_extra_config_json` (supports both backend config and prefetch knobs) + - Calls `cache_controller.attach_storage_backend(...)` / `detach_storage_backend(...)` +5. **HiCacheController** (`python/sglang/srt/managers/cache_controller.py`) + - Creates/destroys the storage backend instance (via `StorageBackendFactory`) + - Starts/stops backend background threads at runtime (prefetch/backup) + +--- + +## 2. Idle-state requirement (strict) + +The Scheduler uses `is_fully_idle()` which checks: + +- No running batches (including chunked prefill, overlap, pipeline-parallel, and disaggregation paths) +- No waiting requests in any queue (waiting, grammar, disagg bootstrap/prealloc/transfer/inflight) +- No DLLM staging requests + +If the condition is not met, attach/detach returns an error like: + +- `Reject attach: scheduler is not idle. #queue-req=... #running-req=...` + +> Tip: before switching, drain upstream traffic and wait for the server to become idle, then call attach/detach. + +### 2.1 DP (data parallel) semantics + +When `dp_size > 1`, the tokenizer dispatches the request to **all DP scheduler instances** and aggregates their responses: + +- The final `success` is **true only if all DP ranks return success** +- The final `message` concatenates messages from all DP ranks + +This is intended to prevent “silent partial success”, but it also means you may see: + +- Overall **failure** even though **some ranks already succeeded** + +Currently there is **no automatic partial rollback** across DP ranks (see TODO in code). Operationally: + +- Prefer to keep backend config identical across ranks +- If attach fails, immediately call detach (best-effort/idempotent), fix config, then retry attach + +--- + +## 3. How to use (HTTP Admin API) + +The examples below assume your SGLang HTTP server is at `http://127.0.0.1:30000`. + +### 3.1 Query current storage backend status + +```bash +curl -s http://127.0.0.1:30000/hicache/storage-backend +``` + +Example response: + +```json +{ + "hicache_storage_backend": "mooncake", + "hicache_storage_backend_extra_config": "{\"master_server_address\":\"127.0.0.1:50051\", ...}" +} +``` + +### 3.2 Attach (enable) a storage backend +```bash +curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \ + -H 'Content-Type: application/json' \ + -d '{ + "hicache_storage_backend": "mooncake" + }' +``` + +```bash +curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \ + -H 'Content-Type: application/json' \ + -d '{ + "hicache_storage_backend": "mooncake", + "hicache_storage_backend_extra_config_json": "{\"master_server_address\":\"127.0.0.1:50051\",\"protocol\":\"tcp\",\"global_segment_size\":\"4gb\",\"prefetch_threshold\":256}", + "hicache_storage_prefetch_policy": "timeout" + }' +``` + +Notes: + +- `hicache_storage_backend_extra_config_json` can include both: + - **Backend configuration** (e.g., Mooncake master/metadata/protocol, etc.) + - **Prefetch configuration** (`prefetch_threshold`, `prefetch_timeout_base`, `prefetch_timeout_per_ki_token`, `hicache_storage_pass_prefix_keys`) + +### 3.3 Detach (disable) the storage backend + +```bash +curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend +``` + +Notes: + +- Detach only makes SGLang **stop using** the L3 storage backend and stops prefetch/backup threads +- It **does not automatically delete** data stored in Mooncake/HF3FS (or other remote backends) + +--- + +## 4. Behavior and caveats + +- **No restart required**: attach/detach switches in-process at runtime +- **Must be idle**: otherwise the request is rejected to avoid consistency issues +- **Host KV layout constraints still apply**: for example, Mooncake still requires layouts like `page_first/page_first_direct/page_head`; if the server's HiCache host-memory layout does not satisfy the backend requirements, attach will fail with an error +- **Observability**: + - After attach, `server_args.hicache_storage_backend*` is updated on both the tokenizer and scheduler sides + - If metrics are enabled, attach will create a storage metrics collector in `HiRadixCache` on demand diff --git a/docs/advanced_features/hisparse_guide.md b/docs/advanced_features/hisparse_guide.md new file mode 100644 index 000000000000..57aa5e7c2481 --- /dev/null +++ b/docs/advanced_features/hisparse_guide.md @@ -0,0 +1,135 @@ +# HiSparse: Hierarchical Sparse Attention + +HiSparse reduces per-request GPU memory consumption during the decode phase by maintaining only a small "hot" KV buffer on GPU while keeping complete KV data in CPU pinned memory. Combined with PD disaggregation, it enables significantly higher decode concurrency. + +> **Prerequisites**: HiSparse only works with models that use **DeepSeek Sparse Attention (DSA)** architectures (e.g., DeepSeek-V3.2, GLM-5). These models natively select a subset of tokens for attention, making it possible to keep only the top-k KV on GPU while storing the full KV in host memory — without accuracy loss. Additionally, HiSparse currently requires **PD disaggregation mode** and is enabled on the **decode instance** only. + +## Why HiSparse? + +In long-context LLM inference, each decoding request holds a full-length KV cache on GPU, limiting the number of concurrent requests a decode instance can serve. HiSparse addresses this by: + +- **Reducing GPU memory per request**: Each request occupies only a fixed-size device buffer (e.g., 4KB tokens) instead of the full sequence length. +- **On-demand swap-in**: A CUDA kernel dynamically loads the top-k most relevant KV entries from host memory based on attention scores. +- **Transparent to prefill**: HiSparse is entirely a decode-side optimization; the prefill instance requires no changes. + +## Design Overview + +### Decode Workflow + +Each decode step follows this flow: + +1. **Forward decode** — generate the next token +2. **Top-k selection** — select the most relevant token positions via attention scores +3. **Swap-in** — the CUDA kernel loads top-k KV entries from host to device buffer: + - *Short sequences* (`seq_len ≤ device_buffer_size`): fast path, all KV already in buffer + - *Long sequences*: hit detection → LRU reordering → miss handling (host → device copy) +4. **Decode attention** — compute attention using the top-k device locations +5. **Eager backup** — asynchronously copy the previous token's KV from device to host + +### PD Disaggregation Integration (Direct-to-Host) + +In PD disaggregation mode, the prefill instance transfers KV cache directly into the decode instance's host pool via RDMA, bypassing the GPU entirely on the decode side. This eliminates the transient GPU memory spike during KV transfer and removes the staging DMA step. + +``` +Prefill GPU ──RDMA──▶ Decode Host Pool (CPU pinned memory) + │ + ▼ + alloc device buffer (4KB) + │ + ▼ + swap-in kernel (on-demand top-k) +``` + +## Server Arguments + +| Argument | Type / Default | Description | +|----------|---------------|-------------| +| `--enable-hisparse` | flag; default: disabled | Enable HiSparse on the decode instance | +| `--hisparse-config` | JSON string | Configuration for HiSparse (see below) | + +### HiSparse Config Parameters + +Pass as a JSON string via `--hisparse-config`: + +| Parameter | Type / Default | Description | +|-----------|---------------|-------------| +| `top_k` | int | Number of topk entries | +| `device_buffer_size` | int | Number of token slots in the per-request GPU device buffer | +| `host_to_device_ratio` | int | Ratio of logical pool size to device pool size, determining host memory capacity | + +Example: `--hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}'` + +## Deployment + +HiSparse currently requires **PD disaggregation mode** and is enabled only on the **decode instance**. + +### Prefill Instance + +```bash +python3 -m sglang.launch_server \ + --model-path /path/to/model \ + --trust-remote-code \ + --port 8000 --host 0.0.0.0 \ + --context-length 81920 \ + --chunked-prefill-size 65536 \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --mem-fraction-static 0.85 \ + --disaggregation-mode prefill \ + --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \ + --nnodes 1 --node-rank 0 +``` + +### Decode Instance (with HiSparse) + +```bash +python3 -m sglang.launch_server \ + --model-path /path/to/model \ + --trust-remote-code \ + --port 8000 --host 0.0.0.0 \ + --context-length 81920 \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --mem-fraction-static 0.85 \ + --kv-cache-dtype bfloat16 \ + --nsa-decode-backend flashmla_sparse \ + --disaggregation-mode decode \ + --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \ + --dist-init-addr 127.0.0.1:5757 \ + --nnodes 1 --node-rank 0 \ + --enable-hisparse \ + --hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}' +``` + +### Benchmark + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \ + --dataset-name random \ + --random-input 40000 \ + --random-output 20000 \ + --num-prompts 200 \ + --max-concurrency 200 \ + --request-rate 40 \ + --random-range-ratio 1.0 \ + --host 127.0.0.1 \ + --port 20000 \ + --model /path/to/model \ + --flush-cache \ +``` + +### Key Notes + +- The prefill instance does not need `--enable-hisparse`; it is unaware of HiSparse. +- On the decode instance, the following flags are **required** for HiSparse: + - `--kv-cache-dtype bfloat16` — currently only bfloat16 KV cache is supported (more dtypes planned). + - `--nsa-decode-backend flashmla_sparse` — currently only `flashmla_sparse` backend is supported. + - `--enable-hisparse` — enables HiSparse. + - `--hisparse-config` — HiSparse configuration (top_k, device_buffer_size, host_to_device_ratio). + - `host_to_device_ratio` should be configured based on the host machine's available memory. For example: + - **~1 TB** host memory → `host_to_device_ratio: 5` + - **~2 TB** host memory → `host_to_device_ratio: 10` + +## Acknowledgments + +We would like to thank the SGLang team and community for the implementation and generous support, especially Zhiqiang Xie, Zhangheng Huang, Tingwei Huang, Shangming Cai, Teng Ma, and many others. We also thank the Alibaba Cloud TairKVCache team and the AntGroup SCT Inference team for their valuable contributions. diff --git a/docs/advanced_features/lora.ipynb b/docs/advanced_features/lora.ipynb index a8245f1b280c..230bd700f03b 100644 --- a/docs/advanced_features/lora.ipynb +++ b/docs/advanced_features/lora.ipynb @@ -47,6 +47,8 @@ "\n", "* `--max-lora-chunk-size`: Maximum chunk size for the ChunkedSGMV LoRA backend. Only used when --lora-backend is 'csgmv'. Choosing a larger value might improve performance. Please tune this value based on your hardware and workload as needed. Defaults to 16.\n", "\n", + "* `lora_drain_wait_threshold`: When any LoRA adapter request waits longer than this threshold (in seconds), the scheduler will selectively drain one running adapter to make room. This mitigates extreme tail latency under high or skewed workloads by preventing a small set of adapters from monopolizing batch slots. Set to 0 to disable draining (default).\n", + "\n", "* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.\n", "\n", "From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to." @@ -102,7 +104,7 @@ "\"\"\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" ] }, { @@ -151,18 +153,16 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --enable-lora \\\n", " --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", - " lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n", + " lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json \\\n", " --max-loras-per-batch 2 \\\n", " --log-level warning \\\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" ] }, { @@ -220,15 +220,14 @@ "metadata": {}, "outputs": [], "source": [ - "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16\" # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj\n", + "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json\" # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj\n", "lora1 = \"algoprog/fact-generation-llama-3.1-8b-instruct-lora\" # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj\n", "lora0_new = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\" # rank - 256, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj\n", "\n", "\n", "# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.\n", "# We are adding it here just to demonstrate usage.\n", - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", " python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --enable-lora \\\n", " --cuda-graph-max-bs 2 \\\n", @@ -236,11 +235,10 @@ " --max-lora-rank 256\n", " --lora-target-modules all\n", " --log-level warning\n", - " \"\"\"\n", - ")\n", + " \"\"\")\n", "\n", "url = f\"http://127.0.0.1:{port}\"\n", - "wait_for_server(url)" + "wait_for_server(url, process=server_process)" ] }, { @@ -435,8 +433,7 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", " python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --enable-lora \\\n", " --cuda-graph-max-bs 8 \\\n", @@ -444,16 +441,15 @@ " --max-lora-rank 256 \\\n", " --lora-target-modules all \\\n", " --lora-paths \\\n", - " {\"lora_name\":\"lora0\",\"lora_path\":\"Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16\",\"pinned\":true} \\\n", + " {\"lora_name\":\"lora0\",\"lora_path\":\"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json\",\"pinned\":true} \\\n", " {\"lora_name\":\"lora1\",\"lora_path\":\"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"} \\\n", " lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora\n", " --log-level warning\n", - " \"\"\"\n", - ")\n", + " \"\"\")\n", "\n", "\n", "url = f\"http://127.0.0.1:{port}\"\n", - "wait_for_server(url)" + "wait_for_server(url, process=server_process)" ] }, { @@ -548,16 +544,14 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", " python3 -m sglang.launch_server \\\n", " --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --enable-lora \\\n", " --lora-backend csgmv \\\n", " --max-loras-per-batch 16 \\\n", " --lora-paths lora1=path/to/lora1 lora2=path/to/lora2\n", - " \"\"\"\n", - ")" + " \"\"\")" ] }, { @@ -589,28 +583,26 @@ "metadata": {}, "outputs": [], "source": [ - "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16\"\n", + "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json\"\n", "lora1 = \"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"\n", "lora2 = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\"\n", "\n", "\n", - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", " python3 -m sglang.launch_server \\\n", " --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --enable-lora \\\n", " --enable-lora-overlap-loading \\\n", - " --lora-paths lora0=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n", + " --lora-paths lora0=Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json \\\n", " lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", " lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora \\\n", " --max-lora-rank 256 \\\n", " --max-loras-per-batch 2 \\\n", " --max-loaded-loras 4\n", - " \"\"\"\n", - ")\n", + " \"\"\")\n", "\n", "url = f\"http://127.0.0.1:{port}\"\n", - "wait_for_server(url)" + "wait_for_server(url, process=server_process)" ] }, { diff --git a/docs/advanced_features/object_storage.md b/docs/advanced_features/object_storage.md new file mode 100644 index 000000000000..957ecdbafe31 --- /dev/null +++ b/docs/advanced_features/object_storage.md @@ -0,0 +1,108 @@ +# Loading Models from Object Storage + +SGLang supports direct loading of models from object storage (S3 and Google Cloud Storage) without requiring a full local download. This feature uses the `runai_streamer` load format to stream model weights directly from cloud storage, significantly reducing startup time and local storage requirements. + +## Overview + +When loading models from object storage, SGLang uses a two-phase approach: + +1. **Metadata Download** (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache +2. **Weight Streaming** (lazy, during model loading): Model weights are streamed directly from object storage as needed + +## Supported Storage Backends + +1. **Amazon S3**: `s3://bucket-name/path/to/model/` +2. **Google Cloud Storage**: `gs://bucket-name/path/to/model/` +3. **Azure Blob**: `az://some-azure-container/path/` +4. **S3 compatible**: `s3://bucket-name/path/to/model/` + +## Quick Start + +### Basic Usage + +Simply provide an object storage URI as the model path: + +```bash +# S3 +python -m sglang.launch_server \ + --model-path s3://my-bucket/models/llama-3-8b/ \ + --load-format runai_streamer + +# Google Cloud Storage +python -m sglang.launch_server \ + --model-path gs://my-bucket/models/llama-3-8b/ \ + --load-format runai_streamer +``` + +**Note**: The `--load-format runai_streamer` is automatically detected when using object storage URIs, so you can omit it: + +```bash +python -m sglang.launch_server \ + --model-path s3://my-bucket/models/llama-3-8b/ +``` + +### With Tensor Parallelism + +```bash +python -m sglang.launch_server \ + --model-path gs://my-bucket/models/llama-70b/ \ + --tp 4 \ + --model-loader-extra-config '{"distributed": true}' +``` + +## Configuration + +### Load Format + +The `runai_streamer` load format is specifically designed for object storage, ssd and shared file systems + +```bash +python -m sglang.launch_server \ + --model-path s3://bucket/model/ \ + --load-format runai_streamer +``` + +### Extended Configuration Parameters + +Use `--model-loader-extra-config` to pass additional configuration as a JSON string: + +```bash +python -m sglang.launch_server \ + --model-path s3://bucket/model/ \ + --model-loader-extra-config '{ + "distributed": true, + "concurrency": 8, + "memory_limit": 2147483648 + }' +``` + +#### Available Parameters + +| Parameter | Type | Description | Default | +|-----------|------|-------------|---------| +| `distributed` | bool | Enable distributed streaming for multi-GPU setups. Automatically set to `true` for object storage paths and cuda alike devices. | Auto-detected | +| `concurrency` | int | Number of concurrent download streams. Higher values can improve throughput for large models. | 4 | +| `memory_limit` | int | Memory limit (in bytes) for the streaming buffer. | System-dependent | + + +## Performance Considerations + +### Distributed Streaming + +For multi-GPU setups, enable distributed streaming to parallelize weight loading between the processes: + +```bash +python -m sglang.launch_server \ + --model-path s3://bucket/model/ \ + --tp 8 \ + --model-loader-extra-config '{"distributed": true}' +``` + +## Limitations + +- **Supported Formats**: Currently only supports `.safetensors` weight format (recommended format) +- **Supported Device**: Distributed streaming is supported on cuda alike devices. Otherwise fallback to non distributed streaming + +## See Also + +- [Runai model streamer documentation](https://github.com/run-ai/runai-model-streamer) diff --git a/docs/advanced_features/pd_disaggregation.md b/docs/advanced_features/pd_disaggregation.md index b40ab11b4d01..e1edc56b84e5 100644 --- a/docs/advanced_features/pd_disaggregation.md +++ b/docs/advanced_features/pd_disaggregation.md @@ -130,16 +130,19 @@ PD Disaggregation with Mooncake supports the following environment variables for To enable NVLink transport for KV cache transfers with the mooncake backend (recommended for NVL72 deployments), set the following environment variables. Note that auxiliary data transfer will still use TCP as a temporary workaround. ```bash -export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True +export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=NVLINK export MC_FORCE_MNNVL=True ``` +The `SGLANG_MOONCAKE_CUSTOM_MEM_POOL` environment variable enables the custom memory pool. Supported values are `NVLINK` (or `True`), `BAREX`, and `INTRA_NODE_NVLINK`. + #### Prefill Server Configuration | Variable | Description | Default | |:--------:|:-----------:|:--------: | **`SGLANG_DISAGGREGATION_THREAD_POOL_SIZE`** | Controls the total number of worker threads for KVCache transfer operations per TP rank | A dynamic value calculated by `int(0.75 * os.cpu_count()) // 8)`, which is limited to be larger than 4 and less than 12 to ensure efficiency and prevent thread race conditions | | **`SGLANG_DISAGGREGATION_QUEUE_SIZE`** | Sets the number of parallel transfer queues. KVCache transfer requests from multiple decode instances will be sharded into these queues so that they can share the threads and the transfer bandwidth at the same time. If it is set to `1`, then we transfer requests one by one according to fcfs strategy | `4` | | **`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`** | Timeout (seconds) for receiving destination KV indices during request initialization | `300` | +| **`SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL`** | Interval (seconds) between cleanups of bootstrap entries | `120` | If a greater mean TTFT is acceptable, you can `export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600` (10 minutes) to relax the timeout condition. Please be aware that this setting will cause prefill instances to take a longer time to clean up the affected memory resources when a running decode node loses connection. @@ -154,6 +157,58 @@ Please be aware that this setting will cause prefill instances to take a longer If a greater mean TTFT is acceptable, you can `export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600` (10 minutes) to relax the timeout condition. +## Heterogeneous TP with GPU Staging Buffer + +When prefill and decode use different tensor parallelism (TP) sizes (e.g., prefill TP=4, decode DP attention with TP=1), the KV cache memory layout differs between the two sides. The **GPU staging buffer** solves this by gathering KV head slices into a contiguous buffer on the prefill side, performing bulk RDMA transfer, then scattering into the correct KV cache pages on the decode side. This provides **2–5x throughput improvement** over the default per-token slice approach at high concurrency and matches homogeneous TP baselines within ~5%. + +Enable the staging buffer when prefill and decode use **different TP sizes** with the **Mooncake** transfer backend. When both sides use the same TP size, staging is automatically bypassed even if enabled. + +> **Note:** The staging buffer is designed for non-MLA models (e.g. GQA, MHA). MLA models (e.g. DeepSeek-V2/V3) should not enable this flag. + +### Environment Variables + +| Variable | Description | Default | +|:---------|:------------|:-------:| +| **`SGLANG_DISAGG_STAGING_BUFFER`** | Enable GPU staging buffer for heterogeneous TP KV transfer | `False` | +| **`SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB`** | Prefill-side per-worker staging buffer size in MB | `64` | +| **`SGLANG_DISAGG_STAGING_POOL_SIZE_MB`** | Decode-side ring buffer pool total size in MB | `4096` | + +### Usage Example + +```bash +# Set staging buffer environment variables on BOTH prefill and decode +export SGLANG_DISAGG_STAGING_BUFFER=1 +export SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB=64 +export SGLANG_DISAGG_STAGING_POOL_SIZE_MB=4096 + +# Prefill with TP=4 +python -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --disaggregation-mode prefill \ + --port 30000 \ + --tp 4 \ + --trust-remote-code \ + --disaggregation-ib-device mlx5_1,mlx5_2 + +# Decode with TP=1 (or DP attention with effective attention TP=1) +python -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --disaggregation-mode decode \ + --port 30001 \ + --tp 4 \ + --dp 4 \ + --enable-dp-attention \ + --trust-remote-code \ + --disaggregation-ib-device mlx5_3,mlx5_4 + +# Router +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://127.0.0.1:30000 \ + --decode http://127.0.0.1:30001 \ + --host 0.0.0.0 --port 8000 +``` + ## NIXL ### Requirements diff --git a/docs/advanced_features/piecewise_cuda_graph.md b/docs/advanced_features/piecewise_cuda_graph.md new file mode 100644 index 000000000000..e0bb47af94eb --- /dev/null +++ b/docs/advanced_features/piecewise_cuda_graph.md @@ -0,0 +1,189 @@ +# Piecewise CUDA Graph + +## Motivation + +Standard CUDA graphs capture the entire model forward pass as a single graph. This works well for decode (fixed batch size), but not for extend/prefill where the number of tokens varies across iterations. + +Piecewise CUDA Graph (PCG) solves this by splitting the model's computation graph into pieces (roughly one per layer) at "split points" (e.g., MoE dispatch ops). Each piece is captured as a separate CUDA graph for a set of pre-defined token lengths. At runtime, the input is padded to the nearest captured size, and each piece is replayed. This eliminates kernel launch overhead for prefill/extend while still supporting dynamic shapes. + +Recently we **enabled PCG by default**, which means that the old `--enable-piecewise-cuda-graph` flag is deprecated. Use `--disable-piecewise-cuda-graph` to turn it off. + +## Usage + +PCG is enabled by default for supported configurations. No extra flags needed: + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct +``` + +### Disable PCG + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disable-piecewise-cuda-graph +``` + +### Custom capture sizes + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --piecewise-cuda-graph-max-tokens 2048 +``` + +### Server Args + +| Argument | Default | Description | +|---|---|---| +| `--disable-piecewise-cuda-graph` | `False` | Disable PCG for extend/prefill. | +| `--enforce-piecewise-cuda-graph` | `False` | Force-enable PCG, skipping all auto-disable conditions. For testing only. | +| `--piecewise-cuda-graph-max-tokens` | `None` (auto) | Maximum token count to capture. Defaults to `chunked_prefill_size` (non-MLA) or `2048` (MLA). | +| `--piecewise-cuda-graph-tokens` | `None` (auto) | Explicit list of token lengths to capture. Auto-generated if not set. | +| `--piecewise-cuda-graph-compiler` | `"eager"` | Compiler backend for the captured subgraphs. Choices: `eager`, `inductor`. | +| ~~`--enable-piecewise-cuda-graph`~~ | — | **Deprecated.** PCG is now enabled by default. Use `--enforce-piecewise-cuda-graph` to skip auto-disable conditions. | + +## Bug Report + +PCG is enabled by default but is still in an experimental stage. Since PCG relies on `torch.compile` to trace the model's forward pass, most bugs are introduced by torch compile tracing failures (e.g., untraceable ops, dynamic control flow, or graph breaks). If you encounter any issues related to PCG, please disable it by adding `--disable-piecewise-cuda-graph` to your launch command and report the bug at [GitHub Issues](https://github.com/sgl-project/sglang/issues/new/choose). We greatly appreciate your help in improving this feature. + +### For Users + +If you see an error message like the following during server startup, it is a PCG bug: + +``` +Piecewise CUDA Graph is enabled by default as an experimental feature. +To work around this error, add --disable-piecewise-cuda-graph to your launch command. +Please report this issue at https://github.com/sgl-project/sglang/issues/new/choose +``` + +To work around it, add `--disable-piecewise-cuda-graph` to your launch command. When filing a bug report, please include: +1. The full error traceback +2. Model name and quantization method +3. Launch command with all arguments +4. GPU type and driver version + +### For Developers + +Since PCG relies on `torch.compile` to trace the model's forward pass, newly developed CUDA kernels (both JIT kernels and sgl-kernels) are typically not compatible with `torch.compile` out of the box. The tracing will fail on untraceable operations such as JIT compilation, file I/O, or dynamic module loading inside the kernel. + +To make a kernel compatible with PCG, you need to register it as a custom op using `register_custom_op` from `sglang.srt.utils.custom_op`. This wraps the kernel as an opaque node in the compiled graph so that `torch.compile` will not trace inside it. + +**Example usage (JIT kernel):** + +```python +from sglang.srt.utils.custom_op import register_custom_op + +# Inplace operator (no return value) +@register_custom_op(mutates_args=["output_q", "output_s"]) +def per_token_group_quant_8bit( + input: torch.Tensor, + output_q: torch.Tensor, + output_s: torch.Tensor, +) -> None: + # kernel implementation ... +``` + +**Example usage (operator with output):** + +```python +# out_shape indicates which argument has the same shape as the output +@register_custom_op(mutates_args=["x"], out_shape=0) +def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: + return x.add_(y) +``` + +For wrapping external library functions (e.g., FlashInfer kernels), use `register_custom_op_from_extern` instead. See `python/sglang/srt/utils/custom_op.py` for full API documentation. + +## How it works +### Torch compile backend + +PCG uses `torch.compile` with a custom backend (`SGLangBackend`) to split and compile the model's forward pass. The flow is: + +``` +model.forward wrapper +→ torch.compile(..., backend=SGLangBackend) +→ FX graph +→ split_graph() at registered split ops +→ split_gm (top-level graph that chains the pieces) +→ replace capturable submodules with CUDAPiecewiseBackend +→ runtime dispatch: eager split ops + per-piece capture/replay +``` + +- **Install**: `install_torch_compiled()` replaces `model.forward` with a wrapper function. When `is_in_piecewise_cuda_graph()` returns True, the wrapper dispatches to the compiled callable; otherwise it falls back to the original forward. The first invocation through this path triggers Dynamo tracing and graph compilation — CUDA graph replay only happens after the capture phase completes. + +- **Split**: When `torch.compile` traces the model, `SGLangBackend` receives the FX graph and calls `split_graph()`. Ops listed in `CompilationConfig.split_ops` are treated as split points, so the graph is cut at each one. These split-op submodules are left to run eagerly at runtime, while the surrounding submodules are compiled and wrapped by `CUDAPiecewiseBackend`. The result is a top-level "stitching graph" (`split_gm`) with children such as `submod_0`, `submod_1`, … interleaving capturable subgraphs and eager split-op submodules. + +- **Replace**: `PiecewiseCompileInterpreter` iterates over each capturable submodule in `split_gm`, compiles it for general (dynamic) shapes, and replaces it in-place with a `CUDAPiecewiseBackend` instance. Split-op submodules (e.g., attention, all-reduce) are left as-is and run eagerly at runtime. + +- **Dispatch**: At runtime, calling `split_gm` executes the stitching graph, which calls each submodule in order. Split-op submodules run eagerly. Each `CUDAPiecewiseBackend` submodule goes through three phases: + - **Compile warmup** — runs the general-shape compiled path. + - **Capture** — for each capture size, runs one warmup pass then records a CUDA graph. + - **Steady-state replay** — replays the captured CUDA graph for each forward pass. + +### Piecewise cuda graph runner + +`PiecewiseCudaGraphRunner` orchestrates the full lifecycle through three phases: + +- **Compile** — Warms up JIT kernels with a dummy forward pass, then wraps the model with `torch.compile`, triggering Dynamo tracing to split the FX graph and create `CUDAPiecewiseBackend` instances for each subgraph piece. + +- **Capture** — Iterates over capture sizes in reverse order (largest first). For each size, runs the forward pass twice (one warmup, one CUDA graph capture). + +- **Replay** — At runtime, finds the smallest captured size >= actual token count via binary search, copies inputs into static buffers with zero-padding, replays the captured CUDA graphs, and slices outputs back to the actual token count. + +### Memory optimization + +The memory cost of PCG comes from two parts: **torch memory allocator** and **non-torch memory**. + +The torch memory allocator overhead is trivial thanks to several optimizations: a global shared memory pool is reused across all CUDA graph runners and capture sizes, capture is done in reverse order (large to small) so smaller graphs reuse memory allocated by larger ones, and output tensors of the last subgraph are stored as weak references to maximize memory reuse. + +The main memory overhead comes from non-torch memory — the CUDA graph objects themselves require GPU memory to store the recorded kernel launch parameters and internal state. This overhead scales with the number of captured sizes, which is why `piecewise_cuda_graph_max_tokens` is capped conservatively by default. + +### Shape configuration +Piecewise CUDA graph pre-captures graphs for a set of token counts. At runtime, the actual token count is rounded up to the nearest captured size (via binary search), and the corresponding graph is replayed. If the token count exceeds the largest captured size, the runtime falls back to the normal (non-graph) forward path. + +The default capture schedule is auto-generated with increasing granularity: + +| Token range | Step size | +|-------------|-----------| +| 4 – 32 | 4 | +| 48 – 256 | 16 | +| 288 – 512 | 32 | +| 576 – 1024 | 64 | +| 1280 – 4096 | 256 | +| 4096+ | 512 | + +For the auto-generated schedule, sizes are capped at `--piecewise-cuda-graph-max-tokens`. The default cap is `chunked_prefill_size` for non-MLA models and `2048` for MLA backend models. If `--max-total-tokens` is set, the cap is further limited to not exceed it. Additionally, Llama-2 models are auto-capped at 4096 tokens as a temporary workaround. + +## Compatibility + +PCG is auto-disabled in the following scenarios. We are actively working on expanding compatibility — support for many of these will be coming soon. + +- Disabled model architectures (e.g., `DeepseekV32ForCausalLM`) +- Speculative decoding +- DP attention +- Pipeline parallelism (`pp_size > 1`) +- Non-CUDA hardware (AMD ROCm, Ascend NPU) +- MoE A2A backend +- LoRA +- Multimodal / VLM models +- DLLM (diffusion LLM) +- Deterministic inference +- PD disaggregation +- Expert distribution recorder / EPLB + +Use `--enforce-piecewise-cuda-graph` to skip all auto-disable checks (for testing/debugging only). + +## Code Reference + +| File | Description | +|---|---| +| `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py` | Main runner: init, capture, replay | +| `python/sglang/srt/compilation/compile.py` | `install_torch_compiled` trampoline | +| `python/sglang/srt/compilation/backend.py` | `SGLangBackend`, graph splitting, piecewise compilation | +| `python/sglang/srt/compilation/cuda_piecewise_backend.py` | Per-subgraph CUDA graph capture/replay | +| `python/sglang/srt/compilation/piecewise_context_manager.py` | Global context flags and `ForwardContext` | +| `python/sglang/srt/compilation/compilation_config.py` | Capture sizes, split ops, compiler config | +| `python/sglang/srt/utils/custom_op.py` | `register_custom_op` for torch.compile compatibility | +| `python/sglang/srt/server_args.py` | Server arguments and auto-disable logic | diff --git a/docs/advanced_features/quantization.md b/docs/advanced_features/quantization.md index 90715a908ea7..8e68d5d10b93 100644 --- a/docs/advanced_features/quantization.md +++ b/docs/advanced_features/quantization.md @@ -17,11 +17,76 @@ or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization to guard against abnormal quantization loss regressions. +## Platform Compatibility + +The following table summarizes quantization method support across NVIDIA and AMD GPUs, Ascend NPUs. + +| Method | NVIDIA GPUs | AMD GPUs (MI300X/MI325X/MI350X) | Ascend NPUs (A2/A3) | Notes | +|--------|:-----------:|:-------------------------------:|:-----------------------:|-------| +| `fp8` | Yes | Yes | WIP | Aiter or Triton backend on AMD | +| `mxfp4` | Yes | Yes | WIP | Requires CDNA3/CDNA4 with MXFP support; uses Aiter | +| `blockwise_int8` | Yes | Yes | No | Triton-based, works on both platforms | +| `w8a8_int8` | Yes | Yes | No | | +| `w8a8_fp8` | Yes | Yes | No | Aiter or Triton FP8 on AMD | +| `awq` | Yes | Yes | Yes | Uses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA). Uses CANN kernels on Ascend| +| `gptq` | Yes | Yes | Yes | Uses Triton or vLLM kernels on AMD. Uses CANN kernels on Ascend| +| `compressed-tensors` | Yes | Yes | Partial | Aiter paths for FP8/MoE on AMD. Uses CANN kernels on Ascend, `FP8` not supported yet| +| `quark` | Yes | Yes | No | AMD Quark quantization; Aiter GEMM paths on AMD | +| `auto-round` | Yes | Yes | Partial | Platform-agnostic (Intel auto-round). Uses CANN kernels on Ascend| +| `quark_int4fp8_moe` | No | Yes | No | AMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4) | +| `awq_marlin` | Yes | No | No | Marlin kernels are CUDA-only | +| `gptq_marlin` | Yes | No | No | Marlin kernels are CUDA-only | +| `gguf` | Yes | No | Yes | CUDA-only kernels in sgl-kernel; Pre-dequantized on Ascend | +| `modelopt` / `modelopt_fp8` | Yes (Hopper/SM90+) | No | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); requires NVIDIA hardware | +| `modelopt_fp4` | Yes (Blackwell/SM100+) | No | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); native FP4 on Blackwell (B200, GB200) | +| `petit_nvfp4` | No | Yes (MI250/MI300X/MI325X) | No | Enables NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel); use `modelopt_fp4` on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/) and [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html). | +| `bitsandbytes` | Yes | Experimental | No | Depends on bitsandbytes ROCm support | +| `torchao` (`int4wo`, etc.) | Yes | Partial | No | `int4wo` not supported on AMD; other methods may work | +| `modelslim` | No | No | Yes | Ascend quantization; Uses CANN kernels | +| `mxfp8` (diffusion) | No | No | Yes (A2/A3) | Ascend NPU only; online MXFP8 quantization for diffusion models (e.g., Wan2.2); requires CANN ≥ 8.0.RC3 | + +On AMD, several of these methods use [Aiter](https://github.com/ROCm/aiter) for acceleration -- set `SGLANG_USE_AITER=1` where noted. See [AMD GPU setup](../platforms/amd_gpu.md) for installation and configuration details. + +On Ascend, various layers quantization configurations are supported, see [Ascend NPU quantization](../platforms/ascend/ascend_npu_quantization.md) for details. + +## GEMM Backends for FP4/FP8 Quantization + +:::{note} +Backend selection is supported only for **blockwise FP8** and **NVFP4** GEMM. When running FP8 or FP4 quantized models, you can select the GEMM backend via `--fp8-gemm-backend` and `--fp4-gemm-backend`. +::: + +### `--fp8-gemm-backend` (Blockwise FP8 GEMM) + +| Backend | Hardware | Description | +|---------|----------|-------------| +| `auto` | All | Auto-selects based on hardware | +| `deep_gemm` | SM90, SM100 | JIT-compiled; enabled when DeepGEMM is installed | +| `flashinfer_trtllm` | SM100 | FlashInfer TensorRT-LLM backend; optimal for low-latency | +| `flashinfer_cutlass` | SM100/120 | FlashInfer CUTLASS groupwise FP8 GEMM | +| `flashinfer_deepgemm` | SM90 | Uses swapAB optimization for small M dimensions in decoding | +| `cutlass` | SM90, SM100/120 | sgl-kernel CUTLASS | +| `triton` | All | Fallback; widely compatible | +| `aiter` | ROCm | AMD AITER backend | + +**`auto` selection order:** 1) DeepGEMM (SM90/SM100, installed); 2) FlashInfer TRTLLM (SM100, FlashInfer available); 3) CUTLASS (SM90/SM100/120); 4) AITER (AMD); 5) Triton. **Exception:** SM120 always resolves to Triton. + +### `--fp4-gemm-backend` (NVFP4 GEMM) + +| Backend | Hardware | Description | +|---------|----------|-------------| +| `auto` | SM100/120 | Auto-selects: `flashinfer_cudnn` on SM120; `flashinfer_cutlass` on SM100 | +| `cutlass` | SM100/120 | SGLang CUTLASS kernel | +| `flashinfer_cutlass` | SM100/120 | FlashInfer CUTLASS backend | +| `flashinfer_cudnn` | SM100/120 (CUDA 13+, cuDNN 9.15+) | FlashInfer cuDNN backend; used on SM120 for performance | +| `flashinfer_trtllm` | SM100 | FlashInfer TensorRT-LLM backend | + +When FlashInfer is unavailable for NVFP4, the SGLang CUTLASS kernel is used as an automatic fallback. + ## Offline Quantization To load already quantized models, simply load the model weights and config. **Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be parsed from the -downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.** +downloaded Hugging Face or msModelSlim config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.** ```bash python3 -m sglang.launch_server \ @@ -191,23 +256,85 @@ python3 -m sglang.launch_server \ #### Using [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer) -NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment. +NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. + +**Offline vs. Online Quantization:** + +SGLang supports two modes for ModelOpt. + +* **Offline Quantization (pre-quantized):** + * **Usage:** Download a pre-quantized model from Hugging Face or run `hf_ptq.py` once to create a new quantized checkpoint. Then load this quantized checkpoint. + * **Pros:** Fast server startup, quantization can be validated before deployment, efficient resource usage. + * **Cons:** Requires an extra preparation step. + +* **Online Quantization (quant and serve):** + * **Usage:** Load a standard BF16/FP16 model and add a flag. The engine applies quantization *on startup*. + * **Pros:** Convenient (no new checkpoint needed). + * **Cons:** **High startup time**, increases VRAM usage during initialization (risk of OOM). + +The following sections guide you through using the Offline path: loading pre-quantized models or creating your own checkpoints. + +##### Using Pre-Quantized Checkpoints + +If a model is already quantized (e.g., from Hugging Face), you can load it directly. + +* **FP8 Models:** + Use `--quantization modelopt_fp8`. + ```bash + python3 -m sglang.launch_server \ + --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \ + --quantization modelopt_fp8 \ + --port 30000 + ``` + +* **FP4 Models:** + Use `--quantization modelopt_fp4`. + ```bash + python3 -m sglang.launch_server \ + --model-path nvidia/Llama-3.3-70B-Instruct-NVFP4 \ + --quantization modelopt_fp4 \ + --port 30000 + ``` + +##### Creating Your Own Quantized Checkpoints + +If a pre-quantized checkpoint is not available for your model, you can create one using NVIDIA Model Optimizer's `hf_ptq.py` script. + +**Why quantize?** +- Reduce VRAM usage +- Higher throughput and lower latency +- More flexible deployment (on smaller GPUs) + +**What can be quantized?** +- The entire model +- MLP layers only +- KV cache + +**Key options in `hf_ptq.py`:** + +`--qformat`: Quantization formats `fp8`, `nvfp4`, `nvfp4_mlp_only` + +`--kv_cache_qformat`: KV cache quantization format (default: `fp8`) + +**Note:** The default `kv_cache_qformat` may not be optimal for all use cases. Consider setting this explicitly. + +**Hardware requirements:** Hopper and higher are recommended. Insufficient GPU memory may cause weight offloading, resulting in extremely long quantization time. + +For detailed usage and supported model architectures, see [NVIDIA Model Optimizer LLM PTQ](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq). + +SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment. ##### Installation -First, install ModelOpt. You can either install it directly or as an optional SGLang dependency: +First, install ModelOpt: ```bash -# Option 1: Install ModelOpt directly pip install nvidia-modelopt - -# Option 2: Install SGLang with ModelOpt support (recommended) -pip install sglang[modelopt] ``` ##### Quantization and Export Workflow -SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow: +SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow. Run from the SGLang repository root (see [modelopt_quantize_and_export.py](https://github.com/sgl-project/sglang/blob/main/examples/usage/modelopt_quantize_and_export.py)): ```bash # Quantize and export a model using ModelOpt FP8 quantization @@ -216,7 +343,7 @@ python examples/usage/modelopt_quantize_and_export.py quantize \ --export-dir ./quantized_tinyllama_fp8 \ --quantization-method modelopt_fp8 -# For FP4 quantization +# For FP4 quantization (requires Blackwell GPU) python examples/usage/modelopt_quantize_and_export.py quantize \ --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --export-dir ./quantized_tinyllama_fp4 \ @@ -272,25 +399,39 @@ python -m sglang.launch_server \ --port 30000 --host 0.0.0.0 ``` -Or using the Python API: +Or using the Python API (use the same path as `modelopt_export_path` from the quantize step): ```python import sglang as sgl -# Deploy exported ModelOpt quantized model -llm = sgl.Engine( - model_path="./quantized_tinyllama_fp8", - quantization="modelopt" -) - -# Run inference -prompts = ["Hello, how are you?", "What is the capital of France?"] -sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100} -outputs = llm.generate(prompts, sampling_params) +def main(): + # Deploy exported ModelOpt quantized model + # Path must match modelopt_export_path from quantize step (e.g., ./exported_model) + llm = sgl.Engine( + model_path="./exported_model", + quantization="modelopt", + ) + + # Run inference + prompts = [ + "Hello, how are you?", + "What is the capital of France?", + ] + sampling_params = { + "temperature": 0.8, + "top_p": 0.95, + "max_new_tokens": 100, + } + + outputs = llm.generate(prompts, sampling_params) + + for i, output in enumerate(outputs): + print(f"Prompt: {prompts[i]}") + print(f"Output: {output['text']}") + +if __name__ == "__main__": + main() -for i, output in enumerate(outputs): - print(f"Prompt: {prompts[i]}") - print(f"Output: {output.outputs[0].text}") ``` ##### Advanced Features @@ -308,7 +449,7 @@ python examples/usage/modelopt_quantize_and_export.py quantize \ # The checkpoint can be reused for future quantization runs and skip calibration ``` -**Export-only Workflow**: If you have a pre-existing fake quantized ModelOpt checkpoint, you can export it directly: +**Export-only Workflow**: If you have a pre-existing fake quantized ModelOpt checkpoint, you can export it directly. See [LoadConfig](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/load_config.py) for the full API: ```python from sglang.srt.configs.device_config import DeviceConfig @@ -327,7 +468,7 @@ load_config = LoadConfig( modelopt_export_path="./exported_model", ) -# Load and export the model +# Load and export the model (DeviceConfig defaults to device="cuda") model_loader = get_model_loader(load_config, model_config) model_loader.load_model(model_config=model_config, device_config=DeviceConfig()) ``` @@ -340,6 +481,74 @@ model_loader.load_model(model_config=model_config, device_config=DeviceConfig()) - **Calibration-based**: Uses calibration datasets for optimal quantization quality - **Production Ready**: Enterprise-grade quantization with NVIDIA support +#### Using [ModelSlim](https://gitcode.com/Ascend/msmodelslim) +MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware. + +- **Installation** + + ```bash + # Clone repo and install msmodelslim: + git clone https://gitcode.com/Ascend/msmodelslim.git + cd msmodelslim + bash install.sh + ``` + +- **LLM quantization** + + Download the original floating-point weights of the large model. Taking Qwen3-32B as an example, you can go to [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the huggingface model card). + > Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech). + + _Traditional quantification methods require the preparation of calibration data files (```.jsonl``` formats) for calibration in the quantification process._ + ```bash + Qwen3-32B/ # floating-point model downloaded from official HF (or modelscope) repo + msmodelslim/ # msmodelslim repo + |----- lab_calib # calibration date folder (put your dataset here in ```.jsonl``` format or use pre-prepared ones) + |----- some file (such as laos_calib.jsonl) + |----- lab_practice # best practice folder with configs for quantization + |----- model folder (such as qwen3_5_moe folder) # folder with quantization configs + |----- quant_config (such as qwen3_5_moe_w8a8.yaml) # quantization config + |----- another folders + output_folder/ # generated by below command + |----- quant_model_weights-00001-of-0001.safetensors # quantized weights + |----- quant_model_description.json # file with description of the quantization methods for each layer (```W4A4_DYNAMIC```, etc.) + |----- another files (such as config.json, tokenizer.json, etc.) + ``` + Run quantization using one-click quantization (recommended): + ```bash + msmodelslim quant \ + --model_path ${MODEL_PATH} \ + --save_path ${SAVE_PATH} \ + --device npu:0,1 \ + --model_type Qwen3-32B \ + --quant_type w8a8 \ + --trust_remote_code True + ``` + +- **Usage Example** + ```bash + python3 -m sglang.launch_server \ + --model-path $PWD/Qwen3-32B-w8a8 \ + --port 30000 --host 0.0.0.0 + ``` + +- **Available Quantization Methods**: + - [x] ```W4A4_DYNAMIC``` linear with online quantization of activations + - [x] ```W8A8``` linear with offline quantization of activations + - [x] ```W8A8_DYNAMIC``` linear with online quantization of activations + - [x] ```W4A4_DYNAMIC``` MOE with online quantization of activations + - [x] ```W4A8_DYNAMIC``` MOE with online quantization of activations + - [x] ```W8A8_DYNAMIC``` MOE with online quantization of activations + - [ ] ```W4A8``` linear TBD + - [ ] ```W4A16``` linear TBD + - [ ] ```W48A16``` linear TBD + - [ ] ```W4A16``` MoE in progress + - [ ] ```W8A16``` MoE in progress + - [ ] ```KV Cache``` in progress + - [ ] ```Attention``` in progress + + +For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/README.md) section in ModelSLim repo. + ## Online Quantization To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`: @@ -382,11 +591,44 @@ SGLang running on AMD GPUs (CDNA3 or CDNA4 architecture) supports the quantizati Other layers (e.g. projections in the attention layers) have their weights quantized online to float8 directly. +## Diffusion Model Quantization on Ascend NPU + +SGLang-Diffusion supports MXFP8 quantization for diffusion models (such as Wan2.2) on Ascend A5 NPUs, in both online and offline (ModelSlim) modes. This is separate from the LLM serving path and uses the `sglang serve` / `sglang generate` CLI. + +**Requirements:** Ascend A5, CANN ≥ 8.0.RC3 + +### Online MXFP8 + +Pass `--quantization mxfp8` to dynamically quantize FP16/BF16 transformer weights to MXFP8 at load time: + +```bash +sglang serve \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --quantization mxfp8 \ + --num-gpus 4 +``` + +### Offline MXFP8 (ModelSlim) + +Pre-quantize with [msModelSlim](https://gitcode.com/Ascend/msmodelslim) and load the checkpoint directly — the quantization scheme is auto-detected from `quant_model_description.json`: + +```bash +sglang generate \ + --model-path /path/to/wan2_2_mxfp8_diffusers \ + --prompt "a beautiful sunset" \ + --save-output +``` + +For the full quantization + format conversion workflow and a complete list of supported schemes, see [Diffusion Quantization on Ascend NPU](../platforms/ascend/ascend_npu_quantization.md#diffusion-model-quantization-on-ascend-npu) and [SGLang-Diffusion Quantization](../diffusion/quantization.md#modelslim). + ## Reference - [GPTQModel](https://github.com/ModelCloud/GPTQModel) - [LLM Compressor](https://github.com/vllm-project/llm-compressor/) - [NVIDIA Model Optimizer (ModelOpt)](https://github.com/NVIDIA/Model-Optimizer) +- [NVIDIA Model Optimizer LLM PTQ](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) +- [Petit: NVFP4 on ROCm](https://github.com/causalflow-ai/petit-kernel) — [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/), [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html) - [Torchao: PyTorch Architecture Optimization](https://github.com/pytorch/ao) - [vLLM Quantization](https://docs.vllm.ai/en/latest/quantization/) - [auto-round](https://github.com/intel/auto-round) +- [ModelSlim](https://gitcode.com/Ascend/msmodelslim) diff --git a/docs/advanced_features/rfork.md b/docs/advanced_features/rfork.md index 5e01aa111216..e4b513328ecf 100644 --- a/docs/advanced_features/rfork.md +++ b/docs/advanced_features/rfork.md @@ -9,11 +9,12 @@ To learn more details about R-Fork, please check **`--tp-size` | The tensor parallelism size. | `1` | Type: int | | `--pipeline-parallel-size`
`--pp-size` | The pipeline parallelism size. | `1` | Type: int | +| `--attention-context-parallel-size`
`--attn-cp-size`| The attention context parallelism size. | `1` | Type: int| +| `--moe-data-parallel-size`
`--moe-dp-size`| The moe data parallelism size. | `1` | Type: int| | `--pp-max-micro-batch-size` | The maximum micro batch size in pipeline parallelism. | `None` | Type: int | | `--pp-async-batch-depth` | The async batch depth of pipeline parallelism. | `0` | Type: int | | `--stream-interval` | The interval (or buffer size) for streaming in terms of the token length. A smaller value makes streaming smoother, while a larger value makes the throughput higher | `1` | Type: int | -| `--stream-output` | Whether to output as a sequence of disjoint segments. | `False` | bool flag (set to enable) | +| `--incremental-streaming-output` | Whether to output as a sequence of disjoint segments. | `False` | bool flag (set to enable) | | `--random-seed` | The random seed. | `None` | Type: int | | `--constrained-json-whitespace-pattern` | (outlines and llguidance backends only) Regex pattern for syntactic whitespaces allowed in JSON constrained output. For example, to allow the model to generate consecutive whitespaces, set the pattern to [\n\t ]* | `None` | Type: str | | `--constrained-json-disable-any-whitespace` | (xgrammar and llguidance backends only) Enforce compact representation in JSON constrained output. | `False` | bool flag (set to enable) | @@ -186,6 +186,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--crash-dump-folder` | Folder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled. | `None` | Type: str | | `--show-time-cost` | Show time cost of custom marks. | `False` | bool flag (set to enable) | | `--enable-metrics` | Enable log prometheus metrics. | `False` | bool flag (set to enable) | +| `--enable-mfu-metrics` | Enable estimated MFU-related prometheus metrics. | `False` | bool flag (set to enable) | | `--enable-metrics-for-all-schedulers` | Enable --enable-metrics-for-all-schedulers when you want schedulers on all TP ranks (not just TP 0) to record request metrics separately. This is especially useful when dp_attention is enabled, as otherwise all metrics appear to come from TP 0. | `False` | bool flag (set to enable) | | `--tokenizer-metrics-custom-labels-header` | Specify the HTTP header for passing custom labels for tokenizer metrics. | `x-custom-labels` | Type: str | | `--tokenizer-metrics-allowed-custom-labels` | The custom labels allowed for tokenizer metrics. The labels are specified via a dict in '--tokenizer-metrics-custom-labels-header' field in HTTP requests, e.g., {'label1': 'value1', 'label2': 'value2'} is allowed if '--tokenizer-metrics-allowed-custom-labels label1 label2' is set. | `None` | List[str] | @@ -212,16 +213,16 @@ Please consult the documentation below and [server_args.py](https://github.com/s | Argument | Description | Defaults | Options | | --- | --- | --- | --- | | `--api-key` | Set API key of the server. It is also used in the OpenAI API compatible server. | `None` | Type: str | -| `--admin-api-key` | Set **admin API key** for administrative/control endpoints (e.g., weights update, cache flush, `/get_server_info`). Endpoints marked as admin-only require `Authorization: Bearer ` when this is set. | `None` | Type: str | +| `--admin-api-key` | Set **admin API key** for administrative/control endpoints (e.g., weights update, cache flush, `/server_info`). Endpoints marked as admin-only require `Authorization: Bearer ` when this is set. | `None` | Type: str | | `--served-model-name` | Override the model name returned by the v1/models endpoint in OpenAI API server. | `None` | Type: str | | `--weight-version` | Version identifier for the model weights. Defaults to 'default' if not specified. | `default` | Type: str | -| `--chat-template` | The buliltin chat template name or the path of the chat template file. This is only used for OpenAI-compatible API server. | `None` | Type: str | +| `--chat-template` | The builtin chat template name or the path of the chat template file. This is only used for OpenAI-compatible API server. | `None` | Type: str | | `--hf-chat-template-name` | When the HuggingFace tokenizer has multiple chat templates (e.g., 'default', 'tool_use', 'rag'), specify which named template to use. If not set, the first available template is used. | `None` | Type: str | -| `--completion-template` | The buliltin completion template name or the path of the completion template file. This is only used for OpenAI-compatible API server. only for code completion currently. | `None` | Type: str | +| `--completion-template` | The builtin completion template name or the path of the completion template file. This is only used for OpenAI-compatible API server. only for code completion currently. | `None` | Type: str | | `--file-storage-path` | The path of the file storage in backend. | `sglang_storage` | Type: str | | `--enable-cache-report` | Return number of cached tokens in usage.prompt_tokens_details for each openai request. | `False` | bool flag (set to enable) | | `--reasoning-parser` | Specify the parser for reasoning models. Supported parsers: [deepseek-r1, deepseek-v3, glm45, gpt-oss, kimi, qwen3, qwen3-thinking, step3]. | `None` | `deepseek-r1`, `deepseek-v3`, `glm45`, `gpt-oss`, `kimi`, `qwen3`, `qwen3-thinking`, `step3` | -| `--tool-call-parser` | Specify the parser for handling tool-call interactions. Supported parsers: [deepseekv3, deepseekv31, glm, glm45, glm47, gpt-oss, kimi_k2, llama3, mistral, pythonic, qwen, qwen25, qwen3_coder, step3]. | `None` | `deepseekv3`, `deepseekv31`, `glm`, `glm45`, `glm47`, `gpt-oss`, `kimi_k2`, `llama3`, `mistral`, `pythonic`, `qwen`, `qwen25`, `qwen3_coder`, `step3` | +| `--tool-call-parser` | Specify the parser for handling tool-call interactions. Supported parsers: [deepseekv3, deepseekv31, glm, glm45, glm47, gpt-oss, kimi_k2, llama3, mistral, pythonic, qwen, qwen25, qwen3_coder, step3]. | `None` | `deepseekv3`, `deepseekv31`, `glm`, `glm45`, `glm47`, `gpt-oss`, `kimi_k2`, `llama3`, `mistral`, `pythonic`, `qwen`, `qwen25`, `qwen3_coder`, `step3`, `gigachat3` | | `--tool-server` | Either 'demo' or a comma-separated list of tool server urls to use for the model. If not specified, no tool server will be used. | `None` | Type: str | | `--sampling-defaults` | Where to get default sampling parameters. 'openai' uses SGLang/OpenAI defaults (temperature=1.0, top_p=1.0, etc.). 'model' uses the model's generation_config.json to get the recommended sampling parameters if available. Default is 'model'. | `model` | `openai`, `model` | @@ -257,6 +258,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--lora-eviction-policy` | LoRA adapter eviction policy when the GPU memory pool is full. | `lru` | `lru`, `fifo` | | `--lora-backend` | Choose the kernel backend for multi-LoRA serving. | `csgmv` | `triton`, `csgmv`, `ascend`, `torch_native` | | `--max-lora-chunk-size` | Maximum chunk size for the ChunkedSGMV LoRA backend. Only used when `--lora-backend` is `csgmv`. Larger values may improve performance. | `16` | `16`, `32`, `64`, `128` | +| `--lora-drain-wait-threshold` | When any LoRA adapter request waits longer than this threshold (in seconds), the scheduler will selectively drain one running adapter to make room. This mitigates extreme tail latency under high or skewed workloads by preventing a small set of adapters from monopolizing batch slots. Set to 0 to disable draining (default). | `0.0` | Type: float | ## Kernel Backends (Attention, Sampling, Grammar, GEMM) | Argument | Description | Defaults | Options | @@ -267,10 +269,10 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--sampling-backend` | Choose the kernels for sampling layers. | `None` | `flashinfer`, `pytorch`, `ascend` | | `--grammar-backend` | Choose the backend for grammar-guided decoding. | `None` | `xgrammar`, `outlines`, `llguidance`, `none` | | `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `fa4`, `triton_attn`, `ascend_attn`, `aiter_attn` | -| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter` | -| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter` | -| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (optimal for Blackwell and low-latency), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `cutlass`, `triton`, `aiter` | -| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'auto' (default, auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_cutlass' (FlashInfer CUTLASS backend, optimal on CUDA 12), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. **NOTE**: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. | `auto` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` | +| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter`, `trtllm` | +| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`, `trtllm` | +| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (FlashInfer TRTLLM backend; SM100/SM103 only), 'flashinfer_cutlass' (FlashInfer CUTLASS backend, SM120 only), 'flashinfer_deepgemm' (Hopper SM90 only, uses swapAB optimization for small M dimensions in decoding), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only).| `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter` | +| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback.| `flashinfer_cutlass` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` | | `--disable-flashinfer-autotune` | Flashinfer autotune is enabled by default. Set this flag to disable the autotune. | `False` | bool flag (set to enable) | ## Speculative decoding @@ -295,12 +297,10 @@ Please consult the documentation below and [server_args.py](https://github.com/s ## Ngram speculative decoding | Argument | Description | Defaults | Options | | --- | --- | --- | --- | -| `--speculative-ngram-min-match-window-size` | The minimum window size for pattern matching in ngram speculative decoding. | `1` | Type: int | -| `--speculative-ngram-max-match-window-size` | The maximum window size for pattern matching in ngram speculative decoding. | `12` | Type: int | | `--speculative-ngram-min-bfs-breadth` | The minimum breadth for BFS (Breadth-First Search) in ngram speculative decoding. | `1` | Type: int | | `--speculative-ngram-max-bfs-breadth` | The maximum breadth for BFS (Breadth-First Search) in ngram speculative decoding. | `10` | Type: int | -| `--speculative-ngram-match-type` | The match type for cache tree. | `BFS` | `BFS`, `PROB` | -| `--speculative-ngram-branch-length` | The branch length for ngram speculative decoding. | `18` | Type: int | +| `--speculative-ngram-match-type` | Ngram tree-building mode. `BFS` selects recency-based expansion and `PROB` selects frequency-based expansion. This setting is forwarded to the ngram cache implementation. | `BFS` | `BFS`, `PROB` | +| `--speculative-ngram-max-trie-depth` | Maximum suffix length stored and matched by the ngram trie. | `18` | Type: int | | `--speculative-ngram-capacity` | The cache capacity for ngram speculative decoding. | `10000000` | Type: int | ## Multi-layer Eagle speculative decoding @@ -312,10 +312,11 @@ Please consult the documentation below and [server_args.py](https://github.com/s | Argument | Description | Defaults | Options | | --- | --- | --- | --- | | `--expert-parallel-size`
`--ep-size`
`--ep` | The expert parallelism size. | `1` | Type: int | -| `--moe-a2a-backend` | Select the backend for all-to-all communication for expert parallelism. | `none` | `none`, `deepep`, `mooncake`, `ascend_fuseep`| -| `--moe-runner-backend` | Choose the runner backend for MoE. | `auto` | `auto`, `deep_gemm`, `triton`, `triton_kernel`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_mxfp4`, `flashinfer_cutedsl`, `cutlass` | +| `--moe-a2a-backend` | Select the backend for all-to-all communication for expert parallelism. | `none` | `none`, `deepep`, `mooncake`, `mori`, `nixl`, `ascend_fuseep`| +| `--moe-runner-backend` | Choose the runner backend for MoE. | `auto` | `auto`, `deep_gemm`, `triton`, `triton_kernel`, `flashinfer_trtllm`, `flashinfer_trtllm_routed`, `flashinfer_cutlass`, `flashinfer_mxfp4`, `flashinfer_cutedsl`, `cutlass` | | `--flashinfer-mxfp4-moe-precision` | Choose the computation precision of flashinfer mxfp4 moe | `default` | `default`, `bf16` | | `--enable-flashinfer-allreduce-fusion` | Enable FlashInfer allreduce fusion with Residual RMSNorm. | `False` | bool flag (set to enable) | +| `--enable-aiter-allreduce-fusion` | Enable aiter allreduce fusion with Residual RMSNorm. | `False` | bool flag (set to enable) | | `--deepep-mode` | Select the mode when enable DeepEP MoE, could be `normal`, `low_latency` or `auto`. Default is `auto`, which means `low_latency` for decode batch and `normal` for prefill batch. | `auto` | `normal`, `low_latency`, `auto` | | `--ep-num-redundant-experts` | Allocate this number of redundant experts in expert parallel. | `0` | Type: int | | `--ep-dispatch-algorithm` | The algorithm to choose ranks for redundant experts in expert parallel. | `None` | Type: str | @@ -331,13 +332,15 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--deepep-config` | Tuned DeepEP config suitable for your own cluster. It can be either a string with JSON content or a file path. | `None` | Type: str | | `--moe-dense-tp-size` | TP size for MoE dense MLP layers. This flag is useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports. | `None` | Type: int | | `--elastic-ep-backend` | Specify the collective communication backend for elastic EP. Currently supports 'mooncake'. | `none` | `none`, `mooncake` | +| `--enable-elastic-expert-backup` | Enable elastic EP backend to backup expert weights in DRAM feature. Currently supports 'mooncake'.| `False` | bool flag (set to enable) | | `--mooncake-ib-device` | The InfiniBand devices for Mooncake Backend transfer, accepts multiple comma-separated devices (e.g., --mooncake-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when Mooncake Backend is enabled. | `None` | Type: str | +| `--elastic-ep-rejoin` | Indicates that this process is a relaunched elastic EP rank that should rejoin an existing process group during rank recovery. | `False` | bool flag (set to enable) | ## Mamba Cache | Argument | Description | Defaults | Options | | --- | --- | --- | --- | | `--max-mamba-cache-size` | The maximum size of the mamba cache. | `None` | Type: int | -| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16` | +| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16`, `float16` | | `--mamba-full-memory-ratio` | The ratio of mamba state memory to full kv cache memory. | `0.9` | Type: float | | `--mamba-scheduler-strategy` | The strategy to use for mamba scheduler. `auto` currently defaults to `no_buffer`. 1. `no_buffer` does not support overlap scheduler due to not allocating extra mamba state buffers. Branching point caching support is feasible but not implemented. 2. `extra_buffer` supports overlap schedule by allocating extra mamba state buffers to track mamba state for caching (mamba state usage per running req becomes `2x` for non-spec; `1+(1/(2+speculative_num_draft_tokens))x` for spec dec (e.g. 1.16x if speculative_num_draft_tokens==4)). 2a. `extra_buffer` is strictly better for non-KV-cache-bound cases; for KV-cache-bound cases, the tradeoff depends on whether enabling overlap outweighs reduced max running requests. 2b. mamba caching at radix cache branching point is strictly better than non-branch but requires kernel support (currently only FLA backend), currently only extra_buffer supports branching. | `auto` | `auto`, `no_buffer`, `extra_buffer` | | `--mamba-track-interval` | The interval (in tokens) to track the mamba state during decode. Only used when `--mamba-scheduler-strategy` is `extra_buffer`. Must be divisible by page_size if set, and must be >= speculative_num_draft_tokens when using speculative decoding. | `256` | Type: int | @@ -376,21 +379,12 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--kt-max-deferred-experts-per-token` | [ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0. | `None` | Type: int | ## Diffusion LLM + | Argument | Description | Defaults | Options | | --- | --- | --- | --- | | `--dllm-algorithm` | The diffusion LLM algorithm, such as LowConfidence. | `None` | Type: str | | `--dllm-algorithm-config` | The diffusion LLM algorithm configurations. Must be a YAML file. | `None` | Type: str | -## Double Sparsity -| Argument | Description | Defaults | Options | -| --- | --- | --- | --- | -| `--enable-double-sparsity` | Enable double sparsity attention | `False` | bool flag (set to enable) | -| `--ds-channel-config-path` | The path of the double sparsity channel config | `None` | Type: str | -| `--ds-heavy-channel-num` | The number of heavy channels in double sparsity attention | `32` | Type: int | -| `--ds-heavy-token-num` | The number of heavy tokens in double sparsity attention | `256` | Type: int | -| `--ds-heavy-channel-type` | The type of heavy channels in double sparsity attention | `qk` | Type: str | -| `--ds-sparse-decode-threshold` | The minimum decode sequence length required before the double-sparsity backend switches from the dense fallback to the sparse decode kernel. | `4096` | Type: int | - ## Offloading | Argument | Description | Defaults | Options | | --- | --- | --- | --- | @@ -434,7 +428,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--tbo-token-distribution-threshold` | The threshold of token distribution between two batches in micro-batch-overlap, determines whether to two-batch-overlap or two-chunk-overlap. Set to 0 denote disable two-chunk-overlap. | `0.48` | Type: float | | `--enable-torch-compile` | Optimize the model with torch.compile. Experimental feature. | `False` | bool flag (set to enable) | | `--enable-torch-compile-debug-mode` | Enable debug mode for torch compile. | `False` | bool flag (set to enable) | -| `--enable-piecewise-cuda-graph` | Optimize the model with piecewise cuda graph for extend/prefill only. Experimental feature. | `False` | bool flag (set to enable) | +| `--disable-piecewise-cuda-graph` | Disable piecewise cuda graph for extend/prefill. PCG is enabled by default. | `False` | bool flag (set to disable) | +| `--enforce-piecewise-cuda-graph` | Enforce piecewise cuda graph, skipping all auto-disable conditions. For testing only. | `False` | bool flag (set to enable) | | `--piecewise-cuda-graph-tokens` | Set the list of tokens when using piecewise cuda graph. | `None` | Type: JSON list | | `--piecewise-cuda-graph-compiler` | Set the compiler for piecewise cuda graph. Choices are: eager, inductor. | `eager` | `eager`, `inductor` | | `--torch-compile-max-bs` | Set the maximum batch size when using torch compile. | `32` | Type: int | @@ -465,7 +460,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--rl-on-policy-target` | The training system that SGLang needs to match for true on-policy. | `None` | `fsdp` | | `--enable-attn-tp-input-scattered` | Allow input of attention to be scattered when only using tensor parallelism, to reduce the computational load of operations such as qkv latent. | `False` | bool flag (set to enable) | | `--enable-nsa-prefill-context-parallel` | Enable context parallelism used in the long sequence prefill phase of DeepSeek v3.2. | `False` | bool flag (set to enable) | -| `--nsa-prefill-cp-mode` | Token splitting mode for the prefill phase of DeepSeek v3.2 under context parallelism. Optional values: `in-seq-split` (default), `round-robin-split`. `round-robin-split` distributes tokens across ranks based on `token_idx % cp_size`. It supports multi-batch prefill, fused MoE, and FP8 KV cache. | `in-seq-split` | `in-seq-split`, `round-robin-split` | +| `--nsa-prefill-cp-mode` | Token splitting mode for the prefill phase of DeepSeek v3.2 under context parallelism. Optional values: `round-robin-split`(default),`in-seq-split`. `round-robin-split` distributes tokens across ranks based on `token_idx % cp_size`. It supports multi-batch prefill, fused MoE, and FP8 KV cache. | `in-seq-split` | `in-seq-split`, `round-robin-split` | | `--enable-fused-qk-norm-rope` | Enable fused qk normalization and rope rotary embedding. | `False` | bool flag (set to enable) | | `--enable-precise-embedding-interpolation` | Enable corner alignment for resize of embeddings grid to ensure more accurate(but slower) evaluation of interpolated embedding values. | `False` | bool flag (set to enable) | @@ -490,12 +485,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--disaggregation-mode` | Only used for PD disaggregation. "prefill" for prefill-only server, and "decode" for decode-only server. If not specified, it is not PD disaggregated | `null` | `null`, `prefill`, `decode` | | `--disaggregation-transfer-backend` | The backend for disaggregation transfer. Default is mooncake. | `mooncake` | `mooncake`, `nixl`, `ascend`, `fake` | | `--disaggregation-bootstrap-port` | Bootstrap server port on the prefill server. Default is 8998. | `8998` | Type: int | -| `--disaggregation-decode-tp` | Decode tp size. If not set, it matches the tp size of the current engine. This is only set on the prefill server. | `None` | Type: int | -| `--disaggregation-decode-dp` | Decode dp size. If not set, it matches the dp size of the current engine. This is only set on the prefill server. | `None` | Type: int | -| `--disaggregation-prefill-pp` | Prefill pp size. If not set, it is default to 1. This is only set on the decode server. | `1` | Type: int | | `--disaggregation-ib-device` | The InfiniBand devices for disaggregation transfer, accepts single device (e.g., --disaggregation-ib-device mlx5_0) or multiple comma-separated devices (e.g., --disaggregation-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when mooncake backend is enabled. | `None` | Type: str | | `--disaggregation-decode-enable-offload-kvcache` | Enable async KV cache offloading on decode server (PD mode). | `False` | bool flag (set to enable) | -| `--disaggregation-decode-enable-fake-auto` | Auto enable FAKE mode for decode node testing, no need to pass bootstrap_host and bootstrap_room in request. | `False` | bool flag (set to enable) | | `--num-reserved-decode-tokens` | Number of decode tokens that will have memory reserved when adding new request to the running batch. | `512` | Type: int | | `--disaggregation-decode-polling-interval` | The interval to poll requests in decode server. Can be set to >1 to reduce the overhead of this. | `1` | Type: int | @@ -512,6 +503,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s | --- | --- | --- | --- | | `--custom-weight-loader` | The custom dataloader which used to update the model. Should be set with a valid import path, such as my_package.weight_load_func | `None` | List[str] | | `--weight-loader-disable-mmap` | Disable mmap while loading weight using safetensors. | `False` | bool flag (set to enable) | +| `--weight-loader-prefetch-checkpoints` | Prefetch checkpoint files into OS page cache before loading. Each rank prefetches a fraction of the shards in a background thread, reducing total network I/O on shared filesystems (NFS/Lustre) from N\*checkpoint to 1\*checkpoint. Recommended for models on network storage. | `False` | bool flag (set to enable) | +| `--weight-loader-prefetch-num-threads` | Number of threads per rank for checkpoint prefetching. | `4` | Type: int | | `--remote-instance-weight-loader-seed-instance-ip` | The ip of the seed instance for loading weights from remote instance. | `None` | Type: str | | `--remote-instance-weight-loader-seed-instance-service-port` | The service port of the seed instance for loading weights from remote instance. | `None` | Type: int | | `--remote-instance-weight-loader-send-weights-group-ports` | The communication group ports for loading weights from remote instance. | `None` | Type: JSON list | @@ -539,6 +532,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s | `--mm-process-config` | Multimodal preprocessing config, a json config contains keys: `image`, `video`, `audio`. | `{}` | Type: JSON / Dict | | `--mm-enable-dp-encoder` | Enabling data parallelism for mm encoder. The dp size will be set to the tp size automatically. | `False` | bool flag (set to enable) | | `--limit-mm-data-per-request` | Limit the number of multimodal inputs per request. e.g. '{"image": 1, "video": 1, "audio": 1}' | `None` | Type: JSON / Dict | +| `--enable-mm-global-cache` | Enable Mooncake-backed global multimodal embedding cache on encoder servers so repeated images can reuse cached ViT embeddings instead of recomputing them. | `False` | bool flag (set to enable) | ## For checkpoint decryption | Argument | Description | Defaults | Options | @@ -552,6 +546,11 @@ Please consult the documentation below and [server_args.py](https://github.com/s | --- | --- | --- | --- | | `--forward-hooks` | JSON-formatted list of forward hook specifications. Each element must include `target_modules` (list of glob patterns matched against `model.named_modules()` names) and `hook_factory` (Python import path to a factory, e.g. `my_package.hooks:make_hook`). An optional `name` field is used for logging, and an optional `config` object is passed as a `dict` to the factory. | `None` | Type: JSON list | +## For MindStudio-probe(msProbe) dump +| Argument | Description | Defaults | Options | +| --- | --- | --- | --- | +| `--msprobe-dump-config` | The path of the JSON configuration file for msProbe. If specified, enables msProbe dump. | `None` | Type: str | + ## Deprecated arguments | Argument | Description | Defaults | Options | | --- | --- | --- | --- | diff --git a/docs/advanced_features/sgl_model_gateway.md b/docs/advanced_features/sgl_model_gateway.md index 753743b0b0bb..0f2da5b4776d 100644 --- a/docs/advanced_features/sgl_model_gateway.md +++ b/docs/advanced_features/sgl_model_gateway.md @@ -77,7 +77,7 @@ SGLang Model Gateway is a high-performance model-routing gateway for large-scale ### Control Plane -- **Worker Manager** discovers capabilities (`/get_server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry. +- **Worker Manager** discovers capabilities (`/server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry. - **Job Queue** serializes add/remove requests and exposes status (`/workers/{worker_id}`) so clients can track onboarding progress. - **Load Monitor** feeds cache-aware and power-of-two policies with live worker load statistics. - **Health Checker** continuously probes workers and updates readiness, circuit breaker state, and router metrics. @@ -552,7 +552,7 @@ Response: | `GET` | `/engine_metrics` | Engine-level metrics from workers | | `GET` | `/v1/models` | List available models | | `GET` | `/get_model_info` | Get model information | -| `GET` | `/get_server_info` | Get server information | +| `GET` | `/server_info` | Get server information | | `POST` | `/flush_cache` | Clear all caches | | `GET` | `/get_loads` | Get all worker loads | | `POST` | `/wasm` | Upload WASM module | @@ -593,6 +593,17 @@ Response: ## Reliability and Flow Control +### HTTP Client + +Configure upstream HTTP client connection settings: + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `--pool-idle-timeout-secs` | 50 | Idle timeout in seconds for pooled upstream HTTP connections. Can also be set with `SMG_POOL_IDLE_TIMEOUT_SECS`. | +| `--connect-timeout-secs` | 10 | Timeout in seconds for new upstream HTTP connections. Can also be set with `SMG_CONNECT_TIMEOUT_SECS`. | +| `--pool-max-idle-per-host` | 500 | Maximum idle upstream HTTP connections to keep per host. Can also be set with `SMG_POOL_MAX_IDLE_PER_HOST`. | +| `--tcp-keepalive-secs` | 30 | TCP keepalive idle time in seconds for upstream HTTP connections. Can also be set with `SMG_TCP_KEEPALIVE_SECS`. | + ### Retries Configure exponential backoff retries: @@ -1645,7 +1656,7 @@ groups: | `--policy` | str | cache_aware | Routing policy | | `--max-concurrent-requests` | int | -1 | Concurrency limit (-1 disables) | | `--request-timeout-secs` | int | 600 | Request timeout | -| `--max-payload-size` | int | 256MB | Maximum request payload | +| `--max-payload-size` | int | 512MB | Maximum request payload | ### Prefill/Decode diff --git a/docs/advanced_features/sglang_for_rl.md b/docs/advanced_features/sglang_for_rl.md index 2fd84c90de69..12eb41540339 100644 --- a/docs/advanced_features/sglang_for_rl.md +++ b/docs/advanced_features/sglang_for_rl.md @@ -106,6 +106,29 @@ This path trades some I/O overhead for simplicity and flexibility. It integrates **Python Engine API:** `engine.update_weights_from_disk(model_path, load_format=None)` +**Diffusion engine (SGLang-Diffusion):** The diffusion engine exposes the same `POST /update_weights_from_disk` endpoint with the following behavior: + +- **All-or-nothing with rollback:** if any module fails to load, all previously updated modules are rolled back to the original weights by reloading from the original model path. No partial updates are left behind. If rollback itself fails, the exception propagates so the caller knows the model is in an inconsistent state. +- **Offload-aware:** when layerwise offload (`--dit-layerwise-offload`) is enabled, the diffusion offload manager replaces GPU parameters with small `torch.empty((1,))` placeholders while real weights live in consolidated pinned CPU buffers. A naive `param.data.copy_()` would fail with a shape mismatch. Instead, the updater dynamically detects active offload managers and writes new weights directly into their CPU buffers, bypassing the placeholders entirely. For any layer that happens to be prefetched on GPU at update time, the live GPU tensor is also updated so the change takes effect immediately. This requires no extra GPU memory and does not disturb the offload state. +- **DTensor-aware:** parameters distributed via `torch.distributed.tensor` (tensor parallelism) are updated through `distribute_tensor` so that each shard is correctly placed on the right device mesh. + +**Request body:** + +| Field | Description | Defaults | Options | +| --- | --- | --- | --- | +| `model_path` | The model path with the new weights. | Required | Type: str | +| `flush_cache` | Flush TeaCache state after update. | `True` | Type: bool | +| `target_modules` | List of module names to update (e.g. `["transformer"]`). If omitted, all `nn.Module` components are updated. | `None` | Type: list[str] | + +**Response body:** + +| Field | Description | Defaults | Options | +| --- | --- | --- | --- | +| `success` | Whether the update succeeded. | - | Type: bool | +| `message` | Status / error message. | - | Type: str | + +> **Note:** The diffusion engine (SGLang-Diffusion) does not currently support hot refit (updating weights while inference is in progress). The diffusion scheduler processes one request at a time and completes the entire inference before handling the next request, so weight updates and inference never run concurrently. + ### Update Weights from Tensor **When to use:** diff --git a/docs/advanced_features/speculative_decoding.md b/docs/advanced_features/speculative_decoding.md new file mode 100644 index 000000000000..8acaf4fcf166 --- /dev/null +++ b/docs/advanced_features/speculative_decoding.md @@ -0,0 +1,565 @@ +# Speculative Decoding + +SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines. + +## Summary + +### Jump to sections + +- [EAGLE Decoding](#eagle-decoding) + - [EAGLE-2 Decoding](#eagle-2-decoding) + - [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile) + - [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling) + - [EAGLE-3 Decoding](#eagle-3-decoding) +- [Multi Token Prediction](#multi-token-prediction) +- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model) +- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler) +- [Ngram Speculative Decoding](#ngram-speculative-decoding) +- [Full Parameter Reference](#full-parameter-reference) +- [OOM Troubleshooting](#oom-troubleshooting) +- [References](#references) + +### Quick guidance + +- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`. +- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`. +- **Workload acceptance changes over time**: Use [**Adaptive speculative decoding**](adaptive_speculative_decoding.md) on top of **EAGLE** with `--speculative-eagle-topk 1`. +- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`. +- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section). +- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`). +- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only). +- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`). + +### Method comparison (mini table) + +| Method | Draft source | Separate draft model? | How to enable | Notes / constraints | +|---|---|---:|---|---| +| EAGLE-2 | EAGLE draft model (feature drafting + tree) | Typically yes | `--speculative-algorithm EAGLE` + `--speculative-draft-model-path ...` | Tune `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens` | +| EAGLE-2 + `torch.compile` | Same as EAGLE-2 | Typically yes | Add `--enable-torch-compile` (optionally `--torch-compile-max-bs`) | Benefit varies by hardware/model; benchmark to verify | +| EAGLE-2 + FR-Spec | Same as EAGLE-2 + token subset | Typically yes | Add `--speculative-token-map ...` | Reduces `lm_head` overhead with high-frequency token vocab | +| EAGLE-3 | EAGLE3 draft model | Yes | `--speculative-algorithm EAGLE3` + `--speculative-draft-model-path ...` | Best throughput in the benchmark below | +| MTP | Built-in multi-token heads (model-specific) | Often no | See **Multi Token Prediction** section | Uses speculative workflow; draft path may be auto-handled for some models | +| STANDALONE | Smaller draft LLM (token-level) | Yes | `--speculative-algorithm STANDALONE` + `--speculative-draft-model-path ...` | Does **not** support `--enable-dp-attention` | +| SpecV2 (experimental) | V2 workers + overlap scheduler | N/A | `SGLANG_ENABLE_SPEC_V2=True` | Only supports `--speculative-eagle-topk 1`; applies to `EAGLE`, `EAGLE3`, `STANDALONE` | +| NGRAM | Ngram cache from previous tokens | No | `--speculative-algorithm NGRAM` | CUDA-only; no `--enable-dp-attention`; disables overlap scheduler & mixed chunked prefill | + +### Performance Highlights + +Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding. +For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840). + +| Method | Throughput (tokens/s) | +|--------|----------------| +| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s | +| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s | +| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s | + +--- + +## EAGLE Decoding + +To enable EAGLE speculative decoding the following parameters are relevant: + +| Parameter | Description | Default | +|---|---|---| +| `--speculative-draft-model-path` | Draft model path/weights. **Typically required** for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted. | `None` | +| `--speculative-num-steps` | Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. | Auto (`5` for Llama/Grok; `3` for many other models) | +| `--speculative-eagle-topk` | Branching factor per step. Improves candidate diversity and acceptance rate, but increases memory/compute consumption. | Auto (`4` for Llama/Grok; `1` for many other models) | +| `--speculative-num-draft-tokens` | Maximum parallel verification capacity. Allows deeper tree evaluation but increases GPU memory usage. | Auto (`8` for Llama/Grok; `4` for many other models). If `topk=1`, it is adjusted to `num_steps + 1`. | +| `--speculative-accept-threshold-single` | Acceptance threshold for single-token verification. Lower values accept more aggressively. | `1.0` | +| `--speculative-accept-threshold-acc` | Accumulated acceptance threshold across steps. | `1.0` | +| `--speculative-attention-mode` | Attention mode for speculative operations (`prefill` or `decode`), affecting both target verification and draft extension. | `"prefill"` | +| `--speculative-draft-attention-backend` | Override attention backend for the draft model. | `None` (same as target) | +| `--speculative-draft-model-quantization` | Quantization method for the draft model. Use `"unquant"` to force no quantization even when the target model is quantized. | Same as target model | +| `--speculative-draft-model-revision` | Specific revision/commit of the draft model to load. | `None` (auto-set to `"main"` when `--speculative-draft-model-path` is set and revision is omitted) | +| `--speculative-draft-load-format` | Load format for the draft model weights. | `None` | + +These parameters are mostly the same for EAGLE-2 and EAGLE-3. `--speculative-token-map` is ignored for EAGLE-3 models. +For `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens`: leave all three unset to use auto-tuning, or set all three explicitly when tuning. +If you use EAGLE with `--speculative-eagle-topk 1` and your acceptance rate varies across requests, see [Adaptive Speculative Decoding](adaptive_speculative_decoding.md). + +You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py). + + +### EAGLE-2 Decoding + +You can enable EAGLE-2 Decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model. + +**Launch the server:** + +```bash +python3 -m sglang.launch_server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +### EAGLE-2 Decoding with `torch.compile` + +You can optionally enable `torch.compile` to apply kernel-level optimizations (operator fusion, autotune) to the draft model. The actual speedup depends on your hardware, model architecture, and batch size. In some configurations (e.g., small draft models on H100 where cuBLAS is already optimal and CUDA graphs are enabled), the benefit may be negligible. We recommend benchmarking with and without this flag on your specific setup to verify whether it helps. + +To enable it, add `--enable-torch-compile` and optionally set `--torch-compile-max-bs`: + +```bash +python3 -m sglang.launch_server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --mem-fraction-static 0.7 \ + --enable-torch-compile \ + --torch-compile-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling + +By employing a truncated high-frequency token vocabulary in the draft model, EAGLE speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, check out [the paper](https://arxiv.org/pdf/2502.14856). + +In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency tokens in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency tokens by directly downloading these tokens from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset). + +Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). + +```bash +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3-8B-Instruct \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --dtype float16 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3-8B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +### EAGLE-3 Decoding + +You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model. + +```bash +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --dtype float16 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Multi Token Prediction + +We support [MTP (Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use `XiaomiMiMo/MiMo-7B-RL` as an example here (for DeepSeek MTP usage, refer to [deepseek_v32 doc](../basic_usage/deepseek_v32.md#multi-token-prediction)). + +```bash +python3 -m sglang.launch_server \ + --model XiaomiMiMo/MiMo-7B-RL \ + --host 0.0.0.0 \ + --trust-remote-code \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 1 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 2 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "XiaomiMiMo/MiMo-7B-RL", + "messages": [{"role": "user", "content": "What is the capital of France?"}], +} + +response = requests.post(url, json=data) +print(response.json()) +``` + +--- + +## Standalone Speculative Decoding (Small Draft Model) + +Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`. + +Relevant parameters: + +| Parameter | Description | Default | +|---|---|---| +| `--speculative-draft-model-path` | Draft model weights (smaller than the target model). | `None` | +| `--speculative-num-steps` | Draft depth (how many steps the draft model runs autoregressively). | `3` (auto default for STANDALONE) | +| `--speculative-eagle-topk` | Branching factor (token candidates per step). | `1` (auto default for STANDALONE) | +| `--speculative-num-draft-tokens` | Verification capacity. | `4` (auto default for STANDALONE) | +| `--speculative-draft-model-quantization` | Quantization for the draft model. Use `"unquant"` to disable quantization on the draft even when the target is quantized. | Same as target | + +> **Note:** Standalone speculative decoding currently **does not support** `--enable-dp-attention`. + +```bash +python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-7B-Instruct \ + --speculative-algorithm STANDALONE \ + --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \ + --speculative-num-steps 4 \ + --speculative-eagle-topk 2 \ + --speculative-num-draft-tokens 7 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Speculative Decoding V2 (Overlap Scheduler) + +SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`). + +To enable it, set the environment variable: +- `SGLANG_ENABLE_SPEC_V2=True` + +Notes: +- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**. +- If you explicitly set `--speculative-eagle-topk > 1`, the server will error. +- If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama). This is incompatible with SpecV2 and may not always trigger an immediate config error, so set `--speculative-eagle-topk 1` explicitly. +- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`. + +```bash +SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-7B-Instruct \ + --speculative-algorithm STANDALONE \ + --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \ + --speculative-num-steps 4 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 5 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Ngram Speculative Decoding + +SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model. + +Enable it with: +- `--speculative-algorithm NGRAM` + +### Ngram-specific parameters + +| Parameter | Description | Default | +|---|---|---| +| `--speculative-num-draft-tokens` | Number of draft tokens verified per step. If omitted, defaults to `min(--speculative-ngram-max-trie-depth, 12)`. | `12` (with default ngram settings) | +| `--speculative-ngram-min-bfs-breadth` | Minimum BFS breadth. | `1` | +| `--speculative-ngram-max-bfs-breadth` | Maximum BFS breadth. | `10` | +| `--speculative-ngram-match-type` | Ngram tree-building mode: `"BFS"` for recency-based expansion or `"PROB"` for frequency-based expansion. | `"BFS"` | +| `--speculative-ngram-max-trie-depth` | Maximum suffix length stored and matched by the ngram trie. | `18` | +| `--speculative-ngram-capacity` | Cache capacity (number of entries). | `10,000,000` | + +Notes: +- Ngram speculative decoding **only supports CUDA**. +- It currently **does not support** `--enable-dp-attention`. +- It disables the overlap scheduler and mixed chunked prefill. +- If `--speculative-ngram-max-bfs-breadth > 1` (thus `speculative_eagle_topk > 1`) and `page_size > 1`, use `--attention-backend flashinfer`; otherwise the server will error. +- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification. + +```bash +python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-7B-Instruct \ + --speculative-algorithm NGRAM \ + --speculative-num-draft-tokens 16 \ + --speculative-ngram-max-bfs-breadth 10 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Full Parameter Reference + +Below is a comprehensive list of all speculative decoding parameters available in SGLang: + +### Core parameters + +| Parameter | Type | Default | Description | +|---|---|---|---| +| `--speculative-algorithm` | `str` | `None` | Algorithm to use: `EAGLE`, `EAGLE3`, `STANDALONE`, `NGRAM`, `NEXTN` (alias of `EAGLE`) | +| `--speculative-draft-model-path` | `str` | `None` | Path to the draft model weights | +| `--speculative-draft-model-revision` | `str` | `None` | Specific revision/commit of the draft model (`"main"` is auto-used when draft path is set and revision is omitted) | +| `--speculative-draft-load-format` | `str` | `None` | Load format for draft model weights | +| `--speculative-num-steps` | `int` | `None` (auto-chosen when omitted) | Autoregressive drafting depth | +| `--speculative-eagle-topk` | `int` | `None` (auto-chosen when omitted) | Branching factor per drafting step | +| `--speculative-num-draft-tokens` | `int` | `None` (auto-chosen when omitted) | Maximum number of draft tokens for verification | +| `--speculative-accept-threshold-single` | `float` | `1.0` | Single-token acceptance threshold | +| `--speculative-accept-threshold-acc` | `float` | `1.0` | Accumulated acceptance threshold | +| `--speculative-token-map` | `str` | `None` | Path to FR-Spec high-frequency token map | +| `--speculative-attention-mode` | `str` | `"prefill"` | Attention mode for speculative operations (`"prefill"` or `"decode"`) | +| `--speculative-draft-attention-backend` | `str` | `None` | Override attention backend for the draft model | +| `--speculative-moe-runner-backend` | `str` | `None` | MoE runner backend for the draft model | +| `--speculative-moe-a2a-backend` | `str` | `None` | MoE all-to-all backend for the draft model | +| `--speculative-draft-model-quantization` | `str` | Same as target | Quantization for the draft model (`"unquant"` to disable) | + +### Ngram-specific parameters + +| Parameter | Type | Default | Description | +|---|---|---|---| +| `--speculative-ngram-min-bfs-breadth` | `int` | `1` | Minimum BFS breadth | +| `--speculative-ngram-max-bfs-breadth` | `int` | `10` | Maximum BFS breadth | +| `--speculative-ngram-match-type` | `str` | `"BFS"` | Ngram tree-building mode: `"BFS"` for recency-based expansion or `"PROB"` for frequency-based expansion | +| `--speculative-ngram-max-trie-depth` | `int` | `18` | Maximum suffix length stored and matched by the ngram trie | +| `--speculative-ngram-capacity` | `int` | `10,000,000` | Cache capacity | + +### Environment variables + +| Variable | Default | Description | +|---|---|---| +| `SGLANG_ENABLE_SPEC_V2` | `False` | Enable Speculative Decoding V2 (overlap scheduler) | +| `SGLANG_NGRAM_FORCE_GREEDY_VERIFY` | `False` | Force greedy verification for ngram decoding | + +### Other related flags + +| Parameter | Description | +|---|---| +| `--enable-multi-layer-eagle` | Enable multi-layer EAGLE (auto-enabled for MiMoV2 and Step3p5 models) | +| `--enable-torch-compile` | Enable `torch.compile` for kernel-level optimizations | +| `--torch-compile-max-bs` | Maximum batch size for `torch.compile` | + +--- + +## OOM Troubleshooting + +> [!WARNING] +> **Out of Memory (OOM)?** Speculative decoding may increase GPU memory usage because the draft tree, CUDA graphs, and verification-related buffers consume additional VRAM. If you encounter OOM errors, try the following adjustments. + +### Step 1: Lower static memory fraction (most effective) + +```bash +--mem-fraction-static 0.5 # when omitted, this value is auto-computed +``` + +- `--mem-fraction-static` controls the memory budget for model weights + KV cache pool. +- Lowering it directly increases dynamic headroom for activations and CUDA graph buffers. +- If omitted, SGLang auto-estimates this value from other settings, and those auto settings can still be too aggressive for some workloads. + +### Step 2: Reduce CUDA graph batch size + +```bash +# Fewer CUDA graph captures = less memory reserved +--cuda-graph-max-bs 4 # or even 2 for tight memory situations +``` + +- If omitted, `--cuda-graph-max-bs` is auto-selected based on GPU memory and TP size, and can be much larger on high-memory GPUs. + +### Step 3: Reduce draft tree size + +These three parameters directly control how much memory the draft tree consumes: + +```bash +# Before (aggressive, high memory) +--speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 + +# After (conservative, lower memory) +--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` + +### Step 4: Limit concurrent requests + +```bash +# Fewer concurrent requests lowers in-flight load and can reduce OOM risk +--max-running-requests 4 +``` + +### Quick OOM recovery recipe + +If you're hitting OOM and just want something that works, start with this minimal configuration and scale up: + +```bash +python3 -m sglang.launch_server \ + --model \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --cuda-graph-max-bs 2 \ + --mem-fraction-static 0.5 \ + --max-running-requests 4 \ + --log-level warning +``` + +Then gradually increase `--speculative-num-draft-tokens`, `--speculative-eagle-topk`, and `--cuda-graph-max-bs`. Increase `--mem-fraction-static` last, only after the run is stable. + +--- + +## References + +EAGLE process is as follows: + +- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. +- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again. +- In SGLang's EAGLE-2 implementation, the draft tree is expanded for the configured steps and then reranked to select the top `speculative_num_draft_tokens` final nodes as draft tokens. +- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner. + +This enhances drafting accuracy by operating on features instead of tokens for more regular inputs and by additionally passing tokens from the next timestep to reduce sampling randomness. For more details, see the [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) papers. + +For guidance on how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train). For EAGLE-3 training specifically, check out [SpecForge](https://github.com/sgl-project/SpecForge), the SGLang team's training framework designed for EAGLE-3 speculative decoding models with seamless porting to SGLang serving. See the [SpecForge documentation](https://docs.sglang.ai/SpecForge/) and [blog post](https://lmsys.org/blog/2025-07-25-spec-forge) for details. diff --git a/docs/advanced_features/structured_outputs.ipynb b/docs/advanced_features/structured_outputs.ipynb index b0ec5e6c7d61..8902c949765e 100644 --- a/docs/advanced_features/structured_outputs.ipynb +++ b/docs/advanced_features/structured_outputs.ipynb @@ -54,7 +54,7 @@ " \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")" ] }, @@ -356,8 +356,7 @@ "outputs": [], "source": [ "# Support for XGrammar latest structural tag format\n", - "# https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html\n", - "\n", + "# \n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=messages,\n", @@ -645,8 +644,7 @@ "outputs": [], "source": [ "# Support for XGrammar latest structural tag format\n", - "# https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html\n", - "\n", + "# \n", "payload = {\n", " \"text\": text,\n", " \"sampling_params\": {\n", @@ -740,7 +738,6 @@ "import json\n", "from pydantic import BaseModel, Field\n", "\n", - "\n", "prompts = [\n", " \"Give me the information of the capital of China in the JSON format.\",\n", " \"Give me the information of the capital of France in the JSON format.\",\n", @@ -926,8 +923,7 @@ "outputs": [], "source": [ "# Support for XGrammar latest structural tag format\n", - "# https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html\n", - "\n", + "# \n", "sampling_params = {\n", " \"temperature\": 0.8,\n", " \"top_p\": 0.95,\n", diff --git a/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb b/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb index 2b05a583775c..cfc07fd01629 100644 --- a/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb +++ b/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb @@ -50,7 +50,7 @@ " \"python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")" ] }, @@ -642,7 +642,6 @@ "import json\n", "from pydantic import BaseModel, Field\n", "\n", - "\n", "prompts = [\n", " \"Give me the information of the capital of China in the JSON format.\",\n", " \"Give me the information of the capital of France in the JSON format.\",\n", diff --git a/docs/advanced_features/tool_parser.ipynb b/docs/advanced_features/tool_parser.ipynb index df1bc4bc7ba0..9afc9663e64f 100644 --- a/docs/advanced_features/tool_parser.ipynb +++ b/docs/advanced_features/tool_parser.ipynb @@ -60,7 +60,7 @@ "server_process, port = launch_server_cmd(\n", " \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\" # qwen25\n", ")\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" ] }, { @@ -550,7 +550,9 @@ "server_process_tool_choice, port_tool_choice = launch_server_cmd(\n", " \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\"\n", ")\n", - "wait_for_server(f\"http://localhost:{port_tool_choice}\")\n", + "wait_for_server(\n", + " f\"http://localhost:{port_tool_choice}\", process=server_process_tool_choice\n", + ")\n", "\n", "# Initialize client for tool choice examples\n", "client_tool_choice = OpenAI(\n", @@ -695,7 +697,7 @@ "server_process, port = launch_server_cmd(\n", " \" python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1 --log-level warning\" # llama-3.2-1b-instruct\n", ")\n", - "wait_for_server(f\"http://localhost:{port}\")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", "\n", "tools = [\n", " {\n", diff --git a/docs/advanced_features/vlm_query.ipynb b/docs/advanced_features/vlm_query.ipynb index 45dd9a1efe01..24bd7a90bc9f 100644 --- a/docs/advanced_features/vlm_query.ipynb +++ b/docs/advanced_features/vlm_query.ipynb @@ -64,8 +64,11 @@ "\n", "nest_asyncio.apply()\n", "\n", + "import sglang.test.doc_patch # noqa: F401\n", + "\n", "model_path = \"Qwen/Qwen2.5-VL-3B-Instruct\"\n", - "chat_template = \"qwen2-vl\"" + "chat_template = \"qwen2-vl\"\n", + "example_image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\"" ] }, { @@ -81,13 +84,7 @@ "\n", "from sglang.srt.parser.conversation import chat_templates\n", "\n", - "image = Image.open(\n", - " BytesIO(\n", - " requests.get(\n", - " \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", - " ).content\n", - " )\n", - ")\n", + "image = Image.open(BytesIO(requests.get(example_image_url).content))\n", "\n", "conv = chat_templates[chat_template].copy()\n", "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n", @@ -117,7 +114,6 @@ "source": [ "from sglang import Engine\n", "\n", - "\n", "llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")" ] }, @@ -186,9 +182,8 @@ "from transformers import Qwen2_5_VLForConditionalGeneration\n", "\n", "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", - "vision = (\n", - " Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval().visual.cuda()\n", - ")" + "model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()\n", + "vision = model.model.visual.cuda()" ] }, { @@ -207,6 +202,7 @@ "precomputed_embeddings = vision(\n", " processor_output[\"pixel_values\"].cuda(), processor_output[\"image_grid_thw\"].cuda()\n", ")\n", + "precomputed_embeddings = precomputed_embeddings.pooler_output\n", "\n", "multi_modal_item = dict(\n", " processor_output,\n", @@ -239,13 +235,7 @@ "from sglang.srt.parser.conversation import chat_templates\n", "\n", "# Download the same example image\n", - "image = Image.open(\n", - " BytesIO(\n", - " requests.get(\n", - " \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", - " ).content\n", - " )\n", - ")\n", + "image = Image.open(BytesIO(requests.get(example_image_url).content))\n", "\n", "conv = chat_templates[chat_template].copy()\n", "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n", diff --git a/docs/basic_usage/deepseek_ocr.md b/docs/basic_usage/deepseek_ocr.md new file mode 100644 index 000000000000..6f62713ebab4 --- /dev/null +++ b/docs/basic_usage/deepseek_ocr.md @@ -0,0 +1,54 @@ +# DeepSeek OCR (OCR-1 / OCR-2) + +DeepSeek OCR models are multimodal (image + text) models for OCR and document understanding. + +## Launch server + +```shell +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-OCR-2 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +> You can replace `deepseek-ai/DeepSeek-OCR-2` with `deepseek-ai/DeepSeek-OCR`. + +## Prompt examples + +Recommended prompts from the model card: + +``` + +<|grounding|>Convert the document to markdown. +``` + +``` + +Free OCR. +``` + +## OpenAI-compatible request example + +```python +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "deepseek-ai/DeepSeek-OCR-2", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "\n<|grounding|>Convert the document to markdown."}, + {"type": "image_url", "image_url": {"url": "https://example.com/your_image.jpg"}}, + ], + } + ], + "max_tokens": 512, +} + +response = requests.post(url, json=data) +print(response.text) +``` diff --git a/docs/basic_usage/deepseek_v3.md b/docs/basic_usage/deepseek_v3.md index a321eb09cbb7..9770c2882f13 100644 --- a/docs/basic_usage/deepseek_v3.md +++ b/docs/basic_usage/deepseek_v3.md @@ -68,13 +68,13 @@ Detailed commands for reference: - [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) - [4 x B200, 8 x B200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-one-b200-node) - [8 x MI300X](../platforms/amd_gpu.md#running-deepseek-v3) -- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes) +- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker) - [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes) - [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization) - [16 x A100 (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) - [32 x L40S (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization) - [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1) -- [4 x Atlas 800I A3 (int8)](../platforms/ascend_npu_deepseek_example.md#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3) +- [4 x Atlas 800I A3 (int8)](../platforms/ascend/ascend_npu_deepseek_example.md#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3) ### Download Weights If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights. @@ -86,7 +86,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be - [Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP](https://lmsys.org/blog/2025-06-16-gb200-part-1/) ([Part I](https://lmsys.org/blog/2025-06-16-gb200-part-1/), [Part II](https://lmsys.org/blog/2025-09-25-gb200-part-2/)) - Comprehensive guide on GB200 optimizations. -- [Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs](https://lmsys.org/blog/2025-05-05-deepseek-pd-ep/) - Guide on PD disaggregation and large-scale EP. +- [Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs](https://lmsys.org/blog/2025-05-05-large-scale-ep/) - Guide on PD disaggregation and large-scale EP. - [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes). @@ -150,7 +150,7 @@ Data parallelism attention is not recommended for low-latency, small-batch use c **Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory. -**Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) for usage examples. +**Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker) for usage examples. ### Block-wise FP8 @@ -223,7 +223,7 @@ Sample Request: ``` curl "http://127.0.0.1:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324", "tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of an city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "Hows the weather like in Qingdao today"}]}' +-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324", "tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of a city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "How'\''s the weather like in Qingdao today"}]}' ``` Expected Response @@ -236,7 +236,7 @@ Sample Streaming Request: ``` curl "http://127.0.0.1:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324","stream":true,"tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of an city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "Hows the weather like in Qingdao today"}]}' +-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324","stream":true,"tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of a city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "How'\''s the weather like in Qingdao today"}]}' ``` Expected Streamed Chunks (simplified for clarity): ``` diff --git a/docs/basic_usage/deepseek_v32.md b/docs/basic_usage/deepseek_v32.md index 8533c4d7bcc8..095060a7f320 100644 --- a/docs/basic_usage/deepseek_v32.md +++ b/docs/basic_usage/deepseek_v32.md @@ -1,10 +1,9 @@ -# DeepSeek V3.2 Usage +# DeepSeek V3.2/GLM-5 Usage DeepSeek-V3.2 model family equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves efficiency improvements in long-context scenarios. -For reporting issues or tracking upcoming features, please refer to this [Roadmap](https://github.com/sgl-project/sglang/issues/11060). -Note: This document is originally written for the usage of [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model. The usage of [DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) or [DeepSeek-V3.2-Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale) is the same as DeepSeek-V3.2-Exp except for the tool call parser. +Note: This document is originally written for the usage of [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model. The usage of [DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) or [DeepSeek-V3.2-Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale) is the same as DeepSeek-V3.2-Exp except for the tool call parser. [GLM-5](https://huggingface.co/zai-org/GLM-5) model also applies DSA (DeepSeek Sparse Attention) structure, so it can share most of the usage here, except for the reasoning parser and tool call parser. ## Installation @@ -16,7 +15,13 @@ Note: This document is originally written for the usage of [DeepSeek-V3.2-Exp](h docker pull lmsysorg/sglang:latest # MI350/MI355 -docker pull lmsysorg/sglang:dsv32-rocm +docker pull lmsysorg/sglang:v0.5.8-rocm700-mi35x + +# MI300 +# v0.5.8-rocm700-mi30x does not include PR #17504. Prefer the newest MI30x ROCm +# image tag from Docker Hub when available, or build from source (below). +docker pull lmsysorg/sglang:v0.5.8-rocm700-mi30x + # NPUs docker pull lmsysorg/sglang:dsv32-a2 @@ -32,7 +37,8 @@ cd sglang pip3 install pip --upgrade pip3 install -e "python" ``` -## Launch DeepSeek V3.2 with SGLang + +## Launch DeepSeek V3.2/GLM-5 with SGLang To serve [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) on 8xH200/B200 GPUs: @@ -45,21 +51,30 @@ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep # Launch with Pure TP python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 + +# Launch with TP on MI30x/MI35x +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --nsa-prefill-backend tilelang --nsa-decode-backend tilelang ``` +To serve GLM-5, just replace the `--model` argument with `zai-org/GLM-5-FP8`. + ### Configuration Tips -- **DP Attention (Recommended)**: For DeepSeek V3.2 model, the kernels are customized for the use case of `dp_size=8`, so DP attention (`--dp 8 --enable-dp-attention`) is the recommended configuration for better stability and performance. All test cases use this configuration by default. -- **Pure TP Mode**: Launching with pure TP (without `--dp` and `--enable-dp-attention`) is also supported. Note that this mode has not been fully validated in PD disaggregation scenarios. -- **Short-sequence MHA prefill (adaptive)**: For short prefill sequences (default threshold: **2048 tokens**), the NSA backend uses standard MHA automatically (no extra flags). On H200 (SM90) this path uses the FlashAttention variable-length kernel; on B200 (SM100) it uses TRT-LLM ragged MHA. MHA uses `MHA_ONE_SHOT` for best performance. `MHA_ONE_SHOT` computes multi-head attention over all tokens (both cached prefix and newly extended tokens) in a single kernel invocation, avoiding the overhead of chunked KV cache processing. This achieves optimal throughput for short sequences where total sequence length fits within the chunk capacity limit. +- **DP Attention**: To enable [DP Attention](../advanced_features/dp_dpa_smg_guide.md), please include `--enable-dp-attention --dp ` in command. DP Attention is better for large concurrency scenarios. +- **TP Attention**: Launching with TP attention is also supported. TP attention is better for low latency scenarios. +- **Short-sequence MHA prefill (adaptive)**: For short prefill sequences (default threshold: **2048 tokens**), the NSA backend uses standard MHA automatically (no extra flags). On H200 (SM90) this path uses the FlashAttention variable-length kernel; on B200 (SM100) it uses TRT-LLM ragged MHA. MHA uses `MHA_ONE_SHOT` for best performance, which computes multi-head attention over all tokens (both cached prefix and newly extended tokens) in a single kernel invocation, avoiding the overhead of chunked KV cache processing. This achieves optimal throughput for short sequences where total sequence length fits within the chunk capacity limit. +- **MHA prefill threshold relaxation**: To apply MHA attention to requests longer than 2048 tokens, please set the flag `SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD` to a value larger than 2048. As threshold grows larger, the prefill performance can be improved, but at the cost of potential accuracy drop. - **Choices of Attention Kernels**: The attention backend is automatically set to `nsa` attention backend for DeepSeek V3.2 model. In this backend, different kernels for sparse prefilling/decoding are implemented, which can be specified by `--nsa-prefill-backend` and `--nsa-decode-backend` server arguments. The choices of nsa prefill/decode attention kernels include: - `flashmla_sparse`: `flash_mla_sparse_fwd` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. It requires bf16 q, kv inputs. - `flashmla_kv`: `flash_mla_with_kvcache` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. It requires bf16 q, fp8 k_cache inputs. + - `flashmla_auto`: enables automatic selection of either `flashmla_sparse` or `flashmla_kv` kernel for prefill based on KV cache dtype, hardware, and heuristics. With BF16 KV cache, `flashmla_sparse` is always used on both Hopper and Blackwell. With FP8 KV cache: On Hopper (SM90), it unconditionally uses `flashmla_kv`; On Blackwell (SM100), it uses `flashmla_sparse` when `total_kv_tokens < total_q_tokens * 512`, otherwise falls back to `flashmla_kv`. The heuristics may need to be tuned if the performance of either kernel changes significantly. - `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs. It requires bf16 q, kv inputs. - `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU. - `aiter`: Aiter kernel on AMD HPUs. Can only be used as decode kernel. -- On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows : - - H200: `flashmla_sparse` prefill attention (short-seq prefill uses MHA via FlashAttention varlen), `fa3` decode attention, `bf16` kv cache dtype. - - B200: `flashmla_auto` prefill attention (short-seq prefill uses MHA via TRT-LLM ragged), `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype. `flashmla_auto` enables automatic selection of either `flashmla_sparse` or `flashmla_kv` kernel for prefill based on KV cache dtype, hardware, and heuristics. When FP8 KV cache is enabled and `total_kv_tokens < total_q_tokens * 512`, it uses the `flashmla_sparse` kernel; otherwise, it falls back to the `flashmla_kv` kernel. The heuristics may need to be tuned if the performance of either the `flashmla_sparse` or `flashmla_kv` kernel changes significantly. + - `trtllm`: `trtllm-mla` sparse kernel from flashinfer library. Only run on blackwell GPUs. It requires q,k,v to be uniformly bf16 or fp8_e4m3 format. + - On the basis of performance benchmarks, the default configuration of DSA kernels on Hopper and Blackwell are set as follows : + - Bfloat 16 kv cache: On Hopper, `flashmla_sparse` prefill attention, `fa3` decode attention; On Blackwell, `flashmla_sparse` prefill attention, `trtllm` decode attention + - Float8_e4m3fn KV cache: On Hopper, `flashmla_kv` prefill attention, `flashmla_kv` decode attention; On Blackwell, `trtllm` prefill attention and `trtllm` decode attention. +- **Index Cache**: Introduce in [this paper](https://arxiv.org/abs/2603.12201), IndexCache improves speed by reusing the result of indexer across different layers, only at cost of negligible accuracy loss. For **GLM-5** model, we recommend appending `--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'` to command for better tradeoff between speedup and performance. ## Multi-token Prediction SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information. @@ -78,7 +93,7 @@ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --sp - The default value of `--max-running-requests` is set to `48` for MTP. For larger batch sizes, this value should be increased beyond the default value. ```{tip} -To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. +To enable overlap scheduler for EAGLE speculative decoding, we recommend setting the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. ``` @@ -107,7 +122,7 @@ python3 -m sglang.launch_server \ --reasoning-parser deepseek-v3 ``` -`DeepSeek-V3.2-Speciale` doesn't support tool calling, so can only be launched with reasoning parser: +`DeepSeek-V3.2-Speciale` does not support tool calling, so it can only be launched with the reasoning parser: ```bash python3 -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3.2-Speciale \ @@ -116,6 +131,23 @@ python3 -m sglang.launch_server \ --reasoning-parser deepseek-v3 ``` +To launch `GLM-5` with function calling and reasoning parser: +```bash +python -m sglang.launch_server \ + --model zai-org/GLM-5-FP8 \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ +``` + +## NVFP4 Checkpoint + +To launch deepseek v3.2 [NVFP4 checkpoint](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4) on Blackwell devices, the user needs to specify the quantization method as `modelopt_fp4`, and moe runner backend as one of `flashinfer_trtllm`(recommended), `flashinfer_cutlass` and `flashinfer_cutedsl`. Any other usage (parallelism, reasoning parser, ...) is the same as FP8 checkpoint. + +An example launching command can be: +```bash +python -m sglang.launch_server --model nvidia/DeepSeek-V3.2-NVFP4 --tp 4 --quantization modelopt_fp4 --moe-runner-backend flashinfer_trtllm --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 +``` ## PD Disaggregation @@ -200,7 +232,7 @@ Repeat: 8, mean: 0.797 Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793'] ``` -For Deepseek V3.2, Deepseek recommends setting the sampling parameters to temperature = 1.0, top_p = 0.95: +For DeepSeek V3.2, DeepSeek recommends setting the sampling parameters to temperature = 1.0, top_p = 0.95: ```bash python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3 @@ -208,7 +240,7 @@ python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 Repeat: 8, mean: 0.840 Scores: ['0.848', '0.808', '0.848', '0.838', '0.879', '0.813', '0.838', '0.848'] ``` -which matches the official score, 0.824, as reported in the [Deepseek-V3.2 technical report](https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf). +which matches the official score, 0.824, as reported in the [DeepSeek-V3.2 technical report](https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf). ### Accuracy Test with `aime 2025` @@ -257,7 +289,7 @@ ns eval \ Test results (8*B200): -DeepSeek-V3.2-Exp: +DeepSeek-V3.2-Exp: | evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer | |--------------------|-------------|------------|-------------|-----------------------|-----------| @@ -289,16 +321,13 @@ DeepSeek-V3.2-Speciale: For context parallel in DeepSeek V3.2 model, we provide two different modes of splitting tokens, which can be controlled with argument `--nsa-prefill-cp-mode`. -### In sequence splitting (default setting) +### In sequence splitting -The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator. +The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator. Add `attn_cp_size` for communication group for context parallel. -The communication group for context parallel reuses the one for attention tp, thus `cp_size` equals `atten_tp_size = tp_size / dp_size`. - -Note that in sequence splitting mode has the following restrictions: +Note that the in-sequence splitting mode has the following restrictions: - The batch size is restricted to 1 for prefill batches -- Multi-node/PD disaggregation is still not supported -- `moe_dense_tp_size=1`, `kv_cache_dtype = "bf16"`, `moe_a2a_backend = "deepep"` +- `moe_dense_tp_size=1`, `moe_a2a_backend = "deepep"` - To ensure `cp_size > 1`, the passed in `tp_size` must be larger than `dp_size` For more details, please refer to PR https://github.com/sgl-project/sglang/pull/12065. @@ -306,21 +335,21 @@ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/ Example: ```bash # In-seq splitting mode launched with EP + DP -python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode in-seq-split --max-running-requests 32 +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --attn-cp-size 4 --nsa-prefill-cp-mode in-seq-split --max-running-requests 32 ``` -### Round robin splitting +### Round robin splitting (default setting) This mode can be enabled by specifying the parameter `--nsa-prefill-cp-mode round-robin-split`, which distributes tokens across ranks based on `token_idx % cp_size`. -In this scenario, compared with the aforementioned method, it additionally supports the fused MoE backend (the fused MoE backend may deliver better performance than DeepEP in single-machine scenarios), FP8 KV-cache, and multi-batch prefill inference. But it cannot be enabled with dp attention together. +In this scenario, compared to the in-sequence splitting method, it additionally supports the fused MoE backend (the fused MoE backend may deliver better performance than DeepEP in single-machine scenarios), FP8 KV-cache, and multi-batch prefill inference. However, it cannot be enabled with DP attention together. For more details, please refer to PR https://github.com/sgl-project/sglang/pull/13959. Example usage: ```bash # Launch with FusedMoe + CP8 -python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split --max-running-requests 32 +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --attn-cp-size 8 --nsa-prefill-cp-mode round-robin-split --max-running-requests 32 ``` ### Pipeline Parallel + Context Parallel (PP + CP) @@ -344,6 +373,7 @@ python3 -m sglang.launch_server \ --tp 8 --pp-size 2 \ --dp-size 1 --moe-dense-tp-size 1 \ --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ --nsa-prefill-cp-mode round-robin-split \ --trust-remote-code \ --disable-radix-cache \ @@ -367,6 +397,7 @@ python3 -m sglang.launch_server \ --tp 8 --pp-size 2 \ --dp-size 1 --moe-dense-tp-size 1 \ --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ --nsa-prefill-cp-mode round-robin-split \ --trust-remote-code \ --disable-radix-cache \ @@ -394,6 +425,7 @@ python -m sglang.launch_server \ --tp 8 --pp-size 2 \ --dp-size 1 --moe-dense-tp-size 1 \ --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ --nsa-prefill-cp-mode round-robin-split \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ --trust-remote-code \ @@ -419,6 +451,7 @@ python -m sglang.launch_server \ --tp 8 --pp-size 2 \ --dp-size 1 --moe-dense-tp-size 1 \ --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ --nsa-prefill-cp-mode round-robin-split \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ --trust-remote-code \ @@ -435,3 +468,9 @@ python -m sglang.launch_server \ ``` For the Decode nodes, it is recommended to use the **EP mode**. + +## HiSparse: Hierarchical Sparse Attention for DSA (experimental) + +HiSparse reduces per-request GPU memory during decode by keeping only a small "hot" KV buffer on GPU while storing complete KV data in CPU pinned memory. A CUDA kernel dynamically swaps in the top-k most relevant KV entries from host memory on each decode step. This enables significantly higher decode concurrency for long-context DSA models. + +HiSparse currently requires PD disaggregation mode and is enabled on the decode instance only. For detailed design, configuration, and deployment instructions, see the [HiSparse Guide](../advanced_features/hisparse_guide.md). diff --git a/docs/basic_usage/glmv.md b/docs/basic_usage/glmv.md index c56b6ecd54cb..ad36cea26ad2 100644 --- a/docs/basic_usage/glmv.md +++ b/docs/basic_usage/glmv.md @@ -133,4 +133,4 @@ python -m sglang.launch_server \ In SGLang, we can implement thinking budget with `CustomLogitProcessor`. -Launch a server with `--enable-custom-logit-processor` flag on. and using `Glm4MoeThinkingBudgetLogitProcessor` in the request likes `GLM-4.6` example in [glm45.md](./glm45.md). +Launch a server with the `--enable-custom-logit-processor` flag. Then, use `Glm4MoeThinkingBudgetLogitProcessor` in the request, similar to the `GLM-4.6` example in [glm45.md](./glm45.md). diff --git a/docs/basic_usage/gpt_oss.md b/docs/basic_usage/gpt_oss.md index f74ba40d90ae..da8e778b25f6 100644 --- a/docs/basic_usage/gpt_oss.md +++ b/docs/basic_usage/gpt_oss.md @@ -25,7 +25,7 @@ GPT‑OSS can call built‑in tools for web search and Python execution. You can ### Tool & Reasoning Parser -- We support OpenAI Reasoning and Tool Call parser, as well as our SGLang native api for tool call and reasoning. Refer to [reasoning parser](../advanced_features/separate_reasoning.ipynb) and [tool call parser](../advanced_features/function_calling.ipynb) for more details. +- We support OpenAI Reasoning and Tool Call parser, as well as our SGLang native api for tool call and reasoning. Refer to [reasoning parser](../advanced_features/separate_reasoning.ipynb) and [tool parser](../advanced_features/tool_parser.ipynb) for more details. ## Notes @@ -105,7 +105,7 @@ print(response.output_text) # Test python tool response = client.responses.create( model="openai/gpt-oss-120b", - instructions="You are a helfpul assistant, you could use python tool to execute code.", + instructions="You are a helpful assistant, you could use python tool to execute code.", input="Use python tool to calculate the sum of 29138749187 and 29138749187", # 58,277,498,374 tools=tools ) @@ -115,7 +115,7 @@ print(response.output_text) # Test browser tool response = client.responses.create( model="openai/gpt-oss-120b", - instructions="You are a helfpul assistant, you could use browser to search the web", + instructions="You are a helpful assistant, you could use browser to search the web", input="Search the web for the latest news about Nvidia stock price", tools=tools ) diff --git a/docs/basic_usage/hy3_preview.md b/docs/basic_usage/hy3_preview.md new file mode 100644 index 000000000000..b7f23937ef72 --- /dev/null +++ b/docs/basic_usage/hy3_preview.md @@ -0,0 +1,191 @@ +# Hy3-preview Usage + +Hy3-preview is a large-scale language model (295B parameters, 21B active parameters) from Tencent Hunyuan team. SGLang supports serving Hy3-preview. This guide describes how to run Hy3-preview with native BF16. + +## Installation + +### Docker + +```bash +docker pull lmsysorg/sglang:hy3-preview +``` + +### Build from Source + +```bash +# Install SGLang +git clone https://github.com/sgl-project/sglang +cd sglang +pip3 install pip --upgrade +pip3 install "transformers>=5.6.0" +pip3 install -e "python" +``` + +## Launch Hy3-preview with SGLang + +To serve the [Hy3-preview](https://huggingface.co/tencent/Hy3-preview) model on 8 GPUs. On 8x96GB H20, SGLang can barely deploy the BF16 model and can only run small batch sizes or short requests. Use larger-memory GPUs such as H20-3e when possible. + +```bash +python3 -m sglang.launch_server \ + --model tencent/Hy3-preview \ + --tp 8 \ + --tool-call-parser hunyuan \ + --reasoning-parser hunyuan \ + --served-model-name hy3-preview +``` + +### EAGLE Speculative Decoding + +**Description**: SGLang supports Hy3-preview models with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#eagle-decoding). + +**Usage**: +Add `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens` to enable this feature. For example: + +```bash +python3 -m sglang.launch_server \ + --model tencent/Hy3-preview \ + --tp 8 \ + --tool-call-parser hunyuan \ + --reasoning-parser hunyuan \ + --speculative-num-steps 1 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 2 \ + --speculative-algorithm EAGLE \ + --served-model-name hy3-preview +``` + +## OpenAI Client Example + +First, install the OpenAI Python client: + +```bash +uv pip install -U openai +``` + +You can use the OpenAI client as follows to verify thinking-mode responses. + +```python +from openai import OpenAI + +# If running SGLang locally with its default OpenAI-compatible port: +# http://localhost:30000/v1 +openai_api_key = "EMPTY" +openai_api_base = "http://localhost:30000/v1" + +client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, +) +messages = [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Hello."}, +] + +# Thinking mode is disabled by default (no need to pass chat_template_kwargs). +resp = client.chat.completions.create( + model="hy3-preview", + messages=messages, + temperature=1, + max_tokens=4096, +) +print(resp.choices[0].message.content) + +# Thinking mode is enabled only if 'reasoning_effort' and 'interleaved_thinking' are set in 'chat_template_kwargs'. +# 'reasoning_effort' supports: 'high', 'low', 'no_think'. +resp_think = client.chat.completions.create( + model="hy3-preview", + messages=messages, + temperature=1, + max_tokens=4096, + extra_body={ + "chat_template_kwargs": { + "reasoning_effort": "high", + "interleaved_thinking": True + }, + }, +) +output_msg = resp_think.choices[0].message +# thinking content +print(output_msg.reasoning_content) +# response content +print(output_msg.content) +``` + +### cURL Usage + +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "hy3-preview", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Hello."} + ], + "temperature": 1, + "max_tokens": 4096 + }' +``` + +## Benchmarking Results + +For benchmarking, disable prefix caching by adding `--disable-radix-cache` to the server command. + +The following example runs the benchmark on 8 H20 GPUs with 96 GB memory each. + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --flush-cache \ + --dataset-name random \ + --random-range-ratio 1.0 \ + --random-input-len 4096 \ + --random-output-len 4096 \ + --num-prompts 5 \ + --max-concurrency 1 \ + --output-file hy3_preview_h20.jsonl \ + --model tencent/Hy3-preview \ + --served-model-name hy3-preview +``` + +If successful, you will see the following output. + +```shell +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 5 +Benchmark duration (s): 176.41 +Total input tokens: 20480 +Total input text tokens: 20480 +Total generated tokens: 20480 +Total generated tokens (retokenized): 20480 +Request throughput (req/s): 0.03 +Input token throughput (tok/s): 116.09 +Output token throughput (tok/s): 116.09 +Peak output token throughput (tok/s): 118.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 232.19 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 35279.06 +Median E2E Latency (ms): 35275.60 +P90 E2E Latency (ms): 35294.13 +P99 E2E Latency (ms): 35294.41 +---------------Time to First Token---------------- +Mean TTFT (ms): 355.93 +Median TTFT (ms): 309.28 +P99 TTFT (ms): 518.36 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 8.53 +Median TPOT (ms): 8.54 +P99 TPOT (ms): 8.54 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 8.53 +Median ITL (ms): 8.54 +P95 ITL (ms): 8.62 +P99 ITL (ms): 8.74 +Max ITL (ms): 31.70 +================================================== +``` diff --git a/docs/basic_usage/minimax_m2.md b/docs/basic_usage/minimax_m2.md index 33d445790a6f..7ca6ed809fcb 100644 --- a/docs/basic_usage/minimax_m2.md +++ b/docs/basic_usage/minimax_m2.md @@ -1,13 +1,14 @@ -# MiniMax M2.1/M2 Usage +# MiniMax M2.5/M2.1/M2 Usage -[MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) and [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) are advanced large language models created by [MiniMax](https://www.minimax.io/). +[MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), [MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1), and [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) are advanced large language models created by [MiniMax](https://www.minimax.io/). -MiniMax-M2 series redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. +The MiniMax-M2 series redefines efficiency for agents. These compact, fast, and cost-effective MoE models (230 billion total parameters with 10 billion active parameters) are built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, the MiniMax-M2 series provides sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. ## Supported Models This guide applies to the following models. You only need to update the model name during deployment. The following examples use **MiniMax-M2**: +- [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) - [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) - [MiniMaxAI/MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) @@ -49,6 +50,24 @@ python -m sglang.launch_server \ --mem-fraction-static 0.85 ``` +### AMD GPUs (MI300X/MI325X/MI355X) + +8-GPU deployment command: + +```bash +SGLANG_USE_AITER=1 python -m sglang.launch_server \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp-size 8 \ + --ep-size 8 \ + --attention-backend aiter \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --host 0.0.0.0 \ + --trust-remote-code \ + --port 8000 \ + --mem-fraction-static 0.85 +``` + ## Testing Deployment After startup, you can test the SGLang OpenAI-compatible API with the following command: diff --git a/docs/basic_usage/native_api.ipynb b/docs/basic_usage/native_api.ipynb index 52e4386af6dc..d3ead5e349d6 100644 --- a/docs/basic_usage/native_api.ipynb +++ b/docs/basic_usage/native_api.ipynb @@ -10,7 +10,7 @@ "\n", "- `/generate` (text generation model)\n", "- `/get_model_info`\n", - "- `/get_server_info`\n", + "- `/server_info`\n", "- `/health`\n", "- `/health_generate`\n", "- `/flush_cache`\n", @@ -49,7 +49,7 @@ " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" ] }, { @@ -140,7 +140,7 @@ "metadata": {}, "outputs": [], "source": [ - "url = f\"http://localhost:{port}/get_server_info\"\n", + "url = f\"http://localhost:{port}/server_info\"\n", "\n", "response = requests.get(url)\n", "print_highlight(response.text)" @@ -185,7 +185,15 @@ "source": [ "## Flush Cache\n", "\n", - "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API." + "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API.\n", + "\n", + "Parameters:\n", + "- `timeout` (query, float, default `0`, unit: seconds): Wait time for idle state before flushing. `0` means fail fast if not idle. When HiCache async operations are in-flight, a non-zero timeout allows the server to wait until idle before flushing, avoiding unnecessary 400 errors.\n", + "\n", + "```bash\n", + "# With timeout (wait up to 30s for idle state)\n", + "curl -s -X POST \"http://127.0.0.1:30000/flush_cache?timeout=30\"\n", + "```" ] }, { @@ -275,14 +283,12 @@ "metadata": {}, "outputs": [], "source": [ - "embedding_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "embedding_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n", " --host 0.0.0.0 --is-embedding --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=embedding_process)" ] }, { @@ -324,14 +330,12 @@ "metadata": {}, "outputs": [], "source": [ - "reranker_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "reranker_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n", " --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=reranker_process)" ] }, { @@ -392,14 +396,12 @@ "metadata": {}, "outputs": [], "source": [ - "score_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "score_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n", " --host 0.0.0.0 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=score_process)" ] }, { @@ -456,13 +458,11 @@ "# Note that SGLang now treats embedding models and reward models as the same type of models.\n", "# This will be updated in the future.\n", "\n", - "reward_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "reward_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=reward_process)" ] }, { @@ -526,7 +526,7 @@ " \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=expert_record_server_process)" ] }, { @@ -575,13 +575,11 @@ "metadata": {}, "outputs": [], "source": [ - "tokenizer_free_server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "tokenizer_free_server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=tokenizer_free_server_process)" ] }, { diff --git a/docs/basic_usage/offline_engine_api.ipynb b/docs/basic_usage/offline_engine_api.ipynb index 9c03e90a7935..fe8a9e3045c0 100644 --- a/docs/basic_usage/offline_engine_api.ipynb +++ b/docs/basic_usage/offline_engine_api.ipynb @@ -66,7 +66,7 @@ "import asyncio\n", "\n", "import sglang as sgl\n", - "import sglang.test.doc_patch\n", + "import sglang.test.doc_patch # noqa: F401\n", "from sglang.utils import async_stream_and_merge, stream_and_merge\n", "\n", "llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")" diff --git a/docs/basic_usage/openai_api_completions.ipynb b/docs/basic_usage/openai_api_completions.ipynb index e89dfd57ff78..ffa576ae52c5 100644 --- a/docs/basic_usage/openai_api_completions.ipynb +++ b/docs/basic_usage/openai_api_completions.ipynb @@ -39,7 +39,7 @@ " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", "print(f\"Server started on http://localhost:{port}\")" ] }, diff --git a/docs/basic_usage/openai_api_embeddings.ipynb b/docs/basic_usage/openai_api_embeddings.ipynb index 26e95a4e7c12..a6c90c06b5f0 100644 --- a/docs/basic_usage/openai_api_embeddings.ipynb +++ b/docs/basic_usage/openai_api_embeddings.ipynb @@ -9,7 +9,7 @@ "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n", "\n", - "This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/embedding_models.md)\n" + "This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/retrieval_ranking/embedding_models.md)\n" ] }, { @@ -30,14 +30,12 @@ "from sglang.test.doc_patch import launch_server_cmd\n", "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", "\n", - "embedding_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "embedding_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n", " --host 0.0.0.0 --is-embedding --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=embedding_process)" ] }, { @@ -173,7 +171,7 @@ "metadata": {}, "source": [ "## Multi-Modal Embedding Model\n", - "Please refer to [Multi-Modal Embedding Model](../supported_models/embedding_models.md)" + "Please refer to [Multi-Modal Embedding Model](../supported_models/retrieval_ranking/embedding_models.md)" ] } ], diff --git a/docs/basic_usage/openai_api_vision.ipynb b/docs/basic_usage/openai_api_vision.ipynb index 1db599dcfa90..b6e6a1a24eb3 100644 --- a/docs/basic_usage/openai_api_vision.ipynb +++ b/docs/basic_usage/openai_api_vision.ipynb @@ -10,7 +10,7 @@ "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n", "This tutorial covers the vision APIs for vision language models.\n", "\n", - "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/multimodal_language_models.md).\n", + "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/text_generation/multimodal_language_models.md).\n", "\n", "As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)." ] @@ -33,13 +33,16 @@ "from sglang.test.doc_patch import launch_server_cmd\n", "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", "\n", - "vision_process, port = launch_server_cmd(\n", - " \"\"\"\n", - "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning\n", - "\"\"\"\n", + "example_image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\"\n", + "logo_image_url = (\n", + " \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "vision_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=vision_process)" ] }, { @@ -75,7 +78,7 @@ " {{\n", " \"type\": \"image_url\",\n", " \"image_url\": {{\n", - " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", + " \"url\": \"{example_image_url}\"\n", " }}\n", " }}\n", " ]\n", @@ -119,9 +122,7 @@ " {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n", " {\n", " \"type\": \"image_url\",\n", - " \"image_url\": {\n", - " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", - " },\n", + " \"image_url\": {\"url\": example_image_url},\n", " },\n", " ],\n", " }\n", @@ -162,9 +163,7 @@ " },\n", " {\n", " \"type\": \"image_url\",\n", - " \"image_url\": {\n", - " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", - " },\n", + " \"image_url\": {\"url\": example_image_url},\n", " },\n", " ],\n", " }\n", @@ -203,13 +202,13 @@ " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", - " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\",\n", + " \"url\": example_image_url,\n", " },\n", " },\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", - " \"url\": \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\",\n", + " \"url\": logo_image_url,\n", " },\n", " },\n", " {\n", diff --git a/docs/basic_usage/popular_model_usage.rst b/docs/basic_usage/popular_model_usage.rst index 0eef2ef33e4d..ec0268ed7cf2 100644 --- a/docs/basic_usage/popular_model_usage.rst +++ b/docs/basic_usage/popular_model_usage.rst @@ -1,6 +1,8 @@ Popular Model Usage (DeepSeek, GPT-OSS, GLM, Llama, MiniMax, Qwen, and more) =============================================================== +For more usage examples and recipes, visit the `SGLang Cookbook `_. + .. toctree:: :maxdepth: 1 @@ -11,5 +13,7 @@ Popular Model Usage (DeepSeek, GPT-OSS, GLM, Llama, MiniMax, Qwen, and more) gpt_oss.md minimax_m2.md qwen3.md + qwen3_5.md qwen3_vl.md + deepseek_ocr.md llama4.md diff --git a/docs/basic_usage/qwen3_5.md b/docs/basic_usage/qwen3_5.md new file mode 100644 index 000000000000..06f7b615eef5 --- /dev/null +++ b/docs/basic_usage/qwen3_5.md @@ -0,0 +1,76 @@ +# Qwen 3.5 Usage + +Qwen 3.5 is Alibaba's latest generation LLM featuring a hybrid attention architecture, advanced MoE with shared experts, and native multimodal capabilities. + +Key architecture features: +- **Hybrid Attention**: Gated Delta Networks (linear, O(n) complexity) combined with full attention every 4th layer for high associative recall +- **MoE with Shared Experts**: Top-8 active out of 64 routed experts plus a dedicated shared expert for universal features +- **Multimodal**: DeepStack Vision Transformer with Conv3d for native image and video understanding + +## Launch Qwen 3.5 with SGLang + +### Dense Model + +To serve `Qwen/Qwen3.5-397B-A17B` on 8 GPUs: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --trust-remote-code +``` + +### AMD GPU (MI300X / MI325X / MI35X) + +On AMD Instinct GPUs, use the `triton` attention backend. Both the full attention layers and the Gated Delta Net (linear attention) layers use Triton-based kernels on ROCm: + +```bash +SGLANG_USE_AITER=1 python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --attention-backend triton \ + --trust-remote-code +``` + +```{tip} +Set `SGLANG_USE_AITER=1` to enable AMD's optimized aiter kernels for MoE and GEMM operations. +``` + +### Configuration Tips + +- `--attention-backend`: Use `triton` on AMD GPUs for Qwen 3.5. The hybrid attention architecture (Gated Delta Networks + full attention) works best with the Triton backend on ROCm. The linear attention (GDN) layers always use Triton kernels internally via the `GDNAttnBackend`. +- `--watchdog-timeout`: Increase to `1200` or higher for this large model, as weight loading takes significant time. +- `--model-loader-extra-config '{"enable_multithread_load": true}'`: Enables parallel weight loading for faster startup. + +### Reasoning and Tool Calling + +Qwen 3.5 supports reasoning and tool calling via the Qwen3 parsers: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --trust-remote-code \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder +``` + +## Accuracy Evaluation + +You can evaluate the model accuracy using `lm-eval`: + +```bash +pip install lm-eval[api] + +lm_eval --model local-completions \ + --model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \ + --tasks gsm8k \ + --batch_size auto \ + --num_fewshot 5 \ + --trust_remote_code +``` + +## Additional Resources + +- [AMD Day 0 Support for Qwen 3.5 on AMD Instinct GPUs](https://www.amd.com/en/developer/resources/technical-articles/2026/day-0-support-for-qwen-3-5-on-amd-instinct-gpus.html) +- [HuggingFace Model Card](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) diff --git a/docs/basic_usage/sampling_params.md b/docs/basic_usage/sampling_params.md index a1848d41ddad..23415f9af555 100644 --- a/docs/basic_usage/sampling_params.md +++ b/docs/basic_usage/sampling_params.md @@ -74,7 +74,7 @@ Please refer to our dedicated guide on [constrained decoding](../advanced_featur | json_schema | `Optional[str] = None` | JSON schema for structured outputs. | | regex | `Optional[str] = None` | Regex for structured outputs. | | ebnf | `Optional[str] = None` | EBNF for structured outputs. | -| structural_tag | `Optional[str] = None` | The structal tag for structured outputs. | +| structural_tag | `Optional[str] = None` | The structural tag for structured outputs. | ### Other options diff --git a/docs/basic_usage/send_request.ipynb b/docs/basic_usage/send_request.ipynb index aa4f745d2f2f..968a23b8d632 100644 --- a/docs/basic_usage/send_request.ipynb +++ b/docs/basic_usage/send_request.ipynb @@ -31,14 +31,12 @@ "# This is equivalent to running the following command in your terminal\n", "# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n", "\n", - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n", " --host 0.0.0.0 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")" + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" ] }, { diff --git a/docs/conf.py b/docs/conf.py index d6ca64d88a2d..6140b47f8362 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -131,6 +131,7 @@ html_static_path = ["_static"] html_css_files = ["css/custom_log.css"] +html_js_files = ["js/deprecation_banner.js"] def setup(app): diff --git a/docs/developer_guide/JIT_kernels.md b/docs/developer_guide/JIT_kernels.md deleted file mode 100644 index 44f298b9cf31..000000000000 --- a/docs/developer_guide/JIT_kernels.md +++ /dev/null @@ -1,258 +0,0 @@ -# Development Guide for JIT Kernels - -## Environment Setup - -We strongly recommend using `clangd` as the language server for JIT kernel development. -For Ubuntu/Debian, you can download clangd from [apt.llvm.org](https://apt.llvm.org/). -If you are using VS Code, we recommend installing the `clangd` extension for better IDE integration. - -All JIT-related files are located in `python/sglang/jit_kernel`. -Unlike `sgl-kernel`, which compiles CUDA/C++ binaries ahead of time (AOT), just-in-time (JIT) kernels are compiled at runtime. -Consequently, a static `compile_commands.json` cannot be generated. -To enable code completion with `clangd`, run `python -m sglang.jit_kernel` to generate a `.clangd` configuration file in your current directory. -After generating the file, restart the clangd language server. It should now recognize all JIT kernel files. - -## Code Structure - -### C++ Implementation - -C++ source code is located in `python/sglang/jit_kernel/csrc`. -Reusable functions should be placed in `python/sglang/jit_kernel/include`. - -We use [tvm-ffi](https://github.com/apache/tvm-ffi) for efficient foreign language bindings. -Refer to the [documentation](https://tvm.apache.org/ffi/) for advanced usage, such as exporting C++ objects. -Typically, `tvm::ffi::TensorView` is sufficient for passing PyTorch Tensors from Python. - -### Python Interface - -Python interfaces are defined in `python/sglang/jit_kernel`. -The `load_jit` utility function in `python/sglang/jit_kernel/utils.py` loads and returns the compiled module. -To export a C++ function (e.g., `cpp_func`), pass `cuda_wrappers=[("func", "cpp_func")]` to `load_jit`. -The function can then be called in Python as `module.func`. - -### C++ Utilities - -The following C++ utilities are available: - -#### Integer Range - -Similar to PyTorch, we provide an `irange` function to represent an integer range. - -```C++ -#include - -void test() { - for (auto i : host::irange(100)) { // [0, 100) - // do something - } - for (auto i : host::irange(0, 100)) { // [0, 100) - // do something - } -} - -``` - -#### Runtime Checking - -`RuntimeCheck` validates conditions at runtime. It accepts optional arguments for error reporting. -If the check fails, these arguments are output to aid debugging. -`RuntimeDeviceCheck` verifies the status of the last kernel launch. - -```C++ -#include -#include - -void test() { - host::RuntimeCheck(1 + 1 == 2, 1 + 1, " != ", 2); - host::RuntimeDeviceCheck(); - // check the provided `cudaError_t` - host::RuntimeDeviceCheck(cudaGetLastError()); -} - -``` - -#### Tensor Checking - -`TensorMatcher` provides a readable way to validate and extract tensor shape information. - -```cpp -#include - -void test(const tvm::ffi::TensorView k_cache, const tvm::ffi::TensorView v_cache) { - using namespace host; - - auto D = SymbolicSize{"D"}; // cache dimension - auto N = SymbolicSize{"N"}; // kvcache stride - auto dtype = SymbolicDType{}; - auto device = SymbolicDevice{}; - - TensorMatcher({-1, D}) // - .with_strides({N, 1}) - .with_dtype(dtype) - .with_device(device) - .verify(k_cache) - .verify(v_cache); -} -``` - -Configure the `TensorMatcher` with expected stride, dtype, and device properties before verification. -- If `with_strides` is omitted, the tensor is expected to be contiguous. -- Template arguments in `with_dtype` restrict the allowed data types. -- Template arguments in `with_device` restrict the allowed devices. -- Values passed to `with_xxx` methods enforce equality checks. -- Passing `-1` for size or stride allows matching any value. - -A `Symbolic` variable must resolve to the same value across all verifications. -Use `.unwrap()` to retrieve the matched value after verification. - -> Note: `TensorMatcher` is a temporary expression and should not be stored in a variable. - -> Tip: Add `//` at the end of the `TensorMatcher` chain to enforce proper indentation. - -#### Kernel Launching - -`LaunchKernel::resolve_device` retrieves the current `cudaStream` from PyTorch. -Kernels can also be launched directly using `LaunchKernel`. - -```cpp -#include - -#include - -__global__ void kernel() {} - -void test() { - const auto num_blocks = 1; - const auto num_threads = 32; - const auto dynamic_smem = 0; - - DLDevice dev; // suppose this is initialized properly - host::LaunchKernel(num_blocks, num_threads, dev)(kernel); - - cudaStream_t stream = host::LaunchKernel::resolve_device(dev); - host::LaunchKernel(num_blocks, num_threads, stream, dynamic_smem)(kernel); -} - -``` - -## Add new kernels - -This section walks through a complete, end-to-end example of adding a new JIT kernel to the system. -We use a simple add_constant kernel as a running example, which adds a constant integer value to every element of an input tensor. - -Conceptually, the Python interface looks like this: - -```python -def add_constant(src: torch.Tensor, c: int): - return src + c -``` - -### STEP 1: Write the C++ kernel - -Write your CUDA kernel in [jit_kernel/csrc/add_constant.cuh](../../python/sglang/jit_kernel/csrc/add_constant.cuh). For demonstration purposes, we pass the constant value as a template parameter. - -```cpp -#include // For TensorMatcher, SymbolicSize, SymbolicDevice -#include // For LaunchKernel -#include // For div_ceil, RuntimeCheck - -#include -#include - -#include -#include - -namespace { - -template -__global__ void add_constant_kernel(int32_t* dst, const int32_t* src, size_t length) { - size_t idx = blockIdx.x * blockDim.x + threadIdx.x; - if (idx < length) { - dst[idx] = src[idx] + kConstant; - } -} - -constexpr size_t kBlockSize = 256; - -// You can also use struct with static method as an alternative -template -void add_constant(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) { - using namespace host; - - // 1. Validate input tensors - SymbolicSize N = {"num_elements"}; - SymbolicDevice device_; - TensorMatcher({N}) // 1D tensor, must be contiguous - .with_dtype() // must be int32 - .with_device(device_) // must be on CUDA device - .verify(dst) // check tensor dst - .verify(src); // check tensor src - - // 2. Extract required parameters, prepare for kernel launch - const size_t num_elements = N.unwrap(); - const size_t grid_size = div_ceil(num_elements, kBlockSize); - const DLDevice device = device_.unwrap(); - // some extra runtime checks using host::RuntimeCheck - RuntimeCheck(num_elements > 0, "We only support non-empty tensors, got num_elements = ", num_elements); - - // 3. Launch the kernel. Error code will be automatically checked. - LaunchKernel(grid_size, kBlockSize, device /*, dynamic_smem*/)( - // kernel function - add_constant_kernel, - // kernel arguments - static_cast(dst.data_ptr()), - static_cast(src.data_ptr()), - num_elements); -} - -} // namespace - -``` - -### STEP 2: Create Python Interfaces - -Next, expose the kernel through a Python wrapper. -Create a new file at [jit_kernel/add_constant.py](../../python/sglang/jit_kernel/add_constant.py) and expose the needed interfaces. - -```python -from __future__ import annotations - -import functools -from typing import TYPE_CHECKING - -import torch - -from sglang.jit_kernel.utils import load_jit, make_cpp_args - -if TYPE_CHECKING: - from tvm_ffi.module import Module - - -@functools.cache -def _jit_add_constant_module(constant: int) -> Module: - args = make_cpp_args(constant) # pass all the template argument - return load_jit( - "add_constant", - *args, - cuda_files=["add_constant.cuh"], - cuda_wrappers=[("add_constant", f"add_constant<{args}>")], - ) - - -def add_constant(src: torch.Tensor, constant: int) -> torch.Tensor: - dst = torch.empty_like(src) - module = _jit_add_constant_module(constant) - module.add_constant(dst, src) - return dst - -``` - -### STEP 3: Use your kernel - -Finally, import and use the kernel like a regular Python function: - -```python -from sglang.jit_kernel.add_constant import add_constant -``` - -For a complete, runnable example, refer to [test_add_constant.py](../../python/sglang/jit_kernel/test_add_constant.py). diff --git a/docs/developer_guide/bench_serving.md b/docs/developer_guide/bench_serving.md index b2f8568e260f..bc13765d0f10 100644 --- a/docs/developer_guide/bench_serving.md +++ b/docs/developer_guide/bench_serving.md @@ -21,7 +21,7 @@ If `--base-url` is provided, requests are sent to it. Otherwise, `--host` and `- ### Prerequisites -- Python 3.8+ +- Python 3.10+ - Dependencies typically used by this script: `aiohttp`, `numpy`, `requests`, `tqdm`, `transformers`, and for some datasets `datasets`, `pillow`, `pybase64`. Install as needed. - An inference server running and reachable via the endpoints above - If your server requires authentication, set environment variable `OPENAI_API_KEY` (used as `Authorization: Bearer `) @@ -332,7 +332,7 @@ python3 -m sglang.bench_serving \ python3 -m sglang.bench_serving \ --backend sglang \ --host 127.0.0.1 --port 30000 \ - --model mode-name \ + --model model-name \ --dataset-name mooncake \ --mooncake-slowdown-factor 1.0 \ --mooncake-num-rounds 1000 \ @@ -341,6 +341,41 @@ python3 -m sglang.bench_serving \ --random-output-len 256 ``` +10) Fake decode stress testing (PD disaggregation, decode-only): + +When benchmarking pure decode performance in a PD disaggregation setup, you can bypass the prefill node entirely by using `--fake-prefill`. This requires the decode server to be started with `--disaggregation-transfer-backend fake`: + +```bash +# Step 1: Start a decode-only server with fake transfer backend +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend fake \ + --port 30001 + +# Step 2: Run bench_serving with --fake-prefill +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30001 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --num-prompts 500 \ + --random-input-len 1024 --random-output-len 256 \ + --fake-prefill +``` + +Similarly, `bench_one_batch_server` also supports `--fake-prefill`: + +```bash +python3 -m sglang.bench_one_batch_server \ + --base-url http://127.0.0.1:30001 \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --batch-size 32 --input-len 1024 --output-len 256 \ + --fake-prefill +``` + +The `--fake-prefill` flag automatically injects special sentinel values into each request, telling the decode server to skip real KV transfer and generate fake KV data locally. + ### Troubleshooting - All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script. @@ -352,4 +387,4 @@ python3 -m sglang.bench_serving \ ### Notes - The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections. -- For sglang, `/get_server_info` is queried post-run to report speculative decoding accept length when available. +- For sglang, `/server_info` is queried post-run to report speculative decoding accept length when available. diff --git a/docs/developer_guide/benchmark_and_profiling.md b/docs/developer_guide/benchmark_and_profiling.md index 728bcba3adb1..3a353944023f 100644 --- a/docs/developer_guide/benchmark_and_profiling.md +++ b/docs/developer_guide/benchmark_and_profiling.md @@ -2,28 +2,42 @@ ## Benchmark -- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`. - Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not. - - Without a server (do not need to launch a server) - ```bash - python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 - ``` - - With a server (please use `sglang.launch_server` to launch a server first and run the following command.) - ```bash - python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32 - ``` +SGLang provides four benchmark tools that operate at different levels of the stack. The table below summarizes their key differences: +| Tool | HTTP Server | Scheduler | Use Case | +| -------------------------- | --------------------------------------------- | --------------------------------------- | -------------------------------------------------------------------------- | +| `bench_serving` | Yes (async HTTP client to a running server) | Yes (indirectly, via server) | Realistic online serving benchmarks with latency metrics (TTFT, TPOT, ITL) | +| `bench_one_batch_server` | Yes (sends HTTP requests to a running server) | Yes (indirectly, via server) | End-to-end single-batch latency including HTTP and scheduler overhead | +| `bench_offline_throughput` | No | Yes (directly uses `Engine` in-process) | Maximum throughput measurement without HTTP overhead | +| `bench_one_batch` | No | No (directly calls `ModelRunner`) | Kernel-level latency profiling of a single static batch | -- Benchmark offline processing. This script will start an offline engine and run the benchmark. +Use `bench_serving` by default unless there are specific needs. + +**`bench_serving`** is an async HTTP load-testing client that sends requests at controlled rates with configurable concurrency to a running server. It measures realistic online serving metrics including time-to-first-token (TTFT), time-per-output-token (TPOT), inter-token latency (ITL), and throughput. Use `num-prompts >= 5 * max-concurrency` to measure steady-state performance. Launch a server with `sglang.launch_server` first. + + ```bash + python3 -m sglang.bench_serving --backend sglang --max-concurrency 16 --num-prompts 80 --random-input-len 256 --random-output-len 32 --dataset-name random + ``` + +**`bench_one_batch_server`** sends a single batch as one HTTP request to a running server. Due to only having a single batch, the server is never in a steady-state and metrics will be biased. Launch a server with `sglang.launch_server` first. + + ```bash + python3 -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32 + ``` + + - Pass `--enable-multi-batch` and set `--batch-size` to a multiple of the server's `--max-running-requests` to stabilize throughput measurements. Surplus requests are queued by the scheduler and promoted batch-by-batch, amortizing per-request prefill and first-step transients into steady-state decode. Under this flag, only `overall_throughput` is authoritative; `input_throughput`, `output_throughput`, `last_ttft`, and ITL include cross-batch queueing in their denominators and should be treated as informational. + - Pass `--lora-name ` to route every prompt through a pre-loaded LoRA adapter. Requires the server to be launched with `--enable-lora --lora-paths =`. + +**`bench_offline_throughput`** directly instantiates the `Engine` object in-process (no HTTP server) and submits all requests at once via `engine.generate()`. The engine's scheduler handles batching and execution. This measures maximum achievable throughput without any network overhead. ```bash python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 ``` -- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command. +**`bench_one_batch`** is the lowest-level tool. It directly instantiates a `ModelRunner` and calls `extend()` / `decode()` on a fixed static batch, bypassing the scheduler entirely. The prefill and decode phases are run separately, making profiling easier but rendering the metrics unrealistic. Because there is no dynamic batching, it may run out of memory for batch sizes that a real server can handle (a real server chunks prefill into smaller batches). This is best suited for profiling individual kernel performance. ```bash - python3 -m sglang.bench_serving --backend sglang --num-prompt 10 + python3 -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32 ``` ## Profile with PyTorch Profiler @@ -43,7 +57,10 @@ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile ``` -Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells). +For `bench_serving --profile`, the output directory is selected on the client side from `--profile-output-dir` or `SGLANG_TORCH_PROFILER_DIR` (fallback: `/tmp`), then sent in the `/start_profile` request. +If you call `/start_profile` directly and do not provide `output_dir`, the server uses its own `SGLANG_TORCH_PROFILER_DIR` (fallback: `/tmp`). + +Setting `SGLANG_TORCH_PROFILER_DIR` on both server and client is still recommended to avoid confusion about where traces are written. For more details, please refer to [Bench Serving Guide](./bench_serving.md). @@ -144,7 +161,7 @@ curl -X POST http://127.0.0.1:30000/start_profile \ **Parameters:** - `output_dir` (optional): Directory where profile traces will be saved. If not specified, uses `SGLANG_TORCH_PROFILER_DIR` environment variable, or `/tmp` as the default -- `num_steps` (optional): Number of steps to profile. If not specified, profiling continues until manually stopped with `/end_profile` +- `num_steps` (optional): Number of steps to profile. If not specified, profiling continues until manually stopped with `/stop_profile` - `start_step` (optional): Step number at which to start profiling (inclusive). Useful for skipping warmup iterations - `activities` (optional): List of activities to profile, e.g., `["CPU", "GPU"]`. Default is `["CPU", "GPU"]` - `merge_profiles` (optional): Whether to merge distributed traces. Default is `false` @@ -168,17 +185,17 @@ curl -X POST http://127.0.0.1:30000/start_profile \ **Continuous profiling (manual stop):** ```bash -# Start profiling without num_steps - must manually stop with /end_profile +# Start profiling without num_steps - must manually stop with /stop_profile curl -X POST http://127.0.0.1:30000/start_profile ``` -#### Using `/end_profile` endpoint +#### Using `/stop_profile` endpoint -The `/end_profile` endpoint stops an ongoing profiling session and saves the trace file. +The `/stop_profile` endpoint stops an ongoing profiling session and saves the trace file. ```bash # Stop profiling and save traces -curl -X POST http://127.0.0.1:30000/end_profile +curl -X POST http://127.0.0.1:30000/stop_profile ``` This is only needed when you start profiling without specifying `num_steps`. If `num_steps` is specified, profiling will automatically stop after that many steps. @@ -201,7 +218,7 @@ curl -X POST http://127.0.0.1:30000/start_profile \ python -m sglang.bench_serving --backend sglang --num-prompts 100 # Terminal 2: Stop profiling when done -curl -X POST http://127.0.0.1:30000/end_profile +curl -X POST http://127.0.0.1:30000/stop_profile ``` ### Profiler Trace Merger for Distributed Traces @@ -395,10 +412,10 @@ This method allows you to control exactly when profiling starts/stops via HTTP A ```bash # Terminal 2: Only needed if num_steps was not specified - curl -X POST http://127.0.0.1:30000/end_profile + curl -X POST http://127.0.0.1:30000/stop_profile ``` -The `--capture-range=cudaProfilerApi` option tells Nsight Systems to only capture data between `cudaProfilerStart()` and `cudaProfilerStop()` calls (triggered by `/start_profile` and `/end_profile`), reducing overhead and file size. The `start_step` parameter skips the first 3 steps to avoid capturing warmup overhead. +The `--capture-range=cudaProfilerApi` option tells Nsight Systems to only capture data between `cudaProfilerStart()` and `cudaProfilerStop()` calls (triggered by `/start_profile` and `/stop_profile`), reducing overhead and file size. The `start_step` parameter skips the first 3 steps to avoid capturing warmup overhead. **Method 2: Simpler approach without `/start_profile` API** diff --git a/docs/developer_guide/contribution_guide.md b/docs/developer_guide/contribution_guide.md index dde033771461..a15c03c75322 100644 --- a/docs/developer_guide/contribution_guide.md +++ b/docs/developer_guide/contribution_guide.md @@ -28,11 +28,44 @@ pre-commit run --all-files - **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request. - **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch. +- Link checking with lychee is **enforced in CI**. By default, it is not blocking local commits. +- To run local link checks manually, use: `pre-commit run --hook-stage manual lychee --all-files`. ## Run and add unit tests If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression. -SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. + +### Unit tests (no server required) + +Unit tests live under [`test/registered/unit/`](https://github.com/sgl-project/sglang/tree/main/test/registered/unit), organized to mirror the `python/sglang/srt/` source tree. These tests validate component logic **without** launching a server or loading real model weights. +SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework with [pytest](https://docs.pytest.org/) as the test runner. + +**When to add a unit test:** If you modify a file under `python/sglang/srt/`, check whether a corresponding test exists in `test/registered/unit/` and add coverage for your changes. For example: + +``` +srt/mem_cache/radix_cache.py → unit/mem_cache/test_radix_cache.py +srt/sampling/sampling_params.py → unit/sampling/test_sampling_params.py +``` + +**Run unit tests locally:** + +```bash +pytest test/registered/unit/ -v # all unit tests +pytest test/registered/unit/mem_cache/ -v # one module +``` + +**Run with coverage:** + +```bash +pytest test/registered/unit/ --cov --cov-config=.coveragerc -v +``` + +For conventions on CI registration, test structure, and examples, see [`test/registered/unit/README.md`](https://github.com/sgl-project/sglang/tree/main/test/registered/unit/README.md). + +### E2E tests (server required) + +For tests that require launching a server, refer to [`test/registered/README.md`](https://github.com/sgl-project/sglang/tree/main/test/registered/README.md) for guidance on where to place your test. + For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md). ## Write documentations @@ -57,8 +90,8 @@ Also, do not rely on the "Latency/Output throughput" from this script, as it is GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests. You can find additional accuracy eval examples in: -- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_eval_accuracy_large.py) -- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_gpt_oss_1gpu.py) +- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/registered/eval/test_eval_accuracy_large.py) +- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/registered/core/test_gpt_oss_1gpu.py) ## Benchmark the speed Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md). @@ -73,6 +106,8 @@ Then your PR can be merged. We have a lot of open PRs but limited CI machines, so only top and trusted contributors have permission to trigger CI tests. Users with permission are listed in the [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) +**PR authors** can always use `/rerun-failed-ci` on their own PRs, even if they are not listed in `CI_PERMISSIONS.json`. + For CI to run on a pull request, it must have the "run-ci" label. Authorized users can add the label or rerun failed tests by commenting on the PR with one of these commands: - `/tag-run-ci-label`: Adds the "run-ci" label. Every future commit will trigger CI. @@ -86,12 +121,11 @@ To avoid spamming a PR with too many `/rerun-failed-ci` comments, you can also t Example of rerunning a single test stage: `/rerun-stage unit-test-backend-4-gpu`. -If you don’t have permission, please ask maintainers to trigger CI for you. +If you don’t have permission and you’re not the PR author, please ask maintainers to trigger CI for you. ### CI rate limits Due to CI scheduling and limited resources, higher-priority PRs may preempt running jobs. In such cases, you may need to rerun the tests. - We apply CI rate limits to prevent abuse and ensure fair usage of our CI resources. Each CI workflow has a default limit defined in its workflow configuration file. For example, in [pr-gate.yml](https://github.com/sgl-project/sglang/blob/main/.github/workflows/pr-gate.yml), the default cooldown period is 120 minutes, and each workflow can override it via the `cool-down-minutes` input parameter: @@ -105,40 +139,46 @@ cool-down-minutes: Users listed in [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) may have a per-user cooldown interval. In practice, we use the minimum of the workflow’s default window and the user-specific interval. - ## Code style guidance - Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function. - Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code. - Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize all minor overheads as much as possible, especially in the model forward code. - - A common pattern is some runtime checks in the model forward pass (e.g., [this](https://github.com/sgl-project/sglang/blob/f1b0eda55c2c4838e8ab90a0fac7fb1e3d7064ab/python/sglang/srt/models/deepseek_v2.py#L486-L491)). These are very likely the same for every layer. Please cache the result as a single boolean value whenever possible. + - A common pattern is some runtime checks in the model forward pass (e.g., [this](https://github.com/sgl-project/sglang/blob/f1b0eda55c2c4838e8ab90a0fac7fb1e3d7064ab/python/sglang/srt/models/deepseek_v2.py#L486-L491)). These are very likely the same for every layer. Please cache the result as a single boolean value in `__init__` whenever possible. - Make functions as pure as possible. Avoid in-place modification of arguments. - Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files. (e.g., `scheduler.py`, `scheduler_output_processor_mixin.py`) +- In a file, put core data structures at the top of the file. Put utility functions at the bottom of the file. - Keep tests run fast. - If a single test file run longer than 500 seconds, split it into multiple smaller files (e.g., `test_eagle_infer_a.py`, `test_eagle_infer_b.py`). - If a single job in a github workflow runs longer than 30 mins, split it into smaller jobs/steps. - Reuse server launches in your unit tests to make tests run faster. +- Never use `pickle.loads()`, `pickle.load()`, or `recv_pyobj()` to deserialize untrusted or network-received data. Python's [pickle module is not secure](https://docs.python.org/3/library/pickle.html) — it can execute arbitrary code during deserialization. Use safe serialization formats such as [msgpack](https://github.com/jcrist/msgspec) or JSON instead. - When supporting new hardware or features, follow these guidelines: - Do not drastically change existing code. - Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_ascend.py`). - If you write multiple if/else blocks for new features, ensure the common path (e.g., NVIDIA hardware or the existing code path) is the first branch. ## How to update sgl-kernel -Since sglang and sgl-kernel are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR). -To add a new kernel or modify an existing one in the sgl-kernel package, you must use multiple PRs. +Since sglang and the `sglang-kernel` (prior `sgl-kernel`) distribution are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR). +To add a new kernel or modify an existing one in the `sgl-kernel/` source tree, you must use multiple PRs. Follow these steps: 1. Submit a PR to update the sgl-kernel source code without using it in sglang python package (e.g., [#8884](https://github.com/sgl-project/sglang/pull/8884/files)). -2. Bump the version of sgl-kernel (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)). - - Once merged, this will trigger an automatic release of the sgl-kernel wheel to PyPI. +2. Bump the version of the kernel package (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)). + - Once merged, this will trigger an automatic release of the `sglang-kernel` wheel to PyPI. - If not urgent, you can wait for other people to release the wheel. A new version will typically be released within one week. 3. Apply the changes: - - Update the sgl-kernel version in `sglang/python/pyproject.toml` to use the modified kernels. + - Update the `sglang-kernel` version in `sglang/python/pyproject.toml` to use the modified kernels. - Update the related caller code in the sglang to use the new kernel. ## Tips for newcomers -If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow. +If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. + +Also check out the following materials as startup guide: +- [Mini-SGLang](https://github.com/sgl-project/mini-sglang) for a quick overview on the structure of sglang. +- [Code Walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow. +- [GTC-2026 Training Lab](https://drive.google.com/file/d/1mwOZEtipNLJzrflCTodj34KhuOZEoEw5/view?usp=drive_link) for hands-on practices of how to do optimization, benchmarking, or profiling on a launched SGLang instance. If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.io). diff --git a/docs/developer_guide/development_guide_using_docker.md b/docs/developer_guide/development_guide_using_docker.md index e38947902458..a833011c62b1 100644 --- a/docs/developer_guide/development_guide_using_docker.md +++ b/docs/developer_guide/development_guide_using_docker.md @@ -55,7 +55,7 @@ Some useful volumes to mount are: 1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`. 2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer. -Example 1: Monting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer. +Example 1: Mounting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer. ```bash docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh docker exec -it sglang_zhyncs /bin/zsh diff --git a/docs/developer_guide/development_jit_kernel_guide.md b/docs/developer_guide/development_jit_kernel_guide.md new file mode 100644 index 000000000000..b09476e485d2 --- /dev/null +++ b/docs/developer_guide/development_jit_kernel_guide.md @@ -0,0 +1,315 @@ +# Development Guide for JIT Kernels + +## Environment Setup + +We strongly recommend using `clangd` as the language server for JIT kernel development. +For Ubuntu/Debian, you can download clangd from [apt.llvm.org](https://apt.llvm.org/). +If you are using VS Code, we recommend installing the `clangd` extension for better IDE integration. + +All JIT-related files are located in `python/sglang/jit_kernel`. +Unlike `sgl-kernel`, which compiles CUDA/C++ binaries ahead of time (AOT), just-in-time (JIT) kernels are compiled at runtime. +Consequently, a static `compile_commands.json` cannot be generated. +To enable code completion with `clangd`, run `python -m sglang.jit_kernel` to generate a `.clangd` configuration file in your current directory. +After generating the file, restart the clangd language server. It should now recognize all JIT kernel files. + +## Code Structure + +### C++ Implementation + +C++ source code is located in `python/sglang/jit_kernel/csrc`. +Reusable functions should be placed in `python/sglang/jit_kernel/include`. + +We use [tvm-ffi](https://github.com/apache/tvm-ffi) for efficient foreign language bindings. +Refer to the [documentation](https://tvm.apache.org/ffi/) for advanced usage, such as exporting C++ objects. +Typically, `tvm::ffi::TensorView` is sufficient for passing PyTorch Tensors from Python. + +### Python Interface + +Python interfaces are defined in `python/sglang/jit_kernel`. +The `load_jit` utility function in `python/sglang/jit_kernel/utils.py` loads and returns the compiled module. +To export a C++ function (e.g., `cpp_func`), pass `cuda_wrappers=[("func", "cpp_func")]` to `load_jit`. +The function can then be called in Python as `module.func`. + +For caching compiled modules, prefer `sglang.jit_kernel.utils.cache_once` over `functools.lru_cache`. +`functools.lru_cache` is not compatible with `torch.compile`. + +### C++ Utilities + +The following C++ utilities are available: + +#### Integer Range + +Similar to PyTorch, we provide an `irange` function to represent an integer range. + +```C++ +#include + +void test() { + for (auto i : host::irange(100)) { // [0, 100) + // do something + } + for (auto i : host::irange(0, 100)) { // [0, 100) + // do something + } +} + +``` + +#### Runtime Checking + +`RuntimeCheck` validates conditions at runtime. It accepts optional arguments for error reporting. +If the check fails, these arguments are output to aid debugging. +`RuntimeDeviceCheck` verifies the status of the last kernel launch. + +```C++ +#include +#include + +void test() { + host::RuntimeCheck(1 + 1 == 2, 1 + 1, " != ", 2); + host::RuntimeDeviceCheck(); + // check the provided `cudaError_t` + host::RuntimeDeviceCheck(cudaGetLastError()); +} + +``` + +#### Tensor Checking + +`TensorMatcher` provides a readable way to validate and extract tensor shape information. + +```cpp +#include + +void test(const tvm::ffi::TensorView k_cache, const tvm::ffi::TensorView v_cache) { + using namespace host; + + auto D = SymbolicSize{"D"}; // cache dimension + auto N = SymbolicSize{"N"}; // kvcache stride + auto dtype = SymbolicDType{}; + auto device = SymbolicDevice{}; + + TensorMatcher({-1, D}) // + .with_strides({N, 1}) + .with_dtype(dtype) + .with_device(device) + .verify(k_cache) + .verify(v_cache); +} +``` + +Configure the `TensorMatcher` with expected stride, dtype, and device properties before verification. +- If `with_strides` is omitted, the tensor is expected to be contiguous. +- Template arguments in `with_dtype` restrict the allowed data types. +- Template arguments in `with_device` restrict the allowed devices. +- Values passed to `with_xxx` methods enforce equality checks. +- Passing `-1` for size or stride allows matching any value. + +A `Symbolic` variable must resolve to the same value across all verifications. +Use `.unwrap()` to retrieve the matched value after verification. + +> Note: `TensorMatcher` is a temporary expression and should not be stored in a variable. + +> Tip: Add `//` at the end of the `TensorMatcher` chain to enforce proper indentation. + +#### Kernel Launching + +`LaunchKernel::resolve_device` retrieves the current `cudaStream` from PyTorch. +Kernels can also be launched directly using `LaunchKernel`. + +```cpp +#include + +#include + +__global__ void kernel() {} + +void test() { + const auto num_blocks = 1; + const auto num_threads = 32; + const auto dynamic_smem = 0; + + DLDevice dev; // suppose this is initialized properly + host::LaunchKernel(num_blocks, num_threads, dev)(kernel); + + cudaStream_t stream = host::LaunchKernel::resolve_device(dev); + host::LaunchKernel(num_blocks, num_threads, stream, dynamic_smem)(kernel); +} + +``` + +## Add new kernels + +This section walks through a complete, end-to-end example of adding a new JIT kernel to the system. +We use a simple add_constant kernel as a running example, which adds a constant integer value to every element of an input tensor. + +Conceptually, the Python interface looks like this: + +```python +def add_constant(src: torch.Tensor, c: int): + return src + c +``` + +### STEP 1: Write the C++ kernel + +Write your CUDA kernel in [jit_kernel/csrc/add_constant.cuh](../../python/sglang/jit_kernel/csrc/add_constant.cuh). For demonstration purposes, we pass the constant value as a template parameter. + +```cpp +#include // For TensorMatcher, SymbolicSize, SymbolicDevice +#include // For LaunchKernel +#include // For div_ceil, RuntimeCheck + +#include +#include + +#include +#include + +namespace { + +template +__global__ void add_constant_kernel(int32_t* dst, const int32_t* src, size_t length) { + size_t idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < length) { + dst[idx] = src[idx] + kConstant; + } +} + +constexpr size_t kBlockSize = 256; + +// You can also use struct with static method as an alternative +template +void add_constant(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) { + using namespace host; + + // 1. Validate input tensors + SymbolicSize N = {"num_elements"}; + SymbolicDevice device_; + TensorMatcher({N}) // 1D tensor, must be contiguous + .with_dtype() // must be int32 + .with_device(device_) // must be on CUDA device + .verify(dst) // check tensor dst + .verify(src); // check tensor src + + // 2. Extract required parameters, prepare for kernel launch + const size_t num_elements = N.unwrap(); + const size_t grid_size = div_ceil(num_elements, kBlockSize); + const DLDevice device = device_.unwrap(); + // some extra runtime checks using host::RuntimeCheck + RuntimeCheck(num_elements > 0, "We only support non-empty tensors, got num_elements = ", num_elements); + + // 3. Launch the kernel. Error code will be automatically checked. + LaunchKernel(grid_size, kBlockSize, device /*, dynamic_smem*/)( + // kernel function + add_constant_kernel, + // kernel arguments + static_cast(dst.data_ptr()), + static_cast(src.data_ptr()), + num_elements); +} + +} // namespace + +``` + +### STEP 2: Create Python Interfaces + +Next, expose the kernel through a Python wrapper. +Create a new file at [jit_kernel/add_constant.py](../../python/sglang/jit_kernel/add_constant.py) and expose the needed interfaces. + +```python +from __future__ import annotations +from typing import TYPE_CHECKING + +import torch + +from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args + +if TYPE_CHECKING: + from tvm_ffi.module import Module + + +@cache_once +def _jit_add_constant_module(constant: int) -> Module: + args = make_cpp_args(constant) # pass all the template argument + return load_jit( + "add_constant", + *args, + cuda_files=["add_constant.cuh"], + cuda_wrappers=[("add_constant", f"add_constant<{args}>")], + ) + + +def add_constant(src: torch.Tensor, constant: int) -> torch.Tensor: + if not src.is_cuda: + raise RuntimeError("src must be a CUDA tensor") + if src.dtype != torch.int32: + raise RuntimeError(f"Unsupported dtype {src.dtype}. Supported: int32") + dst = torch.empty_like(src) + module = _jit_add_constant_module(constant) + module.add_constant(dst, src) + return dst + +``` + +Keep the Python wrapper thin, but still validate the basic invariants such as device and dtype before dispatch. In the current JIT/FFI path, invalid tensors are not always rejected safely before launch. + +### STEP 3: Use your kernel + +Finally, import and use the kernel like a regular Python function: + +```python +from sglang.jit_kernel.add_constant import add_constant +``` + +For a complete, runnable example, refer to [test_add_constant.py](../../python/sglang/jit_kernel/tests/test_add_constant.py). + +## C++ Include Library Reference + +The JIT kernel framework provides a set of reusable C++ headers in +`python/sglang/jit_kernel/include/sgl_kernel/`. Each header is designed +to be lightweight and self-contained. Below is a summary of each header +and its key APIs. + +### Core Utilities + +| Header | Namespace | Purpose | +|--------|-----------|---------| +| `utils.h` | `host` | Host-side essentials: `RuntimeCheck`, `Panic`, `div_ceil`, `irange` | +| `utils.cuh` | `device` / `host` | Type aliases (`fp16_t`, `bf16_t`, ...), `SGL_DEVICE` macro, PDL helpers, `LaunchKernel`, `RuntimeDeviceCheck` | +| `source_location.h` | (global) | Portable `std::source_location` wrapper for error reporting | +| `runtime.cuh` | `host::runtime` | CUDA runtime queries: `get_blocks_per_sm`, `get_sm_count`, `get_cc_major`, `get_runtime_version`, `get_available_dynamic_smem_per_block` | + +### Tensor Validation + +| Header | Namespace | Purpose | +|--------|-----------|---------| +| `tensor.h` | `host` | `TensorMatcher`, `SymbolicSize`, `SymbolicDType`, `SymbolicDevice` | + +### Math & Type System + +| Header | Namespace | Purpose | +|--------|-----------|---------| +| `math.cuh` | `device::math` | `max`, `min`, `abs`, `sqrt`, `rsqrt`, `exp`, `sin`, `cos`, constants | +| `type.cuh` | (global) / `device` | `dtype_trait`, `packed_t`, `device::cast(from)` | + +### Memory Access + +| Header | Namespace | Purpose | +|--------|-----------|---------| +| `vec.cuh` | `device` | `AlignedVector` - vectorized load/store (up to 128-bit; 256-bit requires Blackwell GPUs) | +| `tile.cuh` | `device::tile` | `Memory` - cooperative tiled memory I/O (thread/warp/CTA) | + +### Parallel Primitives + +| Header | Namespace | Purpose | +|--------|-----------|---------| +| `warp.cuh` | `device::warp` | `reduce_sum`, `reduce_max` via `__shfl_xor_sync` | +| `cta.cuh` | `device::cta` | `reduce_max` across warps via shared memory | +| `atomic.cuh` | `device::atomic` | `max` - atomic float max (CUDA + ROCm fallback) | + +### Reusable Kernel Templates + +| Header | Namespace | Purpose | +|--------|-----------|---------| +| `impl/norm.cuh` | `host::norm` / `device::norm` | RMSNorm building blocks (warp & CTA paths, `StorageType`) | diff --git a/docs/developer_guide/evaluating_new_models.md b/docs/developer_guide/evaluating_new_models.md index 19965ed781f9..f3126c9a0d88 100644 --- a/docs/developer_guide/evaluating_new_models.md +++ b/docs/developer_guide/evaluating_new_models.md @@ -26,7 +26,7 @@ python -m sglang.test.run_eval \ ```bash python -m sglang.test.few_shot_gsm8k \ - --host http://127.0.0.1 \ + --host 127.0.0.1 \ --port 30000 \ --num-questions 200 \ --num-shots 5 @@ -36,7 +36,7 @@ python -m sglang.test.few_shot_gsm8k \ ```bash python benchmark/hellaswag/bench_sglang.py \ - --host http://127.0.0.1 \ + --host 127.0.0.1 \ --port 30000 \ --num-questions 200 \ --num-shots 20 @@ -54,7 +54,7 @@ python -m sglang.test.run_eval \ ``` ```{tip} -For reasoning models, add `--thinking-mode ` (e.g., `qwen3`, `deepseek-r1`, `deepseek-v3`). You may skip it if the model has forced thinking enabled. +For reasoning models, add `--thinking-mode ` (e.g., `qwen3`, `deepseek-v3`). You may skip it if the model has forced thinking enabled. ``` **HumanEval** diff --git a/docs/developer_guide/msprobe_debugging_guide.md b/docs/developer_guide/msprobe_debugging_guide.md new file mode 100644 index 000000000000..ee0d8496e742 --- /dev/null +++ b/docs/developer_guide/msprobe_debugging_guide.md @@ -0,0 +1,598 @@ +# MSProbe Debugging Guide + +## Introduction to MSProbe + +MSProbe is a debugging tool for AI models that diagnoses accuracy anomalies and +numerical errors during model training and inference. It captures and monitors intermediate data (feature maps, weights, +activations, layer outputs) and contextual metadata (prompts, tensor dtypes, hardware configuration), and supports +visual analysis to systematically trace the root cause of accuracy degradation or numerical errors (e.g., NaN/Inf, +output drift, mismatched predictions). + +## Basic Details + +### Background Concepts: MSProbe Dumping Levels + +MSProbe supports three accuracy levels for data dumping, each for different debugging needs: + +- **L0**: Dumps tensors/statistics at the **module level** and generates `construct.json` (for network structure + reconstruction in visualization). Requires passing a model/submodule handle. +- **L1**: Dumps tensors/statistics at the **torch API level**, suitable for fine-grained API-level numerical checking. +- **mix**: Combines L0 + L1, ideal for scenarios that require both **graph reconstruction** and **numerical comparison**. + +### Prerequisites: Install MSProbe + +Install MSProbe with pip: + +```shell +pip install mindstudio-probe --pre +``` + +### Key Configuration Parameters + +MSProbe uses a JSON configuration file for customized data dumping. All core parameters are listed in the table below, +with the default JSON configuration provided for reference. + +#### Configuration Parameter Table + +| Field | Description | Required | +|:------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| `task` | Type of dump task. Common PyTorch values include `"statistics"` and `"tensor"`. A statistics task collects tensor statistics (mean, variance, max, min, etc.) while a tensor task captures arbitrary tensors. | Yes | +| `dump_path` | Directory where dump results are stored. When omitted, `MSProbe` uses its default path. | No | +| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks you must set this field to `[]`. | No | +| `step` | Token iteration(s) to sample. An empty list means every iteration. | No | +| `level` | Dump level string (`"L0"`, `"L1"`, or `"mix"`). `L0` targets `nn.Module`, `L1` targets `torch.api`, and `mix` collects both. | Yes | +| `async_dump` | Whether to enable asynchronous dump (supported for PyTorch `statistics`/`tensor` tasks). Defaults to `false`. | No | +| `scope` | Customize the scope of dump. Provide two module or API names that follow the tool's naming convention to lock a range, only data between the two names will be dumped. An empty list dumps every module or torch API.

Examples:
`"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]`
`"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]`

The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`. | No | +| `list` | Customize dump list, only dumps elements from the list. An empty list dumps every module or torch API. Options include:

򴎲Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
򴎲When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
򴎲Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded. | No | + +#### Default configuration + +```json +{ + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "step": [], + "level": "L1", + "async_dump": false, + "statistics": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "summary_mode": "statistics" + }, + "tensor": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "file_format": "npy" + }, + "acc_check": { + "white_list": [], + "black_list": [], + "error_data_path": "./" + } +} +``` + +#### Outputs + +Dump files are written into `dump_path` you defined. They usually contain: + +- `dump.json`, which records metadata such as dtype, shape, min, max, mean, L2 norm, and `requires_grad`. +- `construct.json`, hierarchical structure description, when `level` is `L0` or `mix` (required for visualization), its + content is not empty. +- `stack.json`, record the call stack information of API/Module. +- `dump_tensor_data`, generated when `task` is `tensor` and save the collected tensor data. + +See [dump directory description](#dump-directory-description) for details. + +> **Note**: When MSProbe is enabled, cuda graph is disabled(disable_cuda_graph=True) because MSProbe only supports dump +> in eager mode, warmup is disabled(skip_server_warmup=True) because there is no need to dump data for this stage. + +## End-to-End Examples + +MSProbe’s full debugging workflow follows **Enable → Collect Data → Visualize → Analyze Root Cause**. Below is a common +E2E example for SGLang-based model inference debugging. + +### Example : Advanced Debugging with Custom Configuration + +Suitable for targeted debugging (e.g., only collect statistics data for specific ranks/steps, enable mix level for graph +reconstruction + numerical comparison) and root cause analysis via **problem vs. benchmark comparison**. + +#### Step 1: Enable +##### Prepare Custom Configuration JSON + +Create `msprobe-config.json` (dump statistics data for rank0/1, step0/1, mix level): + +```json +{ + "task": "statistics", + "dump_path": "./problem_dump", + "rank": [ + 0, + 1 + ], + "step": [ + 0, + 1 + ], + "level": "mix", + "async_dump": false, + "statistics": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "summary_mode": "statistics" + } +} +``` + +##### Enable MSProbe with Custom Configuration in SGLang + +Launch the SGLang server and specify the configuration file path with `--msprobe-dump-config`: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen2.5-0.5B-Instruct \ + --host 127.0.0.1 \ + --port 1027 \ + --msprobe-dump-config /home/msprobe-config.json +``` +#### Step 2: Collect Data +##### Collect Dump Data for Problem & Benchmark Sides + +Send normal inference requests to trigger model running (MSProbe automatically collects data during request processing): + +```bash +curl -H "Content-type: application/json" \ + -X POST \ + -d '{ + "model": "Qwen/Qwen2.5-0.5B-Instruct", + "messages": [ + { + "role": "user", + "content": "Hello, my name is" + } + ], + "max_tokens": 10 + }' \ + http://127.0.0.1:1027/v1/chat/completions +``` + +- **Problem side**: Run the above SGLang server (with the accuracy/numerical issue) and send inference request; dump + data is saved to `./problem_dump`. +- **Benchmark side**: Launch a normal SGLang server (without the issue, e.g., stable framework version/operator) with + the **same custom configuration** and send the **same inference request**; rename the dump directory + to `./bench_dump`. + +> **Key Requirement**: Problem and benchmark dumps must use the same inputs and sampling points (rank/step) +> for valid comparison. + +##### Check Generated Dump Files + +Dump files are saved to `./problem_dump` and `./bench_dump` you defined and include core files for subsequent analysis: + +- `dump.json`: Records tensor metadata of APIs and modules (dtype, shape, min/max/mean, L2 norm, `requires_grad`, etc.). +- `stack.json`: Logs call stack information of APIs and modules. +- `construct.json`: hierarchical structure description, required for visualization, its content is not empty. + +#### Step 3: Visualize +##### Visualize Problem vs. Benchmark Comparison (Multi-Rank) + +Generate a multi-rank comparison visualization file (mix level generates `construct.json` for graph reconstruction): + +```shell +msprobe graph_visualize -tp ./problem_dump/step0 -gp ./bench_dump/step0 -o ./graph_output +``` + +- `-tp`: Path to problem-side dump data +- `-gp`: Path to benchmark-side dump data +- `-o`: Output directory for visualization files + +If you want overflow check (for NaN/Inf detection), please specify the parameter `-oc` + +```shell +msprobe graph_visualize -tp ./problem_dump/step0 -gp ./bench_dump/step0 -o ./graph_output -oc +``` + +After the comparison or build task finishes, a `compare_{timestamp}.vis.db` file is created under `graph_output`. + +##### Launch TensorBoard + +Start TensorBoard: +```bash +tensorboard --logdir ./graph_output --bind_all --port 6006 +``` +#### Step 4: Analyze Root Cause +##### Locate Root Cause + +Root Cause Analysis in TensorBoard: +- Divergent nodes (with accuracy/numerical differences) are highlighted in **red** (darker red = larger difference). +- Click on divergent nodes to view detailed tensor data (inputs/outputs, parameters) and API/module call stacks. +- Use the **search/filter** function to quickly locate key layers/APIs (e.g., "relu", "conv"). +- Switch between ranks/steps via the UI to check cross-rank/cross-step divergence. +- Check the **overflow check** tab for NaN/Inf values in specific nodes (the direct cause of numerical instability). + +##### Verify the Root Cause + +After locating the divergent node (e.g., a specific Conv layer or torch API with abnormal tensor values), verify by: + +- Narrowing the dump scope to this node (via `scope`/`list` in the configuration file) for fine-grained data collection. +- Modifying the problematic layer/API (e.g., replacing the operator, adjusting the dtype) and re-running the debugging + workflow to confirm the issue is resolved. + +## Troubleshooting + +### No Dump Files Generated + +1. To confirm if MSProbe is installed, use `pip show mindstudio_probe` to troubleshoot. If it is installed, the MSProbe + version information will be printed. If it is confirmed that it has not been installed, please + use `pip install mindstudio-probe --pre` for installation; +2. Confirm the `--msprobe-dump-config` parameter points to the **correct JSON file path**. + +### Dump Files Are Too Large (Excessive Data) + +1. Start with `task: "statistics"` instead of `"tensor"` to collect only tensor statistics (avoids raw tensor dump); +2. Narrow the dump range with the `scope` field (specify start/end module/API); +3. Filter dump targets with the `list` field (only dump specific modules/APIs or substrings); +4. Sample specific `rank` and `step` (avoid dumping all ranks/iterations). + +### TensorBoard Visualization Fails + +1. Confirm `construct.json` is not empty (requires `level: L0` or `mix` – L1 does not generate graph files); +2. Check that the `-tp` (problem dump) and `-gp` (benchmark dump) paths point to **valid rank/step subdirectories** ( + e.g., `srep0/rank0`); +3. Ensure the MSProbe version is up-to-date (reinstall with `pip install mindstudio-probe --pre --upgrade`); +4. Verify TensorBoard is installed and the `--logdir` parameter points to the directory containing `.vis.db` files (not + the file itself). + +### Numerical Comparison Shows No Divergence But Model Accuracy Is Low + +1. Expand the dump `step` range (check more token iterations for late-stage divergence); +2. Switch to `task: "tensor"` (statistics may mask subtle numerical differences in raw tensor data); +3. Ensure the problem and benchmark dumps use **the same input data/hardware configuration** (different inputs lead to + invalid comparisons); +4. Use the `manual mapping` feature in TensorBoard (automatic mapping may miss some nodes for custom models). + +--- + +## Appendix + +### Dump directory description + +```text +├── problem_dump or bench_dump +│ ├── step0 +│ │ ├── rank0 +│ │ │ ├── dump_tensor_data +│ │ │ │ ├── Tensor.permute.1.forward.pt +│ │ │ │ ├── Functional.linear.5.backward.output.pt # Format: {api_type}.{api_name}.{call_count}.{forward/backward}.{input/output}.{arg_index}. +│ │ │ │ │ # arg_index is the nth input or output of the API. If an input is a list, keep numbering with decimals (e.g., 1.1 is the first element of the first argument). +│ │ │ │ ├── Module.conv1.Conv2d.forward.0.input.0.pt # Format: {Module}.{module_name}.{class_name}.{forward/backward}.{call_count}.{input/output}.{arg_index}. +│ │ │ │ ├── Module.conv1.Conv2d.forward.0.parameters.bias.pt # Module parameter data: {Module}.{module_name}.{class_name}.forward.{call_count}.parameters.{parameter_name}. +│ │ │ │ └── Module.conv1.Conv2d.parameters_grad.weight.pt # Module parameter gradients: {Module}.{module_name}.{class_name}.parameters_grad.{parameter_name}. Gradients do not include call_count because the same gradient updates all invocations. +│ │ │ │ # When the `model` argument passed to dump is a List[torch.nn.Module] or Tuple[torch.nn.Module], module-level data names also include the index inside the list ({Module}.{index}.*), e.g., Module.0.conv1.Conv2d.forward.0.input.0.pt. +│ │ │ ├── dump.json +│ │ │ ├── stack.json +│ │ │ ├── dump_error_info.log +│ │ │ └── construct.json +│ │ ├── rank1 +│ │ │ ├── dump_tensor_data +│ │ │ │ └── ... +│ │ │ ├── dump.json +│ │ │ ├── stack.json +│ │ │ ├── dump_error_info.log +│ │ │ └── construct.json +│ │ ├── ... +│ │ │ +│ │ └── rank7 +│ ├── step1 +│ │ ├── ... +│ ├── step2 +``` + +- `rank`: Device ID. Each card writes its data to the corresponding `rank{ID}` directory. In non-distributed scenarios + the directory is simply named `rank`. +- `dump_tensor_data`: Save the collected tensor data. +- `dump.json`: Statistics for the forward data of each API or module, including names, dtype, shape, max, min, mean, L2 + norm (square root of the L2 variance), and CRC-32 when `summary_mode="md5"`. + See [dump.json file description](#dumpjson-file-description) for details. +- `dump_error_info.log`: Present only when the dump tool encountered an error and records the failure log. +- `stack.json`: Call stacks for APIs/modules. +- `construct.json`: Hierarchical structure description. Empty when `level=L1`. + +### dump.json file description + +#### L0 level + +An L0 `dump.json` contains forward/backward I/O for modules together with parameters and parameter gradients. Using +PyTorch's `Conv2d` as an example, the network code looks like: + +`output = self.conv2(input) # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)` + +`dump.json` contains the following entries: + +- `Module.conv2.Conv2d.forward.0`: Forward data of the module. `input_args` represents positional inputs, `input_kwargs` + represents keyword inputs, `output` stores forward outputs, and `parameters` stores weights/biases. +- `Module.conv2.Conv2d.parameters_grad`: Parameter gradients (weight and bias). +- `Module.conv2.Conv2d.backward.0`: Backward data of the module. `input` represents gradients that flow into the + module (gradients of the forward outputs) and `output` represents gradients that flow out (gradients of the module + inputs). + +**Note**: When the `model` parameter passed to the dump API is `List[torch.nn.Module]` or `Tuple[torch.nn.Module]`, +module-level names include the index inside the list (`{Module}.{index}.*`). Example: `Module.0.conv1.Conv2d.forward.0`. + +
+ +L0 dump.json + +```json +{ + "task": "tensor", + "level": "L0", + "framework": "pytorch", + "dump_data_dir": "/dump/path", + "data": { + "Module.conv2.Conv2d.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 1.638758659362793, + "Min": 0.0, + "Mean": 0.2544615864753723, + "Norm": 70.50277709960938, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.input.0.pt" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 1.6815717220306396, + "Min": -1.5120246410369873, + "Mean": -0.025344856083393097, + "Norm": 149.65576171875, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.output.0.pt" + } + ], + "parameters": { + "weight": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.05992485210299492, + "Min": -0.05999220535159111, + "Mean": -0.0006165213999338448, + "Norm": 3.421217441558838, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt" + }, + "bias": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32 + ], + "Max": 0.05744686722755432, + "Min": -0.04894155263900757, + "Mean": 0.006410328671336174, + "Norm": 0.17263513803482056, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt" + } + } + }, + "Module.conv2.Conv2d.parameters_grad": { + "weight": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.018550323322415352, + "Min": -0.008627401664853096, + "Mean": 0.0006675920449197292, + "Norm": 0.26084786653518677, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.parameters_grad.weight.pt" + } + ], + "bias": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32 + ], + "Max": 0.014914230443537235, + "Min": -0.006656786892563105, + "Mean": 0.002657240955159068, + "Norm": 0.029451673850417137, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.parameters_grad.bias.pt" + } + ] + }, + "Module.conv2.Conv2d.backward.0": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 0.0015069986693561077, + "Min": -0.001139344065450132, + "Mean": 3.3215508210560074e-06, + "Norm": 0.020567523315548897, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.backward.0.input.0.pt" + } + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 0.0007466732058674097, + "Min": -0.00044813455315306783, + "Mean": 6.814070275140693e-06, + "Norm": 0.01474067009985447, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.backward.0.output.0.pt" + } + ] + } + } +} +``` + +
+ +#### L1 level + +An L1 `dump.json` records forward/backward I/O for APIs. Using PyTorch's `relu` function as an +example (`output = torch.nn.functional.relu(input)`), the file contains: + +- `Functional.relu.0.forward`: Forward data of the API. `input_args` are positional inputs, `input_kwargs` are keyword + inputs, and `output` stores the forward outputs. +- `Functional.relu.0.backward`: Backward data of the API. `input` represents the gradients of the forward outputs, + and `output` represents the gradients that flow back to the forward inputs. + +
+ +L1 dump.json + +```json +{ + "task": "tensor", + "level": "L1", + "framework": "pytorch", + "dump_data_dir": "/dump/path", + "data": { + "Functional.relu.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": -1.3364859819412231, + "Mean": 0.03711778670549393, + "Norm": 236.20692443847656, + "requires_grad": true, + "data_name": "Functional.relu.0.forward.input.0.pt" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": 0.0, + "Mean": 0.16849493980407715, + "Norm": 175.23345947265625, + "requires_grad": true, + "data_name": "Functional.relu.0.forward.output.0.pt" + } + ] + }, + "Functional.relu.0.backward": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00013352684618439525, + "Mean": 0.00011915402356442064, + "Norm": 0.007598237134516239, + "requires_grad": false, + "data_name": "Functional.relu.0.backward.input.0.pt" + } + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00012117840378778055, + "Mean": 2.0098118724831693e-08, + "Norm": 0.006532244384288788, + "requires_grad": false, + "data_name": "Functional.relu.0.backward.output.0.pt" + } + ] + } + } +} +``` + +
+ +#### mix level + +A `mix` dump.json contains both L0 and L1 level data; the file format is the same as the examples above. diff --git a/docs/developer_guide/setup_github_runner.md b/docs/developer_guide/setup_github_runner.md index 3ca9627ff7ab..49221acc95cf 100644 --- a/docs/developer_guide/setup_github_runner.md +++ b/docs/developer_guide/setup_github_runner.md @@ -1,4 +1,4 @@ -# Set Up Self-Hosted Runners for GitHub Action +# Set Up Self-Hosted Runners for GitHub Actions ## Add a Runner @@ -12,9 +12,9 @@ docker pull nvidia/cuda:12.9.1-devel-ubuntu22.04 # Nvidia docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.9.1-devel-ubuntu22.04 /bin/bash # AMD -docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.0rc1-rocm630 /bin/bash +docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.8-rocm700-mi30x /bin/bash # AMD just the last 2 GPUs -docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.0rc1-rocm630 /bin/bash +docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.8-rocm700-mi30x /bin/bash ``` ### Step 2: Configure the runner by `config.sh` @@ -27,11 +27,11 @@ pip install --upgrade pip export RUNNER_ALLOW_RUNASROOT=1 ``` -Then follow https://github.com/sgl-project/sglang/settings/actions/runners/new?arch=x64&os=linux to run `config.sh` +Then follow https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners to run `config.sh` **Notes** - Do not need to specify the runner group -- Give it a name (e.g., `test-sgl-gpu-0`) and some labels (e.g., `1-gpu-runner`). The labels can be edited later in Github Settings. +- Give it a name (e.g., `test-sgl-gpu-0`) and some labels (e.g., `1-gpu-h100`). The labels can be edited later in Github Settings. - Do not need to change the work folder. ### Step 3: Run the runner by `run.sh` diff --git a/docs/diffusion/api/cli.md b/docs/diffusion/api/cli.md new file mode 100644 index 000000000000..587efeb46450 --- /dev/null +++ b/docs/diffusion/api/cli.md @@ -0,0 +1,254 @@ +# SGLang Diffusion CLI + +Use the CLI for one-off generation with `sglang generate` or to start a persistent HTTP server with `sglang serve`. + +### Overlay repos for non-diffusers models + +If `--model-path` points to a supported non-diffusers source repo, SGLang can resolve it +through a self-hosted overlay repo. + +SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface. + +Override example: + +```bash +export SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY='{ + "Wan-AI/Wan2.2-S2V-14B": { + "overlay_repo_id": "your-org/Wan2.2-S2V-14B-overlay", + "overlay_revision": "main" + } +}' + +sglang generate \ + --model-path Wan-AI/Wan2.2-S2V-14B \ + --config configs/wan_s2v.yaml +``` + +The overlay repo should be a complete diffusers-style/componentized repo + +You can also pass the overlay repo itself as `--model-path` if it contains `_overlay/overlay_manifest.json`. + +Notes: +1. `SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY` is only an optional override for +development and debugging. It accepts either a JSON object or a path to a JSON +file, and can extend or replace built-in entries for the current process. +2. On the first load, SGLang will: + - download overlay metadata from the overlay repo + - download the required files from the original source repo + - materialize a local standard component repo under `~/.cache/sgl_diffusion/materialized_models/` +3. Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory. + + +## Quick Start + +### Generate + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" \ + --save-output +``` + +### Serve + +```bash +sglang serve \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --num-gpus 4 \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --port 30010 +``` + +For request and response examples, see [OpenAI-Compatible API](openai_api.md). + +```{tip} +Use `sglang generate --help` and `sglang serve --help` for the full argument list. The CLI help output is the source of truth for exhaustive flags. +``` + +## Common Options + +### Model and runtime + +- `--model-path {MODEL}`: model path or Hugging Face model ID +- `--lora-path {PATH}` and `--lora-nickname {NAME}`: load a LoRA adapter +- `--num-gpus {N}`: number of GPUs to use +- `--tp-size {N}`: tensor parallelism size, mainly for encoders +- `--sp-degree {N}`: sequence parallelism size +- `--ulysses-degree {N}` and `--ring-degree {N}`: USP parallelism controls +- `--attention-backend {BACKEND}`: attention backend for native SGLang pipelines +- `--component-attention-backends {MAP}`: per-component attention backend overrides, for example `text_encoder=torch_sdpa,transformer=fa` +- `--attention-backend-config {CONFIG}`: attention backend configuration + +### Sampling and output + +- `--prompt {PROMPT}` and `--negative-prompt {PROMPT}` +- `--image-path {PATH} [{PATH} ...]`: input image(s) for image-to-video or image-to-image generation +- `--num-inference-steps {STEPS}` and `--seed {SEED}` +- `--height {HEIGHT}`, `--width {WIDTH}`, `--num-frames {N}`, `--fps {FPS}` +- `--output-path {PATH}`, `--output-file-name {NAME}`, `--save-output`, `--return-frames` + +For frame interpolation and upscaling, see [Post-Processing](post_processing.md). + +### Quantized transformers + +For quantized transformer checkpoints, prefer: + +- `--model-path` for the base pipeline +- `--transformer-path` for a quantized `transformers` transformer component folder +- `--transformer-weights-path` for a quantized safetensors file, directory, or repo + +See [Quantization](../quantization.md) for supported quantization families and examples. + +## Configuration Files + +Use `--config` to load JSON or YAML configuration. Command-line flags override values from the config file. + +```bash +sglang generate --config config.yaml +``` + +Example: + +```yaml +model_path: FastVideo/FastHunyuan-diffusers +prompt: A beautiful woman in a red dress walking down a street +output_path: outputs/ +num_gpus: 2 +sp_size: 2 +tp_size: 1 +num_frames: 45 +height: 720 +width: 1280 +num_inference_steps: 6 +seed: 1024 +fps: 24 +precision: bf16 +vae_precision: fp16 +vae_tiling: true +vae_sp: true +enable_torch_compile: false +``` + +## Generate + +`sglang generate` runs a single generation job and exits when the job finishes. + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --text-encoder-cpu-offload \ + --pin-cpu-memory \ + --num-gpus 4 \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --prompt "A curious raccoon" \ + --save-output \ + --output-path outputs \ + --output-file-name "a-curious-raccoon.mp4" +``` + +```{note} +HTTP server-only arguments are ignored by `sglang generate`. +``` + +For diffusers pipelines, Cache-DiT can be enabled with `SGLANG_CACHE_DIT_ENABLED=true` or `--cache-dit-config`. See [Cache-DiT](../performance/cache/cache_dit.md). + +## Serve + +`sglang serve` starts the HTTP server and keeps the model loaded for repeated requests. + +```bash +sglang serve \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --text-encoder-cpu-offload \ + --pin-cpu-memory \ + --num-gpus 4 \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --port 30010 +``` + +### Cloud Storage + +SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation. + +```bash +export SGLANG_CLOUD_STORAGE_TYPE=s3 +export SGLANG_S3_BUCKET_NAME=my-bucket +export SGLANG_S3_ACCESS_KEY_ID=your-access-key +export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key +export SGLANG_S3_ENDPOINT_URL=https://minio.example.com +``` + +See [Environment Variables](../environment_variables.md) for the full set of storage options. + +## Component Path Overrides + +Override individual pipeline components such as `vae`, `transformer`, or `text_encoder` with `---path`. + +```bash +sglang serve \ + --model-path black-forest-labs/FLUX.2-dev \ + --vae-path fal/FLUX.2-Tiny-AutoEncoder +``` + +The component key must match the key in the model's `model_index.json`, and the path must be either a Hugging Face repo ID or a complete component directory. + +## Component Attention Backend Overrides + +Use `--component-attention-backends` when one pipeline component needs a different native attention backend from the global `--attention-backend`. + +```bash +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --attention-backend fa \ + --component-attention-backends text_encoder=torch_sdpa +``` + +The component key must match a pipeline module key such as `text_encoder`, `text_encoder_2`, `transformer`, `transformer_2`, or `connectors`. Component overrides take precedence over the global `--attention-backend` only while that component is being constructed. + +You can also pass dotted CLI entries: + +```bash +sglang generate \ + --model-path \ + --component-attention-backends.text_encoder torch_sdpa \ + --component-attention-backends.transformer fa +``` + +## Diffusers Backend + +Use `--backend diffusers` to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class. + +### Key Options + +| Argument | Values | Description | +|----------|--------|-------------| +| `--backend` | `auto`, `sglang`, `diffusers` | Choose native SGLang, force native, or force diffusers | +| `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines | +| `--trust-remote-code` | flag | Required for models with custom pipeline classes | +| `--vae-tiling` and `--vae-slicing` | flag | Lower memory usage for VAE decode | +| `--dit-precision` and `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision controls | +| `--enable-torch-compile` | flag | Enable `torch.compile` | +| `--cache-dit-config` | `{PATH}` | Cache-DiT config for diffusers pipelines | + +### Example + +```bash +sglang generate \ + --model-path AIDC-AI/Ovis-Image-7B \ + --backend diffusers \ + --trust-remote-code \ + --diffusers-attention-backend flash \ + --prompt "A serene Japanese garden with cherry blossoms" \ + --height 1024 \ + --width 1024 \ + --num-inference-steps 30 \ + --save-output \ + --output-path outputs \ + --output-file-name ovis_garden.png +``` + +For pipeline-specific arguments not exposed in the CLI, pass `diffusers_kwargs` in a config file. diff --git a/python/sglang/multimodal_gen/docs/openai_api.md b/docs/diffusion/api/openai_api.md similarity index 79% rename from python/sglang/multimodal_gen/docs/openai_api.md rename to docs/diffusion/api/openai_api.md index 88dabac4c69a..8d18c49599ba 100644 --- a/python/sglang/multimodal_gen/docs/openai_api.md +++ b/docs/diffusion/api/openai_api.md @@ -2,6 +2,10 @@ The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management. +## Prerequisites + +- Python 3.11+ if you plan to use the OpenAI Python SDK. + ## Serve Launch the server using the `sglang serve` command. @@ -25,7 +29,7 @@ sglang serve "${SERVER_ARGS[@]}" - **--model-path**: Path to the model or model ID. - **--port**: HTTP port to listen on (default: `30000`). -#### Get Model Information +**Get Model Information** **Endpoint:** `GET /models` @@ -59,7 +63,7 @@ curl -sS -X GET "http://localhost:30010/models" The server implements an OpenAI-compatible Images API under the `/v1/images` namespace. -#### Create an image +**Create an image** **Endpoint:** `POST /v1/images/generations` @@ -98,9 +102,10 @@ curl -sS -X POST "http://localhost:30010/v1/images/generations" \ ``` > **Note** -> The `response_format=url` option is not supported for `POST /v1/images/generations` and will return a `400` error. +> If `response_format=url` is used and cloud storage is not configured, the API returns +> a relative URL like `/v1/images//content`. -#### Edit an image +**Edit an image** **Endpoint:** `POST /v1/images/edits` @@ -130,9 +135,10 @@ curl -sS -X POST "http://localhost:30010/v1/images/edits" \ -F "response_format=url" ``` -#### Download image content +**Download image content** -When `response_format=url` is used with `POST /v1/images/edits`, the API returns a relative URL like `/v1/images//content`. +When `response_format=url` is used with `POST /v1/images/generations` or `POST /v1/images/edits`, +the API returns a relative URL like `/v1/images//content`. **Endpoint:** `GET /v1/images/{image_id}/content` @@ -148,7 +154,7 @@ curl -sS -L "http://localhost:30010/v1/images//content" \ The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace. -#### Create a video +**Create a video (text-to-video)** **Endpoint:** `POST /v1/videos` @@ -178,7 +184,34 @@ curl -sS -X POST "http://localhost:30010/v1/videos" \ }' ``` -#### List videos +**Create a video (image-to-video)** + +For I2V or TI2V models (e.g., Wan2.1 I2V, LTX-2.3 two-stage), pass an input image via multipart form upload or a reference URL. + +**Curl Example (multipart form upload):** + +```bash +curl -sS -X POST "http://localhost:30010/v1/videos" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -F "prompt=A cat playing a piano" \ + -F "input_reference=@input_image.png" \ + -F "size=1280x720" +``` + +**Curl Example (reference URL):** + +```bash +curl -sS -X POST "http://localhost:30010/v1/videos" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -d '{ + "prompt": "A cat playing a piano", + "reference_url": "https://example.com/input_image.png", + "size": "1280x720" + }' +``` + +**List videos** **Endpoint:** `GET /v1/videos` @@ -197,7 +230,7 @@ curl -sS -X GET "http://localhost:30010/v1/videos" \ -H "Authorization: Bearer sk-proj-1234567890" ``` -#### Download video content +**Download video content** **Endpoint:** `GET /v1/videos/{video_id}/content` @@ -239,7 +272,7 @@ The server supports dynamic loading, merging, and unmerging of LoRA adapters. - Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one - Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost -#### Set LoRA Adapter +**Set LoRA Adapter** Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters. @@ -301,7 +334,7 @@ curl -X POST http://localhost:30010/v1/set_lora \ > - Multiple LoRAs applied to the same target will be merged in order -#### Merge LoRA Weights +**Merge LoRA Weights** Manually merges the currently set LoRA weights into the base model. @@ -323,7 +356,7 @@ curl -X POST http://localhost:30010/v1/merge_lora_weights \ ``` -#### Unmerge LoRA Weights +**Unmerge LoRA Weights** Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA. @@ -336,7 +369,7 @@ curl -X POST http://localhost:30010/v1/unmerge_lora_weights \ -H "Content-Type: application/json" ``` -#### List LoRA Adapters +**List LoRA Adapters** Returns loaded LoRA adapters and current application status per module. @@ -389,3 +422,26 @@ Notes: curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}' ``` 5. Generate with LoRA B... + +### Adjust Output Quality + +The server supports adjusting output quality and compression levels for both image and video generation through the `output-quality` and `output-compression` parameters. + +#### Parameters + +- **`output-quality`** (string, optional): Preset quality level that automatically sets compression. **Default is `"default"`**. Valid values: + - `"maximum"`: Highest quality (100) + - `"high"`: High quality (90) + - `"medium"`: Medium quality (55) + - `"low"`: Lower quality (35) + - `"default"`: Auto-adjust based on media type (50 for video, 75 for image) + +- **`output-compression`** (integer, optional): Direct compression level override (0-100). **Default is `None`**. When provided (not `None`), takes precedence over `output-quality`. + - `0`: Lowest quality, smallest file size + - `100`: Highest quality, largest file size + +#### Notes + +- **Precedence**: When both `output-quality` and `output-compression` are provided, `output-compression` takes precedence +- **Format Support**: Quality settings apply to JPEG, and video formats. PNG uses lossless compression and ignores these settings +- **File Size vs Quality**: Lower compression values (or "low" quality preset) produce smaller files but may show visible artifacts diff --git a/docs/diffusion/api/post_processing.md b/docs/diffusion/api/post_processing.md new file mode 100644 index 000000000000..d832f4af2959 --- /dev/null +++ b/docs/diffusion/api/post_processing.md @@ -0,0 +1,148 @@ +# Post-Processing + +SGLang diffusion supports optional post-processing steps that run after +generation to improve temporal smoothness (frame interpolation) or spatial +resolution (upscaling). These steps are independent of the diffusion model and +can be combined in a single run. + +When both are enabled, **frame interpolation runs first** (increasing the frame +count), then **upscaling runs on every frame** (increasing the spatial +resolution). + +--- + +## Frame Interpolation (video only) + +Frame interpolation synthesizes new frames between each pair of consecutive +generated frames, producing smoother motion without re-running the diffusion +model. + +The `--frame-interpolation-exp` flag controls how many rounds of interpolation +to apply: each round inserts one new frame into every gap between adjacent +frames, so the output frame count follows the formula: + +> **(N − 1) × 2^exp + 1** +> +> e.g. 5 original frames with `exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames; +> with `exp=2` → **17** frames. + +### CLI Arguments + +| Argument | Description | +|----------|-------------| +| `--enable-frame-interpolation` | Enable frame interpolation. Model weights are downloaded automatically on first use. | +| `--frame-interpolation-exp {EXP}` | Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`) | +| `--frame-interpolation-scale {SCALE}` | RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`) | +| `--frame-interpolation-model-path {PATH}` | Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically) | + +### Supported Models + +Frame interpolation uses the [RIFE](https://github.com/hzwer/Practical-RIFE) +(Real-Time Intermediate Flow Estimation) architecture. Only **RIFE 4.22.lite** +(`IFNet` with 4-scale `IFBlock` backbone) is supported. The network topology is +hard-coded, so custom weights provided via `--frame-interpolation-model-path` +must be a `flownet.pkl` checkpoint that is compatible with this architecture. + +Other RIFE versions (e.g., older `v4.x` variants with different block counts) +or entirely different frame interpolation methods (FILM, AMT, etc.) are **not +supported**. + +| Weight | HuggingFace Repo | Description | +|--------|------------------|-------------| +| RIFE 4.22.lite *(default)* | [`elfgum/RIFE-4.22.lite`](https://huggingface.co/elfgum/RIFE-4.22.lite) | Lightweight model, downloaded automatically on first use | + +### Example + +Generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9): + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --prompt "A dog running through a park" \ + --num-frames 5 \ + --enable-frame-interpolation \ + --frame-interpolation-exp 1 \ + --save-output +``` + +--- + +## Upscaling (image and video) + +Upscaling increases the spatial resolution of generated images or video frames +using [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN). The model weights +are downloaded automatically on first use and cached for subsequent runs. + +### CLI Arguments + +| Argument | Description | +|----------|-------------| +| `--enable-upscaling` | Enable post-generation upscaling using Real-ESRGAN. | +| `--upscaling-scale {SCALE}` | Desired upscaling factor (default: `4`). The 4× model is used internally; if a different scale is requested, a bicubic resize is applied after the network output. | +| `--upscaling-model-path {PATH}` | Local `.pth` file, HuggingFace repo ID, or `repo_id:filename` for Real-ESRGAN weights (default: `ai-forever/Real-ESRGAN` with `RealESRGAN_x4.pth`, downloaded automatically). Use the `repo_id:filename` format to specify a custom weight file from a HuggingFace repo (e.g. `my-org/my-esrgan:weights.pth`). | + +### Supported Models + +Upscaling supports two Real-ESRGAN network architectures. The correct +architecture is **auto-detected** from the checkpoint keys, so you only need to +point `--upscaling-model-path` at a valid `.pth` file: + +| Architecture | Example Weights | Description | +|--------------|-----------------|-------------| +| **RRDBNet** | `RealESRGAN_x4plus.pth` | Heavier model with higher quality; best for photos | +| **SRVGGNetCompact** | `RealESRGAN_x4.pth` *(default)*, `realesr-animevideov3.pth`, `realesr-general-x4v3.pth` | Lightweight model; faster inference, good for video | + +The default weight file is +[`ai-forever/Real-ESRGAN`](https://huggingface.co/ai-forever/Real-ESRGAN) with +`RealESRGAN_x4.pth` (SRVGGNetCompact, 4× native scale). + +Other super-resolution models (e.g., SwinIR, HAT, BSRGAN) are **not supported** +— only Real-ESRGAN checkpoints using the two architectures above are +compatible. + +### Examples + +Generate a 1024×1024 image and upscale to 4096×4096: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --prompt "A cat sitting on a windowsill" \ + --output-size 1024x1024 \ + --enable-upscaling \ + --save-output +``` + +Generate a video and upscale each frame by 4×: + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "A curious raccoon" \ + --enable-upscaling \ + --upscaling-scale 4 \ + --save-output +``` + +--- + +## Combining Frame Interpolation and Upscaling + +Frame interpolation and upscaling can be combined in a single run. +Interpolation is applied first (increasing the frame count), then upscaling is +applied to every frame (increasing the spatial resolution). + +Example — generate 5 frames, interpolate to 9 frames, and upscale each frame +by 4×: + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "A curious raccoon" \ + --num-frames 5 \ + --enable-frame-interpolation \ + --frame-interpolation-exp 1 \ + --enable-upscaling \ + --upscaling-scale 4 \ + --save-output +``` diff --git a/python/sglang/multimodal_gen/docs/ci_perf.md b/docs/diffusion/ci_perf.md similarity index 94% rename from python/sglang/multimodal_gen/docs/ci_perf.md rename to docs/diffusion/ci_perf.md index fcedbc39c0c2..f8bb2316bb7f 100644 --- a/python/sglang/multimodal_gen/docs/ci_perf.md +++ b/docs/diffusion/ci_perf.md @@ -1,5 +1,6 @@ +# CI Performance -## Perf baseline generation script +## Perf Baseline Generation Script `python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`. diff --git a/docs/diffusion/compatibility_matrix.md b/docs/diffusion/compatibility_matrix.md new file mode 100644 index 000000000000..37b95acfa004 --- /dev/null +++ b/docs/diffusion/compatibility_matrix.md @@ -0,0 +1,193 @@ +# Compatibility Matrix + +The table below shows every supported model and the optimizations supported for them. + +The symbols used have the following meanings: + +- ✅ = Full compatibility +- ❌ = No compatibility +- ⭕ = Does not apply to this model + +## Models x Optimization + +The `HuggingFace Model ID` can be passed directly to `from_pretrained()` methods, and sglang-diffusion will use the +optimal +default parameters when initializing and generating videos. + +### Video Generation Models + +| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention (SLA) | Sage Sparse Linear Attention (SageSLA) | Sparse Video Gen 2 (SVG2) | +|:-----------------------------|:--------------------------------------------------|:---------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:-----------------------------:|:--------------------------------------:|:-------------------------:| +| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ | +| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ | +| Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | ❌ | ❌ | ❌ | +| Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p
720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ | +| Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p
720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ | +| HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280
544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280
544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| TurboWan2.1 T2V 1.3B | `IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ | +| TurboWan2.1 T2V 14B | `IPostYellow/TurboWan2.1-T2V-14B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ | +| TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ | +| TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ | +| Wan2.1 Fun 1.3B InP | `weizhou03/Wan2.1-Fun-1.3B-InP-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ | +| Helios Base | `BestWishYsh/Helios-Base` | 720p | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| Helios Mid | `BestWishYsh/Helios-Mid` | 720p | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| Helios Distilled | `BestWishYsh/Helios-Distilled` | 720p | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| LTX-2 (one/two-stage/TI2V) | `Lightricks/LTX-2` | 768×512
1536×1024 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| LTX-2.3 (one/two-stage/TI2V/HQ) | `Lightricks/LTX-2.3` | 768×512
1536×1024
1920×1088 (HQ default) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | + +**Note**: + +1. Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue. +2. SageSLA is based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation` +3. LTX pipeline selection: + - One-stage: `--pipeline-class-name LTX2Pipeline` + - Two-stage: `--pipeline-class-name LTX2TwoStagePipeline` + - Two-stage HQ: `--pipeline-class-name LTX2TwoStageHQPipeline` (HQ defaults to 1920×1088; you can still override `--width/--height`) + - LTX-2 and LTX-2.3 support both T2V and TI2V (`--image-path`) on one-stage and two-stage pipelines (including HQ). + - The spatial upsampler and distilled LoRA are auto-resolved from the model snapshot by default, and can still be overridden with `--spatial-upsampler-path` and `--distilled-lora-path`. + - For LTX models, the `Resolutions` column uses output video `width×height` semantics, matching `sglang generate --width ... --height ...`. +4. LTX-2 / LTX-2.3 two-stage also supports `--ltx2-two-stage-device-mode {original,snapshot,resident}`: + - `snapshot` is the default and recommended mode. + - `resident` usually provides the best latency/throughput but uses much more VRAM. + - `original` keeps official two-stage semantics without the premerged stage-2 transformer path. + - Example (one prior run): `original` `154.67s`, `snapshot` `114.05s`, `resident` `75.71s`; peak VRAM trend is `original < snapshot < resident`. + +### Image Generation Models + +| Model Name | HuggingFace Model ID | +|:--------------------------|:---------------------------------------------------------| +| FLUX.1-dev | `black-forest-labs/FLUX.1-dev` | +| FLUX.2-dev | `black-forest-labs/FLUX.2-dev` | +| FLUX.2-dev-NVFP4 | `black-forest-labs/FLUX.2-dev-NVFP4` | +| FLUX.2-Klein-4B | `black-forest-labs/FLUX.2-klein-4B` | +| FLUX.2-Klein-9B | `black-forest-labs/FLUX.2-klein-9B` | +| Z-Image | `Tongyi-MAI/Z-Image` | +| Z-Image-Turbo | `Tongyi-MAI/Z-Image-Turbo` | +| GLM-Image | `zai-org/GLM-Image` | +| Qwen Image | `Qwen/Qwen-Image` | +| Qwen Image 2512 | `Qwen/Qwen-Image-2512` | +| Qwen Image Edit | `Qwen/Qwen-Image-Edit` | +| Qwen Image Edit 2509 | `Qwen/Qwen-Image-Edit-2509` | +| Qwen Image Edit 2511 | `Qwen/Qwen-Image-Edit-2511` | +| Qwen Image Layered | `Qwen/Qwen-Image-Layered` | +| SD3 Medium | `stabilityai/stable-diffusion-3-medium-diffusers` | +| SD3.5 Medium | `stabilityai/stable-diffusion-3.5-medium-diffusers` | +| SD3.5 Large | `stabilityai/stable-diffusion-3.5-large-diffusers` | +| Hunyuan3D-2 | `tencent/Hunyuan3D-2` | +| SANA 1.5 1.6B | `Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers` | +| SANA 1.5 4.8B | `Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers` | +| SANA 1600M 1024px | `Efficient-Large-Model/Sana_1600M_1024px_diffusers` | +| SANA 600M 1024px | `Efficient-Large-Model/Sana_600M_1024px_diffusers` | +| SANA 1600M 512px | `Efficient-Large-Model/Sana_1600M_512px_diffusers` | +| SANA 600M 512px | `Efficient-Large-Model/Sana_600M_512px_diffusers` | +| FireRed-Image-Edit 1.0 | `FireRedTeam/FireRed-Image-Edit-1.0` | +| FireRed-Image-Edit 1.1 | `FireRedTeam/FireRed-Image-Edit-1.1` | +| ERNIE-Image | `baidu/ERNIE-Image` | +| ERNIE-Image-Turbo | `baidu/ERNIE-Image-Turbo` | + +## Supported Components + +SGLang Diffusion supports overriding individual pipeline components with +`---path`. The value can be either a Hugging Face repo ID or a local +component directory. + +The same overrides can also be provided in config files through +`component_paths.`. + +### Common Syntax + +CLI: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --vae-path black-forest-labs/FLUX.2-small-decoder \ + --transformer-path /models/flux2/transformer +``` + +Config file: + +```yaml +model_path: black-forest-labs/FLUX.2-dev +component_paths: + vae: black-forest-labs/FLUX.2-small-decoder + transformer: /models/flux2/transformer +``` + +Use the component name from the pipeline's `model_index.json` or the native pipeline's registered module name: + +| Component Type | Supported Keys | Notes | +|:------------------|:---------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------| +| VAE | `vae`, `video_vae`, `audio_vae` | `vae` is the common image-generation override | +| Transformer / DiT | `transformer`, `video_dit`, `audio_dit` | `transformer` is the standard override for the main denoiser | +| Text / Preprocess | `text_encoder`, `text_encoder_2`, `tokenizer`, `processor`, `image_processor` | Replacement encoders often need matching preprocessing assets | +| Auxiliary | `scheduler`, `spatial_upsampler`, `vocoder`, `connectors`, `dual_tower_bridge`, `image_encoder`, `vision_language_encoder` | Only valid for pipelines that expose these components | + +### Known Component Repos + +The table below lists concrete Hugging Face component repos that are already used in SGLang Diffusion docs or tests. It is not an exhaustive catalog of all compatible component repos. + +| Base Model | Override Key | Example Repo | Notes | +|:-------------------------------|:--------------|:-----------------------------------------|:------------------------------------------| +| `black-forest-labs/FLUX.2-dev` | `vae` | `black-forest-labs/FLUX.2-small-decoder` | Decoder-only FLUX.2 VAE override | +| `black-forest-labs/FLUX.2-dev` | `vae` | `fal/FLUX.2-Tiny-AutoEncoder` | Existing tested custom VAE path | + +### VAE + +- `--vae-path` is the common image-generation override. +- `--video-vae-path` and `--audio-vae-path` are only relevant for pipelines with separate video or audio VAEs. + +### Transformer / DiT + +- `--transformer-path` is the standard override for the main denoising transformer. +- For quantized transformers, prefer `--transformer-path` or `--transformer-weights-path`; see `quantization.md`. +- `--video-dit-path` and `--audio-dit-path` are only for pipelines that split denoisers by modality. + +### Text Encoders and Preprocessors + +- `--text-encoder-path` and `--text-encoder-2-path` override primary and secondary text encoders. +- `--tokenizer-path`, `--processor-path`, and `--image-processor-path` are useful when the replacement encoder requires matching preprocessing assets. + +### Auxiliary Components + +- `--scheduler-path` is only relevant when the pipeline exposes a scheduler component. +- `--spatial-upsampler-path` is mainly for two-stage pipelines such as `LTX2TwoStagePipeline`. +- `--vocoder-path`, `--connectors-path`, `--dual-tower-bridge-path`, `--image-encoder-path`, and `--vision-language-encoder-path` are only valid for pipelines that expose those components. + +### Notes + +1. Component overrides are only valid when the target pipeline actually uses + that component. +2. The override key should match the component name in the pipeline's + `model_index.json` or the native pipeline's registered module name. + +## Verified LoRA Examples + +This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline. + +> Important: +> LoRAs that are not listed here are not necessarily incompatible. +> In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions. +> The entries below simply reflect configurations that have been manually validated by the SGLang team. + +### Verified LoRAs by Base Model + +| Base Model | Supported LoRAs | +|:----------------|:---------------------------------------------------------------------------------------------------------------------------------------------------| +| Wan2.2 | `lightx2v/Wan2.2-Distill-Loras`
`Cseti/wan2.2-14B-Arcane_Jinx-lora-v1` | +| Wan2.1 | `lightx2v/Wan2.1-Distill-Loras` | +| Z-Image-Turbo | `tarn59/pixel_art_style_lora_z_image_turbo`
`wcde/Z-Image-Turbo-DeJPEG-Lora` | +| Qwen-Image | `lightx2v/Qwen-Image-Lightning`
`flymy-ai/qwen-image-realism-lora`
`prithivMLmods/Qwen-Image-HeadshotX`
`starsfriday/Qwen-Image-EVA-LoRA` | +| Qwen-Image-Edit | `ostris/qwen_image_edit_inpainting`
`lightx2v/Qwen-Image-Edit-2511-Lightning` | +| Flux | `dvyio/flux-lora-simple-illustration`
`XLabs-AI/flux-furry-lora`
`XLabs-AI/flux-RealismLora` | + +## Special requirements + +### Sliding Tile Attention + +- Currently, only Hopper GPUs (H100s) are supported. diff --git a/docs/diffusion/contributing.md b/docs/diffusion/contributing.md new file mode 100644 index 000000000000..9b960aec9ea1 --- /dev/null +++ b/docs/diffusion/contributing.md @@ -0,0 +1,79 @@ +# Contributing to SGLang Diffusion + +This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`). + +## Contributor Guides + +- [Support New Models](support_new_models.md): implementation guide for adding new diffusion pipelines +- [CI Performance](ci_perf.md): update and regenerate perf baselines + +```{toctree} +:maxdepth: 1 + +support_new_models +ci_perf +``` + +## On AI-Assisted ("Vibe Coding") PRs + +Vibe-coded PRs are welcome — we judge code quality, not how it was produced. The bar is the same for all PRs: + +- **No over-commenting.** If the name says it all, skip the docstring. +- **No over-catching.** Don't guard against errors that virtually never happen in practice. +- **Test before submitting.** AI-generated code can be subtly wrong — verify correctness end-to-end. + +## Commit Message Convention + +We follow a structured commit message format to maintain a clean history. + +**Format:** +```text +[diffusion] : +``` + +**Examples:** +- `[diffusion] cli: add --perf-dump-path argument` +- `[diffusion] scheduler: fix deadlock in batch processing` +- `[diffusion] model: support Stable Diffusion 3.5` + +**Rules:** +- **Prefix**: Always start with `[diffusion]`. +- **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc. +- **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature"). + +## Performance Reporting + +For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report. + +### How to Generate a Report + +1. **Baseline**: run the benchmark (for a single generation task) + ```bash + $ sglang generate --model-path --prompt "A benchmark prompt" --perf-dump-path baseline.json + ``` + +2. **New**: run the same benchmark, without modifying any server_args or sampling_params + ```bash + $ sglang generate --model-path --prompt "A benchmark prompt" --perf-dump-path new.json + ``` + +3. **Compare**: run the compare script, which will print a Markdown table to the console + ```bash + $ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...] + ### Performance Comparison Report + ... + ``` +4. **Paste**: paste the table into the PR description + +## CI-Based Change Protection + +Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that: + +- support a new model + - add a testcase for this new model to `testcase_configs.py` +- support or fix important features +- significantly improve performance + +Please run the according testcase, then update/add the baseline to `perf_baselines.json` by following the instruction in console if applicable. + +See [test](https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen/test) for examples diff --git a/docs/diffusion/development.md b/docs/diffusion/development.md new file mode 100644 index 000000000000..afed2fb8d9b8 --- /dev/null +++ b/docs/diffusion/development.md @@ -0,0 +1,5 @@ +# Development + +This page collects lower-level development material for SGLang Diffusion. + +- [Contributing](contributing.md): contribution workflow, adding new models, and CI perf baselines diff --git a/docs/diffusion/disaggregation.md b/docs/diffusion/disaggregation.md new file mode 100644 index 000000000000..57bc2c4f10a3 --- /dev/null +++ b/docs/diffusion/disaggregation.md @@ -0,0 +1,237 @@ +# Disaggregated Diffusion Pipeline + +Split a monolithic text-to-video/image pipeline into independent **Encoder**, **Denoiser**, and **Decoder** roles, each running on its own GPU(s). A central **DiffusionServer** routes requests through the pipeline. + +## Quick Start + +Disaggregation is controlled by a single flag: `--disagg-role`. Each component is launched independently, just like LLM PD disaggregation. + +| `--disagg-role` | What it runs | +|----------------|--------------| +| `monolithic` | (Default) Standard single-server mode | +| `encoder` | All stages with the default `RoleType.ENCODER` affinity: `InputValidationStage`, `TextEncodingStage` (plus `ImageEncodingStage` / `ImageVAEEncodingStage` for image-conditioned pipelines), `LatentPreparationStage`, `TimestepPreparationStage`, and any model-specific "before denoising" stage (e.g. `QwenImageLayeredBeforeDenoisingStage`, `GlmImageBeforeDenoisingStage`). | +| `denoiser` | `DenoisingStage` (and its subclasses: `CausalDMDDenoisingStage`, `DmdDenoisingStage`, `LTX2AVDenoisingStage`, `LTX2RefinementStage`, `Hunyuan3DShapeDenoisingStage`, ...) — the DiT forward loop plus the scheduler stepping it drives. | +| `decoder` | `DecodingStage` (VAE decode) and its subclasses (`LTX2AVDecodingStage`, `HeliosDecodingStage`, ...). | +| `server` | DiffusionServer head node + HTTP server (no GPU) | + +> Each stage declares its role via the `role_affinity` property on `PipelineStage` (default `ENCODER`). When `--disagg-role` is not `monolithic`, the pipeline only instantiates stages whose affinity matches, so the above table is the source of truth for what actually runs in each process. + +### Single-Machine Example (Verified) + +The following commands have been tested end-to-end on an 8×H200 machine with +`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. Each role runs on a separate GPU via +`--base-gpu-id`; the `server` head node requires no GPU. + +```bash +# Terminal 1: Encoder (GPU 0) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role encoder \ + --disagg-server-addr tcp://127.0.0.1:19655 \ + --scheduler-port 19000 \ + --num-gpus 1 --base-gpu-id 0 + +# Terminal 2: Denoiser (GPU 1) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role denoiser \ + --disagg-server-addr tcp://127.0.0.1:19655 \ + --scheduler-port 19001 \ + --num-gpus 1 --base-gpu-id 1 + +# Terminal 3: Decoder (GPU 2) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role decoder \ + --disagg-server-addr tcp://127.0.0.1:19655 \ + --scheduler-port 19002 \ + --num-gpus 1 --base-gpu-id 2 + +# Terminal 4: DiffusionServer head (no GPU, receives HTTP requests) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role server \ + --encoder-urls "tcp://127.0.0.1:19000" \ + --denoiser-urls "tcp://127.0.0.1:19001" \ + --decoder-urls "tcp://127.0.0.1:19002" \ + --host 0.0.0.0 --port 22000 \ + --scheduler-port 19655 + +# Send request (video generation) +curl http://127.0.0.1:22000/v1/videos \ + -H "Content-Type: application/json" \ + -d '{"model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "prompt": "A curious raccoon exploring a garden, cinematic", "size": "832x480"}' +``` + +> **Tested result (8×H200):** +> Encoder 2.3 s (TextEncoding) → Denoiser 312.8 s (50 steps, layerwise offload) → Decoder 7.1 s (VAE decode). +> Total ~322 s for 81-frame 1024×1024 video. + +> **Tip:** `--base-gpu-id` controls which physical GPU the role uses. +> Encoder and Decoder can share a GPU (e.g. both `--base-gpu-id 0`) to save resources, +> but make sure the combined GPU memory is sufficient. + +### Multi-Machine Example + +The exact same CLI pattern — just replace `127.0.0.1` with actual IPs and add +RDMA flags for direct transfer: + +```bash +# Machine A (10.0.0.1): Encoder +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role encoder \ + --disagg-server-addr tcp://10.0.0.4:19655 \ + --scheduler-port 19000 \ + --num-gpus 1 \ + --disagg-p2p-hostname 10.0.0.1 --disagg-ib-device mlx5_0 + +# Machine B (10.0.0.2): Denoiser (4 GPUs with SP) +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role denoiser \ + --disagg-server-addr tcp://10.0.0.4:19655 \ + --scheduler-port 19001 \ + --num-gpus 4 --denoiser-sp 4 --denoiser-ulysses 2 --denoiser-ring 2 \ + --disagg-p2p-hostname 10.0.0.2 --disagg-ib-device mlx5_0 + +# Machine C (10.0.0.3): Decoder +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role decoder \ + --disagg-server-addr tcp://10.0.0.4:19655 \ + --scheduler-port 19002 \ + --num-gpus 1 \ + --disagg-p2p-hostname 10.0.0.3 --disagg-ib-device mlx5_0 + +# Machine D (10.0.0.4): DiffusionServer head +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role server \ + --encoder-urls "tcp://10.0.0.1:19000" \ + --denoiser-urls "tcp://10.0.0.2:19001" \ + --decoder-urls "tcp://10.0.0.3:19002" \ + --host 0.0.0.0 --port 30000 \ + --scheduler-port 19655 \ + --disagg-dispatch-policy max_free_slots +``` + +> ZMQ handles startup order gracefully — instances and head can start in any order. + +## Multiple Instances per Role + +Use semicolons in `--*-urls` to register multiple instances: + +```bash +# 2 encoders + 2 denoisers (4-GPU SP each) + 1 decoder +sglang serve --model-path ... --disagg-role server \ + --encoder-urls "tcp://10.0.0.1:35000;tcp://10.0.0.2:35000" \ + --denoiser-urls "tcp://10.0.0.3:35000;tcp://10.0.0.4:35000" \ + --decoder-urls "tcp://10.0.0.5:35000" +``` + +## Port Convention + +Result endpoints are derived deterministically from the head node's `--scheduler-port` (default: 5555): + +| Socket | Port | +|--------|------| +| DS frontend (ROUTER) | `scheduler_port` | +| Encoder result (PULL) | `scheduler_port + 1` | +| Denoiser result (PULL) | `scheduler_port + 2` | +| Decoder result (PULL) | `scheduler_port + 3` | + +Role instances derive their result endpoint automatically from `--disagg-server-addr`. No manual endpoint configuration needed. + +## Transfer Mechanism + +Tensor data between roles (encoder→denoiser, denoiser→decoder) is transferred via a P2P transfer engine. The DiffusionServer only routes lightweight control messages (alloc/push/ready); actual tensor data flows directly between instances. + +**mooncake-transfer-engine** is required for disaggregated diffusion. It provides RDMA for direct GPU-to-GPU data movement. + +```bash +pip install mooncake-transfer-engine +``` + +### Transfer Flow + +1. **Sender** (encoder/denoiser) stages tensors: async copy to transfer buffer (GPU or CPU pinned, depending on GPUDirect support), overlapped with metadata JSON serialization. +2. **Sender** sends `transfer_staged` control message to DiffusionServer (metadata only, no tensor data). +3. **DiffusionServer** sends `transfer_alloc` to receiver → receiver allocates buffer slot → replies `transfer_allocated`. +4. **DiffusionServer** sends `transfer_push` to receiver with sender's address info. +5. **Receiver** pulls data via transfer engine (Mooncake RDMA or mock), sends `transfer_ready`. +6. **Receiver** loads tensors async on a dedicated transfer stream, overlapped with the previous request's compute. + +Decoder results (final output) flow back through DiffusionServer as raw ZMQ frames to the HTTP client. + +### RDMA Flags + +| Flag | Default | Description | +|------|---------|-------------| +| `--disagg-p2p-hostname` | `127.0.0.1` | RDMA-reachable hostname/IP of this instance | +| `--disagg-ib-device` | `None` | InfiniBand device (e.g., `mlx5_0`, `mlx5_roce0`) | +| `--disagg-transfer-pool-size` | 256 MiB | Pinned memory pool per instance | + +Set `--disagg-p2p-hostname` to the actual IP on each machine. For multi-machine, `--disagg-ib-device` specifies the RDMA NIC. + +## Per-Role Parallelism + +| Flag | Description | +|------|-------------| +| `--encoder-tp` | Encoder tensor parallelism | +| `--denoiser-tp` / `--denoiser-sp` / `--denoiser-ulysses` / `--denoiser-ring` | Denoiser parallelism | +| `--decoder-tp` | Decoder tensor parallelism | + +If not specified, parallelism is auto-derived from `--num-gpus`. + +## Other Options + +| Flag | Default | Description | +|------|---------|-------------| +| `--disagg-timeout` | `600` | Timeout (seconds) for pending requests | +| `--disagg-dispatch-policy` | `round_robin` | `round_robin` or `max_free_slots` | + +## Python API + +For programmatic single-machine deployment, `launch_pool_disagg_server()` is available: + +```python +from sglang.multimodal_gen.runtime.server_args import ServerArgs +from sglang.multimodal_gen.runtime.launch_server import launch_pool_disagg_server + +server_args = ServerArgs.from_kwargs( + model_path="Wan-AI/Wan2.1-T2V-14B-Diffusers", + denoiser_sp=4, denoiser_ulysses=2, denoiser_ring=2, + disagg_ib_device="mlx5_0", +) + +launch_pool_disagg_server( + server_args, + encoder_gpus=[[0]], + denoiser_gpus=[[1, 2, 3, 4], [5, 6, 7, 8]], + decoder_gpus=[[0]], +) +``` + +## Architecture + +``` +Client ─── HTTP (port 30000) ──► FastAPI Server + │ + ▼ + DiffusionServer (ROUTER, scheduler_port) + ┌───────┼───────┐ + PUSH work │ │ │ PUSH work + ▼ │ ▼ + Encoder[0..N] │ Decoder[0..K] + │ │ ▲ + P2P tensor │ │ │ P2P tensor + transfer ▼ │ │ transfer + Denoiser[0..M] ─────┘ + │ + PULL results ◄────┘ (decoder → DS → client) +``` + +### Request State Machine + +``` +PENDING → ENCODER_WAITING → ENCODER_RUNNING → ENCODER_DONE + │ + DENOISING_WAITING → DENOISING_RUNNING → DENOISING_DONE + │ + DECODER_WAITING → DECODER_RUNNING → DONE +``` + +Any state can transition to `FAILED` or `TIMED_OUT`. diff --git a/docs/diffusion/environment_variables.md b/docs/diffusion/environment_variables.md new file mode 100644 index 000000000000..745c84af27f6 --- /dev/null +++ b/docs/diffusion/environment_variables.md @@ -0,0 +1,101 @@ +# Environment Variables + +## Runtime + +| Environment Variable | Default | Description | +|----------------------|---------|-------------| +| `SGLANG_DIFFUSION_TARGET_DEVICE` | `cuda` | Target device for inference (`cuda`, `rocm`, `xpu`, `npu`, `musa`, `mps`, `cpu`) | +| `SGLANG_DIFFUSION_ATTENTION_BACKEND` | not set | Override attention backend via env var (e.g. `fa`, `torch_sdpa`, `sage_attn`) | +| `SGLANG_DIFFUSION_ATTENTION_CONFIG` | not set | Path to attention backend configuration file (JSON/YAML) | +| `SGLANG_DIFFUSION_STAGE_LOGGING` | false | Enable per-stage timing logs | +| `SGLANG_DIFFUSION_SERVER_DEV_MODE` | false | Enable dev-only HTTP endpoints for debugging | +| `SGLANG_DIFFUSION_TORCH_PROFILER_DIR` | not set | Directory for torch profiler traces (absolute path). Enables profiling when set | +| `SGLANG_DIFFUSION_CACHE_ROOT` | `~/.cache/sgl_diffusion` | Root directory for cache files | +| `SGLANG_DIFFUSION_CONFIG_ROOT` | `~/.config/sgl_diffusion` | Root directory for configuration files | +| `SGLANG_DIFFUSION_LOGGING_LEVEL` | `INFO` | Default logging level | +| `SGLANG_DIFFUSION_WORKER_MULTIPROC_METHOD` | `fork` | Multiprocess context for workers (`fork` or `spawn`) | +| `SGLANG_USE_RUNAI_MODEL_STREAMER` | true | Use Run:AI model streamer for model loading | + +## Platform-Specific + +### Apple MPS + +| Environment Variable | Default | Description | +|----------------------|---------|--------------------------------------------------------------| +| `SGLANG_USE_MLX` | not set | Set to `1` to enable MLX fused Metal kernels for norm ops on MPS | + +### ROCm (AMD GPUs) + +| Environment Variable | Default | Description | +|----------------------|---------|-------------| +| `SGLANG_USE_ROCM_VAE` | false | Use AITer GroupNorm in VAE for improved performance on ROCm | +| `SGLANG_USE_ROCM_CUDNN_BENCHMARK` | false | Enable MIOpen auto-tuning for VAE conv layers on ROCm | + +### Quantization + +| Environment Variable | Default | Description | +|----------------------|---------|-------------| +| `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND` | not set | FlashInfer FP4 GEMM backend for generic NVFP4 fallback | + +## Caching Acceleration + +These variables configure caching acceleration for Diffusion Transformer (DiT) models. +SGLang supports multiple caching strategies - see [caching documentation](performance/cache/index.md) for an overview. + +### Cache-DiT Configuration + +See [cache-dit documentation](performance/cache/cache_dit.md) for detailed configuration. + +| Environment Variable | Default | Description | +|-------------------------------------|---------|------------------------------------------| +| `SGLANG_CACHE_DIT_ENABLED` | false | Enable Cache-DiT acceleration | +| `SGLANG_CACHE_DIT_FN` | 1 | First N blocks to always compute | +| `SGLANG_CACHE_DIT_BN` | 0 | Last N blocks to always compute | +| `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching | +| `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold | +| `SGLANG_CACHE_DIT_MC` | 3 | Max continuous cached steps | +| `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator | +| `SGLANG_CACHE_DIT_TS_ORDER` | 1 | TaylorSeer order (1 or 2) | +| `SGLANG_CACHE_DIT_SCM_PRESET` | none | SCM preset (none/slow/medium/fast/ultra) | +| `SGLANG_CACHE_DIT_SCM_POLICY` | dynamic | SCM caching policy | +| `SGLANG_CACHE_DIT_SCM_COMPUTE_BINS` | not set | Custom SCM compute bins | +| `SGLANG_CACHE_DIT_SCM_CACHE_BINS` | not set | Custom SCM cache bins | + +### Cache-DiT Secondary Transformer + +For dual-transformer models (e.g., Wan2.2 with high/low-noise experts), these variables configure caching for the secondary transformer. Each falls back to its primary counterpart if not set. + +| Environment Variable | Default | Description | +|-------------------------------------|---------|------------------------------------------| +| `SGLANG_CACHE_DIT_SECONDARY_FN` | (from primary) | First N blocks to always compute | +| `SGLANG_CACHE_DIT_SECONDARY_BN` | (from primary) | Last N blocks to always compute | +| `SGLANG_CACHE_DIT_SECONDARY_WARMUP` | (from primary) | Warmup steps before caching | +| `SGLANG_CACHE_DIT_SECONDARY_RDT` | (from primary) | Residual difference threshold | +| `SGLANG_CACHE_DIT_SECONDARY_MC` | (from primary) | Max continuous cached steps | +| `SGLANG_CACHE_DIT_SECONDARY_TAYLORSEER` | (from primary) | Enable TaylorSeer calibrator | +| `SGLANG_CACHE_DIT_SECONDARY_TS_ORDER` | (from primary) | TaylorSeer order (1 or 2) | + +## Cloud Storage + +These variables configure S3-compatible cloud storage for automatically uploading generated images and videos. + +| Environment Variable | Default | Description | +|---------------------------------|---------|--------------------------------------------------------| +| `SGLANG_CLOUD_STORAGE_TYPE` | not set | Set to `s3` to enable cloud storage | +| `SGLANG_S3_BUCKET_NAME` | not set | The name of the S3 bucket | +| `SGLANG_S3_ENDPOINT_URL` | not set | Custom endpoint URL (for MinIO, OSS, etc.) | +| `SGLANG_S3_REGION_NAME` | us-east-1 | AWS region name | +| `SGLANG_S3_ACCESS_KEY_ID` | not set | AWS Access Key ID | +| `SGLANG_S3_SECRET_ACCESS_KEY` | not set | AWS Secret Access Key | + +## CUDA Crash Debugging + +These variables enable kernel API logging and optional input/output dumps around diffusion CUDA kernel call boundaries. They are useful when tracking down CUDA crashes such as illegal memory access, device-side assert, or shape mismatches in custom kernels. + +| Environment Variable | Default | Description | +|----------------------|---------|-------------| +| `SGLANG_KERNEL_API_LOGLEVEL` | `0` | Controls crash-debug kernel API logging. `1` logs API names, `3` logs tensor metadata, `5` adds tensor statistics, and `10` also writes dump snapshots. | +| `SGLANG_KERNEL_API_LOGDEST` | `stdout` | Destination for crash-debug kernel API logs. Use `stdout`, `stderr`, or a file path. `%i` is replaced with the process PID. | +| `SGLANG_KERNEL_API_DUMP_DIR` | `sglang_kernel_api_dumps` | Output directory for level-10 kernel API dumps. `%i` is replaced with the process PID. | +| `SGLANG_KERNEL_API_DUMP_INCLUDE` | not set | Comma-separated wildcard patterns for kernel API names to include in level-10 dumps. | +| `SGLANG_KERNEL_API_DUMP_EXCLUDE` | not set | Comma-separated wildcard patterns for kernel API names to exclude from level-10 dumps. | diff --git a/docs/diffusion/index.md b/docs/diffusion/index.md new file mode 100644 index 000000000000..e0790d9e7f73 --- /dev/null +++ b/docs/diffusion/index.md @@ -0,0 +1,53 @@ +# SGLang Diffusion + +SGLang Diffusion is a high-performance inference framework for image and video generation. It provides native SGLang pipelines, diffusers backend support, an OpenAI-compatible server, and an optimized kernel stack built on both precompiled `sgl-kernel` operators and JIT kernels for key inference paths. + +## Key Features + +- Broad model support across Wan, Hunyuan, Qwen-Image, FLUX, Z-Image, GLM-Image, and more +- Fast inference with `sgl-kernel`, JIT kernels, scheduler improvements, and caching acceleration +- Multiple interfaces: `sglang generate`, `sglang serve`, and an OpenAI-compatible API +- Multi-platform support for NVIDIA, AMD, Intel XPU, Ascend, Apple Silicon, and Moore Threads + +## Quick Start + +```bash +uv pip install "sglang[diffusion]" --prerelease=allow +``` + +```bash +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" \ + --save-output +``` + +```bash +sglang serve --model-path Qwen/Qwen-Image --port 30010 +``` + +## Start Here + +- [Installation](installation.md): install SGLang Diffusion and platform dependencies +- [Compatibility Matrix](compatibility_matrix.md): check model, optimization, and component override support +- [CLI](api/cli.md): run one-off generation jobs or launch a persistent server +- [OpenAI-Compatible API](api/openai_api.md): send image and video requests to the HTTP server +- [Attention Backends](performance/attention_backends.md): choose the best backend for your model and hardware +- [Caching Acceleration](performance/cache/index.md): use Cache-DiT or TeaCache to reduce denoising cost +- [Quantization](quantization.md): load quantized transformer checkpoints +- [Contributing](contributing.md): contribution workflow, adding new models, and CI perf baselines + +## Additional Documentation + +- [Post-Processing](api/post_processing.md): frame interpolation and upscaling +- [Performance Overview](performance/index.md): overview of attention, caching, and profiling +- [Environment Variables](environment_variables.md): platform, caching, storage, and debugging configuration +- [Support New Models](support_new_models.md): implementation guide for new diffusion pipelines +- [CI Performance](ci_perf.md): performance baseline generation + +## References + +- [SGLang GitHub](https://github.com/sgl-project/sglang) +- [Cache-DiT](https://github.com/vipshop/cache-dit) +- [FastVideo](https://github.com/hao-ai-lab/FastVideo) +- [xDiT](https://github.com/xdit-project/xDiT) +- [Diffusers](https://github.com/huggingface/diffusers) diff --git a/docs/diffusion/installation.md b/docs/diffusion/installation.md new file mode 100644 index 000000000000..46fbab063058 --- /dev/null +++ b/docs/diffusion/installation.md @@ -0,0 +1,128 @@ +# Install SGLang-Diffusion + +You can install SGLang-Diffusion using one of the methods below. The standard installation already includes SGLang's optimized kernel stack, including both `sgl-kernel` and JIT kernels used by diffusion workloads. + +## Standard Installation (NVIDIA GPUs) + +### Method 1: With pip or uv + +It is recommended to use uv for a faster installation: + +```bash +pip install --upgrade pip +pip install uv +uv pip install "sglang[diffusion]" --prerelease=allow +``` + +### Method 2: From source + +```bash +# Use the latest release branch +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Install the Python packages +pip install --upgrade pip +pip install -e "python[diffusion]" + +# With uv +uv pip install -e "python[diffusion]" --prerelease=allow +``` + +### Method 3: Using Docker + +The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang), built from the [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile). +Replace `` below with your HuggingFace Hub [token](https://huggingface.co/docs/hub/en/security-tokens). + +```bash +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:dev \ + zsh -c '\ + echo "Installing diffusion dependencies..." && \ + pip install -e "python[diffusion]" && \ + echo "Starting SGLang-Diffusion..." && \ + sglang generate \ + --model-path black-forest-labs/FLUX.1-dev \ + --prompt "A logo With Bold Large text: SGL Diffusion" \ + --save-output \ + ' +``` + +## Platform-Specific: ROCm (AMD GPUs) + +For AMD Instinct GPUs (e.g., MI300X), you can use the ROCm-enabled Docker image: + +```bash +docker run --device=/dev/kfd --device=/dev/dri --ipc=host \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env HF_TOKEN= \ + lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \ + sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output +``` + +For detailed ROCm system configuration and installation from source, see [AMD GPUs](../platforms/amd_gpu.md). + +## Platform-Specific: MUSA (Moore Threads GPUs) + +For Moore Threads GPUs (MTGPU) with the MUSA software stack, please follow the instructions below to install from source: + +```bash +# Clone the repository +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Install the Python packages +pip install --upgrade pip +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +pip install -e "python[all_musa]" +``` + +## Platform-Specific: Intel XPU + +For Intel Data Center GPU Max or Arc GPUs, follow the [XPU installation guide](../platforms/xpu.md) to set up the base environment, then install diffusion dependencies: + +```bash +pip install -e "python[diffusion]" +``` + +## Platform-Specific: Ascend NPU + +For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend/ascend_npu.md). + +Quick test: + +```bash +sglang generate --model-path black-forest-labs/FLUX.1-dev \ + --prompt "A logo With Bold Large text: SGL Diffusion" \ + --save-output +``` + +## Platform-Specific: Apple MPS + +For Apple MPS, please follow the instructions below to install from source: + +```bash +# Install ffmpeg +brew install ffmpeg + +# Install uv +brew install uv + +# Clone the repository +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Create and activate a virtual environment +uv venv -p 3.11 sglang-diffusion +source sglang-diffusion/bin/activate + +# Install the Python packages +uv pip install --upgrade pip +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +uv pip install -e "python[all_mps]" +``` diff --git a/docs/diffusion/performance/attention_backends.md b/docs/diffusion/performance/attention_backends.md new file mode 100644 index 000000000000..1927185350fa --- /dev/null +++ b/docs/diffusion/performance/attention_backends.md @@ -0,0 +1,154 @@ +# Attention Backends + +This document describes the attention backends available in sglang diffusion (`sglang.multimodal_gen`) and how to select them. + +## Overview + +Attention backends are defined by `AttentionBackendEnum` (`sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum`) and selected via the CLI flag `--attention-backend`. + +Backend selection is performed by the shared attention layers (e.g. `LocalAttention` / `USPAttention` / `UlyssesAttention` in `sglang.multimodal_gen.runtime.layers.attention.layer`) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders). + +When using the diffusers backend, `--attention-backend` is passed through to diffusers' +`set_attention_backend` (e.g., `flash`, `_flash_3_hub`, `sage`, `xformers`, `native`). + +- **CUDA**: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA. +- **ROCm**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA. +- **Intel XPU**: uses XPU Flash Attention backend (fp16/bf16, head sizes 64/96/128/192/256); otherwise falls back to PyTorch SDPA. +- **MUSA**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA. +- **MPS**: always uses PyTorch SDPA. +- **NPU**: for ring attention uses FA otherwise uses PyTorch SDPA. + +## Backend options + +For SGLang-native pipelines, the CLI accepts the lowercase names of `AttentionBackendEnum`. The table below lists the backends implemented by the built-in platforms. `fa3`/`fa4` are accepted as aliases for `fa`. + +| CLI value | Enum value | Notes | +|---|---|---| +| `fa` / `fa3` / `fa4` | `FA` | FlashAttention. `fa3/fa4` are normalized to `fa` during argument parsing (`ServerArgs.__post_init__`). | +| `torch_sdpa` | `TORCH_SDPA` | PyTorch `scaled_dot_product_attention`. | +| `sliding_tile_attn` | `SLIDING_TILE_ATTN` | Sliding Tile Attention (STA). Requires `st_attn`. Configure via `--attention-backend-config`. | +| `sage_attn` | `SAGE_ATTN` | Requires `sageattention`. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream `setup.py`: https://github.com/thu-ml/SageAttention/blob/main/setup.py. | +| `sage_attn_3` | `SAGE_ATTN_3` | Requires SageAttention3 installed per upstream instructions. | +| `video_sparse_attn` | `VIDEO_SPARSE_ATTN` | Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. | +| `vmoba_attn` | `VMOBA_ATTN` | Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. | +| `aiter` | `AITER` | Requires `aiter`. | +| `aiter_sage` | `AITER_SAGE` | Requires `aiter`. | +| `sla_attn` | `SLA_ATTN` | Sparse Linear Attention. Requires `SpargeAttn`. Install with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation`. | +| `sage_sla_attn` | `SAGE_SLA_ATTN` | SageAttention + Sparse Linear Attention. Requires `SpargeAttn` (same install as SLA). | +| `sparse_video_gen_2_attn` | `SPARSE_VIDEO_GEN_2_ATTN` | Requires `svg`. See installation instructions at https://github.com/svg-project/Sparse-VideoGen. | + +## Selection priority + +The selection order in `runtime/layers/attention/selector.py` is: + +1. `global_force_attn_backend(...)` / `global_force_attn_backend_context_manager(...)` +2. Component override from `--component-attention-backends` while that component is being constructed +3. CLI `--attention-backend` (`ServerArgs.attention_backend`) +4. Auto selection (platform capability, dtype, and installed packages) + +## Configuration + +Some backends require additional configuration. You can pass these parameters via `--attention-backend-config`. This argument accepts: +- A path to a JSON or YAML configuration file. +- A JSON string (e.g., `'{"sparsity": 0.5}'`). +- Key-value pairs (e.g., `"sparsity=0.5,enable_x=true"`). + +### Supported Configuration Parameters + +**Sliding Tile Attention (`sliding_tile_attn`)** + +| Parameter | Type | Description | Default | +| :--- | :--- | :--- | :--- | +| `mask_strategy_file_path` | `str` | **Required.** Path to the mask strategy JSON file. | - | +| `sta_mode` | `str` | Mode of STA. | `STA_inference` | +| `skip_time_steps` | `int` | Number of steps to use full attention before switching to sparse attention. | `15` | + +**Video Sparse Attention (`video_sparse_attn`)** + +| Parameter | Type | Description | Default | +| :--- | :--- | :--- | :--- | +| `sparsity` | `float` | Validation sparsity (0.0 - 1.0). | `0.0` | + +**V-MoBA (`vmoba_attn`)** + +| Parameter | Type | Description | Default | +| :--- | :--- | :--- | :--- | +| `temporal_chunk_size` | `int` | Chunk size for temporal dimension. | - | +| `temporal_topk` | `int` | Top-K tokens to select in temporal dimension. | - | +| `spatial_chunk_size` | `list[int]` | Chunk size for spatial dimension (H, W). | - | +| `spatial_topk` | `int` | Top-K tokens to select in spatial dimension. | - | +| `st_chunk_size` | `list[int]` | Chunk size for spatiotemporal dimension (T, H, W). | - | +| `st_topk` | `int` | Top-K tokens to select in spatiotemporal dimension. | - | +| `moba_select_mode` | `str` | Selection mode (e.g., `threshold`). | `threshold` | +| `moba_threshold` | `float` | Threshold value for selection. | `0.25` | +| `moba_threshold_type` | `str` | Type of thresholding (e.g., `query_head`). | `query_head` | +| `first_full_step` | `int` | Number of initial steps to use full attention. | `12` | +| `first_full_layer` | `int` | Number of initial layers to use full attention. | `0` | +| `temporal_layer` | `int` | Number of temporal layers. | `1` | +| `spatial_layer` | `int` | Number of spatial layers. | `1` | +| `st_layer` | `int` | Number of spatiotemporal layers. | `1` | + +## Platform support matrix + +| Backend | CUDA | ROCm | XPU | MUSA | MPS | NPU | Notes | +|---|---:|---:|---:|---:|---:|---:|---| +| `fa` | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | CUDA requires SM80+ and fp16/bf16. XPU uses its own flash attention backend. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. No extra installations are required for NPU | +| `torch_sdpa` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Most compatible option across platforms. | +| `sliding_tile_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires `st_attn`. Configure via `--attention-backend-config`. | +| `sage_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). | +| `sage_attn_3` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). | +| `video_sparse_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. | +| `sla_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires `SpargeAttn`. | +| `sage_sla_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires `SpargeAttn`. | +| `vmoba_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. | +| `aiter` | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | Requires `aiter`. | +| `aiter_sage` | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | Requires `aiter`. | +| `sparse_video_gen_2_attn` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires `svg`. | + +## Usage + +### Select a backend via CLI + +```bash +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend fa +``` + +```bash +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend torch_sdpa +``` + +### Override one component + +Use component overrides when a specific module needs different attention semantics from the main transformer: + +```bash +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend fa \ + --component-attention-backends text_encoder=torch_sdpa +``` + +Component keys match pipeline module names from `model_index.json`, such as `text_encoder`, `text_encoder_2`, `transformer`, `transformer_2`, or `connectors`. + +### Using Sliding Tile Attention (STA) + +```bash +# Pass the mask strategy file path via config +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend sliding_tile_attn \ + --attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json" +``` + +### Notes for ROCm / MPS + +- ROCm: use `--attention-backend torch_sdpa` or `fa` depending on what is available in your environment. +- MPS: the platform implementation always uses `torch_sdpa`. diff --git a/docs/diffusion/performance/cache/cache_dit.md b/docs/diffusion/performance/cache/cache_dit.md new file mode 100644 index 000000000000..9f804ce543be --- /dev/null +++ b/docs/diffusion/performance/cache/cache_dit.md @@ -0,0 +1,418 @@ +# Cache-DiT + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss. + +## Overview + +**Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop: + +- **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences +- **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions +- **SCM (Step Computation Masking)**: Step-level caching control for additional speedup + +## Basic Usage + +Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` : + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" +``` + +## Diffusers Backend + +Cache-DiT supports loading acceleration configs from a custom YAML file. For +diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This +flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`). + +### Single GPU inference + +Define a `cache.yaml` file that contains: + +- DBCache + TaylorSeer + +```yaml +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 +``` + +Then apply the config with: + +```bash +sglang generate \ + --backend diffusers \ + --model-path Qwen/Qwen-Image \ + --cache-dit-config cache.yaml \ + --prompt "A beautiful sunset over the mountains" +``` + +- DBCache + TaylorSeer + SCM (Step Computation Mask) + +```yaml +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 + # Must set the num_inference_steps for SCM. The SCM will automatically + # generate the steps computation mask based on the num_inference_steps. + # Reference: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#scm-steps-computation-masking + num_inference_steps: 28 + steps_computation_mask: fast +``` + +- DBCache + TaylorSeer + SCM (Step Computation Mask) + Cache CFG + +```yaml +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 + num_inference_steps: 28 + steps_computation_mask: fast + enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc. +``` + +### Distributed inference + +- 1D Parallelism + +Define a parallelism only config yaml `parallel.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: auto + attention_backend: native +``` + +Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4. + +Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference) + +```bash +sglang generate \ + --backend diffusers \ + --num-gpus 4 \ + --model-path Qwen/Qwen-Image \ + --cache-dit-config parallel.yaml \ + --prompt "A futuristic cityscape at sunset" +``` + +- 2D Parallelism + +You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: auto + tp_size: 2 + attention_backend: native +``` +Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size. + +- 3D Parallelism + +You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: 2 + ring_size: 2 + tp_size: 2 + attention_backend: native +``` +Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2. + +- Ulysses Anything Attention + +To enable Ulysses Anything Attention, you can define a parallelism config yaml `parallel_uaa.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: auto + attention_backend: native + ulysses_anything: true +``` + +- Ulysses FP8 Communication + +For device that don't have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml `parallel_fp8.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: auto + attention_backend: native + ulysses_float8: true +``` + +- Async Ulysses CP + +You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml `parallel_async.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: auto + attention_backend: native + ulysses_async: true # Now, only support for FLUX.1, Qwen-Image, Ovis-Image and Z-Image. +``` +Then, apply the config from yaml. Here `ulysses_async: true` means enabling async ulysses CP. + +- TE-P and VAE-P + +You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml `parallel_extra.yaml` file that contains: + +```yaml +parallelism_config: + ulysses_size: auto + attention_backend: native + extra_parallel_modules: ["text_encoder", "vae"] +``` + + +### Hybrid Cache and Parallelism + +Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains: + +```yaml +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 +parallelism_config: + ulysses_size: auto + attention_backend: native + extra_parallel_modules: ["text_encoder", "vae"] +``` + +Then, apply the hybrid cache and parallel acceleration config from yaml. + +```bash +sglang generate \ + --backend diffusers \ + --num-gpus 4 \ + --model-path Qwen/Qwen-Image \ + --cache-dit-config hybrid.yaml \ + --prompt "A beautiful sunset over the mountains" +``` + +### Attention Backend + +In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file `attention.yaml` that only contains: + +```yaml +attention_backend: "flash" # '_flash_3' for Hopper +``` + +### Quantization + +You can also specify the quantization config in the yaml file, required `torchao>=0.16.0`. For example, define a yaml file `quantize.yaml` that contains: + +```yaml +quantize_config: # quantization configuration for transformer modules + # float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc. + quant_type: "float8" + # layers to exclude from quantization (transformer). layers that contains any of the + # keywords in the exclude_layers list will be excluded from quantization. This is useful + # for some sensitive layers that are not robust to quantization, e.g., embedding layers. + exclude_layers: + - "embedder" + - "embed" + verbose: false # whether to print verbose logs during quantization +``` +Then, apply the quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. For example: + +```bash +sglang generate \ + --backend diffusers \ + --model-path Qwen/Qwen-Image \ + --warmup \ + --cache-dit-config quantize.yaml \ + --enable-torch-compile \ + --dit-cpu-offload false \ + --text-encoder-cpu-offload false \ + --prompt "A beautiful sunset over the mountains" +``` + +### Combined Configs: Cache + Parallelism + Quantization + +You can also combine all the above configs together in a single yaml file `combined.yaml` that contains: + +```yaml +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 +parallelism_config: + ulysses_size: auto + attention_backend: native + extra_parallel_modules: ["text_encoder", "vae"] +quantize_config: + quant_type: "float8" + exclude_layers: + - "embedder" + - "embed" + verbose: false +``` +Then, apply the combined cache, parallelism and quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. + +## Advanced Configuration + +### DBCache Parameters + +DBCache controls block-level caching behavior: + +| Parameter | Env Variable | Default | Description | +|-----------|---------------------------|---------|------------------------------------------| +| Fn | `SGLANG_CACHE_DIT_FN` | 1 | Number of first blocks to always compute | +| Bn | `SGLANG_CACHE_DIT_BN` | 0 | Number of last blocks to always compute | +| W | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching starts | +| R | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold | +| MC | `SGLANG_CACHE_DIT_MC` | 3 | Maximum continuous cached steps | + +### TaylorSeer Configuration + +TaylorSeer improves caching accuracy using Taylor expansion: + +| Parameter | Env Variable | Default | Description | +|-----------|-------------------------------|---------|---------------------------------| +| Enable | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator | +| Order | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | Taylor expansion order (1 or 2) | + +### Combined Configuration Example + +DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters +simultaneously: + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang generate --model-path black-forest-labs/FLUX.1-dev \ + --prompt "A curious raccoon in a forest" +``` + +### SCM (Step Computation Masking) + +SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and +which to use cached results. + +**SCM Presets** + +SCM is configured with presets: + +| Preset | Compute Ratio | Speed | Quality | +|----------|---------------|----------|------------| +| `none` | 100% | Baseline | Best | +| `slow` | ~75% | ~1.3x | High | +| `medium` | ~50% | ~2x | Good | +| `fast` | ~35% | ~3x | Acceptable | +| `ultra` | ~25% | ~4x | Lower | + +**Usage** + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_SCM_PRESET=medium \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A futuristic cityscape at sunset" +``` + +**Custom SCM Bins** + +For fine-grained control over which steps to compute vs cache: + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \ +SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A futuristic cityscape at sunset" +``` + +**SCM Policy** + +| Policy | Env Variable | Description | +|-----------|---------------------------------------|---------------------------------------------| +| `dynamic` | `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` | Adaptive caching based on content (default) | +| `static` | `SGLANG_CACHE_DIT_SCM_POLICY=static` | Fixed caching pattern | + +## Environment Variables + +All Cache-DiT parameters can be configured via environment variables. +See [Environment Variables](../../environment_variables.md) for the complete list. + +## Supported Models + +SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion: + +| Model Family | Example Models | +|--------------|-----------------------------| +| Wan | Wan2.1, Wan2.2 | +| Flux | FLUX.1-dev, FLUX.2-dev | +| Z-Image | Z-Image-Turbo | +| Qwen | Qwen-Image, Qwen-Image-Edit | +| Hunyuan | HunyuanVideo | + +## Performance Tips + +1. **Start with defaults**: The default parameters work well for most models +2. **Use TaylorSeer**: It typically improves both speed and quality +3. **Tune R threshold**: Lower values = better quality, higher values = faster +4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance +5. **Warmup matters**: Higher warmup = more stable caching decisions + +## Limitations + +- **SGLang-native pipelines**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically + disabled when `world_size > 1`. +- **SCM minimum steps**: SCM requires >= 8 inference steps to be effective +- **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported + +## Troubleshooting + +### SCM disabled for low step count + +For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache +acceleration still works. + +## References + +- [Cache-DiT](https://github.com/vipshop/cache-dit) +- [SGLang Diffusion](../index.md) diff --git a/docs/diffusion/performance/cache/index.md b/docs/diffusion/performance/cache/index.md new file mode 100644 index 000000000000..c7f8f53efa15 --- /dev/null +++ b/docs/diffusion/performance/cache/index.md @@ -0,0 +1,65 @@ +# Caching Acceleration + +SGLang provides two complementary caching strategies for Diffusion Transformer (DiT) models. Both reduce denoising cost by skipping redundant computation, but they operate at different levels. + +## Overview + +SGLang supports two complementary caching approaches: + +| Strategy | Scope | Mechanism | Best For | +|----------|-------|-----------|----------| +| **Cache-DiT** | Block-level | Skip individual transformer blocks dynamically | Advanced, higher speedup | +| **TeaCache** | Timestep-level | Skip entire denoising steps based on L1 similarity | Simple, built-in | + +## Cache-DiT + +[Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with +advanced strategies like DBCache and TaylorSeer. It can achieve up to **1.69x speedup**. + +See [cache_dit.md](cache_dit.md) for detailed configuration. + +### Quick Start + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" +``` + +### Key Features + +- **DBCache**: Dynamic block-level caching based on residual differences +- **TaylorSeer**: Taylor expansion-based calibration for optimized caching +- **SCM**: Step-level computation masking for additional speedup + +## TeaCache + +TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely. + +See [teacache.md](teacache.md) for detailed documentation. + +### Quick Overview + +- Tracks L1 distance between modulated inputs across timesteps +- When accumulated distance is below threshold, reuses cached residual +- Supports CFG with separate positive/negative caches + +### Supported Models + +- Wan (wan2.1, wan2.2) +- Hunyuan (HunyuanVideo) +- Z-Image + +For Flux and Qwen models, TeaCache is automatically disabled when CFG is enabled. + +```{toctree} +:maxdepth: 1 + +cache_dit +teacache +``` + +## References + +- [Cache-DiT Repository](https://github.com/vipshop/cache-dit) +- [TeaCache Paper](https://arxiv.org/abs/2411.14324) diff --git a/python/sglang/multimodal_gen/docs/cache/teacache.md b/docs/diffusion/performance/cache/teacache.md similarity index 96% rename from python/sglang/multimodal_gen/docs/cache/teacache.md rename to docs/diffusion/performance/cache/teacache.md index 5eb0b6c19bdd..dd9691c43a4a 100644 --- a/python/sglang/multimodal_gen/docs/cache/teacache.md +++ b/docs/diffusion/performance/cache/teacache.md @@ -1,7 +1,7 @@ -# TeaCache Acceleration +# TeaCache > **Note**: This is one of two caching strategies available in SGLang. -> For an overview of all caching options, see [caching.md](caching.md). +> For an overview of all caching options, see [caching](../index.md). TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely. diff --git a/docs/diffusion/performance/index.md b/docs/diffusion/performance/index.md new file mode 100644 index 000000000000..2a2abe54a239 --- /dev/null +++ b/docs/diffusion/performance/index.md @@ -0,0 +1,42 @@ +# Performance + +This section covers the main performance levers for SGLang Diffusion: attention backends, caching acceleration, and profiling. + +## Overview + +| Optimization | Type | Description | +|--------------|------|-------------| +| **Cache-DiT** | Caching | Block-level caching with DBCache, TaylorSeer, and SCM | +| **TeaCache** | Caching | Timestep-level caching based on temporal similarity | +| **Attention Backends** | Kernel | Optimized attention implementations (FlashAttention, SageAttention, etc.) | +| **Profiling** | Diagnostics | PyTorch Profiler and Nsight Systems guidance | + +## Start Here + +- Use [Attention Backends](attention_backends.md) to choose the best backend for your model and hardware. +- Use [Caching Acceleration](cache/index.md) to reduce denoising cost with Cache-DiT or TeaCache. +- Use [Profiling](profiling.md) when you need to diagnose a bottleneck rather than guess. + +## Caching at a Glance + +- [Cache-DiT](cache/cache_dit.md) is block-level caching for diffusers pipelines and higher speedup-oriented tuning. +- [TeaCache](cache/teacache.md) is timestep-level caching built into SGLang model families. + +```{toctree} +:maxdepth: 1 + +attention_backends +cache/index +profiling +``` + +## Current Baseline Snapshot + +For Ring SP benchmark details, see: + +- [Ring SP Performance](ring_sp_performance.md) + +## References + +- [Cache-DiT Repository](https://github.com/vipshop/cache-dit) +- [TeaCache Paper](https://arxiv.org/abs/2411.14324) diff --git a/python/sglang/multimodal_gen/docs/profiling.md b/docs/diffusion/performance/profiling.md similarity index 100% rename from python/sglang/multimodal_gen/docs/profiling.md rename to docs/diffusion/performance/profiling.md diff --git a/docs/diffusion/performance/ring_sp_performance.md b/docs/diffusion/performance/ring_sp_performance.md new file mode 100644 index 000000000000..138698bfc4f5 --- /dev/null +++ b/docs/diffusion/performance/ring_sp_performance.md @@ -0,0 +1,67 @@ +# Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline) + +This page reports Ring-SP performance for `Wan2.2-TI2V-5B-Diffusers` using: + +- Parallel config: `sp=2, ulysses=1, ring=2` (short: `u1r2`) +- Baseline config: `sp=1, ulysses=1, ring=1` (short: `u1r1`) + +## Benchmark Setup + +- Model: `Wan2.2-TI2V-5B-Diffusers` +- GPU: `48G RTX40 series * 2` + +## Online Serving + +### Ring SP (`u1r2`) + +```bash +sglang serve \ + --model-type diffusion \ + --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \ + --num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \ + --port 8898 +``` + +### Baseline (`u1r1`) + +```bash +sglang serve \ + --model-type diffusion \ + --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \ + --num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \ + --port 8898 +``` + +## Benchmarks + +### Benchmark Disclaimer + +These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns. + +### Stage Time Breakdown + +| Stage / Metric | `u1r2` (s) | `u1r1` baseline (s) | Speedup | +|---|---:|---:|---:| +| InputValidation | 0.1060 | 0.1029 | 0.97x | +| TextEncoding | 1.3965 | 2.2261 | 1.59x | +| LatentPreparation | 0.0002 | 0.0002 | 1.00x | +| TimestepPreparation | 0.0003 | 0.0004 | 1.33x | +| Denoising | 52.6358 | 71.6785 | 1.36x | +| Decoding | 7.6708 | 13.4314 | 1.75x | +| **Total** | **63.74** | **90.63** | **1.42x** | + +### Memory Usage + +| Memory Metric | `u1r2` (GB) | `u1r1` baseline (GB) | Delta | +|---|---:|---:|---:| +| Peak GPU Memory | 20.07 | 27.40 | -7.33 | +| Peak Allocated | 13.35 | 20.40 | -7.05 | +| Memory Overhead | 6.72 | 7.00 | -0.28 | +| Overhead Ratio | 33.5% | 25.6% | +7.9pp | + +## Summary + +- End-to-end latency improves from `90.63s` to `63.74s` (`1.42x`). +- Main gains come from `Denoising` (`1.36x`) and `Decoding` (`1.75x`). +- Absolute memory usage drops noticeably on Ring-SP (`Peak GPU Memory -7.33GB`, `Peak Allocated -7.05GB`). +- Overhead ratio rises (`+7.9pp`), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain. diff --git a/docs/diffusion/quantization.md b/docs/diffusion/quantization.md new file mode 100644 index 000000000000..ccf3f8112d5c --- /dev/null +++ b/docs/diffusion/quantization.md @@ -0,0 +1,398 @@ +# Quantization + +SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep +the base model and the quantized transformer override separate. + +## Quick Reference + +Use these paths: + +- `--model-path`: the base or original model +- `--transformer-path`: a quantized transformers-style transformer component directory that already contains its own `config.json` +- `--transformer-weights-path`: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID + +Recommended example: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \ + --prompt "a curious pikachu" +``` + +For quantized transformers-style transformer component folders: + +```bash +sglang generate \ + --model-path /path/to/base-model \ + --transformer-path /path/to/quantized-transformer \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" +``` + +NOTE: Some model-specific integrations also accept a quantized repo or local +directory directly as `--model-path`, but that is a compatibility path. If a +repo contains multiple candidate checkpoints, pass +`--transformer-weights-path` explicitly. + +## Quant Families + +Here, `quant_family` means a checkpoint and loading family with shared CLI +usage and loader behavior. It is not just the numeric precision or a kernel +backend. + +| quant_family | checkpoint form | canonical CLI | supported models | extra dependency | platform / notes | +|-------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------| +| `fp8` | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path` | ALL | None | Component-folder and single-file flows are both supported | +| `modelopt-fp8` | Converted ModelOpt FP8 transformer directory or repo with `config.json` | `--transformer-path` | FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit | None | Serialized config stays `quant_method=modelopt` with `quant_algo=FP8`; `dit_layerwise_offload` is supported and `dit_cpu_offload` stays disabled | +| `modelopt-nvfp4` | Mixed transformer directory/repo with `config.json`, or raw NVFP4 safetensors export/repo | `--transformer-path` for mixed overrides; `--transformer-weights-path` for raw exports | FLUX.1, FLUX.2, Wan2.2 | None | Mixed override repos keep the base model separate; raw exports such as `black-forest-labs/FLUX.2-dev-NVFP4` still use the weights-path flow | +| `nunchaku-svdq` | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...` | `--transformer-weights-path` | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku` | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4` | +| `msmodelslim` | Pre-quantized msmodelslim transformer weights | `--model-path` | Wan2.2 family | None | Currently only compatible with the Ascend NPU family and supports both `w8a8` and `w4a4` | + +## Validated ModelOpt Checkpoints + +This section is the canonical support matrix for the nine diffusion ModelOpt +checkpoints currently wired up in SGLang docs and validation coverage. + +Published checkpoints keep the serialized quantization config as +`quant_method=modelopt`; the FP8 vs NVFP4 split below is a documentation label +derived from `quant_algo`. + +Eight of the nine repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the +official `black-forest-labs/FLUX.2-dev-NVFP4` repo. + +| Quant Algo | Base Model | Preferred CLI | HF Repo | Current Scope | Notes | +| --- | --- | --- | --- | --- | --- | +| `FP8` | `black-forest-labs/FLUX.1-dev` | `--transformer-path` | `lmsys/flux1-dev-modelopt-fp8-sglang-transformer` | single-transformer override, deterministic latent/image comparison, H100 benchmark, torch-profiler trace | SGLang converter keeps a validated BF16 fallback set for modulation and FF projection layers; use `--model-id FLUX.1-dev` for local mirrors | +| `FP8` | `black-forest-labs/FLUX.2-dev` | `--transformer-path` | `lmsys/flux2-dev-modelopt-fp8-sglang-transformer` | single-transformer override load and generation path | published SGLang-ready transformer override | +| `FP8` | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `--transformer-path` | `lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer` | primary `transformer` quantized, `transformer_2` kept BF16 | primary-transformer-only path; keep `transformer_2` on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately | +| `FP8` | `hunyuanvideo-community/HunyuanVideo` | `--transformer-path` | `lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer` | single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace | HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores | +| `FP8` | `Qwen/Qwen-Image` | `--transformer-path` | `lmsys/qwen-image-modelopt-fp8-sglang-transformer` | single-transformer override, BF16-vs-FP8 image comparison, H100 benchmark, torch-profiler trace | shares the Qwen Image FP8 fallback preset; keep `img_in`, `txt_in`, timestep embedder, `norm_out.linear`, `proj_out`, `img_mod`/`txt_mod`, and `img_mlp.net.2` in BF16 | +| `FP8` | `Qwen/Qwen-Image-Edit-2511` | `--transformer-path` | `lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer` | TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark | shares `QwenImageTransformer2DModel` with Qwen Image and uses the same Qwen Image FP8 fallback preset | +| `NVFP4` | `black-forest-labs/FLUX.1-dev` | `--transformer-path` | `lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer` | mixed BF16+NVFP4 transformer override, correctness validation, 4x RTX 5090 benchmark, torch-profiler trace | use `build_modelopt_nvfp4_transformer.py`; validated builder keeps selected FLUX.1 modules in BF16 and sets `swap_weight_nibbles=false` | +| `NVFP4` | `black-forest-labs/FLUX.2-dev` | `--transformer-weights-path` | `black-forest-labs/FLUX.2-dev-NVFP4` | packed-QKV load path | official raw export repo; validated packed export detection and runtime layout handling | +| `NVFP4` | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `--transformer-path` | `lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer` | primary `transformer` quantized with ModelOpt NVFP4, `transformer_2` kept BF16 | primary-transformer-only path; keep `transformer_2` on the base checkpoint, and current B200/Blackwell bring-up uses `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn` | + +These nine checkpoints are also the intended case set for the B200 diffusion +CI job (`multimodal-gen-test-1-b200`). + +## ModelOpt FP8 + +### Usage Examples + +Converted ModelOpt FP8 checkpoints should be loaded as transformer component +overrides. If the repo or local directory already contains `config.json`, use +`--transformer-path`. + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --transformer-path lmsys/flux2-dev-modelopt-fp8-sglang-transformer \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --transformer-path lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer \ + --prompt "a fox walking through neon rain" \ + --save-output +``` + +```bash +sglang generate \ + --model-path hunyuanvideo-community/HunyuanVideo \ + --transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \ + --height 544 --width 960 --num-frames 17 \ + --prompt "A cinematic shot of a red sports car driving through rain at night" \ + --save-output +``` + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --transformer-path lmsys/qwen-image-modelopt-fp8-sglang-transformer \ + --prompt "A tiny astronaut reading a book under a glass greenhouse" \ + --save-output +``` + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image-Edit-2511 \ + --transformer-path lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer \ + --image-path /path/to/input.png \ + --prompt "Turn the scene into a warm watercolor illustration" \ + --save-output +``` + +### Notes + +- `--transformer-path` is the canonical flag for converted ModelOpt FP8 + transformer component repos or directories that already carry `config.json`. +- If the override repo or local directory contains its own `config.json`, + SGLang reads the quantization config from that override instead of relying on + the base model config. +- `--transformer-weights-path` still works when you intentionally point at raw + weight files or a directory that should be metadata-probed as weights first. +- `dit_layerwise_offload` is supported for ModelOpt FP8 checkpoints. +- `dit_cpu_offload` still stays disabled for ModelOpt FP8 checkpoints. +- The layerwise offload path now preserves the non-contiguous FP8 weight stride + expected by the runtime FP8 GEMM path. +- On disk, the quantization config stays `quant_method=modelopt` with + `quant_algo=FP8`; the `modelopt-fp8` label in this document is a support + family name, not a serialized config key. +- `hunyuanvideo-community/HunyuanVideo` uses the `hunyuan-video` converter + preset. Use `--model-type hunyuan-video` to force it, or rely on + auto-detection from `_class_name=HunyuanVideoTransformer3DModel`. +- The validated HunyuanVideo FP8 fallback preset keeps `context_embedder`, + `x_embedder.proj`, timestep/guidance/text embedder linear layers, + `norm_out.linear`, `proj_out`, double-block modulation linear layers, and + single-block modulation linear layers in BF16. +- HunyuanVideo ModelOpt exports use diffusers module names that do not match + SGLang runtime module names for fused QKV and fused QKV+MLP layers. The + converter maps the names before selecting scale tensors and before writing + the runtime ignore list. +- `Qwen/Qwen-Image` and `Qwen/Qwen-Image-Edit-2511` share the `qwen-image` + converter preset. Use `--model-type qwen-image` to force it, or rely on + auto-detection from `_class_name=QwenImageTransformer2DModel`. +- The validated Qwen Image FP8 fallback preset keeps `img_in`, `txt_in`, + timestep embedder linear layers, `norm_out.linear`, `proj_out`, + `transformer_blocks.*.(img_mod|txt_mod)`, and + `transformer_blocks.*.img_mlp.net.2` in BF16. +- For Qwen Image FP8 conversion, write explicit BF16 fallback tensors before + honoring ModelOpt ignored weights. Otherwise converter stats can report a + fallback while the output checkpoint still retains the source FP8 tensor. +- To build the converted checkpoint yourself from a ModelOpt diffusers export, + use `python -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer`. + +## ModelOpt NVFP4 + +### Usage Examples + +For mixed ModelOpt NVFP4 transformer overrides that already contain +`config.json`, keep the base model and quantized transformer separate and use +`--transformer-path`: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.1-dev \ + --transformer-path lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +For raw NVFP4 exports such as the official FLUX.2 release, use +`--transformer-weights-path`: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +SGLang also supports passing the NVFP4 repo or local directory directly as +`--model-path`: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev-NVFP4 \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +For a dual-transformer Wan2.2 export where only the primary `transformer` +was quantized: + +```bash +SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn \ +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --transformer-path lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer \ + --prompt "a fox walking through neon rain" \ + --save-output +``` + +### Notes + +- Use `--transformer-path` for mixed ModelOpt NVFP4 transformer repos or local + directories that already include `config.json`. +- Use `--transformer-weights-path` for raw NVFP4 exports, individual + safetensors files, or repo layouts that should be treated as weights first. +- For dual-transformer pipelines such as `Wan2.2-T2V-A14B-Diffusers`, the + primary `--transformer-path` override targets only `transformer`. Use a + per-component override such as `--transformer-2-path` only when you + intentionally want a non-default `transformer_2`. +- On Blackwell, the validated Wan2.2 ModelOpt NVFP4 path currently prefers + FlashInfer FP4 GEMM via + `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn`. +- This environment-variable override is a current workaround for NVFP4 cases + where the default sglang JIT/CUTLASS `sm100` path rejects a large-M shape at + `can_implement()`. The intended long-term fix is to add a validated CUTLASS + fallback for those shapes rather than rely on the override. +- Direct `--model-path` loading is a compatibility path for FLUX.2 NVFP4-style + repos or local directories. +- If `--transformer-weights-path` is provided explicitly, it takes precedence + over the compatibility `--model-path` flow. +- For local directories, SGLang first looks for `*-mixed.safetensors`, then + falls back to loading from the directory. +- To force the generic diffusion ModelOpt FP4 path onto a specific FlashInfer + backend, set `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND`. Supported values + include `flashinfer_cudnn`, `flashinfer_cutlass`, and `flashinfer_trtllm`. +- On disk, the quantization config stays `quant_method=modelopt` with + `quant_algo=NVFP4`; the `modelopt-nvfp4` label here is again a documentation + family name rather than a serialized config key. + +## Nunchaku (SVDQuant) + +### Install + +Install the runtime dependency first: + +```bash +pip install nunchaku +``` + +For platform-specific installation methods and troubleshooting, see the +[Nunchaku installation guide](https://nunchaku.tech/docs/nunchaku/installation/installation.html). + +### File Naming and Auto-Detection + +For Nunchaku checkpoints, `--model-path` should still point to the original +base model, while `--transformer-weights-path` points to the quantized +transformer weights. + +If the basename of `--transformer-weights-path` contains the pattern +`svdq-(int4|fp4)_r{rank}`, SGLang will automatically: +- enable SVDQuant +- infer `--quantization-precision` +- infer `--quantization-rank` + +Examples: + +| checkpoint name fragment | inferred precision | inferred rank | notes | +|--------------------------|--------------------|---------------|-------| +| `svdq-int4_r32` | `int4` | `32` | Standard INT4 checkpoint | +| `svdq-int4_r128` | `int4` | `128` | Higher-quality INT4 checkpoint | +| `svdq-fp4_r32` | `nvfp4` | `32` | `fp4` in the filename maps to CLI value `nvfp4` | +| `svdq-fp4_r128` | `nvfp4` | `128` | Higher-quality NVFP4 checkpoint | + +Common filenames: + +| filename | precision | rank | typical use | +|----------|-----------|------|-------------| +| `svdq-int4_r32-qwen-image.safetensors` | `int4` | `32` | Balanced default | +| `svdq-int4_r128-qwen-image.safetensors` | `int4` | `128` | Quality-focused | +| `svdq-fp4_r32-qwen-image.safetensors` | `nvfp4` | `32` | RTX 50-series / NVFP4 path | +| `svdq-fp4_r128-qwen-image.safetensors` | `nvfp4` | `128` | Quality-focused NVFP4 | +| `svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors` | `int4` | `32` | Lightning 4-step | +| `svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors` | `int4` | `128` | Lightning 8-step | + +If your checkpoint name does not follow this convention, pass +`--enable-svdquant`, `--quantization-precision`, and `--quantization-rank` +explicitly. + +### Usage Examples + +Recommended auto-detected flow: + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \ + --prompt "a beautiful sunset" \ + --save-output +``` + +Manual override when the filename does not encode the quant settings: + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \ + --enable-svdquant \ + --quantization-precision int4 \ + --quantization-rank 128 \ + --prompt "a beautiful sunset" \ + --save-output +``` + +### Notes + +- `--transformer-weights-path` is the canonical flag for Nunchaku checkpoints. + Older config names such as `quantized_model_path` are treated as + compatibility aliases. +- Auto-detection only happens when the checkpoint basename matches + `svdq-(int4|fp4)_r{rank}`. +- The CLI values are `int4` and `nvfp4`. In filenames, the NVFP4 variant is + written as `fp4`. +- Lightning checkpoints usually expect matching `--num-inference-steps`, such + as `4` or `8`. +- Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) + or SM12x GPUs. Hopper (SM90) is currently rejected. + +## [ModelSlim](https://gitcode.com/Ascend/msmodelslim) +MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware. + +- **Installation** + + ```bash + # Clone repo and install msmodelslim: + git clone https://gitcode.com/Ascend/msmodelslim.git + cd msmodelslim + bash install.sh + ``` + +- **Multimodal_sd quantization** + + Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to [Wan2.2-T2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card). + > Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech). + + Run quantization using one-click quantization (recommended): + + ```bash + msmodelslim quant \ + --model_path /path/to/wan2_2_float_weights \ + --save_path /path/to/wan2_2_quantized_weights \ + --device npu \ + --model_type Wan2_2 \ + --quant_type w8a8 \ + --trust_remote_code True + ``` + + For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/multimodal_sd/README.md) section in ModelSLim repo. + + > Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim. + +- **Auto-Detection and different formats** + + For msmodelslim checkpoints, it's enough to specify only ```--model-path```, the detection of quantization occurs automatically for each layer using parsing of `quant_model_description.json` config. + + In the case of `Wan2.2` only `Diffusers` weights storage format are supported, whereas modelslim saves the quantized model in the original `Wan2.2` format, + for conversion in use `python/sglang/multimodal_gen/tools/wan_repack.py` script: + + ```bash + python wan_repack.py \ + --input-path {path_to_quantized_model} \ + --output-path {path_to_converted_model} + ``` + + After that, please copy all files from original `Diffusers` checkpoint (instead of `transformer`/`tranfsormer_2` folders) + +- **Usage Example** + + With auto-detected flow: + + ```bash + sglang generate \ + --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \ + --prompt "a beautiful sunset" \ + --save-output + ``` + +- **Available Quantization Methods**: + - [x] ```W4A4_DYNAMIC``` linear with online quantization of activations + - [x] ```W8A8``` linear with offline quantization of activations + - [x] ```W8A8_DYNAMIC``` linear with online quantization of activations + - [x] ```mxfp8``` linear with online/offline MXFP8 quantization (Ascend A5, CANN ≥ 8.0.RC3; see [Ascend NPU quantization](../platforms/ascend/ascend_npu_quantization.md#diffusion-model-quantization-on-ascend-npu)) diff --git a/docs/diffusion/reference.md b/docs/diffusion/reference.md new file mode 100644 index 000000000000..2005a91c7787 --- /dev/null +++ b/docs/diffusion/reference.md @@ -0,0 +1,11 @@ +# Reference + +Reference material for environment-based configuration and runtime behavior. + +- [Environment Variables](environment_variables.md): platform, caching, cloud storage, and debugging variables + +```{toctree} +:maxdepth: 1 + +environment_variables +``` diff --git a/docs/diffusion/support_new_models.md b/docs/diffusion/support_new_models.md new file mode 100644 index 000000000000..42f33c72b42f --- /dev/null +++ b/docs/diffusion/support_new_models.md @@ -0,0 +1,388 @@ +# How to Support New Diffusion Models + +This document explains how to add support for new diffusion models in SGLang Diffusion. + +## Architecture Overview + +SGLang Diffusion is engineered for both performance and flexibility, built upon a pipeline architecture. This +design allows developers to construct pipelines for various diffusion models while keeping the core generation +loop standardized for optimization. + +At its core, the architecture revolves around two key concepts, as highlighted in our [blog post](https://lmsys.org/blog/2025-11-07-sglang-diffusion/#architecture): + +- **`ComposedPipeline`**: This class orchestrates a series of `PipelineStage`s to define the complete generation process for a specific model. It acts as the main entry point for a model and manages the data flow between the different stages of the diffusion process. +- **`PipelineStage`**: Each stage is a modular component that encapsulates a function within the diffusion process. Examples include prompt encoding, the denoising loop, or VAE decoding. + +### Two Pipeline Styles + +SGLang Diffusion supports two pipeline composition styles. Both are valid; choose the one that best fits your model. + +#### Style A: Hybrid Monolithic Pipeline (Recommended Default) + +The recommended default for most new models. Uses a three-stage structure: + +``` +BeforeDenoisingStage (model-specific) → DenoisingStage (standard) → DecodingStage (standard) +``` + +| Stage | Ownership | Responsibility | +|-------|-----------|----------------| +| `{Model}BeforeDenoisingStage` | Model-specific | All pre-processing: input validation, text/image encoding, latent preparation, timestep computation | +| `DenoisingStage` | Framework-standard | The denoising loop (DiT/UNet forward passes), shared across all models | +| `DecodingStage` | Framework-standard | VAE decoding from latent space to pixel space, shared across all models | + +**Why recommended?** Modern diffusion models often have highly heterogeneous pre-processing requirements — different text encoders, different latent formats, different conditioning mechanisms. The Hybrid approach keeps pre-processing isolated per model, avoids fragile shared stages with excessive conditional logic, and lets developers port Diffusers reference code quickly. + +#### Style B: Modular Composition Style + +Uses the framework's fine-grained standard stages (`TextEncodingStage`, `LatentPreparationStage`, `TimestepPreparationStage`, etc.) to build the pipeline by composition. Convenience methods like `add_standard_t2i_stages()` and `add_standard_ti2i_stages()` make this very concise. + +This style is appropriate when: +- **The new model's pre-processing can largely reuse existing stages** — e.g., a model that uses standard CLIP/T5 text encoding + standard latent preparation with minimal customization. +- **A model-specific optimization needs to be extracted as a standalone stage** — e.g., a specialized encoding or conditioning step that benefits from being a separate stage for profiling, parallelism control, or reuse across multiple pipeline variants. + +#### How to Choose + +| Situation | Recommended Style | +|-----------|-------------------| +| Model has unique/complex pre-processing (VLM captioning, AR token generation, custom latent packing, etc.) | **Hybrid** — consolidate into a BeforeDenoisingStage | +| Model fits neatly into standard text-to-image or text+image-to-image pattern | **Modular** — use `add_standard_t2i_stages()` / `add_standard_ti2i_stages()` | +| Porting a Diffusers pipeline with many custom steps | **Hybrid** — copy the `__call__` logic into a single stage | +| Adding a variant of an existing model that shares most logic | **Modular** — reuse existing stages, customize via PipelineConfig callbacks | +| A specific pre-processing step needs special parallelism or profiling isolation | **Modular** — extract that step as a dedicated stage | + +## Key Components for Implementation + +To add support for a new diffusion model, you will need to define or configure the following components: + +1. **`PipelineConfig`**: A dataclass holding static configurations for your model pipeline — precision settings, model architecture parameters, and callback methods used by the standard `DenoisingStage` and `DecodingStage`. Each model has its own subclass. + +2. **`SamplingParams`**: A dataclass defining runtime generation parameters — `prompt`, `negative_prompt`, `guidance_scale`, `num_inference_steps`, `seed`, `height`, `width`, etc. + +3. **Pre-processing stage(s)**: Either a single model-specific `{Model}BeforeDenoisingStage` (Hybrid style) or a combination of standard stages (Modular style). See [Two Pipeline Styles](#two-pipeline-styles) above. + +4. **`ComposedPipeline`**: A class that wires together your pre-processing stage(s) with the standard `DenoisingStage` and `DecodingStage`. See base definitions: + - [`ComposedPipelineBase`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py) + - [`PipelineStage`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py) + - [Central registry](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/registry.py) + +5. **Modules (model components)**: Each pipeline references modules loaded from the model repository (e.g., Diffusers `model_index.json`): + - `text_encoder`: Encodes text prompts into embeddings. + - `tokenizer`: Tokenizes raw text input for the text encoder(s). + - `processor`: Preprocesses images and extracts features; often used in image-to-image tasks. + - `image_encoder`: Specialized image feature extractor. + - `dit/transformer`: The core denoising network (DiT/UNet architecture) operating in latent space. + - `scheduler`: Controls the timestep schedule and denoising dynamics. + - `vae`: Variational Autoencoder for encoding/decoding between pixel space and latent space. + +## Pipeline Stages Reference + +### Core Stages (used by all pipelines) + +| Stage Class | Description | +| -------------------------------- | ------------------------------------------------------------------------------------------------------- | +| `DenoisingStage` | Executes the main denoising loop, iteratively applying the model (DiT/UNet) to refine the latents. | +| `DecodingStage` | Decodes the final latent tensor back into pixel space using the VAE. | +| `DmdDenoisingStage` | A specialized denoising stage for DMD model architectures. | +| `CausalDMDDenoisingStage` | A specialized causal denoising stage for specific video models. | + +### Pre-processing Stages (for Modular Composition Style) + +The following fine-grained stages can be composed to build the pre-processing portion of a pipeline. They are best suited for models whose pre-processing largely fits the standard patterns. If your model requires significant customization, consider the Hybrid style with a single `BeforeDenoisingStage` instead. + +| Stage Class | Description | +| -------------------------------- | ------------------------------------------------------------------------------------------------------- | +| `InputValidationStage` | Validates user-provided `SamplingParams`. | +| `TextEncodingStage` | Encodes text prompts into embeddings using one or more text encoders. | +| `ImageEncodingStage` | Encodes input images into embeddings, often used in image-to-image tasks. | +| `ImageVAEEncodingStage` | Encodes an input image into latent space using the VAE. | +| `TimestepPreparationStage` | Prepares the scheduler's timesteps for the diffusion process. | +| `LatentPreparationStage` | Creates the initial noisy latent tensor that will be denoised. | + +## Implementation Guide + +### Step 1: Obtain and Study the Reference Implementation + +Before writing any code, obtain the model's original implementation or Diffusers pipeline code: +- The model's Diffusers pipeline source (e.g., the `pipeline_*.py` file from the `diffusers` library or HuggingFace repo) +- Or the model's official reference implementation (e.g., from the model author's GitHub repo) +- Or the HuggingFace model ID to look up `model_index.json` and the associated pipeline class + +Once you have the reference code, study it thoroughly: + +1. Find the model's `model_index.json` to identify required modules. +2. Read the Diffusers pipeline's `__call__` method to understand: + - How text prompts are encoded + - How latents are prepared (shape, dtype, scaling) + - How timesteps/sigmas are computed + - What conditioning kwargs the DiT expects + - How the denoising loop works + - How VAE decoding is done + +### Step 2: Evaluate Reuse of Existing Pipelines and Stages + +Before creating any new files, check whether an existing pipeline or stage can be reused or extended. Only create new pipelines/stages when the existing ones would need substantial structural changes or when no architecturally similar implementation exists. + +- **Compare against existing pipelines** (Flux, Wan, Qwen-Image, GLM-Image, HunyuanVideo, LTX, etc.). If the new model shares most of its structure with an existing one, prefer adding a new config variant or reusing existing stages. +- **Check existing stages** in `runtime/pipelines_core/stages/` and `stages/model_specific_stages/`. +- **Check existing model components** — many models share VAEs (e.g., `AutoencoderKL`), text encoders (CLIP, T5), and schedulers. Reuse these directly. + +### Step 3: Implement Model Components + +Adapt the model's core components: + +- **DiT/Transformer**: Implement in [`runtime/models/dits/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/dits/) +- **Encoders**: Implement in [`runtime/models/encoders/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/encoders/) +- **VAEs**: Implement in [`runtime/models/vaes/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/vaes/) +- **Schedulers**: Implement in [`runtime/models/schedulers/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/schedulers/) if needed + +Use SGLang's fused kernels where possible (see `LayerNormScaleShift`, `RMSNormScaleShift`, `apply_qk_norm`, etc.). + +**Tensor Parallel (TP) and Sequence Parallel (SP)**: For multi-GPU deployment, it is recommended to add TP/SP support to the DiT model. This can be done incrementally after the single-GPU implementation is verified. Reference implementations: +- **Wan model** (`runtime/models/dits/wanvideo.py`) — Full TP + SP: `ColumnParallelLinear`/`RowParallelLinear` for attention, sequence dimension sharding via `get_sp_world_size()` +- **Qwen-Image model** (`runtime/models/dits/qwen_image.py`) — SP via `USPAttention` (Ulysses + Ring Attention) + +### Step 4: Create Configs + +- **DiT Config**: `configs/models/dits/{model_name}.py` +- **VAE Config**: `configs/models/vaes/{model_name}.py` +- **SamplingParams**: `configs/sample/{model_name}.py` + +### Step 5: Create PipelineConfig + +The `PipelineConfig` provides callbacks that the standard `DenoisingStage` and `DecodingStage` use: + +```python +# python/sglang/multimodal_gen/configs/pipeline_configs/my_model.py + +@dataclass +class MyModelPipelineConfig(ImagePipelineConfig): + task_type: ModelTaskType = ModelTaskType.T2I + vae_precision: str = "bf16" + should_use_guidance: bool = True + dit_config: DiTConfig = field(default_factory=MyModelDitConfig) + vae_config: VAEConfig = field(default_factory=MyModelVAEConfig) + + def get_freqs_cis(self, batch, device, rotary_emb, dtype): + """Prepare rotary position embeddings for the DiT.""" + ... + + def prepare_pos_cond_kwargs(self, batch, latent_model_input, t, **kwargs): + """Build positive conditioning kwargs for each denoising step.""" + return { + "hidden_states": latent_model_input, + "encoder_hidden_states": batch.prompt_embeds[0], + "timestep": t, + } + + def prepare_neg_cond_kwargs(self, batch, latent_model_input, t, **kwargs): + """Build negative conditioning kwargs for CFG.""" + return { + "hidden_states": latent_model_input, + "encoder_hidden_states": batch.negative_prompt_embeds[0], + "timestep": t, + } + + def get_decode_scale_and_shift(self): + """Return (scale, shift) for latent denormalization before VAE decode.""" + ... +``` + +### Step 6: Implement Pre-processing + +Choose based on your model's needs (see [How to Choose](#how-to-choose)): + +#### Option A: BeforeDenoisingStage (Hybrid Style) + +Create a single stage that handles all pre-processing. Best when the model has custom/complex pre-processing logic. + +```python +# python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/my_model.py + +class MyModelBeforeDenoisingStage(PipelineStage): + """Monolithic pre-processing stage for MyModel. + + Consolidates: input validation, text/image encoding, latent + preparation, and timestep computation. + """ + + def __init__(self, vae, text_encoder, tokenizer, transformer, scheduler): + super().__init__() + self.vae = vae + self.text_encoder = text_encoder + self.tokenizer = tokenizer + self.transformer = transformer + self.scheduler = scheduler + + @torch.no_grad() + def forward(self, batch: Req, server_args: ServerArgs) -> Req: + device = get_local_torch_device() + + # 1. Encode prompt (model-specific logic) + prompt_embeds, negative_prompt_embeds = self._encode_prompt(...) + + # 2. Prepare latents + latents = self._prepare_latents(...) + + # 3. Prepare timesteps + timesteps, sigmas = self._prepare_timesteps(...) + + # 4. Populate batch for DenoisingStage + batch.prompt_embeds = [prompt_embeds] + batch.negative_prompt_embeds = [negative_prompt_embeds] + batch.latents = latents + batch.timesteps = timesteps + batch.num_inference_steps = len(timesteps) + batch.sigmas = sigmas.tolist() + batch.generator = generator + batch.raw_latent_shape = latents.shape + return batch +``` + +#### Option B: Standard Stages (Modular Style) + +Skip creating a custom stage entirely — configure via `PipelineConfig` callbacks and use framework helpers. Best when the model fits standard patterns. + +(This option has no separate stage file; the pipeline class in Step 7 calls `add_standard_t2i_stages()` directly.) + +**Key batch fields that `DenoisingStage` expects** (regardless of which option you choose): + +| Field | Type | Description | +|-------|------|-------------| +| `batch.latents` | `torch.Tensor` | Initial noisy latent tensor | +| `batch.timesteps` | `torch.Tensor` | Timestep schedule | +| `batch.num_inference_steps` | `int` | Number of denoising steps | +| `batch.sigmas` | `list[float]` | Sigma schedule (must be a Python list, not numpy) | +| `batch.prompt_embeds` | `list[torch.Tensor]` | Positive prompt embeddings (wrapped in a list) | +| `batch.negative_prompt_embeds` | `list[torch.Tensor]` | Negative prompt embeddings (wrapped in a list) | +| `batch.generator` | `torch.Generator` | RNG generator for reproducibility | +| `batch.raw_latent_shape` | `tuple` | Original latent shape before any packing | + +### Step 7: Define the Pipeline Class + +#### Hybrid Style + +```python +# python/sglang/multimodal_gen/runtime/pipelines/my_model.py + +class MyModelPipeline(LoRAPipeline, ComposedPipelineBase): + pipeline_name = "MyModelPipeline" # Must match model_index.json _class_name + + _required_config_modules = [ + "text_encoder", "tokenizer", "vae", "transformer", "scheduler", + ] + + def create_pipeline_stages(self, server_args: ServerArgs): + # 1. Monolithic pre-processing (model-specific) + self.add_stage( + MyModelBeforeDenoisingStage( + vae=self.get_module("vae"), + text_encoder=self.get_module("text_encoder"), + tokenizer=self.get_module("tokenizer"), + transformer=self.get_module("transformer"), + scheduler=self.get_module("scheduler"), + ), + ) + + # 2. Standard denoising loop (framework-provided) + self.add_stage( + DenoisingStage( + transformer=self.get_module("transformer"), + scheduler=self.get_module("scheduler"), + ), + ) + + # 3. Standard VAE decoding (framework-provided) + self.add_standard_decoding_stage() + + +EntryClass = [MyModelPipeline] +``` + +#### Modular Style + +```python +# python/sglang/multimodal_gen/runtime/pipelines/my_model.py + +class MyModelPipeline(LoRAPipeline, ComposedPipelineBase): + pipeline_name = "MyModelPipeline" + + _required_config_modules = [ + "text_encoder", "tokenizer", "vae", "transformer", "scheduler", + ] + + def create_pipeline_stages(self, server_args: ServerArgs): + # All pre-processing + denoising + decoding in one call + self.add_standard_t2i_stages( + prepare_extra_timestep_kwargs=[prepare_mu], # model-specific hooks + ) + + +EntryClass = [MyModelPipeline] +``` + +### Step 8: Register the Model + +Register your configs in [`registry.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/registry.py): + +```python +register_configs( + model_family="my_model", + sampling_param_cls=MyModelSamplingParams, + pipeline_config_cls=MyModelPipelineConfig, + hf_model_paths=["org/my-model-name"], +) +``` + +The `EntryClass` in your pipeline file is automatically discovered by the registry — no additional registration needed for the pipeline class itself. + +### Step 9: Verify Output Quality + +After implementation, verify that the generated output is not noise. A noisy or garbled output is the most common sign of an incorrect implementation. Common causes include: + +- Incorrect latent scale/shift factors +- Wrong timestep/sigma schedule (order, dtype, or value range) +- Mismatched conditioning kwargs +- Rotary embedding style mismatch (`is_neox_style`) + +Debug by comparing intermediate tensor values against the Diffusers reference pipeline with the same seed. + +## Reference Implementations + +### Hybrid Style + +| Model | Pipeline | BeforeDenoisingStage | PipelineConfig | +|-------|----------|---------------------|----------------| +| GLM-Image | `runtime/pipelines/glm_image.py` | `stages/model_specific_stages/glm_image.py` | `configs/pipeline_configs/glm_image.py` | +| Qwen-Image-Layered | `runtime/pipelines/qwen_image.py` | `stages/model_specific_stages/qwen_image_layered.py` | `configs/pipeline_configs/qwen_image.py` | + +### Modular Style + +| Model | Pipeline | Notes | +|-------|----------|-------| +| Qwen-Image (T2I) | `runtime/pipelines/qwen_image.py` | Uses `add_standard_t2i_stages()` | +| Qwen-Image-Edit | `runtime/pipelines/qwen_image.py` | Uses `add_standard_ti2i_stages()` | +| Flux | `runtime/pipelines/flux.py` | Uses `add_standard_t2i_stages()` with custom `prepare_mu` | +| Wan | `runtime/pipelines/wan_pipeline.py` | Uses `add_standard_ti2v_stages()` | + +## Checklist + +Before submitting your implementation, verify: + +**Common (both styles):** +- [ ] **Pipeline file** at `runtime/pipelines/{model_name}.py` with `EntryClass` +- [ ] **PipelineConfig** at `configs/pipeline_configs/{model_name}.py` +- [ ] **SamplingParams** at `configs/sample/{model_name}.py` +- [ ] **DiT model** at `runtime/models/dits/{model_name}.py` +- [ ] **Model configs** (DiT, VAE) at `configs/models/dits/` and `configs/models/vaes/` +- [ ] **Registry entry** in `registry.py` via `register_configs()` +- [ ] `pipeline_name` matches Diffusers `model_index.json` `_class_name` +- [ ] `_required_config_modules` lists all modules from `model_index.json` +- [ ] `PipelineConfig` callbacks (`prepare_pos_cond_kwargs`, etc.) match the DiT's `forward()` signature +- [ ] Uses framework-standard `DenoisingStage` and `DecodingStage` (not custom denoising loops) +- [ ] **TP/SP support** considered for DiT model (recommended; reference `wanvideo.py` for TP+SP, `qwen_image.py` for USPAttention) +- [ ] **Output quality verified** — generated images/videos are not noise; compared against Diffusers reference output + +**Hybrid style only:** +- [ ] **BeforeDenoisingStage** at `stages/model_specific_stages/{model_name}.py` +- [ ] `BeforeDenoisingStage.forward()` populates all batch fields required by `DenoisingStage` diff --git a/docs/diffusion/usage.md b/docs/diffusion/usage.md new file mode 100644 index 000000000000..78b0a545d4d6 --- /dev/null +++ b/docs/diffusion/usage.md @@ -0,0 +1,17 @@ +# Usage + +Use this section for day-to-day inference workflows with SGLang Diffusion. + +- [CLI](api/cli.md): run one-off jobs with `sglang generate` or start a server with `sglang serve` +- [OpenAI-Compatible API](api/openai_api.md): request format, endpoints, and SDK examples +- [Post-Processing](api/post_processing.md): frame interpolation and upscaling +- [Quantization](quantization.md): quantized transformer checkpoints and supported quantization families + +```{toctree} +:maxdepth: 1 + +api/cli +api/openai_api +api/post_processing +quantization +``` diff --git a/docs/get_started/install.md b/docs/get_started/install.md index 59aff71b311a..091bf4ae7b4f 100644 --- a/docs/get_started/install.md +++ b/docs/get_started/install.md @@ -2,7 +2,7 @@ You can install SGLang using one of the methods below. This page primarily applies to common NVIDIA GPU platforms. -For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md), and [Intel XPU](../platforms/xpu.md). +For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend/ascend_npu.md), and [Intel XPU](../platforms/xpu.md). ## Method 1: With pip or uv @@ -11,23 +11,40 @@ It is recommended to use uv for faster installation: ```bash pip install --upgrade pip pip install uv -uv pip install "sglang" +uv pip install sglang ``` -**Quick fixes to common problems** -- In some cases (e.g., GB200), the above command might install a wrong torch version (e.g., the CPU version) due to dependency resolution. To fix this, you can first run the above command and then force-reinstall the correct [PyTorch](https://pytorch.org/get-started/locally/) with the following: - ``` - uv pip install "torch==2.9.1" "torchvision" --extra-index-url https://download.pytorch.org/whl/cu129 --force-reinstall - ``` -- For CUDA 13, Docker is recommended (see the Method 3 note on B300/GB300/CUDA 13). If you do not have Docker access, installing the matching `sgl_kernel` wheel from [the sgl-project whl releases](https://github.com/sgl-project/whl/releases) after installing SGLang also works. Replace `X.Y.Z` with the `sgl_kernel` version required by your SGLang (you can find this by running `uv pip show sgl_kernel`). Examples: - ```bash - # x86_64 - uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl" - - # aarch64 - uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_aarch64.whl" - ``` -- If you encounter `OSError: CUDA_HOME environment variable is not set`, set it to your CUDA install root with either of the following solutions: +### For CUDA 13 + +Docker is recommended (see Method 3 note on B300/GB300/CUDA 13). If you do not have Docker access, follow these steps: + +1. Install PyTorch with CUDA 13 support first: +```bash +# Replace X.Y.Z with the version by your SGLang install +uv pip install torch==X.Y.Z torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 +``` + +2. Install sglang: +```bash +uv pip install sglang +``` + +3. Install the `sglang-kernel` wheel for CUDA 13 from [the sgl-project whl releases](https://github.com/sgl-project/whl/blob/gh-pages/cu130/sglang-kernel/index.html). Replace `X.Y.Z` with the `sglang-kernel` version required by your SGLang install (you can find this by running `uv pip show sglang-kernel`). Examples: +```bash +# x86_64 +uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sglang_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl" + +# aarch64 +uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sglang_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_aarch64.whl" +``` + +4. If you encounter `ptxas fatal : Value 'sm_103a' is not defined for option 'gpu-name'` on B300/GB300, fix it with: +```bash +export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas +``` + +### **Quick fixes to common problems** +- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions: 1. Use `export CUDA_HOME=/usr/local/cuda-` to set the `CUDA_HOME` environment variable. 2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above. @@ -35,7 +52,7 @@ uv pip install "sglang" ```bash # Use the last release branch -git clone -b v0.5.6.post2 https://github.com/sgl-project/sglang.git +git clone -b v0.5.9 https://github.com/sgl-project/sglang.git cd sglang # Install the python packages @@ -211,4 +228,3 @@ echo "Build and push completed successfully!" - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. - To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`. -- When encountering `ptxas fatal : Value 'sm_103a' is not defined for option 'gpu-name'` on B300/GB300, fix it with `export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas`. diff --git a/docs/index.rst b/docs/index.rst index 1e0937d463e6..4a892226c688 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -14,9 +14,9 @@ Its core features include: - **Fast Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching. - **Broad Model Support**: Supports a wide range of language models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), reward models (Skywork), and diffusion models (WAN, Qwen-Image), with easy extensibility for adding new models. Compatible with most Hugging Face models and OpenAI APIs. -- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more. +- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark/5090), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more. - **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 400,000 GPUs worldwide. -- **RL & Post-Training Backbone**: SGLang is a proven rollout backend across the world, with native RL integrations and adoption by well-known post-training frameworks such as AReaL, Miles, slime, Tunix, verl and more. +- **RL & Post-Training Backbone**: SGLang is a proven rollout backend used for training many frontier models, with native RL integrations and adoption by well-known post-training frameworks such as AReaL, Miles, slime, Tunix, verl and more. .. toctree:: :maxdepth: 1 @@ -41,9 +41,11 @@ Its core features include: :caption: Advanced Features advanced_features/server_arguments.md + advanced_features/object_storage.md advanced_features/hyperparameter_tuning.md advanced_features/attention_backend.md advanced_features/speculative_decoding.ipynb + advanced_features/adaptive_speculative_decoding.md advanced_features/structured_outputs.ipynb advanced_features/structured_outputs_for_reasoning_models.ipynb advanced_features/tool_parser.ipynb @@ -51,6 +53,7 @@ Its core features include: advanced_features/quantization.md advanced_features/quantized_kv_cache.md advanced_features/expert_parallelism.md + advanced_features/dp_dpa_smg_guide.md advanced_features/lora.ipynb advanced_features/pd_disaggregation.md advanced_features/epd_disaggregation.md @@ -60,6 +63,8 @@ Its core features include: advanced_features/vlm_query.ipynb advanced_features/dp_for_multi_modal_encoder.md advanced_features/cuda_graph_for_multi_modal_encoder.md + advanced_features/piecewise_cuda_graph.md + advanced_features/breakable_cuda_graph.md advanced_features/sgl_model_gateway.md advanced_features/deterministic_inference.md advanced_features/observability.md @@ -67,21 +72,29 @@ Its core features include: advanced_features/sglang_for_rl.md .. toctree:: - :maxdepth: 1 + :maxdepth: 2 :caption: Supported Models - supported_models/generative_models.md - supported_models/multimodal_language_models.md - supported_models/diffusion_language_models.md - supported_models/diffusion_models.md - supported_models/embedding_models.md - supported_models/reward_models.md - supported_models/rerank_models.md - supported_models/classify_models.md - supported_models/support_new_models.md - supported_models/transformers_fallback.md - supported_models/modelscope.md - supported_models/mindspore_models.md + supported_models/text_generation/index + supported_models/retrieval_ranking/index + supported_models/specialized/index + supported_models/extending/index + +.. toctree:: + :maxdepth: 2 + :caption: SGLang Diffusion + + diffusion/index + diffusion/installation + diffusion/compatibility_matrix + diffusion/api/cli + diffusion/api/openai_api + diffusion/performance/index + diffusion/performance/ring_sp_performance + diffusion/performance/attention_backends + diffusion/performance/cache/index + diffusion/quantization + diffusion/contributing .. toctree:: :maxdepth: 1 @@ -91,7 +104,7 @@ Its core features include: platforms/cpu_server.md platforms/tpu.md platforms/nvidia_jetson.md - platforms/ascend_npu_support.rst + platforms/ascend/ascend_npu_support.rst platforms/xpu.md .. toctree:: @@ -100,6 +113,7 @@ Its core features include: developer_guide/contribution_guide.md developer_guide/development_guide_using_docker.md + developer_guide/development_jit_kernel_guide.md developer_guide/benchmark_and_profiling.md developer_guide/bench_serving.md developer_guide/evaluating_new_models.md @@ -116,6 +130,7 @@ Its core features include: references/custom_chat_template.md references/frontend/frontend_index.rst references/post_training_integration.md + references/release_lookup references/learn_more.md .. toctree:: diff --git a/docs/performance_dashboard/README.md b/docs/performance_dashboard/README.md new file mode 100644 index 000000000000..857dc26a8dfc --- /dev/null +++ b/docs/performance_dashboard/README.md @@ -0,0 +1,147 @@ +# SGLang Performance Dashboard + +A web-based dashboard for visualizing SGLang nightly test performance metrics. + +## Features + +- **Performance Trends**: View throughput, latency, and TTFT trends over time +- **Model Comparison**: Compare performance across different models and configurations +- **Filtering**: Filter by GPU configuration, model, variant, and batch size +- **Interactive Charts**: Zoom, pan, and hover for detailed metrics +- **Run History**: View recent benchmark runs with links to GitHub Actions + +## Quick Start + +### Option 1: Run with Local Server (Recommended) + +For live data from GitHub Actions artifacts: + +```bash +# Install requirements +pip install requests + +# Run the server +python server.py --fetch-on-start + +# Visit http://localhost:8000 +``` + +The server provides: +- Automatic fetching of metrics from GitHub +- Caching to reduce API calls +- `/api/metrics` endpoint for the frontend + +### Option 2: Fetch Data Manually + +Use the fetch script to download metrics data: + +```bash +# Fetch last 30 days of metrics +python fetch_metrics.py --output metrics_data.json + +# Fetch a specific run +python fetch_metrics.py --run-id 21338741812 --output single_run.json + +# Fetch only scheduled (nightly) runs +python fetch_metrics.py --scheduled-only --days 7 +``` + +## GitHub Token + +To download artifacts from GitHub, you need authentication: + +1. **Using `gh` CLI** (recommended): + ```bash + gh auth login + ``` + +2. **Using environment variable**: + ```bash + export GITHUB_TOKEN=your_token_here + ``` + +Without a token, the dashboard will show run metadata but not detailed benchmark results. + +## Data Structure + +The metrics JSON has this structure: + +```json +{ + "run_id": "21338741812", + "run_date": "2026-01-25T22:24:02.090218+00:00", + "commit_sha": "5cdb391...", + "branch": "main", + "results": [ + { + "gpu_config": "8-gpu-h200", + "partition": 0, + "model": "deepseek-ai/DeepSeek-V3.1", + "variant": "TP8+MTP", + "benchmarks": [ + { + "batch_size": 1, + "input_len": 4096, + "output_len": 512, + "latency_ms": 2400.72, + "input_throughput": 21408.64, + "output_throughput": 231.74, + "overall_throughput": 1919.43, + "ttft_ms": 191.32, + "acc_length": 3.19 + } + ] + } + ] +} +``` + +## Deployment + +### GitHub Pages + +The dashboard can be deployed to GitHub Pages for public access: + +1. Copy the dashboard files to `docs/performance_dashboard/` +2. Enable GitHub Pages in repository settings +3. Set up a GitHub Action to periodically update metrics data + +### Self-Hosted + +For a self-hosted deployment with live data: + +1. Set up a server running `server.py` +2. Configure a cron job or systemd timer to refresh data +3. Optionally put behind nginx/caddy for SSL + +## Metrics Explained + +- **Overall Throughput**: Total tokens (input + output) processed per second +- **Input Throughput**: Input tokens processed per second (prefill speed) +- **Output Throughput**: Output tokens generated per second (decode speed) +- **Latency**: End-to-end time to complete the request +- **TTFT**: Time to First Token - time until the first output token +- **Acc Length**: Acceptance length for speculative decoding (MTP variants) + +## Contributing + +To add support for new metrics or visualizations: + +1. Update `fetch_metrics.py` if data collection needs changes +2. Modify `app.js` to add new chart types or filters +3. Update `index.html` for UI changes + +## Troubleshooting + +**No data displayed** +- Check browser console for errors +- Verify GitHub API is accessible +- Try running with `server.py --fetch-on-start` + +**API rate limits** +- Use a GitHub token for higher limits +- The server caches data for 5 minutes + +**Charts not rendering** +- Ensure Chart.js is loading from CDN +- Check for JavaScript errors in console diff --git a/docs/performance_dashboard/app.js b/docs/performance_dashboard/app.js new file mode 100644 index 000000000000..8bfb12b2ed0c --- /dev/null +++ b/docs/performance_dashboard/app.js @@ -0,0 +1,1056 @@ +// SGLang Performance Dashboard Application + +const GITHUB_REPO = 'sgl-project/sglang'; +const WORKFLOW_NAME = 'nightly-test-nvidia.yml'; +const ARTIFACT_PREFIX = 'consolidated-metrics-'; + +// Chart instances (array for batch-separated charts) +let activeCharts = []; + +// Data storage +let allMetricsData = []; +let currentModel = null; +let currentMetricType = 'throughput'; // throughput, latency, ttft, inputThroughput + +// Metric type definitions +const metricTypes = { + // Text/VLM metrics + throughput: { label: 'Overall Throughput', unit: 'tokens/sec', field: 'throughput', type: 'text' }, + outputThroughput: { label: 'Output Throughput', unit: 'tokens/sec', field: 'outputThroughput', type: 'text' }, + inputThroughput: { label: 'Input Throughput', unit: 'tokens/sec', field: 'inputThroughput', type: 'text' }, + latency: { label: 'Latency', unit: 'ms', field: 'latency', type: 'text' }, + ttft: { label: 'Time to First Token', unit: 'ms', field: 'ttft', type: 'text' }, + accLength: { label: 'Accept Length', unit: 'tokens', field: 'accLength', filterInvalid: true, type: 'text' }, + // Diffusion metrics + e2eMs: { label: 'End-to-End Time', unit: 'ms', field: 'e2e_ms', type: 'diffusion' }, + avgDenoiseMs: { label: 'Avg Denoise Time', unit: 'ms', field: 'avg_denoise_ms', type: 'diffusion' }, + medianDenoiseMs: { label: 'Median Denoise Time', unit: 'ms', field: 'median_denoise_ms', type: 'diffusion' } +}; + +// Chart.js default configuration for dark theme +Chart.defaults.color = '#94a3b8'; +Chart.defaults.borderColor = '#1e293b'; + +const chartColors = [ + '#22d3ee', '#34d399', '#fbbf24', '#f87171', '#a78bfa', + '#67e8f9', '#6ee7b7', '#fcd34d', '#fca5a5', '#c4b5fd' +]; + +// Initialize the dashboard +async function init() { + try { + await loadData(); + document.getElementById('loading').style.display = 'none'; + document.getElementById('content').style.display = 'block'; + populateFilters(); + updateStats(); + updateCharts(); + updateRunsTable(); + } catch (error) { + console.error('Failed to initialize dashboard:', error); + document.getElementById('loading').style.display = 'none'; + document.getElementById('error').style.display = 'block'; + document.getElementById('error-message').textContent = error.message; + } +} + +// Load data from local server API or GitHub +async function loadData() { + // Try local server API first (if running server.py) + try { + const response = await fetch('/api/metrics', { headers: getAuthHeaders() }); + if (response.ok) { + const data = await response.json(); + if (data.length > 0 && data[0].results && data[0].results.length > 0) { + allMetricsData = data; + console.log(`Loaded ${data.length} records from local API`); + allMetricsData.sort((a, b) => new Date(b.run_date) - new Date(a.run_date)); + return; + } + } + } catch (error) { + console.log('Local API not available, trying GitHub API'); + } + + // Try to load from GitHub API + const runs = await fetchWorkflowRuns(); + const metricsPromises = runs.map(run => fetchMetricsForRun(run)); + const results = await Promise.allSettled(metricsPromises); + + allMetricsData = results + .filter(r => r.status === 'fulfilled' && r.value !== null) + .map(r => r.value); + + if (allMetricsData.length === 0) { + throw new Error('No metrics data available. Please run the server.py with --fetch-on-start to fetch data from GitHub.'); + } + + // Sort by date descending + allMetricsData.sort((a, b) => new Date(b.run_date) - new Date(a.run_date)); +} + +// Fetch workflow runs from GitHub API +async function fetchWorkflowRuns() { + const response = await fetch( + `https://api.github.com/repos/${GITHUB_REPO}/actions/workflows/${WORKFLOW_NAME}/runs?status=completed&per_page=30`, + { + headers: { + 'Accept': 'application/vnd.github.v3+json' + } + } + ); + + if (!response.ok) { + throw new Error(`GitHub API error: ${response.status}`); + } + + const data = await response.json(); + return data.workflow_runs || []; +} + +// Fetch metrics artifact for a specific run +async function fetchMetricsForRun(run) { + try { + // Get artifacts for this run + const artifactsResponse = await fetch( + `https://api.github.com/repos/${GITHUB_REPO}/actions/runs/${run.id}/artifacts`, + { + headers: { + 'Accept': 'application/vnd.github.v3+json' + } + } + ); + + if (!artifactsResponse.ok) return null; + + const artifactsData = await artifactsResponse.json(); + const metricsArtifact = artifactsData.artifacts.find( + a => a.name.startsWith(ARTIFACT_PREFIX) + ); + + if (!metricsArtifact) return null; + + // Note: GitHub API doesn't allow direct artifact download without authentication + // For public access, we would need to use a proxy or pre-process the data + // For now, return run metadata - in production, use a backend to fetch artifacts + return { + run_id: run.id.toString(), + run_date: run.created_at, + commit_sha: run.head_sha, + branch: run.head_branch, + artifact_id: metricsArtifact.id, + results: [] // Would be populated from artifact content + }; + } catch (error) { + console.warn(`Failed to fetch metrics for run ${run.id}:`, error); + return null; + } +} + +// Helper function to detect if result is diffusion type +function isDiffusionResult(result) { + return result.test_type === 'diffusion' || (result.tests && !result.benchmarks); +} + +// Populate filter dropdowns +function populateFilters() { + const gpuConfigs = new Set(); + const models = new Set(); + const testNames = new Set(); // For diffusion tests + const batchSizes = new Set(); + const ioLengths = new Set(); + + allMetricsData.forEach(run => { + run.results.forEach(result => { + gpuConfigs.add(result.gpu_config); + + // Handle diffusion results + if (isDiffusionResult(result)) { + models.add(result.test_suite || 'diffusion'); + if (result.tests) { + result.tests.forEach(test => { + testNames.add(test.test_name); + }); + } + } + // Handle text/VLM results + else { + models.add(result.model); + // Try new structure first (benchmarks_by_io_len), fall back to flat benchmarks + if (result.benchmarks_by_io_len) { + Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => { + ioLengths.add(ioKey); + ioData.benchmarks.forEach(bench => { + batchSizes.add(bench.batch_size); + }); + }); + } else if (result.benchmarks) { + result.benchmarks.forEach(bench => { + batchSizes.add(bench.batch_size); + if (bench.input_len && bench.output_len) { + ioLengths.add(`${bench.input_len}_${bench.output_len}`); + } + }); + } + } + }); + }); + + // No "all" option for GPU and Model - populate with first value selected + const gpuArray = Array.from(gpuConfigs).sort(); + const modelArray = Array.from(models).sort(); + + populateSelectNoAll('gpu-filter', gpuArray); + populateSelectNoAll('model-filter', modelArray); + populateSelect('batch-filter', Array.from(batchSizes).sort((a, b) => a - b)); + populateSelectWithLabels('io-len-filter', sortIoLengths(Array.from(ioLengths)), formatIoLenLabel); + + // Set initial values (first option) + if (gpuArray.length > 0) { + document.getElementById('gpu-filter').value = gpuArray[0]; + } + if (modelArray.length > 0) { + document.getElementById('model-filter').value = modelArray[0]; + currentModel = modelArray[0]; + } + + // Update variants based on selected model + updateVariantFilter(); + // Update IO length filter based on selected GPU/model + updateIoLenFilter(); + + // Create metric type tabs + createMetricTabs(); +} + +// Format input/output length key for display +function formatIoLenLabel(ioKey) { + if (!ioKey) return 'Unknown'; + const parts = ioKey.split('_'); + if (parts.length === 2) { + return `In: ${parts[0]}, Out: ${parts[1]}`; + } + return ioKey; +} + +// Sort IO length keys numerically (by input length, then output length) +function sortIoLengths(ioLengths) { + return ioLengths.filter(key => key && key.includes('_')).sort((a, b) => { + const [aIn, aOut] = a.split('_').map(Number); + const [bIn, bOut] = b.split('_').map(Number); + if (isNaN(aIn) || isNaN(bIn)) return 0; + return (aIn - bIn) || (aOut - bOut); + }); +} + +// Populate select with custom label formatting +function populateSelectWithLabels(selectId, options, labelFormatter) { + const select = document.getElementById(selectId); + options.forEach(option => { + const opt = document.createElement('option'); + opt.value = option; + opt.textContent = labelFormatter ? labelFormatter(option) : option; + select.appendChild(opt); + }); +} + +// Update IO length filter based on selected GPU and model +function updateIoLenFilter() { + const gpuFilterEl = document.getElementById('gpu-filter'); + const modelFilterEl = document.getElementById('model-filter'); + const ioLenSelect = document.getElementById('io-len-filter'); + if (!gpuFilterEl || !modelFilterEl || !ioLenSelect) return; + + const gpuFilter = gpuFilterEl.value; + const modelFilter = modelFilterEl.value; + + const ioLengths = new Set(); + + allMetricsData.forEach(run => { + run.results.forEach(result => { + if (result.gpu_config === gpuFilter && result.model === modelFilter) { + if (result.benchmarks_by_io_len) { + Object.keys(result.benchmarks_by_io_len).forEach(ioKey => { + ioLengths.add(ioKey); + }); + } else if (result.benchmarks) { + result.benchmarks.forEach(bench => { + if (bench.input_len && bench.output_len) { + ioLengths.add(`${bench.input_len}_${bench.output_len}`); + } + }); + } + } + }); + }); + + const ioLenArray = sortIoLengths(Array.from(ioLengths)); + const currentIoLen = ioLenSelect.value; + + // Clear and repopulate + ioLenSelect.innerHTML = ''; + ioLenArray.forEach(ioLen => { + const opt = document.createElement('option'); + opt.value = ioLen; + opt.textContent = formatIoLenLabel(ioLen); + ioLenSelect.appendChild(opt); + }); + + // Try to restore previous selection if still valid + if (ioLenArray.includes(currentIoLen)) { + ioLenSelect.value = currentIoLen; + } else { + ioLenSelect.value = 'all'; + } +} + +// Update variant filter based on selected GPU and model +function updateVariantFilter() { + const gpuFilter = document.getElementById('gpu-filter').value; + const modelFilter = document.getElementById('model-filter').value; + + const variants = new Set(); + + allMetricsData.forEach(run => { + run.results.forEach(result => { + if (result.gpu_config === gpuFilter && result.model === modelFilter) { + // Use 'default' for null/undefined variants + variants.add(result.variant || 'default'); + } + }); + }); + + const variantArray = Array.from(variants).sort(); + const variantSelect = document.getElementById('variant-filter'); + const currentVariant = variantSelect.value; + + // Clear and repopulate + variantSelect.innerHTML = ''; + variantArray.forEach(variant => { + const opt = document.createElement('option'); + opt.value = variant; + opt.textContent = variant; + variantSelect.appendChild(opt); + }); + + // Try to restore previous selection if still valid + if (variantArray.includes(currentVariant)) { + variantSelect.value = currentVariant; + } else { + variantSelect.value = 'all'; + } +} + +function populateSelect(selectId, options) { + const select = document.getElementById(selectId); + options.forEach(option => { + const opt = document.createElement('option'); + opt.value = option; + opt.textContent = option; + select.appendChild(opt); + }); +} + +function populateSelectNoAll(selectId, options) { + const select = document.getElementById(selectId); + // Remove the "all" option if present + while (select.options.length > 0) { + select.remove(0); + } + options.forEach(option => { + const opt = document.createElement('option'); + opt.value = option; + opt.textContent = option; + select.appendChild(opt); + }); +} + +function createMetricTabs() { + const tabsContainer = document.getElementById('metric-tabs'); + tabsContainer.innerHTML = ''; + + // Detect if current data is diffusion or text + const isDiffusion = detectCurrentDataType() === 'diffusion'; + const dataType = isDiffusion ? 'diffusion' : 'text'; + + // Filter metrics based on data type + const relevantMetrics = Object.entries(metricTypes).filter(([key, metric]) => + metric.type === dataType + ); + + relevantMetrics.forEach(([key, metric], index) => { + const tab = document.createElement('div'); + tab.className = index === 0 ? 'tab active' : 'tab'; + tab.textContent = metric.label; + tab.dataset.metric = key; + tab.onclick = () => selectMetricTab(key, tab); + tabsContainer.appendChild(tab); + }); + + // Set initial metric type + if (relevantMetrics.length > 0) { + currentMetricType = relevantMetrics[0][0]; + } +} + +function detectCurrentDataType() { + // Check if currently selected model/GPU config has diffusion data + const gpuFilter = document.getElementById('gpu-filter')?.value; + const modelFilter = currentModel; + + if (!gpuFilter || !modelFilter) return 'text'; + + for (const run of allMetricsData) { + for (const result of run.results) { + if (result.gpu_config === gpuFilter) { + const resultModel = result.test_suite || result.model; + if (resultModel === modelFilter && isDiffusionResult(result)) { + return 'diffusion'; + } + } + } + } + return 'text'; +} + +function selectMetricTab(metricKey, tabElement) { + document.querySelectorAll('.tab').forEach(t => t.classList.remove('active')); + tabElement.classList.add('active'); + currentMetricType = metricKey; + + // Update chart title + const metric = metricTypes[metricKey]; + document.getElementById('metric-title').textContent = `${metric.label} (${metric.unit})`; + + updateCharts(); +} + +// Handle model filter dropdown change +function handleModelFilterChange(model) { + currentModel = model; + // Update variant filter based on new model selection + updateVariantFilter(); + // Update IO length filter based on new model selection + updateIoLenFilter(); + // Recreate metric tabs in case data type changed (text vs diffusion) + createMetricTabs(); + updateCharts(); +} + +// Handle GPU filter change +function handleGpuFilterChange() { + // Update variant filter based on new GPU selection + updateVariantFilter(); + // Update IO length filter based on new GPU selection + updateIoLenFilter(); + // Recreate metric tabs in case data type changed (text vs diffusion) + createMetricTabs(); + updateCharts(); +} + +// Update summary stats +function updateStats() { + const statsRow = document.getElementById('stats-row'); + const latestRun = allMetricsData[0]; + + if (!latestRun) { + statsRow.innerHTML = ''; + const noDataDiv = document.createElement('div'); + noDataDiv.className = 'no-data'; + noDataDiv.textContent = 'No data available'; + statsRow.appendChild(noDataDiv); + return; + } + + const totalModels = new Set(latestRun.results.map(r => r.model)).size; + const totalBenchmarks = latestRun.results.reduce((sum, r) => { + // Count benchmarks from either structure + if (r.benchmarks_by_io_len) { + return sum + Object.values(r.benchmarks_by_io_len).reduce( + (ioSum, ioData) => ioSum + ioData.benchmarks.length, 0 + ); + } + return sum + (r.benchmarks ? r.benchmarks.length : 0); + }, 0); + + statsRow.innerHTML = ''; // Clear previous stats + + const addStat = (label, value) => { + const card = document.createElement('div'); + card.className = 'stat-card'; + const labelEl = document.createElement('div'); + labelEl.className = 'label'; + labelEl.textContent = label; + const valueEl = document.createElement('div'); + valueEl.className = 'value'; + valueEl.textContent = value; + card.appendChild(labelEl); + card.appendChild(valueEl); + statsRow.appendChild(card); + }; + + addStat('Total Runs', allMetricsData.length); + addStat('Models Tested', totalModels); + addStat('Benchmarks', totalBenchmarks); +} + +// Update charts based on current filters and selected metric type +function updateCharts() { + const gpuFilter = document.getElementById('gpu-filter').value; + const modelFilter = currentModel; + const variantFilter = document.getElementById('variant-filter').value; + const ioLenFilter = document.getElementById('io-len-filter').value; + const batchFilter = document.getElementById('batch-filter').value; + + // Prepare data for charts - grouped by batch size + const chartDataByBatch = prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFilter, batchFilter); + + // Update chart for the selected metric type + updateMetricChart(chartDataByBatch, currentMetricType); +} + +function prepareChartData(gpuFilter, modelFilter, variantFilter, ioLenFilter, batchFilter) { + const seriesMap = new Map(); + + allMetricsData.forEach(run => { + const runDate = new Date(run.run_date); + + run.results.forEach(result => { + // Apply filters + if (result.gpu_config !== gpuFilter) return; + if (result.model !== modelFilter) return; + if (variantFilter !== 'all' && result.variant !== variantFilter) return; + + // Helper function to process a benchmark entry + const processBenchmark = (bench, ioKey) => { + if (batchFilter !== 'all' && bench.batch_size !== parseInt(batchFilter)) return; + + const ioLabel = ioKey ? `, ${formatIoLenLabel(ioKey)}` : ''; + const seriesKey = `${result.model.split('/').pop()} (${result.variant}, BS=${bench.batch_size}${ioLabel})`; + + if (!seriesMap.has(seriesKey)) { + seriesMap.set(seriesKey, { + label: seriesKey, + data: [], + model: result.model, + variant: result.variant, + batchSize: bench.batch_size, + ioKey: ioKey + }); + } + + seriesMap.get(seriesKey).data.push({ + x: runDate, + throughput: bench.overall_throughput, + outputThroughput: bench.output_throughput, + latency: bench.latency_ms, + ttft: bench.ttft_ms, + inputThroughput: bench.input_throughput, + accLength: bench.acc_length, + runId: run.run_id + }); + }; + + // Use benchmarks_by_io_len if available + if (result.benchmarks_by_io_len) { + Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => { + if (ioLenFilter !== 'all' && ioKey !== ioLenFilter) return; + ioData.benchmarks.forEach(bench => processBenchmark(bench, ioKey)); + }); + } else if (result.benchmarks) { + result.benchmarks.forEach(bench => { + const benchIoKey = bench.input_len && bench.output_len + ? `${bench.input_len}_${bench.output_len}` + : null; + if (ioLenFilter !== 'all' && benchIoKey !== ioLenFilter) return; + processBenchmark(bench, benchIoKey); + }); + } + }); + }); + + // Sort data points by date + seriesMap.forEach(series => { + series.data.sort((a, b) => a.x - b.x); + }); + + return Array.from(seriesMap.values()); +} + +// Prepare chart data grouped by batch size - each batch size is a separate series +function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFilter, batchFilter) { + const batchDataMap = new Map(); // batch_size -> Map of variant -> data + const testDataMap = new Map(); // For diffusion: test_name -> data + + allMetricsData.forEach(run => { + const runDate = new Date(run.run_date); + + run.results.forEach(result => { + // Apply filters - GPU and Model are required (no "all" option) + if (result.gpu_config !== gpuFilter) return; + + // Handle diffusion results + if (isDiffusionResult(result)) { + const resultModel = result.test_suite || 'diffusion'; + if (resultModel !== modelFilter) return; + + if (result.tests) { + result.tests.forEach(test => { + const testName = test.test_name; + if (!testDataMap.has(testName)) { + testDataMap.set(testName, { + label: testName, + data: [], + model: resultModel, + testName: testName + }); + } + + testDataMap.get(testName).data.push({ + x: runDate, + e2e_ms: test.e2e_ms, + avg_denoise_ms: test.avg_denoise_ms, + median_denoise_ms: test.median_denoise_ms, + runId: run.run_id + }); + }); + } + return; + } + + // Handle text/VLM results + if (result.model !== modelFilter) return; + if (variantFilter !== 'all' && result.variant !== variantFilter) return; + + // Use benchmarks_by_io_len if available, otherwise fall back to flat benchmarks + if (result.benchmarks_by_io_len) { + Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => { + // Apply IO length filter + if (ioLenFilter !== 'all' && ioKey !== ioLenFilter) return; + + ioData.benchmarks.forEach(bench => { + if (batchFilter !== 'all' && bench.batch_size !== parseInt(batchFilter)) return; + + const batchSize = bench.batch_size; + const variantLabel = result.variant || 'default'; + // Include IO length in series key when showing all lengths + const seriesKey = ioLenFilter === 'all' + ? `${variantLabel} (${formatIoLenLabel(ioKey)})` + : variantLabel; + + if (!batchDataMap.has(batchSize)) { + batchDataMap.set(batchSize, new Map()); + } + + const variantMap = batchDataMap.get(batchSize); + if (!variantMap.has(seriesKey)) { + variantMap.set(seriesKey, { + label: seriesKey, + data: [], + model: result.model, + variant: result.variant, + batchSize: batchSize, + ioKey: ioKey + }); + } + + variantMap.get(seriesKey).data.push({ + x: runDate, + throughput: bench.overall_throughput, + outputThroughput: bench.output_throughput, + latency: bench.latency_ms, + ttft: bench.ttft_ms, + inputThroughput: bench.input_throughput, + accLength: bench.acc_length, + runId: run.run_id + }); + }); + }); + } else if (result.benchmarks) { + // Fall back to flat benchmarks for backward compatibility + result.benchmarks.forEach(bench => { + // Apply IO length filter using flat structure + const benchIoKey = bench.input_len && bench.output_len + ? `${bench.input_len}_${bench.output_len}` + : null; + if (ioLenFilter !== 'all' && benchIoKey !== ioLenFilter) return; + if (batchFilter !== 'all' && bench.batch_size !== parseInt(batchFilter)) return; + + const batchSize = bench.batch_size; + const variantLabel = result.variant || 'default'; + // Include IO length in series key when showing all lengths + const seriesKey = ioLenFilter === 'all' && benchIoKey + ? `${variantLabel} (${formatIoLenLabel(benchIoKey)})` + : variantLabel; + + if (!batchDataMap.has(batchSize)) { + batchDataMap.set(batchSize, new Map()); + } + + const variantMap = batchDataMap.get(batchSize); + if (!variantMap.has(seriesKey)) { + variantMap.set(seriesKey, { + label: seriesKey, + data: [], + model: result.model, + variant: result.variant, + batchSize: batchSize, + ioKey: benchIoKey + }); + } + + variantMap.get(seriesKey).data.push({ + x: runDate, + throughput: bench.overall_throughput, + outputThroughput: bench.output_throughput, + latency: bench.latency_ms, + ttft: bench.ttft_ms, + inputThroughput: bench.input_throughput, + accLength: bench.acc_length, + runId: run.run_id + }); + }); + } + }); + }); + + // Sort data points by date and convert to array format + const result = {}; + + // For diffusion data, use test names as "batch sizes" + if (testDataMap.size > 0) { + testDataMap.forEach((series, testName) => { + series.data.sort((a, b) => a.x - b.x); + result[testName] = [series]; // Each test is its own series + }); + return result; + } + + // For text/VLM data, use batch sizes + batchDataMap.forEach((variantMap, batchSize) => { + variantMap.forEach(series => { + series.data.sort((a, b) => a.x - b.x); + }); + result[batchSize] = Array.from(variantMap.values()); + }); + + return result; +} + +// Unified chart update function for any metric type +function updateMetricChart(chartDataByBatch, metricType) { + const container = document.getElementById('charts-container'); + container.innerHTML = ''; + + // Destroy existing charts + activeCharts.forEach(chart => chart.destroy()); + activeCharts = []; + + const metric = metricTypes[metricType]; + const isDiffusion = metric.type === 'diffusion'; + + // For diffusion, keys are test names; for text, keys are batch sizes + const keys = Object.keys(chartDataByBatch); + if (!isDiffusion) { + keys.sort((a, b) => parseInt(a) - parseInt(b)); + } else { + keys.sort(); // Alphabetical sort for test names + } + const batchSizes = keys; // Keep variable name for compatibility + + if (batchSizes.length === 0) { + container.innerHTML = '
No data available for the selected filters
'; + return; + } + + let hasAnyData = false; + + batchSizes.forEach(batchSize => { + const chartData = chartDataByBatch[batchSize]; + + const ctx_datasets = chartData.map((series, index) => { + // Filter data points - for metrics like accLength, exclude invalid values (-1 or null) + let dataPoints = series.data.map(d => ({ x: d.x, y: d[metric.field] })); + if (metric.filterInvalid) { + dataPoints = dataPoints.filter(d => d.y != null && d.y !== -1 && d.y > 0); + } + return { + label: series.label, + data: dataPoints, + borderColor: chartColors[index % chartColors.length], + backgroundColor: chartColors[index % chartColors.length] + '20', + tension: 0.1, + fill: false + }; + }).filter(dataset => dataset.data.length > 0); // Remove empty datasets + + // Skip this batch size if no valid data + if (ctx_datasets.length === 0) { + return; + } + + hasAnyData = true; + + const chartWrapper = document.createElement('div'); + chartWrapper.className = 'batch-chart-wrapper'; + + const title = document.createElement('div'); + title.className = 'batch-chart-title'; + // For diffusion, show test name; for text, show batch size + title.textContent = isDiffusion ? `Test: ${batchSize}` : `Batch Size: ${batchSize}`; + chartWrapper.appendChild(title); + + const chartContainer = document.createElement('div'); + chartContainer.className = 'chart-container'; + const canvas = document.createElement('canvas'); + chartContainer.appendChild(canvas); + chartWrapper.appendChild(chartContainer); + container.appendChild(chartWrapper); + + const ctx = canvas.getContext('2d'); + + const chart = new Chart(ctx, { + type: 'line', + data: { datasets: ctx_datasets }, + options: getChartOptions(metric.unit) + }); + activeCharts.push(chart); + }); + + // Show message if no valid data for this metric + if (!hasAnyData) { + container.innerHTML = `
No valid ${metric.label.toLowerCase()} data available for the selected filters
`; + } +} + +function getChartOptions(yAxisLabel) { + return { + responsive: true, + maintainAspectRatio: false, + interaction: { + mode: 'index', + intersect: false + }, + plugins: { + legend: { + position: 'bottom', + labels: { + boxWidth: 12, + padding: 10, + font: { size: 11 } + } + }, + tooltip: { + backgroundColor: '#1a2332', + borderColor: 'rgba(148, 163, 184, 0.1)', + borderWidth: 1, + titleFont: { size: 13, family: "'DM Sans', sans-serif" }, + bodyFont: { size: 12, family: "'JetBrains Mono', monospace" }, + padding: 14, + cornerRadius: 8 + } + }, + scales: { + x: { + type: 'time', + time: { + unit: 'day', + displayFormats: { + day: 'MMM d' + } + }, + grid: { + color: 'rgba(148, 163, 184, 0.06)' + } + }, + y: { + title: { + display: true, + text: yAxisLabel + }, + grid: { + color: 'rgba(148, 163, 184, 0.06)' + } + } + } + }; +} + +// Escape HTML to prevent XSS +function escapeHtml(text) { + const div = document.createElement('div'); + div.textContent = text; + return div.innerHTML; +} + +// Update runs table +function updateRunsTable() { + const tbody = document.getElementById('runs-table-body'); + tbody.innerHTML = ''; + + allMetricsData.slice(0, 10).forEach(run => { + const models = new Set(run.results.map(r => r.model.split('/').pop())); + const date = new Date(run.run_date); + + const row = document.createElement('tr'); + + // Create cells safely to prevent XSS + const dateCell = document.createElement('td'); + dateCell.textContent = `${date.toLocaleDateString()} ${date.toLocaleTimeString()}`; + + const runIdCell = document.createElement('td'); + const runLink = document.createElement('a'); + runLink.href = `https://github.com/${GITHUB_REPO}/actions/runs/${encodeURIComponent(run.run_id)}`; + runLink.target = '_blank'; + runLink.className = 'run-link'; + runLink.textContent = run.run_id; + runIdCell.appendChild(runLink); + + const commitCell = document.createElement('td'); + const commitCode = document.createElement('code'); + commitCode.textContent = run.commit_sha.substring(0, 7); + commitCell.appendChild(commitCode); + + const branchCell = document.createElement('td'); + branchCell.textContent = run.branch; + + const modelsCell = document.createElement('td'); + Array.from(models).forEach((model, index) => { + if (index > 0) modelsCell.appendChild(document.createTextNode(' ')); + const badge = document.createElement('span'); + badge.className = 'model-badge'; + badge.textContent = model; + modelsCell.appendChild(badge); + }); + + row.appendChild(dateCell); + row.appendChild(runIdCell); + row.appendChild(commitCell); + row.appendChild(branchCell); + row.appendChild(modelsCell); + + tbody.appendChild(row); + }); +} + +// Refresh data +async function refreshData() { + document.getElementById('content').style.display = 'none'; + document.getElementById('loading').style.display = 'flex'; + await init(); +} + +// Format numbers for display +function formatNumber(num) { + if (num >= 1000) { + return (num / 1000).toFixed(1) + 'k'; + } + return num.toFixed(1); +} + +// Authentication state +let authToken = sessionStorage.getItem('dashboard_auth_token') || null; + +// Get auth headers for API requests +function getAuthHeaders() { + const headers = {}; + if (authToken) { + headers['Authorization'] = `Bearer ${authToken}`; + } + return headers; +} + +// Check if server requires authentication and show/hide login accordingly +async function checkAuthAndInit() { + const loginOverlay = document.getElementById('login-overlay'); + const dashboardContainer = document.getElementById('dashboard-container'); + + try { + const response = await fetch('/api/auth-check'); + if (response.ok) { + const data = await response.json(); + if (!data.auth_required) { + // No auth required - skip login, show dashboard directly + loginOverlay.style.display = 'none'; + dashboardContainer.style.display = 'block'; + init(); + return; + } + } + } catch (e) { + // Server not available (e.g. static hosting) - skip login + loginOverlay.style.display = 'none'; + dashboardContainer.style.display = 'block'; + init(); + return; + } + + // Auth is required - check if we have a valid token from a previous session + if (authToken) { + try { + const testResponse = await fetch('/api/metrics', { + headers: getAuthHeaders() + }); + if (testResponse.ok) { + loginOverlay.style.display = 'none'; + dashboardContainer.style.display = 'block'; + init(); + return; + } + } catch (e) { + // Token invalid or expired + } + // Clear invalid token + authToken = null; + sessionStorage.removeItem('dashboard_auth_token'); + } + + // Show login form + loginOverlay.style.display = 'flex'; + dashboardContainer.style.display = 'none'; +} + +// Handle login form submission +async function handleLogin(event) { + event.preventDefault(); + + const username = document.getElementById('login-username').value; + const password = document.getElementById('login-password').value; + const errorEl = document.getElementById('login-error'); + const loginBtn = document.getElementById('login-btn'); + + errorEl.textContent = ''; + loginBtn.disabled = true; + loginBtn.textContent = 'Signing in...'; + + try { + const response = await fetch('/api/login', { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ username, password }) + }); + + const data = await response.json(); + + if (response.ok && data.token) { + authToken = data.token; + sessionStorage.setItem('dashboard_auth_token', authToken); + + document.getElementById('login-overlay').style.display = 'none'; + document.getElementById('dashboard-container').style.display = 'block'; + init(); + } else { + errorEl.textContent = data.error || 'Invalid username or password'; + } + } catch (e) { + errorEl.textContent = 'Unable to connect to server'; + } finally { + loginBtn.disabled = false; + loginBtn.textContent = 'Sign In'; + } + + return false; +} + +// Initialize on page load +document.addEventListener('DOMContentLoaded', checkAuthAndInit); diff --git a/docs/performance_dashboard/fetch_metrics.py b/docs/performance_dashboard/fetch_metrics.py new file mode 100755 index 000000000000..264e7f334c0d --- /dev/null +++ b/docs/performance_dashboard/fetch_metrics.py @@ -0,0 +1,272 @@ +#!/usr/bin/env python3 +""" +Fetch and process SGLang nightly test metrics from GitHub Actions artifacts. + +This script fetches consolidated metrics from GitHub Actions workflow runs +and outputs them as JSON for the performance dashboard. + +Usage: + python fetch_metrics.py --output metrics_data.json + python fetch_metrics.py --output metrics_data.json --days 30 + python fetch_metrics.py --output metrics_data.json --run-id 21338741812 +""" + +import argparse +import io +import json +import os +import sys +import zipfile +from datetime import datetime, timedelta, timezone +from pathlib import Path +from typing import Optional + +import requests + +GITHUB_REPO = "sgl-project/sglang" +WORKFLOW_NAME = "nightly-test-nvidia.yml" +ARTIFACT_PREFIX = "consolidated-metrics-" + + +def get_github_token() -> Optional[str]: + """Get GitHub token from environment or gh CLI.""" + # Check environment variable first + token = os.environ.get("GITHUB_TOKEN") + if token: + return token + + # Try gh CLI + try: + import subprocess + + result = subprocess.run( + ["gh", "auth", "token"], + capture_output=True, + text=True, + check=True, + ) + return result.stdout.strip() + except (subprocess.CalledProcessError, FileNotFoundError): + pass + + return None + + +def get_headers(token: Optional[str]) -> dict: + """Get request headers with optional authentication.""" + headers = { + "Accept": "application/vnd.github.v3+json", + } + if token: + headers["Authorization"] = f"Bearer {token}" + return headers + + +def fetch_workflow_runs( + token: Optional[str], + days: int = 30, + event: Optional[str] = None, +) -> list: + """Fetch completed workflow runs from GitHub Actions.""" + url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/workflows/{WORKFLOW_NAME}/runs" + + params = { + "status": "completed", + "per_page": 100, + } + + if event: + params["event"] = event + + response = requests.get(url, headers=get_headers(token), params=params, timeout=30) + response.raise_for_status() + + runs = response.json().get("workflow_runs", []) + + # Filter by date + cutoff = datetime.now(timezone.utc) - timedelta(days=days) + runs = [ + run + for run in runs + if datetime.fromisoformat(run["created_at"].replace("Z", "+00:00")) > cutoff + ] + + return runs + + +def fetch_run_artifacts(token: Optional[str], run_id: int) -> list: + """Fetch artifacts for a specific workflow run.""" + url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/runs/{run_id}/artifacts" + + response = requests.get(url, headers=get_headers(token), timeout=30) + response.raise_for_status() + + return response.json().get("artifacts", []) + + +def download_artifact(token: Optional[str], artifact_id: int) -> Optional[bytes]: + """Download an artifact by ID.""" + if not token: + print(f"Warning: GitHub token required to download artifacts", file=sys.stderr) + return None + + url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/artifacts/{artifact_id}/zip" + + headers = get_headers(token) + response = requests.get(url, headers=headers, allow_redirects=True, timeout=60) + + if response.status_code == 200: + return response.content + + print( + f"Failed to download artifact {artifact_id}: {response.status_code}", + file=sys.stderr, + ) + return None + + +def extract_metrics_from_zip(zip_content: bytes) -> Optional[dict]: + """Extract metrics JSON from a zip file.""" + try: + with zipfile.ZipFile(io.BytesIO(zip_content)) as zf: + # Find the JSON file in the archive + json_files = [f for f in zf.namelist() if f.endswith(".json")] + if not json_files: + return None + + with zf.open(json_files[0]) as f: + return json.load(f) + except (zipfile.BadZipFile, json.JSONDecodeError) as e: + print(f"Failed to extract metrics: {e}", file=sys.stderr) + return None + + +def fetch_metrics_for_run(token: Optional[str], run: dict) -> Optional[dict]: + """Fetch metrics for a single workflow run.""" + run_id = run["id"] + print(f"Fetching metrics for run {run_id}...", file=sys.stderr) + + artifacts = fetch_run_artifacts(token, run_id) + + # Find consolidated metrics artifact + metrics_artifact = None + for artifact in artifacts: + if artifact["name"].startswith(ARTIFACT_PREFIX): + metrics_artifact = artifact + break + + if not metrics_artifact: + print(f"No consolidated metrics found for run {run_id}", file=sys.stderr) + return None + + # Download and extract + zip_content = download_artifact(token, metrics_artifact["id"]) + if not zip_content: + return None + + metrics = extract_metrics_from_zip(zip_content) + if not metrics: + return None + + # Ensure required fields are present + if "run_id" not in metrics: + metrics["run_id"] = str(run_id) + if "run_date" not in metrics: + metrics["run_date"] = run["created_at"] + if "commit_sha" not in metrics: + metrics["commit_sha"] = run["head_sha"] + if "branch" not in metrics: + metrics["branch"] = run["head_branch"] + + return metrics + + +def fetch_single_run(token: Optional[str], run_id: int) -> Optional[dict]: + """Fetch metrics for a single run by ID.""" + url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/runs/{run_id}" + + response = requests.get(url, headers=get_headers(token), timeout=30) + response.raise_for_status() + + run = response.json() + return fetch_metrics_for_run(token, run) + + +def main(): + parser = argparse.ArgumentParser( + description="Fetch SGLang nightly test metrics from GitHub Actions" + ) + parser.add_argument( + "--output", + "-o", + type=str, + default="metrics_data.json", + help="Output JSON file path", + ) + parser.add_argument( + "--days", + type=int, + default=30, + help="Number of days to fetch (default: 30)", + ) + parser.add_argument( + "--run-id", + type=int, + help="Fetch a specific run by ID", + ) + parser.add_argument( + "--event", + type=str, + choices=["schedule", "workflow_dispatch", "push"], + help="Filter by trigger event type", + ) + parser.add_argument( + "--scheduled-only", + action="store_true", + help="Only fetch scheduled (nightly) runs", + ) + + args = parser.parse_args() + + token = get_github_token() + if not token: + print( + "Warning: No GitHub token found. Some features may be limited.", + file=sys.stderr, + ) + print( + "Set GITHUB_TOKEN env var or login with 'gh auth login'", + file=sys.stderr, + ) + + all_metrics = [] + + if args.run_id: + # Fetch single run + metrics = fetch_single_run(token, args.run_id) + if metrics: + all_metrics.append(metrics) + else: + # Fetch multiple runs + event = "schedule" if args.scheduled_only else args.event + runs = fetch_workflow_runs(token, days=args.days, event=event) + print(f"Found {len(runs)} workflow runs", file=sys.stderr) + + for run in runs: + metrics = fetch_metrics_for_run(token, run) + if metrics: + all_metrics.append(metrics) + + # Sort by date descending + all_metrics.sort(key=lambda x: x.get("run_date", ""), reverse=True) + + # Write output + output_path = Path(args.output) + with open(output_path, "w") as f: + json.dump(all_metrics, f, indent=2) + + print(f"Wrote {len(all_metrics)} metrics records to {output_path}", file=sys.stderr) + + +if __name__ == "__main__": + main() diff --git a/docs/performance_dashboard/index.html b/docs/performance_dashboard/index.html new file mode 100644 index 000000000000..e680f981a108 --- /dev/null +++ b/docs/performance_dashboard/index.html @@ -0,0 +1,946 @@ + + + + + + SGLang Performance Dashboard + + + + + + + + + + + +
+ + + + diff --git a/docs/performance_dashboard/server.py b/docs/performance_dashboard/server.py new file mode 100755 index 000000000000..1e025ce856e3 --- /dev/null +++ b/docs/performance_dashboard/server.py @@ -0,0 +1,422 @@ +#!/usr/bin/env python3 +""" +Simple development server for the SGLang Performance Dashboard. + +This server: +1. Serves the static HTML/JS files +2. Provides an API endpoint to fetch metrics from GitHub +3. Caches metrics data to reduce API calls + +Usage: + python server.py + python server.py --port 8080 + python server.py --host 0.0.0.0 # Allow external access + python server.py --fetch-on-start + python server.py --username admin --password secret # Enable authentication + DASHBOARD_USERNAME=admin DASHBOARD_PASSWORD=secret python server.py # Via env vars + python server.py --refresh-interval 12 # Auto-refresh data every 12 hours +""" + +import argparse +import hashlib +import hmac +import http.server +import io +import json +import os +import secrets +import socketserver +import threading +import time +import zipfile +from datetime import datetime, timedelta, timezone +from pathlib import Path +from urllib.parse import urlparse + +import requests + +GITHUB_REPO = "sgl-project/sglang" +WORKFLOW_NAME = "nightly-test-nvidia.yml" +ARTIFACT_PREFIX = "consolidated-metrics-" + +# Cache for metrics data with thread-safe lock +cache_lock = threading.Lock() +metrics_cache = { + "data": [], + "last_updated": None, + "updating": False, +} + +CACHE_TTL = 300 # 5 minutes +REQUEST_TIMEOUT = 30 # seconds + +# Authentication configuration (set via CLI flags) +auth_config = { + "enabled": False, + "username": None, + "password_hash": None, # SHA-256 hash of the password + "active_tokens": {}, # token -> expiry timestamp +} +auth_lock = threading.Lock() +AUTH_TOKEN_TTL = 3600 # 1 hour + + +def hash_password(password): + """Hash a password using SHA-256 for constant-time comparison.""" + return hashlib.sha256(password.encode("utf-8")).hexdigest() + + +def create_auth_token(): + """Create a new session token.""" + token = secrets.token_hex(32) + with auth_lock: + # Clean up expired tokens + now = time.time() + auth_config["active_tokens"] = { + t: exp for t, exp in auth_config["active_tokens"].items() if exp > now + } + auth_config["active_tokens"][token] = now + AUTH_TOKEN_TTL + return token + + +def verify_auth_token(token): + """Verify a session token is valid and not expired.""" + if not token: + return False + with auth_lock: + expiry = auth_config["active_tokens"].get(token) + if expiry and expiry > time.time(): + return True + # Remove expired token + auth_config["active_tokens"].pop(token, None) + return False + + +def get_github_token(): + """Get GitHub token from environment or gh CLI.""" + token = os.environ.get("GITHUB_TOKEN") + if token: + return token + + try: + import subprocess + + result = subprocess.run( + ["gh", "auth", "token"], + capture_output=True, + text=True, + check=True, + ) + return result.stdout.strip() + except (subprocess.CalledProcessError, FileNotFoundError): + pass + + return None + + +def fetch_metrics_from_github(days=30): + """Fetch metrics from GitHub Actions artifacts.""" + token = get_github_token() + headers = {"Accept": "application/vnd.github.v3+json"} + if token: + headers["Authorization"] = f"Bearer {token}" + + # Get workflow runs - only scheduled (nightly) runs, not workflow_dispatch + url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/workflows/{WORKFLOW_NAME}/runs" + params = {"status": "completed", "per_page": 50, "event": "schedule"} + + try: + response = requests.get( + url, headers=headers, params=params, timeout=REQUEST_TIMEOUT + ) + if not response.ok: + print(f"Failed to fetch workflow runs: {response.status_code}") + return [] + except requests.exceptions.RequestException as e: + print(f"Network error fetching workflow runs: {e}") + return [] + + runs = response.json().get("workflow_runs", []) + + # Filter by date + cutoff = datetime.now(timezone.utc) - timedelta(days=days) + runs = [ + run + for run in runs + if datetime.fromisoformat(run["created_at"].replace("Z", "+00:00")) > cutoff + ] + + all_metrics = [] + + for run in runs[:20]: # Limit to 20 most recent + run_id = run["id"] + + # Get artifacts + artifacts_url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/runs/{run_id}/artifacts" + try: + artifacts_resp = requests.get( + artifacts_url, headers=headers, timeout=REQUEST_TIMEOUT + ) + if not artifacts_resp.ok: + continue + except requests.exceptions.RequestException as e: + print(f"Network error fetching artifacts for run {run_id}: {e}") + continue + + artifacts = artifacts_resp.json().get("artifacts", []) + + # Find consolidated metrics + for artifact in artifacts: + if artifact["name"].startswith(ARTIFACT_PREFIX): + if not token: + # Without token, we can't download - return metadata only + all_metrics.append( + { + "run_id": str(run_id), + "run_date": run["created_at"], + "commit_sha": run["head_sha"], + "branch": run["head_branch"], + "results": [], + } + ) + break + + # Download artifact + download_url = f"https://api.github.com/repos/{GITHUB_REPO}/actions/artifacts/{artifact['id']}/zip" + try: + download_resp = requests.get( + download_url, + headers=headers, + allow_redirects=True, + timeout=REQUEST_TIMEOUT, + ) + except requests.exceptions.RequestException as e: + print(f"Network error downloading artifact: {e}") + break + + if download_resp.ok: + try: + with zipfile.ZipFile(io.BytesIO(download_resp.content)) as zf: + json_files = [ + f for f in zf.namelist() if f.endswith(".json") + ] + if json_files: + with zf.open(json_files[0]) as f: + metrics = json.load(f) + # Ensure required fields + metrics.setdefault("run_id", str(run_id)) + metrics.setdefault("run_date", run["created_at"]) + metrics.setdefault("commit_sha", run["head_sha"]) + metrics.setdefault("branch", run["head_branch"]) + all_metrics.append(metrics) + except (zipfile.BadZipFile, json.JSONDecodeError) as e: + print(f"Failed to process artifact: {e}") + break + + return all_metrics + + +def update_cache_async(): + """Update the metrics cache in background with thread safety.""" + with cache_lock: + if metrics_cache["updating"]: + return + metrics_cache["updating"] = True + + try: + data = fetch_metrics_from_github() + with cache_lock: + metrics_cache["data"] = data + metrics_cache["last_updated"] = time.time() + print(f"Cache updated with {len(data)} metrics records") + finally: + with cache_lock: + metrics_cache["updating"] = False + + +def start_periodic_refresh(interval_hours): + """Start a background thread that refreshes the cache periodically.""" + interval_seconds = interval_hours * 3600 + + def refresh_loop(): + while True: + time.sleep(interval_seconds) + print(f"Periodic refresh triggered (every {interval_hours}h)") + update_cache_async() + + thread = threading.Thread(target=refresh_loop, daemon=True) + thread.start() + print(f"Periodic refresh enabled: every {interval_hours} hours") + + +class DashboardHandler(http.server.SimpleHTTPRequestHandler): + """HTTP request handler for the dashboard.""" + + def __init__(self, *args, directory=None, **kwargs): + super().__init__(*args, directory=directory, **kwargs) + + def _send_json(self, data, status=200): + """Send a JSON response.""" + self.send_response(status) + self.send_header("Content-Type", "application/json") + self.send_header("Access-Control-Allow-Origin", "*") + self.end_headers() + self.wfile.write(json.dumps(data).encode()) + + def _check_auth(self): + """Check if request is authenticated. Returns True if OK, sends 401 and returns False otherwise.""" + if not auth_config["enabled"]: + return True + auth_header = self.headers.get("Authorization", "") + if auth_header.startswith("Bearer "): + token = auth_header[7:] + if verify_auth_token(token): + return True + self._send_json({"error": "Unauthorized"}, status=401) + return False + + def do_GET(self): + parsed = urlparse(self.path) + + # Prevent directory traversal attacks + if ".." in parsed.path or parsed.path.startswith("//"): + self.send_error(400, "Invalid path") + return + + if parsed.path == "/api/auth-check": + self.handle_auth_check() + elif parsed.path == "/api/metrics": + if self._check_auth(): + self.handle_metrics_api(parsed) + elif parsed.path == "/api/refresh": + if self._check_auth(): + self.handle_refresh_api() + else: + super().do_GET() + + def do_POST(self): + parsed = urlparse(self.path) + + if parsed.path == "/api/login": + self.handle_login() + else: + self.send_error(404, "Not Found") + + def handle_auth_check(self): + """Tell the frontend whether authentication is required.""" + self._send_json({"auth_required": auth_config["enabled"]}) + + def handle_login(self): + """Validate username/password and return a session token.""" + content_length = int(self.headers.get("Content-Length", 0)) + if content_length == 0 or content_length > 4096: + self._send_json({"error": "Invalid request"}, status=400) + return + + try: + body = json.loads(self.rfile.read(content_length)) + except (json.JSONDecodeError, ValueError): + self._send_json({"error": "Invalid JSON"}, status=400) + return + + username = body.get("username", "") + password = body.get("password", "") + + if hmac.compare_digest( + username, auth_config["username"] + ) and hmac.compare_digest( + hash_password(password), auth_config["password_hash"] + ): + token = create_auth_token() + self._send_json({"token": token}) + else: + self._send_json({"error": "Invalid username or password"}, status=401) + + def handle_metrics_api(self, parsed): + """Handle /api/metrics endpoint.""" + # Check cache with thread safety + with cache_lock: + cache_valid = ( + metrics_cache["last_updated"] + and time.time() - metrics_cache["last_updated"] < CACHE_TTL + ) + data = metrics_cache["data"].copy() + + if not cache_valid: + # Trigger background update + threading.Thread(target=update_cache_async, daemon=True).start() + + self._send_json(data) + + def handle_refresh_api(self): + """Handle /api/refresh endpoint.""" + threading.Thread(target=update_cache_async, daemon=True).start() + self._send_json({"status": "refreshing"}) + + def log_message(self, format, *args): + """Custom log format.""" + print(f"[{self.log_date_time_string()}] {args[0]}") + + +def main(): + parser = argparse.ArgumentParser(description="SGLang Performance Dashboard Server") + parser.add_argument("--port", type=int, default=8000, help="Port to serve on") + parser.add_argument( + "--host", + default="127.0.0.1", + help="Host to bind to (use 0.0.0.0 for external access)", + ) + parser.add_argument( + "--fetch-on-start", action="store_true", help="Fetch metrics on startup" + ) + parser.add_argument( + "--refresh-interval", + type=float, + default=12, + help="Auto-refresh interval in hours (default: 12, set to 0 to disable)", + ) + parser.add_argument( + "--username", + default=os.environ.get("DASHBOARD_USERNAME"), + help="Username for dashboard authentication (or set DASHBOARD_USERNAME env var)", + ) + parser.add_argument( + "--password", + default=os.environ.get("DASHBOARD_PASSWORD"), + help="Password for dashboard authentication (or set DASHBOARD_PASSWORD env var)", + ) + args = parser.parse_args() + + # Configure authentication if both username and password are provided + if args.username and args.password: + auth_config["enabled"] = True + auth_config["username"] = args.username + auth_config["password_hash"] = hash_password(args.password) + print(f"Authentication enabled for user: {args.username}") + elif args.username or args.password: + parser.error("Both --username and --password must be provided together") + + # Change to dashboard directory + dashboard_dir = Path(__file__).parent + os.chdir(dashboard_dir) + + if args.fetch_on_start: + print("Fetching initial metrics data...") + update_cache_async() + + if args.refresh_interval > 0: + start_periodic_refresh(args.refresh_interval) + + handler = lambda *a, **kw: DashboardHandler(*a, directory=str(dashboard_dir), **kw) + + with socketserver.TCPServer((args.host, args.port), handler) as httpd: + print(f"Serving dashboard at http://{args.host}:{args.port}") + print("Press Ctrl+C to stop") + try: + httpd.serve_forever() + except KeyboardInterrupt: + print("\nShutting down...") + + +if __name__ == "__main__": + main() diff --git a/docs/platforms/amd_gpu.md b/docs/platforms/amd_gpu.md index 1759e9a7309a..ca427d38abf9 100644 --- a/docs/platforms/amd_gpu.md +++ b/docs/platforms/amd_gpu.md @@ -1,6 +1,6 @@ # AMD GPUs -This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues). +This document describes how to run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues). ## System Configuration @@ -44,7 +44,7 @@ You can install SGLang using one of the methods below. ```bash # Use the last release branch -git clone -b v0.5.6.post2 https://github.com/sgl-project/sglang.git +git clone -b v0.5.9 https://github.com/sgl-project/sglang.git cd sglang # Compile sgl-kernel @@ -55,7 +55,7 @@ python setup_rocm.py install # Install sglang python package along with diffusion support cd .. rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml -pip install -e "python[all_hip,diffusion_hip]" +pip install -e "python[all_hip]" ``` ### Install Using Docker (Recommended) @@ -114,6 +114,42 @@ The steps below show how to build and use an image. With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities. +## Quantization on AMD GPUs + +The [Quantization documentation](../advanced_features/quantization.md#platform-compatibility) has a full compatibility matrix. The short version: FP8, AWQ, MXFP4, W8A8, GPTQ, compressed-tensors, Quark, and **petit_nvfp4** (NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel)) all work on AMD. Methods that depend on Marlin or NVIDIA-specific kernels (`awq_marlin`, `gptq_marlin`, `gguf`, `modelopt_fp8`, `modelopt_fp4`) do not. + +A few things to keep in mind: + +- FP8 works via Aiter or Triton. Pre-quantized FP8 models like DeepSeek-V3/R1 work out of the box. +- AWQ uses Triton dequantization kernels on AMD. The faster Marlin path is not available. +- MXFP4 requires CDNA3/CDNA4 and `SGLANG_USE_AITER=1`. +- `petit_nvfp4` enables NVFP4 models (e.g., [Llama 3.3 70B FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)) on MI250/MI300X via [Petit](https://github.com/causalflow-ai/petit-kernel). Install with `pip install petit-kernel`; no `--quantization` flag needed when loading pre-quantized NVFP4 models. +- `quark_int4fp8_moe` is an AMD-only online quantization method for MoE models on CDNA3/CDNA4. + +Several of these backends are accelerated by [Aiter](https://github.com/ROCm/aiter). Enable it with: + +```bash +export SGLANG_USE_AITER=1 +``` + +Example -- serving an AWQ model: + +```bash +python3 -m sglang.launch_server \ + --model-path hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 \ + --trust-remote-code \ + --port 30000 --host 0.0.0.0 +``` + +Example -- FP8 online quantization: + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --quantization fp8 \ + --port 30000 --host 0.0.0.0 +``` + ## Examples ### Running DeepSeek-V3 diff --git a/docs/platforms/apple_metal.md b/docs/platforms/apple_metal.md new file mode 100644 index 000000000000..9f388d768677 --- /dev/null +++ b/docs/platforms/apple_metal.md @@ -0,0 +1,74 @@ +# Apple Silicon with Metal (MLX) + +This document describes how run SGLang on Apple Silicon using [Metal (MLX)](https://opensource.apple.com/projects/mlx/). If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +## Install SGLang + +You can install SGLang using one of the methods below. + +### Install from Source + +```bash +# Use the default branch +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Install sglang python package +pip install --upgrade pip +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +uv pip install -e "python[all_mps]" +``` + +## Launch of the Serving Engine + +Launch the server with: + +```bash +SGLANG_USE_MLX=1 python -m sglang.launch_server \ + --model \ + --disable-cuda-graph \ + --host 0.0.0.0 +``` + +**Key Parameters Explained:** + +1. `SGLANG_USE_MLX=1` - Enables the use of MLX as the SGLang runtime backend (if disabled, SGLang will fall back to `torch.mps`, which has less support) +2. `--disable-cuda-graph` - Disables usage of CUDA graph, which is not relevant for Apple Metal. +3. `--disable-overlap-schedule` - Disables overlap scheduling (enabled/not present by default) achieved using MLX's `async_eval()` + + +## Benchmarking with Requests + +`sglang.benchmark_one_batch` calls the synchronous prefill/decode methods directly without going through the scheduler and the overlap code path. + +`sglang.benchmark_offline_throughput` can toggle overlap scheduling as it uses the scheduler and the overlap code path by using the flag `--disable-overlap-schedule`. + +### Throughput Testing + +Basic synchronous one batch throughput: +```bash +SGLANG_USE_MLX=1 python -m sglang.bench_one_batch \ + --model-path \ + --disable-cuda-graph \ + --tp-size 1 \ + --batch-size 1 \ + --input-len 60 \ + --output-len 10 +``` + +Synchronous offline throughput: +```bash +SGLANG_USE_MLX=1 python -m sglang.bench_offline_throughput \ + --model-path \ + --disable-cuda-graph \ + --num-prompts 1 \ + --disable-overlap-schedule +``` + +Asynchronous offline throughput: +```bash +SGLANG_USE_MLX=1 python -m sglang.bench_offline_throughput \ + --model-path \ + --disable-cuda-graph \ + --num-prompts 1 +``` diff --git a/docs/platforms/ascend_contribution_guide.md b/docs/platforms/ascend/ascend_contribution_guide.md similarity index 83% rename from docs/platforms/ascend_contribution_guide.md rename to docs/platforms/ascend/ascend_contribution_guide.md index db343126083d..4d3ad0d3a2e6 100644 --- a/docs/platforms/ascend_contribution_guide.md +++ b/docs/platforms/ascend/ascend_contribution_guide.md @@ -6,7 +6,7 @@ Welcome to **SGLang**! We appreciate your interest in contributing. This guide p ### Prepare Environment -Before contributing, please ensure that your environment is set up correctly. Follow the steps in the [Installation Guide](../platforms/ascend_npu.md) to install the necessary dependencies. we recommend [using docker](../platforms/ascend_npu.md#method-2-using-docker-image) to build the environment. +Before contributing, please ensure that your environment is set up correctly. Follow the steps in the [Installation Guide](ascend_npu.md) to install the necessary dependencies. We recommend [using docker](ascend_npu.md#method-2-using-docker-image) to build the environment. ### Fork and clone the repository @@ -38,6 +38,18 @@ If you add a new feature or fix a bug, please add corresponding unit tests to en SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md). +If you need to use model which is not in `python/sglang/test/ascend/test_ascend_utils.py` list. Follow these steps: +1. Register account and upload your model to [modelscope](https://modelscope.cn/models). +2. Make sure your model is pre-cached on the CI server and is on the way "/data/ascend-ci-share-pkking-sglang/modelscope/hub/models/{your_model_repo}/{your_model}". +If this is not the case, use following command on CI server: + ```bash + modelscope download + --model {your_model_repo}/{your_model} + --local_dir /data/ascend-ci-share-pkking-sglang/modelscope/hub/models/{your_model_repo}/{your_model} + ``` + > Note: If you don’t have access to CI server, please ask maintainers (zl19940307@163.com) to download your model. +4. Add model to ```python/sglang/test/ascend/test_ascend_utils.py``` (use docker ```"/root/.cache/modelscope/hub/models/{your_model_repo}/{your_model}"``` path). + ## Write documentations We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. @@ -60,11 +72,11 @@ Also, do not rely on the "Latency/Output throughput" from this script, as it is GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests. You can find additional accuracy eval examples in: -- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_eval_accuracy_large.py) -- [test_moe_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_moe_eval_accuracy_large.py) +- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/registered/eval/test_eval_accuracy_large.py) +- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/registered/core/test_gpt_oss_1gpu.py) ## Benchmark the speed -Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md). +Refer to [Benchmark and Profiling](../../developer_guide/benchmark_and_profiling.md). ## Requesting a review for merge You can follow the pull request merge process described in [MAINTAINER.md](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md). @@ -101,14 +113,13 @@ Each CI workflow has a default limit defined in its workflow configuration file. ```yaml cool-down-minutes: - description: "Default cooldown period in minutes; 0 disables rate limiting" + description: "Cooldown period in minutes for low-permission users; 0 disables rate limiting" type: number default: 120 ``` Users listed in [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) may have a per-user cooldown interval. In practice, we use the minimum of the workflow’s default window and the user-specific interval. - ## Code style guidance - Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function. - Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code. @@ -122,21 +133,21 @@ Users listed in [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob - Reuse server launches in your unit tests to make tests run faster. - When supporting new hardware or features, follow these guidelines: - Do not drastically change existing code. - - Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_ascend.py`). + - Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_npu.py`). - If you write multiple if/else blocks for new features, ensure the common path (e.g., NVIDIA hardware or the existing code path) is the first branch. ## How to update sgl-kernel Since sglang and sgl-kernel are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR). -To add a new kernel or modify an existing one in the sgl-kernel package, you must use multiple PRs. +To add a new kernel or modify an existing one in the `sgl-kernel/` source tree, you must use multiple PRs. Follow these steps: 1. Submit a PR to update the sgl-kernel source code without using it in sglang python package (e.g., [#8884](https://github.com/sgl-project/sglang/pull/8884/files)). -2. Bump the version of sgl-kernel (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)). - - Once merged, this will trigger an automatic release of the sgl-kernel wheel to PyPI. +2. Bump the version of the kernel package (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)). + - Once merged, this will trigger an automatic release of the `sglang-kernel` wheel to PyPI. - If not urgent, you can wait for other people to release the wheel. A new version will typically be released within one week. 3. Apply the changes: - - Update the sgl-kernel version in `sglang/python/pyproject.toml` to use the modified kernels. + - Update the `sglang-kernel` version in `sglang/python/pyproject.toml` to use the modified kernels. - Update the related caller code in the sglang to use the new kernel. ## How to update sgl-kernel-npu diff --git a/docs/platforms/ascend_npu.md b/docs/platforms/ascend/ascend_npu.md similarity index 83% rename from docs/platforms/ascend_npu.md rename to docs/platforms/ascend/ascend_npu.md index d91382f657c1..b6f1fcf302b1 100644 --- a/docs/platforms/ascend_npu.md +++ b/docs/platforms/ascend/ascend_npu.md @@ -6,12 +6,11 @@ You can install SGLang using any of the methods below. Please go through `System ## Component Version Mapping For SGLang | Component | Version | Obtain Way | |-------------------|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| HDK | 25.3.RC1 | [link](https://hiascend.com/hardware/firmware-drivers/commercial?product=7&model=33) | +| HDK | 25.5.2 | [link](https://www.hiascend.com/hardware/firmware-drivers/commercial?product=7&model=33) | | CANN | 8.5.0 | [Obtain Images](#obtain-cann-image) | | Pytorch Adapter | 7.3.0 | [link](https://gitcode.com/Ascend/pytorch/releases) | | MemFabric | 1.0.5 | `pip install memfabric-hybrid==1.0.5` | | Triton | 3.2.0 | `pip install triton-ascend`| -| Bisheng | 20251121 | [link](https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/triton_ascend/Ascend-BiSheng-toolkit_aarch64_20251121.run) | | SGLang NPU Kernel | NA | [link](https://github.com/sgl-project/sgl-kernel-npu/releases) | @@ -39,7 +38,7 @@ conda activate sglang_npu #### CANN -Prior to start work with SGLang on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 8.3.RC2 or higher, check the [installation guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/softwareinst/instg/instg_0008.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit) +Prior to start work with SGLang on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 8.5.0, check the [installation guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/softwareinst/instg/instg_0008.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit) #### MemFabric-Hybrid @@ -54,7 +53,7 @@ pip install memfabric-hybrid==1.0.5 ```shell PYTORCH_VERSION=2.8.0 TORCHVISION_VERSION=0.23.0 -TORCH_NPU_VERSION=2.8.0 +TORCH_NPU_VERSION=2.8.0.post2 pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu pip install torch_npu==$TORCH_NPU_VERSION ``` @@ -65,11 +64,6 @@ If you are using other versions of `torch` and install `torch_npu`, check [insta We provide our own implementation of Triton for Ascend. -```shell -BISHENG_NAME="Ascend-BiSheng-toolkit_aarch64_20251121.run" -BISHENG_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/triton_ascend/${BISHENG_NAME}" -wget -O "${BISHENG_NAME}" "${BISHENG_URL}" && chmod a+x "${BISHENG_NAME}" && "./${BISHENG_NAME}" --install && rm "${BISHENG_NAME}" -``` ```shell pip install triton-ascend ``` @@ -81,17 +75,15 @@ We provide SGL kernels for Ascend NPU, check [installation guide](https://github #### DeepEP-compatible Library We provide a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md). -#### CustomOps -_TODO: to be removed once merged into sgl-kernel-npu._ -Additional package with custom operations. DEVICE_TYPE can be "a3" for Atlas A3 server or "910b" for Atlas A2 server. +#### Some other dependencies ```shell -DEVICE_TYPE="a3" -wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/ops/CANN-custom_ops-8.3.0.1-$DEVICE_TYPE-linux.aarch64.run -chmod a+x ./CANN-custom_ops-8.3.0.1-$DEVICE_TYPE-linux.aarch64.run -./CANN-custom_ops-8.3.0.1-$DEVICE_TYPE-linux.aarch64.run --quiet --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp -wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/ops/custom_ops-2.0.$DEVICE_TYPE-cp311-cp311-linux_aarch64.whl -pip install ./custom_ops-2.0.$DEVICE_TYPE-cp311-cp311-linux_aarch64.whl +# libGL +apt update +apt install libgl1 libglib2.0-0 + +# ensure setuptools contains pkg_resources module +pip install "setuptools<80" ``` #### Installing SGLang from source @@ -112,8 +104,8 @@ You can download the SGLang image or build an image based on Dockerfile to obtai dockerhub: docker.io/lmsysorg/sglang:$tag # Main-based tag, change main to specific version like v0.5.6, # you can get image for specific version -Atlas 800I A3 : {main}-cann8.3.rc2-a3 -Atlas 800I A2: {main}-cann8.3.rc2-910b +Atlas 800I A3 : {main}-cann8.5.0-a3 +Atlas 800I A2: {main}-cann8.5.0-910b ``` 2. Build an image based on Dockerfile ```shell @@ -123,7 +115,8 @@ cd sglang/docker # Build the docker image # If there are network errors, please modify the Dockerfile to use offline dependencies or use a proxy -docker build -t -f npu.Dockerfile . +# is the target architecture of the image, e.g. amd64, arm64 +docker build --build-arg TARGETARCH= -t -f npu.Dockerfile . ``` #### Create Docker @@ -189,7 +182,7 @@ export SGLANG_SET_CPU_AFFINITY=1 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend ``` -#### PD Separation Scene +#### PD Disaggregation Scene 1. Launch Prefill Server ```shell # Enabling CPU Affinity diff --git a/docs/platforms/ascend/ascend_npu_best_practice.md b/docs/platforms/ascend/ascend_npu_best_practice.md new file mode 100644 index 000000000000..91eb59a454b0 --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_best_practice.md @@ -0,0 +1,3762 @@ +# Best Practice on Ascend NPU + +This section describes the best practice data of mainstream LLM models such as DeepSeek and Qwen on the Ascend NPU. If +you encounter issues or have any questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +## DeepSeek Series Models + +### Low Latency + +| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration | +|-------------------|---------------|-------|-------------------|-----------|------|--------------|-------------------------------------------------------------------------------------------| +| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 6K+1.6K | 20ms | W8A8 INT8 | [Optimal Configuration](#deepseek-r1-6k-1_6k-20ms-on-a3-32-cards-disaggregation-mode) | +| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.9K+1K | 19ms | W8A8 INT8 | [Optimal Configuration](#deepseek-r1-3_9k-1k-19ms-on-a3-32-cards-disaggregation-mode) | +| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1.5K | 19ms | W8A8 INT8 | [Optimal Configuration](#deepseek-r1-3_5k-1_5k-19ms-on-a3-32-cards-disaggregation-mode) | +| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1K | 19ms | W8A8 INT8 | [Optimal Configuration](#deepseek-r1-3_5k-1k-19ms-on-a3-32-cards-disaggregation-mode) | +| DeepSeek-V3.2 | Atlas 800I A3 | 32 | PD Disaggregation | 128K+1K | 26ms | W8A8 INT8 | [Optimal Configuration](#deepseek-v32-128k-1k-26ms-on-a3-32-cards-disaggregation-mode) | + +### High Throughput + +| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration | +|-------------|---------------|-------|-------------------|-----------|------|--------------|-----------------------------------------------------------------------------------------| +| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#deepseek-r1-3_5k-1_5k-50ms-on-a3-32-cards-disaggregation-mode) | +| Deepseek-R1 | Atlas 800I A3 | 24 | PD Disaggregation | 2K+2K | 50ms | W8A8 INT8 | [Optimal Configuration](#deepseek-r1-2k-2k-50ms-on-a3-24-cards-disaggregation-mode) | +| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 50ms | W4A8 INT8 | [Optimal Configuration](#deepseek-r1-2k-2k-50ms-on-a3-8-cards-mixed-mode) | +| Deepseek-R1 | Atlas 800I A3 | 16 | PD Disaggregation | 2K+2K | 50ms | W4A8 INT8 | [Optimal Configuration](#deepseek-r1-2k-2k-50ms-on-a3-16-cards-disaggregation-mode) | +| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W4A8 INT8 | [Optimal Configuration](#deepseek-r1-3_5k-1_5k-50ms-on-a3-8-cards-mixed-mode) | +| Deepseek-R1 | Atlas 800I A3 | 16 | PD Disaggregation | 3.5K+1.5K | 50ms | W4A8 INT8 | [Optimal Configuration](#deepseek-r1-3_5k-1_5k-50ms-on-a3-16-cards-disaggregation-mode) | + +## Qwen Series Models + +### Low Latency + +| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration | +|-----------------|---------------|-------|-------------|---------|------|--------------|--------------------------------------------------------------------------------| +| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 11K+1K | 10ms | BF16 | [Optimal Configuration](#qwen3-235b-a22b-11k-1k-10ms-on-a3-8-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 6K+1.5K | 18ms | BF16 | [Optimal Configuration](#qwen3-32b-6k-1_5k-18ms-on-a3-4-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 4K+1.5K | 11ms | BF16 | [Optimal Configuration](#qwen3-32b-4k-1_5k-11ms-on-a3-4-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 8 | PD Mixed | 18K+4K | 6ms | BF16 | [Optimal Configuration](#qwen3-32b-18k-4k-6ms-on-a3-8-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 6K+1.5K | 18ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-6k-1_5k-18ms-on-a2-8-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 4K+1.5K | 11ms | BF16 | [Optimal Configuration](#qwen3-32b-4k-1_5k-11ms-on-a2-8-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 1K+0.3K | 12ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-1k-0_3k-12ms-on-a3-2-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 6K+1.5K | 17ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-6k-1_5k-17ms-on-a3-2-cards-mixed-mode) | +| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 1K+0.3K | 7ms | W8A8 INT8 | [Optimal Configuration](#qwen3-8b-1k-0_3k-7ms-on-a3-1-cards-mixed-mode) | +| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 12ms | W8A8 INT8 | [Optimal Configuration](#qwen3-8b-6k-1_5k-12ms-on-a3-1-cards-mixed-mode) | +| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 5ms | W8A8 INT8 | [Optimal Configuration](#qwen3-8b-3_5k-1_5k-5ms-on-a3-1-cards-mixed-mode) | +| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 10ms | W8A8 INT8 | [Optimal Configuration](#qwen3-30b-a3b-6k-1_5k-10ms-on-a3-1-cards-mixed-mode) | +| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 1K+0.3K | 7ms | W8A8 INT8 | [Optimal Configuration](#qwen3-30b-a3b-1k-0_3k-7ms-on-a3-1-cards-mixed-mode) | +| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 1K+0.3K | 14.21ms | W8A8 INT8 | [Optimal Configuration](#qwen3-next-1k-0_3k-14_21ms-on-a3-2-cards-mixed-mode) | +| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 6K+1.5K | 15.62ms | W8A8 INT8 | [Optimal Configuration](#qwen3-next-6k-1_5k-15_62ms-on-a3-2-cards-mixed-mode) | +| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 20ms | W8A8 INT8 | [Optimal Configuration](#qwen3-next-3_5k-1_5k-20ms-on-a3-2-cards-mixed-mode) | +| Qwen3-14B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 9ms | W8A8 INT8 | [Optimal Configuration](#qwen3-14b-3_5k-1_5k-9ms-on-a3-1-cards-mixed-mode) | + +### High Throughput + +| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration | +|--------------------------------|---------------|-------|-------------------|-----------|-------|--------------|------------------------------------------------------------------------------------------------------------| +| Qwen3-235B-A22B | Atlas 800I A3 | 24 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-235b-a22b-3_5k-1_5k-50ms-on-a3-24-cards-disaggregation-mode) | +| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-235b-a22b-3_5k-1_5k-50ms-on-a3-8-cards-mixed-mode) | +| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 100ms | W8A8 INT8 | [Optimal Configuration](#qwen3-235b-a22b-2k-2k-100ms-on-a3-8-cards-mixed-mode) | +| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-235b-a22b-2k-2k-50ms-on-a3-8-cards-mixed-mode) | +| Qwen3-235B-A22B | Atlas 800I A3 | 16 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-235b-a22b-2k-2k-50ms-on-a3-16-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-3_5k-1_5k-50ms-on-a3-2-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-2k-2k-50ms-on-a3-2-cards-mixed-mode) | +| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-30b-a3b-3_5k-1_5k-50ms-on-a3-1-card-mixed-mode) | +| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 24 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-coder-480b-a35b-instruct-3_5k-1_5k-50ms-on-a3-24-cards-disaggregation-mode) | +| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 16 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-coder-480b-a35b-instruct-3_5k-1_5k-50ms-on-a3-16-cards-mixed-mode) | +| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-coder-480b-a35b-instruct-3_5k-1_5k-50ms-on-a3-8-cards-mixed-mode) | +| Qwen3-Next-80B-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-next-80B-a3b-instruct-3_5k-1_5k-50ms-on-a3-2-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-3_5k-1_5k-50ms-on-a2-8-cards-mixed-mode) | +| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-32b-2k-2k-50ms-on-a2-8-cards-mixed-mode) | +| Qwen3-14B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-14b-3_5k-1_5k-50ms-on-a3-1-cards-mixed-mode) | +| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | [Optimal Configuration](#qwen3-8b-3_5k-1_5k-50ms-on-a3-1-cards-mixed-mode) | + +## Optimal Configuration + +### DeepSeek-R1 3_5K-1_5K 50ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +export SGLANG_SET_CPU_AFFINITY=1 +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_OP_EXPANSION_MODE=AIV +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') + +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export SGLANG_USE_AG_AFTER_QLORA=1 + export HCCL_BUFFSIZE=800 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export SGLANG_NPU_FUSED_MOE_MODE=2 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=131072 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.778 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 16 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 60000 --moe-a2a-backend ascend_fuseep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=600 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 + export TASK_QUEUE_ENABLE=1 + export SGLANG_NPU_FUSED_MOE_MODE=1 + export SGLANG_LM_HEAD_TP=8 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \ + --mem-fraction-static 0.82 --max-running-requests 1024 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 768 --random-input-len 3500 --random-output-len 1500 --num-prompts 3072 --random-range-ratio 1 --request-rate 16 +``` + +### DeepSeek-R1 2K-2K 50ms on A3 24 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 24Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1') +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1600 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export SGLANG_USE_AG_AFTER_QLORA=1 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.8 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 20 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=800 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=102 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export SGLANG_NPU_FUSED_MOE_MODE=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \ + --mem-fraction-static 0.81 --max-running-requests 1088 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang \ +--host 127.0.0.1 \ +--port 6688 \ +--max-concurrency 1088 \ +--random-input-len 2048 \ +--random-output-len 2048 \ +--num-prompts 12800 \ +--random-range-ratio 1 \ +--request-rate 24 +``` + +### DeepSeek-R1 6K-1_6K 20ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 6K+1.6K + +TPOT: 20ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +export SGLANG_SET_CPU_AFFINITY=1 +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') + +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1536 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 4 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=650 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 8 \ + --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 6000 \ + --random-output-len 1600 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.9K+1K + +TPOT: 19ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1536 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 4 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=650 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=12 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 16 \ + --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done +``` + +```shell +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 3900 \ + --random-output-len 1024 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 3_5K-1_5K 19ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 19ms + +#### Model Deployment + +Please Turn to [DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode](#deepseek-r1-3_9k-1k-19ms-on-a3-32-cards-disaggregation-mode) + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 3500 \ + --random-output-len 1500 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 3_5K-1K 19ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1K + +TPOT: 19ms + +#### Model Deployment + +Please Turn to [DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode](#deepseek-r1-3_9k-1k-19ms-on-a3-32-cards-disaggregation-mode) + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 3500 \ + --random-output-len 1024 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 2K-2K 50ms on A3 8 Cards Mixed Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=88 +export HCCL_BUFFSIZE=1600 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512 + +MODEL_PATH=xxx + +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_USE_FIA_NZ=1 + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ +--tp 16 \ +--trust-remote-code \ +--attention-backend ascend \ +--device npu \ +--quantization modelslim \ +--watchdog-timeout 9000 \ +--host 127.0.0.1 --port 6699 \ +--cuda-graph-bs 4 8 20 21 22 \ +--mem-fraction-static 0.78 \ +--max-running-requests 352 \ +--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 1500 \ +--moe-a2a-backend deepep --deepep-mode auto \ +--enable-dp-attention --dp-size 16 --enable-dp-lm-head \ +--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ +--dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 352 --random-input-len 2048 --random-output-len 2048 --num-prompts 1408 --random-range-ratio 1 +``` + +### DeepSeek-R1 2K-2K 50ms on A3 16 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" + +P_IP=('your prefill ip1') + +D_IP=('your decode ip1') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export ENABLE_MOE_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=2600 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.7 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 10240 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=900 + export SGLANG_DP_ROUND_ROBIN=1 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112 + export TASK_QUEUE_ENABLE=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \ + --mem-fraction-static 0.8 --max-running-requests 448 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 448 --random-input-len 2048 --random-output-len 2048 --num-prompts 1792 --random-range-ratio 1 --request-rate 32 +``` + +### DeepSeek-R1 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=56 +export HCCL_BUFFSIZE=1200 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_USE_FIA_NZ=1 + +MODEL_PATH=xxx + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ +--tp 16 \ +--trust-remote-code \ +--attention-backend ascend \ +--device npu \ +--quantization modelslim \ +--watchdog-timeout 9000 \ +--host 127.0.0.1 --port 6699 \ +--cuda-graph-bs 4 8 12 14 \ +--mem-fraction-static 0.77 \ +--max-running-requests 224 \ +--context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 3000 \ +--moe-a2a-backend deepep --deepep-mode auto \ +--enable-dp-attention --dp-size 16 --enable-dp-lm-head \ +--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ +--dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 224 --random-input-len 3500 --random-output-len 1500 --num-prompts 896 --random-range-ratio 1 +``` + +### DeepSeek-R1 3_5K-1_5K 50ms on A3 16 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" + +P_IP=('your prefill ip1') + +D_IP=('your decode ip1') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export ENABLE_MOE_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=3500 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.62 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 20480 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=800 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78 + export TASK_QUEUE_ENABLE=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \ + --mem-fraction-static 0.805 --max-running-requests 416 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 416 --random-input-len 3500 --random-output-len 1500 --num-prompts 1664 --random-range-ratio 1 +``` + +### DeepSeek-V3.2 128K-1K 26ms on A3 32 Cards Disaggregation Mode + +Model: DeepSeek-V3.2-W8A8 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 128K+1K + +TPOT: 26ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH} +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24670" + +P_IP=('your prefill ip1' 'your prefill ip2') +D_IP=('your decode ip1' 'your decode ip2') +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1200 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --tp 32 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host ${P_IP[$i]} --port 8000 \ + --mem-fraction-static 0.73 \ + --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 68000 \ + --max-running-requests 1 \ + --moe-a2a-backend deepep --deepep-mode normal \ + --quantization modelslim \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode prefill \ + --disable-cuda-graph \ + --nnodes 2 --node-rank $i \ + --disaggregation-bootstrap-port 8995 \ + --moe-dense-tp-size 1 \ + --enable-nsa-prefill-context-parallel \ + --nsa-prefill-cp-mode in-seq-split \ + --attn-cp-size 32 \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dist-init-addr ${P_IP[0]}:10000 + break + fi +done + + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + + export TASK_QUEUE_ENABLE=0 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + DP=8 + export HCCL_BUFFSIZE=400 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8 + + python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --tp 32 \ + --dp ${DP} \ + --ep 32 \ + --moe-dense-tp-size 1 \ + --enable-dp-attention \ + --enable-dp-lm-head \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host ${D_IP[$i]} --port 8001 \ + --mem-fraction-static 0.79 \ + --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 68000 \ + --max-running-requests 32 \ + --cuda-graph-max-bs 4 \ + --moe-a2a-backend deepep \ + --deepep-mode low_latency \ + --quantization modelslim \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode decode \ + --nnodes 2 --node-rank $i \ + --dist-init-addr ${D_IP[0]}:10000 + break + fi +done +``` + + +```shell +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP1:8000 8995 \ + --decode http://D_IP1:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 8 --random-input-len 131076 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 24Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_DP_ROUND_ROBIN=1 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +MODEL_PATH=xxx +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" +P_IP=('your prefill ip1') +D_IP=('your decode ip1' 'your decode ip2') + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + + +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416 + export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 + export DEEPEP_NORMAL_LONG_SEQ_ROUND=16 + export HCCL_BUFFSIZE=4300 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + export STREAMS_PER_DEVICE=32 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + + # P节点 + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \ + --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \ + --nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --mem-fraction-static 0.6 \ + --disable-radix-cache \ + --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --speculative-draft-model-quantization unquant \ + --max-running-requests 128 --chunked-prefill-size 94208 --max-prefill-tokens 262144 \ + --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --dtype bfloat16 + NODE_RANK=$i + break + fi +done + + +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export DP_ROUND_ROBIN=1 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536 + export HCCL_BUFFSIZE=800 + export HCCL_SOCKET_IFNAME=data0.3001 + export GLOO_SOCKET_IFNAME=data0.3001 + export STREAMS_PER_DEVICE=32 + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \ + --host ${D_IP[$i]} --port 8001 --trust-remote-code \ + --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.83 --max-running-requests 768 \ + --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --cuda-graph-bs 6 8 12 15 18 20 22 24 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-draft-model-quantization unquant \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --dist-init-addr xxx:5000 \ + --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://PIP:8000 8995 \ + --decode http://DIP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang-oai --host 127.0.0.1 --port 7239 --max-concurrency 860 --random-input-len 3500 --random-output-len 1500 --num-prompts 3440 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=570 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100 + +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 432 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 94208 --max-prefill-tokens 458880 --sampling-backend ascend \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-radix-cache --moe-a2a-backend ascend_fuseep --speculative-draft-model-quantization unquant \ + --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs 1 2 4 8 16 20 24 26 27 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 272 --random-input-len 3500 --random-output-len 1500 --num-prompts 1088 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 2K-2K 100ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 100ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1200 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=144 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 576 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 32768 --max-prefill-tokens 458880 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --speculative-draft-model-quantization unquant \ + --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.84 --cuda-graph-bs 8 16 20 24 32 36 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 576 --random-input-len 2000 --random-output-len 2000 --num-prompts 576 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 2K-2K 50ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=450 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=147456 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 624 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 73728 --max-prefill-tokens 458880 --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-radix-cache --moe-a2a-backend ascend_fuseep \ + --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.83 --cuda-graph-bs 4 8 16 24 28 29 30 32 34 36 37 38 39 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 480 --random-input-len 2048 --random-output-len 2048 --num-prompts 480 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 2K-2K 50ms on A3 16 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1600 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" + +MIX_IP=('IP1' 'IP2') + +for i in "${!MIX_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]]; + then + echo "${MIX_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + + python -m sglang.launch_server --model-path ${MODEL_PATH} \ + --host 127.0.0.1 --port 7439 --trust-remote-code \ + --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.8 --max-running-requests 768 \ + --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ + --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-bs 6 8 10 12 18 24 \ + --dist-init-addr ${MIX_IP[0]}:5000 --chunked-prefill-size 131072 --max-prefill-tokens 458880 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx --speculative-draft-model-quantization= unquant \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --context-length 8192 --disable-radix-cache \ + --enable-dp-lm-head --dtype bfloat16 + NODE_RANK=$i + break + fi +done + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 768 --random-input-len 2000 --random-output-len 2000 --num-prompts 768 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 11K-1K 10ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 11K+1K + +TPOT: 10ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1600 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 1 --dtype bfloat16 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --disable-radix-cache --enable-dp-lm-head \ + --tp 16 --mem-fraction-static 0.78 --cuda-graph-bs 1 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 1 --random-input-len 11000 --random-output-len 1000 --num-prompts 1 --random-range-ratio 1 +``` + +### Qwen3-32B 6K-1_5K 18ms on A3 4 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 4Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 18ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 32 \ + --disable-radix-cache \ + --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1 +``` + +### Qwen3-32B 4K-1_5K 11ms on A3 4 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 4Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 4K+1.5K + +TPOT: 11ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 1 \ + --disable-radix-cache \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4 +``` + +### Qwen3-32B 18K-4K 6ms on A3 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 18K+4K + +TPOT: 6ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 1 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 65536 \ + --tp-size 16 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 18000 --random-input-len 4000 --num-prompts 1 +``` + +### Qwen3-32B 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 78 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-prefill-tokens 49152 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 16 32 64 68 72 78 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1 +``` + +### Qwen3-32B 2K-2K 50ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 120 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 49152 \ + --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 480 --random-range-ratio 1 +``` + +### Qwen3-30B-A3B 3_5K-1_5K 50ms on A3 1 Card Mixed Mode + +Model: Qwen3-30B-A3B-Instruct-2507 + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export ASCEND_LAUNCH_BLOCKING=0 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 162 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 35000 \ + --tp-size 2 --mem-fraction-static 0.87 --cuda-graph-bs 1 5 15 40 70 100 120 130 140 146 150 154 156 158 160 162 \ + --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 156 --random-input-len 3500 --random-output-len 1500 --num-prompts 624 --random-range-ratio 1 +``` + +### Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode + +Model: Qwen3-Coder-480B-A35B-Instruct + +Hardware: Atlas 800I A3 24Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +MODEL_PATH=xxx +export ASCEND_MF_STORE_URL="tcp://PIP:24667" +P_IP=('PIP') +D_IP=('DIP1' 'DIP2') +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + + +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=327680 + export HCCL_BUFFSIZE=1550 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \ + --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \ + --nnodes 1 --node-rank $i --tp-size 16 --dp-size 2 --mem-fraction-static 0.7 \ + --disable-radix-cache \ + --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \ + --max-running-requests 16 --chunked-prefill-size 20480 --max-prefill-tokens 20480 \ + --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --dtype bfloat16 \ + --disable-overlap-schedule + NODE_RANK=$i + break + fi +done + +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536 + export HCCL_BUFFSIZE=600 + export SGLANG_NPU_FUSED_MOE_MODE=2 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \ + --host ${D_IP[$i]} --port 8001 --trust-remote-code \ + --nnodes 2 --node-rank $i --tp-size 32 --dp-size 4 --mem-fraction-static 0.75 --max-running-requests 544 \ + --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --cuda-graph-bs 16 32 56 72 80 88 96 104 112 120 128 136 \ + --dist-init-addr DIP1:5000 \ + --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://PIP:8000 8995 \ + --decode http://DIP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 410 --random-input-len 3500 --random-output-len 1500 --num-prompts 1640 --random-range-ratio 1 --request-rate 8 +``` + +### Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 16 Cards Mixed Mode + +Model: Qwen3-Coder-480B-A35B-Instruct + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=72 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1800 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" + +MIX_IP=('IP1' 'IP2') + +for i in "${!MIX_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]]; + then + echo "${MIX_IP[$i]}" + + python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 2 --node-rank $i \ + --dist-init-addr 141.61.133.128:5000 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 288 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 114688 --max-prefill-tokens 458880 \ + --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto \ + --tp 32 --dp-size 4 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.7 --cuda-graph-bs 56 64 72 + NODE_RANK=$i + break + fi +done +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 288 --random-input-len 3500 --random-output-len 1500 --num-prompts 1152 --random-range-ratio 1 --request-rate 20 +``` + +### Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode + +Model: Qwen3-Coder-480B-A35B-Instruct + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=2100 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" + +python -m sglang.launch_server --model-path $MODEL_PATH \ +--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ +--attention-backend ascend --device npu --quantization modelslim \ +--max-running-requests 80 --context-length 8192 --dtype bfloat16 \ +--chunked-prefill-size 28672 --max-prefill-tokens 458880 \ +--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --enable-dp-attention --enable-dp-lm-head \ +--tp 16 --dp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 16 20 24 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 80 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1 +``` + +### Qwen3-Next-80B-A3B-Instruct 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +export cann_path=/usr/local/Ascend/ascend-toolkit/latest +source /usr/local/Ascend/driver/bin/setenv.bash +source ${cann_path}/../set_env.sh +source ${cann_path}/../../nnal/atb/set_env.sh +source ${cann_path}/opp/vendors/customize/bin/set_env.bash +export ASCEND_HOME_PATH=${cann_path} +source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh + +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_ALGO="level0:NA;level1:ring" + +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=20 +export HCCL_BUFFSIZE=2000 + +python -m sglang.launch_server \ + --model-path /path/to/Qwen3-Next-80B-A3B-Instruct-W8A8-3 \ + --host 127.0.0.1 \ + --port 6699 \ + --tp-size 4 \ + --device npu \ + --attention-backend ascend \ + --mem-fraction-static 0.685 \ + --max-running-requests 80 \ + --watchdog-timeout 3600 \ + --disable-radix-cache \ + --cuda-graph-bs 80 \ + --max-prefill-tokens 28672 --max-total-tokens 450560 \ + --moe-a2a-backend deepep --deepep-mode auto \ + --quantization modelslim \ + --chunked-prefill-size -1 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 80 --random-output-len 1536 --random-input-len 3584 --num-prompts 160 --random-range-ratio 1 +``` + +### Qwen3-32B 6K-1_5K 18ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 18ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 32 \ + --disable-radix-cache \ + --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1 +``` + +### Qwen3-32B 4K-1_5K 11ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 4K+1.5K + +TPOT: 11ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 32 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 65536 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 6 12 18 24 30 32 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4 +``` + +### Qwen3-32B 1K-0_3K 12ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 12ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 8 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16 +``` + +### Qwen3-32B 6K-1_5K 17ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 17ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 10 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-8B 1K-0_3K 7ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 7ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 2 4 6 9 10 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16 +``` + +### Qwen3-8B 6K-1_5K 12ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 12ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 5 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-32B 3_5K-1_5K 50ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 78 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-prefill-tokens 65536 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 8 16 32 64 68 72 78 --dtype bfloat16 --base-gpu-id 4 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1 +``` + +### Qwen3-32B 2K-2K 50ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 120 \ + --disable-radix-cache \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-prefill-tokens 49152 --base-gpu-id 4 \ + --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 120 --random-range-ratio 1 +``` + +### Qwen3-30B-A3B 6K-1_5K 10ms on A3 1 Cards Mixed Mode + +Model: Qwen3-30B-A3B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 10ms + +#### Model Deployment + +```shell +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 35000 \ + --tp-size 2 --mem-fraction-static 0.6 --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-30B-A3B 1K-0_3K 7ms on A3 1 Cards Mixed Mode + +Model: Qwen3-30B-A3B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 7ms + +#### Model Deployment + +```shell +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 8 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 35000 \ + --tp-size 2 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 3 4 5 6 7 8 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 8 --random-output-len 300 --random-input-len 1024 --num-prompts 8 +``` + +### Qwen3-Next 1K-0_3K 14_21ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 14.21ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=5 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000 +export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 + +export ASCEND_USE_FIA=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 + +export SGLANG_WARMUP_TIMEOUT=3600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export FORCE_DRAFT_MODEL_NON_QUANT=1 + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=2000 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --page-size 128 \ + --tp-size 4 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host 127.0.0.1 --port 6699 \ + --mem-fraction-static 0.75 \ + --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-running-requests 312 \ + --cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \ + --mamba-ssm-dtype bfloat16 \ + --base-gpu-id 0 \ + --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \ + --quantization modelslim \ + --moe-a2a-backend deepep --deepep-mode auto \ +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16 +``` + +### Qwen3-Next 6K-1_5K 15_62ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 15.62ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=5 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000 +export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 + +export ASCEND_USE_FIA=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 + +export SGLANG_WARMUP_TIMEOUT=3600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export FORCE_DRAFT_MODEL_NON_QUANT=1 + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=2000 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --page-size 128 \ + --tp-size 4 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host 127.0.0.1 --port 6699 \ + --mem-fraction-static 0.75 \ + --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-running-requests 312 \ + --cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \ + --mamba-ssm-dtype bfloat16 \ + --base-gpu-id 0 \ + --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \ + --quantization modelslim \ + --moe-a2a-backend deepep --deepep-mode auto \ +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-14B 3_5K-1_5K 9ms on A3 1 Cards Mixed Mode + +Model: Qwen3-14B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 9ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export ASCEND_USE_FIA=0 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.8 \ + --tp-size 1 --dp-size 1 \ + --sampling-backend ascend --max-running-requests 8 \ + --served-model-name Qwen3-14B \ + --chunked-prefill-size -1 \ + --cuda-graph-bs 8 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --schedule-conservativeness 0.01 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 8 --random-range-ratio 1 +``` + +### Qwen3-14B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode + +Model: Qwen3-14B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export ASCEND_USE_FIA=0 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.89 \ + --tp-size 1 --dp-size 2 \ + --sampling-backend ascend --max-running-requests 144 \ + --max-prefill-tokens 12288 \ + --served-model-name Qwen3-14B \ + --chunked-prefill-size -1 \ + --cuda-graph-bs 8 16 32 44 48 50 52 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --schedule-conservativeness 0.01 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 144 --random-output-len 1500 --random-input-len 3500 --num-prompts 576 --random-range-ratio 1 +``` + +### Qwen3-8B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.9 \ + --tp-size 1 \ + --max-running-requests 70 \ + --max-prefill-tokens 16384 \ + --served-model-name Qwen3-8B \ + --chunked-prefill-size 16384 \ + --cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 256 --random-range-ratio 1 +``` + +### Qwen3-8B 3_5K-1_5K 5ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 5ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.894 \ + --tp-size 2 \ + --max-running-requests 1 \ + --max-prefill-tokens 16384 \ + --served-model-name Qwen3-8B \ + --chunked-prefill-size -1 \ + --cuda-graph-bs 1 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 4 --random-range-ratio 1 +``` + +### Qwen3-Next 3_5K-1_5K 20ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 20ms + +#### Model Deployment + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=400 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048 +export HCCL_OP_EXPANSION_MODE="AIV" +export TASK_QUEUE_ENABLE=1 +export ASCEND_USE_FIA=1 +export SGLANG_NPU_USE_MULTI_STREAM=0 +export SGLANG_WARMUP_TIMEOUT=3600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export FORCE_DRAFT_MODEL_NON_QUANT=1 +export HCCL_BUFFSIZE=2000 +export ZBCCL_LOCAL_MEM_SIZE=60416 +export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0 + +export ZBCCL_BOOTSTRAP_URL=tcp://127.0.0.1:24669 +export ZBCCL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True +export ZBCCL_ENABLE_GRAPH=1 + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --page-size 128 \ + --tp-size 4 --dp-size 2 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --quantization modelslim \ + --watchdog-timeout 9000 \ + --host 127.0.0.1 --port 6699 \ + --mem-fraction-static 0.85 \ + --disable-radix-cache --max-prefill-tokens 28672 --context-length 26384 --max-total-tokens 122304 \ + --enable-dp-attention --enable-dp-lm-head \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-running-requests 16 \ + --cuda-graph-bs 2 4 8 \ + --mamba-ssm-dtype bfloat16 \ + --speculative-draft-model-path /path/to/Qwen3-Next-80B-A3B-Instruct +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 1 +``` diff --git a/docs/platforms/ascend_npu_deepseek_example.md b/docs/platforms/ascend/ascend_npu_deepseek_example.md similarity index 97% rename from docs/platforms/ascend_npu_deepseek_example.md rename to docs/platforms/ascend/ascend_npu_deepseek_example.md index d0b207f18586..abda404d5995 100644 --- a/docs/platforms/ascend_npu_deepseek_example.md +++ b/docs/platforms/ascend/ascend_npu_deepseek_example.md @@ -22,7 +22,6 @@ export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 #npu acceleration operator export SGLANG_NPU_USE_MLAPO=1 export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 python3 -m sglang.launch_server \ --model-path ${MODEL_PATH} \ @@ -71,7 +70,6 @@ export HCCL_BUFFSIZE=1536 #npu acceleration operator export SGLANG_NPU_USE_MLAPO=1 export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 export TASK_QUEUE_ENABLE=2 python -m sglang.launch_server \ @@ -128,7 +126,6 @@ export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 unset TASK_QUEUE_ENABLE export SGLANG_NPU_USE_MLAPO=1 export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 # suggest max-running-requests <= max-cuda-graph-bs * dp_size, Because when this value is exceeded, performance will significantly degrade. python -m sglang.launch_server \ @@ -146,7 +143,6 @@ python -m sglang.launch_server \ --attention-backend ascend \ --device npu \ --quantization modelslim \ - --prefill-round-robin-balance \ --moe-a2a-backend deepep \ --enable-dp-attention \ --deepep-mode low_latency \ @@ -255,7 +251,7 @@ do --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ --cuda-graph-bs 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ - --tokenizer-worker-num 4 --prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ --load-balance-method decode_round_robin NODE_RANK=$i break @@ -266,7 +262,6 @@ done 2. SGLang Model Gateway (former Router): ```shell -export SGLANG_DP_ROUND_ROBIN=1 python -m sglang_router.launch_router \ --pd-disaggregation \ --policy cache_aware \ diff --git a/docs/platforms/ascend/ascend_npu_environment_variables.md b/docs/platforms/ascend/ascend_npu_environment_variables.md new file mode 100644 index 000000000000..fce333ba2022 --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_environment_variables.md @@ -0,0 +1,39 @@ +# Environment Variables + +SGLang supports various environment variables related to Ascend NPU that can be used to configure its runtime behavior. +This document provides a list of commonly used environment variables and aims to stay updated over time. + +## Directly Used in SGLang + +| Environment Variable | Description | Default Value | +|--------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `SGLANG_NPU_USE_MLAPO` | Adopts the `MLAPO` fusion operator in attention
preprocessing stage of the MLA model. | `false` | +| `SGLANG_USE_FIA_NZ` | Reshapes KV Cache for FIA NZ format.
`SGLANG_USE_FIA_NZ` must be enabled with `SGLANG_NPU_USE_MLAPO` | `false` | +| `SGLANG_NPU_USE_MULTI_STREAM` | Enable dual-stream computation of shared experts
and routing experts in DeepSeek models.
Enable dual-stream computation in DeepSeek NSA Indexer. | `false` | +| `SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHT` | Disable cast model weight tensor to a specific NPU
ACL format. | `false` | +| `SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK` | The maximum number of dispatched tokens on each rank. | `128` | + +## Used in DeepEP Ascend + +| Environment Variable | Description | Default Value | +|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------| +| `DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS` | Enable ant-moving function in dispatch stage. Indicates
the number of tokens transmitted per round on each rank. | `8192` | +| `DEEPEP_NORMAL_LONG_SEQ_ROUND` | Enable ant-moving function in dispatch stage. Indicates
the number of rounds transmitted on each rank. | `1` | +| `DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ` | Enable ant-moving function in combine stage.
The value `0` means disabled. | `0` | +| `MOE_ENABLE_TOPK_NEG_ONE` | Needs to be enabled when the expert ID to be processed by
DEEPEP contains -1. | `0` | +| `DEEP_NORMAL_MODE_USE_INT8_QUANT` | Quantizes x to int8 and returns (tensor, scales) in dispatch operator. | `0` | + +## Others + +| Environment Variable | Description | Default Value | +|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `TASK_QUEUE_ENABLE` | Used to control the optimization level of the dispatch queue
about the task_queue operator. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/730/comref/Envvariables/docs/zh/environment_variable_reference/TASK_QUEUE_ENABLE.md) | `1` | +| `INF_NAN_MODE_ENABLE` | Controls whether the chip uses saturation mode or INF_NAN mode. [Detail](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha001/apiref/envref/envref_07_0056.html) | `1` | +| `STREAMS_PER_DEVICE` | Configures the maximum number of streams for the stream pool. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/720/comref/Envvariables/Envir_041.html) | `32` | +| `PYTORCH_NPU_ALLOC_CONF` | Controls the behavior of the cache allocator.
This variable changes memory usage and may cause performance fluctuations. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html) | | +| `ASCEND_MF_STORE_URL` | The address of config store in MemFabric during PD separation,
which is generally set to the IP address of the P primary node
with an arbitrary port number. | | +| `ASCEND_LAUNCH_BLOCKING` | Controls whether synchronous mode is enabled during operator execution. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_006.html) | `0` | +| `HCCL_OP_EXPANSION_MODE` | Configures the expansion position for communication algorithm scheduling. [Detail](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha001/apiref/envref/envref_07_0094.html) | | +| `HCCL_BUFFSIZE` | Controls the size of the buffer area for shared data between two NPUs.
The unit is MB, and the value must be greater than or equal to 1. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/60RC3/ptmoddevg/trainingmigrguide/performance_tuning_0047.html) | `200` | +| `HCCL_SOCKET_IFNAME` | Configures the name of the network card used by the Host
during HCCL initialization. [Detail](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/apiref/envvar/envref_07_0075.html) | | +| `GLOO_SOCKET_IFNAME` | Configures the network interface name for GLOO communication. | | diff --git a/docs/platforms/ascend/ascend_npu_glm5_examples.md b/docs/platforms/ascend/ascend_npu_glm5_examples.md new file mode 100644 index 000000000000..d83f670fc03e --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_glm5_examples.md @@ -0,0 +1,200 @@ +# GLM-5 examples + +## Introduction + +The GLM (General Language Model) series is an open-source bilingual large language model family jointly developed by the KEG Laboratory of Tsinghua University and Zhipu AI. This series of models has performed outstandingly in the field of Chinese NLP with its unique unified pre-training framework and bilingual capabilities. [GLM-5](https://huggingface.co/zai-org/GLM-5) adopts the DeepSeek-V3/V3.2 architecture, including the sparse attention (DSA) and multi-token prediction (MTP). Ascend supports GLM-5 with 0Day based on the SGLang inference framework, achieving low-code seamless enablement and compatibility with the mainstream distributed parallel capabilities within the current SGLang framework. We welcome developers to download and experience it. + +## Environment Preparation + +### Model Weight + +- `GLM-5.0`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5). +- `GLM-5.0-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8). +- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively. + + +### Installation + +The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the online platform. You can directly pull it. + +```{code-block} bash +#Atlas 800 A3 +docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-a3-glm5 +#Atlas 800 A2 +docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-910b-glm5 + +#start container +docker run -itd --shm-size=16g --privileged=true --name ${NAME} \ +--privileged=true --net=host \ +-v /var/queue_schedule:/var/queue_schedule \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /usr/local/sbin:/usr/local/sbin \ +-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ +-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ +--device=/dev/davinci0:/dev/davinci0 \ +--device=/dev/davinci1:/dev/davinci1 \ +--device=/dev/davinci2:/dev/davinci2 \ +--device=/dev/davinci3:/dev/davinci3 \ +--device=/dev/davinci4:/dev/davinci4 \ +--device=/dev/davinci5:/dev/davinci5 \ +--device=/dev/davinci6:/dev/davinci6 \ +--device=/dev/davinci7:/dev/davinci7 \ +--device=/dev/davinci8:/dev/davinci8 \ +--device=/dev/davinci9:/dev/davinci9 \ +--device=/dev/davinci10:/dev/davinci10 \ +--device=/dev/davinci11:/dev/davinci11 \ +--device=/dev/davinci12:/dev/davinci12 \ +--device=/dev/davinci13:/dev/davinci13 \ +--device=/dev/davinci14:/dev/davinci14 \ +--device=/dev/davinci15:/dev/davinci15 \ +--device=/dev/davinci_manager:/dev/davinci_manager \ +--device=/dev/hisi_hdc:/dev/hisi_hdc \ +--entrypoint=bash \ +swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:${TAG} +``` + +### Best Practices +Note: Using this image for **best practices**, you need to update transformers to version 5.3.0 +``` shell +# reinstall transformers + +# Install transformers version 5.3.0 from PyPI +pip install transformers==5.3.0 + +# Install from GitHub v5.3.0 tag from GitHub +pip install git+https://github.com/huggingface/transformers.git@v5.3.0 +``` + +## Deployment + +### Single-node Deployment + +- Quantized model `glm5_w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) . + +Run the following script to execute online inference. + +```shell +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 16 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 16384 --max-prefill-tokens 280000 \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --served-model-name glm-5 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --moe-a2a-backend deepep --deepep-mode auto +``` + +### Multi-node Deployment + +- `GLM-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16). + +**A3 series** + +Modify the IP of 2 nodes, then run the same scripts on two nodes. + +**node 0/1** + +```shell +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV + +# Run command ifconfig on two nodes, find out which inet addr has same IP with your node IP. That is your public interface, which should be added here +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + + +P_IP=('your ip1' 'your ip2') +P_MASTER="${P_IP[0]}:your port" +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 32 --nnodes 2 --node-rank $i --dist-init-addr $P_MASTER \ + --chunked-prefill-size 16384 --max-prefill-tokens 131072 \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.8\ + --port 8000 \ + --served-model-name glm-5 \ + --cuda-graph-max-bs 32 \ + --moe-a2a-backend deepep \ + --deepep-mode auto \ + --disable-radix-cache + NODE_RANK=$i + break + fi +done + +``` + +### Prefill-Decode Disaggregation + +Not test yet. + +### Using Benchmark + +Refer to [Benchmark and Profiling](../../developer_guide/benchmark_and_profiling.md) for details. diff --git a/docs/platforms/ascend/ascend_npu_quantization.md b/docs/platforms/ascend/ascend_npu_quantization.md new file mode 100644 index 000000000000..e60173850d82 --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_quantization.md @@ -0,0 +1,134 @@ +# Quantization on Ascend + +To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config. + +SGLang support **mix-bits** quantization (independently defines and loads each layer depending on the type of quantification specified in the `quant_model_description'.json`). [Advanced mix-bits for MoE](https://github.com/sgl-project/sglang/pull/17361) in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. + +[ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504) +| Quantization scheme | `quant_type` in JSON | Scheme class | Layer type | A2 Supported | A3 Supported | A5 Supported | Diffusion models | +|-----------------------------------------------------------|----------------------|--------------------------|--------------------------|:----------------------------------------:|:----------------------------------------:|:------------------------------------------:|:------------------------------------------:| +| W4A4 dynamic | `W4A4_DYNAMIC` | `ModelSlimW4A4Int4` | Linear | **** | **** | **TBD** | **** | +| W8A8 static | `W8A8` | `ModelSlimW8A8Int8` | Linear | **** | **** | **TBD** | **** | +| W8A8 dynamic | `W8A8_DYNAMIC` | `ModelSlimW8A8Int8` | Linear | **** | **** | **TBD** | **** | +| [MXFP8](https://github.com/sgl-project/sglang/pull/20922) | `W8A8_MXFP8` | `ModelSlimMXFP8Scheme` | Linear | **x** | **x** | **WIP** | **** (A5) | +| W4A4 dynamic | `W4A4_DYNAMIC` | `ModelSlimW4A4Int4` | MoE | **** | **** | **TBD** | **x** | +| W4A8 dynamic | `W4A8_DYNAMIC` | `ModelSlimW4A8Int8MoE` | MoE | **** | **** | **TBD** | **x** | +| W8A8 dynamic | `W8A8_DYNAMIC` | `ModelSlimW8A8Int8` | MoE | **** | **** | **TBD** | **x** | +| [MXFP8](https://github.com/sgl-project/sglang/pull/20922) | `W8A8_MXFP8` | `ModelSlimMXFP8Scheme` | MoE | **x** | **x** | **WIP** | **x** | + +[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158): +| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | +|--------------------------------|--------------------------|:----------------------------------------:|:----------------------------------------:|:------------------------------------------:| +| W4A16 | Linear | **** | **** | **TBD** | +| W8A16 | Linear | **** | **** | **TBD** | +| W4A16 | MoE | **** | **** | **TBD** | + +GPTQ on Ascend support +| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | +|----------------------------------------------------------------------------|--------------------------|:----------------------------------------:|:----------------------------------------:|:-----------------------------------------:| +| [W4A16](https://github.com/sgl-project/sglang/pull/15203) | Linear | **** | **** | **TBD** | +| [W8A16](https://github.com/sgl-project/sglang/pull/15203) | Linear | **** | **** | **TBD** | +| [W4A16 MOE](https://github.com/sgl-project/sglang/pull/16364) | MoE | **** | **** | **TBD** | +| [W8A16 MOE](https://github.com/sgl-project/sglang/pull/16364) | MoE | **** | **** | **TBD** | + +[Auto-round on Ascend support](https://github.com/sgl-project/sglang/pull/16699) +| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | +|--------------------------------|--------------------------|:----------------------------------------:|:----------------------------------------:|:-----------------------------------------:| +| W4A16 | Linear | **** | **** | **TBD** | +| W8A16 | Linear | **** | **** | **TBD** | +| W4A16 | MoE | **** | **** | **TBD** | +| W8A16 | MoE | **** | **** | **TBD** | + +Compressed-tensors (LLM Compressor) on Ascend support: +| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | +|-----------------------------------------------------------------------------------------------|--------------------------|:----------------------------------------:|:----------------------------------------:|:-----------------------------------------:| +| [W8A8 dynamic](https://github.com/sgl-project/sglang/pull/14504) | Linear | **** | **** | **TBD** | +| [W4A8 dynamic with/without activation clip](https://github.com/sgl-project/sglang/pull/14736) | MoE | **** | **** | **TBD** | +| [W4A16 MOE](https://github.com/sgl-project/sglang/pull/12759) | MoE | **** | **** | **TBD** | +| [W8A8 dynamic](https://github.com/sgl-project/sglang/pull/14504) | MoE | **** | **** | **TBD** | + +[GGUF on Ascend support](https://github.com/sgl-project/sglang/pull/17883) +| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | +|-----------------------------------------------------------|--------------------------|:----------------------------------------:|:----------------------------------------:|:-----------------------------------------:| +| [GGUF (all types)](https://github.com/sgl-project/sglang/pull/17883) | Linear | **** | **** | **TBD** | +| [GGUF (all types)](https://github.com/sgl-project/sglang/pull/17883) | MoE | **** | **** | **TBD** | + +> Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed. + +in progress + +## Diffusion Model Quantization on Ascend NPU + +SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3. + +**Requirements for MXFP8:** CANN ≥ 8.0.RC3, Ascend A5 + +| Quantization method | `quant_type` in JSON | Scheme class | Mode | A2/A3 Supported | A5 Supported | Trigger | +|---------------------|-----------------------|-------------------------------|---------|:--------------------------------------------:|:----------------------------------------:|---------------------------------------------------| +| MXFP8 (W8A8) | — | `MXFP8Config` | Online | **x** | **** | `--quantization mxfp8` | +| MXFP8 (W8A8) | `W8A8_MXFP8` | `ModelSlimMXFP8Scheme` | Offline | **x** | **** | auto-detected from `quant_model_description.json` | +| W8A8 static | `W8A8` | `ModelSlimW8A8Int8` | Offline | **** | **TBD** | auto-detected from `quant_model_description.json` | +| W8A8 dynamic | `W8A8_DYNAMIC` | `ModelSlimW8A8Int8` | Offline | **** | **TBD** | auto-detected from `quant_model_description.json` | +| W4A4 dynamic | `W4A4_DYNAMIC` | `ModelSlimW4A4Int4` | Offline | **** | **TBD** | auto-detected from `quant_model_description.json` | + +### Online MXFP8 Quantization + +Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using `npu_dynamic_mx_quant` + `npu_quant_matmul` CANN kernels. Pass `--quantization mxfp8` to override auto-detection. + +```bash +# Start the diffusion server with online MXFP8 quantization +sglang serve \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --quantization mxfp8 \ + --num-gpus 4 +``` + +```bash +# One-shot generation +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --quantization mxfp8 \ + --prompt "a beautiful sunset over the mountains" \ + --save-output +``` + +### Offline MXFP8 Quantization (ModelSlim) + +For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from `quant_model_description.json`, so no extra `--quantization` flag is needed. + +**Step 1: Quantize with msModelSlim** + +```bash +msmodelslim quant \ + --model_path /path/to/wan2_2_float_weights \ + --save_path /path/to/wan2_2_mxfp8_weights \ + --device npu \ + --model_type Wan2_2 \ + --quant_type mxfp8 \ + --trust_remote_code True +``` + +> Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim. + +**Step 2: Convert to Diffusers format** + +msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script: + +```bash +python python/sglang/multimodal_gen/tools/wan_repack.py \ + --input-path /path/to/wan2_2_mxfp8_weights \ + --output-path /path/to/wan2_2_mxfp8_diffusers +``` + +Then copy all files from the original Diffusers checkpoint (except the `transformer`/`transformer_2` folders) into the output directory. + +**Step 3: Run inference** + +```bash +sglang generate \ + --model-path /path/to/wan2_2_mxfp8_diffusers \ + --prompt "a beautiful sunset over the mountains" \ + --save-output +``` + +For pre-quantized checkpoints available on ModelScope, see [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech). diff --git a/docs/platforms/ascend/ascend_npu_quick_start.md b/docs/platforms/ascend/ascend_npu_quick_start.md new file mode 100644 index 000000000000..7f0bef6e8aa3 --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_quick_start.md @@ -0,0 +1,103 @@ +# Ascend NPU Quickstart + +## Prerequisites + +### Supported Devices + +- Atlas 800I A2 inference series (Atlas 800I A2) +- Atlas 800I A3 inference series (Atlas 800I A3) + +## Setup environment using container + +__Notice:__ The following commands are based on Atlas 800I A3 machines. If you are using Atlas 800I A2, some changes are needed. + +- The image tag needs to be `main-cann8.5.0-a3` for Atlas 800I A3 and `main-cann8.5.0-910b` for Atlas 800I A2. +- The device mapping in `docker run` command needs to be changed to `davinci[0-7]` for Atlas 800I A2. + +```shell +# For Atlas 800I A3 +export IMAGE=quay.io/ascend/sglang:main-cann8.5.0-a3 + +docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \ + --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ + --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ + --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \ + --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \ + --device=/dev/davinci_manager \ + --device=/dev/hisi_hdc \ + --volume /usr/local/sbin:/usr/local/sbin \ + --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ + --volume /etc/ascend_install.info:/etc/ascend_install.info \ + --volume /var/queue_schedule:/var/queue_schedule \ + --volume ~/.cache/:/root/.cache/ \ + --entrypoint=bash \ + $IMAGE +``` + +## Usage + +The SGLang server is installed in the container by default. You can use `pip show sglang` to check the version. + +### Start SGLang server + +SGLang will automatically download the model from Hugging Face. + +```shell +# Set HF_ENDPOINT to a mirror site if network is not available +export HF_ENDPOINT=https://hf-mirror.com + +# Set your own HF_TOKEN to download restricted models +export HF_TOKEN= + +# Start SGLang server +# It may take several minutes to download the model on the first run +sglang serve --model-path Qwen/Qwen2.5-7B-Instruct --attention-backend ascend & +``` + +If you see output like the following, the server is running. + +```log +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) +The server is fired up and ready to roll! +``` + +### Send a test request + +You can do inference using the server: + +```shell +curl -X POST http://localhost:30000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 16 + } + }' +``` + +If the "text" field in the response contains "Paris", the server is working as expected. + +### Stop server and exit container + +The SGLang server is running as a background process. You can send a `SIGINT` signal to stop it. + +```shell +SGLANG_PID=$(pgrep -f "sglang serve") +kill -SIGINT $SGLANG_PID +``` + +The output should be like the following: + +```log +INFO: Shutting down +INFO: Waiting for application shutdown. +INFO: Application shutdown complete. +INFO: Finished server process [25310] +``` + +The server has now stopped. You can verify it with `ps -ef | grep sglang`, then exit the container by pressing `Ctrl+D`. diff --git a/docs/platforms/ascend/ascend_npu_qwen3_5_examples.md b/docs/platforms/ascend/ascend_npu_qwen3_5_examples.md new file mode 100644 index 000000000000..8660f17cc5ea --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_qwen3_5_examples.md @@ -0,0 +1,231 @@ +# Qwen3.5 examples + +## Environment Preparation + +### Installation + +The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the quay.io platform. You can directly pull it. + +```{code-block} bash +#Atlas 800 A3 +docker pull quay.io/ascend/sglang:main-cann8.5.0-a3 +#Atlas 800 A2 +docker pull quay.io/ascend/sglang:main-cann8.5.0-910b + +#start container +docker run -itd --shm-size=16g --privileged=true --name ${NAME} \ +--privileged=true --net=host \ +-v /var/queue_schedule:/var/queue_schedule \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /usr/local/sbin:/usr/local/sbin \ +-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ +-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ +--device=/dev/davinci0:/dev/davinci0 \ +--device=/dev/davinci1:/dev/davinci1 \ +--device=/dev/davinci2:/dev/davinci2 \ +--device=/dev/davinci3:/dev/davinci3 \ +--device=/dev/davinci4:/dev/davinci4 \ +--device=/dev/davinci5:/dev/davinci5 \ +--device=/dev/davinci6:/dev/davinci6 \ +--device=/dev/davinci7:/dev/davinci7 \ +--device=/dev/davinci8:/dev/davinci8 \ +--device=/dev/davinci9:/dev/davinci9 \ +--device=/dev/davinci10:/dev/davinci10 \ +--device=/dev/davinci11:/dev/davinci11 \ +--device=/dev/davinci12:/dev/davinci12 \ +--device=/dev/davinci13:/dev/davinci13 \ +--device=/dev/davinci14:/dev/davinci14 \ +--device=/dev/davinci15:/dev/davinci15 \ +--device=/dev/davinci_manager:/dev/davinci_manager \ +--device=/dev/hisi_hdc:/dev/hisi_hdc \ +--entrypoint=bash \ +quay.io/ascend/sglang:${tag} +``` + +## Deployment + +### Single-node Deployment + +Run the following script to execute online inference. + +#### Qwen3.5 397B + +```shell +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 16 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --enable-multimodal \ + --mm-attention-backend ascend_attn \ + --dtype bfloat16 +``` + +#### Qwen3.5 122B + +```shell +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 8 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --enable-multimodal \ + --mm-attention-backend ascend_attn \ + --dtype bfloat16 +``` + +#### Qwen3.5 35B + +```shell +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 2 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --enable-multimodal \ + --mm-attention-backend ascend_attn \ + --dtype bfloat16 +``` + +#### Qwen3.5 27B + +```shell +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 2 \ + --chunked-prefill-size -1 --max-prefill-tokens 120000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.8 \ + --port 8000 \ + --cuda-graph-bs 32 \ + --enable-multimodal \ + --mm-attention-backend ascend_attn +``` + +### Prefill-Decode Disaggregation + +Not test yet. + +### Using Benchmark + +Refer to [Benchmark and Profiling](../../developer_guide/benchmark_and_profiling.md) for details. diff --git a/docs/platforms/ascend/ascend_npu_qwen3_examples.md b/docs/platforms/ascend/ascend_npu_qwen3_examples.md new file mode 100644 index 000000000000..f17ed6b71ef5 --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_qwen3_examples.md @@ -0,0 +1,287 @@ +## Qwen3 examples + +### Running Qwen3 + +#### Running Qwen3-32B on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-32B) + +```shell +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export HCCL_OP_EXPANSION_MODE=AIV + +python -m sglang.launch_server \ + --device npu \ + --attention-backend ascend \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-32B \ + --mem-fraction-static 0.8 +``` + +#### Running Qwen3-32B on 1 x Atlas 800I A3 with Qwen3-32B-Eagle3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-32B) + +Speculative model weights could be found [here](https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3) + +```shell +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_OP_EXPANSION_MODE=AIV +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server \ + --device npu \ + --attention-backend ascend \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-32B \ + --mem-fraction-static 0.8 \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path Qwen/Qwen3-32B-Eagle3 \ + --speculative-num-steps 1 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 2 +``` + +#### Running Qwen3-30B-A3B MOE on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-30B-A3B) + +```shell +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export HCCL_OP_EXPANSION_MODE=AIV +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 +export SGLANG_DEEPEP_BF16_DISPATCH=1 + +python -m sglang.launch_server \ + --device npu \ + --attention-backend ascend \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-30B-A3B \ + --mem-fraction-static 0.8 +``` + +#### Running Qwen3-235B-A22B-Instruct-2507 MOE on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) + +```shell +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 +export SGLANG_DEEPEP_BF16_DISPATCH=1 + +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --tp-size 16 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --mem-fraction-static 0.8 +``` + +#### Running Qwen3-235B-A22B-Instruct-2507 with 256K long sequence on 2 x Atlas 800I A3 without CP + +This example uses **PD disaggregation** for long-sequence inference and keeps **context parallel disabled**. + +Set the shared environment variables on both nodes first: + +```shell +export ASCEND_USE_FIA=1 +export SGLANG_SET_CPU_AFFINITY=1 +export ASCEND_MF_STORE_URL="tcp://:12345" +export HCCL_SOCKET_IFNAME= +export GLOO_SOCKET_IFNAME= + +MODEL_PATH=/root/.cache/modelscope/hub/models/zcgy26/Qwen3-235B-A22B-Instruct-2507-w8a8 +``` + +**Prefill node:** + +```shell +export ASCEND_LAUNCH_BLOCKING=1 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export HCCL_BUFFSIZE=1500 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=128 +export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 + +python3 -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend ascend \ + --disaggregation-bootstrap-port 8995 \ + --attention-backend ascend \ + --disable-radix-cache \ + --quantization modelslim \ + --chunked-prefill-size -1 \ + --skip-server-warmup \ + --device npu \ + --tp-size 16 \ + --mem-fraction-static 0.45 \ + --max-running-requests 1 \ + --host \ + --port 8000 \ + --dist-init-addr :5000 \ + --nnodes 1 \ + --node-rank 0 \ + --moe-a2a-backend deepep \ + --deepep-mode normal +``` + +**Decode node:** + +```shell +export SGLANG_DEEPEP_BF16_DISPATCH=0 +export HCCL_BUFFSIZE=4000 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=16 + +python3 -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend ascend \ + --attention-backend ascend \ + --mem-fraction-static 0.8 \ + --disable-cuda-graph \ + --device npu \ + --disable-radix-cache \ + --quantization modelslim \ + --chunked-prefill-size 8192 \ + --skip-server-warmup \ + --tp-size 16 \ + --max-running-requests 1 \ + --host \ + --port 8232 \ + --moe-a2a-backend deepep \ + --deepep-mode low_latency \ + --disable-overlap-schedule +``` + +**Router:** + +```shell +python3 -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://:8000 8995 \ + --decode http://:8232 \ + --host \ + --port 6689 \ + --prometheus-port 29010 +``` + +#### Running Qwen3-235B-A22B-Instruct-2507-W8A8 with Prefill Context Parallel (CP) on 2 x Atlas 800I A3 + +This example enables **Prefill Context Parallel** (`--enable-prefill-context-parallel`) to split the context across CP ranks during prefill, reducing per-device memory pressure and improving TTFT for long sequences. PD disaggregation is required. + +> **Constraints** +> - Prefill side must set `--max-running-requests 1` (PCP only supports batch_size=1) +> - `--attn-cp-size` must evenly divide `--tp-size`; each CP rank occupies `tp_size / cp_size` NPUs + +**Prefill node :** + +```shell +export SGLANG_SET_CPU_AFFINITY=1 +export ASCEND_MF_STORE_URL="tcp://:23456" +export ASCEND_USE_FIA=True + +python3 -m sglang.launch_server \ + --model-path /mnt/share/weights/Qwen3-235B-A22B-Instruct-2507-W8A8 \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend ascend \ + --disaggregation-bootstrap-port 8995 \ + --quantization modelslim \ + --attention-backend ascend \ + --skip-server-warmup \ + --mem-fraction-static 0.7 \ + --chunked-prefill-size 32768 \ + --device npu \ + --base-gpu-id 0 \ + --tp-size 16 \ + --enable-prefill-context-parallel \ + --attn-cp-size 2 \ + --moe-dp-size 2 \ + --max-running-requests 1 \ + --host \ + --port 8000 \ + --nnodes 1 \ + --node-rank 0 \ + --dist-init-addr :6688 +``` + +Key parameters for PCP: + +| Parameter | Value | Description | +|-----------|-------|-------------| +| `--enable-prefill-context-parallel` | flag | Enable PCP feature | +| `--attn-cp-size` | 2 | Split context across 2 CP ranks (each rank handles half the sequence) | +| `--moe-dp-size` | 2 | MoE DP size, should match `--attn-cp-size` | +| `--max-running-requests` | 1 | Required by PCP (batch_size=1 constraint) | + +**Decode node ():** + +```shell +export ASCEND_MF_STORE_URL="tcp://141.61.39.231:23456" +export ASCEND_USE_FIA=True + +python3 -m sglang.launch_server \ + --model-path /mnt/share/weights/Qwen3-235B-A22B-Instruct-2507-W8A8 \ + --trust-remote-code \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend ascend \ + --quantization modelslim \ + --attention-backend ascend \ + --disable-radix-cache \ + --disable-cuda-graph \ + --mem-fraction-static 0.7 \ + --chunked-prefill-size 32768 \ + --skip-server-warmup \ + --device npu \ + --base-gpu-id 0 \ + --tp-size 8 \ + --max-running-requests 32 \ + --host \ + --port 8001 \ + --nnodes 1 \ + --node-rank 0 \ + --dist-init-addr :6688 +``` + +> **Note:** `ASCEND_MF_STORE_URL` on both nodes must point to the same KV store (typically the Prefill node IP). `ASCEND_USE_FIA=True` enables fast interconnect aggregation for KV transfer. PCP is a Prefill-only feature; the Decode side needs no CP-related flags. + +#### Running Qwen3-VL-8B-Instruct on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) + +```shell +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export HCCL_OP_EXPANSION_MODE=AIV + +python -m sglang.launch_server \ + --enable-multimodal \ + --attention-backend ascend \ + --mm-attention-backend ascend_attn \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --mem-fraction-static 0.8 +``` diff --git a/docs/platforms/ascend/ascend_npu_support.rst b/docs/platforms/ascend/ascend_npu_support.rst new file mode 100644 index 000000000000..1c0bbc2760c6 --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_support.rst @@ -0,0 +1,20 @@ +Ascend NPUs +=============================================================== + +.. toctree:: + :maxdepth: 1 + + ascend_npu_quick_start.md + ascend_npu.md + ascend_npu_support_features.md + ascend_npu_support_models.md + ascend_npu_quantization.md + ascend_npu_deepseek_example.md + ascend_npu_qwen3_examples.md + mindspore_backend.md + ascend_contribution_guide.md + ascend_npu_best_practice.md + ascend_npu_ring_sp_performance.md + ascend_npu_qwen3_5_examples.md + ascend_npu_glm5_examples.md + ascend_npu_environment_variables.md diff --git a/docs/platforms/ascend/ascend_npu_support_features.md b/docs/platforms/ascend/ascend_npu_support_features.md new file mode 100644 index 000000000000..729702ed64da --- /dev/null +++ b/docs/platforms/ascend/ascend_npu_support_features.md @@ -0,0 +1,483 @@ +# Support Features on Ascend NPU + +This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any +questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +If you want to know the meaning and usage of each parameter, +click [Server Arguments](https://docs.sglang.io/advanced_features/server_arguments.html). + +## Model and tokenizer + +| Argument | Defaults | Options | Server supported | +|----------------------------------------|----------|---------------------------------------|:----------------:| +| `--model-path`
`--model` | `None` | Type: str | A2, A3 | +| `--tokenizer-path` | `None` | Type: str | A2, A3 | +| `--tokenizer-mode` | `auto` | `auto`, `slow` | A2, A3 | +| `--tokenizer-worker-num` | `1` | Type: int | A2, A3 | +| `--skip-tokenizer-init` | `False` | bool flag (set to enable) | A2, A3 | +| `--load-format` | `auto` | `auto`, `safetensors` | A2, A3 | +| `--model-loader-`
`extra-config` | `{}` | Type: str | A2, A3 | +| `--trust-remote-code` | `False` | bool flag (set to enable) | A2, A3 | +| `--context-length` | `None` | Type: int | A2, A3 | +| `--is-embedding` | `False` | bool flag (set to enable) | A2, A3 | +| `--enable-multimodal` | `None` | bool flag (set to enable) | A2, A3 | +| `--revision` | `None` | Type: str | A2, A3 | +| `--model-impl` | `auto` | `auto`, `sglang`,
`transformers` | A2, A3 | + +## HTTP server + +| Argument | Defaults | Options | Server supported | +|------------------------|-------------|---------------------------|:----------------:| +| `--host` | `127.0.0.1` | Type: str | A2, A3 | +| `--port` | `30000` | Type: int | A2, A3 | +| `--skip-server-warmup` | `False` | bool flag (set to enable) | A2, A3 | +| `--warmups` | `None` | Type: str | A2, A3 | +| `--nccl-port` | `None` | Type: int | A2, A3 | +| `--fastapi-root-path` | `None` | Type: str | A2, A3 | +| `--grpc-mode` | `False` | `False` | Planned | + +## Quantization and data type + +| Argument | Defaults | Options | Server supported | +|---------------------------------------------|----------|-----------------------------------------|:----------------:| +| `--dtype` | `auto` | `auto`,
`float16`,
`bfloat16` | A2, A3 | +| `--quantization` | `None` | `modelslim` | A2, A3 | +| `--quantization-param-path` | `None` | Type: str | Special For GPU | +| `--kv-cache-dtype` | `auto` | `auto` | A2, A3 | +| `--enable-fp32-lm-head` | `False` | bool flag
(set to enable) | A2, A3 | +| `--modelopt-quant` | `None` | Type: str | Special For GPU | +| `--modelopt-checkpoint-`
`restore-path` | `None` | Type: str | Special For GPU | +| `--modelopt-checkpoint-`
`save-path` | `None` | Type: str | Special For GPU | +| `--modelopt-export-path` | `None` | Type: str | Special For GPU | +| `--quantize-and-serve` | `False` | bool flag
(set to enable) | Special For GPU | +| `--rl-quant-profile` | `None` | Type: str | Special For GPU | + +## Memory and scheduling + +| Argument | Defaults | Options | Server supported | +|-----------------------------------------------------|----------|--------------------------------|:----------------:| +| `--mem-fraction-static` | `None` | Type: float | A2, A3 | +| `--max-running-requests` | `None` | Type: int | A2, A3 | +| `--prefill-max-requests` | `None` | Type: int | A2, A3 | +| `--max-queued-requests` | `None` | Type: int | A2, A3 | +| `--max-total-tokens` | `None` | Type: int | A2, A3 | +| `--chunked-prefill-size` | `None` | Type: int | A2, A3 | +| `--max-prefill-tokens` | `16384` | Type: int | A2, A3 | +| `--schedule-policy` | `fcfs` | `lpm`, `fcfs` | A2, A3 | +| `--enable-priority-`
`scheduling` | `False` | bool flag
(set to enable) | A2, A3 | +| `--schedule-low-priority-`
`values-first` | `False` | bool flag
(set to enable) | A2, A3 | +| `--priority-scheduling-`
`preemption-threshold` | `10` | Type: int | A2, A3 | +| `--schedule-conservativeness` | `1.0` | Type: float | A2, A3 | +| `--page-size` | `128` | Type: int | A2, A3 | +| `--swa-full-tokens-ratio` | `0.8` | Type: float | Planned | +| `--disable-hybrid-swa-memory` | `False` | bool flag
(set to enable) | Planned | +| `--radix-eviction-policy` | `lru` | `lru`,
`lfu` | A2, A3 | +| `--enable-prefill-delayer` | `False` | bool flag
(set to enable) | A2, A3 | +| `--prefill-delayer-max-delay-passes` | `30` | Type: int | A2, A3 | +| `--prefill-delayer-token-usage-low-watermark` | `None` | Type: float | A2, A3 | +| `--prefill-delayer-forward-passes-buckets` | `None` | List[float] | A2, A3 | +| `--prefill-delayer-wait-seconds-buckets` | `None` | List[float] | A2, A3 | +| `--abort-on-priority-`
`when-disabled` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-dynamic-chunking` | `False` | bool flag
(set to enable) | Experimental | + +## Runtime options + +| Argument | Defaults | Options | Server supported | +|----------------------------------------------------------|----------|----------------------------------------|:----------------:| +| `--device` | `None` | Type: str | A2, A3 | +| `--tensor-parallel-size`
`--tp-size` | `1` | Type: int | A2, A3 | +| `--pipeline-parallel-size`
`--pp-size` | `1` | Type: int; Currently `2` not supported | Experimental | +| `--attention-context-parallel-size`
`--attn-cp-size` | `1` | Type: int; must be equal to --tp-size | A2, A3 | +| `--moe-data-parallel-size`
`--moe-dp-size` | `1` | Type: int | Planned | +| `--pp-max-micro-batch-size` | `None` | Type: int | Experimental | +| `--pp-async-batch-depth` | `None` | Type: int | Experimental | +| `--stream-interval` | `1` | Type: int | A2, A3 | +| `--incremental-streaming-output` | `False` | bool flag (set to enable) | A2, A3 | +| `--random-seed` | `None` | Type: int | A2, A3 | +| `--constrained-json-`
`whitespace-pattern` | `None` | Type: str | A2, A3 | +| `--constrained-json-`
`disable-any-whitespace` | `False` | bool flag (set to enable) | A2, A3 | +| `--watchdog-timeout` | `300` | Type: float | A2, A3 | +| `--soft-watchdog-timeout` | `300` | Type: float | A2, A3 | +| `--dist-timeout` | `None` | Type: int | A2, A3 | +| `--download-dir` | `None` | Type: str | A2, A3 | +| `--model-checksum` | `None` | Type: str | Planned | +| `--base-gpu-id` | `0` | Type: int | A2, A3 | +| `--gpu-id-step` | `1` | Type: int | A2, A3 | +| `--sleep-on-idle` | `False` | bool flag (set to enable) | A2, A3 | + +## Logging + +| Argument | Defaults | Options | Server supported | +|----------------------------------------------------|-------------------|--------------------------------|:----------------:| +| `--log-level` | `info` | Type: str | A2, A3 | +| `--log-level-http` | `None` | Type: str | A2, A3 | +| `--log-requests` | `False` | bool flag
(set to enable) | A2, A3 | +| `--log-requests-level` | `2` | `0`, `1`, `2`, `3` | A2, A3 | +| `--log-requests-format` | `text` | `text`, `json` | A2, A3 | +| `--crash-dump-folder` | `None` | Type: str | A2, A3 | +| `--enable-metrics` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-metrics-for-`
`all-schedulers` | `False` | bool flag
(set to enable) | A2, A3 | +| `--tokenizer-metrics-`
`custom-labels-header` | `x-custom-labels` | Type: str | A2, A3 | +| `--tokenizer-metrics-`
`allowed-custom-labels` | `None` | List[str] | A2, A3 | +| `--bucket-time-to-`
`first-token` | `None` | List[float] | A2, A3 | +| `--bucket-inter-token-`
`latency` | `None` | List[float] | A2, A3 | +| `--bucket-e2e-request-`
`latency` | `None` | List[float] | A2, A3 | +| `--collect-tokens-`
`histogram` | `False` | bool flag
(set to enable) | A2, A3 | +| `--prompt-tokens-buckets` | `None` | List[str] | A2, A3 | +| `--generation-tokens-buckets` | `None` | List[str] | A2, A3 | +| `--gc-warning-threshold-secs` | `0.0` | Type: float | A2, A3 | +| `--decode-log-interval` | `40` | Type: int | A2, A3 | +| `--enable-request-time-`
`stats-logging` | `False` | bool flag
(set to enable) | A2, A3 | +| `--kv-events-config` | `None` | Type: str | Special for GPU | +| `--enable-trace` | `False` | bool flag
(set to enable) | A2, A3 | +| `--oltp-traces-endpoint` | `localhost:4317` | Type: str | A2, A3 | +| `--log-requests-target` | `None` | Type: str | A2, A3 | +| `--uvicorn-access-log-exclude-prefixes` | `[]` | List[str] | A2, A3 | + +## RequestMetricsExporter configuration + +| Argument | Defaults | Options | Server supported | +|---------------------------------------|----------|--------------------------------|:----------------:| +| `--export-metrics-to-`
`file` | `False` | bool flag
(set to enable) | A2, A3 | +| `--export-metrics-to-`
`file-dir` | `None` | Type: str | A2, A3 | + +## API related + +| Argument | Defaults | Options | Server supported | +|---------------------------|-----------|-------------------------------------------------------------------------------------------------------------------|:----------------:| +| `--api-key` | `None` | Type: str | A2, A3 | +| `--admin-api-key` | `None` | Type: str | A2, A3 | +| `--served-model-name` | `None` | Type: str | A2, A3 | +| `--weight-version` | `default` | Type: str | A2, A3 | +| `--chat-template` | `None` | Type: str | A2, A3 | +| `--hf-chat-template-name` | `None` | Type: str | A2, A3 | +| `--completion-template` | `None` | Type: str | A2, A3 | +| `--enable-cache-report` | `False` | bool flag
(set to enable) | A2, A3 | +| `--reasoning-parser` | `None` | `deepseek-r1`
`deepseek-v3`
`glm45`
`gpt-oss`
`kimi`
`qwen3`
`qwen3-thinking`
`step3` | A2, A3 | +| `--tool-call-parser` | `None` | `llama3`
`pythonic`
`qwen`
`qwen3_coder` | A2, A3 | +| `--sampling-defaults` | `model` | `openai`, `model` | A2, A3 | + +## Data parallelism + +| Argument | Defaults | Options | Server supported | +|----------------------------------------|---------------|-----------------------------------------------------------|:----------------:| +| `--data-parallel-size`
`--dp-size` | `1` | Type: int | A2, A3 | +| `--load-balance-method` | `auto` | `auto`,
`round_robin`,
`follow_bootstrap_room`,
`total_requests`,
`total_tokens` | A2, A3 | + +## Multi-node distributed serving + +| Argument | Defaults | Options | Server supported | +|-------------------------------------------|----------|-----------|:----------------:| +| `--dist-init-addr`
`--nccl-init-addr` | `None` | Type: str | A2, A3 | +| `--nnodes` | `1` | Type: int | A2, A3 | +| `--node-rank` | `0` | Type: int | A2, A3 | + +## Model override args + +| Argument | Defaults | Options | Server supported | +|--------------------------------------|----------|-----------|:----------------:| +| `--json-model-override-`
`args` | `{}` | Type: str | A2, A3 | +| `--preferred-sampling-`
`params` | `None` | Type: str | A2, A3 | + +## LoRA + +| Argument | Defaults | Options | Server supported | +|--------------------------|----------|-------------------------------------|:----------------:| +| `--enable-lora` | `False` | Bool flag
(set to enable) | A2, A3 | +| `--enable-lora-overlap-loading` | `False` | Bool flag
(set to enable) | A2, A3 | +| `--max-lora-rank` | `None` | Type: int | A2, A3 | +| `--lora-target-modules` | `None` | `all` | A2, A3 | +| `--lora-paths` | `None` | Type: List[str] /
JSON objects | A2, A3 | +| `--max-loras-per-batch` | `8` | Type: int | A2, A3 | +| `--max-loaded-loras` | `None` | Type: int | A2, A3 | +| `--lora-eviction-policy` | `lru` | `lru`,
`fifo` | A2, A3 | +| `--lora-backend` | `csgmv` | `triton`,
`csgmv`,
`ascend`,
`torch_native` | A2, A3 | +| `--max-lora-chunk-size` | `16` | `16`, `32`,
`64`, `128` | Special for GPU | + +## Kernel Backends (Attention, Sampling, Grammar, GEMM) + +| Argument | Defaults | Options | Server supported | +|----------------------------------------|-------------------|------------------------------------------------------------------------------------------------|:----------------:| +| `--attention-backend` | `None` | `ascend` | A2, A3 | +| `--prefill-attention-backend` | `None` | `ascend` | A2, A3 | +| `--decode-attention-backend` | `None` | `ascend` | A2, A3 | +| `--sampling-backend` | `None` | `pytorch`,
`ascend` | A2, A3 | +| `--grammar-backend` | `None` | `xgrammar` | A2, A3 | +| `--mm-attention-backend` | `None` | `ascend_attn` | A2, A3 | +| `--nsa-prefill-backend` | `flashmla_sparse` | `flashmla_sparse`,
`flashmla_decode`,
`fa3`,
`tilelang`,
`aiter` | Special for GPU | +| `--nsa-decode-backend` | `fa3` | `flashmla_prefill`,
`flashmla_kv`,
`fa3`,
`tilelang`,
`aiter` | Special for GPU | +| `--fp8-gemm-backend` | `auto` | `auto`,
`deep_gemm`,
`flashinfer_trtllm`,
`flashinfer_cutlass`,
`flashinfer_deepgemm`,
`cutlass`,
`triton`,
`aiter` | Special for GPU | +| `--disable-flashinfer-`
`autotune` | `False` | bool flag
(set to enable) | Special for GPU | + +## Speculative decoding + +| Argument | Defaults | Options | Server supported | +|------------------------------------------------------------------|-----------|------------------------------------------------------------------|:----------------:| +| `--speculative-algorithm` | `None` | `EAGLE3`,
`NEXTN` | A2, A3 | +| `--speculative-draft-model-path`
`--speculative-draft-model` | `None` | Type: str | A2, A3 | +| `--speculative-draft-model-`
`revision` | `None` | Type: str,
`branch name`,
`tag name`,
`commit id` | A2, A3 | +| `--speculative-draft-load-format` | `auto` | `auto`,
`dummy` | A2, A3 | +| `--speculative-num-steps` | `None` | Type: int | A2, A3 | +| `--speculative-eagle-topk` | `None` | Type: int | A2, A3 | +| `--speculative-num-draft-tokens` | `None` | Type: int | A2, A3 | +| `--speculative-accept-`
`threshold-single` | `1.0` | Type: float | Special for GPU | +| `--speculative-accept-`
`threshold-acc` | `1.0` | Type: float | Special for GPU | +| `--speculative-token-map` | `None` | Type: str | A2, A3 | +| `--speculative-attention-`
`mode` | `prefill` | `prefill`,
`decode` | A2, A3 | +| `--speculative-moe-runner-`
`backend` | `None` | `auto` | A2, A3 | +| `--speculative-moe-a2a-`
`backend` | `None` | `ascend_fuseep` | A2, A3 | +| `--speculative-draft-attention-backend` | `None` | `ascend` | A2, A3 | +| `--speculative-draft-model-quantization` | `None` | `unquant` | A2, A3 | + +## Ngram speculative decoding + +| Argument | Defaults | Options | Server supported | +|----------------------------------------------------|------------|--------------------|:----------------:| +| `--speculative-ngram-`
`min-match-window-size` | `1` | Type: int | Experimental | +| `--speculative-ngram-`
`max-match-window-size` | `12` | Type: int | Experimental | +| `--speculative-ngram-`
`min-bfs-breadth` | `1` | Type: int | Experimental | +| `--speculative-ngram-`
`max-bfs-breadth` | `10` | Type: int | Experimental | +| `--speculative-ngram-`
`match-type` | `BFS` | `BFS`,
`PROB` | Experimental. `BFS` uses recency-based expansion; `PROB` uses frequency-based expansion. | +| `--speculative-ngram-`
`max-trie-depth` | `18` | Type: int | Experimental | +| `--speculative-ngram-`
`capacity` | `10000000` | Type: int | Experimental | + +## Expert parallelism + +| Argument | Defaults | Options | Server supported | +|-------------------------------------------------------|-----------|---------------------------------------------------------------------------|:----------------:| +| `--expert-parallel-size`
`--ep-size`
`--ep` | `1` | Type: int | A2, A3 | +| `--moe-a2a-backend` | `none` | `none`,
`deepep`,
`ascend_fuseep`(It is incompatible with eplb) | A2, A3 | +| `--moe-runner-backend` | `auto` | `auto`, `triton` | A2, A3 | +| `--flashinfer-mxfp4-`
`moe-precision` | `default` | `default`,
`bf16` | Special for GPU | +| `--enable-flashinfer-`
`allreduce-fusion` | `False` | bool flag
(set to enable) | Special for GPU | +| `--deepep-mode` | `auto` | `normal`,
`low_latency`,
`auto` | A2, A3 | +| `--deepep-config` | `None` | Type: str | Special for GPU | +| `--ep-num-redundant-experts` | `0` | Type: int | A2, A3 | +| `--ep-dispatch-algorithm` | `None` | `static`,
`dynamic`,
`fake` | A2, A3 | +| `--init-expert-location` | `trivial` | `trivial`,
``,
``,
`` | A2, A3 | +| `--enable-eplb` | `False` | bool flag
(set to enable) | A2, A3 | +| `--eplb-algorithm` | `deepseek`| `auto`,
`deepseek` | A2, A3 | +| `--eplb-rebalance-num-iterations` | `1000` | Type: int | A2, A3 | +| `--eplb-rebalance-layers-`
`per-chunk` | `None` | Type: int | A2, A3 | +| `--eplb-min-rebalancing-`
`utilization-threshold` | `1.0` | Type: float | A2, A3 | +| `--expert-distribution-`
`recorder-mode` | `None` | `stat`,
`stat_approx`,
`per_pass`,
`per_token` | A2, A3 | +| `--expert-distribution-`
`recorder-buffer-size` | `None` | Type: int | A2, A3 | +| `--enable-expert-distribution-`
`metrics` | `False` | bool flag (set to enable) | A2, A3 | +| `--moe-dense-tp-size` | `None` | `1` | A2, A3 | +| `--elastic-ep-backend` | `None` | `none`, `mooncake` | Special for GPU | +| `--mooncake-ib-device` | `None` | Type: str | Special for GPU | + +## Mamba Cache + +| Argument | Defaults | Options | Server supported | +|------------------------------|-----------|-----------------------------------------------|:----------------:| +| `--max-mamba-cache-size` | `None` | Type: int | A2, A3 | +| `--mamba-ssm-dtype` | `float32` | `float32`,
`bfloat16`,
`float16` | A2, A3 | +| `--mamba-full-memory-ratio` | `0.9` | Type: float | A2, A3 | +| `--mamba-scheduler-strategy` | `auto` | `auto`,
`no_buffer`,
`extra_buffer` | A2, A3 | +| `--mamba-track-interval` | `256` | Type: int | A2, A3 | + +## Hierarchical cache + +| Argument | Defaults | Options | Server supported | +|-------------------------------------------------|-----------------|-------------------------------------------------------------------------------|:----------------:| +| `--enable-hierarchical-`
`cache` | `False` | bool flag
(set to enable).
Currently, mamba cache is not supported. | A2, A3 | +| `--hicache-ratio` | `2.0` | Type: float | A2, A3 | +| `--hicache-size` | `0` | Type: int | A2, A3 | +| `--hicache-write-policy` | `write_through` | Currently only `write_back` supported | A2, A3 | +| `--hicache-io-backend` | `kernel` | `kernel_ascend`,
`direct` | A2, A3 | +| `--hicache-mem-layout` | `layer_first` | `page_first_direct`,
`page_first_kv_split` | A2, A3 | +| `--hicache-storage-`
`backend` | `None` | `file` | A2, A3 | +| `--hicache-storage-`
`prefetch-policy` | `best_effort` | `best_effort`,
`wait_complete`,
`timeout` | Special for GPU | +| `--hicache-storage-`
`backend-extra-config` | `None` | Type: str | Special for GPU | + +## LMCache + +| Argument | Defaults | Options | Server supported | +|--------------------|----------|--------------------------------|:----------------:| +| `--enable-lmcache` | `False` | bool flag
(set to enable) | Special for GPU | + +## Offloading (must be used with `--disable-cuda-graph`) + +| Argument | Defaults | Options | Server supported | +|---------------------------|----------|-----------|:----------------:| +| `--cpu-offload-gb` | `0` | Type: int | A2, A3 | +| `--offload-group-size` | `-1` | Type: int (DeepSeek only) | A2, A3 | +| `--offload-num-in-group` | `1` | Type: int (DeepSeek only) | A2, A3 | +| `--offload-prefetch-step` | `1` | Type: int (DeepSeek only) | A2, A3 | +| `--offload-mode` | `cpu` | `cpu` (DeepSeek only)
`meta` (DeepSeek only)
`sharded_gpu` (DeepSeek only, only support tp=1 dp>1) | A2, A3 | + +## Args for multi-item scoring + +| Argument | Defaults | Options | Server supported | +|----------------------------------|----------|-----------|:----------------:| +| `--multi-item-scoring-delimiter` | `None` | Type: int | A2, A3 | + +## Optimization/debug options + +| Argument | Defaults | Options | Server supported | +|---------------------------------------------------------|----------|----------------------------------------------------------------------------------------------------------------------|:----------------:| +| `--disable-radix-cache` | `False` | bool flag
(set to enable) | A2, A3 | +| `--cuda-graph-max-bs` | `None` | Type: int | A2, A3 | +| `--cuda-graph-bs` | `None` | List[int] | A2, A3 | +| `--disable-cuda-graph` | `False` | bool flag
(set to enable) | A2, A3 | +| `--disable-cuda-graph-`
`padding` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-profile-`
`cuda-graph` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-cudagraph-gc` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-nccl-nvls` | `False` | bool flag
(set to enable) | Special for GPU | +| `--enable-symm-mem` | `False` | bool flag
(set to enable) | Special for GPU | +| `--disable-flashinfer-`
`cutlass-moe-fp4-allgather` | `False` | bool flag
(set to enable) | Special for GPU | +| `--enable-tokenizer-`
`batch-encode` | `False` | bool flag
(set to enable) | A2, A3 | +| `--disable-tokenizer-`
`batch-decode` | `False` | bool flag
(set to enable) | A2, A3 | +| `--disable-custom-`
`all-reduce` | `False` | bool flag
(set to enable) | Special for GPU | +| `--enable-mscclpp` | `False` | bool flag
(set to enable) | Special for GPU | +| `--enable-torch-`
`symm-mem` | `False` | bool flag
(set to enable) | Special for GPU | +| `--disable-overlap`
`-schedule` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-mixed-`
`chunk` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-dp-attention` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-dp-lm-head` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-two-`
`batch-overlap` | `False` | bool flag
(set to enable) | Planned | +| `--enable-single-`
`batch-overlap` | `False` | bool flag
(set to enable) | A2, A3 | +| `--tbo-token-`
`distribution-threshold` | `0.48` | Type: float | Planned | +| `--enable-torch-`
`compile` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-torch-`
`compile-debug-mode` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enforce-piecewise-`
`cuda-graph` | `False` | bool flag
(set to enable);
Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported. | A2, A3 | +| `--piecewise-cuda-`
`graph-tokens` | `None` | Type: JSON
list | A2, A3 | +| `--piecewise-cuda-`
`graph-compiler` | `eager` | `eager` | A2, A3 | +| `--torch-compile-max-bs` | `32` | Type: int | A2, A3 | +| `--piecewise-cuda-`
`graph-max-tokens` | `None` | Type: int | A2, A3 | +| `--torchao-config` | `` | Type: str | Special for GPU | +| `--enable-nan-detection` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-p2p-check` | `False` | bool flag
(set to enable) | Special for GPU | +| `--triton-attention-`
`reduce-in-fp32` | `False` | bool flag
(set to enable) | Special for GPU | +| `--triton-attention-`
`num-kv-splits` | `8` | Type: int | Special for GPU | +| `--triton-attention-`
`split-tile-size` | `None` | Type: int | Special for GPU | +| `--delete-ckpt-`
`after-loading` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-memory-saver` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-weights-`
`cpu-backup` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-draft-weights-`
`cpu-backup` | `False` | bool flag
(set to enable) | A2, A3 | +| `--allow-auto-truncate` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-custom-`
`logit-processor` | `False` | bool flag
(set to enable) | A2, A3 | +| `--flashinfer-mla-`
`disable-ragged` | `False` | bool flag
(set to enable) | Special for GPU | +| `--disable-shared-`
`experts-fusion` | `True` | bool flag
(set to enable) | A2, A3 | +| `--disable-chunked-`
`prefix-cache` | `True` | bool flag
(set to enable) | A2, A3 | +| `--disable-fast-`
`image-processor` | `False` | bool flag
(set to enable) | A2, A3 | +| `--keep-mm-feature-`
`on-device` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-return-`
`hidden-states` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-return-`
`routed-experts` | `False` | bool flag
(set to enable) | A2, A3 | +| `--scheduler-recv-`
`interval` | `1` | Type: int | A2, A3 | +| `--numa-node` | `None` | List[int] | A2, A3 | +| `--enable-deterministic-`
`inference` | `False` | bool flag
(set to enable) | Planned | +| `--rl-on-policy-target` | `None` | `fsdp` | Planned | +| `--enable-layerwise-`
`nvtx-marker` | `False` | bool flag
(set to enable) | Special for GPU | +| `--enable-attn-tp-`
`input-scattered` | `False` | bool flag
(set to enable) | Experimental | +| `--enable-nsa-prefill-`
`context-parallel` | `False` | bool flag
(set to enable) | A2, A3 | +| `--enable-fused-qk-`
`norm-rope` | `False` | bool flag
(set to enable) | Special for GPU | + +## Dynamic batch tokenizer + +| Argument | Defaults | Options | Server supported | +|--------------------------------------------------|----------|--------------------------------|:----------------:| +| `--enable-dynamic-`
`batch-tokenizer` | `False` | bool flag
(set to enable) | A2, A3 | +| `--dynamic-batch-`
`tokenizer-batch-size` | `32` | Type: int | A2, A3 | +| `--dynamic-batch-`
`tokenizer-batch-timeout` | `0.002` | Type: float | A2, A3 | + +## Debug tensor dumps + +| Argument | Defaults | Options | Server supported | +|--------------------------------------------|----------|-----------|:----------------:| +| `--debug-tensor-dump-`
`output-folder` | `None` | Type: str | A2, A3 | +| `--debug-tensor-dump-`
`layers` | `None` | List[int] | A2, A3 | +| `--debug-tensor-dump-`
`input-file` | `None` | Type: str | A2, A3 | + +## PD disaggregation + +| Argument | Defaults | Options | Server supported | +|---------------------------------------------------------|------------|---------------------------------------|:----------------:| +| `--disaggregation-mode` | `null` | `null`,
`prefill`,
`decode` | A2, A3 | +| `--disaggregation-transfer-backend` | `mooncake` | `ascend` | A2, A3 | +| `--disaggregation-bootstrap-port` | `8998` | Type: int | A2, A3 | +| `--disaggregation-ib-device` | `None` | Type: str | Special for GPU | +| `--disaggregation-decode-`
`enable-offload-kvcache` | `False` | `False` | A2, A3 | +| `--num-reserved-decode-tokens` | `512` | Type: int | A2, A3 | +| `--disaggregation-decode-`
`polling-interval` | `1` | Type: int | A2, A3 | + +## Encode prefill disaggregation + +| Argument | Defaults | Options | Server supported | +| --------------------------------------- | ------------------ | ---------------------------------------------------------------- |:----------------:| +| `--enable-adaptive-dispatch-to-encoder` | `False` | bool flag
(set to enable adaptively dispatch) | A2, A3 | +| `--encoder-only` | `False` | bool flag
(set to launch an encoder-only server) | A2, A3 | +| `--language-only` | `False` | bool flag
(set to load weights for the language model only) | A2, A3 | +| `--encoder-transfer-backend` | `zmq_to_scheduler` | `zmq_to_scheduler`,
`zmq_to_tokenizer`,
`mooncake` | A2, A3 | +| `--encoder-urls` | `[]` | List[str]
(List of encoder server urls) | A2, A3 | + +## Custom weight loader + +| Argument | Defaults | Options | Server supported | +|-------------------------------------------------------------------------|----------|---------------------------------|:----------------:| +| `--custom-weight-loader` | `None` | List[str] | A2, A3 | +| `--weight-loader-disable-`
`mmap` | `False` | bool flag
(set to enable) | A2, A3 | +| `--remote-instance-weight-`
`loader-seed-instance-ip` | `None` | Type: str | A2, A3 | +| `--remote-instance-weight-`
`loader-seed-instance-service-port` | `None` | Type: int | A2, A3 | +| `--remote-instance-weight-`
`loader-send-weights-group-ports` | `None` | Type: JSON
list | A2, A3 | +| `--remote-instance-weight-`
`loader-backend` | `nccl` | `transfer_engine`,
`nccl` | A2, A3 | +| `--remote-instance-weight-`
`loader-start-seed-via-transfer-engine` | `False` | bool flag
(set to enable) | Special for GPU | + +## For PD-Multiplexing + +| Argument | Defaults | Options | Server supported | +|-----------------------|----------|--------------------------------|:----------------:| +| `--enable-pdmux` | `False` | bool flag
(set to enable) | Special for GPU | +| `--pdmux-config-path` | `None` | Type: str | Special for GPU | +| `--sm-group-num` | `8` | Type: int | Special for GPU | + +## For Multi-Modal + +| Argument | Defaults | Options | Server supported | +|-----------------------------------------------|----------|--------------------------------|:----------------:| +| `--enable-broadcast-mm-`
`inputs-process` | `False` | bool flag
(set to enable) | A2, A3 | +| `--mm-process-config` | `None` | Type: JSON / Dict | A2, A3 | +| `--mm-enable-dp-encoder` | `False` | bool flag
(set to enable) | A2, A3 | +| `--limit-mm-data-per-request` | `None` | Type: JSON / Dict | A2, A3 | + +## For checkpoint decryption + +| Argument | Defaults | Options | Server supported | +|---------------------------------|----------|--------------------------------|:----------------:| +| `--decrypted-config-file` | `None` | Type: str | A2, A3 | +| `--decrypted-draft-config-file` | `None` | Type: str | A2, A3 | +| `--enable-prefix-mm-cache` | `False` | bool flag
(set to enable) | A2, A3 | + +## Forward hooks + +| Argument | Defaults | Options | Server supported | +|-------------------|----------|-----------------|:----------------:| +| `--forward-hooks` | `None` | Type: JSON list | A2, A3 | + +## Configuration file support + +| Argument | Defaults | Options | Server supported | +|------------|----------|-----------|:----------------:| +| `--config` | `None` | Type: str | A2, A3 | + +## Other Params + +The following parameters are not supported because the third-party components that depend on are not compatible with the +NPU, like Ktransformer, checkpoint-engine etc. + +| Argument | Defaults | Options | +|-------------------------------------------------------------------|-----------|---------------------------| +| `--checkpoint-engine-`
`wait-weights-`
`before-ready` | `False` | bool flag (set to enable) | +| `--kt-weight-path` | `None` | Type: str | +| `--kt-method` | `AMXINT4` | Type: str | +| `--kt-cpuinfer` | `None` | Type: int | +| `--kt-threadpool-count` | `2` | Type: int | +| `--kt-num-gpu-experts` | `None` | Type: int | +| `--kt-max-deferred-`
`experts-per-token` | `None` | Type: int | + +The following parameters have some functional deficiencies on community + +| Argument | Defaults | Options | +|---------------------------------------|----------|--------------------------------| +| `--tool-server` | `None` | Type: str | diff --git a/docs/platforms/ascend_npu_support_models.md b/docs/platforms/ascend/ascend_npu_support_models.md similarity index 81% rename from docs/platforms/ascend_npu_support_models.md rename to docs/platforms/ascend/ascend_npu_support_models.md index 11a7b77c181e..b1ee29fb4a28 100644 --- a/docs/platforms/ascend_npu_support_models.md +++ b/docs/platforms/ascend/ascend_npu_support_models.md @@ -9,17 +9,17 @@ You are welcome to enable various models based on your business requirements. | Models | Model Family | A2 Supported | A3 Supported | |--------------------------------------------|--------------------------------|:----------------------------------------:|:----------------------------------------:| | DeepSeek V3/V3.1 | DeepSeek | **** | **** | -| vllm-ascend/DeepSeek-V3.2-Exp-W8A8 | DeepSeek | **** | **** | -| vllm-ascend/DeepSeek-R1-0528-W8A8 | DeepSeek | **** | **** | -| vllm-ascend/DeepSeek-V2-Lite-W8A8 | DeepSeek | **** | **** | +| DeepSeek-V3.2-W8A8 | DeepSeek | **** | **** | +| DeepSeek-R1-0528-W8A8 | DeepSeek | **** | **** | +| DeepSeek-V2-Lite-W8A8 | DeepSeek | **** | **** | | Qwen/Qwen3-30B-A3B-Instruct-2507 | Qwen | **** | **** | | Qwen/Qwen3-32B | Qwen | **** | **** | | Qwen/Qwen3-0.6B | Qwen | **** | **** | -| vllm-ascend/Qwen3-235B-A22B-W8A8 | Qwen | **** | **** | +| Qwen3-235B-A22B-W8A8 | Qwen | **** | **** | | Qwen/Qwen3-Next-80B-A3B-Instruct | Qwen | **** | **** | | Qwen3-Coder-480B-A35B-Instruct-w8a8-QuaRot | Qwen | **** | **** | | Qwen/Qwen2.5-7B-Instruct | Qwen | **** | **** | -| vllm-ascend/QWQ-32B-W8A8 | Qwen | **** | **** | +| QWQ-32B-W8A8 | Qwen | **** | **** | | meta-llama/Llama-4-Scout-17B-16E-Instruct | Llama | **** | **** | | AI-ModelScope/Llama-3.1-8B-Instruct | Llama | **** | **** | | LLM-Research/llama-2-7b | Llama | **** | **** | @@ -46,9 +46,14 @@ You are welcome to enable various models based on your business requirements. | AI-ModelScope/dbrx-instruct | DBRX (Databricks) | **** | **** | | baichuan-inc/Baichuan2-13B-Chat | Baichuan 2 (7B, 13B) | **** | **** | | baidu/ERNIE-4.5-21B-A3B-PT | ERNIE-4.5 (4.5, 4.5MoE series) | **** | **** | -| openbmb/MiniCPM3-4B | MiniCPM (v3, 4B) | **** | **** | -| Kimi/Kimi-K2-Thinking | Kimi | **** | **** | -| openai/gpt-oss-120b | GPTOSS | **** | **** | +| OpenBMB/MiniCPM3-4B | MiniCPM (v3, 4B) | **** | **** | +| moonshotai/Kimi-K2-Thinking | Kimi | **** | **** | +| eigen-ai-labs/gpt-oss-120b-bf16 | GPTOSS | **** | **** | +| allenai/OLMo-2-1124-7B-Instruct | OLMo | **** | **** | +| cyankiwi/MiniMax-M2-BF16 | MiniMax-M2 | **** | **** | +| upstage/SOLAR-10.7B-Instruct-v1.0 | Solar | **** | **** | +| bigcode/starcoder2-7b | StarCoder2 | **** | **** | +| arcee-ai/Trinity-Mini | Trinity (Nano, Mini) | **** | **** | ## Multimodal Language Models @@ -72,9 +77,10 @@ You are welcome to enable various models based on your business requirements. | AI-ModelScope/llava-v1.6-34b | LLaVA (v1.5 & v1.6) | **** | **** | | lmms-lab/llava-next-72b | LLaVA-NeXT (8B, 72B) | **** | **** | | lmms-lab/llava-onevision-qwen2-7b-ov | LLaVA-OneVision | **** | **** | -| Kimi/Kimi-VL-A3B-Instruct | Kimi-VL (A3B) | **** | **** | +| moonshotai/Kimi-VL-A3B-Instruct | Kimi-VL (A3B) | **** | **** | | ZhipuAI/GLM-4.5V | GLM-4.5V (106B) | **** | **** | -| meta-llama/Llama-3.2-11B-Vision-Instruct | Llama 3.2 Vision (11B) | **** | **** | +| LLM-Research/Llama-3.2-11B-Vision-Instruct | Llama 3.2 Vision (11B) | **** | **** | +| rednote-hilab/dots.ocr | DotsVLM-OCR | **** | **** | ## Embedding Models @@ -89,13 +95,13 @@ You are welcome to enable various models based on your business requirements. ## Reward Models -| Models | Model Family | A2 Supported | A3 Supported | -|---------------------------------------------|---------------------------|------------------------------------------|:----------------------------------------:| -| Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Llama3.1 Reward | **** | **** | -| Shanghai_AI_Laboratory/internlm2-7b-reward | InternLM 2 Reward | **** | **** | -| Qwen/Qwen2.5-Math-RM-72B | Qwen2.5 Reward - Math | **** | **** | -| Howeee/Qwen2.5-1.5B-apeach | Qwen2.5 Reward - Sequence | **** | **** | -| Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Gemma 2-27B Reward | **** | **** | +| Models | Model Family | A2 Supported | A3 Supported | +|------------------------------------------------|---------------------------|------------------------------------------|:----------------------------------------:| +| Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Llama3.1 Reward | **** | **** | +| Shanghai_AI_Laboratory/internlm2-7b-reward | InternLM 2 Reward | **** | **** | +| Qwen/Qwen2.5-Math-RM-72B | Qwen2.5 Reward - Math | **** | **** | +| Howeee/Qwen2.5-1.5B-apeach | Qwen2.5 Reward - Sequence | **** | **** | +| AI-ModelScope/Skywork-Reward-Gemma-2-27B-v0.2 | Gemma 2-27B Reward | **** | **** | ## Rerank Models diff --git a/docs/platforms/ascend/mindspore_backend.md b/docs/platforms/ascend/mindspore_backend.md new file mode 100644 index 000000000000..d0df08ea3fd7 --- /dev/null +++ b/docs/platforms/ascend/mindspore_backend.md @@ -0,0 +1,151 @@ +# MindSpore Models + +## Introduction + +MindSpore is a high-performance AI framework optimized for Ascend NPUs. This doc guides users to run MindSpore models in SGLang. + +## Requirements + +MindSpore currently only supports Ascend NPU devices. Users need to first install Ascend CANN software packages. +The CANN software packages can be downloaded from the [Ascend Official Website](https://www.hiascend.com). The recommended version is 8.3.RC2. + +## Supported Models + +Currently, the following models are supported: + +- **Qwen3**: Dense and MoE models +- **DeepSeek V3/R1** +- *More models coming soon...* + +## Installation + +> **Note**: Currently, MindSpore models are provided by an independent package `sgl-mindspore`. Support for MindSpore is built upon current SGLang support for Ascend NPU platform. Please first [install SGLang for Ascend NPU](ascend_npu.md) and then install `sgl-mindspore`: + +```shell +git clone https://github.com/mindspore-lab/sgl-mindspore.git +cd sgl-mindspore +pip install -e . +``` + + +## Run Model + +Current SGLang-MindSpore supports Qwen3 and DeepSeek V3/R1 models. This doc uses Qwen3-8B as an example. + +### Offline infer + +Use the following script for offline infer: + +```python +import sglang as sgl + +# Initialize the engine with MindSpore backend +llm = sgl.Engine( + model_path="/path/to/your/model", # Local model path + device="npu", # Use NPU device + model_impl="mindspore", # MindSpore implementation + attention_backend="ascend", # Attention backend + tp_size=1, # Tensor parallelism size + dp_size=1 # Data parallelism size +) + +# Generate text +prompts = [ + "Hello, my name is", + "The capital of France is", + "The future of AI is" +] + +sampling_params = {"temperature": 0, "top_p": 0.9} +outputs = llm.generate(prompts, sampling_params) + +for prompt, output in zip(prompts, outputs): + print(f"Prompt: {prompt}") + print(f"Generated: {output['text']}") + print("---") +``` + +### Start server + +Launch a server with MindSpore backend: + +```bash +# Basic server startup +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --tp-size 1 \ + --dp-size 1 +``` + +For distributed server with multiple nodes: + +```bash +# Multi-node distributed server +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --dist-init-addr 127.0.0.1:29500 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 4 \ + --dp-size 2 +``` + +## Troubleshooting + +#### Debug Mode + +Enable sglang debug logging by log-level argument. + +```bash +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --log-level DEBUG +``` + +Enable mindspore info and debug logging by setting environments. + +```bash +export GLOG_v=1 # INFO +export GLOG_v=0 # DEBUG +``` + +#### Explicitly select devices + +Use the following environment variable to explicitly select the devices to use. + +```shell +export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 # to set device +``` + +#### Some communication environment issues + +In case of some environment with special communication environment, users need set some environment variables. + +```shell +export MS_ENABLE_LCCL=off # current not support LCCL communication mode in SGLang-MindSpore +``` + +#### Some dependencies of protobuf + +In case of some environment with special protobuf version, users need set some environment variables to avoid binary version mismatch. + +```shell +export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python # to avoid protobuf binary version mismatch +``` + +## Support +For MindSpore-specific issues: + +- Refer to the [MindSpore documentation](https://www.mindspore.cn/) diff --git a/docs/platforms/ascend_npu_best_practice.md b/docs/platforms/ascend_npu_best_practice.md deleted file mode 100644 index 639b343f1520..000000000000 --- a/docs/platforms/ascend_npu_best_practice.md +++ /dev/null @@ -1,2440 +0,0 @@ -# Best Practice on Ascend NPU - -This section describes the best practice data of mainstream LLM models such as DeepSeek and Qwen on the Ascend Npu. If -you encounter issues or have any questions, please [open an issue](https://github.com/sgl-project/sglang/issues). - -## DeepSeek Series Models - -### Low Latency - -| Model | Hardware | CardNum | Deploy Mode | Dataset | Quantization | Configuration | -|---------------|---------------|---------|---------------|-----------|--------------|----------------------------------------------------------| -| Deepseek-R1 | Atlas 800I A3 | 32 | PD Separation | 6K-1.6K | W8A8 | [Optimal Configuration](#deepseek-r1-low-latency-20ms-1) | -| Deepseek-R1 | Atlas 800I A3 | 32 | PD Separation | 3.9K-1K | W8A8 | [Optimal Configuration](#deepseek-r1-low-latency-20ms-2) | -| Deepseek-R1 | Atlas 800I A3 | 32 | PD Separation | 3.5K-1.5K | W8A8 | [Optimal Configuration](#deepseek-r1-low-latency-20ms-3) | -| Deepseek-R1 | Atlas 800I A3 | 32 | PD Separation | 3.5K-1K | W8A8 | [Optimal Configuration](#deepseek-r1-low-latency-20ms-4) | -| Deepseek-V3.2 | Atlas 800I A3 | 32 | PD Separation | 64K-3K | W8A8 | [Optimal Configuration](#deepseek-v32-low-latency-30ms) | - -### High Throughput - -| Model | Hardware | CardNum | Deploy Mode | Dataset | Quantization | Configuration | -|-------------|---------------|---------|---------------|-----------|--------------|---------------------------------------------------------------| -| Deepseek-R1 | Atlas 800I A3 | 32 | PD Separation | 3.5K-1.5K | W8A8 | [Optimal Configuration](#deepseek-r1-high-performance-50ms-1) | -| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 2K-2K | W4A8 | [Optimal Configuration](#deepseek-r1-high-performance-50ms-2) | -| Deepseek-R1 | Atlas 800I A3 | 16 | PD Separation | 2K-2K | W4A8 | [Optimal Configuration](#deepseek-r1-high-performance-50ms-3) | -| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 3.5K-1.5K | W4A8 | [Optimal Configuration](#deepseek-r1-high-performance-50ms-4) | -| Deepseek-R1 | Atlas 800I A3 | 16 | PD Separation | 3.5K-1.5K | W4A8 | [Optimal Configuration](#deepseek-r1-high-performance-50ms-5) | - -## Qwen Series Models - -### Low Latency - -| Model | Hardware | CardNum | Deploy Mode | Dataset | Quantization | Configuration | -|------------|---------------|---------|-------------|---------|--------------|---------------------------------------------------------| -| Qwen3-235B | Atlas 800I A3 | 8 | PD Mixed | 11K-1K | BF16 | [Optimal Configuration](#qwen3-235b-low-latency-10ms) | -| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 6K-1.5K | W8A8 | [Optimal Configuration](#qwen3-32b-low-latency-18ms) | -| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 4K-1.5K | BF16 | [Optimal Configuration](#qwen3-32b-low-latency-11ms) | -| Qwen3-32B | Atlas 800I A3 | 8 | PD Mixed | 18K-4K | BF16 | [Optimal Configuration](#qwen3-32b-low-latency-12ms) | -| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 6K-1.5K | W8A8 | [Optimal Configuration](#qwen3-32b-a2-low-latency-18ms) | -| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 4K-1.5K | BF16 | [Optimal Configuration](#qwen3-32b-a2-low-latency-11ms) | - -### High Throughput - -| Model | Hardware | CardNum | Deploy Mode | Dataset | Quantization | Configuration | -|------------|---------------|---------|---------------|-----------|--------------|---------------------------------------------------------------| -| Qwen3-235B | Atlas 800I A3 | 24 | PD Separation | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-235b-high-throughput-50ms-1) | -| Qwen3-235B | Atlas 800I A3 | 8 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-235b-high-throughput-50ms-2) | -| Qwen3-235B | Atlas 800I A3 | 8 | PD Mixed | 2K-2K | W8A8 | [Optimal Configuration](#qwen3-235b-high-throughput-50ms-3) | -| Qwen3-235B | Atlas 800I A3 | 16 | PD Mixed | 2K-2K | W8A8 | [Optimal Configuration](#qwen3-235b-high-throughput-50ms-4) | -| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-32b-high-throughput-50ms-1) | -| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 2K-2K | W8A8 | [Optimal Configuration](#qwen3-32b-high-throughput-50ms-2) | -| Qwen3-30B | Atlas 800I A3 | 1 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-32b-high-throughput-50ms-3) | -| Qwen3-480B | Atlas 800I A3 | 24 | PD Separation | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-480b-high-throughput-50ms-1) | -| Qwen3-480B | Atlas 800I A3 | 16 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-480b-high-throughput-50ms-2) | -| Qwen3-480B | Atlas 800I A3 | 8 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-480b-high-throughput-50ms-3) | -| Qwen3-Next | Atlas 800I A3 | 2 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-next-high-throughput-50ms) | -| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 3.5K-1.5K | W8A8 | [Optimal Configuration](#qwen3-32b-a2-high-throughput-50ms-1) | -| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 2K-2K | W8A8 | [Optimal Configuration](#qwen3-32b-a2-high-throughput-50ms-2) | - -## Optimal Configuration - -### DeepSeek R1 High Performance 50ms 1 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 32Card - -DeployMode: PD Separation - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 -export SGLANG_SET_CPU_AFFINITY=1 -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 - -export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" - -P_IP=('your prefill ip1' 'your prefill ip2') - -D_IP=('your decode ip1' 'your decode ip2') - -MODEL_PATH=xxx - -export SGLANG_NPU_USE_MLAPO=1 -export SGLANG_USE_FIA_NZ=1 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" -# prefill -for i in "${!P_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; - then - echo "${P_IP[$i]}" - export HCCL_BUFFSIZE=1536 - export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - export TASK_QUEUE_ENABLE=2 - - export HCCL_SOCKET_IFNAME=lo - export GLOO_SOCKET_IFNAME=lo - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ - --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ - --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ - --disaggregation-transfer-backend ascend --max-running-requests 8 --context-length 8192 --disable-radix-cache \ - --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ - --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ - --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered - NODE_RANK=$i - break - fi -done - -# decode -for i in "${!D_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; - then - echo "${D_IP[$i]}" - export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 - export SGLANG_ENABLE_SPEC_V2=1 - export HCCL_BUFFSIZE=650 - export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78 - export TASK_QUEUE_ENABLE=1 - export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 - export HCCL_SOCKET_IFNAME=xxx - export GLOO_SOCKET_IFNAME=xxx - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ - --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \ - --mem-fraction-static 0.815 --max-running-requests 832 --attention-backend ascend --device npu --quantization modelslim \ - --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ - --cuda-graph-bs 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ - --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ - --tokenizer-worker-num 4 --prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16 \ - --load-balance-method decode_round_robin - NODE_RANK=$i - break - fi -done - -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://P_IP:8000 8998 \ - --prefill http://P_IP:8000 8999 \ - --decode http://D_IP:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 768 --random-input-len 3500 --random-output-len 1500 --num-prompts 3072 --random-range-ratio 1 --request-rate 16 -``` - -### DeepSeek R1 Low Latency 20ms 1 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 32Card - -DeployMode: PD Separation - -DataSets: 6K1.6K - -TPOT: 20ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 -export SGLANG_SET_CPU_AFFINITY=1 -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" - -P_IP=('your prefill ip1' 'your prefill ip2') - -D_IP=('your decode ip1' 'your decode ip2') - -MODEL_PATH=xxx - -export SGLANG_NPU_USE_MLAPO=1 -export SGLANG_USE_FIA_NZ=1 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" -# prefill -for i in "${!P_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; - then - echo "${P_IP[$i]}" - export HCCL_BUFFSIZE=1536 - export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - export TASK_QUEUE_ENABLE=2 - - export HCCL_SOCKET_IFNAME=lo - export GLOO_SOCKET_IFNAME=lo - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ - --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ - --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ - --disaggregation-transfer-backend ascend --max-running-requests 4 --context-length 8192 --disable-radix-cache \ - --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ - --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ - --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered - NODE_RANK=$i - break - fi -done - -# decode -for i in "${!D_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; - then - echo "${D_IP[$i]}" - export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 - export SGLANG_ENABLE_SPEC_V2=1 - export HCCL_BUFFSIZE=650 - export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=12 - export TASK_QUEUE_ENABLE=1 - export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 - export HCCL_SOCKET_IFNAME=xxx - export GLOO_SOCKET_IFNAME=xxx - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ - --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 16 \ - --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \ - --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ - --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ - --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --tokenizer-worker-num 4 --prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16 \ - --load-balance-method decode_round_robin - NODE_RANK=$i - break - fi -done - -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://P_IP:8000 8998 \ - --prefill http://P_IP:8000 8999 \ - --decode http://D_IP:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 32 --random-input-len 6000 --random-output-len 1600 --num-prompts 32 --random-range-ratio 1 -``` - -### DeepSeek R1 Low Latency 20ms 2 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 32Card - -DeployMode: PD Separation - -DataSets: 3.9K1K - -TPOT: 20ms - -#### Model Deployment - -Please Turn to [DeepSeek R1 Low Latency 20ms](#deepSeek-r1-low-latency-20ms-1) - -#### Benchmark - -```bash -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 768 --random-input-len 3900 --random-output-len 1000 --num-prompts 768 --random-range-ratio 1 --request-rate 16 -``` - -### DeepSeek R1 Low Latency 20ms 3 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 32Card - -DeployMode: PD Separation - -DataSets: 3.5K1.5K - -TPOT: 20ms - -#### Model Deployment - -Please Turn to [DeepSeek R1 Low Latency 20ms](#deepSeek-r1-low-latency-20ms-1) - -#### Benchmark - -```bash -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 768 --random-input-len 3500 --random-output-len 1500 --num-prompts 768 --random-range-ratio 1 --request-rate 16 -``` - -### DeepSeek R1 Low Latency 20ms 4 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 32Card - -DeployMode: PD Separation - -DataSets: 3.5K1K - -TPOT: 20ms - -#### Model Deployment - -Please Turn to [DeepSeek R1 Low Latency 20ms](#deepSeek-r1-low-latency-20ms-1) - -#### Benchmark - -```bash -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 768 --random-input-len 3500 --random-output-len 1000 --num-prompts 768 --random-range-ratio 1 --request-rate 16 -``` - -### DeepSeek R1 High Performance 50ms 2 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 2K2K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 - -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo - -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 -export HCCL_BUFFSIZE=1600 -export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 -export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512 - -MODEL_PATH=xxx - -export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 -export SGLANG_NPU_USE_MLAPO=1 -export SGLANG_ENABLE_SPEC_V2=1 -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 - -python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ ---tp 16 \ ---trust-remote-code \ ---attention-backend ascend \ ---device npu \ ---quantization modelslim \ ---watchdog-timeout 9000 \ ---host 127.0.0.1 --port 6699 \ ---cuda-graph-bs 4 8 16 \ ---mem-fraction-static 0.74 \ ---max-running-requests 256 \ ---disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 1500 \ ---moe-a2a-backend deepep --deepep-mode auto \ ---enable-dp-attention --dp-size 16 --enable-dp-lm-head \ ---speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ ---dtype bfloat16 - -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 256 --random-input-len 2048 --random-output-len 2048 --num-prompts 1024 --random-range-ratio 1 -``` - -### DeepSeek R1 High Performance 50ms 3 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 16Card - -DeployMode: PD Separation - -DataSets: 2K2K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 - -export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" - -P_IP=('your prefill ip1') - -D_IP=('your decode ip1') - -MODEL_PATH=xxx - -export SGLANG_NPU_USE_MLAPO=1 -export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -# prefill -for i in "${!P_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; - then - echo "${P_IP[$i]}" - export HCCL_BUFFSIZE=1536 - export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - export TASK_QUEUE_ENABLE=2 - - export HCCL_SOCKET_IFNAME=lo - export GLOO_SOCKET_IFNAME=lo - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ - --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ - --tp-size 16 --mem-fraction-static 0.6 --attention-backend ascend --device npu --quantization modelslim \ - --disaggregation-transfer-backend ascend --max-running-requests 8 --context-length 8192 --disable-radix-cache \ - --chunked-prefill-size 32768 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ - --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ - --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 - NODE_RANK=$i - break - fi -done - -# decode -for i in "${!D_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; - then - echo "${D_IP[$i]}" - export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 - export SGLANG_ENABLE_SPEC_V2=1 - export HCCL_BUFFSIZE=720 - export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96 - export TASK_QUEUE_ENABLE=1 - export HCCL_SOCKET_IFNAME=xxx - export GLOO_SOCKET_IFNAME=xxx - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ - --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \ - --mem-fraction-static 0.8 --max-running-requests 384 --attention-backend ascend --device npu --quantization modelslim \ - --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \ - --cuda-graph-bs 8 10 12 14 16 18 20 22 24 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ - --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \ - --load-balance-method decode_round_robin - NODE_RANK=$i - break - fi -done - -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://P_IP:8000 8998 \ - --decode http://D_IP:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 400 --random-input-len 2048 --random-output-len 2048 --num-prompts 3200 --random-range-ratio 1 --request-rate 8 -``` - -### DeepSeek R1 High Performance 50ms 4 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -export STREAMS_PER_DEVICE=32 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=36 -export HCCL_BUFFSIZE=1600 -export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 -export SGLANG_NPU_USE_MLAPO=1 -export SGLANG_ENABLE_SPEC_V2=1 -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 - -MODEL_PATH=xxx - -python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ ---tp 16 \ ---trust-remote-code \ ---attention-backend ascend \ ---device npu \ ---quantization modelslim \ ---watchdog-timeout 9000 \ ---host 127.0.0.1 --port 6699 \ ---cuda-graph-bs 8 16 24 28 32 36 \ ---mem-fraction-static 0.71 \ ---max-running-requests 144 \ ---context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 9000 \ ---moe-a2a-backend deepep --deepep-mode auto \ ---enable-dp-attention --dp-size 4 --enable-dp-lm-head \ ---speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ ---dtype bfloat16 - -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 144 --random-input-len 3500 --random-output-len 1500 --num-prompts 576 --random-range-ratio 1 -``` - -### DeepSeek R1 High Performance 50ms 5 - -Model: Deepseek R1 - -Hardware: Atlas 800I A3 16Card - -DeployMode: PD Separation - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 - -export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" - -P_IP=('your prefill ip1') - -D_IP=('your decode ip1') - -MODEL_PATH=xxx - -export SGLANG_NPU_USE_MLAPO=1 -export SGLANG_USE_FIA_NZ=1 -export ENABLE_MOE_NZ=1 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -# prefill -for i in "${!P_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; - then - echo "${P_IP[$i]}" - export HCCL_BUFFSIZE=1536 - export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - export TASK_QUEUE_ENABLE=2 - - export HCCL_SOCKET_IFNAME=lo - export GLOO_SOCKET_IFNAME=lo - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ - --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ - --tp-size 16 --mem-fraction-static 0.6 --attention-backend ascend --device npu --quantization modelslim \ - --disaggregation-transfer-backend ascend --max-running-requests 8 --context-length 8192 --disable-radix-cache \ - --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ - --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ - --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 - NODE_RANK=$i - break - fi -done - -# decode -for i in "${!D_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; - then - echo "${D_IP[$i]}" - export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 - export SGLANG_ENABLE_SPEC_V2=1 - export HCCL_BUFFSIZE=720 - export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96 - export TASK_QUEUE_ENABLE=1 - export HCCL_SOCKET_IFNAME=xxx - export GLOO_SOCKET_IFNAME=xxx - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ - --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \ - --mem-fraction-static 0.8 --max-running-requests 384 --attention-backend ascend --device npu --quantization modelslim \ - --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \ - --cuda-graph-bs 8 10 12 14 16 18 20 22 24 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ - --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \ - --load-balance-method decode_round_robin - NODE_RANK=$i - break - fi -done - -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://P_IP:8000 8998 \ - --decode http://D_IP:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 384 --random-input-len 3500 --random-output-len 1500 --num-prompts 1536 --random-range-ratio 1 -``` - -### Deepseek V32 Low Latency 30ms - -Model: Deepseek V3.2 - -Hardware: Atlas 800I A3 32Card - -DeployMode: PD Separation - -DataSets: 64K3K - -TPOT: 30ms - -#### Model Deployment - -Deploy Prefill Instance - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING - -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH} -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 - -export HCCL_BUFFSIZE=1024 -export DEEPEP_NORMAL_LONG_SEQ_ROUND=5 -export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512 - -MODEL_PATH=xxx - -export SGLANG_NPU_USE_MLAPO=1 -export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 -export SGLANG_NPU_USE_MULTI_STREAM=1 -export HCCL_OP_EXPANSION_MODE=AIV - -IPs=('your prefill ip1' 'your prefill ip2') - -# get IP in current node -LOCAL_HOST=`hostname -I|awk -F " " '{print$1}'` -echo "LOCAL_HOST = " ${LOCAL_HOST} -# get node index -for i in "${!IPs[@]}"; -do - echo "LOCAL_HOST=${LOCAL_HOST}, IPs[${i}]=${IPs[$i]}" - if [ "$LOCAL_HOST" == "${IPs[$i]}" ]; then - echo "Node Rank : ${i}" - VC_TASK_INDEX=$i - break - fi -done - -IFNAMES=('xxx' 'xxx') - -export HCCL_SOCKET_IFNAME=${IFNAMES[$VC_TASK_INDEX]} -export GLOO_SOCKET_IFNAME=${HCCL_SOCKET_IFNAME} -echo "HCCL_SOCKET_IFNAME : ${HCCL_SOCKET_IFNAME}" -nnodes=${#IPs[@]} -tp_size=`expr 16 \* ${nnodes}` -export ASCEND_MF_STORE_URL=tcp://${IPs[0]}:24667 - -python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ ---tp $tp_size \ ---trust-remote-code \ ---attention-backend ascend \ ---device npu \ ---watchdog-timeout 9000 \ ---host ${IPs[$VC_TASK_INDEX]} --port 8000 \ ---mem-fraction-static 0.73 \ ---disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 68000 \ ---max-running-requests 1 \ ---moe-a2a-backend deepep --deepep-mode normal \ ---quantization modelslim \ ---disaggregation-transfer-backend ascend \ ---disaggregation-mode prefill \ ---disable-cuda-graph \ ---nnodes $nnodes --node-rank $VC_TASK_INDEX \ ---disaggregation-bootstrap-port 8995 \ ---enable-nsa-prefill-context-parallel --moe-dense-tp-size 1 \ ---speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ ---dist-init-addr ${IPs[0]}:10000 -``` - -Deploy Decode Instance - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH} -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH -export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 - -MODEL_PATH=xxx - -export SGLANG_NPU_USE_MULTI_STREAM=1 -export SGLANG_NPU_USE_MLAPO=1 -export HCCL_OP_EXPANSION_MODE=AIV -export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 -export TASK_QUEUE_ENABLE=0 -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -IPs=('your decode ip1' 'your decode ip2') - -export prefill_ip=your prefill ip1 -# get IP in current node -LOCAL_HOST=`hostname -I|awk -F " " '{print$1}'` -echo "LOCAL_HOST = " ${LOCAL_HOST} -# get node index -for i in "${!IPs[@]}"; -do - echo "LOCAL_HOST=${LOCAL_HOST}, IPs[${i}]=${IPs[$i]}" - if [ "$LOCAL_HOST" == "${IPs[$i]}" ]; then - echo "Node Rank : ${i}" - VC_TASK_INDEX=$i - break - fi -done - -IFNAMES=('xxx' 'xxx') - -export HCCL_SOCKET_IFNAME=${IFNAMES[$VC_TASK_INDEX]} -export GLOO_SOCKET_IFNAME=${HCCL_SOCKET_IFNAME} -nnodes=${#IPs[@]} -tp_size=`expr 16 \* ${nnodes}` -export ASCEND_MF_STORE_URL=tcp://${prefill_ip}:24667 - -CHUNKED_SIZE=65536 -DP=8 -export HCCL_BUFFSIZE=400 -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8 - -python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ ---tp $tp_size \ ---dp ${DP} \ ---ep $tp_size \ ---moe-dense-tp-size 1 \ ---enable-dp-attention \ ---enable-dp-lm-head \ ---trust-remote-code \ ---attention-backend ascend \ ---device npu \ ---watchdog-timeout 9000 \ ---host ${IPs[$VC_TASK_INDEX]} --port 8001 \ ---mem-fraction-static 0.79 \ ---disable-radix-cache \ ---chunked-prefill-size -1 --max-prefill-tokens 68000 \ ---max-running-requests 32 \ ---cuda-graph-max-bs 4 \ ---moe-a2a-backend deepep \ ---deepep-mode low_latency \ ---quantization modelslim \ ---speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ ---disaggregation-transfer-backend ascend \ ---disaggregation-mode decode \ ---prefill-round-robin-balance \ ---load-balance-method round_robin \ ---nnodes $nnodes --node-rank $VC_TASK_INDEX \ ---dist-init-addr ${IPs[0]}:10000 --load-balance-method decode_round_robin -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://PIP1:8000 8998 \ - --prefill http://PIP2:8000 8999 \ - --decode http://DIP1:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 32 --random-input-len 64000 --random-output-len 3000 --num-prompts 64 --random-range-ratio 1 -``` - -### Qwen3 235B High Throughput 50ms 1 - -Model: Qwen3 235B - -Hardware: Atlas 800I A3 24Card - -DeployMode: PD Separation - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING - -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16 - -MODEL_PATH=xxx -export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" -P_IP=('your prefill ip1') -D_IP=('your decode ip1' 'your decode ip2') -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 -export SGLANG_DP_ROUND_ROBIN=1 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - - -for i in "${!P_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; - then - echo "${P_IP[$i]}" - source /usr/local/Ascend/ascend-toolkit/set_env.sh - source /usr/local/Ascend/nnal/atb/set_env.sh - export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 - export DEEPEP_NORMAL_LONG_SEQ_ROUND=16 - export HCCL_BUFFSIZE=4300 - export TASK_QUEUE_ENABLE=2 - export HCCL_SOCKET_IFNAME=lo - export GLOO_SOCKET_IFNAME=lo - export STREAMS_PER_DEVICE=32 - export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - - # P节点 - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \ - --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \ - --nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --mem-fraction-static 0.6 \ - --disable-radix-cache \ - --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --speculative-draft-model-quantization unquant \ - --max-running-requests 128 --chunked-prefill-size 262144 --max-prefill-tokens 262144 \ - --enable-dp-attention \ - --moe-a2a-backend deepep --deepep-mode normal --dtype bfloat16 - NODE_RANK=$i - break - fi -done - - -for i in "${!D_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; - then - echo "${D_IP[$i]}" - source /usr/local/Ascend/ascend-toolkit/set_env.sh - source /usr/local/Ascend/nnal/atb/set_env.sh - export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24 - export HCCL_BUFFSIZE=512 - export HCCL_SOCKET_IFNAME=data0.3001 - export GLOO_SOCKET_IFNAME=data0.3001 - export STREAMS_PER_DEVICE=32 - - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \ - --host ${D_IP[$i]} --port 8001 --trust-remote-code \ - --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.83 --max-running-requests 768 \ - --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ - --moe-a2a-backend ascend_fuseep --cuda-graph-bs 6 8 12 15 18 20 22 24 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-draft-model-quantization unquant \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --dist-init-addr xxx:5000 \ - --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ - --prefill-round-robin-balance --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 \ - --load-balance-method decode_round_robin - NODE_RANK=$i - break - fi -done - -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://PIP:8000 8995 \ - --decode http://DIP:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang-oai --host 127.0.0.1 --port 7239 --max-concurrency 860 --random-input-len 3500 --random-output-len 1500 --num-prompts 3440 --random-range-ratio 1 -``` - -### Qwen3 235B High Throughput 50ms 2 - -Model: Qwen3 235B - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=1600 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 -export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=2 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 272 --context-length 8192 --dtype bfloat16 \ - --chunked-prefill-size 32768 --max-prefill-tokens 32768 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --speculative-draft-model-quantization unquant \ - --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs 3 4 6 8 10 12 13 14 15 16 17 - -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 272 --random-input-len 3500 --random-output-len 1500 --num-prompts 1088 --random-range-ratio 1 -``` - -### Qwen3-235B Atlas 800I A3-8Card PD Mixed 2K-2K 100ms - -Model: Qwen3 235B - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 2K2K - -TPOT: 100ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=1200 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 576 --context-length 8192 --dtype bfloat16 \ - --chunked-prefill-size 32768 --max-prefill-tokens 458880 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --speculative-draft-model-quantization unquant \ - --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.81 --cuda-graph-bs 8 16 20 24 32 36 - -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 576 --random-input-len 2000 --random-output-len 2000 --num-prompts 576 --random-range-ratio 1 -``` - -### Qwen3 235B High Throughput 50ms 3 - -Model: Qwen3 235B - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 2K2K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=2100 -export HCCL_SOCKET_IFNAME=xxx -export GLOO_SOCKET_IFNAME=xxx -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 480 --context-length 8192 --dtype bfloat16 \ - --chunked-prefill-size -1 --max-prefill-tokens 4096 --speculative-draft-model-quantization unquant \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto \ - --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.75 --cuda-graph-bs 6 8 10 12 15 18 28 30 -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 480 --random-input-len 2048 --random-output-len 2048 --num-prompts 480 --random-range-ratio 1 -``` - -### Qwen3 235B High Throughput 50ms 4 - -Model: Qwen3 235B - -Hardware: Atlas 800I A3 16Card - -DeployMode: PD Mixed - -DataSets: 2K2K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=1600 -export HCCL_SOCKET_IFNAME=xxx -export GLOO_SOCKET_IFNAME=xxx -export HCCL_OP_EXPANSION_MODE="AIV" - -MIX_IP=('IP1' 'IP2') - -for i in "${!MIX_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]]; - then - echo "${MIX_IP[$i]}" - export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 - export SGLANG_ENABLE_SPEC_V2=1 - - python -m sglang.launch_server --model-path ${MODEL_PATH} \ - --host 127.0.0.1 --port 7439 --trust-remote-code \ - --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.8 --max-running-requests 768 \ - --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ - --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-bs 6 8 10 12 18 24 \ - --dist-init-addr ${MIX_IP[0]}:5000 --chunked-prefill-size 131072 --max-prefill-tokens 458880 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx --speculative-draft-model-quantization= unquant \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --context-length 8192 --disable-radix-cache \ - --enable-dp-lm-head --dtype bfloat16 - NODE_RANK=$i - break - fi -done - -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 768 --random-input-len 2000 --random-output-len 2000 --num-prompts 768 --random-range-ratio 1 -``` - -### Qwen3 235B Low Latency 10ms - -Model: Qwen3 235B - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 11K1K - -TPOT: 10ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=1600 -export HCCL_SOCKET_IFNAME=xxx -export GLOO_SOCKET_IFNAME=xxx -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 1 --dtype bfloat16 \ - --chunked-prefill-size -1 --max-prefill-tokens 16384 --speculative-draft-model-quantization unquant \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ - --disable-radix-cache --enable-dp-lm-head \ - --tp 16 --mem-fraction-static 0.78 --cuda-graph-bs 1 - -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 1 --random-input-len 11000 --random-output-len 1000 --num-prompts 1 --random-range-ratio 1 -``` - -### Qwen3 32B Low Latency 18ms - -Model: Qwen3 32B - -Hardware: Atlas 800I A3 4Card - -DeployMode: PD Mixed - -DataSets: 6K1.5K - -TPOT: 18ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=xxx -export GLOO_SOCKET_IFNAME=xxx -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu \ - --max-running-requests 32 \ - --disable-radix-cache \ - --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ - --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16 - -``` - -#### Benchmark - -We tested it based on the GSM8K dataset. - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1 -``` - -### Qwen3 32B Low Latency 11ms - -Model: Qwen3 32B - -Hardware: Atlas 800I A3 4Card - -DeployMode: PD Mixed - -DataSets: 4K1.5K - -TPOT: 11ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu \ - --max-running-requests 1 \ - --disable-radix-cache \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ - --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ - --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16 - -``` - -#### Benchmark - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4 -``` - -### Qwen3 32B Low Latency 12ms - -Model: Qwen3 32B - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 18K4K - -TPOT: 12ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu \ - --max-running-requests 1 \ - --disable-radix-cache --speculative-draft-model-quantization unquant \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ - --chunked-prefill-size -1 --max-prefill-tokens 65536 \ - --tp-size 16 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16 -``` - -#### Benchmark - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 18000 --random-input-len 4000 --num-prompts 1 -``` - -### Qwen3 32B High Throughput 50ms 1 - -Model: Qwen3 32B - -Hardware: Atlas 800I A3 2Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 78 \ - --disable-radix-cache --speculative-draft-model-quantization unquant \ - --chunked-prefill-size 65536 --max-prefill-tokens 65536 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 16 32 64 68 72 78 --dtype bfloat16 -``` - -#### Benchmark - -We tested it based on the GSM8K dataset. - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1 -``` - -### Qwen3 32B High Throughput 50ms 2 - -Model: Qwen3 32B - -Hardware: Atlas 800I A3 2Card - -DeployMode: PD Mixed - -DataSets: 2K2K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 120 \ - --disable-radix-cache --speculative-draft-model-quantization unquant \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --chunked-prefill-size -1 --max-prefill-tokens 49152 \ - --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16 - -``` - -#### Benchmark - -We tested it based on the GSM8K dataset. - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 480 --random-range-ratio 1 -``` - -### Qwen3 32B High Throughput 50ms 3 - -Model: Qwen3 30B - -Hardware: Atlas 800I A3 1Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 -export DISABLE_EAGLE3_QUANT=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 192 \ - --disable-radix-cache \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --chunked-prefill-size -1 --max-prefill-tokens 32768 \ - --tp-size 2 --mem-fraction-static 0.86 --cuda-graph-bs 42 88 96 132 144 156 172 178 192 --dtype bfloat16 -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 156 --random-input-len 3500 --random-output-len 1500 --num-prompts 624 --random-range-ratio 1 -``` - -### Qwen3 480B High Throughput 50ms 1 - -Model: Qwen3 480B - -Hardware: Atlas 800I A3 24Card - -DeployMode: PD Separation - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING - -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16 - -MODEL_PATH=xxx -export ASCEND_MF_STORE_URL="tcp://PIP:24667" -P_IP=('PIP') -D_IP=('DIP1' 'DIP2') -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - - -for i in "${!P_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; - then - echo "${P_IP[$i]}" - source /usr/local/Ascend/ascend-toolkit/set_env.sh - source /usr/local/Ascend/nnal/atb/set_env.sh - export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 - export DEEPEP_NORMAL_LONG_SEQ_ROUND=16 - export HCCL_BUFFSIZE=4300 - export TASK_QUEUE_ENABLE=2 - export HCCL_SOCKET_IFNAME=lo - export GLOO_SOCKET_IFNAME=lo - export STREAMS_PER_DEVICE=32 - export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \ - --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \ - --nnodes 1 --node-rank $i --tp-size 16 --dp-size 2 --mem-fraction-static 0.6 \ - --disable-radix-cache \ - --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \ - --max-running-requests 128 --chunked-prefill-size 65536 --max-prefill-tokens 262144 \ - --enable-dp-attention \ - --moe-a2a-backend deepep --deepep-mode normal --dtype bfloat16 - NODE_RANK=$i - break - fi -done - -for i in "${!D_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; - then - echo "${D_IP[$i]}" - source /usr/local/Ascend/ascend-toolkit/set_env.sh - source /usr/local/Ascend/nnal/atb/set_env.sh - export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=72 - export HCCL_BUFFSIZE=512 - export HCCL_SOCKET_IFNAME=xxx - export GLOO_SOCKET_IFNAME=xxx - export STREAMS_PER_DEVICE=32 - - python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \ - --host ${D_IP[$i]} --port 8001 --trust-remote-code \ - --nnodes 2 --node-rank $i --tp-size 32 --dp-size 4 --mem-fraction-static 0.73 --max-running-requests 384 \ - --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ - --moe-a2a-backend ascend_fuseep --cuda-graph-bs 16 32 48 56 64 72 80 88 96 \ - --dist-init-addr DIP1:5000 \ - --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ - --prefill-round-robin-balance --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 --load-balance-method decode_round_robin - NODE_RANK=$i - break - fi -done - -``` - -```shell -export SGLANG_DP_ROUND_ROBIN=1 -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://PIP:8000 8995 \ - --decode http://DIP:8001 \ - --host 127.0.0.1 \ - --port 6688 \ - --mini-lb -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 410 --random-input-len 3500 --random-output-len 1500 --num-prompts 1640 --random-range-ratio 1 --request-rate 8 -``` - -### Qwen3 480B High Throughput 50ms 2 - -Model: Qwen3 480B - -Hardware: Atlas 800I A3 16Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16 - -export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=1800 -export HCCL_SOCKET_IFNAME=xxx -export GLOO_SOCKET_IFNAME=xxx -export HCCL_OP_EXPANSION_MODE="AIV" - -MIX_IP=('IP1' 'IP2') - -for i in "${!MIX_IP[@]}"; -do - if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]]; - then - echo "${MIX_IP[$i]}" - - python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 2 --node-rank $i \ - --dist-init-addr 141.61.133.128:5000 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 288 --context-length 8192 --dtype bfloat16 \ - --chunked-prefill-size 114688 --max-prefill-tokens 458880 \ - --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto \ - --tp 32 --dp-size 4 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.7 --cuda-graph-bs 56 64 72 - NODE_RANK=$i - break - fi -done -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 288 --random-input-len 3500 --random-output-len 1500 --num-prompts 1152 --random-range-ratio 1 --request-rate 20 -``` - -### Qwen3 480B High Throughput 50ms 3 - -Model: Qwen3 480B - -Hardware: Atlas 800I A3 8Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=2100 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" - -python -m sglang.launch_server --model-path $MODEL_PATH \ ---host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ ---attention-backend ascend --device npu --quantization modelslim \ ---max-running-requests 80 --context-length 8192 --dtype bfloat16 \ ---chunked-prefill-size 28672 --max-prefill-tokens 458880 \ ---disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --enable-dp-attention --enable-dp-lm-head \ ---tp 16 --dp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 16 20 24 -``` - -#### Benchmark - -```shell -python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 80 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1 -``` - -### Qwen3 Next High Throughput 50ms - -Model: Qwen3 Next - -Hardware: Atlas 800I A3 2Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -export cann_path=/usr/local/Ascend/ascend-toolkit/latest -source /usr/local/Ascend/driver/bin/setenv.bash -source ${cann_path}/../set_env.sh -source ${cann_path}/../../nnal/atb/set_env.sh -source ${cann_path}/opp/vendors/customize/bin/set_env.bash -export ASCEND_HOME_PATH=${cann_path} -source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh - -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo - -export HCCL_OP_EXPANSION_MODE=AIV -export HCCL_ALGO="level0:NA;level1:ring" - -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=20 -export HCCL_BUFFSIZE=2000 - -python -m sglang.launch_server \ - --model-path /mnt/share/weight/Qwen3-Next-80B-A3B-Instruct-W8A8-3 \ - --host 127.0.0.1 \ - --port 6699 \ - --tp-size 4 \ - --device npu \ - --attention-backend ascend \ - --mem-fraction-static 0.685 \ - --max-running-requests 80 \ - --watchdog-timeout 3600 \ - --disable-radix-cache \ - --cuda-graph-bs 80 \ - --max-prefill-tokens 28672 --max-total-tokens 450560 \ - --moe-a2a-backend deepep --deepep-mode auto \ - --quantization modelslim \ - --chunked-prefill-size -1 -``` - -#### Benchmark - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 80 --random-output-len 1536 --random-input-len 3584 --num-prompts 160 --random-range-ratio 1 -``` - -### Qwen3 32B A2 Low Latency 18ms - -Model: Qwen3 32B - -Hardware: Atlas 800I A2 8Card - -DeployMode: PD Mixed - -DataSets: 6K1.5K - -TPOT: 18ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 32 \ - --disable-radix-cache \ - --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ - --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16 -``` - -#### Benchmark - -We tested it based on the GSM8K dataset. - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1 -``` - -### Qwen3 32B A2 Low Latency 11ms - -Model: Qwen3 32B - -Hardware: Atlas 800I A2 8Card - -DeployMode: PD Mixed - -DataSets: 4K1.5K - -TPOT: 11ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 -export DISABLE_EAGLE3_QUANT=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu \ - --max-running-requests 32 \ - --disable-radix-cache \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ - --chunked-prefill-size -1 --max-prefill-tokens 65536 \ - --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 6 12 18 24 30 32 --dtype bfloat16 -``` - -#### Benchmark - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4 -``` - -### Qwen3 32B A2 High Throughput 50ms 1 - -Model: Qwen3 32B - -Hardware: Atlas 800I A2 8Card - -DeployMode: PD Mixed - -DataSets: 3.5K1.5K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 - -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 78 \ - --disable-radix-cache --speculative-draft-model-quantization unquant \ - --chunked-prefill-size -1 --max-prefill-tokens 65536 \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ - --tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 8 16 32 64 68 72 78 --dtype bfloat16 --base-gpu-id 4 -``` - -#### Benchmark - -We tested it based on the GSM8K dataset. - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1 -``` - -### Qwen3 32B A2 High Throughput 50ms 2 - -Model: Qwen3 32B - -Hardware: Atlas 800I A2 8Card - -DeployMode: PD Mixed - -DataSets: 2K2K - -TPOT: 50ms - -#### Model Deployment - -```shell -echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -sysctl -w vm.swappiness=0 -sysctl -w kernel.numa_balancing=0 -sysctl -w kernel.sched_migration_cost_ns=50000 - -export SGLANG_SET_CPU_AFFINITY=1 -unset https_proxy -unset http_proxy -unset HTTPS_PROXY -unset HTTP_PROXY -unset ASCEND_LAUNCH_BLOCKING -source /usr/local/Ascend/ascend-toolkit/set_env.sh -source /usr/local/Ascend/nnal/atb/set_env.sh -source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash -export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH - -MODEL_PATH=xxx - -export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - -LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` -LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` - -echo "${LOCAL_HOST1}" -echo "${LOCAL_HOST2}" - -export HCCL_BUFFSIZE=400 -export HCCL_SOCKET_IFNAME=lo -export GLOO_SOCKET_IFNAME=lo -export HCCL_OP_EXPANSION_MODE="AIV" -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 -export DISABLE_EAGLE3_QUANT=1 - -python -m sglang.launch_server --model-path $MODEL_PATH \ - --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ - --attention-backend ascend --device npu --quantization modelslim \ - --max-running-requests 120 \ - --disable-radix-cache \ - --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ - --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ - --chunked-prefill-size -1 --max-prefill-tokens 49152 --base-gpu-id 4 \ - --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16 -``` - -#### Benchmark - -We tested it based on the GSM8K dataset. - -```shell -python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 120 --random-range-ratio 1 -``` diff --git a/docs/platforms/ascend_npu_quantization.md b/docs/platforms/ascend_npu_quantization.md deleted file mode 100644 index 4c40fde6e170..000000000000 --- a/docs/platforms/ascend_npu_quantization.md +++ /dev/null @@ -1,21 +0,0 @@ -Quantization on Ascend. - -To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config. - -[ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504): -- [x] W4A4 dynamic linear -- [x] W8A8 static linear -- [x] W8A8 dynamic linear -- [x] W4A8 dynamic MOE -- [x] W8A8 dynamic MOE - -[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158): -- [x] W4A16 linear -- [x] W8A16 linear # Need to test -- [x] W4A16 MOE # Need to test - -Compressed-tensors (LLM Compressor) on Ascend support: -- [x] [W4A8 dynamic MOE with/without activation clip](https://github.com/sgl-project/sglang/pull/14736) # Need to test -- [x] [W4A16 MOE](https://github.com/sgl-project/sglang/pull/12759) -- [x] [W8A8 dynamic linear](https://github.com/sgl-project/sglang/pull/14504) -- [x] [W8A8 dynamic MOE](https://github.com/sgl-project/sglang/pull/14504) diff --git a/docs/platforms/ascend_npu_qwen3_examples.md b/docs/platforms/ascend_npu_qwen3_examples.md deleted file mode 100644 index 958ad8c97398..000000000000 --- a/docs/platforms/ascend_npu_qwen3_examples.md +++ /dev/null @@ -1,118 +0,0 @@ -## Qwen3 examples - -### Running Qwen3 - -#### Running Qwen3-32B on 1 x Atlas 800I A3. - -Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-32B) - -```shell -export SGLANG_SET_CPU_AFFINITY=1 -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export HCCL_BUFFSIZE=1536 -export HCCL_OP_EXPANSION_MODE=AIV - -python -m sglang.launch_server \ - --device npu \ - --attention-backend ascend \ - --trust-remote-code \ - --tp-size 4 \ - --model-path Qwen/Qwen3-32B \ - --mem-fraction-static 0.8 -``` - -#### Running Qwen3-32B on 1 x Atlas 800I A3 with Qwen3-32B-Eagle3. - -Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-32B) - -Speculative model weights could be found [here](https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3) - -```shell -export SGLANG_SET_CPU_AFFINITY=1 -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export HCCL_OP_EXPANSION_MODE=AIV -export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 -export SGLANG_ENABLE_SPEC_V2=1 - -python -m sglang.launch_server \ - --device npu \ - --attention-backend ascend \ - --trust-remote-code \ - --tp-size 4 \ - --model-path Qwen/Qwen3-32B \ - --mem-fraction-static 0.8 \ - --speculative-algorithm EAGLE3 \ - --speculative-draft-model-path Qwen/Qwen3-32B-Eagle3 \ - --speculative-num-steps 1 \ - --speculative-eagle-topk 1 \ - --speculative-num-draft-tokens 2 -``` - -#### Running Qwen3-30B-A3B MOE on 1 x Atlas 800I A3. - -Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-30B-A3B) - -```shell -export SGLANG_SET_CPU_AFFINITY=1 -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export HCCL_BUFFSIZE=1536 -export HCCL_OP_EXPANSION_MODE=AIV -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 -export SGLANG_DEEPEP_BF16_DISPATCH=1 -export ENABLE_ASCEND_MOE_NZ=1 - -python -m sglang.launch_server \ - --device npu \ - --attention-backend ascend \ - --trust-remote-code \ - --tp-size 4 \ - --model-path Qwen/Qwen3-30B-A3B \ - --mem-fraction-static 0.8 -``` - -#### Running Qwen3-235B-A22B-Instruct-2507 MOE on 1 x Atlas 800I A3. - -Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) - -```shell -export SGLANG_SET_CPU_AFFINITY=1 -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export HCCL_BUFFSIZE=1536 -export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 -export SGLANG_DEEPEP_BF16_DISPATCH=1 -export ENABLE_ASCEND_MOE_NZ=1 - -python -m sglang.launch_server \ - --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \ - --tp-size 16 \ - --trust-remote-code \ - --attention-backend ascend \ - --device npu \ - --watchdog-timeout 9000 \ - --mem-fraction-static 0.8 -``` - -#### Running Qwen3-VL-8B-Instruct on 1 x Atlas 800I A3. - -Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) - -```shell -export SGLANG_SET_CPU_AFFINITY=1 -export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True -export STREAMS_PER_DEVICE=32 -export HCCL_BUFFSIZE=1536 -export HCCL_OP_EXPANSION_MODE=AIV - -python -m sglang.launch_server \ - --enable-multimodal \ - --attention-backend ascend \ - --mm-attention-backend ascend_attn \ - --trust-remote-code \ - --tp-size 4 \ - --model-path Qwen/Qwen3-VL-8B-Instruct \ - --mem-fraction-static 0.8 -``` diff --git a/docs/platforms/ascend_npu_ring_sp_performance.md b/docs/platforms/ascend_npu_ring_sp_performance.md new file mode 100644 index 000000000000..014328aefa4f --- /dev/null +++ b/docs/platforms/ascend_npu_ring_sp_performance.md @@ -0,0 +1,55 @@ +# Ascend NPU Ring-SP Performance (Wan2.1-T2V-1.3B) + +This page reports Ring-SP performance on Ascend NPU with `torch_npu==2.10.0`. + +- Baseline config: `ulysses=1, ring=1` (short: `u1r1`) +- Ring-SP config: `ulysses=1, ring=2` (short: `u1r2`) + +## Benchmark Setup + +- Model: `Wan2.1-T2V-1.3B-Diffusers` +- Prompt: `"a cat is playing piano"` +- Framework command: `sglang generate` +- Runtime: `torch_npu==2.10.0` + +## Generate Commands + +### Baseline (`u1r1`) + +```bash +sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "a cat is playing piano" --num-gpus 1 --ring-degree 1 \ + --save-output +``` + +### Ring-SP (`u1r2`) + +```bash +sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "a cat is playing piano" --num-gpus 2 --ring-degree 2 \ + --save-output +``` + +## Benchmarks + +Benchmark Disclaimer + +These numbers are from one fixed setup and one prompt case. Actual performance may vary by model settings, environment, and workload. + +### Stage Time Breakdown + +| Stage / Metric | `u1r2` (s) | `u1r1` baseline (s) | Speedup | +|---|---:|---:|---:| +| InputValidation | 0.0003 | 0.0002 | 0.67x | +| TextEncoding | 3.5936 | 3.5820 | 1.00x | +| LatentPreparation | 0.0007 | 0.0055 | 7.86x | +| TimestepPreparation | 0.0008 | 0.0007 | 0.88x | +| Denoising | 121.2788 | 239.2580 | 1.97x | +| Decoding | 13.8685 | 16.4969 | 1.19x | +| **Total (Pixel data generated)** | **141.86** | **266.50** | **1.88x** | + +## Summary + +- With `torch_npu==2.10.0`, Ring-SP (`u1r2`) runs successfully on NPU for this case. +- End-to-end generation time improves from `266.50s` to `141.86s` (`1.88x`). +- The main gain comes from `DenoisingStage` (`1.97x`), while decoding also improves (`1.19x`). diff --git a/docs/platforms/ascend_npu_support.rst b/docs/platforms/ascend_npu_support.rst deleted file mode 100644 index 7bf9726abe57..000000000000 --- a/docs/platforms/ascend_npu_support.rst +++ /dev/null @@ -1,12 +0,0 @@ -Ascend NPUs -=============================================================== - -.. toctree:: - :maxdepth: 1 - - ascend_npu.md - ascend_npu_support_models.md - ascend_npu_deepseek_example.md - ascend_npu_qwen3_examples.md - ascend_contribution_guide.md - ascend_npu_best_practice.md diff --git a/docs/platforms/cpu_server.md b/docs/platforms/cpu_server.md index 6d6cce83cd70..b954163e5643 100644 --- a/docs/platforms/cpu_server.md +++ b/docs/platforms/cpu_server.md @@ -123,7 +123,6 @@ cp pyproject_cpu.toml pyproject.toml # Install SGLang dependent libs, and build SGLang main package uv pip install --upgrade pip setuptools uv pip install . -uv pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 torchao==0.14.1 triton==3.5.0 --force-reinstall # Build the CPU backend kernels cd ../sgl-kernel @@ -187,20 +186,37 @@ Notes: 2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6). The number of TP specified is how many TP ranks will be used during the execution. On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). - Usually we can get the SNC information (How many available) from the Operating System. - Users can specify TP to be no more than the total available SNCs in current system. + Usually we can get the SNC information (How many available) from the Operating System with e.g. `lscpu` command. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the first `n` SNCs. Note that `n` cannot exceed the total SNC number, doing so will result in an error. - To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`. - For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server, + `SGLANG_CPU_OMP_THREADS_BIND` allows explicit control of CPU cores for each tensor parallel (TP) rank. + + **example 1**: Run SGLang service with TP=6, using the first 40 cores of each SNC on a Xeon® 6980P server, which has 43-43-42 cores on the 3 SNCs of a socket, we should set: ```bash export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253" ``` + This configuration is equivalent to: + - rank 0: `numactl -C 0-39 -m 0` + - rank 1: `numactl -C 43-82 -m 1` + - rank 2: `numactl -C 86-125 -m 2` + - rank 3: `numactl -C 128-167 -m 3` + - rank 4: `numactl -C 171-210 -m 4` + - rank 5: `numactl -C 214-253 -m 5` + + + **example 2**: Run SGLang service with TP=2, using 96 cores cross 3 SNCs on a Xeon® 6972P server, + which has 32-32-32 cores on the 3 SNCs in a socket, we should set: + ```bash + export SGLANG_CPU_OMP_THREADS_BIND="0-95|96-191" + ``` + This configuration is equivalent to: + - rank 0: `numactl -C 0-95 -m 0-2` + - rank 1: `numactl -C 96-191 -m 3-5` Please beware that with SGLANG_CPU_OMP_THREADS_BIND set, the available memory amounts of the ranks may not be determined in prior. @@ -209,8 +225,7 @@ Notes: 3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`. To specify the maximum batch size when using `torch.compile`, set the flag `--torch-compile-max-bs`. For example, `--enable-torch-compile --torch-compile-max-bs 4` means using `torch.compile` - and setting the maximum batch size to 4. Currently the maximum applicable batch size - for optimizing with `torch.compile` is 16. + and setting the maximum batch size to 4. 4. A warmup step is automatically triggered when the service is started. The server is ready when you see the log `The server is fired up and ready to roll!`. diff --git a/docs/platforms/plugin.md b/docs/platforms/plugin.md new file mode 100644 index 000000000000..8a4c4ee1c64d --- /dev/null +++ b/docs/platforms/plugin.md @@ -0,0 +1,414 @@ +# SGLang Plugin System + +## Overview + +Allows hardware vendors and developers to extend SGLang **without modifying the main repository code**. + +The framework provides two plugin types, both discovered via Python's standard `setuptools` entry_points: + +| Plugin Type | Entry Point Group | Purpose | +|---|---|---| +| **Hardware Platform Plugin** | `sglang.srt.platforms` | Register a custom hardware platform (device operations, KV cache pools, attention backends, graph capture, compilation backends, etc.) | +| **General Plugin** | `sglang.srt.plugins` | Inject hooks (before/after/around/replace) into any function/method, or replace entire classes | + +### Principles + +- **Non-intrusive**: Existing CUDA/ROCm/NPU/XPU code remains unchanged. OOT code paths are added alongside existing hardware-specific logic. +- **Zero configuration**: Plugins are automatically discovered after `pip install`, no sglang code changes required. +- **Environment variable control**: `SGLANG_PLATFORM` selects or validates the active platform plugin; `SGLANG_PLUGINS` (comma-separated) controls which general plugins to load. + +### Current Scope & Future Direction + +The plugin system currently targets **out-of-tree (OOT) hardware platforms** — enabling new devices to integrate with SGLang without any changes to the main repository. The main-repo hardware paths (CUDA, ROCm, NPU, XPU, etc.) continue to use the existing `is_cuda()`/`is_npu()`/… utility functions. + +As the plugin interfaces mature and stabilize, in-tree hardware backends can be gradually migrated to the same plugin architecture. This would replace the scattered `if device == "cuda" … elif device == "npu" …` branches throughout the codebase with a single polymorphic dispatch through the platform interface, making each hardware backend self-contained and the core engine hardware-agnostic. + +## Architecture + +### Platform Hierarchy + +The platform hierarchy uses a DeviceMixin pattern to share device operations between SRT (LLM inference) and Multimodal subsystems: + +``` +DeviceMixin (shared device identity + operations) +├── SRTPlatform(DeviceMixin) # + graph runner, KV pool, … +│ └── MySRTPlatform(SRTPlatform, MyDeviceMixin) # OOT plugin +└── MMPlatform(DeviceMixin) # + attention backend, VAE, … (future) + └── MyMMPlatform(MMPlatform, MyDeviceMixin) # OOT plugin +``` + +Key design points: +- **DeviceMixin** provides platform identity queries (`is_cuda()`, `is_npu()`, etc.) and device operations (`set_device()`, `get_device_name()`, etc.) +- **SRTPlatform** adds SRT-specific factory methods, capability flags, and lifecycle hooks +- OOT plugins implement a **device mixin** (vendor-specific operations) and compose it with **SRTPlatform** via multiple inheritance +- All methods are **instance methods** (not classmethods), called through the `current_platform` singleton +- Device operations and factory methods raise `NotImplementedError` by default (fail-fast) +- Capability flags use safe conservative defaults (`False`/`pass`) +- Methods are annotated `[Active]` (called by SGLang core) or `[Planned]` (reserved for future migration) + +### Platform Discovery (`current_platform`) + +`current_platform` is a **lazy singleton** in `sglang.srt.platforms`. On first access it resolves the active platform through the following priority chain: + +``` +entry_points("sglang.srt.platforms") → Enumerate ALL plugins by name (metadata only) + │ + ├─ SGLANG_PLATFORM set (front-loading filter): + │ ├─ Name not found in discovered → RuntimeError + │ ├─ activate() returns non-None → load that platform + │ └─ activate() returns None → RuntimeError (hardware unavailable) + │ + └─ SGLANG_PLATFORM unset (auto-discover, activate all): + ├─ 0 activated → fallback base SRTPlatform + ├─ 1 activated → use it + └─ N activated → RuntimeError (must set SGLANG_PLATFORM) +``` + +### Plugin Loading Flow + +`load_plugins()` discovers and executes general plugins, then applies all registered hooks. It is called at four points: + +| Call Site | Process | Timing | +|---------|------|------| +| `cli/serve.py` serve() | Main | Before `prepare_server_args()` | +| `launch_server.py` `__main__` | Main | Before `prepare_server_args()` | +| `engine.py` `_launch_subprocesses()` | Main | Before `server_args.check_server_args()` | +| `scheduler.py` `run_scheduler_process()` | Subprocess | Before `Scheduler()` construction | + +> **Note**: `load_plugins()` is idempotent (guarded by `_plugins_loaded` flag). In spawn'd subprocesses the flag resets, so plugins are correctly re-loaded. + +``` +load_plugins() + ├── _get_excluded_dists() → compute dists to skip (via SGLANG_PLATFORM) + ├── load_plugins_by_group("sglang.srt.plugins", → discover entry_points, filter by SGLANG_PLUGINS + │ excluded_dists=...) skip plugins from unselected platform packages + ├── for each plugin: → set _current_plugin_source context var + │ func() side effects (register hooks with source tracking) + └── HookRegistry.apply_hooks() → monkey-patch targets +``` + +--- + +## Plugin Type 1: Hardware Platform Plugin + +### Description + +A hardware platform plugin registers an `SRTPlatform` subclass that tells SGLang how to interact with a specific hardware backend. + +### Quick Start + +**1. Create a minimal package:** + +``` +my_platform_plugin/ +├── pyproject.toml +└── my_platform_plugin/ + ├── __init__.py # activate() function + ├── device.py # MyDeviceMixin + └── platform.py # MySRTPlatform +``` + +**2. `pyproject.toml`:** + +```toml +[build-system] +requires = ["setuptools"] +build-backend = "setuptools.build_meta" + +[project] +name = "my-platform-plugin" +version = "0.1.0" + +[project.entry-points."sglang.srt.platforms"] +my_device = "my_platform_plugin:activate" +``` + +**3. `__init__.py`** — activation function: + +```python +def activate(): + """Return fully-qualified class name to activate, or None to skip.""" + if _my_device_is_available(): + return "my_platform_plugin.platform.MySRTPlatform" + return None +``` + +**4. `device.py`** — device mixin: + +```python +from sglang.srt.platforms.device_mixin import DeviceMixin, PlatformEnum + +class MyDeviceMixin(DeviceMixin): + _enum = PlatformEnum.OOT + device_name = "my_device" + device_type = "my_device" # torch device type + + def set_device(self, device) -> None: ... + def get_device_name(self, device_id=0) -> str: ... + def get_device_total_memory(self, device_id=0) -> int: ... + def get_current_memory_usage(self, device=None) -> float: ... + def get_device_capability(self, device_id=0): ... + def get_torch_distributed_backend_str(self) -> str: ... +``` + +**5. `platform.py`** — SRT platform: + +```python +from sglang.srt.platforms.interface import SRTPlatform +from my_platform_plugin.device import MyDeviceMixin + +class MySRTPlatform(SRTPlatform, MyDeviceMixin): + def get_default_attention_backend(self) -> str: ... + def support_cuda_graph(self) -> bool: ... + # ... override other methods as needed +``` + +**6. Install and verify:** + +```bash +pip install -e my_platform_plugin/ +python -c "from sglang.srt.platforms import current_platform; print(current_platform)" +``` + +### Platform Interface Reference + +#### Identity Queries (from DeviceMixin) + +| Method | Default | Description | +|---|---|---| +| `is_cuda()` | Based on `_enum` | Whether this is an NVIDIA CUDA platform | +| `is_rocm()` | Based on `_enum` | Whether this is an AMD ROCm platform | +| `is_npu()` | Based on `_enum` | Whether this is a Huawei NPU platform | +| `is_cpu()` | Based on `_enum` | Whether this is a CPU-only platform | +| `is_xpu()` | Based on `_enum` | Whether this is an Intel XPU platform | +| `is_musa()` | Based on `_enum` | Whether this is a Moore Threads MUSA platform | +| `is_cuda_alike()` | CUDA+ROCM+MUSA | True if the hardware supports CUDA-like APIs | +| `is_out_of_tree()` | `True` for OOT | Automatically detected based on `_enum = PlatformEnum.OOT` | + +#### Device Operations (from DeviceMixin) + +> Methods annotated **[Active]** are called by SGLang core through `current_platform` — OOT implementations take effect immediately. +> Methods annotated **[Planned]** are reserved interfaces — SGLang core still uses hardcoded calls (e.g. `torch.cuda.empty_cache()`). OOT implementations will NOT take effect until the core is migrated in a future PR. + +| Method | Default | Status | Description | +|---|---|---|---| +| `get_device(local_rank)` | `raise NotImplementedError` | Planned | Return `torch.device` for a given local rank | +| `set_device(device)` | `raise NotImplementedError` | Planned | Set the current device | +| `get_device_name(device_id)` | `raise NotImplementedError` | Planned | Get human-readable device name | +| `get_device_uuid(device_id)` | `raise NotImplementedError` | Planned | Get unique device identifier | +| `get_device_capability(device_id)` | `raise NotImplementedError` | Planned | Get `DeviceCapability(major, minor)`. None if N/A | +| `empty_cache()` | `pass` | Planned | Release cached device memory | +| `synchronize()` | `pass` | Planned | Synchronize device operations | +| `get_device_total_memory(device_id)` | `raise NotImplementedError` | **Active** | Get total device memory in bytes | +| `get_available_memory(device_id)` | `raise NotImplementedError` | Planned | Return `(free_bytes, total_bytes)` | +| `get_current_memory_usage(device)` | `raise NotImplementedError` | **Active** | Get current peak memory usage in bytes | +| `get_torch_distributed_backend_str()` | `raise NotImplementedError` | Planned | Distributed backend string (e.g. "nccl", "hccl") | +| `get_communicator_class()` | `None` | Planned | Platform-specific communicator class | +| `inference_mode()` | `torch.inference_mode(True)` | Planned | Return inference mode context manager | +| `seed_everything(seed)` | Set random/np/torch seeds | Planned | Set random seeds for reproducibility | +| `verify_quantization(quant)` | `pass` | Planned | Validate quantization method support | +| `get_cpu_architecture()` | Auto-detect x86/arm | Planned | Detect CPU architecture (`CpuArchEnum`) | + +#### Types (from DeviceMixin) + +| Type | Description | +|---|---| +| `PlatformEnum` | Enumeration of platform types: CUDA, ROCM, CPU, XPU, MUSA, NPU, TPU, MPS, OOT, UNSPECIFIED | +| `CpuArchEnum` | CPU architecture: X86, ARM, UNSPECIFIED | +| `DeviceCapability` | `NamedTuple(major, minor)` with comparison support. Methods: `as_version_str()`, `to_int()` | + +#### Capability Flags (from SRTPlatform) + +| Method | Default | Description | +|---|---|---| +| `support_cuda_graph()` | `False` | Whether device graph capture is supported (plain CUDA graph) | +| `support_piecewise_cuda_graph()` | `False` | Whether piecewise CUDA graph (torch.compile backend) is supported | +| `supports_fp8()` | `False` | Whether FP8 quantization is supported | +| `is_pin_memory_available()` | `True` | Whether pinned memory is available | + +#### Subsystem Factory Methods (from SRTPlatform) + +| Method | Default | Description | +|---|---|---| +| `get_default_attention_backend()` | `raise NotImplementedError` | Default attention backend name | +| `get_graph_runner_cls()` | `raise NotImplementedError` | Graph Runner class | +| `get_mha_kv_pool_cls()` | `raise NotImplementedError` | MHA KV cache pool class | +| `get_mla_kv_pool_cls()` | `raise NotImplementedError` | MLA KV cache pool class | +| `get_nsa_kv_pool_cls()` | `raise NotImplementedError` | NSA KV cache pool class (DeepSeek V3.2) | +| `get_paged_allocator_cls()` | `raise NotImplementedError` | Paged allocator class | +| `get_piecewise_backend_cls()` | `raise NotImplementedError` | Piecewise compilation backend class | +| `get_compile_backend(mode)` | `"inductor"` | Compilation backend string | +| `get_dispatch_key_name()` | `"native"` | MultiPlatformOp dispatch key name | + +#### Lifecycle Hooks (from SRTPlatform) + +| Method | Invocation Timing | Purpose | +|---|---|---| +| `apply_server_args_defaults(server_args)` | After ServerArgs parsing, in `__post_init__` | Set platform-specific defaults | +| `init_backend()` | In each worker, before model construction | One-time backend initialization | + +### Environment Variables + +| Variable | Description | +|---|---| +| `SGLANG_PLATFORM` | Select the platform plugin by entry_point name (e.g. `kunlun`, `demo_cuda`). When set, **only** the named plugin's `activate()` is called (front-loading filter) — other plugins are not touched. Additionally, general plugins (`sglang.srt.plugins`) from unselected platform packages are automatically skipped to avoid importing their dependencies. Required when multiple plugins would activate. Errors if the name is not found or if the plugin's hardware is unavailable. | +| `SGLANG_PLUGINS` | Comma-separated whitelist of general plugin names to load (group: `sglang.srt.plugins`). If unset, all discovered general plugins are loaded. | + +--- + +## Plugin Type 2: General Plugin + +### Description + +General function plugins inject behavior into sglang **without requiring a custom platform**. Use cases include: + +- **Observability**: Add logging, metrics, and tracing to any function +- **Behavior modification**: Modify function arguments or return values +- **Performance profiling**: Add timing to critical functions +- **A/B testing**: Replace implementations at runtime + +### Quick Start + +**1. Create a minimal package:** + +``` +my_general_plugin/ +├── pyproject.toml +└── my_general_plugin/ + └── __init__.py # register() function +``` + +**2. `pyproject.toml`:** + +```toml +[build-system] +requires = ["setuptools"] +build-backend = "setuptools.build_meta" + +[project] +name = "my-general-plugin" +version = "0.1.0" + +[project.entry-points."sglang.srt.plugins"] +my_plugin = "my_general_plugin:register" +``` + +**3. `__init__.py`** — register hooks: + +```python +from sglang.srt.plugins.hook_registry import HookRegistry, HookType + +def register(): + """Entry point called by load_plugins().""" + HookRegistry.register( + "sglang.srt.managers.scheduler.Scheduler.__init__", + my_hook, + HookType.AROUND, + ) + +def my_hook(original_fn, self, *args, **kwargs): + result = original_fn(self, *args, **kwargs) + print(f"Scheduler initialized! gpu_id={self.gpu_id}") + return result +``` + +**4. Install and run:** + +```bash +pip install -e my_general_plugin/ +sglang serve --model-path [options] +# Look for "Scheduler initialized!" in logs +``` + +### Hook Types + +`HookRegistry` supports four hook types: + +| Hook Type | Signature | Description | +|---|---|---| +| **BEFORE** | `fn(*args, **kwargs) -> (args, kwargs) \| None` | Runs before the original. Return `None` to keep args unchanged, or `(args, kwargs)` to modify. | +| **AFTER** | `fn(result, *args, **kwargs) -> new_result \| None` | Runs after the original. Return `None` to keep result, or a new value to replace. | +| **AROUND** | `fn(original_fn, *args, **kwargs) -> result` | Wraps the original. You must call `original_fn` yourself. Full control over execution. | +| **REPLACE** | `fn(*args, **kwargs) -> result` or `class` | Replace the original function or class entirely. For class targets, pass a replacement class directly — it is substituted via `setattr` preserving `isinstance()`/`issubclass()` semantics. | + +> **Note**: Only `REPLACE` accepts a class as the hook. Passing a class to `BEFORE`/`AFTER`/`AROUND` raises `TypeError` at registration time. + +### Registration API + +Hooks can be registered using the **imperative API** or the **decorator API**: + +```python +# --- Imperative API --- +from sglang.srt.plugins.hook_registry import HookRegistry, HookType + +def my_timer(original_fn, *args, **kwargs): + start = time.perf_counter() + result = original_fn(*args, **kwargs) + print(f"Elapsed: {time.perf_counter() - start:.3f}s") + return result + +HookRegistry.register( + "sglang.srt.managers.scheduler.Scheduler.get_next_batch_to_run", + my_timer, + HookType.AROUND, +) + +# --- Decorator API --- +from sglang.srt.plugins.hook_registry import plugin_hook, HookType + +@plugin_hook( + "sglang.srt.managers.scheduler.Scheduler.get_next_batch_to_run", + type=HookType.AROUND, +) +def my_timer(original_fn, *args, **kwargs): + start = time.perf_counter() + result = original_fn(*args, **kwargs) + print(f"Elapsed: {time.perf_counter() - start:.3f}s") + return result + +# --- Class replacement (REPLACE) --- +from sglang.srt.plugins.hook_registry import plugin_hook, HookType +from sglang.srt.managers.scheduler import Scheduler + +@plugin_hook( + "sglang.srt.managers.scheduler.Scheduler", + type=HookType.REPLACE, +) +class MyScheduler(Scheduler): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + print("Enhanced scheduler initialized!") +``` + +### Hook Target Resolution + +Target paths use fully-qualified dotted notation. Both formats are supported: + +- **Dotted**: `sglang.srt.managers.scheduler.Scheduler.__init__` +- **Entry-points style**: `sglang.srt.managers.scheduler:Scheduler.__init__` (colon treated as dot) + +### Common Hook Targets + +| Target | Description | +|---|---| +| `sglang.srt.server_args.ServerArgs.add_cli_args` | Add custom CLI arguments | +| `sglang.srt.server_args.ServerArgs.__post_init__` | Modify ServerArgs after parsing | +| `sglang.srt.server_args.ServerArgs.check_server_args` | Add/relax validation | +| `sglang.srt.managers.scheduler.Scheduler.__init__` | Custom scheduler state | +| `sglang.srt.managers.scheduler.Scheduler.get_next_batch_to_run` | Custom scheduling policy | +| `sglang.srt.managers.scheduler.Scheduler.run_batch` | Profiling / inspection | +| `sglang.srt.managers.scheduler.Scheduler.process_batch_result` | Custom metrics | +| `sglang.srt.managers.tp_worker.TpModelWorker.__init__` | Custom worker state | +| `sglang.srt.managers.tp_worker.TpModelWorker.forward_batch_generation` | Forward pass wrapping | + +--- + +## File Reference + +| File | Description | +|---|---| +| `sglang/srt/platforms/device_mixin.py` | `PlatformEnum` + `DeviceMixin` base class | +| `sglang/srt/platforms/interface.py` | `SRTPlatform` base class (extends DeviceMixin) | +| `sglang/srt/platforms/__init__.py` | `current_platform` lazy singleton + discovery logic | +| `sglang/srt/plugins/__init__.py` | `load_plugins()` + `load_plugins_by_group()` | +| `sglang/srt/plugins/hook_registry.py` | `HookRegistry`, `HookType`, `plugin_hook` decorator | diff --git a/docs/platforms/xpu.md b/docs/platforms/xpu.md index 88fa1552c790..1ba56b192402 100644 --- a/docs/platforms/xpu.md +++ b/docs/platforms/xpu.md @@ -30,7 +30,7 @@ conda create -n sgl-xpu python=3.12 -y conda activate sgl-xpu # Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues. -pip3 install torch==2.9.0+xpu torchao torchvision torchaudio pytorch-triton-xpu==3.5.0 --index-url https://download.pytorch.org/whl/xpu +pip3 install torch==2.11.0+xpu torchao torchvision torchaudio==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU # Clone the SGLang code @@ -43,7 +43,7 @@ cd python cp pyproject_xpu.toml pyproject.toml # Install SGLang dependent libs, and build SGLang main package pip install --upgrade pip setuptools -pip install -v . +pip install -v . --extra-index-url https://download.pytorch.org/whl/xpu ``` ### Install Using Docker @@ -90,3 +90,54 @@ python -m sglang.bench_serving -h Additionally, the requests can be formed with [OpenAI Completions API](https://docs.sglang.io/basic_usage/openai_api_completions.html) and sent via the command line (e.g. using `curl`) or via your own script. + +## Prefill-Decode (P/D) Disaggregation on Intel XPU [Experimental] + +SGLang supports prefill-decode disaggregation on Intel XPU using the [NIXL](https://github.com/ai-dynamo/nixl) KV-transfer backend. + +**Tested models:** + +| Model | Notes | +|:---:|:---:| +| [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | Used in integration tests; verified on Intel XPU with homogeneous P/D (XPU prefill + XPU decode) | +| [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Verified on Intel XPU with homogeneous P/D (XPU prefill + XPU decode) | + +**Prerequisites:** `pip install nixl sglang-router` + +**Start the prefill server (GPU 0):** + +```bash +ZE_AFFINITY_MASK=0 UCX_POSIX_USE_PROC_LINK=n python -m sglang.launch_server \ + --model-path Qwen/Qwen3-0.6B --trust-remote-code --device xpu \ + --disaggregation-mode prefill --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 12335 --host 0.0.0.0 --port 30000 +``` + +**Start the decode server (GPU 1):** + +```bash +ZE_AFFINITY_MASK=1 UCX_POSIX_USE_PROC_LINK=n python -m sglang.launch_server \ + --model-path Qwen/Qwen3-0.6B --trust-remote-code --device xpu \ + --disaggregation-mode decode --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 12335 --host 0.0.0.0 --port 30001 +``` + +**Start the router:** + +```bash +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://127.0.0.1:30000 \ + --decode http://127.0.0.1:30001 \ + --host 0.0.0.0 --port 8000 +``` + +**Send a request:** + +```bash +curl http://127.0.0.1:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "Qwen/Qwen3-0.6B", "prompt": "The capital of France is", "max_tokens": 32}' +``` + +> **Note:** `UCX_POSIX_USE_PROC_LINK=n` is required on Intel XPU to avoid UCX shared-memory transport issues. diff --git a/docs/references/custom_chat_template.md b/docs/references/custom_chat_template.md index f22ee8bec30c..870d09c1ccf0 100644 --- a/docs/references/custom_chat_template.md +++ b/docs/references/custom_chat_template.md @@ -1,6 +1,6 @@ # Custom Chat Template -**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)). +**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/parser/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)). By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3. diff --git a/docs/references/environment_variables.md b/docs/references/environment_variables.md index 91772cb3d5ba..45e51b9abf3b 100644 --- a/docs/references/environment_variables.md +++ b/docs/references/environment_variables.md @@ -12,20 +12,23 @@ SGLang supports various environment variables that can be used to configure its | `SGLANG_HOST_IP` | Host IP address for the server | `0.0.0.0` | | `SGLANG_PORT` | Port for the server | auto-detected | | `SGLANG_LOGGING_CONFIG_PATH` | Custom logging configuration path | Not set | -| `SGLANG_DISABLE_REQUEST_LOGGING` | Disable request logging | `false` | +| `SGLANG_LOG_REQUEST_HEADERS` | Comma-separated list of additional HTTP headers to log when `--log-requests` is enabled. Appends to the default `x-smg-routing-key`. | Not set | | `SGLANG_HEALTH_CHECK_TIMEOUT` | Timeout for health check in seconds | `20` | | `SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL` | The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. | `0` | | `SGLANG_FORWARD_UNKNOWN_TOOLS` | Forward unknown tool calls to clients instead of dropping them | `false` (drop unknown tools) | -| `SGLANG_QUEUED_TIMEOUT_MS` | Timeout (in ms) for requests in the waiting queue | `-1` | +| `SGLANG_REQ_WAITING_TIMEOUT` | Timeout (in seconds) for requests waiting in the queue before being scheduled | `-1` | +| `SGLANG_REQ_RUNNING_TIMEOUT` | Timeout (in seconds) for requests running in the decode batch | `-1` | +| `SGLANG_CACHE_DIR` | Cache directory for model weights and other data | `~/.cache/sglang` | +| `SGLANG_PREFETCH_BLOCK_SIZE_MB` | Block size (in MB) for sequential checkpoint prefetch reads that warm the OS page cache before workers load weights via mmap | `16` | ## Performance Tuning | Environment Variable | Description | Default Value | | --- | --- | --- | | `SGLANG_ENABLE_TORCH_INFERENCE_MODE` | Control whether to use torch.inference_mode | `false` | -| `SGLANG_ENABLE_TORCH_COMPILE` | Enable torch.compile | `true` | -| `SGLANG_SET_CPU_AFFINITY` | Enable CPU affinity setting (often set to `1` in Docker builds) | `0` | -| `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN` | Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds) | `0` | +| `SGLANG_ENABLE_TORCH_COMPILE` | Enable torch.compile | `false` | +| `SGLANG_SET_CPU_AFFINITY` | Enable CPU affinity setting (often set to `1` in Docker builds) | `false` | +| `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN` | Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds) | `false` | | `SGLANG_IS_FLASHINFER_AVAILABLE` | Control FlashInfer availability check | `true` | | `SGLANG_SKIP_P2P_CHECK` | Skip P2P (peer-to-peer) access check | `false` | | `SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD` | Sets the threshold for enabling chunked prefix caching | `8192` | @@ -36,12 +39,16 @@ SGLang supports various environment variables that can be used to configure its | `SGLANG_DATA_PARALLEL_BUDGET_INTERVAL` | Interval for DPBudget updates | `1` | | `SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT` | Default weight value for scheduler recv skipper counter (used when forward mode doesn't match specific modes). Only active when `--scheduler-recv-interval > 1`. The counter accumulates weights and triggers request polling when reaching the interval threshold. | `1000` | | `SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE` | Weight increment for decode forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during decode phase. | `1` | -| `SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_VERIFY` | Weight increment for target verify forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during verification phase. | `1` | +| `SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_TARGET_VERIFY` | Weight increment for target verify forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during verification phase. | `1` | | `SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE` | Weight increment when forward mode is None in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency when no specific forward mode is active. | `1` | | `SGLANG_MM_BUFFER_SIZE_MB` | Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to `0` to disable. | `0` | | `SGLANG_MM_PRECOMPUTE_HASH` | Enable precomputing of hash values for MultimodalDataItem | `false` | | `SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH` | Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering) | `false` | -| `SGLANG_SYMM_MEM_PREALLOC_GB_SIZE` | Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg `--enable-symm-mem` is set. | `4` | +| `SGLANG_SYMM_MEM_PREALLOC_GB_SIZE` | Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg `--enable-symm-mem` is set. | `-1` | +| `SGLANG_CUSTOM_ALLREDUCE_ALGO` | The algorithm of custom all-reduce. Set to `oneshot` or `1stage` to force use one-shot. Set to `twoshot` or `2stage` to force use two-shot. | `` | +| `SGLANG_SKIP_SOFTMAX_PREFILL_THRESHOLD_SCALE_FACTOR` | Skip-softmax threshold scale factor for TRT-LLM prefill attention in flashinfer. `None` means standard attention. See https://arxiv.org/abs/2512.12087 | `None` | +| `SGLANG_SKIP_SOFTMAX_DECODE_THRESHOLD_SCALE_FACTOR` | Skip-softmax threshold scale factor for TRT-LLM decode attention in flashinfer. `None` means standard attention. See https://arxiv.org/abs/2512.12087 | `None` | +| `SGLANG_USE_SGL_FA3_KERNEL` | Use sgl-kernel implementation for FlashAttention v3 | `true` | ## DeepGEMM Configuration (Advanced Optimization) @@ -53,8 +60,9 @@ SGLang supports various environment variables that can be used to configure its | `SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS` | Number of workers for parallel DeepGEMM kernel compilation | `4` | | `SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE` | Indicator flag used during the DeepGEMM precompile script | `"false"` | | `SGLANG_DG_CACHE_DIR` | Directory for caching compiled DeepGEMM kernels | `~/.cache/deep_gemm` | -| `SGL_DG_USE_NVRTC` | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | `"0"` | -| `SGL_USE_DEEPGEMM_BMM` | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | `"false"` | +| `SGLANG_DG_USE_NVRTC` | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | `"false"` | +| `SGLANG_USE_DEEPGEMM_BMM` | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | `"false"` | +| `SGLANG_JIT_DEEPGEMM_FAST_WARMUP` | Precompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime. | `"false"` | ## DeepEP Configuration @@ -66,6 +74,20 @@ SGLang supports various environment variables that can be used to configure its | `SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS` | Number of SMs used for DeepEP combine when single batch overlap is enabled | `"32"` | | `SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO` | Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together. | `"false"` | +## MORI Configuration + +| Environment Variable | Description | Default Value | +| --- | --- | --- | +| `SGLANG_MORI_DISPATCH_DTYPE` | Override MoRI-EP dispatch quantization type. `auto` uses auto-detection from weight dtype; `bf16`/`fp8`/`fp4` forces the specified type for all layers | `"auto"` | +| `SGLANG_MORI_FP8_COMB` | Use FP8 for combine | `"false"` | +| `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` | Maximum number of dispatch tokens per rank for MORI-EP buffer allocation | `4096` | +| `SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD` | Threshold for switching between `InterNodeV1` and `InterNodeV1LL` kernel types. `InterNodeV1LL` is used if `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` is less than or equal to this threshold; otherwise, `InterNodeV1` is used. | `256` | +| `SGLANG_MORI_PREALLOC_MAX_RECV_TOKENS` | This argument devives `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` which indicates customized amount of tokens preallocated for a rank, valid range from 1 to world_size*SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK, by default `0` means maximum. Setting a smaller value will reduce memory footprint but too small value could cause buffer overflow. | `0` | +| `SGLANG_MORI_MOE_MAX_INPUT_TOKENS` | Truncate the dispatch buffer to this many rows before MoE computation, reducing kernel overhead on padding tokens. The value must be >= the actual number of received tokens (`totalRecvTokenNum`); setting it too small causes incorrect results. `0` disables truncation (use full buffer). | `0` | +| `SGLANG_MORI_QP_PER_TRANSFER` | Number of RDMA Queue Pairs (QPs) used per transfer operation | `1` | +| `SGLANG_MORI_POST_BATCH_SIZE` | Number of RDMA work requests posted in a single batch to each QP | `-1` | +| `SGLANG_MORI_NUM_WORKERS` | Number of worker threads in the RDMA executor thread pool | `1` | + ## NSA Backend Configuration (For DeepSeek V3.2) @@ -74,6 +96,8 @@ SGLang supports various environment variables that can be used to configure its | --- | --- | --- | | `SGLANG_NSA_FUSE_TOPK` | Fuse the operation of picking topk logits and picking topk indices from page table | `true` | | `SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA` | Precompute metadata that can be shared among different draft steps when MTP is enabled | `true` | +| `SGLANG_USE_FUSED_METADATA_COPY` | Control whether to use fused metadata copy kernel for cuda graph replay | `true` | +| `SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD` | When the maximum kv len in current prefill batch exceeds this value, the sparse mla kernel will be applied, else it falls back to dense MHA implementation. Default to the index topk of model (2048 for DeepSeek V3.2) | `2048` | ## Memory Management @@ -84,13 +108,15 @@ SGLang supports various environment variables that can be used to configure its | `SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION` | Clip max new tokens estimation for memory planning | `4096` | | `SGLANG_DETOKENIZER_MAX_STATES` | Maximum states for detokenizer | Default value based on system | | `SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK` | Enable checks for memory imbalance across Tensor Parallel ranks | `true` | +| `SGLANG_MOONCAKE_CUSTOM_MEM_POOL` | Configure the custom memory pool type for Mooncake. Supports `NVLINK`, `BAREX`, `INTRA_NODE_NVLINK`. If set to `true`, it defaults to `NVLINK`. | `None` | ## Model-Specific Options | Environment Variable | Description | Default Value | | --- | --- | --- | | `SGLANG_USE_AITER` | Use AITER optimize implementation | `false` | -| `SGLANG_MOE_PADDING` | Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds) | `0` | +| `SGLANG_ROCM_USE_MULTI_STREAM` | Allocate alt CUDA/HIP stream on ROCm/AITER to overlap shared and routed experts in DeepseekV2 MoE. Requires the HIP env `GPU_MAX_HW_QUEUES>=5` (default `4`, the cap on HSA/ROCr HW queues HIP creates) so the alt stream gets its own queue instead of serializing with the main stream. Best paired with `--deepep-mode low_latency` so Mori's AsyncLL kernel offloads dispatch/combine to copy engines and frees CUs. | `false` | +| `SGLANG_MOE_PADDING` | Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds) | `false` | | `SGLANG_CUTLASS_MOE` (deprecated) | Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use --moe-runner-backend=cutlass) | `false` | ## Quantization @@ -98,14 +124,12 @@ SGLang supports various environment variables that can be used to configure its | Environment Variable | Description | Default Value | | --- | --- | --- | | `SGLANG_INT4_WEIGHT` | Enable INT4 weight quantization | `false` | -| `SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2` | Apply per token group quantization kernel with fused silu and mul and masked m | `false` | | `SGLANG_FORCE_FP8_MARLIN` | Force using FP8 MARLIN kernels even if other FP8 kernels are available | `false` | -| `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (deprecated) | Select backend for `mm_fp4` on Blackwell GPUs. **DEPRECATED**: Please use `--fp4-gemm-backend` instead. | `` | | `SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN` | Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | `false` | | `SGLANG_MOE_NVFP4_DISPATCH` | Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend) | `"false"` | | `SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE` | Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | `false` | -| `SGLANG_ENABLE_FLASHINFER_FP8_GEMM` (deprecated) | Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs. **DEPRECATED**: Please use `--fp8-gemm-backend=flashinfer_trtllm` instead. | `false` | -| `SGLANG_SUPPORT_CUTLASS_BLOCK_FP8` (deprecated) | Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs. **DEPRECATED**: Please use `--fp8-gemm-backend=cutlass` instead. | `false` | +| `SGLANG_QUANT_ALLOW_DOWNCASTING` | Allow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization. | `false` | +| `SGLANG_FP8_IGNORED_LAYERS` | A comma-separated list of layer names to ignore during FP8 quantization. For example: `model.layers.0,model.layers.1.,qkv_proj`. | `""` | ## Distributed Computing @@ -117,6 +141,15 @@ SGLang supports various environment variables that can be used to configure its | `SGLANG_PP_LAYER_PARTITION` | Pipeline parallel layer partition specification | Not set | | `SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS` | Set one visible device per process for distributed computing | `false` | +## PD Disaggregation — Staging Buffer (Heterogeneous TP) + +| Environment Variable | Description | Default Value | +| --- | --- | --- | +| `SGLANG_DISAGG_STAGING_BUFFER` | Enable GPU staging buffer for heterogeneous TP KV transfer. Required when prefill and decode use different TP/attention-TP sizes. Only for non-MLA models (e.g. GQA, MHA). | `false` | +| `SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB` | Prefill-side per-worker staging buffer size in MB. Used for gathering KV head slices before bulk RDMA transfer. | `64` | +| `SGLANG_DISAGG_STAGING_POOL_SIZE_MB` | Decode-side ring buffer pool total size in MB. Shared buffer receiving RDMA data from all prefill ranks. Larger values support higher concurrency. | `4096` | +| `SGLANG_STAGING_USE_TORCH` | Force using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging. | `false` | + ## Testing & Debugging (Internal/CI) *These variables are primarily used for internal testing, continuous integration, or debugging.* @@ -124,11 +157,17 @@ SGLang supports various environment variables that can be used to configure its | Environment Variable | Description | Default Value | | --- | --- | --- | | `SGLANG_IS_IN_CI` | Indicates if running in CI environment | `false` | -| `SGLANG_IS_IN_CI_AMD` | Indicates running in AMD CI environment | `0` | +| `SGLANG_IS_IN_CI_AMD` | Indicates running in AMD CI environment | `false` | | `SGLANG_TEST_RETRACT` | Enable retract decode testing | `false` | | `SGLANG_TEST_RETRACT_NO_PREFILL_BS` | When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS. | `2 ** 31` | | `SGLANG_RECORD_STEP_TIME` | Record step time for profiling | `false` | | `SGLANG_TEST_REQUEST_TIME_STATS` | Test request time statistics | `false` | +| `SGLANG_DEBUG_SYMM_MEM` | Enable debug checks that verify tensors passed to NCCL communication ops are allocated in the symmetric memory pool. Logs warnings (rank 0 only) with stack traces for any tensor not in the pool. | `false` | +| `SGLANG_KERNEL_API_LOGLEVEL` | Controls crash-debug kernel API logging. `0` disables logging, `1` logs API names, `3` logs tensor metadata, `5` adds tensor statistics, and `10` also writes pre-call dump snapshots. | `0` | +| `SGLANG_KERNEL_API_LOGDEST` | Destination for crash-debug kernel API logs. Use `stdout`, `stderr`, or a file path. `%i` is replaced with the process PID. | `stdout` | +| `SGLANG_KERNEL_API_DUMP_DIR` | Output directory for level-10 kernel API input/output dumps. `%i` is replaced with the process PID. | `sglang_kernel_api_dumps` | +| `SGLANG_KERNEL_API_DUMP_INCLUDE` | Comma-separated wildcard patterns for kernel API names to include in level-10 dumps. | Not set | +| `SGLANG_KERNEL_API_DUMP_EXCLUDE` | Comma-separated wildcard patterns for kernel API names to exclude from level-10 dumps. | Not set | ## Profiling & Benchmarking @@ -145,8 +184,10 @@ SGLang supports various environment variables that can be used to configure its | Environment Variable | Description | Default Value | | --- | --- | --- | | `SGLANG_WAIT_WEIGHTS_READY_TIMEOUT` | Timeout period for waiting on weights | `120` | -| `SGLANG_DISABLE_OUTLINES_DISK_CACHE` | Disable Outlines disk cache | `true` | +| `SGLANG_DISABLE_OUTLINES_DISK_CACHE` | Disable Outlines disk cache | `false` | | `SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE` | Use SGLang's custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA) | `false` | +| `SGLANG_HICACHE_DECODE_OFFLOAD_STRIDE` | Decode-side incremental KV cache offload stride. Rounded down to a multiple of `--page-size` (min is `--page-size`). If unset/invalid/<=0, it falls back to `--page-size`. | Not set (uses `--page-size`) | + ## Function Calling / Tool Use diff --git a/docs/references/frontend/frontend_tutorial.ipynb b/docs/references/frontend/frontend_tutorial.ipynb index 166f8caccb36..9c4da052c397 100644 --- a/docs/references/frontend/frontend_tutorial.ipynb +++ b/docs/references/frontend/frontend_tutorial.ipynb @@ -42,7 +42,7 @@ " \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", "print(f\"Server started on http://localhost:{port}\")" ] }, @@ -385,7 +385,7 @@ "## Multi-modal Generation\n", "\n", "You may use SGLang frontend language to define multi-modal prompts.\n", - "See [here](https://docs.sglang.io/supported_models/generative_models.html) for supported models." + "See [here](https://docs.sglang.io/supported_models/text_generation/multimodal_language_models.html) for supported models." ] }, { @@ -398,7 +398,7 @@ " \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning\"\n", ")\n", "\n", - "wait_for_server(f\"http://localhost:{port}\")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", "print(f\"Server started on http://localhost:{port}\")" ] }, @@ -430,7 +430,7 @@ " s += assistant(gen(\"answer\", max_tokens=256))\n", "\n", "\n", - "image_url = \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", + "image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\"\n", "image_bytes, _ = load_image(image_url)\n", "state = image_qa(image_bytes, \"What is in the image?\")\n", "print_highlight(state[\"answer\"])" diff --git a/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.md b/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.md index 368aee34b9a5..419474a4e55e 100644 --- a/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.md +++ b/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.md @@ -638,7 +638,7 @@ kubectl apply -f p.yaml kubectl apply -f d.yaml ``` -At this point, we have completed the deployment of the 1P1D SGlang engine part. +At this point, we have completed the deployment of the 1P1D SGLang engine part. To allow our users to directly experience the model API, we still need a load balancer to handle sequential calls between prefill and decode. Different companies implement LBs differently, and the community will also officially release a new LB component written in Rust in the near future. diff --git a/docs/references/multi_node_deployment/multi_node.md b/docs/references/multi_node_deployment/multi_node.md index e6e5b53444fe..bdd0ca23dd46 100644 --- a/docs/references/multi_node_deployment/multi_node.md +++ b/docs/references/multi_node_deployment/multi_node.md @@ -30,7 +30,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr ## DeepSeek V3/R1 -Please refer to [DeepSeek documents for reference](https://docs.sglang.io/basic_usage/deepseek.html#running-examples-on-multi-node). +Please refer to [DeepSeek documents for reference](https://docs.sglang.io/basic_usage/deepseek_v3.html#running-examples-on-multi-node). ## Multi-Node Inference on SLURM diff --git a/docs/references/post_training_integration.md b/docs/references/post_training_integration.md index 5e82f837455e..4dddf5905a86 100644 --- a/docs/references/post_training_integration.md +++ b/docs/references/post_training_integration.md @@ -6,7 +6,7 @@ What makes SGLang essential for post-training? - Open-To-Use Refit Functionality: diverse method for colocate or disaggregate - Easy To Postpone Generation: enable partial rollout and dedicated rollout control -- Fine-Grained Engine Sleep And Wake Up: facilitate maxium-powered rollout and training +- Fine-Grained Engine Sleep And Wake Up: facilitate maximum-powered rollout and training - Training Serving Alignment: ensure the performance consistency in training and serving - Load Balancing Router: cache-aware load-balancing for high-throughput rollout - Deterministic Inference: ensure zero kl divergence between rollout and training @@ -28,4 +28,4 @@ These capabilities, combined with native integration support across major framew ## Collaboration -Due to the privacy of the design parternes, we cannot list the companies that adopt SGLang for post-training. However, we are happy to share the details with you if you are interested and trust the choice among 10+ top companies and frontier labs across US and China. If you are interested in integrating SGLang with your training framework or need technical support, we're here to help! Reach out to us at **rl_team@lmsys.org** for partnerships, integration guidance, and custom feature development. +Due to the privacy of the design partners, we cannot list the companies that adopt SGLang for post-training. However, we are happy to share the details with you if you are interested and trust the choice among 10+ top companies and frontier labs across US and China. If you are interested in integrating SGLang with your training framework or need technical support, we're here to help! Reach out to us at **rl_team@lmsys.org** for partnerships, integration guidance, and custom feature development. diff --git a/docs/references/production_metrics.md b/docs/references/production_metrics.md index 85a6ff8a64a6..d104584ee4bc 100644 --- a/docs/references/production_metrics.md +++ b/docs/references/production_metrics.md @@ -142,7 +142,8 @@ This section describes how to set up the monitoring stack (Prometheus + Grafana) python -m sglang.launch_server \ --model-path \ --port 30000 \ - --enable-metrics + --enable-metrics \ + --enable-mfu-metrics ``` Replace `` with the actual path to your model (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct`). Ensure the server is accessible from the monitoring stack (you might need `--host 0.0.0.0` if running in Docker). By default, the metrics endpoint will be available at `http://:30000/metrics`. @@ -229,3 +230,38 @@ python3 -m sglang.bench_serving \ to generate some requests. Then you should be able to see the metrics in the Grafana dashboard. + +## Estimated Performance Metrics (MFU-related) + +SGLang exports the following estimated per-GPU counters that can be used to derive +Model FLOPs Utilization (MFU)-related signals: + +- `sglang:estimated_flops_per_gpu_total`: Estimated floating-point operations. +- `sglang:estimated_read_bytes_per_gpu_total`: Estimated bytes read from memory. +- `sglang:estimated_write_bytes_per_gpu_total`: Estimated bytes written to memory. + +These metrics are available when both `--enable-metrics` and +`--enable-mfu-metrics` are enabled. + +These are cumulative counters. Use Prometheus `rate(...)` to get per-second values. + +### PromQL examples + +Average TFLOPS per GPU: + +```promql +rate(sglang:estimated_flops_per_gpu_total[1m]) / 1e12 +``` + +Average estimated memory bandwidth in GB/s: + +```promql +(rate(sglang:estimated_read_bytes_per_gpu_total[1m]) + + rate(sglang:estimated_write_bytes_per_gpu_total[1m])) / 1e9 +``` + +### Notes + +- These metrics are estimates intended for observability and trend analysis. +- Estimated memory bytes reflect modeled traffic and are not a direct hardware + counter from GPU profilers. diff --git a/docs/references/production_request_trace.md b/docs/references/production_request_trace.md index 2d19570c2158..d1dfdd2f067d 100644 --- a/docs/references/production_request_trace.md +++ b/docs/references/production_request_trace.md @@ -1,6 +1,6 @@ # Production Request Tracing -SGlang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--enable-trace` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server. +SGLang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--enable-trace` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server. You can find example screenshots of the visualization in https://github.com/sgl-project/sglang/issues/8965. @@ -17,23 +17,23 @@ This section explains how to configure the request tracing and export the trace pip install opentelemetry-sdk opentelemetry-api opentelemetry-exporter-otlp opentelemetry-exporter-otlp-proto-grpc ``` -2. launch opentelemetry collector and jaeger +2. Launch OpenTelemetry collector and Jaeger ```bash docker compose -f examples/monitoring/tracing_compose.yaml up -d ``` -3. start your SGLang server with tracing enabled +3. Start your SGLang server with tracing enabled ```bash # set env variables export SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS=500 export SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE=64 # start the prefill and decode server python -m sglang.launch_server --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 - # start the mini lb + # start the model-gate-way python -m sglang_router.launch_router --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 ``` - Replace `0.0.0.0:4317` with the actual endpoint of the opentelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317. + Replace `0.0.0.0:4317` with the actual endpoint of the OpenTelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317. To use the HTTP/protobuf span exporter, set the following environment variable and point to an HTTP endpoint, for example, `http://0.0.0.0:4318/v1/traces`. ```bash @@ -41,15 +41,33 @@ This section explains how to configure the request tracing and export the trace ``` -4. raise some requests +4. Raise some requests 5. Observe whether trace data is being exported * Access port 16686 of Jaeger using a web browser to visualize the request traces. * The OpenTelemetry Collector also exports trace data in JSON format to /tmp/otel_trace.json. In a follow-up patch, we will provide a tool to convert this data into a Perfetto-compatible format, enabling visualization of requests in the Perfetto UI. -## How to add Tracing for slices you're interested in? +6. Dynamically adjust trace level + The trace level accepts configurable values from `0` to `3`. The meanings of different trace level values are as follows: + ``` + 0: disable tracing + 1: Trace important slices + 2: Trace all slices except nested ones + 3: Trace all slices + ``` + The trace level can be dynamically set via HTTP API, for example: + ```bash + curl http://0.0.0.0:30000/set_trace_level?level=2 + ``` + Replace `0.0.0.0:30000` with your actual server address, and replace `level=2` with the level you want to set. + + **Note**: You must set the parameter `--enable-trace`; otherwise, the trace capability will not be enabled regardless of any dynamic adjustments to the trace level. + +## How to add Tracing for slices you're interested in?(API introduction) We have already inserted instrumentation points in the tokenizer and scheduler main threads. If you wish to trace additional request execution segments or perform finer-grained tracing, please use the APIs from the tracing package as described below. -1. initialization +**All of the following implementations are done in python/sglang/srt/observability/req_time_stats.py. If you want to add another slice, please do it here.** + +1. Initialization Every process involved in tracing during the initialization phase should execute: ```python @@ -63,98 +81,53 @@ We have already inserted instrumentation points in the tokenizer and scheduler m ``` The "thread label" can be regarded as the name of the thread, used to distinguish different threads in the visualization view. -2. Mark the beginning and end of a request +2. Create a trace context for a request + Each request needs to call `TraceReqContext()` to initialize a request context, which is used to generate slice spans and record request stage info. You can either store it within the request object or maintain it as a global variable. + +3. Mark the beginning and end of a request ``` - trace_req_start(rid, bootstrap_room) - trace_req_finish(rid) + trace_ctx.trace_req_start(). + trace_ctx.trace_req_finish() ``` - These two APIs must be called within the same process, for example, in the tokenizer. + trace_req_start() and trace_req_finish() must be called within the same process, for example, in the tokenizer. -3. Add tracing for slice +4. Add tracing for a slice * Add slice tracing normally: ```python - trace_slice_start("slice A", rid) - trace_slice_end("slice A", rid) - ``` + trace_ctx.trace_slice_start(RequestStage.TOKENIZER.stage_name) + trace_ctx.trace_slice_end(RequestStage.TOKENIZER.stage_name) - - Use the "anonymous" flag to not specify a slice name at the start of the slice, allowing the slice name to be determined by trace_slice_end. -
Note: Anonymous slices must not be nested. - ```python - trace_slice_start("", rid, anonymous = True) - trace_slice_end("slice A", rid) + or + trace_ctx.trace_slice(slice: TraceSliceContext) ``` - - In trace_slice_end, use auto_next_anon to automatically create the next anonymous slice, which can reduce the number of instrumentation points needed. + - The end of the last slice in a thread must be marked with thread_finish_flag=True, or explicitly call trace_ctx.abort(); otherwise, the thread's span will not be properly generated. ```python - trace_slice_start("", rid, anonymous = True) - trace_slice_end("slice A", rid, auto_next_anon = True) - trace_slice_end("slice B", rid, auto_next_anon = True) - trace_slice_end("slice C", rid, auto_next_anon = True) - trace_slice_end("slice D", rid) - ``` - - The end of the last slice in a thread must be marked with thread_finish_flag=True; otherwise, the thread's span will not be properly generated. - ```python - trace_slice_end("slice D", rid, thread_finish_flag = True) + trace_ctx.slice_end(RequestStage.D.stage_name, thread_finish_flag = True) + trace_ctx.abort() ``` -4. When the request execution flow transfers to another thread, the trace context needs to be explicitly propagated. - - sender: Execute the following code before sending the request to another thread via ZMQ - ```python - trace_context = trace_get_proc_propagate_context(rid) - req.trace_context = trace_context - ``` +5. When the request execution flow transfers to another thread, the thread context needs to be explicitly rebuilt. - receiver: Execute the following code after receiving the request via ZMQ ```python - trace_set_proc_propagate_context(rid, req.trace_context) - ``` - -5. When the request execution flow transfers to another node(PD disaggregation), the trace context needs to be explicitly propagated. - - sender: Execute the following code before sending the request to node thread via http - ```python - trace_context = trace_get_remote_propagate_context(bootstrap_room_list) - headers = {"trace_context": trace_context} - session.post(url, headers=headers) - ``` - - receiver: Execute the following code after receiving the request via http - ```python - trace_set_remote_propagate_context(request.headers['trace_context']) + trace_ctx.rebuild_thread_context() ``` ## How to Extend the Tracing Framework to Support Complex Tracing Scenarios The currently provided tracing package still has potential for further development. If you wish to build more advanced features upon it, you must first understand its existing design principles. -The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a two-level trace context structure and a four-level span structure: `SglangTraceReqContext`, `SglangTraceThreadContext`. Their relationship is as follows: +The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a three-level trace context structure or span structure: `TraceReqContext`, `TraceThreadContext` and `TraceSliceContext`. Their relationship is as follows: ``` -SglangTraceReqContext (req_id="req-123") -├── SglangTraceThreadContext(thread_label="scheduler", tp_rank=0) +TraceReqContext (req_id="req-123") +├── TraceThreadContext(thread_label="scheduler", tp_rank=0) +| └── TraceSliceContext(slice_name="prefill") | -└── SglangTraceThreadContext(thread_label="scheduler", tp_rank=1) +└── TraceThreadContext(thread_label="scheduler", tp_rank=1) + └── TraceSliceContext(slice_name="prefill") ``` -Each traced request maintains a global `SglangTraceReqContext`. For every thread processing the request, a corresponding `SglangTraceThreadContext` is recorded and composed within the `SglangTraceReqContext`. Within each thread, every currently traced slice (possibly nested) is stored in a list. +Each traced request maintains a global `TraceReqContext` and creates a corresponding request span. For every thread that processes the request, a `TraceThreadContext` is recorded and a thread span is created. The `TraceThreadContext` is nested within the `TraceReqContext`, and each currently traced code slice—potentially nested—is stored in its associated `TraceThreadContext`. In addition to the above hierarchy, each slice also records its previous slice via Span.add_link(), which can be used to trace the execution flow. - -When the request execution flow transfers to a new thread, the trace context needs to be explicitly propagated. In the framework, this is represented by `SglangTracePropagateContext`, which contains the context of the request span and the previous slice span. - - -We designed a four-level span structure, consisting of `bootstrap_room_span`, `req_root_span`, `thread_span`, and `slice_span`. Among them, `req_root_span` and `thread_span` correspond to `SglangTraceReqContext` and `SglangTraceThreadContext`, respectively, and `slice_span` is stored within the `SglangTraceThreadContext`. The `bootstrap_room_span` is designed to accommodate the separation of PD-disaggregation. On different nodes, we may want to add certain attributes to the `req_root_span`. However, if the `req_root_span` is shared across all nodes, the Prefill and Decode nodes would not be allowed to add attributes due to the constraints imposed by OpenTelemetry's design. - -``` -bootstrap room span -├── router req root span -| └── router thread span -| └── slice span -├── prefill req root span -| ├── tokenizer thread span -| | └── slice span -| └── scheduler thread span -| └── slice span -└── decode req root span - ├── tokenizer thread span - | └── slice span - └── scheduler thread span - └── slice span -``` diff --git a/docs/references/release_lookup.rst b/docs/references/release_lookup.rst new file mode 100644 index 000000000000..2e8833f6c78d --- /dev/null +++ b/docs/references/release_lookup.rst @@ -0,0 +1,325 @@ +Release Lookup +============== + +Find which SGLang release first included a specific PR or commit. + +.. raw:: html + + + +
+
+ + +
+ +
+
Initializing…
+
+ + diff --git a/docs/release_lookup/README.md b/docs/release_lookup/README.md new file mode 100644 index 000000000000..3472ded2f21f --- /dev/null +++ b/docs/release_lookup/README.md @@ -0,0 +1,39 @@ +# SGLang Release Lookup Tool + +This tool allows users to find the earliest release that contains a specific PR or commit. +It runs entirely in the browser using a static JSON index generated from the git history. + +## Usage + +1. **Generate the Index**: + Run the Python script to generate the `release_index.json` file from your local git repository. + + ```bash + python3 generate_index.py --output release_index.json + ``` + + This script: + - Finds all tags matching `v*` and `gateway-v*`. + - Sorts them by creation date. + - Traverses the history to find which release first introduced each commit and PR. + - Extracts PR numbers from commit messages. + +2. **Open the Tool**: + Open `index.html` in your browser. + + ```bash + # You can open it directly if your browser supports local file fetch (Firefox usually does), + # or serve it locally: + python3 -m http.server + # Then go to http://localhost:8000/index.html + ``` + +## Files + +- `index.html`: The UI for the lookup tool. +- `generate_index.py`: Script to build the index. +- `release_index.json`: The index file used by the UI. + +## Logic + +The tool determines the "earliest release" based on the tag creation date. It traverses tags from oldest to newest. Any commit reachable from a tag (that wasn't reachable from a previous tag) is assigned to that release. diff --git a/docs/release_lookup/generate_index.py b/docs/release_lookup/generate_index.py new file mode 100644 index 000000000000..d8415e41deda --- /dev/null +++ b/docs/release_lookup/generate_index.py @@ -0,0 +1,222 @@ +import argparse +import json +import os +import re +import subprocess +import sys +from datetime import datetime + +# Short hash length for commits (7 is git's default short hash) +SHORT_HASH_LEN = 8 +COMMIT_CHUNK_SIZE = 1000 + + +def run_git(cmd): + try: + output = subprocess.check_output(cmd, stderr=subprocess.STDOUT) + return output.decode("utf-8", errors="replace").strip() + except subprocess.CalledProcessError as e: + print(f"Error running cmd: {cmd}\n{e.output.decode('utf-8', errors='replace')}") + sys.exit(1) + + +def is_stable_release(tag_name): + """Check if tag is a stable release (not rc/alpha/beta).""" + # Skip release candidates, alpha, beta versions + if re.search(r"(rc|alpha|beta)\d*$", tag_name, re.IGNORECASE): + return False + return True + + +def get_tags(): + # Get tags sorted by creator date + cmd = [ + "git", + "tag", + "--list", + "v*", + "gateway-v*", + "--sort=creatordate", + "--format=%(refname:short)|%(creatordate:iso8601)|%(objectname)", + ] + raw = run_git(cmd) + tags = [] + if not raw: + return [] + for line in raw.split("\n"): + parts = line.split("|") + if len(parts) >= 3: + name, date, commit = parts[0], parts[1], parts[2] + # Skip non-stable releases (rc, alpha, beta) + if not is_stable_release(name): + continue + tag_type = "gateway" if name.startswith("gateway-") else "main" + tags.append( + {"name": name, "date": date, "commit": commit, "type": tag_type} + ) + return tags + + +def extract_pr_num(message): + lines = message.strip().split("\n") + first_line = lines[0] + + m = re.search(r"\(#(\d+)\)$", first_line) + if m: + return m.group(1) + + m = re.search(r"Merge pull request #(\d+)", message) + if m: + return m.group(1) + + return None + + +def process_tag_line(tags, commit_map, pr_map, tag_type, tag_to_idx): + """Process a single release line (main or gateway) independently.""" + seen_commits = set() + + for tag in tags: + tag_name = tag["name"] + print(f"Processing {tag_name}...") + + commits = run_git(["git", "rev-list", tag_name]).split("\n") + + new_commits = [] + for c in commits: + c = c.strip() + if not c: + continue + if c in seen_commits: + continue + new_commits.append(c) + seen_commits.add(c) + + if not new_commits: + continue + + for i in range(0, len(new_commits), COMMIT_CHUNK_SIZE): + chunk = new_commits[i : i + COMMIT_CHUNK_SIZE] + + cmd = ["git", "show", "-s", "--format=%H|%B%n--END-COMMIT--"] + chunk + raw_logs = run_git(cmd) + + entries = raw_logs.split("--END-COMMIT--\n") + for log_entry in entries: + if not log_entry.strip(): + continue + parts = log_entry.split("|", 1) + if len(parts) < 2: + continue + sha = parts[0].strip() + msg = parts[1].strip() + + tag_idx = tag_to_idx[tag_name] + + # Store release index using full SHA as key + if sha not in commit_map: + commit_map[sha] = {} + commit_map[sha][tag_type] = tag_idx + + pr = extract_pr_num(msg) + if pr: + if pr not in pr_map: + pr_map[pr] = {} + if tag_type not in pr_map[pr]: + pr_map[pr][tag_type] = tag_idx + + +def main(): + parser = argparse.ArgumentParser( + description="Generate lookup index for sglang releases" + ) + parser.add_argument( + "--output", default="release_index.json", help="Output JSON file" + ) + args = parser.parse_args() + + tags = get_tags() + print(f"Found {len(tags)} tags.") + + main_tags = [t for t in tags if t["type"] == "main"] + gateway_tags = [t for t in tags if t["type"] == "gateway"] + + print(f" - {len(main_tags)} main tags") + print(f" - {len(gateway_tags)} gateway tags") + + # Build tag list and index mapping + # Tags array: [name, date, type] for each tag + tag_list = [] + tag_to_idx = {} + + for tag in tags: + tag_to_idx[tag["name"]] = len(tag_list) + # Compact format: [name, date, type (0=main, 1=gateway)] + tag_list.append( + [tag["name"], tag["date"], 1 if tag["type"] == "gateway" else 0] + ) + + pr_map = {} + commit_map_full = {} + + process_tag_line(main_tags, commit_map_full, pr_map, "m", tag_to_idx) + process_tag_line(gateway_tags, commit_map_full, pr_map, "g", tag_to_idx) + + # Convert full SHAs to short SHAs, checking for collisions + commit_map = {} + short_to_full_map = {} + for full_sha, data in commit_map_full.items(): + short_sha = full_sha[:SHORT_HASH_LEN] + if short_sha in short_to_full_map and short_to_full_map[short_sha] != full_sha: + print( + f"CRITICAL: Short SHA collision detected for '{short_sha}'\n" + f" Commit 1: {short_to_full_map[short_sha]}\n" + f" Commit 2: {full_sha}\n" + "Please increase SHORT_HASH_LEN and re-run.", + file=sys.stderr, + ) + sys.exit(1) + commit_map[short_sha] = data + short_to_full_map[short_sha] = full_sha + + # Compact output format: + # - tags: array of [name, date, type] + # - prs: {pr_num: tag_idx} or {pr_num: {m: idx, g: idx}} + # - commits: {short_hash: tag_idx} or {short_hash: {m: idx, g: idx}} + + # Simplify single-entry dicts to just the value + def simplify_map(m): + result = {} + for k, v in m.items(): + if len(v) == 1: + # Single entry: just store the index directly with type prefix + key_type, idx = list(v.items())[0] + result[k] = f"{key_type}{idx}" + else: + # Multiple entries: keep as dict + result[k] = v + return result + + output_data = { + "t": tag_list, # tags + "p": simplify_map(pr_map), # prs + "c": simplify_map(commit_map), # commits + "g": datetime.now().isoformat(), # generated_at + } + + # Write minified JSON with a trailing newline for formatter compatibility. + json_str = json.dumps(output_data, separators=(",", ":")) + + with open(args.output, "w", encoding="utf-8") as f: + f.write(json_str) + f.write("\n") + + json_size = os.path.getsize(args.output) + + print(f"Index generated at {args.output}") + print(f"Stats: {len(tag_list)} tags, {len(pr_map)} PRs, {len(commit_map)} commits.") + print(f"Size: {json_size/1024:.1f} KB") + + +if __name__ == "__main__": + main() diff --git a/docs/release_lookup/index.html b/docs/release_lookup/index.html new file mode 100644 index 000000000000..dc8219de5590 --- /dev/null +++ b/docs/release_lookup/index.html @@ -0,0 +1,515 @@ + + + + + + SGLang Release Lookup + + + + +
+

Release Lookup

+

Find which SGLang release first included your PR or commit.

+ +
+ + +
+ + + +
+ +
Initializing...
+
+ + + + + diff --git a/docs/release_lookup/release_index.json b/docs/release_lookup/release_index.json new file mode 100644 index 000000000000..4a8606e2499a --- /dev/null +++ b/docs/release_lookup/release_index.json @@ -0,0 +1 @@ +{"t":[["v0.1.3","2024-01-16 05:55:25 +0000",0],["v0.1.5","2024-01-17 18:37:02 -0800",0],["v0.1.6","2024-01-21 01:45:02 -0800",0],["v0.1.7","2024-01-21 10:31:02 +0000",0],["v0.1.8","2024-01-24 03:33:34 -0800",0],["v0.1.9","2024-01-24 11:37:25 +0000",0],["v0.1.10","2024-01-30 15:37:52 +0000",0],["v0.1.11","2024-02-03 02:50:13 -0800",0],["v0.1.12","2024-02-11 06:43:45 -0800",0],["v0.1.13","2024-03-11 05:49:27 -0700",0],["v0.1.14","2024-03-22 13:42:22 -0700",0],["v0.1.15","2024-05-12 14:22:33 -0700",0],["v0.1.16","2024-05-13 17:29:17 -0700",0],["v0.1.17","2024-06-07 19:49:18 -0700",0],["v0.1.18","2024-07-04 06:27:29 +0000",0],["v0.1.19","2024-07-09 02:23:14 -0700",0],["v0.1.20","2024-07-13 17:27:55 -0700",0],["v0.1.21","2024-07-15 13:10:53 -0700",0],["v0.1.22","2024-07-20 03:39:50 -0700",0],["v0.1.23","2024-07-23 13:49:34 -0700",0],["v0.1.24","2024-07-24 15:55:01 -0700",0],["v0.2.0","2024-07-25 08:03:36 -0700",0],["v0.2.5","2024-07-27 05:56:30 +1000",0],["v0.2.6","2024-07-27 20:29:33 -0700",0],["v0.2.7","2024-07-30 20:41:10 +1000",0],["v0.2.8","2024-08-01 14:18:26 -0700",0],["v0.2.9","2024-08-02 01:45:48 -0700",0],["v0.2.9.post1","2024-08-02 12:08:00 -0700",0],["v0.2.10","2024-08-04 16:52:51 -0700",0],["v0.2.11","2024-08-07 20:47:53 +0800",0],["v0.2.12","2024-08-12 20:59:38 +1000",0],["v0.2.13","2024-08-16 03:50:43 +1000",0],["v0.2.14","2024-08-27 00:28:24 +1000",0],["v0.2.14.post1","2024-08-28 21:16:47 +1000",0],["v0.2.14.post2","2024-08-28 18:46:33 +0000",0],["v0.2.15","2024-09-01 22:22:38 -0700",0],["v0.3.0","2024-09-04 04:21:21 -0700",0],["v0.3.1.post1","2024-09-17 01:47:31 -0700",0],["v0.3.1.post2","2024-09-19 02:03:38 -0700",0],["v0.3.1.post3","2024-09-21 11:17:45 +0800",0],["v0.3.2","2024-09-25 14:17:09 +0800",0],["v0.3.3","2024-10-08 12:58:41 -0700",0],["v0.3.3.post1","2024-10-11 07:56:16 -0700",0],["v0.3.4","2024-10-19 08:17:41 -0700",0],["v0.3.4.post1","2024-10-21 21:16:43 -0700",0],["v0.3.4.post2","2024-10-25 11:07:19 -0700",0],["v0.3.5","2024-11-03 13:48:11 -0800",0],["v0.3.5.post1","2024-11-13 10:27:12 -0800",0],["v0.3.5.post2","2024-11-15 06:54:00 -0800",0],["v0.3.6","2024-11-22 19:27:30 +0800",0],["v0.3.6.post1","2024-11-25 17:31:37 -0800",0],["v0.3.6.post2","2024-11-27 03:35:30 -0800",0],["v0.3.6.post3","2024-11-30 01:41:16 +0800",0],["v0.4.0","2024-12-03 11:55:41 -0800",0],["v0.4.0.post1","2024-12-06 06:08:19 -0800",0],["v0.4.0.post2","2024-12-21 21:16:34 +0800",0],["v0.4.1","2024-12-26 07:14:51 +0800",0],["v0.4.1.post1","2024-12-28 00:11:06 +0800",0],["v0.4.1.post2","2024-12-30 00:11:46 +0800",0],["v0.4.1.post3","2024-12-29 14:25:53 -0800",0],["v0.4.1.post4","2025-01-06 01:29:54 +0800",0],["v0.4.1.post5","2025-01-11 23:10:02 +0800",0],["v0.4.1.post6","2025-01-15 16:23:42 +0800",0],["v0.4.1.post7","2025-01-20 21:50:55 +0800",0],["v0.4.2","2025-01-27 21:42:05 +0800",0],["v0.4.2.post1","2025-01-31 20:35:55 +0800",0],["v0.4.2.post2","2025-02-05 17:35:02 +0800",0],["v0.4.2.post3","2025-02-07 08:20:03 -0800",0],["v0.4.2.post4","2025-02-10 14:12:16 +0800",0],["v0.4.3","2025-02-14 09:43:14 +0800",0],["v0.4.3.post1","2025-02-17 21:58:19 +0800",0],["v0.4.3.post2","2025-02-18 02:48:30 +0800",0],["v0.4.3.post3","2025-03-05 17:26:10 -0800",0],["v0.4.3.post4","2025-03-06 12:50:28 -0800",0],["v0.4.4","2025-03-13 02:49:58 -0700",0],["v0.4.4.post1","2025-03-13 17:53:46 -0700",0],["v0.4.4.post2","2025-03-26 19:58:00 -0700",0],["v0.4.4.post3","2025-03-28 23:21:24 -0700",0],["v0.4.4.post4","2025-04-05 15:36:17 -0700",0],["v0.4.5","2025-04-07 00:35:00 -0700",0],["v0.4.5.post1","2025-04-15 23:00:07 -0700",0],["v0.4.5.post2","2025-04-20 14:12:37 -0700",0],["v0.4.5.post3","2025-04-21 18:16:20 -0700",0],["v0.4.6","2025-04-27 14:07:05 -0700",0],["v0.4.6.post1","2025-04-28 12:57:08 -0700",0],["v0.4.6.post2","2025-04-30 22:04:40 -0700",0],["v0.4.6.post3","2025-05-09 15:38:47 -0700",0],["v0.4.6.post4","2025-05-13 01:57:51 -0700",0],["v0.4.6.post5","2025-05-24 00:48:05 -0700",0],["v0.4.7","2025-06-10 01:56:20 -0700",0],["v0.4.7.post1","2025-06-16 15:20:29 -0700",0],["v0.4.8","2025-06-23 23:14:22 -0700",0],["v0.4.8.post1","2025-06-26 02:21:12 -0700",0],["v0.4.9","2025-07-05 17:40:29 -0700",0],["gateway-v0.1.5","2025-07-06 22:54:17 -0700",1],["v0.4.9.post1","2025-07-09 00:28:17 -0700",0],["v0.4.9.post2","2025-07-11 21:11:20 -0700",0],["gateway-v0.1.6","2025-07-20 23:13:20 -0700",1],["v0.4.9.post3","2025-07-22 15:55:48 -0700",0],["v0.4.9.post4","2025-07-25 17:12:47 -0700",0],["v0.4.9.post5","2025-07-28 02:11:06 -0700",0],["v0.4.9.post6","2025-07-29 02:30:07 -0700",0],["v0.4.10","2025-07-31 20:50:17 +0800",0],["gateway-v0.1.7","2025-07-31 11:24:12 -0700",1],["gateway-v0.1.8","2025-07-31 19:00:23 -0700",1],["v0.4.10.post1","2025-08-01 12:07:30 +0800",0],["v0.4.10.post2","2025-08-03 03:43:29 -0700",0],["gateway-v0.1.9","2025-08-07 09:29:12 -0700",1],["v0.5.1","2025-08-23 07:09:26 -0700",0],["v0.5.1.post1","2025-08-24 01:14:17 -0700",0],["v0.5.1.post2","2025-08-25 03:45:09 -0700",0],["v0.5.1.post3","2025-08-27 15:42:42 -0700",0],["v0.5.2","2025-09-11 16:09:20 -0700",0],["v0.5.3","2025-10-06 20:07:02 +0800",0],["v0.5.3.post1","2025-10-09 15:19:59 -0700",0],["gateway-v0.2.0","2025-10-14 22:10:30 -0400",1],["v0.5.3.post2","2025-10-15 16:49:14 -0700",0],["v0.5.3.post3","2025-10-16 13:14:55 -0700",0],["gateway-v0.2.1","2025-10-20 21:08:45 -0700",1],["v0.5.4","2025-10-23 18:01:40 -0700",0],["v0.5.4.post1","2025-10-27 09:35:20 +0800",0],["gateway-v0.2.2","2025-10-30 14:40:13 -0700",1],["v0.5.4.post2","2025-10-31 17:38:50 -0700",0],["v0.5.4.post3","2025-11-04 18:32:11 -0800",0],["v0.5.5","2025-11-07 00:46:19 +0800",0],["v0.5.5.post1","2025-11-10 11:53:43 -0800",0],["v0.5.5.post2","2025-11-12 20:35:20 +0800",0],["gateway-v0.2.3","2025-11-14 19:04:20 -0800",1],["v0.5.5.post3","2025-11-16 17:55:38 -0800",0],["v0.5.6","2025-12-02 17:17:13 -0800",0],["v0.5.6.post1","2025-12-08 13:41:01 -0800",0],["gateway-v0.2.4","2025-12-09 16:36:17 -0800",1],["v0.5.6.post2","2025-12-11 12:29:52 -0800",0],["gateway-v0.3.0","2025-12-24 16:25:05 -0500",1],["v0.5.7","2026-01-01 10:59:48 +0800",0],["gateway-v0.3.1","2026-01-08 21:50:34 -0800",1],["v0.5.8","2026-01-23 09:58:11 -0800",0],["v0.5.8.post1","2026-02-05 20:56:52 +0800",0]],"p":{"10":{"m":0,"g":94},"8":{"m":0,"g":94},"7":{"m":0,"g":94},"6":{"m":0,"g":94},"4":{"m":0,"g":94},"3":{"m":0,"g":94},"2":{"m":0,"g":94},"1":{"m":0,"g":94},"32":{"m":1,"g":94},"18":{"m":1,"g":94},"30":{"m":1,"g":94},"20":{"m":1,"g":94},"19":{"m":1,"g":94},"9":{"m":1,"g":94},"17":{"m":1,"g":94},"16":{"m":1,"g":94},"15":{"m":1,"g":94},"12":{"m":1,"g":94},"11":{"m":1,"g":94},"68":{"m":2,"g":94},"67":{"m":2,"g":94},"64":{"m":2,"g":94},"63":{"m":2,"g":94},"58":{"m":2,"g":94},"57":{"m":2,"g":94},"36":{"m":2,"g":94},"52":{"m":2,"g":94},"50":{"m":2,"g":94},"49":{"m":2,"g":94},"47":{"m":2,"g":94},"46":{"m":2,"g":94},"45":{"m":2,"g":94},"42":{"m":2,"g":94},"34":{"m":2,"g":94},"33":{"m":2,"g":94},"93":{"m":4,"g":94},"92":{"m":4,"g":94},"90":{"m":4,"g":94},"87":{"m":4,"g":94},"84":{"m":4,"g":94},"83":{"m":4,"g":94},"82":{"m":4,"g":94},"75":{"m":4,"g":94},"80":{"m":4,"g":94},"72":{"m":4,"g":94},"37":{"m":4,"g":94},"71":{"m":4,"g":94},"113":{"m":6,"g":94},"121":{"m":6,"g":94},"120":{"m":6,"g":94},"118":{"m":6,"g":94},"114":{"m":6,"g":94},"117":{"m":6,"g":94},"108":{"m":6,"g":94},"103":{"m":6,"g":94},"101":{"m":6,"g":94},"98":{"m":6,"g":94},"48":{"m":6,"g":94},"97":{"m":6,"g":94},"95":{"m":6,"g":94},"134":{"m":7,"g":94},"133":{"m":7,"g":94},"132":{"m":7,"g":94},"112":{"m":7,"g":94},"129":{"m":7,"g":94},"125":{"m":7,"g":94},"119":{"m":7,"g":94},"116":{"m":7,"g":94},"178":{"m":8,"g":94},"177":{"m":8,"g":94},"172":{"m":8,"g":94},"174":{"m":8,"g":94},"168":{"m":8,"g":94},"170":{"m":8,"g":94},"162":{"m":8,"g":94},"160":{"m":8,"g":94},"156":{"m":8,"g":94},"155":{"m":8,"g":94},"130":{"m":8,"g":94},"141":{"m":8,"g":94},"153":{"m":8,"g":94},"148":{"m":8,"g":94},"146":{"m":8,"g":94},"144":{"m":8,"g":94},"142":{"m":8,"g":94},"137":{"m":8,"g":94},"136":{"m":8,"g":94},"280":{"m":9,"g":94},"279":{"m":9,"g":94},"230":{"m":9,"g":94},"277":{"m":9,"g":94},"278":{"m":9,"g":94},"256":{"m":9,"g":94},"222":{"m":9,"g":94},"261":{"m":9,"g":94},"275":{"m":9,"g":94},"263":{"m":9,"g":94},"201":{"m":9,"g":94},"224":{"m":9,"g":94},"253":{"m":9,"g":94},"226":{"m":9,"g":94},"195":{"m":9,"g":94},"198":{"m":9,"g":94},"225":{"m":9,"g":94},"219":{"m":9,"g":94},"193":{"m":9,"g":94},"210":{"m":9,"g":94},"207":{"m":9,"g":94},"200":{"m":9,"g":94},"196":{"m":9,"g":94},"189":{"m":9,"g":94},"186":{"m":9,"g":94},"184":{"m":9,"g":94},"181":{"m":9,"g":94},"182":{"m":9,"g":94},"324":{"m":10,"g":94},"323":{"m":10,"g":94},"301":{"m":10,"g":94},"304":{"m":10,"g":94},"311":{"m":10,"g":94},"291":{"m":10,"g":94},"290":{"m":10,"g":94},"288":{"m":10,"g":94},"286":{"m":10,"g":94},"287":{"m":10,"g":94},"282":{"m":10,"g":94},"242":{"m":10,"g":94},"281":{"m":10,"g":94},"431":{"m":11,"g":94},"430":{"m":11,"g":94},"429":{"m":11,"g":94},"428":{"m":11,"g":94},"427":{"m":11,"g":94},"422":{"m":11,"g":94},"420":{"m":11,"g":94},"380":{"m":11,"g":94},"416":{"m":11,"g":94},"415":{"m":11,"g":94},"412":{"m":11,"g":94},"411":{"m":11,"g":94},"381":{"m":11,"g":94},"392":{"m":11,"g":94},"390":{"m":11,"g":94},"406":{"m":11,"g":94},"399":{"m":11,"g":94},"395":{"m":11,"g":94},"394":{"m":11,"g":94},"382":{"m":11,"g":94},"385":{"m":11,"g":94},"378":{"m":11,"g":94},"372":{"m":11,"g":94},"375":{"m":11,"g":94},"364":{"m":11,"g":94},"370":{"m":11,"g":94},"368":{"m":11,"g":94},"369":{"m":11,"g":94},"358":{"m":11,"g":94},"355":{"m":11,"g":94},"354":{"m":11,"g":94},"346":{"m":11,"g":94},"338":{"m":11,"g":94},"345":{"m":11,"g":94},"343":{"m":11,"g":94},"315":{"m":11,"g":94},"332":{"m":11,"g":94},"337":{"m":11,"g":94},"331":{"m":11,"g":94},"329":{"m":11,"g":94},"293":{"m":11,"g":94},"327":{"m":11,"g":94},"326":{"m":11,"g":94},"298":{"m":11,"g":94},"438":{"m":12,"g":94},"437":{"m":12,"g":94},"426":{"m":12,"g":94},"436":{"m":12,"g":94},"434":{"m":12,"g":94},"418":{"m":12,"g":94},"433":{"m":12,"g":94},"363":{"m":12,"g":94},"432":{"m":12,"g":94},"515":{"m":13,"g":94},"514":{"m":13,"g":94},"505":{"m":13,"g":94},"512":{"m":13,"g":94},"502":{"m":13,"g":94},"500":{"m":13,"g":94},"511":{"m":13,"g":94},"493":{"m":13,"g":94},"491":{"m":13,"g":94},"492":{"m":13,"g":94},"488":{"m":13,"g":94},"486":{"m":13,"g":94},"480":{"m":13,"g":94},"484":{"m":13,"g":94},"477":{"m":13,"g":94},"475":{"m":13,"g":94},"476":{"m":13,"g":94},"440":{"m":13,"g":94},"471":{"m":13,"g":94},"463":{"m":13,"g":94},"470":{"m":13,"g":94},"460":{"m":13,"g":94},"459":{"m":13,"g":94},"458":{"m":13,"g":94},"457":{"m":13,"g":94},"456":{"m":13,"g":94},"250":{"m":13,"g":94},"451":{"m":13,"g":94},"449":{"m":13,"g":94},"448":{"m":13,"g":94},"447":{"m":13,"g":94},"446":{"m":13,"g":94},"441":{"m":13,"g":94},"419":{"m":13,"g":94},"579":{"m":14,"g":94},"585":{"m":14,"g":94},"583":{"m":14,"g":94},"578":{"m":14,"g":94},"577":{"m":14,"g":94},"576":{"m":14,"g":94},"574":{"m":14,"g":94},"545":{"m":14,"g":94},"571":{"m":14,"g":94},"569":{"m":14,"g":94},"568":{"m":14,"g":94},"567":{"m":14,"g":94},"566":{"m":14,"g":94},"564":{"m":14,"g":94},"563":{"m":14,"g":94},"561":{"m":14,"g":94},"560":{"m":14,"g":94},"559":{"m":14,"g":94},"558":{"m":14,"g":94},"557":{"m":14,"g":94},"556":{"m":14,"g":94},"554":{"m":14,"g":94},"550":{"m":14,"g":94},"553":{"m":14,"g":94},"551":{"m":14,"g":94},"546":{"m":14,"g":94},"542":{"m":14,"g":94},"540":{"m":14,"g":94},"539":{"m":14,"g":94},"538":{"m":14,"g":94},"517":{"m":14,"g":94},"531":{"m":14,"g":94},"516":{"m":14,"g":94},"526":{"m":14,"g":94},"525":{"m":14,"g":94},"524":{"m":14,"g":94},"518":{"m":14,"g":94},"605":{"m":15,"g":94},"503":{"m":15,"g":94},"530":{"m":15,"g":94},"603":{"m":15,"g":94},"598":{"m":15,"g":94},"602":{"m":15,"g":94},"604":{"m":15,"g":94},"601":{"m":15,"g":94},"600":{"m":15,"g":94},"599":{"m":15,"g":94},"586":{"m":15,"g":94},"594":{"m":15,"g":94},"593":{"m":15,"g":94},"592":{"m":15,"g":94},"588":{"m":15,"g":94},"618":{"m":16,"g":94},"616":{"m":16,"g":94},"615":{"m":16,"g":94},"614":{"m":16,"g":94},"613":{"m":16,"g":94},"612":{"m":16,"g":94},"611":{"m":16,"g":94},"610":{"m":16,"g":94},"609":{"m":16,"g":94},"607":{"m":16,"g":94},"626":{"m":17,"g":94},"625":{"m":17,"g":94},"623":{"m":17,"g":94},"621":{"m":17,"g":94},"620":{"m":17,"g":94},"619":{"m":17,"g":94},"677":{"m":18,"g":94},"676":{"m":18,"g":94},"675":{"m":18,"g":94},"664":{"m":18,"g":94},"673":{"m":18,"g":94},"671":{"m":18,"g":94},"669":{"m":18,"g":94},"668":{"m":18,"g":94},"667":{"m":18,"g":94},"640":{"m":18,"g":94},"666":{"m":18,"g":94},"665":{"m":18,"g":94},"663":{"m":18,"g":94},"662":{"m":18,"g":94},"661":{"m":18,"g":94},"660":{"m":18,"g":94},"659":{"m":18,"g":94},"655":{"m":18,"g":94},"657":{"m":18,"g":94},"658":{"m":18,"g":94},"656":{"m":18,"g":94},"654":{"m":18,"g":94},"653":{"m":18,"g":94},"651":{"m":18,"g":94},"650":{"m":18,"g":94},"648":{"m":18,"g":94},"647":{"m":18,"g":94},"649":{"m":18,"g":94},"646":{"m":18,"g":94},"645":{"m":18,"g":94},"643":{"m":18,"g":94},"642":{"m":18,"g":94},"617":{"m":18,"g":94},"638":{"m":18,"g":94},"637":{"m":18,"g":94},"636":{"m":18,"g":94},"635":{"m":18,"g":94},"632":{"m":18,"g":94},"633":{"m":18,"g":94},"624":{"m":18,"g":94},"630":{"m":18,"g":94},"631":{"m":18,"g":94},"629":{"m":18,"g":94},"628":{"m":18,"g":94},"627":{"m":18,"g":94},"705":{"m":19,"g":94},"704":{"m":19,"g":94},"701":{"m":19,"g":94},"702":{"m":19,"g":94},"700":{"m":19,"g":94},"698":{"m":19,"g":94},"697":{"m":19,"g":94},"696":{"m":19,"g":94},"695":{"m":19,"g":94},"694":{"m":19,"g":94},"692":{"m":19,"g":94},"691":{"m":19,"g":94},"690":{"m":19,"g":94},"689":{"m":19,"g":94},"688":{"m":19,"g":94},"687":{"m":19,"g":94},"686":{"m":19,"g":94},"685":{"m":19,"g":94},"684":{"m":19,"g":94},"682":{"m":19,"g":94},"681":{"m":19,"g":94},"679":{"m":19,"g":94},"670":{"m":19,"g":94},"678":{"m":19,"g":94},"718":{"m":20,"g":94},"717":{"m":20,"g":94},"716":{"m":20,"g":94},"715":{"m":20,"g":94},"714":{"m":20,"g":94},"713":{"m":20,"g":94},"712":{"m":20,"g":94},"711":{"m":20,"g":94},"708":{"m":20,"g":94},"709":{"m":20,"g":94},"707":{"m":20,"g":94},"706":{"m":20,"g":94},"730":{"m":21,"g":94},"729":{"m":21,"g":94},"728":{"m":21,"g":94},"727":{"m":21,"g":94},"726":{"m":21,"g":94},"725":{"m":21,"g":94},"720":{"m":21,"g":94},"723":{"m":21,"g":94},"724":{"m":21,"g":94},"722":{"m":21,"g":94},"721":{"m":21,"g":94},"719":{"m":21,"g":94},"755":{"m":22,"g":94},"754":{"m":22,"g":94},"753":{"m":22,"g":94},"752":{"m":22,"g":94},"751":{"m":22,"g":94},"740":{"m":22,"g":94},"743":{"m":22,"g":94},"742":{"m":22,"g":94},"739":{"m":22,"g":94},"741":{"m":22,"g":94},"736":{"m":22,"g":94},"734":{"m":22,"g":94},"733":{"m":22,"g":94},"731":{"m":22,"g":94},"779":{"m":23,"g":94},"778":{"m":23,"g":94},"776":{"m":23,"g":94},"775":{"m":23,"g":94},"774":{"m":23,"g":94},"773":{"m":23,"g":94},"772":{"m":23,"g":94},"766":{"m":23,"g":94},"770":{"m":23,"g":94},"769":{"m":23,"g":94},"767":{"m":23,"g":94},"761":{"m":23,"g":94},"763":{"m":23,"g":94},"762":{"m":23,"g":94},"760":{"m":23,"g":94},"757":{"m":23,"g":94},"693":{"m":23,"g":94},"830":{"m":24,"g":94},"829":{"m":24,"g":94},"828":{"m":24,"g":94},"825":{"m":24,"g":94},"826":{"m":24,"g":94},"823":{"m":24,"g":94},"824":{"m":24,"g":94},"822":{"m":24,"g":94},"821":{"m":24,"g":94},"820":{"m":24,"g":94},"819":{"m":24,"g":94},"807":{"m":24,"g":94},"817":{"m":24,"g":94},"814":{"m":24,"g":94},"815":{"m":24,"g":94},"812":{"m":24,"g":94},"809":{"m":24,"g":94},"699":{"m":24,"g":94},"806":{"m":24,"g":94},"802":{"m":24,"g":94},"805":{"m":24,"g":94},"803":{"m":24,"g":94},"800":{"m":24,"g":94},"799":{"m":24,"g":94},"797":{"m":24,"g":94},"793":{"m":24,"g":94},"796":{"m":24,"g":94},"795":{"m":24,"g":94},"794":{"m":24,"g":94},"792":{"m":24,"g":94},"791":{"m":24,"g":94},"790":{"m":24,"g":94},"789":{"m":24,"g":94},"788":{"m":24,"g":94},"787":{"m":24,"g":94},"786":{"m":24,"g":94},"785":{"m":24,"g":94},"784":{"m":24,"g":94},"783":{"m":24,"g":94},"781":{"m":24,"g":94},"877":{"m":25,"g":94},"872":{"m":25,"g":94},"873":{"m":25,"g":94},"871":{"m":25,"g":94},"870":{"m":25,"g":94},"869":{"m":25,"g":94},"864":{"m":25,"g":94},"862":{"m":25,"g":94},"861":{"m":25,"g":94},"860":{"m":25,"g":94},"811":{"m":25,"g":94},"852":{"m":25,"g":94},"858":{"m":25,"g":94},"856":{"m":25,"g":94},"855":{"m":25,"g":94},"850":{"m":25,"g":94},"848":{"m":25,"g":94},"843":{"m":25,"g":94},"842":{"m":25,"g":94},"838":{"m":25,"g":94},"840":{"m":25,"g":94},"890":{"m":26,"g":94},"886":{"m":26,"g":94},"889":{"m":26,"g":94},"888":{"m":26,"g":94},"883":{"m":26,"g":94},"882":{"m":26,"g":94},"880":{"m":26,"g":94},"879":{"m":26,"g":94},"749":{"m":26,"g":94},"878":{"m":26,"g":94},"876":{"m":26,"g":94},"875":{"m":26,"g":94},"899":{"m":27,"g":94},"896":{"m":27,"g":94},"895":{"m":27,"g":94},"894":{"m":27,"g":94},"884":{"m":27,"g":94},"891":{"m":27,"g":94},"923":{"m":28,"g":94},"916":{"m":28,"g":94},"920":{"m":28,"g":94},"918":{"m":28,"g":94},"917":{"m":28,"g":94},"915":{"m":28,"g":94},"905":{"m":28,"g":94},"914":{"m":28,"g":94},"912":{"m":28,"g":94},"911":{"m":28,"g":94},"909":{"m":28,"g":94},"866":{"m":28,"g":94},"904":{"m":28,"g":94},"908":{"m":28,"g":94},"900":{"m":28,"g":94},"970":{"m":29,"g":94},"966":{"m":29,"g":94},"967":{"m":29,"g":94},"960":{"m":29,"g":94},"965":{"m":29,"g":94},"964":{"m":29,"g":94},"963":{"m":29,"g":94},"932":{"m":29,"g":94},"936":{"m":29,"g":94},"957":{"m":29,"g":94},"953":{"m":29,"g":94},"948":{"m":29,"g":94},"941":{"m":29,"g":94},"940":{"m":29,"g":94},"835":{"m":29,"g":94},"934":{"m":29,"g":94},"935":{"m":29,"g":94},"928":{"m":29,"g":94},"927":{"m":29,"g":94},"926":{"m":29,"g":94},"925":{"m":29,"g":94},"921":{"m":29,"g":94},"1048":{"m":30,"g":94},"1052":{"m":30,"g":94},"1051":{"m":30,"g":94},"1049":{"m":30,"g":94},"1050":{"m":30,"g":94},"1033":{"m":30,"g":94},"1046":{"m":30,"g":94},"1047":{"m":30,"g":94},"1044":{"m":30,"g":94},"1045":{"m":30,"g":94},"1039":{"m":30,"g":94},"1037":{"m":30,"g":94},"1038":{"m":30,"g":94},"1025":{"m":30,"g":94},"1034":{"m":30,"g":94},"1031":{"m":30,"g":94},"1027":{"m":30,"g":94},"1029":{"m":30,"g":94},"1028":{"m":30,"g":94},"1022":{"m":30,"g":94},"1024":{"m":30,"g":94},"907":{"m":30,"g":94},"1021":{"m":30,"g":94},"1020":{"m":30,"g":94},"1019":{"m":30,"g":94},"990":{"m":30,"g":94},"1014":{"m":30,"g":94},"1010":{"m":30,"g":94},"1009":{"m":30,"g":94},"1007":{"m":30,"g":94},"959":{"m":30,"g":94},"997":{"m":30,"g":94},"1005":{"m":30,"g":94},"1002":{"m":30,"g":94},"1001":{"m":30,"g":94},"994":{"m":30,"g":94},"995":{"m":30,"g":94},"988":{"m":30,"g":94},"993":{"m":30,"g":94},"973":{"m":30,"g":94},"992":{"m":30,"g":94},"985":{"m":30,"g":94},"981":{"m":30,"g":94},"987":{"m":30,"g":94},"983":{"m":30,"g":94},"984":{"m":30,"g":94},"982":{"m":30,"g":94},"971":{"m":30,"g":94},"980":{"m":30,"g":94},"977":{"m":30,"g":94},"968":{"m":30,"g":94},"969":{"m":30,"g":94},"976":{"m":30,"g":94},"975":{"m":30,"g":94},"1111":{"m":31,"g":94},"1113":{"m":31,"g":94},"1112":{"m":31,"g":94},"1110":{"m":31,"g":94},"1107":{"m":31,"g":94},"1040":{"m":31,"g":94},"1106":{"m":31,"g":94},"1077":{"m":31,"g":94},"1104":{"m":31,"g":94},"1092":{"m":31,"g":94},"1103":{"m":31,"g":94},"1090":{"m":31,"g":94},"1082":{"m":31,"g":94},"1099":{"m":31,"g":94},"1095":{"m":31,"g":94},"1098":{"m":31,"g":94},"1096":{"m":31,"g":94},"1094":{"m":31,"g":94},"1088":{"m":31,"g":94},"1086":{"m":31,"g":94},"1084":{"m":31,"g":94},"1056":{"m":31,"g":94},"1081":{"m":31,"g":94},"1006":{"m":31,"g":94},"1079":{"m":31,"g":94},"1078":{"m":31,"g":94},"1074":{"m":31,"g":94},"1053":{"m":31,"g":94},"1070":{"m":31,"g":94},"1060":{"m":31,"g":94},"1066":{"m":31,"g":94},"1068":{"m":31,"g":94},"1057":{"m":31,"g":94},"1155":{"m":32,"g":94},"1201":{"m":32,"g":94},"1219":{"m":32,"g":94},"1212":{"m":32,"g":94},"1218":{"m":32,"g":94},"1217":{"m":32,"g":94},"1215":{"m":32,"g":94},"1214":{"m":32,"g":94},"1204":{"m":32,"g":94},"1213":{"m":32,"g":94},"1210":{"m":32,"g":94},"1211":{"m":32,"g":94},"1209":{"m":32,"g":94},"1208":{"m":32,"g":94},"1186":{"m":32,"g":94},"1205":{"m":32,"g":94},"1207":{"m":32,"g":94},"1199":{"m":32,"g":94},"1202":{"m":32,"g":94},"1198":{"m":32,"g":94},"1194":{"m":32,"g":94},"1193":{"m":32,"g":94},"1123":{"m":32,"g":94},"1185":{"m":32,"g":94},"1184":{"m":32,"g":94},"1180":{"m":32,"g":94},"1168":{"m":32,"g":94},"1179":{"m":32,"g":94},"1167":{"m":32,"g":94},"1170":{"m":32,"g":94},"1177":{"m":32,"g":94},"1171":{"m":32,"g":94},"1157":{"m":32,"g":94},"1166":{"m":32,"g":94},"1154":{"m":32,"g":94},"1165":{"m":32,"g":94},"1148":{"m":32,"g":94},"1164":{"m":32,"g":94},"1134":{"m":32,"g":94},"1138":{"m":32,"g":94},"1035":{"m":32,"g":94},"1144":{"m":32,"g":94},"1143":{"m":32,"g":94},"1141":{"m":32,"g":94},"1140":{"m":32,"g":94},"1139":{"m":32,"g":94},"1136":{"m":32,"g":94},"1133":{"m":32,"g":94},"1131":{"m":32,"g":94},"1013":{"m":32,"g":94},"1122":{"m":32,"g":94},"1119":{"m":32,"g":94},"1115":{"m":32,"g":94},"1114":{"m":32,"g":94},"1242":{"m":33,"g":94},"1239":{"m":33,"g":94},"1233":{"m":33,"g":94},"1237":{"m":33,"g":94},"1225":{"m":33,"g":94},"1236":{"m":33,"g":94},"1231":{"m":33,"g":94},"1230":{"m":33,"g":94},"1227":{"m":33,"g":94},"1222":{"m":33,"g":94},"1223":{"m":33,"g":94},"1125":{"m":33,"g":94},"1250":{"m":34,"g":94},"1252":{"m":34,"g":94},"1249":{"m":34,"g":94},"1234":{"m":34,"g":94},"1247":{"m":34,"g":94},"1232":{"m":34,"g":94},"1244":{"m":34,"g":94},"1243":{"m":34,"g":94},"1295":{"m":35,"g":94},"1297":{"m":35,"g":94},"1296":{"m":35,"g":94},"1294":{"m":35,"g":94},"1293":{"m":35,"g":94},"1291":{"m":35,"g":94},"1290":{"m":35,"g":94},"1277":{"m":35,"g":94},"1284":{"m":35,"g":94},"1288":{"m":35,"g":94},"1286":{"m":35,"g":94},"1289":{"m":35,"g":94},"1285":{"m":35,"g":94},"1280":{"m":35,"g":94},"1262":{"m":35,"g":94},"1282":{"m":35,"g":94},"1276":{"m":35,"g":94},"1256":{"m":35,"g":94},"1269":{"m":35,"g":94},"1267":{"m":35,"g":94},"1258":{"m":35,"g":94},"1261":{"m":35,"g":94},"1260":{"m":35,"g":94},"1253":{"m":35,"g":94},"1255":{"m":35,"g":94},"1254":{"m":35,"g":94},"1327":{"m":36,"g":94},"1326":{"m":36,"g":94},"1320":{"m":36,"g":94},"1319":{"m":36,"g":94},"1318":{"m":36,"g":94},"1317":{"m":36,"g":94},"1313":{"m":36,"g":94},"1299":{"m":36,"g":94},"1308":{"m":36,"g":94},"1306":{"m":36,"g":94},"1304":{"m":36,"g":94},"1445":{"m":37,"g":94},"1444":{"m":37,"g":94},"1442":{"m":37,"g":94},"1420":{"m":37,"g":94},"1441":{"m":37,"g":94},"1440":{"m":37,"g":94},"1438":{"m":37,"g":94},"1432":{"m":37,"g":94},"1433":{"m":37,"g":94},"1431":{"m":37,"g":94},"1430":{"m":37,"g":94},"1428":{"m":37,"g":94},"1429":{"m":37,"g":94},"1427":{"m":37,"g":94},"1422":{"m":37,"g":94},"1426":{"m":37,"g":94},"1425":{"m":37,"g":94},"1418":{"m":37,"g":94},"1392":{"m":37,"g":94},"1414":{"m":37,"g":94},"1412":{"m":37,"g":94},"1411":{"m":37,"g":94},"1409":{"m":37,"g":94},"1408":{"m":37,"g":94},"1407":{"m":37,"g":94},"1406":{"m":37,"g":94},"1307":{"m":37,"g":94},"1397":{"m":37,"g":94},"1402":{"m":37,"g":94},"1403":{"m":37,"g":94},"1401":{"m":37,"g":94},"1399":{"m":37,"g":94},"1393":{"m":37,"g":94},"1381":{"m":37,"g":94},"1390":{"m":37,"g":94},"1389":{"m":37,"g":94},"1367":{"m":37,"g":94},"1385":{"m":37,"g":94},"1378":{"m":37,"g":94},"1380":{"m":37,"g":94},"1379":{"m":37,"g":94},"1376":{"m":37,"g":94},"1375":{"m":37,"g":94},"1373":{"m":37,"g":94},"1371":{"m":37,"g":94},"1370":{"m":37,"g":94},"1368":{"m":37,"g":94},"1300":{"m":37,"g":94},"1363":{"m":37,"g":94},"1360":{"m":37,"g":94},"1361":{"m":37,"g":94},"1341":{"m":37,"g":94},"1357":{"m":37,"g":94},"1298":{"m":37,"g":94},"1346":{"m":37,"g":94},"1281":{"m":37,"g":94},"1345":{"m":37,"g":94},"1339":{"m":37,"g":94},"1340":{"m":37,"g":94},"1337":{"m":37,"g":94},"1336":{"m":37,"g":94},"1328":{"m":37,"g":94},"1470":{"m":38,"g":94},"1469":{"m":38,"g":94},"1464":{"m":38,"g":94},"1458":{"m":38,"g":94},"1457":{"m":38,"g":94},"1454":{"m":38,"g":94},"1453":{"m":38,"g":94},"1452":{"m":38,"g":94},"1449":{"m":38,"g":94},"1451":{"m":38,"g":94},"1450":{"m":38,"g":94},"1448":{"m":38,"g":94},"1447":{"m":38,"g":94},"1483":{"m":39,"g":94},"1484":{"m":39,"g":94},"1482":{"m":39,"g":94},"1476":{"m":39,"g":94},"1475":{"m":39,"g":94},"1305":{"m":39,"g":94},"1472":{"m":39,"g":94},"1512":{"m":40,"g":94},"1511":{"m":40,"g":94},"1508":{"m":40,"g":94},"1510":{"m":40,"g":94},"1503":{"m":40,"g":94},"1499":{"m":40,"g":94},"1502":{"m":40,"g":94},"1500":{"m":40,"g":94},"1497":{"m":40,"g":94},"1496":{"m":40,"g":94},"1494":{"m":40,"g":94},"1490":{"m":40,"g":94},"1492":{"m":40,"g":94},"1491":{"m":40,"g":94},"1489":{"m":40,"g":94},"1456":{"m":40,"g":94},"1488":{"m":40,"g":94},"1486":{"m":40,"g":94},"1481":{"m":40,"g":94},"1605":{"m":41,"g":94},"1606":{"m":41,"g":94},"1604":{"m":41,"g":94},"1598":{"m":41,"g":94},"1603":{"m":41,"g":94},"1597":{"m":41,"g":94},"1594":{"m":41,"g":94},"1596":{"m":41,"g":94},"1595":{"m":41,"g":94},"1593":{"m":41,"g":94},"1567":{"m":41,"g":94},"1592":{"m":41,"g":94},"1591":{"m":41,"g":94},"1590":{"m":41,"g":94},"1589":{"m":41,"g":94},"1587":{"m":41,"g":94},"1586":{"m":41,"g":94},"1585":{"m":41,"g":94},"1584":{"m":41,"g":94},"1573":{"m":41,"g":94},"1582":{"m":41,"g":94},"1583":{"m":41,"g":94},"1581":{"m":41,"g":94},"1576":{"m":41,"g":94},"1561":{"m":41,"g":94},"1572":{"m":41,"g":94},"1577":{"m":41,"g":94},"1580":{"m":41,"g":94},"1574":{"m":41,"g":94},"1563":{"m":41,"g":94},"1569":{"m":41,"g":94},"1568":{"m":41,"g":94},"1566":{"m":41,"g":94},"1562":{"m":41,"g":94},"1559":{"m":41,"g":94},"1536":{"m":41,"g":94},"1557":{"m":41,"g":94},"1556":{"m":41,"g":94},"1555":{"m":41,"g":94},"1553":{"m":41,"g":94},"1554":{"m":41,"g":94},"1549":{"m":41,"g":94},"1552":{"m":41,"g":94},"1550":{"m":41,"g":94},"1548":{"m":41,"g":94},"1547":{"m":41,"g":94},"1545":{"m":41,"g":94},"1544":{"m":41,"g":94},"1543":{"m":41,"g":94},"1541":{"m":41,"g":94},"1539":{"m":41,"g":94},"1538":{"m":41,"g":94},"1537":{"m":41,"g":94},"1534":{"m":41,"g":94},"1531":{"m":41,"g":94},"1532":{"m":41,"g":94},"1530":{"m":41,"g":94},"1495":{"m":41,"g":94},"1520":{"m":41,"g":94},"1521":{"m":41,"g":94},"1528":{"m":41,"g":94},"1525":{"m":41,"g":94},"1529":{"m":41,"g":94},"1524":{"m":41,"g":94},"1513":{"m":41,"g":94},"1636":{"m":42,"g":94},"1635":{"m":42,"g":94},"1634":{"m":42,"g":94},"1633":{"m":42,"g":94},"1632":{"m":42,"g":94},"1631":{"m":42,"g":94},"1626":{"m":42,"g":94},"1579":{"m":42,"g":94},"1611":{"m":42,"g":94},"1607":{"m":42,"g":94},"1629":{"m":42,"g":94},"1625":{"m":42,"g":94},"1619":{"m":42,"g":94},"1620":{"m":42,"g":94},"1615":{"m":42,"g":94},"1714":{"m":43,"g":94},"1713":{"m":43,"g":94},"1712":{"m":43,"g":94},"1710":{"m":43,"g":94},"1709":{"m":43,"g":94},"1707":{"m":43,"g":94},"1706":{"m":43,"g":94},"1705":{"m":43,"g":94},"1704":{"m":43,"g":94},"1703":{"m":43,"g":94},"1684":{"m":43,"g":94},"1702":{"m":43,"g":94},"1701":{"m":43,"g":94},"1700":{"m":43,"g":94},"1699":{"m":43,"g":94},"1694":{"m":43,"g":94},"1697":{"m":43,"g":94},"1696":{"m":43,"g":94},"1690":{"m":43,"g":94},"1679":{"m":43,"g":94},"1689":{"m":43,"g":94},"1688":{"m":43,"g":94},"1599":{"m":43,"g":94},"1687":{"m":43,"g":94},"1686":{"m":43,"g":94},"1685":{"m":43,"g":94},"1677":{"m":43,"g":94},"1676":{"m":43,"g":94},"1681":{"m":43,"g":94},"1674":{"m":43,"g":94},"1672":{"m":43,"g":94},"1671":{"m":43,"g":94},"1670":{"m":43,"g":94},"1658":{"m":43,"g":94},"1667":{"m":43,"g":94},"1666":{"m":43,"g":94},"1665":{"m":43,"g":94},"1459":{"m":43,"g":94},"1663":{"m":43,"g":94},"1662":{"m":43,"g":94},"1661":{"m":43,"g":94},"1659":{"m":43,"g":94},"1650":{"m":43,"g":94},"1656":{"m":43,"g":94},"1652":{"m":43,"g":94},"1654":{"m":43,"g":94},"1653":{"m":43,"g":94},"1651":{"m":43,"g":94},"1648":{"m":43,"g":94},"1480":{"m":43,"g":94},"1645":{"m":43,"g":94},"1642":{"m":43,"g":94},"1638":{"m":43,"g":94},"1614":{"m":43,"g":94},"1749":{"m":44,"g":94},"1748":{"m":44,"g":94},"1551":{"m":44,"g":94},"1746":{"m":44,"g":94},"1737":{"m":44,"g":94},"1738":{"m":44,"g":94},"1743":{"m":44,"g":94},"1741":{"m":44,"g":94},"1740":{"m":44,"g":94},"1736":{"m":44,"g":94},"1735":{"m":44,"g":94},"1734":{"m":44,"g":94},"1727":{"m":44,"g":94},"1726":{"m":44,"g":94},"1725":{"m":44,"g":94},"1724":{"m":44,"g":94},"1722":{"m":44,"g":94},"1721":{"m":44,"g":94},"1720":{"m":44,"g":94},"1718":{"m":44,"g":94},"1716":{"m":44,"g":94},"1796":{"m":45,"g":94},"1795":{"m":45,"g":94},"1797":{"m":45,"g":94},"1794":{"m":45,"g":94},"1787":{"m":45,"g":94},"1793":{"m":45,"g":94},"1780":{"m":45,"g":94},"1789":{"m":45,"g":94},"1785":{"m":45,"g":94},"1783":{"m":45,"g":94},"1778":{"m":45,"g":94},"1782":{"m":45,"g":94},"1779":{"m":45,"g":94},"1776":{"m":45,"g":94},"1774":{"m":45,"g":94},"1773":{"m":45,"g":94},"1772":{"m":45,"g":94},"1771":{"m":45,"g":94},"1769":{"m":45,"g":94},"1768":{"m":45,"g":94},"1766":{"m":45,"g":94},"1767":{"m":45,"g":94},"1765":{"m":45,"g":94},"1760":{"m":45,"g":94},"1758":{"m":45,"g":94},"1747":{"m":45,"g":94},"1908":{"m":46,"g":94},"1907":{"m":46,"g":94},"1906":{"m":46,"g":94},"1902":{"m":46,"g":94},"1905":{"m":46,"g":94},"1904":{"m":46,"g":94},"1903":{"m":46,"g":94},"1899":{"m":46,"g":94},"1896":{"m":46,"g":94},"1895":{"m":46,"g":94},"1894":{"m":46,"g":94},"1892":{"m":46,"g":94},"1890":{"m":46,"g":94},"1888":{"m":46,"g":94},"1889":{"m":46,"g":94},"1886":{"m":46,"g":94},"1885":{"m":46,"g":94},"1883":{"m":46,"g":94},"1873":{"m":46,"g":94},"1882":{"m":46,"g":94},"1881":{"m":46,"g":94},"1879":{"m":46,"g":94},"1878":{"m":46,"g":94},"1877":{"m":46,"g":94},"1875":{"m":46,"g":94},"1871":{"m":46,"g":94},"1867":{"m":46,"g":94},"1866":{"m":46,"g":94},"1754":{"m":46,"g":94},"1856":{"m":46,"g":94},"1859":{"m":46,"g":94},"1860":{"m":46,"g":94},"1861":{"m":46,"g":94},"1858":{"m":46,"g":94},"1855":{"m":46,"g":94},"1852":{"m":46,"g":94},"1851":{"m":46,"g":94},"1846":{"m":46,"g":94},"1850":{"m":46,"g":94},"1847":{"m":46,"g":94},"1845":{"m":46,"g":94},"1842":{"m":46,"g":94},"1836":{"m":46,"g":94},"1838":{"m":46,"g":94},"1840":{"m":46,"g":94},"1839":{"m":46,"g":94},"1827":{"m":46,"g":94},"1833":{"m":46,"g":94},"1835":{"m":46,"g":94},"1834":{"m":46,"g":94},"1822":{"m":46,"g":94},"1823":{"m":46,"g":94},"1830":{"m":46,"g":94},"1825":{"m":46,"g":94},"1790":{"m":46,"g":94},"1821":{"m":46,"g":94},"1820":{"m":46,"g":94},"1819":{"m":46,"g":94},"1810":{"m":46,"g":94},"1817":{"m":46,"g":94},"1816":{"m":46,"g":94},"1813":{"m":46,"g":94},"1811":{"m":46,"g":94},"1809":{"m":46,"g":94},"1808":{"m":46,"g":94},"1807":{"m":46,"g":94},"1805":{"m":46,"g":94},"1804":{"m":46,"g":94},"1803":{"m":46,"g":94},"1802":{"m":46,"g":94},"1801":{"m":46,"g":94},"1800":{"m":46,"g":94},"1786":{"m":46,"g":94},"1799":{"m":46,"g":94},"1798":{"m":46,"g":94},"1752":{"m":46,"g":94},"2022":{"m":47,"g":94},"2020":{"m":47,"g":94},"2018":{"m":47,"g":94},"1996":{"m":47,"g":94},"2015":{"m":47,"g":94},"2014":{"m":47,"g":94},"2013":{"m":47,"g":94},"1998":{"m":47,"g":94},"2011":{"m":47,"g":94},"2010":{"m":47,"g":94},"2009":{"m":47,"g":94},"2008":{"m":47,"g":94},"2006":{"m":47,"g":94},"2005":{"m":47,"g":94},"1994":{"m":47,"g":94},"2003":{"m":47,"g":94},"2004":{"m":47,"g":94},"2002":{"m":47,"g":94},"2001":{"m":47,"g":94},"2000":{"m":47,"g":94},"1999":{"m":47,"g":94},"1995":{"m":47,"g":94},"1997":{"m":47,"g":94},"1934":{"m":47,"g":94},"1980":{"m":47,"g":94},"1990":{"m":47,"g":94},"1988":{"m":47,"g":94},"1984":{"m":47,"g":94},"1986":{"m":47,"g":94},"1983":{"m":47,"g":94},"1981":{"m":47,"g":94},"1982":{"m":47,"g":94},"1977":{"m":47,"g":94},"1972":{"m":47,"g":94},"1976":{"m":47,"g":94},"1975":{"m":47,"g":94},"1974":{"m":47,"g":94},"1745":{"m":47,"g":94},"1973":{"m":47,"g":94},"1963":{"m":47,"g":94},"1966":{"m":47,"g":94},"1962":{"m":47,"g":94},"1961":{"m":47,"g":94},"1958":{"m":47,"g":94},"1957":{"m":47,"g":94},"1956":{"m":47,"g":94},"1955":{"m":47,"g":94},"1954":{"m":47,"g":94},"1933":{"m":47,"g":94},"1952":{"m":47,"g":94},"1951":{"m":47,"g":94},"1939":{"m":47,"g":94},"1941":{"m":47,"g":94},"1949":{"m":47,"g":94},"1940":{"m":47,"g":94},"1942":{"m":47,"g":94},"1891":{"m":47,"g":94},"1926":{"m":47,"g":94},"1922":{"m":47,"g":94},"1853":{"m":47,"g":94},"1924":{"m":47,"g":94},"1920":{"m":47,"g":94},"1916":{"m":47,"g":94},"1893":{"m":47,"g":94},"1915":{"m":47,"g":94},"1910":{"m":47,"g":94},"1909":{"m":47,"g":94},"2046":{"m":48,"g":94},"2044":{"m":48,"g":94},"2043":{"m":48,"g":94},"2030":{"m":48,"g":94},"2042":{"m":48,"g":94},"2038":{"m":48,"g":94},"1968":{"m":48,"g":94},"2039":{"m":48,"g":94},"2036":{"m":48,"g":94},"2034":{"m":48,"g":94},"2033":{"m":48,"g":94},"2031":{"m":48,"g":94},"2027":{"m":48,"g":94},"2028":{"m":48,"g":94},"2026":{"m":48,"g":94},"2024":{"m":48,"g":94},"2023":{"m":48,"g":94},"2120":{"m":49,"g":94},"2125":{"m":49,"g":94},"2122":{"m":49,"g":94},"2110":{"m":49,"g":94},"2118":{"m":49,"g":94},"2106":{"m":49,"g":94},"2115":{"m":49,"g":94},"2055":{"m":49,"g":94},"2111":{"m":49,"g":94},"2116":{"m":49,"g":94},"2107":{"m":49,"g":94},"2105":{"m":49,"g":94},"2104":{"m":49,"g":94},"2103":{"m":49,"g":94},"2073":{"m":49,"g":94},"2100":{"m":49,"g":94},"2067":{"m":49,"g":94},"2096":{"m":49,"g":94},"2095":{"m":49,"g":94},"2093":{"m":49,"g":94},"2094":{"m":49,"g":94},"2091":{"m":49,"g":94},"2088":{"m":49,"g":94},"2089":{"m":49,"g":94},"2086":{"m":49,"g":94},"2085":{"m":49,"g":94},"2083":{"m":49,"g":94},"2078":{"m":49,"g":94},"2069":{"m":49,"g":94},"2075":{"m":49,"g":94},"2074":{"m":49,"g":94},"2072":{"m":49,"g":94},"2071":{"m":49,"g":94},"2070":{"m":49,"g":94},"2068":{"m":49,"g":94},"2062":{"m":49,"g":94},"2056":{"m":49,"g":94},"2066":{"m":49,"g":94},"2061":{"m":49,"g":94},"2065":{"m":49,"g":94},"2064":{"m":49,"g":94},"2063":{"m":49,"g":94},"1849":{"m":49,"g":94},"2053":{"m":49,"g":94},"2051":{"m":49,"g":94},"1970":{"m":49,"g":94},"2050":{"m":49,"g":94},"2049":{"m":49,"g":94},"1876":{"m":49,"g":94},"2048":{"m":49,"g":94},"2047":{"m":49,"g":94},"2189":{"m":50,"g":94},"2188":{"m":50,"g":94},"2187":{"m":50,"g":94},"2052":{"m":50,"g":94},"2184":{"m":50,"g":94},"2176":{"m":50,"g":94},"2186":{"m":50,"g":94},"2171":{"m":50,"g":94},"2185":{"m":50,"g":94},"2183":{"m":50,"g":94},"2173":{"m":50,"g":94},"2182":{"m":50,"g":94},"2180":{"m":50,"g":94},"2175":{"m":50,"g":94},"2174":{"m":50,"g":94},"2170":{"m":50,"g":94},"2169":{"m":50,"g":94},"2167":{"m":50,"g":94},"2164":{"m":50,"g":94},"2163":{"m":50,"g":94},"2162":{"m":50,"g":94},"2158":{"m":50,"g":94},"2161":{"m":50,"g":94},"2159":{"m":50,"g":94},"2156":{"m":50,"g":94},"2154":{"m":50,"g":94},"2157":{"m":50,"g":94},"2155":{"m":50,"g":94},"2153":{"m":50,"g":94},"2152":{"m":50,"g":94},"2148":{"m":50,"g":94},"2147":{"m":50,"g":94},"2146":{"m":50,"g":94},"2144":{"m":50,"g":94},"2143":{"m":50,"g":94},"2142":{"m":50,"g":94},"2114":{"m":50,"g":94},"2139":{"m":50,"g":94},"2138":{"m":50,"g":94},"2136":{"m":50,"g":94},"2137":{"m":50,"g":94},"2134":{"m":50,"g":94},"2081":{"m":50,"g":94},"2121":{"m":50,"g":94},"2130":{"m":50,"g":94},"2124":{"m":50,"g":94},"2092":{"m":50,"g":94},"2127":{"m":50,"g":94},"2077":{"m":50,"g":94},"2126":{"m":50,"g":94},"2214":{"m":51,"g":94},"2222":{"m":51,"g":94},"2221":{"m":51,"g":94},"2217":{"m":51,"g":94},"2210":{"m":51,"g":94},"2212":{"m":51,"g":94},"2208":{"m":51,"g":94},"2207":{"m":51,"g":94},"2204":{"m":51,"g":94},"2206":{"m":51,"g":94},"2201":{"m":51,"g":94},"2199":{"m":51,"g":94},"2196":{"m":51,"g":94},"2198":{"m":51,"g":94},"2195":{"m":51,"g":94},"2197":{"m":51,"g":94},"2259":{"m":52,"g":94},"2257":{"m":52,"g":94},"2256":{"m":52,"g":94},"2254":{"m":52,"g":94},"2253":{"m":52,"g":94},"2252":{"m":52,"g":94},"2251":{"m":52,"g":94},"2250":{"m":52,"g":94},"2239":{"m":52,"g":94},"2242":{"m":52,"g":94},"2238":{"m":52,"g":94},"2231":{"m":52,"g":94},"2233":{"m":52,"g":94},"2235":{"m":52,"g":94},"2234":{"m":52,"g":94},"2232":{"m":52,"g":94},"2228":{"m":52,"g":94},"2123":{"m":52,"g":94},"2223":{"m":52,"g":94},"2224":{"m":52,"g":94},"2226":{"m":52,"g":94},"2191":{"m":52,"g":94},"2218":{"m":52,"g":94},"2338":{"m":53,"g":94},"2339":{"m":53,"g":94},"2300":{"m":53,"g":94},"2335":{"m":53,"g":94},"2327":{"m":53,"g":94},"2324":{"m":53,"g":94},"2328":{"m":53,"g":94},"2329":{"m":53,"g":94},"2325":{"m":53,"g":94},"2281":{"m":53,"g":94},"2319":{"m":53,"g":94},"2318":{"m":53,"g":94},"2314":{"m":53,"g":94},"2311":{"m":53,"g":94},"2310":{"m":53,"g":94},"2309":{"m":53,"g":94},"2279":{"m":53,"g":94},"2306":{"m":53,"g":94},"2305":{"m":53,"g":94},"2304":{"m":53,"g":94},"2301":{"m":53,"g":94},"2302":{"m":53,"g":94},"2299":{"m":53,"g":94},"2298":{"m":53,"g":94},"2241":{"m":53,"g":94},"2295":{"m":53,"g":94},"2292":{"m":53,"g":94},"2179":{"m":53,"g":94},"2293":{"m":53,"g":94},"2290":{"m":53,"g":94},"2244":{"m":53,"g":94},"2288":{"m":53,"g":94},"2289":{"m":53,"g":94},"2287":{"m":53,"g":94},"2284":{"m":53,"g":94},"2286":{"m":53,"g":94},"2285":{"m":53,"g":94},"2282":{"m":53,"g":94},"2215":{"m":53,"g":94},"2280":{"m":53,"g":94},"2274":{"m":53,"g":94},"2266":{"m":53,"g":94},"2265":{"m":53,"g":94},"2269":{"m":53,"g":94},"2243":{"m":53,"g":94},"2268":{"m":53,"g":94},"2225":{"m":53,"g":94},"2261":{"m":53,"g":94},"2375":{"m":54,"g":94},"2377":{"m":54,"g":94},"2363":{"m":54,"g":94},"2374":{"m":54,"g":94},"2373":{"m":54,"g":94},"2369":{"m":54,"g":94},"2357":{"m":54,"g":94},"2360":{"m":54,"g":94},"2370":{"m":54,"g":94},"2371":{"m":54,"g":94},"2368":{"m":54,"g":94},"2359":{"m":54,"g":94},"2364":{"m":54,"g":94},"2355":{"m":54,"g":94},"2342":{"m":54,"g":94},"2352":{"m":54,"g":94},"2308":{"m":54,"g":94},"2323":{"m":54,"g":94},"2350":{"m":54,"g":94},"2340":{"m":54,"g":94},"2349":{"m":54,"g":94},"2341":{"m":54,"g":94},"2348":{"m":54,"g":94},"2525":{"m":55,"g":94},"2528":{"m":55,"g":94},"2524":{"m":55,"g":94},"2517":{"m":55,"g":94},"2516":{"m":55,"g":94},"2515":{"m":55,"g":94},"2502":{"m":55,"g":94},"2500":{"m":55,"g":94},"2499":{"m":55,"g":94},"2438":{"m":55,"g":94},"2426":{"m":55,"g":94},"2457":{"m":55,"g":94},"2467":{"m":55,"g":94},"2495":{"m":55,"g":94},"2494":{"m":55,"g":94},"2493":{"m":55,"g":94},"2492":{"m":55,"g":94},"2491":{"m":55,"g":94},"2476":{"m":55,"g":94},"2490":{"m":55,"g":94},"2489":{"m":55,"g":94},"2486":{"m":55,"g":94},"2487":{"m":55,"g":94},"2481":{"m":55,"g":94},"2485":{"m":55,"g":94},"2484":{"m":55,"g":94},"2483":{"m":55,"g":94},"2479":{"m":55,"g":94},"2473":{"m":55,"g":94},"2469":{"m":55,"g":94},"2464":{"m":55,"g":94},"2466":{"m":55,"g":94},"2463":{"m":55,"g":94},"2462":{"m":55,"g":94},"2459":{"m":55,"g":94},"2456":{"m":55,"g":94},"2444":{"m":55,"g":94},"2455":{"m":55,"g":94},"2454":{"m":55,"g":94},"2442":{"m":55,"g":94},"2453":{"m":55,"g":94},"2452":{"m":55,"g":94},"2437":{"m":55,"g":94},"2449":{"m":55,"g":94},"2448":{"m":55,"g":94},"2425":{"m":55,"g":94},"2447":{"m":55,"g":94},"2436":{"m":55,"g":94},"2441":{"m":55,"g":94},"2440":{"m":55,"g":94},"2435":{"m":55,"g":94},"2434":{"m":55,"g":94},"2433":{"m":55,"g":94},"2424":{"m":55,"g":94},"2412":{"m":55,"g":94},"2422":{"m":55,"g":94},"2419":{"m":55,"g":94},"2417":{"m":55,"g":94},"2416":{"m":55,"g":94},"2410":{"m":55,"g":94},"2393":{"m":55,"g":94},"2413":{"m":55,"g":94},"2411":{"m":55,"g":94},"2398":{"m":55,"g":94},"2409":{"m":55,"g":94},"2408":{"m":55,"g":94},"2407":{"m":55,"g":94},"2406":{"m":55,"g":94},"2405":{"m":55,"g":94},"2404":{"m":55,"g":94},"2403":{"m":55,"g":94},"2401":{"m":55,"g":94},"2397":{"m":55,"g":94},"2394":{"m":55,"g":94},"2382":{"m":55,"g":94},"2330":{"m":55,"g":94},"2392":{"m":55,"g":94},"2391":{"m":55,"g":94},"2388":{"m":55,"g":94},"2390":{"m":55,"g":94},"2387":{"m":55,"g":94},"2380":{"m":55,"g":94},"2379":{"m":55,"g":94},"2378":{"m":55,"g":94},"2582":{"m":56,"g":94},"2581":{"m":56,"g":94},"2580":{"m":56,"g":94},"2579":{"m":56,"g":94},"2575":{"m":56,"g":94},"2566":{"m":56,"g":94},"2563":{"m":56,"g":94},"2553":{"m":56,"g":94},"2545":{"m":56,"g":94},"2547":{"m":56,"g":94},"2543":{"m":56,"g":94},"2509":{"m":56,"g":94},"2523":{"m":56,"g":94},"2529":{"m":56,"g":94},"2541":{"m":56,"g":94},"2616":{"m":57,"g":94},"2617":{"m":57,"g":94},"2615":{"m":57,"g":94},"2610":{"m":57,"g":94},"2611":{"m":57,"g":94},"2612":{"m":57,"g":94},"2608":{"m":57,"g":94},"2606":{"m":57,"g":94},"2586":{"m":57,"g":94},"2605":{"m":57,"g":94},"2603":{"m":57,"g":94},"2564":{"m":57,"g":94},"2570":{"m":57,"g":94},"2521":{"m":57,"g":94},"2598":{"m":57,"g":94},"2574":{"m":57,"g":94},"2565":{"m":57,"g":94},"2597":{"m":57,"g":94},"2596":{"m":57,"g":94},"2526":{"m":57,"g":94},"2557":{"m":57,"g":94},"2594":{"m":57,"g":94},"2555":{"m":57,"g":94},"2560":{"m":57,"g":94},"2592":{"m":57,"g":94},"2590":{"m":57,"g":94},"2589":{"m":57,"g":94},"2643":{"m":58,"g":94},"2641":{"m":58,"g":94},"2637":{"m":58,"g":94},"2635":{"m":58,"g":94},"2640":{"m":58,"g":94},"2639":{"m":58,"g":94},"2638":{"m":58,"g":94},"2609":{"m":58,"g":94},"2544":{"m":58,"g":94},"2631":{"m":58,"g":94},"2628":{"m":58,"g":94},"2633":{"m":58,"g":94},"2614":{"m":58,"g":94},"2626":{"m":58,"g":94},"2624":{"m":58,"g":94},"2625":{"m":58,"g":94},"2623":{"m":58,"g":94},"2622":{"m":58,"g":94},"2475":{"m":58,"g":94},"2618":{"m":58,"g":94},"2647":{"m":59,"g":94},"2648":{"m":59,"g":94},"2646":{"m":59,"g":94},"2636":{"m":59,"g":94},"2645":{"m":59,"g":94},"2644":{"m":59,"g":94},"2713":{"m":60,"g":94},"2688":{"m":60,"g":94},"2735":{"m":60,"g":94},"2733":{"m":60,"g":94},"2731":{"m":60,"g":94},"2571":{"m":60,"g":94},"2726":{"m":60,"g":94},"2727":{"m":60,"g":94},"2722":{"m":60,"g":94},"2717":{"m":60,"g":94},"2601":{"m":60,"g":94},"2716":{"m":60,"g":94},"2711":{"m":60,"g":94},"2714":{"m":60,"g":94},"2704":{"m":60,"g":94},"2707":{"m":60,"g":94},"2712":{"m":60,"g":94},"2150":{"m":60,"g":94},"2709":{"m":60,"g":94},"2695":{"m":60,"g":94},"2705":{"m":60,"g":94},"2663":{"m":60,"g":94},"2697":{"m":60,"g":94},"2692":{"m":60,"g":94},"2691":{"m":60,"g":94},"2690":{"m":60,"g":94},"2689":{"m":60,"g":94},"2685":{"m":60,"g":94},"2684":{"m":60,"g":94},"2683":{"m":60,"g":94},"2682":{"m":60,"g":94},"2680":{"m":60,"g":94},"2678":{"m":60,"g":94},"2679":{"m":60,"g":94},"2676":{"m":60,"g":94},"2674":{"m":60,"g":94},"2672":{"m":60,"g":94},"2670":{"m":60,"g":94},"2667":{"m":60,"g":94},"2669":{"m":60,"g":94},"2664":{"m":60,"g":94},"2642":{"m":60,"g":94},"2666":{"m":60,"g":94},"2654":{"m":60,"g":94},"2655":{"m":60,"g":94},"2656":{"m":60,"g":94},"2652":{"m":60,"g":94},"2651":{"m":60,"g":94},"2650":{"m":60,"g":94},"2649":{"m":60,"g":94},"2840":{"m":61,"g":94},"2837":{"m":61,"g":94},"2836":{"m":61,"g":94},"2804":{"m":61,"g":94},"2730":{"m":61,"g":94},"2826":{"m":61,"g":94},"2835":{"m":61,"g":94},"2822":{"m":61,"g":94},"2819":{"m":61,"g":94},"2833":{"m":61,"g":94},"2830":{"m":61,"g":94},"2787":{"m":61,"g":94},"2816":{"m":61,"g":94},"2813":{"m":61,"g":94},"2809":{"m":61,"g":94},"2792":{"m":61,"g":94},"2789":{"m":61,"g":94},"2773":{"m":61,"g":94},"2784":{"m":61,"g":94},"2723":{"m":61,"g":94},"2780":{"m":61,"g":94},"2779":{"m":61,"g":94},"2771":{"m":61,"g":94},"2774":{"m":61,"g":94},"2761":{"m":61,"g":94},"2770":{"m":61,"g":94},"2767":{"m":61,"g":94},"2758":{"m":61,"g":94},"2757":{"m":61,"g":94},"2513":{"m":61,"g":94},"2535":{"m":61,"g":94},"2756":{"m":61,"g":94},"2745":{"m":61,"g":94},"2748":{"m":61,"g":94},"2752":{"m":61,"g":94},"2751":{"m":61,"g":94},"2750":{"m":61,"g":94},"2899":{"m":62,"g":94},"2887":{"m":62,"g":94},"2888":{"m":62,"g":94},"2881":{"m":62,"g":94},"2885":{"m":62,"g":94},"2879":{"m":62,"g":94},"2878":{"m":62,"g":94},"2875":{"m":62,"g":94},"2630":{"m":62,"g":94},"2870":{"m":62,"g":94},"2869":{"m":62,"g":94},"2863":{"m":62,"g":94},"2868":{"m":62,"g":94},"2867":{"m":62,"g":94},"2862":{"m":62,"g":94},"2866":{"m":62,"g":94},"2865":{"m":62,"g":94},"2828":{"m":62,"g":94},"2859":{"m":62,"g":94},"2858":{"m":62,"g":94},"2861":{"m":62,"g":94},"2857":{"m":62,"g":94},"2860":{"m":62,"g":94},"2856":{"m":62,"g":94},"2851":{"m":62,"g":94},"2853":{"m":62,"g":94},"2854":{"m":62,"g":94},"2852":{"m":62,"g":94},"2786":{"m":62,"g":94},"2848":{"m":62,"g":94},"2850":{"m":62,"g":94},"2846":{"m":62,"g":94},"2843":{"m":62,"g":94},"2841":{"m":62,"g":94},"3009":{"m":63,"g":94},"2993":{"m":63,"g":94},"3010":{"m":63,"g":94},"3006":{"m":63,"g":94},"3008":{"m":63,"g":94},"3003":{"m":63,"g":94},"2998":{"m":63,"g":94},"3005":{"m":63,"g":94},"3004":{"m":63,"g":94},"3001":{"m":63,"g":94},"2996":{"m":63,"g":94},"2997":{"m":63,"g":94},"2995":{"m":63,"g":94},"2991":{"m":63,"g":94},"2992":{"m":63,"g":94},"2990":{"m":63,"g":94},"2396":{"m":63,"g":94},"2983":{"m":63,"g":94},"2988":{"m":63,"g":94},"2839":{"m":63,"g":94},"2982":{"m":63,"g":94},"2986":{"m":63,"g":94},"2987":{"m":63,"g":94},"2985":{"m":63,"g":94},"2984":{"m":63,"g":94},"2981":{"m":63,"g":94},"2975":{"m":63,"g":94},"2980":{"m":63,"g":94},"2979":{"m":63,"g":94},"2978":{"m":63,"g":94},"2976":{"m":63,"g":94},"2974":{"m":63,"g":94},"2973":{"m":63,"g":94},"2972":{"m":63,"g":94},"2958":{"m":63,"g":94},"2956":{"m":63,"g":94},"2901":{"m":63,"g":94},"2971":{"m":63,"g":94},"2785":{"m":63,"g":94},"2966":{"m":63,"g":94},"2967":{"m":63,"g":94},"2964":{"m":63,"g":94},"2963":{"m":63,"g":94},"2960":{"m":63,"g":94},"2941":{"m":63,"g":94},"2894":{"m":63,"g":94},"2942":{"m":63,"g":94},"2944":{"m":63,"g":94},"2954":{"m":63,"g":94},"2952":{"m":63,"g":94},"2951":{"m":63,"g":94},"2947":{"m":63,"g":94},"2948":{"m":63,"g":94},"2949":{"m":63,"g":94},"2950":{"m":63,"g":94},"2945":{"m":63,"g":94},"2907":{"m":63,"g":94},"2938":{"m":63,"g":94},"2937":{"m":63,"g":94},"2806":{"m":63,"g":94},"2876":{"m":63,"g":94},"2930":{"m":63,"g":94},"2928":{"m":63,"g":94},"2920":{"m":63,"g":94},"2926":{"m":63,"g":94},"2927":{"m":63,"g":94},"2925":{"m":63,"g":94},"2924":{"m":63,"g":94},"2923":{"m":63,"g":94},"2821":{"m":63,"g":94},"2910":{"m":63,"g":94},"2922":{"m":63,"g":94},"2919":{"m":63,"g":94},"2917":{"m":63,"g":94},"2915":{"m":63,"g":94},"2911":{"m":63,"g":94},"2909":{"m":63,"g":94},"2908":{"m":63,"g":94},"2511":{"m":63,"g":94},"2906":{"m":63,"g":94},"2904":{"m":63,"g":94},"2902":{"m":63,"g":94},"2872":{"m":63,"g":94},"2897":{"m":63,"g":94},"3180":{"m":64,"g":94},"3179":{"m":64,"g":94},"3178":{"m":64,"g":94},"3175":{"m":64,"g":94},"3176":{"m":64,"g":94},"3174":{"m":64,"g":94},"3146":{"m":64,"g":94},"3173":{"m":64,"g":94},"3170":{"m":64,"g":94},"3167":{"m":64,"g":94},"3156":{"m":64,"g":94},"3134":{"m":64,"g":94},"3162":{"m":64,"g":94},"3144":{"m":64,"g":94},"3155":{"m":64,"g":94},"2700":{"m":64,"g":94},"3154":{"m":64,"g":94},"3153":{"m":64,"g":94},"3152":{"m":64,"g":94},"3150":{"m":64,"g":94},"3151":{"m":64,"g":94},"3149":{"m":64,"g":94},"3147":{"m":64,"g":94},"3145":{"m":64,"g":94},"3085":{"m":64,"g":94},"3047":{"m":64,"g":94},"3143":{"m":64,"g":94},"3139":{"m":64,"g":94},"3135":{"m":64,"g":94},"3138":{"m":64,"g":94},"3113":{"m":64,"g":94},"3133":{"m":64,"g":94},"3132":{"m":64,"g":94},"3130":{"m":64,"g":94},"3129":{"m":64,"g":94},"3128":{"m":64,"g":94},"3127":{"m":64,"g":94},"3126":{"m":64,"g":94},"3125":{"m":64,"g":94},"3124":{"m":64,"g":94},"3121":{"m":64,"g":94},"3110":{"m":64,"g":94},"3109":{"m":64,"g":94},"3107":{"m":64,"g":94},"3096":{"m":64,"g":94},"3037":{"m":64,"g":94},"3105":{"m":64,"g":94},"3097":{"m":64,"g":94},"3095":{"m":64,"g":94},"3094":{"m":64,"g":94},"3070":{"m":64,"g":94},"3093":{"m":64,"g":94},"2742":{"m":64,"g":94},"3087":{"m":64,"g":94},"3086":{"m":64,"g":94},"3084":{"m":64,"g":94},"3083":{"m":64,"g":94},"3081":{"m":64,"g":94},"3080":{"m":64,"g":94},"3079":{"m":64,"g":94},"3078":{"m":64,"g":94},"3074":{"m":64,"g":94},"3030":{"m":64,"g":94},"3071":{"m":64,"g":94},"3069":{"m":64,"g":94},"3068":{"m":64,"g":94},"3067":{"m":64,"g":94},"3061":{"m":64,"g":94},"3062":{"m":64,"g":94},"3063":{"m":64,"g":94},"2989":{"m":64,"g":94},"3060":{"m":64,"g":94},"3045":{"m":64,"g":94},"3038":{"m":64,"g":94},"3058":{"m":64,"g":94},"3057":{"m":64,"g":94},"3055":{"m":64,"g":94},"3056":{"m":64,"g":94},"3054":{"m":64,"g":94},"3053":{"m":64,"g":94},"3052":{"m":64,"g":94},"3051":{"m":64,"g":94},"3048":{"m":64,"g":94},"3046":{"m":64,"g":94},"3039":{"m":64,"g":94},"3036":{"m":64,"g":94},"3035":{"m":64,"g":94},"3033":{"m":64,"g":94},"3027":{"m":64,"g":94},"3026":{"m":64,"g":94},"3025":{"m":64,"g":94},"3014":{"m":64,"g":94},"3018":{"m":64,"g":94},"2939":{"m":64,"g":94},"3022":{"m":64,"g":94},"3021":{"m":64,"g":94},"3017":{"m":64,"g":94},"3015":{"m":64,"g":94},"3020":{"m":64,"g":94},"3019":{"m":64,"g":94},"3016":{"m":64,"g":94},"3013":{"m":64,"g":94},"3012":{"m":64,"g":94},"3233":{"m":65,"g":94},"3231":{"m":65,"g":94},"3232":{"m":65,"g":94},"3224":{"m":65,"g":94},"3230":{"m":65,"g":94},"3229":{"m":65,"g":94},"3227":{"m":65,"g":94},"3218":{"m":65,"g":94},"3217":{"m":65,"g":94},"3216":{"m":65,"g":94},"3214":{"m":65,"g":94},"3213":{"m":65,"g":94},"3212":{"m":65,"g":94},"2977":{"m":65,"g":94},"3190":{"m":65,"g":94},"3169":{"m":65,"g":94},"3192":{"m":65,"g":94},"3183":{"m":65,"g":94},"3166":{"m":65,"g":94},"3171":{"m":65,"g":94},"3181":{"m":65,"g":94},"3313":{"m":66,"g":94},"3312":{"m":66,"g":94},"3306":{"m":66,"g":94},"3305":{"m":66,"g":94},"3299":{"m":66,"g":94},"3294":{"m":66,"g":94},"3293":{"m":66,"g":94},"3292":{"m":66,"g":94},"3287":{"m":66,"g":94},"3288":{"m":66,"g":94},"3161":{"m":66,"g":94},"3273":{"m":66,"g":94},"3276":{"m":66,"g":94},"3274":{"m":66,"g":94},"3272":{"m":66,"g":94},"3270":{"m":66,"g":94},"3269":{"m":66,"g":94},"3268":{"m":66,"g":94},"3205":{"m":66,"g":94},"3207":{"m":66,"g":94},"3259":{"m":66,"g":94},"3261":{"m":66,"g":94},"3114":{"m":66,"g":94},"3255":{"m":66,"g":94},"3252":{"m":66,"g":94},"3251":{"m":66,"g":94},"3250":{"m":66,"g":94},"3249":{"m":66,"g":94},"3248":{"m":66,"g":94},"3246":{"m":66,"g":94},"3221":{"m":66,"g":94},"3242":{"m":66,"g":94},"3240":{"m":66,"g":94},"3238":{"m":66,"g":94},"3236":{"m":66,"g":94},"3235":{"m":66,"g":94},"3228":{"m":66,"g":94},"3369":{"m":67,"g":94},"3378":{"m":67,"g":94},"3376":{"m":67,"g":94},"3374":{"m":67,"g":94},"3373":{"m":67,"g":94},"3372":{"m":67,"g":94},"3366":{"m":67,"g":94},"3356":{"m":67,"g":94},"3300":{"m":67,"g":94},"3355":{"m":67,"g":94},"3314":{"m":67,"g":94},"3350":{"m":67,"g":94},"3347":{"m":67,"g":94},"3352":{"m":67,"g":94},"3349":{"m":67,"g":94},"3338":{"m":67,"g":94},"3337":{"m":67,"g":94},"3335":{"m":67,"g":94},"3332":{"m":67,"g":94},"3325":{"m":67,"g":94},"3327":{"m":67,"g":94},"3324":{"m":67,"g":94},"3168":{"m":67,"g":94},"3317":{"m":67,"g":94},"3309":{"m":67,"g":94},"3459":{"m":68,"g":94},"3457":{"m":68,"g":94},"3413":{"m":68,"g":94},"3453":{"m":68,"g":94},"3452":{"m":68,"g":94},"3442":{"m":68,"g":94},"3441":{"m":68,"g":94},"3440":{"m":68,"g":94},"3439":{"m":68,"g":94},"3437":{"m":68,"g":94},"3410":{"m":68,"g":94},"3435":{"m":68,"g":94},"3433":{"m":68,"g":94},"3431":{"m":68,"g":94},"3430":{"m":68,"g":94},"3425":{"m":68,"g":94},"3422":{"m":68,"g":94},"3421":{"m":68,"g":94},"3408":{"m":68,"g":94},"3415":{"m":68,"g":94},"3411":{"m":68,"g":94},"3412":{"m":68,"g":94},"3407":{"m":68,"g":94},"3404":{"m":68,"g":94},"3346":{"m":68,"g":94},"3382":{"m":68,"g":94},"3386":{"m":68,"g":94},"3275":{"m":68,"g":94},"3556":{"m":69,"g":94},"3555":{"m":69,"g":94},"3550":{"m":69,"g":94},"3553":{"m":69,"g":94},"3543":{"m":69,"g":94},"3534":{"m":69,"g":94},"3541":{"m":69,"g":94},"3536":{"m":69,"g":94},"3529":{"m":69,"g":94},"3530":{"m":69,"g":94},"3267":{"m":69,"g":94},"3450":{"m":69,"g":94},"3503":{"m":69,"g":94},"3493":{"m":69,"g":94},"3523":{"m":69,"g":94},"3522":{"m":69,"g":94},"3405":{"m":69,"g":94},"3502":{"m":69,"g":94},"3420":{"m":69,"g":94},"3495":{"m":69,"g":94},"3496":{"m":69,"g":94},"3499":{"m":69,"g":94},"3498":{"m":69,"g":94},"3497":{"m":69,"g":94},"3500":{"m":69,"g":94},"3492":{"m":69,"g":94},"3490":{"m":69,"g":94},"3418":{"m":69,"g":94},"3364":{"m":69,"g":94},"3473":{"m":69,"g":94},"3469":{"m":69,"g":94},"3468":{"m":69,"g":94},"3466":{"m":69,"g":94},"3638":{"m":70,"g":94},"3636":{"m":70,"g":94},"3634":{"m":70,"g":94},"3632":{"m":70,"g":94},"3619":{"m":70,"g":94},"3617":{"m":70,"g":94},"3260":{"m":70,"g":94},"3532":{"m":70,"g":94},"3597":{"m":70,"g":94},"3535":{"m":70,"g":94},"3258":{"m":70,"g":94},"3598":{"m":70,"g":94},"3564":{"m":70,"g":94},"3594":{"m":70,"g":94},"3592":{"m":70,"g":94},"3591":{"m":70,"g":94},"3589":{"m":70,"g":94},"3587":{"m":70,"g":94},"3582":{"m":70,"g":94},"3548":{"m":70,"g":94},"3584":{"m":70,"g":94},"3505":{"m":70,"g":94},"3581":{"m":70,"g":94},"3563":{"m":70,"g":94},"3363":{"m":70,"g":94},"3558":{"m":70,"g":94},"3557":{"m":70,"g":94},"3645":{"m":71,"g":94},"3644":{"m":71,"g":94},"3643":{"m":71,"g":94},"3639":{"m":71,"g":94},"4114":{"m":72,"g":94},"4099":{"m":72,"g":94},"4101":{"m":72,"g":94},"3211":{"m":72,"g":94},"3029":{"m":72,"g":94},"4110":{"m":72,"g":94},"4105":{"m":72,"g":94},"4109":{"m":72,"g":94},"4103":{"m":72,"g":94},"4108":{"m":72,"g":94},"4107":{"m":72,"g":94},"4102":{"m":72,"g":94},"4100":{"m":72,"g":94},"4012":{"m":72,"g":94},"4081":{"m":72,"g":94},"3986":{"m":72,"g":94},"4075":{"m":72,"g":94},"3990":{"m":72,"g":94},"3790":{"m":72,"g":94},"3941":{"m":72,"g":94},"4077":{"m":72,"g":94},"4074":{"m":72,"g":94},"4066":{"m":72,"g":94},"4071":{"m":72,"g":94},"4065":{"m":72,"g":94},"4016":{"m":72,"g":94},"3607":{"m":72,"g":94},"3712":{"m":72,"g":94},"3948":{"m":72,"g":94},"3954":{"m":72,"g":94},"4030":{"m":72,"g":94},"4051":{"m":72,"g":94},"4046":{"m":72,"g":94},"4053":{"m":72,"g":94},"4052":{"m":72,"g":94},"4023":{"m":72,"g":94},"4000":{"m":72,"g":94},"4049":{"m":72,"g":94},"4044":{"m":72,"g":94},"4043":{"m":72,"g":94},"4033":{"m":72,"g":94},"4039":{"m":72,"g":94},"3999":{"m":72,"g":94},"4034":{"m":72,"g":94},"4032":{"m":72,"g":94},"4027":{"m":72,"g":94},"4025":{"m":72,"g":94},"3264":{"m":72,"g":94},"4031":{"m":72,"g":94},"4029":{"m":72,"g":94},"4021":{"m":72,"g":94},"3988":{"m":72,"g":94},"4014":{"m":72,"g":94},"3826":{"m":72,"g":94},"4010":{"m":72,"g":94},"4008":{"m":72,"g":94},"3987":{"m":72,"g":94},"3822":{"m":72,"g":94},"3993":{"m":72,"g":94},"3406":{"m":72,"g":94},"3994":{"m":72,"g":94},"3992":{"m":72,"g":94},"3991":{"m":72,"g":94},"3893":{"m":72,"g":94},"3989":{"m":72,"g":94},"3985":{"m":72,"g":94},"3982":{"m":72,"g":94},"3979":{"m":72,"g":94},"3976":{"m":72,"g":94},"3977":{"m":72,"g":94},"3975":{"m":72,"g":94},"3967":{"m":72,"g":94},"3966":{"m":72,"g":94},"3963":{"m":72,"g":94},"3852":{"m":72,"g":94},"3566":{"m":72,"g":94},"3678":{"m":72,"g":94},"3950":{"m":72,"g":94},"3870":{"m":72,"g":94},"3613":{"m":72,"g":94},"3866":{"m":72,"g":94},"3861":{"m":72,"g":94},"3934":{"m":72,"g":94},"3933":{"m":72,"g":94},"3593":{"m":72,"g":94},"3922":{"m":72,"g":94},"3925":{"m":72,"g":94},"3914":{"m":72,"g":94},"3905":{"m":72,"g":94},"3897":{"m":72,"g":94},"3907":{"m":72,"g":94},"3898":{"m":72,"g":94},"3903":{"m":72,"g":94},"3791":{"m":72,"g":94},"3900":{"m":72,"g":94},"3894":{"m":72,"g":94},"3298":{"m":72,"g":94},"3860":{"m":72,"g":94},"3845":{"m":72,"g":94},"3602":{"m":72,"g":94},"3865":{"m":72,"g":94},"3843":{"m":72,"g":94},"3519":{"m":72,"g":94},"3841":{"m":72,"g":94},"3857":{"m":72,"g":94},"3709":{"m":72,"g":94},"3803":{"m":72,"g":94},"3787":{"m":72,"g":94},"3237":{"m":72,"g":94},"3641":{"m":72,"g":94},"3741":{"m":72,"g":94},"3799":{"m":72,"g":94},"3801":{"m":72,"g":94},"3821":{"m":72,"g":94},"3829":{"m":72,"g":94},"3828":{"m":72,"g":94},"3818":{"m":72,"g":94},"3730":{"m":72,"g":94},"3785":{"m":72,"g":94},"3813":{"m":72,"g":94},"3809":{"m":72,"g":94},"2693":{"m":72,"g":94},"3795":{"m":72,"g":94},"3116":{"m":72,"g":94},"3115":{"m":72,"g":94},"3562":{"m":72,"g":94},"3117":{"m":72,"g":94},"3766":{"m":72,"g":94},"3777":{"m":72,"g":94},"3348":{"m":72,"g":94},"3223":{"m":72,"g":94},"3772":{"m":72,"g":94},"3773":{"m":72,"g":94},"3771":{"m":72,"g":94},"3733":{"m":72,"g":94},"3754":{"m":72,"g":94},"3432":{"m":72,"g":94},"3761":{"m":72,"g":94},"3740":{"m":72,"g":94},"3680":{"m":72,"g":94},"3747":{"m":72,"g":94},"3737":{"m":72,"g":94},"3588":{"m":72,"g":94},"3652":{"m":72,"g":94},"3732":{"m":72,"g":94},"3705":{"m":72,"g":94},"3731":{"m":72,"g":94},"3722":{"m":72,"g":94},"3727":{"m":72,"g":94},"3710":{"m":72,"g":94},"3677":{"m":72,"g":94},"3601":{"m":72,"g":94},"3692":{"m":72,"g":94},"3700":{"m":72,"g":94},"3706":{"m":72,"g":94},"3628":{"m":72,"g":94},"3698":{"m":72,"g":94},"3657":{"m":72,"g":94},"3665":{"m":72,"g":94},"3635":{"m":72,"g":94},"3676":{"m":72,"g":94},"3663":{"m":72,"g":94},"3629":{"m":72,"g":94},"3654":{"m":72,"g":94},"3624":{"m":72,"g":94},"3567":{"m":72,"g":94},"3616":{"m":72,"g":94},"3650":{"m":72,"g":94},"4140":{"m":73,"g":94},"4142":{"m":73,"g":94},"4147":{"m":73,"g":94},"4134":{"m":73,"g":94},"4135":{"m":73,"g":94},"4138":{"m":73,"g":94},"4137":{"m":73,"g":94},"4132":{"m":73,"g":94},"4128":{"m":73,"g":94},"4126":{"m":73,"g":94},"4129":{"m":73,"g":94},"4131":{"m":73,"g":94},"4038":{"m":73,"g":94},"4111":{"m":73,"g":94},"4113":{"m":73,"g":94},"4117":{"m":73,"g":94},"4121":{"m":73,"g":94},"4041":{"m":74,"g":94},"4381":{"m":74,"g":94},"4377":{"m":74,"g":94},"4376":{"m":74,"g":94},"4374":{"m":74,"g":94},"4367":{"m":74,"g":94},"4086":{"m":74,"g":94},"4356":{"m":74,"g":94},"3679":{"m":74,"g":94},"3814":{"m":74,"g":94},"3835":{"m":74,"g":94},"3844":{"m":74,"g":94},"3896":{"m":74,"g":94},"3959":{"m":74,"g":94},"3980":{"m":74,"g":94},"3962":{"m":74,"g":94},"3961":{"m":74,"g":94},"4026":{"m":74,"g":94},"4079":{"m":74,"g":94},"4359":{"m":74,"g":94},"4212":{"m":74,"g":94},"4295":{"m":74,"g":94},"4342":{"m":74,"g":94},"4335":{"m":74,"g":94},"4329":{"m":74,"g":94},"4362":{"m":74,"g":94},"4348":{"m":74,"g":94},"4326":{"m":74,"g":94},"4354":{"m":74,"g":94},"4355":{"m":74,"g":94},"4352":{"m":74,"g":94},"4350":{"m":74,"g":94},"4278":{"m":74,"g":94},"4082":{"m":74,"g":94},"3203":{"m":74,"g":94},"4340":{"m":74,"g":94},"4337":{"m":74,"g":94},"3911":{"m":74,"g":94},"4334":{"m":74,"g":94},"4331":{"m":74,"g":94},"4333":{"m":74,"g":94},"4104":{"m":74,"g":94},"4215":{"m":74,"g":94},"4327":{"m":74,"g":94},"4297":{"m":74,"g":94},"4323":{"m":74,"g":94},"4321":{"m":74,"g":94},"4317":{"m":74,"g":94},"4229":{"m":74,"g":94},"4220":{"m":74,"g":94},"4311":{"m":74,"g":94},"4299":{"m":74,"g":94},"4287":{"m":74,"g":94},"4290":{"m":74,"g":94},"4199":{"m":74,"g":94},"4291":{"m":74,"g":94},"4136":{"m":74,"g":94},"4288":{"m":74,"g":94},"4284":{"m":74,"g":94},"4277":{"m":74,"g":94},"4279":{"m":74,"g":94},"4275":{"m":74,"g":94},"4272":{"m":74,"g":94},"4261":{"m":74,"g":94},"4256":{"m":74,"g":94},"4267":{"m":74,"g":94},"4262":{"m":74,"g":94},"4258":{"m":74,"g":94},"4255":{"m":74,"g":94},"4231":{"m":74,"g":94},"4144":{"m":74,"g":94},"4252":{"m":74,"g":94},"4206":{"m":74,"g":94},"3958":{"m":74,"g":94},"4165":{"m":74,"g":94},"4250":{"m":74,"g":94},"4230":{"m":74,"g":94},"4238":{"m":74,"g":94},"4243":{"m":74,"g":94},"4242":{"m":74,"g":94},"4241":{"m":74,"g":94},"4237":{"m":74,"g":94},"4235":{"m":74,"g":94},"4228":{"m":74,"g":94},"4217":{"m":74,"g":94},"3148":{"m":74,"g":94},"4218":{"m":74,"g":94},"4224":{"m":74,"g":94},"4193":{"m":74,"g":94},"3631":{"m":74,"g":94},"4225":{"m":74,"g":94},"4223":{"m":74,"g":94},"4213":{"m":74,"g":94},"4222":{"m":74,"g":94},"4219":{"m":74,"g":94},"4124":{"m":74,"g":94},"4216":{"m":74,"g":94},"4211":{"m":74,"g":94},"4210":{"m":74,"g":94},"3749":{"m":74,"g":94},"4203":{"m":74,"g":94},"4181":{"m":74,"g":94},"4200":{"m":74,"g":94},"4198":{"m":74,"g":94},"4197":{"m":74,"g":94},"4164":{"m":74,"g":94},"4195":{"m":74,"g":94},"4194":{"m":74,"g":94},"4189":{"m":74,"g":94},"4185":{"m":74,"g":94},"4187":{"m":74,"g":94},"4186":{"m":74,"g":94},"4178":{"m":74,"g":94},"4174":{"m":74,"g":94},"4179":{"m":74,"g":94},"4177":{"m":74,"g":94},"4176":{"m":74,"g":94},"4170":{"m":74,"g":94},"4168":{"m":74,"g":94},"4166":{"m":74,"g":94},"4162":{"m":74,"g":94},"4163":{"m":74,"g":94},"3888":{"m":74,"g":94},"4089":{"m":74,"g":94},"3786":{"m":74,"g":94},"3694":{"m":74,"g":94},"4152":{"m":74,"g":94},"4154":{"m":74,"g":94},"4151":{"m":74,"g":94},"4148":{"m":74,"g":94},"4402":{"m":75,"g":94},"4397":{"m":75,"g":94},"4399":{"m":75,"g":94},"4269":{"m":75,"g":94},"4320":{"m":75,"g":94},"4398":{"m":75,"g":94},"4393":{"m":75,"g":94},"4390":{"m":75,"g":94},"4392":{"m":75,"g":94},"4375":{"m":75,"g":94},"4669":{"m":76,"g":94},"4743":{"m":76,"g":94},"4797":{"m":76,"g":94},"4782":{"m":76,"g":94},"4777":{"m":76,"g":94},"4775":{"m":76,"g":94},"4784":{"m":76,"g":94},"4566":{"m":76,"g":94},"4728":{"m":76,"g":94},"4755":{"m":76,"g":94},"4753":{"m":76,"g":94},"4752":{"m":76,"g":94},"4751":{"m":76,"g":94},"4310":{"m":76,"g":94},"4705":{"m":76,"g":94},"4435":{"m":76,"g":94},"4695":{"m":76,"g":94},"4691":{"m":76,"g":94},"4738":{"m":76,"g":94},"4735":{"m":76,"g":94},"4744":{"m":76,"g":94},"3023":{"m":76,"g":94},"4737":{"m":76,"g":94},"3899":{"m":76,"g":94},"4609":{"m":76,"g":94},"4736":{"m":76,"g":94},"4716":{"m":76,"g":94},"4721":{"m":76,"g":94},"4731":{"m":76,"g":94},"4720":{"m":76,"g":94},"4396":{"m":76,"g":94},"4605":{"m":76,"g":94},"4680":{"m":76,"g":94},"4631":{"m":76,"g":94},"4698":{"m":76,"g":94},"4064":{"m":76,"g":94},"4525":{"m":76,"g":94},"4661":{"m":76,"g":94},"4610":{"m":76,"g":94},"4608":{"m":76,"g":94},"4685":{"m":76,"g":94},"4679":{"m":76,"g":94},"4643":{"m":76,"g":94},"3984":{"m":76,"g":94},"4660":{"m":76,"g":94},"4676":{"m":76,"g":94},"4670":{"m":76,"g":94},"4677":{"m":76,"g":94},"4674":{"m":76,"g":94},"4556":{"m":76,"g":94},"4665":{"m":76,"g":94},"4596":{"m":76,"g":94},"4639":{"m":76,"g":94},"4664":{"m":76,"g":94},"4582":{"m":76,"g":94},"4654":{"m":76,"g":94},"4637":{"m":76,"g":94},"4641":{"m":76,"g":94},"4613":{"m":76,"g":94},"4640":{"m":76,"g":94},"4558":{"m":76,"g":94},"4622":{"m":76,"g":94},"4592":{"m":76,"g":94},"4571":{"m":76,"g":94},"3446":{"m":76,"g":94},"4577":{"m":76,"g":94},"4549":{"m":76,"g":94},"4583":{"m":76,"g":94},"4514":{"m":76,"g":94},"4232":{"m":76,"g":94},"4515":{"m":76,"g":94},"4557":{"m":76,"g":94},"4553":{"m":76,"g":94},"4274":{"m":76,"g":94},"4521":{"m":76,"g":94},"4247":{"m":76,"g":94},"4532":{"m":76,"g":94},"4441":{"m":76,"g":94},"4538":{"m":76,"g":94},"4541":{"m":76,"g":94},"4542":{"m":76,"g":94},"4500":{"m":76,"g":94},"4531":{"m":76,"g":94},"3682":{"m":76,"g":94},"4458":{"m":76,"g":94},"4486":{"m":76,"g":94},"4507":{"m":76,"g":94},"4522":{"m":76,"g":94},"4520":{"m":76,"g":94},"4505":{"m":76,"g":94},"4482":{"m":76,"g":94},"4517":{"m":76,"g":94},"4513":{"m":76,"g":94},"4510":{"m":76,"g":94},"4499":{"m":76,"g":94},"4495":{"m":76,"g":94},"4446":{"m":76,"g":94},"4480":{"m":76,"g":94},"4067":{"m":76,"g":94},"4485":{"m":76,"g":94},"4418":{"m":76,"g":94},"4493":{"m":76,"g":94},"2798":{"m":76,"g":94},"4372":{"m":76,"g":94},"4483":{"m":76,"g":94},"4386":{"m":76,"g":94},"2797":{"m":76,"g":94},"3612":{"m":76,"g":94},"4448":{"m":76,"g":94},"4465":{"m":76,"g":94},"4474":{"m":76,"g":94},"4479":{"m":76,"g":94},"4424":{"m":76,"g":94},"4202":{"m":76,"g":94},"4484":{"m":76,"g":94},"4481":{"m":76,"g":94},"4477":{"m":76,"g":94},"4472":{"m":76,"g":94},"4363":{"m":76,"g":94},"4470":{"m":76,"g":94},"4469":{"m":76,"g":94},"4383":{"m":76,"g":94},"4468":{"m":76,"g":94},"4467":{"m":76,"g":94},"4466":{"m":76,"g":94},"4449":{"m":76,"g":94},"4464":{"m":76,"g":94},"4460":{"m":76,"g":94},"4368":{"m":76,"g":94},"4459":{"m":76,"g":94},"4447":{"m":76,"g":94},"4423":{"m":76,"g":94},"4391":{"m":76,"g":94},"4413":{"m":76,"g":94},"4454":{"m":76,"g":94},"4453":{"m":76,"g":94},"4455":{"m":76,"g":94},"4452":{"m":76,"g":94},"4451":{"m":76,"g":94},"4442":{"m":76,"g":94},"4439":{"m":76,"g":94},"4438":{"m":76,"g":94},"4437":{"m":76,"g":94},"4302":{"m":76,"g":94},"4427":{"m":76,"g":94},"4419":{"m":76,"g":94},"3964":{"m":76,"g":94},"4400":{"m":76,"g":94},"4009":{"m":76,"g":94},"4403":{"m":76,"g":94},"4878":{"m":77,"g":94},"4874":{"m":77,"g":94},"4873":{"m":77,"g":94},"4831":{"m":77,"g":94},"4872":{"m":77,"g":94},"4768":{"m":77,"g":94},"4871":{"m":77,"g":94},"4866":{"m":77,"g":94},"4749":{"m":77,"g":94},"4864":{"m":77,"g":94},"4772":{"m":77,"g":94},"4834":{"m":77,"g":94},"4492":{"m":77,"g":94},"4863":{"m":77,"g":94},"4855":{"m":77,"g":94},"4853":{"m":77,"g":94},"4840":{"m":77,"g":94},"4740":{"m":77,"g":94},"4750":{"m":77,"g":94},"4687":{"m":77,"g":94},"4729":{"m":77,"g":94},"4712":{"m":77,"g":94},"4704":{"m":77,"g":94},"4688":{"m":77,"g":94},"4681":{"m":77,"g":94},"4648":{"m":77,"g":94},"4832":{"m":77,"g":94},"4528":{"m":77,"g":94},"4597":{"m":77,"g":94},"4487":{"m":77,"g":94},"3949":{"m":77,"g":94},"4844":{"m":77,"g":94},"4846":{"m":77,"g":94},"4843":{"m":77,"g":94},"4799":{"m":77,"g":94},"4788":{"m":77,"g":94},"4809":{"m":77,"g":94},"4837":{"m":77,"g":94},"3969":{"m":77,"g":94},"4835":{"m":77,"g":94},"4815":{"m":77,"g":94},"4819":{"m":77,"g":94},"4770":{"m":77,"g":94},"4833":{"m":77,"g":94},"4830":{"m":77,"g":94},"4694":{"m":77,"g":94},"4826":{"m":77,"g":94},"4825":{"m":77,"g":94},"4828":{"m":77,"g":94},"4827":{"m":77,"g":94},"4823":{"m":77,"g":94},"4341":{"m":77,"g":94},"4638":{"m":77,"g":94},"4813":{"m":77,"g":94},"4764":{"m":77,"g":94},"4706":{"m":77,"g":94},"4745":{"m":77,"g":94},"4565":{"m":77,"g":94},"4804":{"m":77,"g":94},"4506":{"m":77,"g":94},"4388":{"m":77,"g":94},"4628":{"m":77,"g":94},"4719":{"m":77,"g":94},"5091":{"m":78,"g":94},"5080":{"m":78,"g":94},"5089":{"m":78,"g":94},"5079":{"m":78,"g":94},"5088":{"m":78,"g":94},"5052":{"m":78,"g":94},"5050":{"m":78,"g":94},"5074":{"m":78,"g":94},"4535":{"m":78,"g":94},"5072":{"m":78,"g":94},"4918":{"m":78,"g":94},"4996":{"m":78,"g":94},"4995":{"m":78,"g":94},"4994":{"m":78,"g":94},"5060":{"m":78,"g":94},"5057":{"m":78,"g":94},"5056":{"m":78,"g":94},"4625":{"m":78,"g":94},"5049":{"m":78,"g":94},"5051":{"m":78,"g":94},"5039":{"m":78,"g":94},"4796":{"m":78,"g":94},"5046":{"m":78,"g":94},"5048":{"m":78,"g":94},"5036":{"m":78,"g":94},"4992":{"m":78,"g":94},"5005":{"m":78,"g":94},"5024":{"m":78,"g":94},"5030":{"m":78,"g":94},"5020":{"m":78,"g":94},"4727":{"m":78,"g":94},"5009":{"m":78,"g":94},"5011":{"m":78,"g":94},"4817":{"m":78,"g":94},"5008":{"m":78,"g":94},"4951":{"m":78,"g":94},"4861":{"m":78,"g":94},"4989":{"m":78,"g":94},"4581":{"m":78,"g":94},"4915":{"m":78,"g":94},"4977":{"m":78,"g":94},"4958":{"m":78,"g":94},"4767":{"m":78,"g":94},"4959":{"m":78,"g":94},"4954":{"m":78,"g":94},"4953":{"m":78,"g":94},"4950":{"m":78,"g":94},"4754":{"m":78,"g":94},"4944":{"m":78,"g":94},"4928":{"m":78,"g":94},"4913":{"m":78,"g":94},"4936":{"m":78,"g":94},"4883":{"m":78,"g":94},"4925":{"m":78,"g":94},"4933":{"m":78,"g":94},"4932":{"m":78,"g":94},"4931":{"m":78,"g":94},"4902":{"m":78,"g":94},"4930":{"m":78,"g":94},"4927":{"m":78,"g":94},"4926":{"m":78,"g":94},"4896":{"m":78,"g":94},"4914":{"m":78,"g":94},"4908":{"m":78,"g":94},"4909":{"m":78,"g":94},"4890":{"m":78,"g":94},"4899":{"m":78,"g":94},"4898":{"m":78,"g":94},"4530":{"m":78,"g":94},"4891":{"m":78,"g":94},"4889":{"m":78,"g":94},"4886":{"m":78,"g":94},"4795":{"m":78,"g":94},"4845":{"m":78,"g":94},"4882":{"m":78,"g":94},"5117":{"m":79,"g":94},"5106":{"m":79,"g":94},"5092":{"m":79,"g":94},"5097":{"m":79,"g":94},"5445":{"m":80,"g":94},"5113":{"m":80,"g":94},"5425":{"m":80,"g":94},"5397":{"m":80,"g":94},"5398":{"m":80,"g":94},"5038":{"m":80,"g":94},"5211":{"m":80,"g":94},"5264":{"m":80,"g":94},"5436":{"m":80,"g":94},"5344":{"m":80,"g":94},"5431":{"m":80,"g":94},"5434":{"m":80,"g":94},"5430":{"m":80,"g":94},"5419":{"m":80,"g":94},"5420":{"m":80,"g":94},"5423":{"m":80,"g":94},"5422":{"m":80,"g":94},"5415":{"m":80,"g":94},"5416":{"m":80,"g":94},"5351":{"m":80,"g":94},"5412":{"m":80,"g":94},"5352":{"m":80,"g":94},"5214":{"m":80,"g":94},"5406":{"m":80,"g":94},"5401":{"m":80,"g":94},"5400":{"m":80,"g":94},"5381":{"m":80,"g":94},"5399":{"m":80,"g":94},"5395":{"m":80,"g":94},"5279":{"m":80,"g":94},"5291":{"m":80,"g":94},"5368":{"m":80,"g":94},"5393":{"m":80,"g":94},"5263":{"m":80,"g":94},"5392":{"m":80,"g":94},"5371":{"m":80,"g":94},"5385":{"m":80,"g":94},"5370":{"m":80,"g":94},"5384":{"m":80,"g":94},"5326":{"m":80,"g":94},"5003":{"m":80,"g":94},"5367":{"m":80,"g":94},"5364":{"m":80,"g":94},"5277":{"m":80,"g":94},"5360":{"m":80,"g":94},"5359":{"m":80,"g":94},"5161":{"m":80,"g":94},"5328":{"m":80,"g":94},"5357":{"m":80,"g":94},"5342":{"m":80,"g":94},"5343":{"m":80,"g":94},"5322":{"m":80,"g":94},"5341":{"m":80,"g":94},"5337":{"m":80,"g":94},"5336":{"m":80,"g":94},"5333":{"m":80,"g":94},"5332":{"m":80,"g":94},"5331":{"m":80,"g":94},"5327":{"m":80,"g":94},"4848":{"m":80,"g":94},"5294":{"m":80,"g":94},"5321":{"m":80,"g":94},"4884":{"m":80,"g":94},"5210":{"m":80,"g":94},"5317":{"m":80,"g":94},"5316":{"m":80,"g":94},"5315":{"m":80,"g":94},"5120":{"m":80,"g":94},"5299":{"m":80,"g":94},"5311":{"m":80,"g":94},"5142":{"m":80,"g":94},"5271":{"m":80,"g":94},"5065":{"m":80,"g":94},"5310":{"m":80,"g":94},"5308":{"m":80,"g":94},"5307":{"m":80,"g":94},"5306":{"m":80,"g":94},"5304":{"m":80,"g":94},"5303":{"m":80,"g":94},"5302":{"m":80,"g":94},"5298":{"m":80,"g":94},"5301":{"m":80,"g":94},"5300":{"m":80,"g":94},"5290":{"m":80,"g":94},"5292":{"m":80,"g":94},"5289":{"m":80,"g":94},"5288":{"m":80,"g":94},"5287":{"m":80,"g":94},"5286":{"m":80,"g":94},"5280":{"m":80,"g":94},"5193":{"m":80,"g":94},"5244":{"m":80,"g":94},"5265":{"m":80,"g":94},"5254":{"m":80,"g":94},"5262":{"m":80,"g":94},"5127":{"m":80,"g":94},"5228":{"m":80,"g":94},"5259":{"m":80,"g":94},"5245":{"m":80,"g":94},"5216":{"m":80,"g":94},"5167":{"m":80,"g":94},"5213":{"m":80,"g":94},"5215":{"m":80,"g":94},"5204":{"m":80,"g":94},"4880":{"m":80,"g":94},"5086":{"m":80,"g":94},"5190":{"m":80,"g":94},"4444":{"m":80,"g":94},"5209":{"m":80,"g":94},"5196":{"m":80,"g":94},"5207":{"m":80,"g":94},"5171":{"m":80,"g":94},"5144":{"m":80,"g":94},"5102":{"m":80,"g":94},"5194":{"m":80,"g":94},"5128":{"m":80,"g":94},"5185":{"m":80,"g":94},"5189":{"m":80,"g":94},"4058":{"m":80,"g":94},"5179":{"m":80,"g":94},"5180":{"m":80,"g":94},"5176":{"m":80,"g":94},"5110":{"m":80,"g":94},"5068":{"m":80,"g":94},"5173":{"m":80,"g":94},"5175":{"m":80,"g":94},"5174":{"m":80,"g":94},"3972":{"m":80,"g":94},"5159":{"m":80,"g":94},"4938":{"m":80,"g":94},"5139":{"m":80,"g":94},"4911":{"m":80,"g":94},"5150":{"m":80,"g":94},"5155":{"m":80,"g":94},"4686":{"m":80,"g":94},"5158":{"m":80,"g":94},"5151":{"m":80,"g":94},"5115":{"m":80,"g":94},"5152":{"m":80,"g":94},"4760":{"m":80,"g":94},"5103":{"m":80,"g":94},"5147":{"m":80,"g":94},"5083":{"m":80,"g":94},"5140":{"m":80,"g":94},"5145":{"m":80,"g":94},"4984":{"m":80,"g":94},"5137":{"m":80,"g":94},"5133":{"m":80,"g":94},"5090":{"m":80,"g":94},"5126":{"m":80,"g":94},"5582":{"m":81,"g":94},"5581":{"m":81,"g":94},"4947":{"m":81,"g":94},"5571":{"m":81,"g":94},"5564":{"m":81,"g":94},"5568":{"m":81,"g":94},"5432":{"m":81,"g":94},"5149":{"m":81,"g":94},"5562":{"m":81,"g":94},"5543":{"m":81,"g":94},"5561":{"m":81,"g":94},"5546":{"m":81,"g":94},"5549":{"m":81,"g":94},"5504":{"m":81,"g":94},"5475":{"m":81,"g":94},"5460":{"m":81,"g":94},"5547":{"m":81,"g":94},"5534":{"m":81,"g":94},"5545":{"m":81,"g":94},"5548":{"m":81,"g":94},"5340":{"m":81,"g":94},"5540":{"m":81,"g":94},"5544":{"m":81,"g":94},"5542":{"m":81,"g":94},"4693":{"m":81,"g":94},"5476":{"m":81,"g":94},"5497":{"m":81,"g":94},"5461":{"m":81,"g":94},"5473":{"m":81,"g":94},"5518":{"m":81,"g":94},"5440":{"m":81,"g":94},"4836":{"m":81,"g":94},"5512":{"m":81,"g":94},"5373":{"m":81,"g":94},"5426":{"m":81,"g":94},"5511":{"m":81,"g":94},"5500":{"m":81,"g":94},"5503":{"m":81,"g":94},"5496":{"m":81,"g":94},"5493":{"m":81,"g":94},"5205":{"m":81,"g":94},"5479":{"m":81,"g":94},"4887":{"m":81,"g":94},"5480":{"m":81,"g":94},"5481":{"m":81,"g":94},"5484":{"m":81,"g":94},"5489":{"m":81,"g":94},"4982":{"m":81,"g":94},"5345":{"m":81,"g":94},"5467":{"m":81,"g":94},"5463":{"m":81,"g":94},"5447":{"m":81,"g":94},"5449":{"m":81,"g":94},"5444":{"m":81,"g":94},"5611":{"m":82,"g":94},"5510":{"m":82,"g":94},"5610":{"m":82,"g":94},"5580":{"m":82,"g":94},"5604":{"m":82,"g":94},"5609":{"m":82,"g":94},"5608":{"m":82,"g":94},"5477":{"m":82,"g":94},"5589":{"m":82,"g":94},"5598":{"m":82,"g":94},"5488":{"m":82,"g":94},"5037":{"m":82,"g":94},"5021":{"m":82,"g":94},"4980":{"m":82,"g":94},"4937":{"m":82,"g":94},"4718":{"m":82,"g":94},"4590":{"m":82,"g":94},"3443":{"m":82,"g":94},"5348":{"m":82,"g":94},"5570":{"m":82,"g":94},"5318":{"m":82,"g":94},"5590":{"m":82,"g":94},"5319":{"m":82,"g":94},"5241":{"m":82,"g":94},"5188":{"m":82,"g":94},"5141":{"m":82,"g":94},"5019":{"m":82,"g":94},"5016":{"m":82,"g":94},"4859":{"m":82,"g":94},"4852":{"m":82,"g":94},"4675":{"m":82,"g":94},"4733":{"m":82,"g":94},"5378":{"m":82,"g":94},"5224":{"m":82,"g":94},"4226":{"m":82,"g":94},"5433":{"m":82,"g":94},"5588":{"m":82,"g":94},"5452":{"m":82,"g":94},"5586":{"m":82,"g":94},"5575":{"m":82,"g":94},"5417":{"m":82,"g":94},"5531":{"m":82,"g":94},"5526":{"m":82,"g":94},"5521":{"m":82,"g":94},"5559":{"m":82,"g":94},"5560":{"m":82,"g":94},"5567":{"m":82,"g":94},"5574":{"m":82,"g":94},"5795":{"m":83,"g":94},"5691":{"m":83,"g":94},"5790":{"m":83,"g":94},"5791":{"m":83,"g":94},"5769":{"m":83,"g":94},"5789":{"m":83,"g":94},"5787":{"m":83,"g":94},"5786":{"m":83,"g":94},"5785":{"m":83,"g":94},"5779":{"m":83,"g":94},"5777":{"m":83,"g":94},"5774":{"m":83,"g":94},"5776":{"m":83,"g":94},"5772":{"m":83,"g":94},"3744":{"m":83,"g":94},"4986":{"m":83,"g":94},"4971":{"m":83,"g":94},"4870":{"m":83,"g":94},"5633":{"m":83,"g":94},"5599":{"m":83,"g":94},"5565":{"m":83,"g":94},"5509":{"m":83,"g":94},"5687":{"m":83,"g":94},"5592":{"m":83,"g":94},"5607":{"m":83,"g":94},"5730":{"m":83,"g":94},"5697":{"m":83,"g":94},"5748":{"m":83,"g":94},"5682":{"m":83,"g":94},"5685":{"m":83,"g":94},"5716":{"m":83,"g":94},"5720":{"m":83,"g":94},"5722":{"m":83,"g":94},"5728":{"m":83,"g":94},"5733":{"m":83,"g":94},"5736":{"m":83,"g":94},"5756":{"m":83,"g":94},"5760":{"m":83,"g":94},"5552":{"m":83,"g":94},"5718":{"m":83,"g":94},"5719":{"m":83,"g":94},"5737":{"m":83,"g":94},"5740":{"m":83,"g":94},"5753":{"m":83,"g":94},"5754":{"m":83,"g":94},"5750":{"m":83,"g":94},"5723":{"m":83,"g":94},"5704":{"m":83,"g":94},"5738":{"m":83,"g":94},"5715":{"m":83,"g":94},"5078":{"m":83,"g":94},"4491":{"m":83,"g":94},"5706":{"m":83,"g":94},"5648":{"m":83,"g":94},"5684":{"m":83,"g":94},"5707":{"m":83,"g":94},"5349":{"m":83,"g":94},"5688":{"m":83,"g":94},"5686":{"m":83,"g":94},"5683":{"m":83,"g":94},"5667":{"m":83,"g":94},"5677":{"m":83,"g":94},"5530":{"m":83,"g":94},"5671":{"m":83,"g":94},"5670":{"m":83,"g":94},"5435":{"m":83,"g":94},"5669":{"m":83,"g":94},"5666":{"m":83,"g":94},"5281":{"m":83,"g":94},"5601":{"m":83,"g":94},"5619":{"m":83,"g":94},"5628":{"m":83,"g":94},"5649":{"m":83,"g":94},"5646":{"m":83,"g":94},"5638":{"m":83,"g":94},"5634":{"m":83,"g":94},"5272":{"m":83,"g":94},"5641":{"m":83,"g":94},"5632":{"m":83,"g":94},"5640":{"m":83,"g":94},"5624":{"m":83,"g":94},"5622":{"m":83,"g":94},"5620":{"m":83,"g":94},"5618":{"m":83,"g":94},"5615":{"m":83,"g":94},"5578":{"m":83,"g":94},"5845":{"m":84,"g":94},"5849":{"m":84,"g":94},"5854":{"m":84,"g":94},"5823":{"m":84,"g":94},"5847":{"m":84,"g":94},"5816":{"m":84,"g":94},"5798":{"m":84,"g":94},"5850":{"m":84,"g":94},"5851":{"m":84,"g":94},"5846":{"m":84,"g":94},"5726":{"m":84,"g":94},"5839":{"m":84,"g":94},"5842":{"m":84,"g":94},"5833":{"m":84,"g":94},"5838":{"m":84,"g":94},"5551":{"m":84,"g":94},"5825":{"m":84,"g":94},"5276":{"m":84,"g":94},"5482":{"m":84,"g":94},"5771":{"m":84,"g":94},"5809":{"m":84,"g":94},"5807":{"m":84,"g":94},"5690":{"m":84,"g":94},"5390":{"m":84,"g":94},"5788":{"m":84,"g":94},"5643":{"m":84,"g":94},"5796":{"m":84,"g":94},"5797":{"m":84,"g":94},"5939":{"m":85,"g":94},"5934":{"m":85,"g":94},"5881":{"m":85,"g":94},"5915":{"m":85,"g":94},"5930":{"m":85,"g":94},"5724":{"m":85,"g":94},"5783":{"m":85,"g":94},"5933":{"m":85,"g":94},"5932":{"m":85,"g":94},"5912":{"m":85,"g":94},"5909":{"m":85,"g":94},"5917":{"m":85,"g":94},"5919":{"m":85,"g":94},"5910":{"m":85,"g":94},"5905":{"m":85,"g":94},"5383":{"m":85,"g":94},"5903":{"m":85,"g":94},"5900":{"m":85,"g":94},"5899":{"m":85,"g":94},"5870":{"m":85,"g":94},"5830":{"m":85,"g":94},"5901":{"m":85,"g":94},"5898":{"m":85,"g":94},"5861":{"m":85,"g":94},"5696":{"m":85,"g":94},"5725":{"m":85,"g":94},"5893":{"m":85,"g":94},"5793":{"m":85,"g":94},"5896":{"m":85,"g":94},"5746":{"m":85,"g":94},"5895":{"m":85,"g":94},"5841":{"m":85,"g":94},"5836":{"m":85,"g":94},"5894":{"m":85,"g":94},"5880":{"m":85,"g":94},"5875":{"m":85,"g":94},"5859":{"m":85,"g":94},"4115":{"m":85,"g":94},"4949":{"m":85,"g":94},"5820":{"m":85,"g":94},"5868":{"m":85,"g":94},"5860":{"m":85,"g":94},"5801":{"m":85,"g":94},"5857":{"m":85,"g":94},"6165":{"m":86,"g":94},"5778":{"m":86,"g":94},"6162":{"m":86,"g":94},"6089":{"m":86,"g":94},"6141":{"m":86,"g":94},"5822":{"m":86,"g":94},"6101":{"m":86,"g":94},"5745":{"m":86,"g":94},"6132":{"m":86,"g":94},"6131":{"m":86,"g":94},"6112":{"m":86,"g":94},"6129":{"m":86,"g":94},"6123":{"m":86,"g":94},"6097":{"m":86,"g":94},"5662":{"m":86,"g":94},"5764":{"m":86,"g":94},"6119":{"m":86,"g":94},"5626":{"m":86,"g":94},"6091":{"m":86,"g":94},"5572":{"m":86,"g":94},"5232":{"m":86,"g":94},"5219":{"m":86,"g":94},"5121":{"m":86,"g":94},"6077":{"m":86,"g":94},"6111":{"m":86,"g":94},"6038":{"m":86,"g":94},"6079":{"m":86,"g":94},"6034":{"m":86,"g":94},"6102":{"m":86,"g":94},"6105":{"m":86,"g":94},"5993":{"m":86,"g":94},"6075":{"m":86,"g":94},"6063":{"m":86,"g":94},"5233":{"m":86,"g":94},"5014":{"m":86,"g":94},"6084":{"m":86,"g":94},"3853":{"m":86,"g":94},"6010":{"m":86,"g":94},"6039":{"m":86,"g":94},"5655":{"m":86,"g":94},"6062":{"m":86,"g":94},"6004":{"m":86,"g":94},"6045":{"m":86,"g":94},"5885":{"m":86,"g":94},"5751":{"m":86,"g":94},"6057":{"m":86,"g":94},"6048":{"m":86,"g":94},"6047":{"m":86,"g":94},"6046":{"m":86,"g":94},"5081":{"m":86,"g":94},"5752":{"m":86,"g":94},"5996":{"m":86,"g":94},"5428":{"m":86,"g":94},"5555":{"m":86,"g":94},"5587":{"m":86,"g":94},"5781":{"m":86,"g":94},"6018":{"m":86,"g":94},"6002":{"m":86,"g":94},"5997":{"m":86,"g":94},"5679":{"m":86,"g":94},"5957":{"m":86,"g":94},"6012":{"m":86,"g":94},"5992":{"m":86,"g":94},"5998":{"m":86,"g":94},"5991":{"m":86,"g":94},"5986":{"m":86,"g":94},"5977":{"m":86,"g":94},"5975":{"m":86,"g":94},"5681":{"m":86,"g":94},"5969":{"m":86,"g":94},"5968":{"m":86,"g":94},"5350":{"m":86,"g":94},"5967":{"m":86,"g":94},"5908":{"m":86,"g":94},"5960":{"m":86,"g":94},"5782":{"m":86,"g":94},"5956":{"m":86,"g":94},"5945":{"m":86,"g":94},"5944":{"m":86,"g":94},"5952":{"m":86,"g":94},"5953":{"m":86,"g":94},"5834":{"m":86,"g":94},"5921":{"m":86,"g":94},"6245":{"m":87,"g":94},"6259":{"m":87,"g":94},"6252":{"m":87,"g":94},"5084":{"m":87,"g":94},"5657":{"m":87,"g":94},"6247":{"m":87,"g":94},"6235":{"m":87,"g":94},"6251":{"m":87,"g":94},"6042":{"m":87,"g":94},"6248":{"m":87,"g":94},"6225":{"m":87,"g":94},"5922":{"m":87,"g":94},"6241":{"m":87,"g":94},"6243":{"m":87,"g":94},"6244":{"m":87,"g":94},"6223":{"m":87,"g":94},"6206":{"m":87,"g":94},"6209":{"m":87,"g":94},"6231":{"m":87,"g":94},"6212":{"m":87,"g":94},"6201":{"m":87,"g":94},"6213":{"m":87,"g":94},"5558":{"m":87,"g":94},"6154":{"m":87,"g":94},"6043":{"m":87,"g":94},"6204":{"m":87,"g":94},"6202":{"m":87,"g":94},"6178":{"m":87,"g":94},"6198":{"m":87,"g":94},"6192":{"m":87,"g":94},"6188":{"m":87,"g":94},"6032":{"m":87,"g":94},"6199":{"m":87,"g":94},"5621":{"m":87,"g":94},"6196":{"m":87,"g":94},"6195":{"m":87,"g":94},"6073":{"m":87,"g":94},"6169":{"m":87,"g":94},"6191":{"m":87,"g":94},"6190":{"m":87,"g":94},"6186":{"m":87,"g":94},"5654":{"m":87,"g":94},"6180":{"m":87,"g":94},"6179":{"m":87,"g":94},"6184":{"m":87,"g":94},"6183":{"m":87,"g":94},"4701":{"m":87,"g":94},"6181":{"m":87,"g":94},"6146":{"m":87,"g":94},"6114":{"m":87,"g":94},"6118":{"m":87,"g":94},"6016":{"m":87,"g":94},"6566":{"m":88,"g":94},"6567":{"m":88,"g":94},"6485":{"m":88,"g":94},"6560":{"m":88,"g":94},"6550":{"m":88,"g":94},"6533":{"m":88,"g":94},"6562":{"m":88,"g":94},"6524":{"m":88,"g":94},"6347":{"m":88,"g":94},"6558":{"m":88,"g":94},"6521":{"m":88,"g":94},"6507":{"m":88,"g":94},"6474":{"m":88,"g":94},"6452":{"m":88,"g":94},"6404":{"m":88,"g":94},"6493":{"m":88,"g":94},"6355":{"m":88,"g":94},"6535":{"m":88,"g":94},"6536":{"m":88,"g":94},"6059":{"m":88,"g":94},"6532":{"m":88,"g":94},"6120":{"m":88,"g":94},"6522":{"m":88,"g":94},"6520":{"m":88,"g":94},"6469":{"m":88,"g":94},"6482":{"m":88,"g":94},"6308":{"m":88,"g":94},"6388":{"m":88,"g":94},"6492":{"m":88,"g":94},"6504":{"m":88,"g":94},"6019":{"m":88,"g":94},"6457":{"m":88,"g":94},"6499":{"m":88,"g":94},"6510":{"m":88,"g":94},"6508":{"m":88,"g":94},"5759":{"m":88,"g":94},"6419":{"m":88,"g":94},"6275":{"m":88,"g":94},"6503":{"m":88,"g":94},"5573":{"m":88,"g":94},"6445":{"m":88,"g":94},"6461":{"m":88,"g":94},"6467":{"m":88,"g":94},"6468":{"m":88,"g":94},"6476":{"m":88,"g":94},"6311":{"m":88,"g":94},"6487":{"m":88,"g":94},"5339":{"m":88,"g":94},"6381":{"m":88,"g":94},"6214":{"m":88,"g":94},"6475":{"m":88,"g":94},"6472":{"m":88,"g":94},"6447":{"m":88,"g":94},"6385":{"m":88,"g":94},"6444":{"m":88,"g":94},"6405":{"m":88,"g":94},"6429":{"m":88,"g":94},"6412":{"m":88,"g":94},"6387":{"m":88,"g":94},"6386":{"m":88,"g":94},"6326":{"m":88,"g":94},"6438":{"m":88,"g":94},"6321":{"m":88,"g":94},"4957":{"m":88,"g":94},"6440":{"m":88,"g":94},"6431":{"m":88,"g":94},"6098":{"m":88,"g":94},"6414":{"m":88,"g":94},"6430":{"m":88,"g":94},"6417":{"m":88,"g":94},"6137":{"m":88,"g":94},"6401":{"m":88,"g":94},"6400":{"m":88,"g":94},"5974":{"m":88,"g":94},"6325":{"m":88,"g":94},"6323":{"m":88,"g":94},"6333":{"m":88,"g":94},"6397":{"m":88,"g":94},"6396":{"m":88,"g":94},"6395":{"m":88,"g":94},"6339":{"m":88,"g":94},"6383":{"m":88,"g":94},"6392":{"m":88,"g":94},"6331":{"m":88,"g":94},"6391":{"m":88,"g":94},"6365":{"m":88,"g":94},"6250":{"m":88,"g":94},"6187":{"m":88,"g":94},"6379":{"m":88,"g":94},"6377":{"m":88,"g":94},"6362":{"m":88,"g":94},"6364":{"m":88,"g":94},"6330":{"m":88,"g":94},"6290":{"m":88,"g":94},"6041":{"m":88,"g":94},"6284":{"m":88,"g":94},"6257":{"m":88,"g":94},"6108":{"m":88,"g":94},"6134":{"m":88,"g":94},"6107":{"m":88,"g":94},"4741":{"m":88,"g":94},"6348":{"m":88,"g":94},"6211":{"m":88,"g":94},"6175":{"m":88,"g":94},"6366":{"m":88,"g":94},"6373":{"m":88,"g":94},"6356":{"m":88,"g":94},"6368":{"m":88,"g":94},"6324":{"m":88,"g":94},"6316":{"m":88,"g":94},"5099":{"m":88,"g":94},"6361":{"m":88,"g":94},"6360":{"m":88,"g":94},"6358":{"m":88,"g":94},"6359":{"m":88,"g":94},"6121":{"m":88,"g":94},"5694":{"m":88,"g":94},"6334":{"m":88,"g":94},"6136":{"m":88,"g":94},"6336":{"m":88,"g":94},"6327":{"m":88,"g":94},"6302":{"m":88,"g":94},"6147":{"m":88,"g":94},"6317":{"m":88,"g":94},"6216":{"m":88,"g":94},"6298":{"m":88,"g":94},"6109":{"m":88,"g":94},"5914":{"m":88,"g":94},"6009":{"m":88,"g":94},"6274":{"m":88,"g":94},"6283":{"m":88,"g":94},"6300":{"m":88,"g":94},"6138":{"m":88,"g":94},"6282":{"m":88,"g":94},"6276":{"m":88,"g":94},"6273":{"m":88,"g":94},"6115":{"m":88,"g":94},"7038":{"m":89,"g":94},"6833":{"m":89,"g":94},"7029":{"m":89,"g":94},"7027":{"m":89,"g":94},"6980":{"m":89,"g":94},"6964":{"m":89,"g":94},"7023":{"m":89,"g":94},"7018":{"m":89,"g":94},"7017":{"m":89,"g":94},"6741":{"m":89,"g":94},"6987":{"m":89,"g":94},"7015":{"m":89,"g":94},"7013":{"m":89,"g":94},"6884":{"m":89,"g":94},"6992":{"m":89,"g":94},"6998":{"m":89,"g":94},"7008":{"m":89,"g":94},"7007":{"m":89,"g":94},"6958":{"m":89,"g":94},"6557":{"m":89,"g":94},"6990":{"m":89,"g":94},"6960":{"m":89,"g":94},"6973":{"m":89,"g":94},"6983":{"m":89,"g":94},"6929":{"m":89,"g":94},"6977":{"m":89,"g":94},"6967":{"m":89,"g":94},"6981":{"m":89,"g":94},"6979":{"m":89,"g":94},"6976":{"m":89,"g":94},"6970":{"m":89,"g":94},"6937":{"m":89,"g":94},"6965":{"m":89,"g":94},"6974":{"m":89,"g":94},"6966":{"m":89,"g":94},"6956":{"m":89,"g":94},"6963":{"m":89,"g":94},"6926":{"m":89,"g":94},"6968":{"m":89,"g":94},"6853":{"m":89,"g":94},"6957":{"m":89,"g":94},"6955":{"m":89,"g":94},"6916":{"m":89,"g":94},"6885":{"m":89,"g":94},"6950":{"m":89,"g":94},"6953":{"m":89,"g":94},"6220":{"m":89,"g":94},"6866":{"m":89,"g":94},"6895":{"m":89,"g":94},"6874":{"m":89,"g":94},"6915":{"m":89,"g":94},"6924":{"m":89,"g":94},"6945":{"m":89,"g":94},"6369":{"m":89,"g":94},"5955":{"m":89,"g":94},"6912":{"m":89,"g":94},"6910":{"m":89,"g":94},"6944":{"m":89,"g":94},"6943":{"m":89,"g":94},"6942":{"m":89,"g":94},"6939":{"m":89,"g":94},"6767":{"m":89,"g":94},"6879":{"m":89,"g":94},"6922":{"m":89,"g":94},"6932":{"m":89,"g":94},"6934":{"m":89,"g":94},"6931":{"m":89,"g":94},"6930":{"m":89,"g":94},"6838":{"m":89,"g":94},"6458":{"m":89,"g":94},"6877":{"m":89,"g":94},"6887":{"m":89,"g":94},"6890":{"m":89,"g":94},"6764":{"m":89,"g":94},"6170":{"m":89,"g":94},"6837":{"m":89,"g":94},"6868":{"m":89,"g":94},"6865":{"m":89,"g":94},"6851":{"m":89,"g":94},"6846":{"m":89,"g":94},"6820":{"m":89,"g":94},"6277":{"m":89,"g":94},"6861":{"m":89,"g":94},"6878":{"m":89,"g":94},"6736":{"m":89,"g":94},"6852":{"m":89,"g":94},"6460":{"m":89,"g":94},"6659":{"m":89,"g":94},"6858":{"m":89,"g":94},"6745":{"m":89,"g":94},"5929":{"m":89,"g":94},"6735":{"m":89,"g":94},"6816":{"m":89,"g":94},"6818":{"m":89,"g":94},"6671":{"m":89,"g":94},"6456":{"m":89,"g":94},"6766":{"m":89,"g":94},"6093":{"m":89,"g":94},"6812":{"m":89,"g":94},"6811":{"m":89,"g":94},"6815":{"m":89,"g":94},"6813":{"m":89,"g":94},"6780":{"m":89,"g":94},"5382":{"m":89,"g":94},"6805":{"m":89,"g":94},"6699":{"m":89,"g":94},"6803":{"m":89,"g":94},"6804":{"m":89,"g":94},"6799":{"m":89,"g":94},"6800":{"m":89,"g":94},"5981":{"m":89,"g":94},"6421":{"m":89,"g":94},"6797":{"m":89,"g":94},"6795":{"m":89,"g":94},"6787":{"m":89,"g":94},"6794":{"m":89,"g":94},"6788":{"m":89,"g":94},"6791":{"m":89,"g":94},"6792":{"m":89,"g":94},"6734":{"m":89,"g":94},"6408":{"m":89,"g":94},"6786":{"m":89,"g":94},"6785":{"m":89,"g":94},"6782":{"m":89,"g":94},"6784":{"m":89,"g":94},"6679":{"m":89,"g":94},"6772":{"m":89,"g":94},"6289":{"m":89,"g":94},"6509":{"m":89,"g":94},"6737":{"m":89,"g":94},"6761":{"m":89,"g":94},"6765":{"m":89,"g":94},"6727":{"m":89,"g":94},"6265":{"m":89,"g":94},"6748":{"m":89,"g":94},"6746":{"m":89,"g":94},"6742":{"m":89,"g":94},"6728":{"m":89,"g":94},"6680":{"m":89,"g":94},"6437":{"m":89,"g":94},"6729":{"m":89,"g":94},"6545":{"m":89,"g":94},"6705":{"m":89,"g":94},"6715":{"m":89,"g":94},"6725":{"m":89,"g":94},"6718":{"m":89,"g":94},"6720":{"m":89,"g":94},"6726":{"m":89,"g":94},"6676":{"m":89,"g":94},"6709":{"m":89,"g":94},"6719":{"m":89,"g":94},"6479":{"m":89,"g":94},"6668":{"m":89,"g":94},"6682":{"m":89,"g":94},"6711":{"m":89,"g":94},"6710":{"m":89,"g":94},"6712":{"m":89,"g":94},"6706":{"m":89,"g":94},"6703":{"m":89,"g":94},"6697":{"m":89,"g":94},"6649":{"m":89,"g":94},"6685":{"m":89,"g":94},"6689":{"m":89,"g":94},"6693":{"m":89,"g":94},"6655":{"m":89,"g":94},"6260":{"m":89,"g":94},"6627":{"m":89,"g":94},"6687":{"m":89,"g":94},"6582":{"m":89,"g":94},"6672":{"m":89,"g":94},"6678":{"m":89,"g":94},"6380":{"m":89,"g":94},"6665":{"m":89,"g":94},"6673":{"m":89,"g":94},"6007":{"m":89,"g":94},"6473":{"m":89,"g":94},"6677":{"m":89,"g":94},"6661":{"m":89,"g":94},"6660":{"m":89,"g":94},"6638":{"m":89,"g":94},"6606":{"m":89,"g":94},"6662":{"m":89,"g":94},"6652":{"m":89,"g":94},"6658":{"m":89,"g":94},"6403":{"m":89,"g":94},"6601":{"m":89,"g":94},"6640":{"m":89,"g":94},"6450":{"m":89,"g":94},"6650":{"m":89,"g":94},"6585":{"m":89,"g":94},"6648":{"m":89,"g":94},"6634":{"m":89,"g":94},"6646":{"m":89,"g":94},"6643":{"m":89,"g":94},"6547":{"m":89,"g":94},"6631":{"m":89,"g":94},"6603":{"m":89,"g":94},"6263":{"m":89,"g":94},"6639":{"m":89,"g":94},"6635":{"m":89,"g":94},"6620":{"m":89,"g":94},"6629":{"m":89,"g":94},"6628":{"m":89,"g":94},"6617":{"m":89,"g":94},"6306":{"m":89,"g":94},"6599":{"m":89,"g":94},"6611":{"m":89,"g":94},"6598":{"m":89,"g":94},"6597":{"m":89,"g":94},"6581":{"m":89,"g":94},"6610":{"m":89,"g":94},"6575":{"m":89,"g":94},"6609":{"m":89,"g":94},"6594":{"m":89,"g":94},"6587":{"m":89,"g":94},"6596":{"m":89,"g":94},"6595":{"m":89,"g":94},"6593":{"m":89,"g":94},"6586":{"m":89,"g":94},"6571":{"m":89,"g":94},"6439":{"m":89,"g":94},"6527":{"m":89,"g":94},"6588":{"m":89,"g":94},"6570":{"m":89,"g":94},"6546":{"m":89,"g":94},"6537":{"m":89,"g":94},"6494":{"m":89,"g":94},"5961":{"m":89,"g":94},"6543":{"m":89,"g":94},"4068":{"m":89,"g":94},"6577":{"m":89,"g":94},"6578":{"m":89,"g":94},"6477":{"m":89,"g":94},"6564":{"m":89,"g":94},"6576":{"m":89,"g":94},"7248":{"m":90,"g":94},"7244":{"m":90,"g":94},"7247":{"m":90,"g":94},"6058":{"m":90,"g":94},"7234":{"m":90,"g":94},"7245":{"m":90,"g":94},"7231":{"m":90,"g":94},"7239":{"m":90,"g":94},"7232":{"m":90,"g":94},"7207":{"m":90,"g":94},"7213":{"m":90,"g":94},"7228":{"m":90,"g":94},"7221":{"m":90,"g":94},"7218":{"m":90,"g":94},"6378":{"m":90,"g":94},"7217":{"m":90,"g":94},"7215":{"m":90,"g":94},"7214":{"m":90,"g":94},"7210":{"m":90,"g":94},"7163":{"m":90,"g":94},"7205":{"m":90,"g":94},"7204":{"m":90,"g":94},"7202":{"m":90,"g":94},"7200":{"m":90,"g":94},"7196":{"m":90,"g":94},"7198":{"m":90,"g":94},"7195":{"m":90,"g":94},"7186":{"m":90,"g":94},"7180":{"m":90,"g":94},"7189":{"m":90,"g":94},"7191":{"m":90,"g":94},"7190":{"m":90,"g":94},"7184":{"m":90,"g":94},"7181":{"m":90,"g":94},"7177":{"m":90,"g":94},"7178":{"m":90,"g":94},"7175":{"m":90,"g":94},"7172":{"m":90,"g":94},"7173":{"m":90,"g":94},"7161":{"m":90,"g":94},"7150":{"m":90,"g":94},"7153":{"m":90,"g":94},"7170":{"m":90,"g":94},"7165":{"m":90,"g":94},"7156":{"m":90,"g":94},"7020":{"m":90,"g":94},"7157":{"m":90,"g":94},"6814":{"m":90,"g":94},"7154":{"m":90,"g":94},"7155":{"m":90,"g":94},"7152":{"m":90,"g":94},"7146":{"m":90,"g":94},"7145":{"m":90,"g":94},"7056":{"m":90,"g":94},"7058":{"m":90,"g":94},"7140":{"m":90,"g":94},"7134":{"m":90,"g":94},"6026":{"m":90,"g":94},"7092":{"m":90,"g":94},"7126":{"m":90,"g":94},"7119":{"m":90,"g":94},"7115":{"m":90,"g":94},"6919":{"m":90,"g":94},"7093":{"m":90,"g":94},"6994":{"m":90,"g":94},"6824":{"m":90,"g":94},"7079":{"m":90,"g":94},"6106":{"m":90,"g":94},"7091":{"m":90,"g":94},"7097":{"m":90,"g":94},"6870":{"m":90,"g":94},"7067":{"m":90,"g":94},"7076":{"m":90,"g":94},"7046":{"m":90,"g":94},"7054":{"m":90,"g":94},"7073":{"m":90,"g":94},"6031":{"m":90,"g":94},"7071":{"m":90,"g":94},"6579":{"m":90,"g":94},"7066":{"m":90,"g":94},"6716":{"m":90,"g":94},"7064":{"m":90,"g":94},"7049":{"m":90,"g":94},"7063":{"m":90,"g":94},"6947":{"m":90,"g":94},"7057":{"m":90,"g":94},"7061":{"m":90,"g":94},"7060":{"m":90,"g":94},"7021":{"m":90,"g":94},"7053":{"m":90,"g":94},"7037":{"m":90,"g":94},"7051":{"m":90,"g":94},"6999":{"m":90,"g":94},"7045":{"m":90,"g":94},"7043":{"m":90,"g":94},"7040":{"m":90,"g":94},"7493":{"m":91,"g":94},"7490":{"m":91,"g":94},"7376":{"m":91,"g":94},"7487":{"m":91,"g":94},"7347":{"m":91,"g":94},"7269":{"m":91,"g":94},"7449":{"m":91,"g":94},"7469":{"m":91,"g":94},"7378":{"m":91,"g":94},"7481":{"m":91,"g":94},"7382":{"m":91,"g":94},"7480":{"m":91,"g":94},"7456":{"m":91,"g":94},"7479":{"m":91,"g":94},"7397":{"m":91,"g":94},"7472":{"m":91,"g":94},"7457":{"m":91,"g":94},"7290":{"m":91,"g":94},"6821":{"m":91,"g":94},"7454":{"m":91,"g":94},"7451":{"m":91,"g":94},"7445":{"m":91,"g":94},"7361":{"m":91,"g":94},"7391":{"m":91,"g":94},"7441":{"m":91,"g":94},"7327":{"m":91,"g":94},"7406":{"m":91,"g":94},"7414":{"m":91,"g":94},"7408":{"m":91,"g":94},"7412":{"m":91,"g":94},"7351":{"m":91,"g":94},"7420":{"m":91,"g":94},"7409":{"m":91,"g":94},"7425":{"m":91,"g":94},"6984":{"m":91,"g":94},"7394":{"m":91,"g":94},"7396":{"m":91,"g":94},"7400":{"m":91,"g":94},"7329":{"m":91,"g":94},"7401":{"m":91,"g":94},"7403":{"m":91,"g":94},"7402":{"m":91,"g":94},"7285":{"m":91,"g":94},"7219":{"m":91,"g":94},"7360":{"m":91,"g":94},"7399":{"m":91,"g":94},"7398":{"m":91,"g":94},"6389":{"m":91,"g":94},"7372":{"m":91,"g":94},"5485":{"m":91,"g":94},"7393":{"m":91,"g":94},"7356":{"m":91,"g":94},"7326":{"m":91,"g":94},"7322":{"m":91,"g":94},"7371":{"m":91,"g":94},"7364":{"m":91,"g":94},"7159":{"m":91,"g":94},"7370":{"m":91,"g":94},"7343":{"m":91,"g":94},"7366":{"m":91,"g":94},"7362":{"m":91,"g":94},"7242":{"m":91,"g":94},"7363":{"m":91,"g":94},"7303":{"m":91,"g":94},"7354":{"m":91,"g":94},"7099":{"m":91,"g":94},"7333":{"m":91,"g":94},"7331":{"m":91,"g":94},"7284":{"m":91,"g":94},"7096":{"m":91,"g":94},"7319":{"m":91,"g":94},"7003":{"m":91,"g":94},"7301":{"m":91,"g":94},"7297":{"m":91,"g":94},"7251":{"m":91,"g":94},"7300":{"m":91,"g":94},"6614":{"m":91,"g":94},"7267":{"m":91,"g":94},"7264":{"m":91,"g":94},"7237":{"m":91,"g":94},"7289":{"m":91,"g":94},"7288":{"m":91,"g":94},"7286":{"m":91,"g":94},"6842":{"m":91,"g":94},"7283":{"m":91,"g":94},"7164":{"m":91,"g":94},"7022":{"m":91,"g":94},"6081":{"m":91,"g":94},"7265":{"m":91,"g":94},"7160":{"m":91,"g":94},"7179":{"m":91,"g":94},"7167":{"m":91,"g":94},"7252":{"m":91,"g":94},"7122":{"m":91,"g":94},"7233":{"m":91,"g":94},"7125":{"m":91,"g":94},"7559":{"m":92,"g":94},"7541":{"m":92,"g":94},"7542":{"m":92,"g":94},"7544":{"m":92,"g":94},"7549":{"m":92,"g":94},"7543":{"m":92,"g":94},"7527":{"m":92,"g":94},"7531":{"m":92,"g":94},"7148":{"m":92,"g":94},"7499":{"m":92,"g":94},"7507":{"m":92,"g":94},"7513":{"m":92,"g":94},"7521":{"m":92,"g":94},"7520":{"m":92,"g":94},"6793":{"m":92,"g":94},"6721":{"m":92,"g":94},"7386":{"m":92,"g":94},"7522":{"m":92,"g":94},"6641":{"m":92,"g":94},"6626":{"m":92,"g":94},"7498":{"m":92,"g":94},"7512":{"m":92,"g":94},"7510":{"m":92,"g":94},"7516":{"m":92,"g":94},"7508":{"m":92,"g":94},"7437":{"m":92,"g":94},"7489":{"m":92,"g":94},"7505":{"m":92,"g":94},"6717":{"m":92,"g":94},"7277":{"m":92,"g":94},"7422":{"m":92,"g":94},"7439":{"m":92,"g":94},"7236":{"m":92,"g":94},"7423":{"m":92,"g":94},"7268":{"m":92,"g":94},"7802":{"m":93,"g":94},"7801":{"m":93,"g":94},"7799":{"m":93,"g":94},"7792":{"m":93,"g":94},"7800":{"m":93,"g":94},"7756":{"m":93,"g":94},"7790":{"m":93,"g":94},"7757":{"m":93,"g":94},"7786":{"m":93,"g":94},"7222":{"m":93,"g":94},"7787":{"m":93,"g":94},"7784":{"m":93,"g":94},"7444":{"m":93,"g":94},"7623":{"m":93,"g":94},"7782":{"m":93,"g":94},"7596":{"m":93,"g":94},"7772":{"m":93,"g":94},"7705":{"m":93,"g":94},"7745":{"m":93,"g":94},"7778":{"m":93,"g":94},"7419":{"m":93,"g":94},"7418":{"m":93,"g":94},"7748":{"m":93,"g":94},"7764":{"m":93,"g":94},"7741":{"m":93,"g":94},"7729":{"m":93,"g":94},"7390":{"m":93,"g":94},"7759":{"m":93,"g":94},"7751":{"m":93,"g":94},"7754":{"m":93,"g":94},"7755":{"m":93,"g":94},"7744":{"m":93,"g":94},"7752":{"m":93,"g":94},"7723":{"m":93,"g":94},"7673":{"m":93,"g":94},"7740":{"m":93,"g":94},"7681":{"m":93,"g":94},"6771":{"m":93,"g":94},"7647":{"m":93,"g":94},"7750":{"m":93,"g":94},"7722":{"m":93,"g":94},"7738":{"m":93,"g":94},"7278":{"m":93,"g":94},"7731":{"m":93,"g":94},"7735":{"m":93,"g":94},"6770":{"m":93,"g":94},"7734":{"m":93,"g":94},"7714":{"m":93,"g":94},"6698":{"m":93,"g":94},"7462":{"m":93,"g":94},"6549":{"m":93,"g":94},"7621":{"m":93,"g":94},"6512":{"m":93,"g":94},"7292":{"m":93,"g":94},"7416":{"m":93,"g":94},"7717":{"m":93,"g":94},"7677":{"m":93,"g":94},"7683":{"m":93,"g":94},"7697":{"m":93,"g":94},"7635":{"m":93,"g":94},"7642":{"m":93,"g":94},"7698":{"m":93,"g":94},"7684":{"m":93,"g":94},"7688":{"m":93,"g":94},"7676":{"m":93,"g":94},"7629":{"m":93,"g":94},"6985":{"m":93,"g":94},"7675":{"m":93,"g":94},"7486":{"m":93,"g":94},"7671":{"m":93,"g":94},"7648":{"m":93,"g":94},"7318":{"m":93,"g":94},"7663":{"m":93,"g":94},"7627":{"m":93,"g":94},"7632":{"m":93,"g":94},"7524":{"m":93,"g":94},"7643":{"m":93,"g":94},"7640":{"m":93,"g":94},"7539":{"m":93,"g":94},"7176":{"m":93,"g":94},"7580":{"m":93,"g":94},"7628":{"m":93,"g":94},"7619":{"m":93,"g":94},"7432":{"m":93,"g":94},"7636":{"m":93,"g":94},"7630":{"m":93,"g":94},"7624":{"m":93,"g":94},"7625":{"m":93,"g":94},"7036":{"m":93,"g":94},"7620":{"m":93,"g":94},"7310":{"m":93,"g":94},"7309":{"m":93,"g":94},"7618":{"m":93,"g":94},"7598":{"m":93,"g":94},"7446":{"m":93,"g":94},"7584":{"m":93,"g":94},"7552":{"m":93,"g":94},"7308":{"m":93,"g":94},"6769":{"m":93,"g":94},"6563":{"m":93,"g":94},"7612":{"m":93,"g":94},"7588":{"m":93,"g":94},"7610":{"m":93,"g":94},"7581":{"m":93,"g":94},"7225":{"m":93,"g":94},"7577":{"m":93,"g":94},"7540":{"m":93,"g":94},"7208":{"m":93,"g":94},"7569":{"m":93,"g":94},"7573":{"m":93,"g":94},"7575":{"m":93,"g":94},"7330":{"m":93,"g":94},"7882":{"m":95,"g":97},"7880":{"m":95,"g":97},"7660":{"m":95,"g":97},"7846":{"m":95,"g":97},"7818":{"m":95,"g":97},"7866":{"m":95,"g":97},"7724":{"m":95,"g":97},"7830":{"m":95,"g":97},"7579":{"m":95,"g":97},"7840":{"m":95,"g":97},"7864":{"m":95,"g":97},"7860":{"m":95,"g":97},"7832":{"m":95,"g":97},"7853":{"m":95,"g":97},"7850":{"m":95,"g":97},"7129":{"m":95,"g":97},"7762":{"m":95,"g":97},"7821":{"m":95,"g":97},"7187":{"m":95,"g":97},"7816":{"m":95,"g":97},"7798":{"m":95,"g":94},"7797":{"m":95,"g":94},"7313":{"m":95,"g":94},"7689":{"m":95,"g":94},"7813":{"m":95,"g":94},"6094":{"m":95,"g":94},"7794":{"m":95,"g":94},"7812":{"m":95,"g":94},"7733":{"m":95,"g":94},"5246":{"m":95,"g":94},"7785":{"m":95,"g":94},"7709":{"m":95,"g":94},"7793":{"m":95,"g":94},"7803":{"m":95,"g":94},"7796":{"m":95,"g":94},"7963":{"m":96,"g":97},"7971":{"m":96,"g":97},"7960":{"m":96,"g":97},"7969":{"m":96,"g":97},"7970":{"m":96,"g":97},"7962":{"m":96,"g":97},"7968":{"m":96,"g":97},"7964":{"m":96,"g":97},"7932":{"m":96,"g":97},"7961":{"m":96,"g":97},"7953":{"m":96,"g":97},"7795":{"m":96,"g":97},"7940":{"m":96,"g":97},"7775":{"m":96,"g":97},"7791":{"m":96,"g":97},"6449":{"m":96,"g":97},"7922":{"m":96,"g":97},"7907":{"m":96,"g":97},"5888":{"m":96,"g":97},"7904":{"m":96,"g":97},"7608":{"m":96,"g":97},"7899":{"m":96,"g":97},"7898":{"m":96,"g":97},"7838":{"m":96,"g":97},"7885":{"m":96,"g":97},"7872":{"m":96,"g":97},"7895":{"m":96,"g":97},"8265":{"m":98,"g":103},"8260":{"m":98,"g":103},"7822":{"m":98,"g":103},"8059":{"m":98,"g":103},"8257":{"m":98,"g":103},"7484":{"m":98,"g":103},"8221":{"m":98,"g":103},"8237":{"m":98,"g":103},"8231":{"m":98,"g":103},"8202":{"m":98,"g":103},"8204":{"m":98,"g":103},"8209":{"m":98,"g":97},"8208":{"m":98,"g":97},"8107":{"m":98,"g":97},"8193":{"m":98,"g":97},"8200":{"m":98,"g":97},"8195":{"m":98,"g":97},"7935":{"m":98,"g":97},"8184":{"m":98,"g":97},"8197":{"m":98,"g":97},"8067":{"m":98,"g":97},"8163":{"m":98,"g":97},"8183":{"m":98,"g":97},"7983":{"m":98,"g":97},"8182":{"m":98,"g":97},"8181":{"m":98,"g":97},"7825":{"m":98,"g":97},"8178":{"m":98,"g":97},"8176":{"m":98,"g":97},"7312":{"m":98,"g":97},"6230":{"m":98,"g":97},"7999":{"m":98,"g":97},"8115":{"m":98,"g":97},"8175":{"m":98,"g":97},"8172":{"m":98,"g":97},"8019":{"m":98,"g":97},"8167":{"m":98,"g":97},"8170":{"m":98,"g":97},"8169":{"m":98,"g":97},"8103":{"m":98,"g":97},"8161":{"m":98,"g":97},"8171":{"m":98,"g":97},"8168":{"m":98,"g":97},"8166":{"m":98,"g":97},"8165":{"m":98,"g":97},"7966":{"m":98,"g":97},"8157":{"m":98,"g":97},"8160":{"m":98,"g":97},"8158":{"m":98,"g":97},"8028":{"m":98,"g":97},"7931":{"m":98,"g":97},"8048":{"m":98,"g":97},"7302":{"m":98,"g":97},"6881":{"m":98,"g":97},"8155":{"m":98,"g":97},"7661":{"m":98,"g":97},"7987":{"m":98,"g":97},"8113":{"m":98,"g":97},"8147":{"m":98,"g":97},"8142":{"m":98,"g":97},"8136":{"m":98,"g":97},"8141":{"m":98,"g":97},"7704":{"m":98,"g":97},"7820":{"m":98,"g":97},"7889":{"m":98,"g":97},"7506":{"m":98,"g":97},"7959":{"m":98,"g":97},"8127":{"m":98,"g":97},"7924":{"m":98,"g":97},"7030":{"m":98,"g":97},"8117":{"m":98,"g":97},"8102":{"m":98,"g":97},"8046":{"m":98,"g":97},"7884":{"m":98,"g":97},"7989":{"m":98,"g":97},"8105":{"m":98,"g":97},"8110":{"m":98,"g":97},"8108":{"m":98,"g":97},"8100":{"m":98,"g":97},"7597":{"m":98,"g":97},"8075":{"m":98,"g":97},"7992":{"m":98,"g":97},"7634":{"m":98,"g":97},"8098":{"m":98,"g":97},"8090":{"m":98,"g":97},"8086":{"m":98,"g":97},"8077":{"m":98,"g":97},"8001":{"m":98,"g":97},"7760":{"m":98,"g":97},"8029":{"m":98,"g":97},"8058":{"m":98,"g":97},"5163":{"m":98,"g":97},"8045":{"m":98,"g":97},"8047":{"m":98,"g":97},"7943":{"m":98,"g":97},"8052":{"m":98,"g":97},"6556":{"m":98,"g":97},"8022":{"m":98,"g":97},"8002":{"m":98,"g":97},"8023":{"m":98,"g":97},"8044":{"m":98,"g":97},"7887":{"m":98,"g":97},"8035":{"m":98,"g":97},"7897":{"m":98,"g":97},"7982":{"m":98,"g":97},"8006":{"m":98,"g":97},"7649":{"m":98,"g":97},"7653":{"m":98,"g":97},"8005":{"m":98,"g":97},"7874":{"m":98,"g":97},"8021":{"m":98,"g":97},"8010":{"m":98,"g":97},"7902":{"m":98,"g":97},"7862":{"m":98,"g":97},"7844":{"m":98,"g":97},"7997":{"m":98,"g":97},"7367":{"m":98,"g":97},"7749":{"m":98,"g":97},"7952":{"m":98,"g":97},"7988":{"m":98,"g":97},"7814":{"m":98,"g":97},"7978":{"m":98,"g":97},"7985":{"m":98,"g":97},"7975":{"m":98,"g":97},"7972":{"m":98,"g":97},"7950":{"m":98,"g":97},"8305":{"m":99,"g":103},"8370":{"m":99,"g":103},"8333":{"m":99,"g":103},"8367":{"m":99,"g":103},"8363":{"m":99,"g":103},"8359":{"m":99,"g":103},"8332":{"m":99,"g":103},"8357":{"m":99,"g":103},"8344":{"m":99,"g":103},"7858":{"m":99,"g":103},"8353":{"m":99,"g":103},"8341":{"m":99,"g":103},"8000":{"m":99,"g":103},"7135":{"m":99,"g":103},"8266":{"m":99,"g":103},"8280":{"m":99,"g":103},"6619":{"m":99,"g":103},"8233":{"m":99,"g":103},"8334":{"m":99,"g":103},"8307":{"m":99,"g":103},"8300":{"m":99,"g":103},"8299":{"m":99,"g":103},"8301":{"m":99,"g":103},"8310":{"m":99,"g":103},"8298":{"m":99,"g":103},"8315":{"m":99,"g":103},"8303":{"m":99,"g":103},"8317":{"m":99,"g":103},"8235":{"m":99,"g":103},"8070":{"m":99,"g":103},"7562":{"m":99,"g":103},"8043":{"m":99,"g":103},"7685":{"m":99,"g":103},"8304":{"m":99,"g":103},"8240":{"m":99,"g":103},"8262":{"m":99,"g":103},"7708":{"m":99,"g":103},"8133":{"m":99,"g":103},"8302":{"m":99,"g":103},"8295":{"m":99,"g":103},"8130":{"m":99,"g":103},"8288":{"m":99,"g":103},"8282":{"m":99,"g":103},"8264":{"m":99,"g":103},"8261":{"m":99,"g":103},"8284":{"m":99,"g":103},"8272":{"m":99,"g":103},"8458":{"m":100,"g":103},"8457":{"m":100,"g":103},"8449":{"m":100,"g":103},"8456":{"m":100,"g":103},"8445":{"m":100,"g":103},"8441":{"m":100,"g":103},"8442":{"m":100,"g":103},"8224":{"m":100,"g":103},"8352":{"m":100,"g":103},"8416":{"m":100,"g":103},"6338":{"m":100,"g":103},"8415":{"m":100,"g":103},"8422":{"m":100,"g":103},"8425":{"m":100,"g":103},"8419":{"m":100,"g":103},"8417":{"m":100,"g":103},"8316":{"m":100,"g":103},"7603":{"m":100,"g":103},"8213":{"m":100,"g":103},"8414":{"m":100,"g":103},"8062":{"m":100,"g":103},"8258":{"m":100,"g":103},"8406":{"m":100,"g":103},"8407":{"m":100,"g":103},"8405":{"m":100,"g":103},"8156":{"m":100,"g":103},"8241":{"m":100,"g":103},"8397":{"m":100,"g":103},"7720":{"m":100,"g":103},"8351":{"m":100,"g":103},"8395":{"m":100,"g":103},"8382":{"m":100,"g":103},"8392":{"m":100,"g":103},"8036":{"m":100,"g":103},"7739":{"m":100,"g":103},"7974":{"m":100,"g":103},"7976":{"m":100,"g":103},"8403":{"m":100,"g":103},"8401":{"m":100,"g":103},"8372":{"m":100,"g":103},"8394":{"m":100,"g":103},"8396":{"m":100,"g":103},"8314":{"m":100,"g":103},"8350":{"m":100,"g":103},"8381":{"m":100,"g":103},"6003":{"m":100,"g":103},"7737":{"m":100,"g":103},"8267":{"m":100,"g":103},"8343":{"m":100,"g":103},"7000":{"m":100,"g":103},"8356":{"m":100,"g":103},"8374":{"m":100,"g":103},"8517":{"m":101,"g":103},"8489":{"m":101,"g":103},"8482":{"m":101,"g":103},"7973":{"m":101,"g":103},"8426":{"m":101,"g":103},"8413":{"m":101,"g":103},"8486":{"m":101,"g":103},"8485":{"m":101,"g":103},"8477":{"m":101,"g":103},"8480":{"m":101,"g":103},"8478":{"m":101,"g":103},"8476":{"m":101,"g":103},"8469":{"m":101,"g":103},"8473":{"m":101,"g":103},"8421":{"m":101,"g":103},"7273":{"m":101,"g":103},"8453":{"m":101,"g":103},"8467":{"m":101,"g":103},"7565":{"m":101,"g":103},"8465":{"m":101,"g":103},"8125":{"m":101,"g":103},"8608":{"m":102,"g":103},"8590":{"m":102,"g":103},"8583":{"m":102,"g":103},"8604":{"m":102,"g":103},"8515":{"m":102,"g":103},"8603":{"m":102,"g":103},"8550":{"m":102,"g":103},"7211":{"m":102,"g":103},"8533":{"m":102,"g":103},"8404":{"m":102,"g":103},"8599":{"m":102,"g":103},"8514":{"m":102,"g":103},"8365":{"m":102,"g":103},"8544":{"m":102,"g":103},"8541":{"m":102,"g":103},"8564":{"m":102,"g":103},"8479":{"m":102,"g":103},"7280":{"m":102,"g":103},"8584":{"m":102,"g":103},"8154":{"m":102,"g":103},"8461":{"m":102,"g":103},"6869":{"m":102,"g":103},"8562":{"m":102,"g":103},"8545":{"m":102,"g":103},"8560":{"m":102,"g":103},"8516":{"m":102,"g":103},"8498":{"m":102,"g":103},"8448":{"m":102,"g":103},"8431":{"m":102,"g":103},"8537":{"m":102,"g":103},"8483":{"m":102,"g":103},"8531":{"m":102,"g":103},"8535":{"m":102,"g":103},"8499":{"m":102,"g":103},"8528":{"m":102,"g":103},"8527":{"m":102,"g":103},"8652":{"m":105,"g":107},"8051":{"m":105,"g":107},"8318":{"m":105,"g":107},"8636":{"m":105,"g":107},"8450":{"m":105,"g":107},"8645":{"m":105,"g":104},"8640":{"m":105,"g":104},"8644":{"m":105,"g":104},"8308":{"m":105,"g":104},"8270":{"m":105,"g":104},"8083":{"m":105,"g":104},"8642":{"m":105,"g":104},"8532":{"m":105,"g":104},"8632":{"m":105,"g":104},"8598":{"m":105,"g":104},"8630":{"m":105,"g":104},"8634":{"m":105,"g":104},"8488":{"m":105,"g":104},"8633":{"m":105,"g":104},"6227":{"m":105,"g":104},"8577":{"m":105,"g":104},"8628":{"m":105,"g":103},"8629":{"m":105,"g":103},"8626":{"m":105,"g":103},"8623":{"m":105,"g":103},"8611":{"m":105,"g":103},"8595":{"m":105,"g":103},"8727":{"m":106,"g":107},"8723":{"m":106,"g":107},"8579":{"m":106,"g":107},"8567":{"m":106,"g":107},"8718":{"m":106,"g":107},"8444":{"m":106,"g":107},"8547":{"m":106,"g":107},"8683":{"m":106,"g":107},"8631":{"m":106,"g":107},"7379":{"m":106,"g":107},"8719":{"m":106,"g":107},"8306":{"m":106,"g":107},"8650":{"m":106,"g":107},"8721":{"m":106,"g":107},"8524":{"m":106,"g":107},"8709":{"m":106,"g":107},"8722":{"m":106,"g":107},"7369":{"m":106,"g":107},"8714":{"m":106,"g":107},"8705":{"m":106,"g":107},"8693":{"m":106,"g":107},"8717":{"m":106,"g":107},"8713":{"m":106,"g":107},"8701":{"m":106,"g":107},"8711":{"m":106,"g":107},"8706":{"m":106,"g":107},"8704":{"m":106,"g":107},"8691":{"m":106,"g":107},"8512":{"m":106,"g":107},"7434":{"m":106,"g":107},"8688":{"m":106,"g":107},"8694":{"m":106,"g":107},"8364":{"m":106,"g":107},"8238":{"m":106,"g":107},"8618":{"m":106,"g":107},"8522":{"m":106,"g":107},"8668":{"m":106,"g":107},"8648":{"m":106,"g":107},"8679":{"m":106,"g":107},"8686":{"m":106,"g":107},"8684":{"m":106,"g":107},"8685":{"m":106,"g":107},"8647":{"m":106,"g":107},"8094":{"m":106,"g":107},"8664":{"m":106,"g":107},"8543":{"m":106,"g":107},"8665":{"m":106,"g":107},"8013":{"m":106,"g":107},"8643":{"m":106,"g":107},"8658":{"m":106,"g":107},"8511":{"m":106,"g":107},"8635":{"m":106,"g":107},"8653":{"m":106,"g":107},"9533":{"m":108,"g":115},"9532":{"m":108,"g":115},"9372":{"m":108,"g":115},"9485":{"m":108,"g":115},"9478":{"m":108,"g":115},"8034":{"m":108,"g":115},"9473":{"m":108,"g":115},"9525":{"m":108,"g":115},"9530":{"m":108,"g":115},"8946":{"m":108,"g":115},"9004":{"m":108,"g":115},"9241":{"m":108,"g":115},"9211":{"m":108,"g":115},"9503":{"m":108,"g":115},"9519":{"m":108,"g":115},"9456":{"m":108,"g":115},"7699":{"m":108,"g":115},"9200":{"m":108,"g":115},"9516":{"m":108,"g":115},"9127":{"m":108,"g":115},"9513":{"m":108,"g":115},"8624":{"m":108,"g":115},"8865":{"m":108,"g":115},"9109":{"m":108,"g":115},"9452":{"m":108,"g":115},"9507":{"m":108,"g":115},"9303":{"m":108,"g":115},"9331":{"m":108,"g":115},"9497":{"m":108,"g":115},"9494":{"m":108,"g":115},"9475":{"m":108,"g":115},"9480":{"m":108,"g":115},"9487":{"m":108,"g":115},"9491":{"m":108,"g":115},"9482":{"m":108,"g":115},"9492":{"m":108,"g":115},"9483":{"m":108,"g":115},"9333":{"m":108,"g":115},"9474":{"m":108,"g":115},"8616":{"m":108,"g":115},"9356":{"m":108,"g":115},"8593":{"m":108,"g":115},"9468":{"m":108,"g":115},"9467":{"m":108,"g":115},"9470":{"m":108,"g":115},"9469":{"m":108,"g":115},"9455":{"m":108,"g":115},"9463":{"m":108,"g":115},"9464":{"m":108,"g":115},"9462":{"m":108,"g":115},"9461":{"m":108,"g":115},"9458":{"m":108,"g":115},"9454":{"m":108,"g":115},"9427":{"m":108,"g":115},"9433":{"m":108,"g":115},"7604":{"m":108,"g":115},"8521":{"m":108,"g":115},"9392":{"m":108,"g":115},"9395":{"m":108,"g":115},"9238":{"m":108,"g":115},"9384":{"m":108,"g":115},"9430":{"m":108,"g":115},"9346":{"m":108,"g":115},"9399":{"m":108,"g":115},"9251":{"m":108,"g":115},"9388":{"m":108,"g":115},"9261":{"m":108,"g":115},"9420":{"m":108,"g":115},"9416":{"m":108,"g":115},"9415":{"m":108,"g":115},"9413":{"m":108,"g":115},"9371":{"m":108,"g":115},"9339":{"m":108,"g":115},"9357":{"m":108,"g":115},"9377":{"m":108,"g":115},"9381":{"m":108,"g":115},"9404":{"m":108,"g":115},"9359":{"m":108,"g":115},"9336":{"m":108,"g":115},"9249":{"m":108,"g":115},"9409":{"m":108,"g":115},"8690":{"m":108,"g":115},"9278":{"m":108,"g":115},"9391":{"m":108,"g":115},"9106":{"m":108,"g":115},"7375":{"m":108,"g":115},"9385":{"m":108,"g":115},"9383":{"m":108,"g":115},"9378":{"m":108,"g":115},"9380":{"m":108,"g":115},"9376":{"m":108,"g":115},"9350":{"m":108,"g":115},"9368":{"m":108,"g":115},"9344":{"m":108,"g":115},"9369":{"m":108,"g":115},"9367":{"m":108,"g":115},"9370":{"m":108,"g":115},"9364":{"m":108,"g":115},"9360":{"m":108,"g":115},"9335":{"m":108,"g":115},"9361":{"m":108,"g":115},"9354":{"m":108,"g":115},"6295":{"m":108,"g":115},"9353":{"m":108,"g":115},"9348":{"m":108,"g":115},"9327":{"m":108,"g":115},"9332":{"m":108,"g":115},"9326":{"m":108,"g":115},"8990":{"m":108,"g":115},"7019":{"m":108,"g":115},"9321":{"m":108,"g":115},"9317":{"m":108,"g":115},"9299":{"m":108,"g":115},"9322":{"m":108,"g":115},"9306":{"m":108,"g":115},"9320":{"m":108,"g":115},"9059":{"m":108,"g":115},"8936":{"m":108,"g":115},"9284":{"m":108,"g":115},"9313":{"m":108,"g":115},"9316":{"m":108,"g":115},"9315":{"m":108,"g":115},"8829":{"m":108,"g":115},"9011":{"m":108,"g":115},"9289":{"m":108,"g":115},"9310":{"m":108,"g":115},"9307":{"m":108,"g":115},"9298":{"m":108,"g":115},"9276":{"m":108,"g":115},"9293":{"m":108,"g":115},"6307":{"m":108,"g":115},"9245":{"m":108,"g":115},"9287":{"m":108,"g":115},"9286":{"m":108,"g":115},"8289":{"m":108,"g":115},"9281":{"m":108,"g":115},"8520":{"m":108,"g":115},"9272":{"m":108,"g":115},"9271":{"m":108,"g":115},"9279":{"m":108,"g":115},"9131":{"m":108,"g":115},"9242":{"m":108,"g":115},"9260":{"m":108,"g":115},"9268":{"m":108,"g":115},"9067":{"m":108,"g":115},"9264":{"m":108,"g":115},"9237":{"m":108,"g":115},"9232":{"m":108,"g":115},"9006":{"m":108,"g":115},"8893":{"m":108,"g":115},"9049":{"m":108,"g":115},"8846":{"m":108,"g":115},"9252":{"m":108,"g":115},"8027":{"m":108,"g":115},"9258":{"m":108,"g":115},"7758":{"m":108,"g":115},"9165":{"m":108,"g":115},"7667":{"m":108,"g":115},"9247":{"m":108,"g":115},"9246":{"m":108,"g":115},"8663":{"m":108,"g":115},"8268":{"m":108,"g":115},"9243":{"m":108,"g":115},"9236":{"m":108,"g":115},"9201":{"m":108,"g":115},"9198":{"m":108,"g":115},"9231":{"m":108,"g":115},"9220":{"m":108,"g":115},"9223":{"m":108,"g":115},"9222":{"m":108,"g":115},"8777":{"m":108,"g":115},"8790":{"m":108,"g":115},"9215":{"m":108,"g":115},"9218":{"m":108,"g":115},"9214":{"m":108,"g":115},"9208":{"m":108,"g":115},"9213":{"m":108,"g":115},"9207":{"m":108,"g":115},"9177":{"m":108,"g":115},"8849":{"m":108,"g":115},"9206":{"m":108,"g":115},"9205":{"m":108,"g":115},"9183":{"m":108,"g":115},"9204":{"m":108,"g":115},"9203":{"m":108,"g":115},"9202":{"m":108,"g":115},"9197":{"m":108,"g":115},"9008":{"m":108,"g":115},"8795":{"m":108,"g":115},"9191":{"m":108,"g":115},"9194":{"m":108,"g":115},"9060":{"m":108,"g":115},"8913":{"m":108,"g":115},"9185":{"m":108,"g":115},"8112":{"m":108,"g":115},"9065":{"m":108,"g":115},"8018":{"m":108,"g":115},"7687":{"m":108,"g":115},"7631":{"m":108,"g":115},"7004":{"m":108,"g":115},"8852":{"m":108,"g":115},"8808":{"m":108,"g":115},"8818":{"m":108,"g":115},"9154":{"m":108,"g":115},"9101":{"m":108,"g":115},"9162":{"m":108,"g":115},"9136":{"m":108,"g":115},"9171":{"m":108,"g":115},"9169":{"m":108,"g":115},"9159":{"m":108,"g":115},"8951":{"m":108,"g":115},"9161":{"m":108,"g":115},"8840":{"m":108,"g":115},"9134":{"m":108,"g":115},"9042":{"m":108,"g":115},"7957":{"m":108,"g":115},"9069":{"m":108,"g":115},"9028":{"m":108,"g":115},"8910":{"m":108,"g":115},"9149":{"m":108,"g":115},"9133":{"m":108,"g":115},"9126":{"m":108,"g":115},"9150":{"m":108,"g":115},"8484":{"m":108,"g":115},"9111":{"m":108,"g":115},"9146":{"m":108,"g":115},"9093":{"m":108,"g":115},"9088":{"m":108,"g":115},"8588":{"m":108,"g":115},"9137":{"m":108,"g":115},"8884":{"m":108,"g":115},"8651":{"m":108,"g":115},"9130":{"m":108,"g":115},"9129":{"m":108,"g":115},"8660":{"m":108,"g":115},"9119":{"m":108,"g":115},"8619":{"m":108,"g":115},"8610":{"m":108,"g":115},"8700":{"m":108,"g":115},"9125":{"m":108,"g":115},"9121":{"m":108,"g":115},"9014":{"m":108,"g":115},"9118":{"m":108,"g":115},"9122":{"m":108,"g":115},"9107":{"m":108,"g":115},"9113":{"m":108,"g":115},"9114":{"m":108,"g":115},"9021":{"m":108,"g":115},"9103":{"m":108,"g":115},"9077":{"m":108,"g":115},"9096":{"m":108,"g":115},"9005":{"m":108,"g":115},"9075":{"m":108,"g":115},"9097":{"m":108,"g":115},"9032":{"m":108,"g":115},"8766":{"m":108,"g":115},"9087":{"m":108,"g":115},"8293":{"m":108,"g":115},"9095":{"m":108,"g":115},"9084":{"m":108,"g":115},"8992":{"m":108,"g":115},"9089":{"m":108,"g":115},"9043":{"m":108,"g":115},"9086":{"m":108,"g":115},"9030":{"m":108,"g":115},"9053":{"m":108,"g":115},"8638":{"m":108,"g":115},"8731":{"m":108,"g":115},"9083":{"m":108,"g":115},"8752":{"m":108,"g":115},"9081":{"m":108,"g":115},"9082":{"m":108,"g":115},"8866":{"m":108,"g":115},"9080":{"m":108,"g":115},"8973":{"m":108,"g":115},"9063":{"m":108,"g":115},"9079":{"m":108,"g":115},"9066":{"m":108,"g":115},"7216":{"m":108,"g":115},"9051":{"m":108,"g":115},"9047":{"m":108,"g":115},"9057":{"m":108,"g":115},"9050":{"m":108,"g":115},"9054":{"m":108,"g":115},"9048":{"m":108,"g":115},"8997":{"m":108,"g":115},"9046":{"m":108,"g":115},"9044":{"m":108,"g":115},"9031":{"m":108,"g":115},"9037":{"m":108,"g":115},"9036":{"m":108,"g":115},"9034":{"m":108,"g":115},"9035":{"m":108,"g":115},"9033":{"m":108,"g":115},"8079":{"m":108,"g":115},"8794":{"m":108,"g":115},"9024":{"m":108,"g":115},"9027":{"m":108,"g":115},"9029":{"m":108,"g":115},"7626":{"m":108,"g":115},"9022":{"m":108,"g":115},"8940":{"m":108,"g":115},"8996":{"m":108,"g":115},"9018":{"m":108,"g":115},"9017":{"m":108,"g":115},"9019":{"m":108,"g":115},"8340":{"m":108,"g":115},"8991":{"m":108,"g":115},"8915":{"m":108,"g":115},"8245":{"m":108,"g":115},"9013":{"m":108,"g":115},"9007":{"m":108,"g":115},"9012":{"m":108,"g":115},"9003":{"m":108,"g":115},"9010":{"m":108,"g":115},"9001":{"m":108,"g":115},"8329":{"m":108,"g":115},"8355":{"m":108,"g":115},"8877":{"m":108,"g":115},"8995":{"m":108,"g":115},"8998":{"m":108,"g":115},"8878":{"m":108,"g":115},"6752":{"m":108,"g":115},"8673":{"m":108,"g":115},"8798":{"m":108,"g":115},"8687":{"m":108,"g":115},"8600":{"m":108,"g":115},"8966":{"m":108,"g":115},"8851":{"m":108,"g":115},"8984":{"m":108,"g":115},"8962":{"m":108,"g":115},"8987":{"m":108,"g":115},"8994":{"m":108,"g":115},"8993":{"m":108,"g":115},"8989":{"m":108,"g":115},"8667":{"m":108,"g":115},"8983":{"m":108,"g":115},"8980":{"m":108,"g":115},"8330":{"m":108,"g":115},"8770":{"m":108,"g":115},"8724":{"m":108,"g":115},"8988":{"m":108,"g":115},"8986":{"m":108,"g":115},"8785":{"m":108,"g":115},"8371":{"m":108,"g":115},"8982":{"m":108,"g":115},"8978":{"m":108,"g":115},"8981":{"m":108,"g":115},"8772":{"m":108,"g":115},"8971":{"m":108,"g":115},"8968":{"m":108,"g":115},"8972":{"m":108,"g":115},"7279":{"m":108,"g":115},"8941":{"m":108,"g":115},"8959":{"m":108,"g":115},"8757":{"m":108,"g":115},"8960":{"m":108,"g":115},"8958":{"m":108,"g":115},"8692":{"m":108,"g":115},"6555":{"m":108,"g":115},"8894":{"m":108,"g":115},"8957":{"m":108,"g":115},"8955":{"m":108,"g":115},"7657":{"m":108,"g":115},"8944":{"m":108,"g":115},"8799":{"m":108,"g":115},"8932":{"m":108,"g":115},"8952":{"m":108,"g":115},"8953":{"m":108,"g":115},"8947":{"m":108,"g":115},"8950":{"m":108,"g":115},"8720":{"m":108,"g":115},"8703":{"m":108,"g":115},"8923":{"m":108,"g":115},"8850":{"m":108,"g":115},"8933":{"m":108,"g":115},"8937":{"m":108,"g":115},"8929":{"m":108,"g":115},"8928":{"m":108,"g":115},"8925":{"m":108,"g":115},"8927":{"m":108,"g":115},"8908":{"m":108,"g":115},"8844":{"m":108,"g":107},"8916":{"m":108,"g":107},"8912":{"m":108,"g":107},"8698":{"m":108,"g":107},"8869":{"m":108,"g":107},"8898":{"m":108,"g":107},"8895":{"m":108,"g":107},"8041":{"m":108,"g":107},"5949":{"m":108,"g":107},"8888":{"m":108,"g":107},"8292":{"m":108,"g":107},"8787":{"m":108,"g":107},"8369":{"m":108,"g":107},"8697":{"m":108,"g":107},"8834":{"m":108,"g":107},"8883":{"m":108,"g":107},"8847":{"m":108,"g":107},"8539":{"m":108,"g":107},"8837":{"m":108,"g":107},"8880":{"m":108,"g":107},"8881":{"m":108,"g":107},"8811":{"m":108,"g":107},"8872":{"m":108,"g":107},"8861":{"m":108,"g":107},"8860":{"m":108,"g":107},"8868":{"m":108,"g":107},"8815":{"m":108,"g":107},"8859":{"m":108,"g":107},"8853":{"m":108,"g":107},"8843":{"m":108,"g":107},"8753":{"m":108,"g":107},"8751":{"m":108,"g":107},"8838":{"m":108,"g":107},"8144":{"m":108,"g":107},"8680":{"m":108,"g":107},"8839":{"m":108,"g":107},"8828":{"m":108,"g":107},"8836":{"m":108,"g":107},"8824":{"m":108,"g":107},"8809":{"m":108,"g":107},"8832":{"m":108,"g":107},"8681":{"m":108,"g":107},"8827":{"m":108,"g":107},"8823":{"m":108,"g":107},"8817":{"m":108,"g":107},"8804":{"m":108,"g":107},"8782":{"m":108,"g":107},"8802":{"m":108,"g":107},"8800":{"m":108,"g":107},"8797":{"m":108,"g":107},"8596":{"m":108,"g":107},"8779":{"m":108,"g":107},"8780":{"m":108,"g":107},"8744":{"m":108,"g":107},"8571":{"m":108,"g":107},"8212":{"m":108,"g":107},"8255":{"m":108,"g":107},"8776":{"m":108,"g":107},"8773":{"m":108,"g":107},"8771":{"m":108,"g":107},"8762":{"m":108,"g":107},"8768":{"m":108,"g":107},"8639":{"m":108,"g":107},"8552":{"m":108,"g":107},"8749":{"m":108,"g":107},"8294":{"m":108,"g":107},"8738":{"m":108,"g":107},"8437":{"m":108,"g":107},"8745":{"m":108,"g":107},"8733":{"m":108,"g":107},"8735":{"m":108,"g":107},"8737":{"m":108,"g":107},"8662":{"m":108,"g":107},"7114":{"m":108,"g":107},"8678":{"m":108,"g":107},"8732":{"m":108,"g":107},"8729":{"m":108,"g":107},"8676":{"m":108,"g":107},"8699":{"m":108,"g":107},"9558":{"m":109,"g":115},"9557":{"m":109,"g":115},"9549":{"m":109,"g":115},"9544":{"m":109,"g":115},"9547":{"m":109,"g":115},"9546":{"m":109,"g":115},"9592":{"m":110,"g":115},"9591":{"m":110,"g":115},"9589":{"m":110,"g":115},"9587":{"m":110,"g":115},"9581":{"m":110,"g":115},"9578":{"m":110,"g":115},"9229":{"m":110,"g":115},"9536":{"m":110,"g":115},"9535":{"m":110,"g":115},"9559":{"m":110,"g":115},"9560":{"m":110,"g":115},"7317":{"m":110,"g":115},"9576":{"m":110,"g":115},"9429":{"m":110,"g":115},"9565":{"m":110,"g":115},"9498":{"m":110,"g":115},"9716":{"m":111,"g":115},"9708":{"m":111,"g":115},"9340":{"m":111,"g":115},"9703":{"m":111,"g":115},"9702":{"m":111,"g":115},"9695":{"m":111,"g":115},"9683":{"m":111,"g":115},"9700":{"m":111,"g":115},"9676":{"m":111,"g":115},"9694":{"m":111,"g":115},"9693":{"m":111,"g":115},"9679":{"m":111,"g":115},"9678":{"m":111,"g":115},"9397":{"m":111,"g":115},"9495":{"m":111,"g":115},"9677":{"m":111,"g":115},"9446":{"m":111,"g":115},"9071":{"m":111,"g":115},"9597":{"m":111,"g":115},"9555":{"m":111,"g":115},"9583":{"m":111,"g":115},"9564":{"m":111,"g":115},"9658":{"m":111,"g":115},"9665":{"m":111,"g":115},"9648":{"m":111,"g":115},"9637":{"m":111,"g":115},"9649":{"m":111,"g":115},"9647":{"m":111,"g":115},"9656":{"m":111,"g":115},"9523":{"m":111,"g":115},"9606":{"m":111,"g":115},"9635":{"m":111,"g":115},"9630":{"m":111,"g":115},"9640":{"m":111,"g":115},"9636":{"m":111,"g":115},"9301":{"m":111,"g":115},"8328":{"m":111,"g":115},"9632":{"m":111,"g":115},"9629":{"m":111,"g":115},"9628":{"m":111,"g":115},"9623":{"m":111,"g":115},"9622":{"m":111,"g":115},"9608":{"m":111,"g":115},"8901":{"m":111,"g":115},"9613":{"m":111,"g":115},"9190":{"m":111,"g":115},"9554":{"m":111,"g":115},"9500":{"m":111,"g":115},"9436":{"m":111,"g":115},"9568":{"m":111,"g":115},"10221":{"m":112,"g":115},"10340":{"m":112,"g":115},"10303":{"m":112,"g":115},"10331":{"m":112,"g":115},"10339":{"m":112,"g":115},"10330":{"m":112,"g":115},"10327":{"m":112,"g":115},"10338":{"m":112,"g":115},"10254":{"m":112,"g":115},"10264":{"m":112,"g":115},"10280":{"m":112,"g":115},"10335":{"m":112,"g":115},"10322":{"m":112,"g":115},"10328":{"m":112,"g":115},"10326":{"m":112,"g":115},"10233":{"m":112,"g":115},"10297":{"m":112,"g":115},"10314":{"m":112,"g":115},"10311":{"m":112,"g":115},"10310":{"m":112,"g":115},"10299":{"m":112,"g":115},"9090":{"m":112,"g":115},"10229":{"m":112,"g":115},"10239":{"m":112,"g":115},"9881":{"m":112,"g":115},"10294":{"m":112,"g":115},"10292":{"m":112,"g":115},"10282":{"m":112,"g":115},"10184":{"m":112,"g":115},"10241":{"m":112,"g":115},"9662":{"m":112,"g":115},"10252":{"m":112,"g":115},"9940":{"m":112,"g":115},"10251":{"m":112,"g":115},"10256":{"m":112,"g":115},"10250":{"m":112,"g":115},"10262":{"m":112,"g":115},"9954":{"m":112,"g":115},"10173":{"m":112,"g":115},"10060":{"m":112,"g":115},"10253":{"m":112,"g":115},"10093":{"m":112,"g":115},"10240":{"m":112,"g":115},"8803":{"m":112,"g":115},"10246":{"m":112,"g":115},"9795":{"m":112,"g":115},"10245":{"m":112,"g":115},"10242":{"m":112,"g":115},"10236":{"m":112,"g":115},"10234":{"m":112,"g":115},"10213":{"m":112,"g":115},"10238":{"m":112,"g":115},"10210":{"m":112,"g":115},"10220":{"m":112,"g":115},"10214":{"m":112,"g":115},"9960":{"m":112,"g":115},"10208":{"m":112,"g":115},"10212":{"m":112,"g":115},"10209":{"m":112,"g":115},"10207":{"m":112,"g":115},"10205":{"m":112,"g":115},"9300":{"m":112,"g":115},"10193":{"m":112,"g":115},"10127":{"m":112,"g":115},"10188":{"m":112,"g":115},"10165":{"m":112,"g":115},"10169":{"m":112,"g":115},"4422":{"m":112,"g":115},"9900":{"m":112,"g":115},"10191":{"m":112,"g":115},"7995":{"m":112,"g":115},"10185":{"m":112,"g":115},"10149":{"m":112,"g":115},"9522":{"m":112,"g":115},"10182":{"m":112,"g":115},"10181":{"m":112,"g":115},"9839":{"m":112,"g":115},"10176":{"m":112,"g":115},"9595":{"m":112,"g":115},"10166":{"m":112,"g":115},"9925":{"m":112,"g":115},"10156":{"m":112,"g":115},"10161":{"m":112,"g":115},"10159":{"m":112,"g":115},"10131":{"m":112,"g":115},"10028":{"m":112,"g":115},"10155":{"m":112,"g":115},"9871":{"m":112,"g":115},"9434":{"m":112,"g":115},"10148":{"m":112,"g":115},"6226":{"m":112,"g":115},"10013":{"m":112,"g":115},"10147":{"m":112,"g":115},"9981":{"m":112,"g":115},"9989":{"m":112,"g":115},"10108":{"m":112,"g":115},"10123":{"m":112,"g":115},"7843":{"m":112,"g":115},"10090":{"m":112,"g":115},"10104":{"m":112,"g":115},"8801":{"m":112,"g":115},"10040":{"m":112,"g":115},"10141":{"m":112,"g":115},"10095":{"m":112,"g":115},"10144":{"m":112,"g":115},"9971":{"m":112,"g":115},"10134":{"m":112,"g":115},"10135":{"m":112,"g":115},"10074":{"m":112,"g":115},"10128":{"m":112,"g":115},"10126":{"m":112,"g":115},"10113":{"m":112,"g":115},"10096":{"m":112,"g":115},"10056":{"m":112,"g":115},"10101":{"m":112,"g":115},"9969":{"m":112,"g":115},"9741":{"m":112,"g":115},"9477":{"m":112,"g":115},"10117":{"m":112,"g":115},"10102":{"m":112,"g":115},"10116":{"m":112,"g":115},"10068":{"m":112,"g":115},"9956":{"m":112,"g":115},"10041":{"m":112,"g":115},"10058":{"m":112,"g":115},"10107":{"m":112,"g":115},"10100":{"m":112,"g":115},"9834":{"m":112,"g":115},"9861":{"m":112,"g":115},"9764":{"m":112,"g":115},"9269":{"m":112,"g":115},"9620":{"m":112,"g":115},"6905":{"m":112,"g":115},"10032":{"m":112,"g":115},"10097":{"m":112,"g":115},"10029":{"m":112,"g":115},"10039":{"m":112,"g":115},"10092":{"m":112,"g":115},"10086":{"m":112,"g":115},"10057":{"m":112,"g":115},"10047":{"m":112,"g":115},"9842":{"m":112,"g":115},"9965":{"m":112,"g":115},"10069":{"m":112,"g":115},"9884":{"m":112,"g":115},"10087":{"m":112,"g":115},"8622":{"m":112,"g":115},"8555":{"m":112,"g":115},"10080":{"m":112,"g":115},"10079":{"m":112,"g":115},"10043":{"m":112,"g":115},"10007":{"m":112,"g":115},"7182":{"m":112,"g":115},"8725":{"m":112,"g":115},"8867":{"m":112,"g":115},"9567":{"m":112,"g":115},"5255":{"m":112,"g":115},"10006":{"m":112,"g":115},"9534":{"m":112,"g":115},"9934":{"m":112,"g":115},"9931":{"m":112,"g":115},"9801":{"m":112,"g":115},"10049":{"m":112,"g":115},"10055":{"m":112,"g":115},"9964":{"m":112,"g":115},"10052":{"m":112,"g":115},"10050":{"m":112,"g":115},"8677":{"m":112,"g":115},"10008":{"m":112,"g":115},"9951":{"m":112,"g":115},"9957":{"m":112,"g":115},"9938":{"m":112,"g":115},"9634":{"m":112,"g":115},"9886":{"m":112,"g":115},"9973":{"m":112,"g":115},"9846":{"m":112,"g":115},"10003":{"m":112,"g":115},"10004":{"m":112,"g":115},"9997":{"m":112,"g":115},"10016":{"m":112,"g":115},"9993":{"m":112,"g":115},"9994":{"m":112,"g":115},"9999":{"m":112,"g":115},"10000":{"m":112,"g":115},"9996":{"m":112,"g":115},"9988":{"m":112,"g":115},"9986":{"m":112,"g":115},"9733":{"m":112,"g":115},"9314":{"m":112,"g":115},"9978":{"m":112,"g":115},"9914":{"m":112,"g":115},"9460":{"m":112,"g":115},"9958":{"m":112,"g":115},"9906":{"m":112,"g":115},"9953":{"m":112,"g":115},"9937":{"m":112,"g":115},"9959":{"m":112,"g":115},"9955":{"m":112,"g":115},"9755":{"m":112,"g":115},"9952":{"m":112,"g":115},"9671":{"m":112,"g":115},"7912":{"m":112,"g":115},"9895":{"m":112,"g":115},"9905":{"m":112,"g":115},"9927":{"m":112,"g":115},"9946":{"m":112,"g":115},"9912":{"m":112,"g":115},"9869":{"m":112,"g":115},"8747":{"m":112,"g":115},"9939":{"m":112,"g":115},"9929":{"m":112,"g":115},"9932":{"m":112,"g":115},"9705":{"m":112,"g":115},"9909":{"m":112,"g":115},"9920":{"m":112,"g":115},"9921":{"m":112,"g":115},"9879":{"m":112,"g":115},"9919":{"m":112,"g":115},"9916":{"m":112,"g":115},"9907":{"m":112,"g":115},"9844":{"m":112,"g":115},"9913":{"m":112,"g":115},"9902":{"m":112,"g":115},"8118":{"m":112,"g":115},"9875":{"m":112,"g":115},"9893":{"m":112,"g":115},"9878":{"m":112,"g":115},"9803":{"m":112,"g":115},"9876":{"m":112,"g":115},"9874":{"m":112,"g":115},"9783":{"m":112,"g":115},"9882":{"m":112,"g":115},"9857":{"m":112,"g":115},"9862":{"m":112,"g":115},"9864":{"m":112,"g":115},"8964":{"m":112,"g":115},"9794":{"m":112,"g":115},"9858":{"m":112,"g":115},"9852":{"m":112,"g":115},"9847":{"m":112,"g":115},"9073":{"m":112,"g":115},"9850":{"m":112,"g":115},"9797":{"m":112,"g":115},"9661":{"m":112,"g":115},"9841":{"m":112,"g":115},"9750":{"m":112,"g":115},"9709":{"m":112,"g":115},"8909":{"m":112,"g":115},"9840":{"m":112,"g":115},"9824":{"m":112,"g":115},"9835":{"m":112,"g":115},"9837":{"m":112,"g":115},"9836":{"m":112,"g":115},"9831":{"m":112,"g":115},"9830":{"m":112,"g":115},"9761":{"m":112,"g":115},"9828":{"m":112,"g":115},"9827":{"m":112,"g":115},"9826":{"m":112,"g":115},"9822":{"m":112,"g":115},"9802":{"m":112,"g":115},"9820":{"m":112,"g":115},"9746":{"m":112,"g":115},"9817":{"m":112,"g":115},"9815":{"m":112,"g":115},"9807":{"m":112,"g":115},"8345":{"m":112,"g":115},"9809":{"m":112,"g":115},"9670":{"m":112,"g":115},"9556":{"m":112,"g":115},"9675":{"m":112,"g":115},"9712":{"m":112,"g":115},"9793":{"m":112,"g":115},"9216":{"m":112,"g":115},"8375":{"m":112,"g":115},"9663":{"m":112,"g":115},"9715":{"m":112,"g":115},"9692":{"m":112,"g":115},"9776":{"m":112,"g":115},"9792":{"m":112,"g":115},"9786":{"m":112,"g":115},"9789":{"m":112,"g":115},"9788":{"m":112,"g":115},"9757":{"m":112,"g":115},"9784":{"m":112,"g":115},"9777":{"m":112,"g":115},"9749":{"m":112,"g":115},"9772":{"m":112,"g":115},"8750":{"m":112,"g":115},"6287":{"m":112,"g":115},"6407":{"m":112,"g":115},"9355":{"m":112,"g":115},"9770":{"m":112,"g":115},"8236":{"m":112,"g":115},"9759":{"m":112,"g":115},"9745":{"m":112,"g":115},"9721":{"m":112,"g":115},"9735":{"m":112,"g":115},"9573":{"m":112,"g":115},"9673":{"m":112,"g":115},"9740":{"m":112,"g":115},"9739":{"m":112,"g":115},"9684":{"m":112,"g":115},"9732":{"m":112,"g":115},"9730":{"m":112,"g":115},"9728":{"m":112,"g":115},"9615":{"m":112,"g":115},"9505":{"m":112,"g":115},"9724":{"m":112,"g":115},"9720":{"m":112,"g":115},"11263":{"m":113,"g":115},"11259":{"m":113,"g":115},"11061":{"m":113,"g":115},"11235":{"m":113,"g":115},"11240":{"m":113,"g":115},"11209":{"m":113,"g":115},"11254":{"m":113,"g":115},"11242":{"m":113,"g":115},"11252":{"m":113,"g":115},"11251":{"m":113,"g":115},"10048":{"m":113,"g":115},"10042":{"m":113,"g":115},"11248":{"m":113,"g":115},"11247":{"m":113,"g":115},"10996":{"m":113,"g":115},"11206":{"m":113,"g":115},"11228":{"m":113,"g":115},"11237":{"m":113,"g":115},"11222":{"m":113,"g":115},"11174":{"m":113,"g":115},"11162":{"m":113,"g":115},"10571":{"m":113,"g":115},"11229":{"m":113,"g":115},"11137":{"m":113,"g":115},"9624":{"m":113,"g":115},"11194":{"m":113,"g":115},"11225":{"m":113,"g":115},"11063":{"m":113,"g":115},"11217":{"m":113,"g":115},"11215":{"m":113,"g":115},"11140":{"m":113,"g":115},"11213":{"m":113,"g":115},"11012":{"m":113,"g":115},"11096":{"m":113,"g":115},"11011":{"m":113,"g":115},"11178":{"m":113,"g":115},"11196":{"m":113,"g":115},"10741":{"m":113,"g":115},"11198":{"m":113,"g":115},"10517":{"m":113,"g":115},"10838":{"m":113,"g":115},"10859":{"m":113,"g":115},"10609":{"m":113,"g":115},"10855":{"m":113,"g":115},"11090":{"m":113,"g":115},"10780":{"m":113,"g":115},"10892":{"m":113,"g":115},"11166":{"m":113,"g":115},"11192":{"m":113,"g":115},"11189":{"m":113,"g":115},"10873":{"m":113,"g":115},"11173":{"m":113,"g":115},"11167":{"m":113,"g":115},"10637":{"m":113,"g":115},"11185":{"m":113,"g":115},"11161":{"m":113,"g":115},"9537":{"m":113,"g":115},"10830":{"m":113,"g":115},"10133":{"m":113,"g":115},"11138":{"m":113,"g":115},"11179":{"m":113,"g":115},"11176":{"m":113,"g":115},"11159":{"m":113,"g":115},"11170":{"m":113,"g":115},"11175":{"m":113,"g":115},"11124":{"m":113,"g":115},"11171":{"m":113,"g":115},"11164":{"m":113,"g":115},"11130":{"m":113,"g":115},"11163":{"m":113,"g":115},"10837":{"m":113,"g":115},"11152":{"m":113,"g":115},"10988":{"m":113,"g":115},"11160":{"m":113,"g":115},"10422":{"m":113,"g":115},"10263":{"m":113,"g":115},"11156":{"m":113,"g":115},"10508":{"m":113,"g":115},"10779":{"m":113,"g":115},"10768":{"m":113,"g":115},"11132":{"m":113,"g":115},"11148":{"m":113,"g":115},"11149":{"m":113,"g":115},"10559":{"m":113,"g":115},"11135":{"m":113,"g":115},"10720":{"m":113,"g":115},"11145":{"m":113,"g":115},"11123":{"m":113,"g":115},"11143":{"m":113,"g":115},"11120":{"m":113,"g":115},"10512":{"m":113,"g":115},"10271":{"m":113,"g":115},"11005":{"m":113,"g":115},"10760":{"m":113,"g":115},"11128":{"m":113,"g":115},"11075":{"m":113,"g":115},"10985":{"m":113,"g":115},"11111":{"m":113,"g":115},"11115":{"m":113,"g":115},"11114":{"m":113,"g":115},"10735":{"m":113,"g":115},"11112":{"m":113,"g":115},"10972":{"m":113,"g":115},"11080":{"m":113,"g":115},"11113":{"m":113,"g":115},"11071":{"m":113,"g":115},"11081":{"m":113,"g":115},"11102":{"m":113,"g":115},"11101":{"m":113,"g":115},"10846":{"m":113,"g":115},"11094":{"m":113,"g":115},"11067":{"m":113,"g":115},"11099":{"m":113,"g":115},"11087":{"m":113,"g":115},"10991":{"m":113,"g":115},"11085":{"m":113,"g":115},"11070":{"m":113,"g":115},"11092":{"m":113,"g":115},"10729":{"m":113,"g":115},"9642":{"m":113,"g":115},"10816":{"m":113,"g":115},"11083":{"m":113,"g":115},"10875":{"m":113,"g":115},"11082":{"m":113,"g":115},"11079":{"m":113,"g":115},"10611":{"m":113,"g":115},"11076":{"m":113,"g":115},"11073":{"m":113,"g":115},"10975":{"m":113,"g":115},"11069":{"m":113,"g":115},"11056":{"m":113,"g":115},"10976":{"m":113,"g":115},"11054":{"m":113,"g":115},"11050":{"m":113,"g":115},"11022":{"m":113,"g":115},"9614":{"m":113,"g":115},"11010":{"m":113,"g":115},"10591":{"m":113,"g":115},"11015":{"m":113,"g":115},"10940":{"m":113,"g":115},"10701":{"m":113,"g":115},"11036":{"m":113,"g":115},"11038":{"m":113,"g":115},"10986":{"m":113,"g":115},"11033":{"m":113,"g":115},"11017":{"m":113,"g":115},"11003":{"m":113,"g":115},"11013":{"m":113,"g":115},"10543":{"m":113,"g":115},"10555":{"m":113,"g":115},"10964":{"m":113,"g":115},"11009":{"m":113,"g":115},"10999":{"m":113,"g":115},"10978":{"m":113,"g":115},"10995":{"m":113,"g":115},"10997":{"m":113,"g":115},"10550":{"m":113,"g":115},"10565":{"m":113,"g":115},"10616":{"m":113,"g":115},"10751":{"m":113,"g":115},"10930":{"m":113,"g":115},"10981":{"m":113,"g":115},"10112":{"m":113,"g":115},"10980":{"m":113,"g":115},"10982":{"m":113,"g":115},"10551":{"m":113,"g":115},"10965":{"m":113,"g":115},"10944":{"m":113,"g":115},"10971":{"m":113,"g":115},"10941":{"m":113,"g":115},"10495":{"m":113,"g":115},"10372":{"m":113,"g":115},"10970":{"m":113,"g":115},"10947":{"m":113,"g":115},"10968":{"m":113,"g":115},"10967":{"m":113,"g":115},"10963":{"m":113,"g":115},"10960":{"m":113,"g":115},"10958":{"m":113,"g":115},"10956":{"m":113,"g":115},"10955":{"m":113,"g":115},"10749":{"m":113,"g":115},"10927":{"m":113,"g":115},"10936":{"m":113,"g":115},"10192":{"m":113,"g":115},"10939":{"m":113,"g":115},"10935":{"m":113,"g":115},"10929":{"m":113,"g":115},"10932":{"m":113,"g":115},"10898":{"m":113,"g":115},"10612":{"m":113,"g":115},"10923":{"m":113,"g":115},"10926":{"m":113,"g":115},"10899":{"m":113,"g":115},"10883":{"m":113,"g":115},"10910":{"m":113,"g":115},"10924":{"m":113,"g":115},"10132":{"m":113,"g":115},"10881":{"m":113,"g":115},"10894":{"m":113,"g":115},"10915":{"m":113,"g":115},"10376":{"m":113,"g":115},"10778":{"m":113,"g":115},"10872":{"m":113,"g":115},"10885":{"m":113,"g":115},"10895":{"m":113,"g":115},"10845":{"m":113,"g":115},"10880":{"m":113,"g":115},"10572":{"m":113,"g":115},"10861":{"m":113,"g":115},"10876":{"m":113,"g":115},"10877":{"m":113,"g":115},"10534":{"m":113,"g":115},"10827":{"m":113,"g":115},"10832":{"m":113,"g":115},"10786":{"m":113,"g":115},"10860":{"m":113,"g":115},"10718":{"m":113,"g":115},"10829":{"m":113,"g":115},"10828":{"m":113,"g":115},"10825":{"m":113,"g":115},"10826":{"m":113,"g":115},"10822":{"m":113,"g":115},"10824":{"m":113,"g":115},"10823":{"m":113,"g":115},"10787":{"m":113,"g":115},"10820":{"m":113,"g":115},"10794":{"m":113,"g":115},"10799":{"m":113,"g":115},"10818":{"m":113,"g":115},"10540":{"m":113,"g":115},"10504":{"m":113,"g":115},"10761":{"m":113,"g":115},"10814":{"m":113,"g":115},"10812":{"m":113,"g":115},"10259":{"m":113,"g":115},"10323":{"m":113,"g":115},"10581":{"m":113,"g":115},"10773":{"m":113,"g":115},"10792":{"m":113,"g":115},"10791":{"m":113,"g":115},"10770":{"m":113,"g":115},"10783":{"m":113,"g":115},"10782":{"m":113,"g":115},"10715":{"m":113,"g":115},"10705":{"m":113,"g":115},"10776":{"m":113,"g":115},"10777":{"m":113,"g":115},"10771":{"m":113,"g":115},"10774":{"m":113,"g":115},"10767":{"m":113,"g":115},"10756":{"m":113,"g":115},"10765":{"m":113,"g":115},"10574":{"m":113,"g":115},"10556":{"m":113,"g":115},"10130":{"m":113,"g":115},"10762":{"m":113,"g":115},"10541":{"m":113,"g":115},"10281":{"m":113,"g":115},"10759":{"m":113,"g":115},"10300":{"m":113,"g":115},"10755":{"m":113,"g":115},"10732":{"m":113,"g":115},"10758":{"m":113,"g":115},"10757":{"m":113,"g":115},"10727":{"m":113,"g":115},"10754":{"m":113,"g":115},"10724":{"m":113,"g":115},"10753":{"m":113,"g":115},"10737":{"m":113,"g":115},"10728":{"m":113,"g":115},"10730":{"m":113,"g":115},"9849":{"m":113,"g":115},"10731":{"m":113,"g":115},"10709":{"m":113,"g":115},"10699":{"m":113,"g":115},"10678":{"m":113,"g":115},"10695":{"m":113,"g":115},"10694":{"m":113,"g":115},"10717":{"m":113,"g":115},"10714":{"m":113,"g":115},"10716":{"m":113,"g":115},"10385":{"m":113,"g":115},"10317":{"m":113,"g":115},"10706":{"m":113,"g":115},"10592":{"m":113,"g":115},"10697":{"m":113,"g":115},"10696":{"m":113,"g":115},"10651":{"m":113,"g":115},"10688":{"m":113,"g":115},"10686":{"m":113,"g":115},"10685":{"m":113,"g":115},"10673":{"m":113,"g":115},"10684":{"m":113,"g":115},"10680":{"m":113,"g":115},"10681":{"m":113,"g":115},"10683":{"m":113,"g":115},"10645":{"m":113,"g":115},"10677":{"m":113,"g":115},"10679":{"m":113,"g":115},"10648":{"m":113,"g":115},"10671":{"m":113,"g":115},"10675":{"m":113,"g":115},"10666":{"m":113,"g":115},"10670":{"m":113,"g":115},"10668":{"m":113,"g":115},"10664":{"m":113,"g":115},"10661":{"m":113,"g":115},"10522":{"m":113,"g":115},"10653":{"m":113,"g":115},"10634":{"m":113,"g":115},"10650":{"m":113,"g":115},"10321":{"m":113,"g":115},"10647":{"m":113,"g":115},"10081":{"m":113,"g":115},"10633":{"m":113,"g":115},"10319":{"m":113,"g":115},"10630":{"m":113,"g":115},"10631":{"m":113,"g":115},"10632":{"m":113,"g":115},"10586":{"m":113,"g":115},"10553":{"m":113,"g":115},"9873":{"m":113,"g":115},"10629":{"m":113,"g":115},"10628":{"m":113,"g":115},"10621":{"m":113,"g":115},"9947":{"m":113,"g":115},"10579":{"m":113,"g":115},"10595":{"m":113,"g":115},"10610":{"m":113,"g":115},"10622":{"m":113,"g":115},"10624":{"m":113,"g":115},"10222":{"m":113,"g":115},"9979":{"m":113,"g":115},"8274":{"m":113,"g":115},"10604":{"m":113,"g":115},"10525":{"m":113,"g":115},"10596":{"m":113,"g":115},"10273":{"m":113,"g":115},"10563":{"m":113,"g":115},"10190":{"m":113,"g":115},"10558":{"m":113,"g":115},"10526":{"m":113,"g":115},"9987":{"m":113,"g":115},"10584":{"m":113,"g":115},"9976":{"m":113,"g":115},"10171":{"m":113,"g":115},"10548":{"m":113,"g":115},"8813":{"m":113,"g":115},"10545":{"m":113,"g":115},"10523":{"m":113,"g":115},"10459":{"m":113,"g":115},"10529":{"m":113,"g":115},"8746":{"m":113,"g":115},"10538":{"m":113,"g":115},"10494":{"m":113,"g":115},"9928":{"m":113,"g":115},"10474":{"m":113,"g":115},"10530":{"m":113,"g":115},"10528":{"m":113,"g":115},"10524":{"m":113,"g":115},"10511":{"m":113,"g":115},"10506":{"m":113,"g":115},"10515":{"m":113,"g":115},"10491":{"m":113,"g":115},"10500":{"m":113,"g":115},"10507":{"m":113,"g":115},"10466":{"m":113,"g":115},"10498":{"m":113,"g":115},"10499":{"m":113,"g":115},"10493":{"m":113,"g":115},"10487":{"m":113,"g":115},"10230":{"m":113,"g":115},"10336":{"m":113,"g":115},"10203":{"m":113,"g":115},"10434":{"m":113,"g":115},"8863":{"m":113,"g":115},"10486":{"m":113,"g":115},"10484":{"m":113,"g":115},"10286":{"m":113,"g":115},"10481":{"m":113,"g":115},"10478":{"m":113,"g":115},"10479":{"m":113,"g":115},"10475":{"m":113,"g":115},"10473":{"m":113,"g":115},"10476":{"m":113,"g":115},"8189":{"m":113,"g":115},"9657":{"m":113,"g":115},"10375":{"m":113,"g":115},"9887":{"m":113,"g":115},"8710":{"m":113,"g":115},"10471":{"m":113,"g":115},"10470":{"m":113,"g":115},"10468":{"m":113,"g":115},"10465":{"m":113,"g":115},"10440":{"m":113,"g":115},"10456":{"m":113,"g":115},"10439":{"m":113,"g":115},"10463":{"m":113,"g":115},"10458":{"m":113,"g":115},"10457":{"m":113,"g":115},"10401":{"m":113,"g":115},"10358":{"m":113,"g":115},"10449":{"m":113,"g":115},"9343":{"m":113,"g":115},"10452":{"m":113,"g":115},"10450":{"m":113,"g":115},"10445":{"m":113,"g":115},"9626":{"m":113,"g":115},"9768":{"m":113,"g":115},"10143":{"m":113,"g":115},"10441":{"m":113,"g":115},"10437":{"m":113,"g":115},"9338":{"m":113,"g":115},"10435":{"m":113,"g":115},"10201":{"m":113,"g":115},"10432":{"m":113,"g":115},"10129":{"m":113,"g":115},"10433":{"m":113,"g":115},"10426":{"m":113,"g":115},"10425":{"m":113,"g":115},"10429":{"m":113,"g":115},"10431":{"m":113,"g":115},"10428":{"m":113,"g":115},"9962":{"m":113,"g":115},"10076":{"m":113,"g":115},"8627":{"m":113,"g":115},"10419":{"m":113,"g":115},"6539":{"m":113,"g":115},"10270":{"m":113,"g":115},"9948":{"m":113,"g":115},"10157":{"m":113,"g":115},"10313":{"m":113,"g":115},"10369":{"m":113,"g":115},"10318":{"m":113,"g":115},"10404":{"m":113,"g":115},"10228":{"m":113,"g":115},"10414":{"m":113,"g":115},"10410":{"m":113,"g":115},"10411":{"m":113,"g":115},"10412":{"m":113,"g":115},"9748":{"m":113,"g":115},"9382":{"m":113,"g":115},"10406":{"m":113,"g":115},"10392":{"m":113,"g":115},"10403":{"m":113,"g":115},"10400":{"m":113,"g":115},"10397":{"m":113,"g":115},"10398":{"m":113,"g":115},"9984":{"m":113,"g":115},"10395":{"m":113,"g":115},"10394":{"m":113,"g":115},"10332":{"m":113,"g":115},"10377":{"m":113,"g":115},"10379":{"m":113,"g":115},"10380":{"m":113,"g":115},"10387":{"m":113,"g":115},"10244":{"m":113,"g":115},"10386":{"m":113,"g":115},"10391":{"m":113,"g":115},"10361":{"m":113,"g":115},"10390":{"m":113,"g":115},"10388":{"m":113,"g":115},"10343":{"m":113,"g":115},"10333":{"m":113,"g":115},"9023":{"m":113,"g":115},"10099":{"m":113,"g":115},"10219":{"m":113,"g":115},"10370":{"m":113,"g":115},"10368":{"m":113,"g":115},"10180":{"m":113,"g":115},"10355":{"m":113,"g":115},"8215":{"m":113,"g":115},"8778":{"m":113,"g":115},"10351":{"m":113,"g":115},"10362":{"m":113,"g":115},"10359":{"m":113,"g":115},"10360":{"m":113,"g":115},"10356":{"m":113,"g":115},"10031":{"m":113,"g":115},"10283":{"m":113,"g":115},"10346":{"m":113,"g":115},"10352":{"m":113,"g":115},"9774":{"m":113,"g":115},"10296":{"m":113,"g":115},"9199":{"m":113,"g":115},"10345":{"m":113,"g":115},"10349":{"m":113,"g":115},"10347":{"m":113,"g":115},"11324":{"m":114,"g":115},"11369":{"m":114,"g":115},"11364":{"m":114,"g":115},"11394":{"m":114,"g":115},"11387":{"m":114,"g":115},"11376":{"m":114,"g":115},"11375":{"m":114,"g":115},"11373":{"m":114,"g":115},"11309":{"m":114,"g":115},"11366":{"m":114,"g":115},"11359":{"m":114,"g":115},"11353":{"m":114,"g":115},"10979":{"m":114,"g":115},"11350":{"m":114,"g":115},"11327":{"m":114,"g":115},"11342":{"m":114,"g":115},"11339":{"m":114,"g":115},"11341":{"m":114,"g":115},"11340":{"m":114,"g":115},"11336":{"m":114,"g":115},"11323":{"m":114,"g":115},"10909":{"m":114,"g":115},"11264":{"m":114,"g":115},"9812":{"m":114,"g":115},"11318":{"m":114,"g":115},"11007":{"m":114,"g":115},"11321":{"m":114,"g":115},"10937":{"m":114,"g":115},"11312":{"m":114,"g":115},"11211":{"m":114,"g":115},"9545":{"m":114,"g":115},"11314":{"m":114,"g":115},"11316":{"m":114,"g":115},"11126":{"m":114,"g":115},"11315":{"m":114,"g":115},"10710":{"m":114,"g":115},"11230":{"m":114,"g":115},"11200":{"m":114,"g":115},"11304":{"m":114,"g":115},"11310":{"m":114,"g":115},"11311":{"m":114,"g":115},"11297":{"m":114,"g":115},"11205":{"m":114,"g":115},"11307":{"m":114,"g":115},"11223":{"m":114,"g":115},"11306":{"m":114,"g":115},"11305":{"m":114,"g":115},"11001":{"m":114,"g":115},"11027":{"m":114,"g":115},"11288":{"m":114,"g":115},"11303":{"m":114,"g":115},"11302":{"m":114,"g":115},"11068":{"m":114,"g":115},"11301":{"m":114,"g":115},"11300":{"m":114,"g":115},"11290":{"m":114,"g":115},"11210":{"m":114,"g":115},"11231":{"m":114,"g":115},"11294":{"m":114,"g":115},"10949":{"m":114,"g":115},"11095":{"m":114,"g":115},"11283":{"m":114,"g":115},"11281":{"m":114,"g":115},"11286":{"m":114,"g":115},"11282":{"m":114,"g":115},"11261":{"m":114,"g":115},"11238":{"m":114,"g":115},"11279":{"m":114,"g":115},"11280":{"m":114,"g":115},"11276":{"m":114,"g":115},"11277":{"m":114,"g":115},"11182":{"m":114,"g":115},"11268":{"m":114,"g":115},"11274":{"m":114,"g":115},"11270":{"m":114,"g":115},"7149":{"m":114,"g":115},"11262":{"m":114,"g":115},"11219":{"m":114,"g":115},"11680":{"m":116,"g":118},"11676":{"m":116,"g":118},"11684":{"m":116,"g":118},"11667":{"m":116,"g":118},"11674":{"m":116,"g":118},"11681":{"m":116,"g":118},"11621":{"m":116,"g":118},"11367":{"m":116,"g":118},"11653":{"m":116,"g":118},"11660":{"m":116,"g":118},"11659":{"m":116,"g":118},"11585":{"m":116,"g":118},"11293":{"m":116,"g":118},"11458":{"m":116,"g":118},"11590":{"m":116,"g":118},"11636":{"m":116,"g":118},"11579":{"m":116,"g":118},"8247":{"m":116,"g":118},"10423":{"m":116,"g":118},"11642":{"m":116,"g":115},"11638":{"m":116,"g":115},"11628":{"m":116,"g":115},"11639":{"m":116,"g":115},"11351":{"m":116,"g":115},"11627":{"m":116,"g":115},"11633":{"m":116,"g":115},"11631":{"m":116,"g":115},"11622":{"m":116,"g":115},"11625":{"m":116,"g":115},"11623":{"m":116,"g":115},"11624":{"m":116,"g":115},"11619":{"m":116,"g":115},"11605":{"m":116,"g":115},"11620":{"m":116,"g":115},"11617":{"m":116,"g":115},"11453":{"m":116,"g":115},"11561":{"m":116,"g":115},"11434":{"m":116,"g":115},"10721":{"m":116,"g":115},"11586":{"m":116,"g":115},"11556":{"m":116,"g":115},"11603":{"m":116,"g":115},"11593":{"m":116,"g":115},"11601":{"m":116,"g":115},"11449":{"m":116,"g":115},"11600":{"m":116,"g":115},"11597":{"m":116,"g":115},"11598":{"m":116,"g":115},"11566":{"m":116,"g":115},"11591":{"m":116,"g":115},"11588":{"m":116,"g":115},"11587":{"m":116,"g":115},"11580":{"m":116,"g":115},"11583":{"m":116,"g":115},"11582":{"m":116,"g":115},"11041":{"m":116,"g":115},"11542":{"m":116,"g":115},"11535":{"m":116,"g":115},"11565":{"m":116,"g":115},"11413":{"m":116,"g":115},"11539":{"m":116,"g":115},"11572":{"m":116,"g":115},"11573":{"m":116,"g":115},"11534":{"m":116,"g":115},"11571":{"m":116,"g":115},"11564":{"m":116,"g":115},"11537":{"m":116,"g":115},"11538":{"m":116,"g":115},"11521":{"m":116,"g":115},"11562":{"m":116,"g":115},"11308":{"m":116,"g":115},"11557":{"m":116,"g":115},"11441":{"m":116,"g":115},"11483":{"m":116,"g":115},"11531":{"m":116,"g":115},"11549":{"m":116,"g":115},"11553":{"m":116,"g":115},"11547":{"m":116,"g":115},"11507":{"m":116,"g":115},"11419":{"m":116,"g":115},"11444":{"m":116,"g":115},"11442":{"m":116,"g":115},"11548":{"m":116,"g":115},"11530":{"m":116,"g":115},"11527":{"m":116,"g":115},"11201":{"m":116,"g":115},"11528":{"m":116,"g":115},"11457":{"m":116,"g":115},"11544":{"m":116,"g":115},"11460":{"m":116,"g":115},"11505":{"m":116,"g":115},"11385":{"m":116,"g":115},"11214":{"m":116,"g":115},"11493":{"m":116,"g":115},"11512":{"m":116,"g":115},"11432":{"m":116,"g":115},"11485":{"m":116,"g":115},"11511":{"m":116,"g":115},"5889":{"m":116,"g":115},"11520":{"m":116,"g":115},"11516":{"m":116,"g":115},"11498":{"m":116,"g":115},"11474":{"m":116,"g":115},"11515":{"m":116,"g":115},"11514":{"m":116,"g":115},"11331":{"m":116,"g":115},"11509":{"m":116,"g":115},"11443":{"m":116,"g":115},"11452":{"m":116,"g":115},"11503":{"m":116,"g":115},"11502":{"m":116,"g":115},"11497":{"m":116,"g":115},"11501":{"m":116,"g":115},"11500":{"m":116,"g":115},"11332":{"m":116,"g":115},"11499":{"m":116,"g":115},"11465":{"m":116,"g":115},"10577":{"m":116,"g":115},"11479":{"m":116,"g":115},"11221":{"m":116,"g":115},"10172":{"m":116,"g":115},"11478":{"m":116,"g":115},"11481":{"m":116,"g":115},"11476":{"m":116,"g":115},"11489":{"m":116,"g":115},"10062":{"m":116,"g":115},"11398":{"m":116,"g":115},"10635":{"m":116,"g":115},"9804":{"m":116,"g":115},"11454":{"m":116,"g":115},"11019":{"m":116,"g":115},"8919":{"m":116,"g":115},"11462":{"m":116,"g":115},"11428":{"m":116,"g":115},"11470":{"m":116,"g":115},"11427":{"m":116,"g":115},"11467":{"m":116,"g":115},"11448":{"m":116,"g":115},"10312":{"m":116,"g":115},"9991":{"m":116,"g":115},"11455":{"m":116,"g":115},"11450":{"m":116,"g":115},"11360":{"m":116,"g":115},"11368":{"m":116,"g":115},"11445":{"m":116,"g":115},"11438":{"m":116,"g":115},"11439":{"m":116,"g":115},"11399":{"m":116,"g":115},"11435":{"m":116,"g":115},"11313":{"m":116,"g":115},"10745":{"m":116,"g":115},"11411":{"m":116,"g":115},"11433":{"m":116,"g":115},"11437":{"m":116,"g":115},"11436":{"m":116,"g":115},"11345":{"m":116,"g":115},"9256":{"m":116,"g":115},"11381":{"m":116,"g":115},"11361":{"m":116,"g":115},"11420":{"m":116,"g":115},"11144":{"m":116,"g":115},"10734":{"m":116,"g":115},"10969":{"m":116,"g":115},"9045":{"m":116,"g":115},"11414":{"m":116,"g":115},"11388":{"m":116,"g":115},"11363":{"m":116,"g":115},"11365":{"m":116,"g":115},"11389":{"m":116,"g":115},"11401":{"m":116,"g":115},"11285":{"m":116,"g":115},"11693":{"m":117,"g":118},"11543":{"m":117,"g":118},"11687":{"m":117,"g":118},"11706":{"m":117,"g":118},"11510":{"m":117,"g":118},"11488":{"m":117,"g":118},"11370":{"m":117,"g":118},"11692":{"m":117,"g":118},"11663":{"m":117,"g":118},"10912":{"m":117,"g":118},"11679":{"m":117,"g":118},"11689":{"m":117,"g":118},"11686":{"m":117,"g":118},"10248":{"m":117,"g":118},"9493":{"m":117,"g":118},"12027":{"m":119,"g":121},"12009":{"m":119,"g":121},"12030":{"m":119,"g":121},"12029":{"m":119,"g":121},"11616":{"m":119,"g":121},"12028":{"m":119,"g":121},"9366":{"m":119,"g":121},"11891":{"m":119,"g":121},"10158":{"m":119,"g":121},"12024":{"m":119,"g":121},"11765":{"m":119,"g":121},"12022":{"m":119,"g":121},"12018":{"m":119,"g":121},"12021":{"m":119,"g":121},"11755":{"m":119,"g":121},"11981":{"m":119,"g":121},"12015":{"m":119,"g":121},"12014":{"m":119,"g":121},"11937":{"m":119,"g":121},"11988":{"m":119,"g":121},"11866":{"m":119,"g":121},"11821":{"m":119,"g":121},"12004":{"m":119,"g":121},"11985":{"m":119,"g":121},"11944":{"m":119,"g":121},"10652":{"m":119,"g":121},"11965":{"m":119,"g":121},"11906":{"m":119,"g":121},"11990":{"m":119,"g":121},"11299":{"m":119,"g":121},"11322":{"m":119,"g":121},"11811":{"m":119,"g":121},"11955":{"m":119,"g":121},"11978":{"m":119,"g":121},"10869":{"m":119,"g":121},"11921":{"m":119,"g":121},"11563":{"m":119,"g":121},"11980":{"m":119,"g":121},"11977":{"m":119,"g":121},"10750":{"m":119,"g":121},"11956":{"m":119,"g":121},"11723":{"m":119,"g":121},"11967":{"m":119,"g":121},"11953":{"m":119,"g":121},"9651":{"m":119,"g":121},"10606":{"m":119,"g":121},"11908":{"m":119,"g":121},"10154":{"m":119,"g":121},"11929":{"m":119,"g":121},"11717":{"m":119,"g":121},"11922":{"m":119,"g":121},"11945":{"m":119,"g":121},"11790":{"m":119,"g":121},"11926":{"m":119,"g":121},"11940":{"m":119,"g":121},"11935":{"m":119,"g":121},"11934":{"m":119,"g":121},"11377":{"m":119,"g":121},"11933":{"m":119,"g":121},"11876":{"m":119,"g":121},"11844":{"m":119,"g":121},"11287":{"m":119,"g":121},"11918":{"m":119,"g":121},"11915":{"m":119,"g":121},"11702":{"m":119,"g":121},"11482":{"m":119,"g":121},"10700":{"m":119,"g":121},"11902":{"m":119,"g":121},"11295":{"m":119,"g":121},"11416":{"m":119,"g":121},"11895":{"m":119,"g":121},"11570":{"m":119,"g":121},"11487":{"m":119,"g":121},"11878":{"m":119,"g":121},"11885":{"m":119,"g":118},"11664":{"m":119,"g":118},"10656":{"m":119,"g":118},"11843":{"m":119,"g":118},"11845":{"m":119,"g":118},"11859":{"m":119,"g":118},"11887":{"m":119,"g":118},"11838":{"m":119,"g":118},"11886":{"m":119,"g":118},"11826":{"m":119,"g":118},"11882":{"m":119,"g":118},"11875":{"m":119,"g":118},"11881":{"m":119,"g":118},"11868":{"m":119,"g":118},"11807":{"m":119,"g":118},"11867":{"m":119,"g":118},"11823":{"m":119,"g":118},"11847":{"m":119,"g":118},"11862":{"m":119,"g":118},"10691":{"m":119,"g":118},"11776":{"m":119,"g":118},"11822":{"m":119,"g":118},"11849":{"m":119,"g":118},"11396":{"m":119,"g":118},"11846":{"m":119,"g":118},"11747":{"m":119,"g":118},"10801":{"m":119,"g":118},"11722":{"m":119,"g":118},"11594":{"m":119,"g":118},"11780":{"m":119,"g":118},"11733":{"m":119,"g":118},"10510":{"m":119,"g":118},"11508":{"m":119,"g":118},"11787":{"m":119,"g":118},"11832":{"m":119,"g":118},"11778":{"m":119,"g":118},"11612":{"m":119,"g":118},"11810":{"m":119,"g":118},"11831":{"m":119,"g":118},"11815":{"m":119,"g":118},"11606":{"m":119,"g":118},"11835":{"m":119,"g":118},"10994":{"m":119,"g":118},"11147":{"m":119,"g":118},"11833":{"m":119,"g":118},"11834":{"m":119,"g":118},"11652":{"m":119,"g":118},"11819":{"m":119,"g":118},"11827":{"m":119,"g":118},"11805":{"m":119,"g":118},"5162":{"m":119,"g":118},"11786":{"m":119,"g":118},"11804":{"m":119,"g":118},"11808":{"m":119,"g":118},"11091":{"m":119,"g":118},"11818":{"m":119,"g":118},"10788":{"m":119,"g":118},"11817":{"m":119,"g":118},"11328":{"m":119,"g":118},"11555":{"m":119,"g":118},"11618":{"m":119,"g":118},"11670":{"m":119,"g":118},"11688":{"m":119,"g":118},"11773":{"m":119,"g":118},"11813":{"m":119,"g":118},"11000":{"m":119,"g":118},"11710":{"m":119,"g":118},"11749":{"m":119,"g":118},"11772":{"m":119,"g":118},"11506":{"m":119,"g":118},"11797":{"m":119,"g":118},"11781":{"m":119,"g":118},"11803":{"m":119,"g":118},"11801":{"m":119,"g":118},"10152":{"m":119,"g":118},"11669":{"m":119,"g":118},"11793":{"m":119,"g":118},"11665":{"m":119,"g":118},"11799":{"m":119,"g":118},"11798":{"m":119,"g":118},"11794":{"m":119,"g":118},"11784":{"m":119,"g":118},"11783":{"m":119,"g":118},"11614":{"m":119,"g":118},"11788":{"m":119,"g":118},"9170":{"m":119,"g":118},"11666":{"m":119,"g":118},"11613":{"m":119,"g":118},"11611":{"m":119,"g":118},"11607":{"m":119,"g":118},"11685":{"m":119,"g":118},"11519":{"m":119,"g":118},"11782":{"m":119,"g":118},"11777":{"m":119,"g":118},"11682":{"m":119,"g":118},"11775":{"m":119,"g":118},"11738":{"m":119,"g":118},"10725":{"m":119,"g":118},"11540":{"m":119,"g":118},"11767":{"m":119,"g":118},"11768":{"m":119,"g":118},"11766":{"m":119,"g":118},"11735":{"m":119,"g":118},"11730":{"m":119,"g":118},"11643":{"m":119,"g":118},"11062":{"m":119,"g":118},"11724":{"m":119,"g":118},"11746":{"m":119,"g":118},"11739":{"m":119,"g":118},"11740":{"m":119,"g":118},"11732":{"m":119,"g":118},"11734":{"m":119,"g":118},"11541":{"m":119,"g":118},"11728":{"m":119,"g":118},"11731":{"m":119,"g":118},"11727":{"m":119,"g":118},"11729":{"m":119,"g":118},"11677":{"m":119,"g":118},"10911":{"m":119,"g":118},"12169":{"m":120,"g":121},"12177":{"m":120,"g":121},"12170":{"m":120,"g":121},"12167":{"m":120,"g":121},"12171":{"m":120,"g":121},"12164":{"m":120,"g":121},"12168":{"m":120,"g":121},"12129":{"m":120,"g":121},"12166":{"m":120,"g":121},"11047":{"m":120,"g":121},"11632":{"m":120,"g":121},"10399":{"m":120,"g":121},"12142":{"m":120,"g":121},"12106":{"m":120,"g":121},"12156":{"m":120,"g":121},"12155":{"m":120,"g":121},"12152":{"m":120,"g":121},"12154":{"m":120,"g":121},"11615":{"m":120,"g":121},"11494":{"m":120,"g":121},"12113":{"m":120,"g":121},"12116":{"m":120,"g":121},"12136":{"m":120,"g":121},"12147":{"m":120,"g":121},"12141":{"m":120,"g":121},"11991":{"m":120,"g":121},"12097":{"m":120,"g":121},"12139":{"m":120,"g":121},"12138":{"m":120,"g":121},"12118":{"m":120,"g":121},"12133":{"m":120,"g":121},"11936":{"m":120,"g":121},"12132":{"m":120,"g":121},"11993":{"m":120,"g":121},"12130":{"m":120,"g":121},"12125":{"m":120,"g":121},"12101":{"m":120,"g":121},"11814":{"m":120,"g":121},"12127":{"m":120,"g":121},"12115":{"m":120,"g":121},"12126":{"m":120,"g":121},"12098":{"m":120,"g":121},"11962":{"m":120,"g":121},"12119":{"m":120,"g":121},"12124":{"m":120,"g":121},"11869":{"m":120,"g":121},"12110":{"m":120,"g":121},"12096":{"m":120,"g":121},"12083":{"m":120,"g":121},"12103":{"m":120,"g":121},"12105":{"m":120,"g":121},"12087":{"m":120,"g":121},"9501":{"m":120,"g":121},"11379":{"m":120,"g":121},"12058":{"m":120,"g":121},"12054":{"m":120,"g":121},"11877":{"m":120,"g":121},"12070":{"m":120,"g":121},"8464":{"m":120,"g":121},"12093":{"m":120,"g":121},"12071":{"m":120,"g":121},"12000":{"m":120,"g":121},"12091":{"m":120,"g":121},"12089":{"m":120,"g":121},"12086":{"m":120,"g":121},"11560":{"m":120,"g":121},"12084":{"m":120,"g":121},"12034":{"m":120,"g":121},"11924":{"m":120,"g":121},"11884":{"m":120,"g":121},"12025":{"m":120,"g":121},"11958":{"m":120,"g":121},"12067":{"m":120,"g":121},"11999":{"m":120,"g":121},"12046":{"m":120,"g":121},"12049":{"m":120,"g":121},"12063":{"m":120,"g":121},"12064":{"m":120,"g":121},"12053":{"m":120,"g":121},"11800":{"m":120,"g":121},"12056":{"m":120,"g":121},"12019":{"m":120,"g":121},"11853":{"m":120,"g":121},"12031":{"m":120,"g":121},"12041":{"m":120,"g":121},"10953":{"m":120,"g":121},"11759":{"m":120,"g":121},"12042":{"m":120,"g":121},"12037":{"m":120,"g":121},"11745":{"m":120,"g":121},"12038":{"m":120,"g":121},"11909":{"m":120,"g":121},"11816":{"m":120,"g":121},"11795":{"m":120,"g":121},"12003":{"m":120,"g":121},"9936":{"m":120,"g":121},"11964":{"m":120,"g":121},"12439":{"m":122,"g":127},"11874":{"m":122,"g":127},"12475":{"m":122,"g":127},"12469":{"m":122,"g":127},"11987":{"m":122,"g":127},"12430":{"m":122,"g":127},"12473":{"m":122,"g":127},"12297":{"m":122,"g":127},"12341":{"m":122,"g":127},"12428":{"m":122,"g":127},"12429":{"m":122,"g":127},"12066":{"m":122,"g":127},"12275":{"m":122,"g":127},"11757":{"m":122,"g":127},"11931":{"m":122,"g":127},"12470":{"m":122,"g":127},"12415":{"m":122,"g":127},"12266":{"m":122,"g":127},"12463":{"m":122,"g":127},"12328":{"m":122,"g":127},"12369":{"m":122,"g":127},"12256":{"m":122,"g":127},"12449":{"m":122,"g":127},"12452":{"m":122,"g":127},"12413":{"m":122,"g":127},"10889":{"m":122,"g":127},"12422":{"m":122,"g":127},"12436":{"m":122,"g":127},"12437":{"m":122,"g":127},"12410":{"m":122,"g":127},"12384":{"m":122,"g":127},"12307":{"m":122,"g":127},"10566":{"m":122,"g":127},"12405":{"m":122,"g":127},"12425":{"m":122,"g":127},"12401":{"m":122,"g":127},"12300":{"m":122,"g":127},"11224":{"m":122,"g":127},"11116":{"m":122,"g":127},"12399":{"m":122,"g":121},"12290":{"m":122,"g":121},"12242":{"m":122,"g":121},"12368":{"m":122,"g":121},"12403":{"m":122,"g":121},"12386":{"m":122,"g":121},"12409":{"m":122,"g":121},"12375":{"m":122,"g":121},"12404":{"m":122,"g":121},"12281":{"m":122,"g":121},"12185":{"m":122,"g":121},"12012":{"m":122,"g":121},"12364":{"m":122,"g":121},"12395":{"m":122,"g":121},"11960":{"m":122,"g":121},"12377":{"m":122,"g":121},"11897":{"m":122,"g":121},"11969":{"m":122,"g":121},"12394":{"m":122,"g":121},"11806":{"m":122,"g":121},"12319":{"m":122,"g":121},"12123":{"m":122,"g":121},"12358":{"m":122,"g":121},"12362":{"m":122,"g":121},"12135":{"m":122,"g":121},"12378":{"m":122,"g":121},"12174":{"m":122,"g":121},"12340":{"m":122,"g":121},"12195":{"m":122,"g":121},"11910":{"m":122,"g":121},"12216":{"m":122,"g":121},"12050":{"m":122,"g":121},"12153":{"m":122,"g":121},"12094":{"m":122,"g":121},"12182":{"m":122,"g":121},"12354":{"m":122,"g":121},"12350":{"m":122,"g":121},"12348":{"m":122,"g":121},"11737":{"m":122,"g":121},"12346":{"m":122,"g":121},"12325":{"m":122,"g":121},"12095":{"m":122,"g":121},"12347":{"m":122,"g":121},"12345":{"m":122,"g":121},"11709":{"m":122,"g":121},"12343":{"m":122,"g":121},"12315":{"m":122,"g":121},"12338":{"m":122,"g":121},"11673":{"m":122,"g":121},"12002":{"m":122,"g":121},"12336":{"m":122,"g":121},"12312":{"m":122,"g":121},"12317":{"m":122,"g":121},"12276":{"m":122,"g":121},"12294":{"m":122,"g":121},"12314":{"m":122,"g":121},"12144":{"m":122,"g":121},"10874":{"m":122,"g":121},"12269":{"m":122,"g":121},"9825":{"m":122,"g":121},"12313":{"m":122,"g":121},"12311":{"m":122,"g":121},"12259":{"m":122,"g":121},"12241":{"m":122,"g":121},"12308":{"m":122,"g":121},"12299":{"m":122,"g":121},"12271":{"m":122,"g":121},"12296":{"m":122,"g":121},"12295":{"m":122,"g":121},"12285":{"m":122,"g":121},"12233":{"m":122,"g":121},"11928":{"m":122,"g":121},"12188":{"m":122,"g":121},"12283":{"m":122,"g":121},"12231":{"m":122,"g":121},"12284":{"m":122,"g":121},"12274":{"m":122,"g":121},"12257":{"m":122,"g":121},"12268":{"m":122,"g":121},"12267":{"m":122,"g":121},"12206":{"m":122,"g":121},"12247":{"m":122,"g":121},"10804":{"m":122,"g":121},"12230":{"m":122,"g":121},"12229":{"m":122,"g":121},"12252":{"m":122,"g":121},"12249":{"m":122,"g":121},"7873":{"m":122,"g":121},"10567":{"m":122,"g":121},"11177":{"m":122,"g":121},"11655":{"m":122,"g":121},"12245":{"m":122,"g":121},"11517":{"m":122,"g":121},"10654":{"m":122,"g":121},"12222":{"m":122,"g":121},"11994":{"m":122,"g":121},"12176":{"m":122,"g":121},"11708":{"m":122,"g":121},"12235":{"m":122,"g":121},"12161":{"m":122,"g":121},"12234":{"m":122,"g":121},"11142":{"m":122,"g":121},"12006":{"m":122,"g":121},"11592":{"m":122,"g":121},"11656":{"m":122,"g":121},"12186":{"m":122,"g":121},"12209":{"m":122,"g":121},"12205":{"m":122,"g":121},"12107":{"m":122,"g":121},"12112":{"m":122,"g":121},"10153":{"m":122,"g":121},"12117":{"m":122,"g":121},"12080":{"m":122,"g":121},"9403":{"m":122,"g":121},"12192":{"m":122,"g":121},"12173":{"m":122,"g":121},"12159":{"m":122,"g":121},"12057":{"m":122,"g":121},"12639":{"m":123,"g":127},"12572":{"m":123,"g":127},"12656":{"m":123,"g":127},"12456":{"m":123,"g":127},"12585":{"m":123,"g":127},"12648":{"m":123,"g":127},"12650":{"m":123,"g":127},"12640":{"m":123,"g":127},"12645":{"m":123,"g":127},"12647":{"m":123,"g":127},"12642":{"m":123,"g":127},"12641":{"m":123,"g":127},"12628":{"m":123,"g":127},"12634":{"m":123,"g":127},"12633":{"m":123,"g":127},"12632":{"m":123,"g":127},"12593":{"m":123,"g":127},"12616":{"m":123,"g":127},"12594":{"m":123,"g":127},"12592":{"m":123,"g":127},"11456":{"m":123,"g":127},"12615":{"m":123,"g":127},"12599":{"m":123,"g":127},"12522":{"m":123,"g":127},"10183":{"m":123,"g":127},"12598":{"m":123,"g":127},"6318":{"m":123,"g":127},"11131":{"m":123,"g":127},"11974":{"m":123,"g":127},"12580":{"m":123,"g":127},"12597":{"m":123,"g":127},"11760":{"m":123,"g":127},"12462":{"m":123,"g":127},"12165":{"m":123,"g":127},"12111":{"m":123,"g":127},"12547":{"m":123,"g":127},"12270":{"m":123,"g":127},"12044":{"m":123,"g":127},"12519":{"m":123,"g":127},"12571":{"m":123,"g":127},"12301":{"m":123,"g":127},"12569":{"m":123,"g":127},"12550":{"m":123,"g":127},"12549":{"m":123,"g":127},"12553":{"m":123,"g":127},"12524":{"m":123,"g":127},"12560":{"m":123,"g":127},"12227":{"m":123,"g":127},"12548":{"m":123,"g":127},"11330":{"m":123,"g":127},"12564":{"m":123,"g":127},"12561":{"m":123,"g":127},"12060":{"m":123,"g":127},"12536":{"m":123,"g":127},"12541":{"m":123,"g":127},"12532":{"m":123,"g":127},"12530":{"m":123,"g":127},"12523":{"m":123,"g":127},"12367":{"m":123,"g":127},"12502":{"m":123,"g":127},"12521":{"m":123,"g":127},"12515":{"m":123,"g":127},"12453":{"m":123,"g":127},"12481":{"m":123,"g":127},"12506":{"m":123,"g":127},"11917":{"m":123,"g":127},"12511":{"m":123,"g":127},"10078":{"m":123,"g":127},"12505":{"m":123,"g":127},"12507":{"m":123,"g":127},"11133":{"m":123,"g":127},"11052":{"m":123,"g":127},"12499":{"m":123,"g":127},"12391":{"m":123,"g":127},"12488":{"m":123,"g":127},"12412":{"m":123,"g":127},"11966":{"m":123,"g":127},"12238":{"m":123,"g":127},"12423":{"m":123,"g":127},"12500":{"m":123,"g":127},"12480":{"m":123,"g":127},"12485":{"m":123,"g":127},"12435":{"m":123,"g":127},"12483":{"m":123,"g":127},"12482":{"m":123,"g":127},"12226":{"m":123,"g":127},"12334":{"m":123,"g":127},"12739":{"m":124,"g":127},"12778":{"m":124,"g":127},"12440":{"m":124,"g":127},"12760":{"m":124,"g":127},"12565":{"m":124,"g":127},"12674":{"m":124,"g":127},"12646":{"m":124,"g":127},"12240":{"m":124,"g":127},"12508":{"m":124,"g":127},"12737":{"m":124,"g":127},"12693":{"m":124,"g":127},"12752":{"m":124,"g":127},"12744":{"m":124,"g":127},"12736":{"m":124,"g":127},"12748":{"m":124,"g":127},"12741":{"m":124,"g":127},"12742":{"m":124,"g":127},"12738":{"m":124,"g":127},"11892":{"m":124,"g":127},"12721":{"m":124,"g":127},"12734":{"m":124,"g":127},"12723":{"m":124,"g":127},"12716":{"m":124,"g":127},"12732":{"m":124,"g":127},"12611":{"m":124,"g":127},"12728":{"m":124,"g":127},"12729":{"m":124,"g":127},"12718":{"m":124,"g":127},"12651":{"m":124,"g":127},"12711":{"m":124,"g":127},"12713":{"m":124,"g":127},"12714":{"m":124,"g":127},"12712":{"m":124,"g":127},"12658":{"m":124,"g":127},"12673":{"m":124,"g":127},"12710":{"m":124,"g":127},"12709":{"m":124,"g":127},"12406":{"m":124,"g":127},"12586":{"m":124,"g":127},"12631":{"m":124,"g":127},"12699":{"m":124,"g":127},"12484":{"m":124,"g":127},"12677":{"m":124,"g":127},"12696":{"m":124,"g":127},"12708":{"m":124,"g":127},"12609":{"m":124,"g":127},"12702":{"m":124,"g":127},"12455":{"m":124,"g":127},"12691":{"m":124,"g":127},"12687":{"m":124,"g":127},"11641":{"m":124,"g":127},"12680":{"m":124,"g":127},"10044":{"m":124,"g":127},"12668":{"m":124,"g":127},"12175":{"m":124,"g":127},"12670":{"m":124,"g":127},"8784":{"m":124,"g":127},"12486":{"m":124,"g":127},"12638":{"m":124,"g":127},"12353":{"m":124,"g":127},"13000":{"m":125,"g":127},"12908":{"m":125,"g":127},"12952":{"m":125,"g":127},"13010":{"m":125,"g":127},"11850":{"m":125,"g":127},"12781":{"m":125,"g":127},"12224":{"m":125,"g":127},"13013":{"m":125,"g":127},"13009":{"m":125,"g":127},"13001":{"m":125,"g":127},"13005":{"m":125,"g":127},"12996":{"m":125,"g":127},"12999":{"m":125,"g":127},"12966":{"m":125,"g":127},"12984":{"m":125,"g":127},"12982":{"m":125,"g":127},"10225":{"m":125,"g":127},"10702":{"m":125,"g":127},"11719":{"m":125,"g":127},"12916":{"m":125,"g":127},"12239":{"m":125,"g":127},"12883":{"m":125,"g":127},"12912":{"m":125,"g":127},"12604":{"m":125,"g":127},"12934":{"m":125,"g":127},"9528":{"m":125,"g":127},"12931":{"m":125,"g":127},"12959":{"m":125,"g":127},"12926":{"m":125,"g":127},"12803":{"m":125,"g":127},"12957":{"m":125,"g":127},"12943":{"m":125,"g":127},"12834":{"m":125,"g":127},"12956":{"m":125,"g":127},"12554":{"m":125,"g":127},"12946":{"m":125,"g":127},"12948":{"m":125,"g":127},"12940":{"m":125,"g":127},"11812":{"m":125,"g":127},"12928":{"m":125,"g":127},"12927":{"m":125,"g":127},"12839":{"m":125,"g":127},"10775":{"m":125,"g":127},"12917":{"m":125,"g":127},"12920":{"m":125,"g":127},"12332":{"m":125,"g":127},"12919":{"m":125,"g":127},"12448":{"m":125,"g":127},"12906":{"m":125,"g":127},"12907":{"m":125,"g":127},"12911":{"m":125,"g":127},"12895":{"m":125,"g":127},"12905":{"m":125,"g":127},"12904":{"m":125,"g":127},"12865":{"m":125,"g":127},"12900":{"m":125,"g":127},"12896":{"m":125,"g":127},"12891":{"m":125,"g":127},"12897":{"m":125,"g":127},"12889":{"m":125,"g":127},"12870":{"m":125,"g":127},"12361":{"m":125,"g":127},"12811":{"m":125,"g":127},"12888":{"m":125,"g":127},"12832":{"m":125,"g":127},"12843":{"m":125,"g":127},"12886":{"m":125,"g":127},"12798":{"m":125,"g":127},"12868":{"m":125,"g":127},"12853":{"m":125,"g":127},"12846":{"m":125,"g":127},"12582":{"m":125,"g":127},"12805":{"m":125,"g":127},"12849":{"m":125,"g":127},"12859":{"m":125,"g":127},"12852":{"m":125,"g":127},"12851":{"m":125,"g":127},"12856":{"m":125,"g":127},"12801":{"m":125,"g":127},"12822":{"m":125,"g":127},"12431":{"m":125,"g":127},"12825":{"m":125,"g":127},"12836":{"m":125,"g":127},"12374":{"m":125,"g":127},"12812":{"m":125,"g":127},"12816":{"m":125,"g":127},"12090":{"m":125,"g":127},"12758":{"m":125,"g":127},"12520":{"m":125,"g":127},"12765":{"m":125,"g":127},"12761":{"m":125,"g":127},"12763":{"m":125,"g":127},"12776":{"m":125,"g":127},"12772":{"m":125,"g":127},"12788":{"m":125,"g":127},"12576":{"m":125,"g":127},"12782":{"m":125,"g":127},"12794":{"m":125,"g":127},"12715":{"m":125,"g":127},"12795":{"m":125,"g":127},"12724":{"m":125,"g":127},"12279":{"m":125,"g":127},"12717":{"m":125,"g":127},"8243":{"m":125,"g":127},"11904":{"m":125,"g":127},"12684":{"m":125,"g":127},"12764":{"m":125,"g":127},"12363":{"m":125,"g":127},"11051":{"m":125,"g":127},"12749":{"m":125,"g":127},"13129":{"m":126,"g":127},"10808":{"m":126,"g":127},"12617":{"m":126,"g":127},"7906":{"m":126,"g":127},"13149":{"m":126,"g":127},"7886":{"m":126,"g":127},"9790":{"m":126,"g":127},"13120":{"m":126,"g":127},"12458":{"m":126,"g":127},"11961":{"m":126,"g":127},"13137":{"m":126,"g":127},"12860":{"m":126,"g":127},"13136":{"m":126,"g":127},"13135":{"m":126,"g":127},"13077":{"m":126,"g":127},"13132":{"m":126,"g":127},"13131":{"m":126,"g":127},"13133":{"m":126,"g":127},"12942":{"m":126,"g":127},"12817":{"m":126,"g":127},"13095":{"m":126,"g":127},"12666":{"m":126,"g":127},"12396":{"m":126,"g":127},"13118":{"m":126,"g":127},"12872":{"m":126,"g":127},"12997":{"m":126,"g":127},"12863":{"m":126,"g":127},"13114":{"m":126,"g":127},"13039":{"m":126,"g":127},"11856":{"m":126,"g":127},"13105":{"m":126,"g":127},"13056":{"m":126,"g":127},"12583":{"m":126,"g":127},"13093":{"m":126,"g":127},"12660":{"m":126,"g":127},"13090":{"m":126,"g":127},"12866":{"m":126,"g":127},"12915":{"m":126,"g":127},"13018":{"m":126,"g":127},"13092":{"m":126,"g":127},"11645":{"m":126,"g":127},"13041":{"m":126,"g":127},"10862":{"m":126,"g":127},"13088":{"m":126,"g":127},"12814":{"m":126,"g":127},"12941":{"m":126,"g":127},"12994":{"m":126,"g":127},"13076":{"m":126,"g":127},"13015":{"m":126,"g":127},"13063":{"m":126,"g":127},"12199":{"m":126,"g":127},"12983":{"m":126,"g":127},"13037":{"m":126,"g":127},"12689":{"m":126,"g":127},"11609":{"m":126,"g":127},"12976":{"m":126,"g":127},"13050":{"m":126,"g":127},"12980":{"m":126,"g":127},"11938":{"m":126,"g":127},"13057":{"m":126,"g":127},"13053":{"m":126,"g":127},"13036":{"m":126,"g":127},"13043":{"m":126,"g":127},"13029":{"m":126,"g":127},"12885":{"m":126,"g":127},"12518":{"m":126,"g":127},"13035":{"m":126,"g":127},"13027":{"m":126,"g":127},"13028":{"m":126,"g":127},"12218":{"m":126,"g":127},"12753":{"m":126,"g":127},"12869":{"m":126,"g":127},"12993":{"m":126,"g":127},"13012":{"m":126,"g":127},"13366":{"m":128,"g":131},"13389":{"m":128,"g":131},"13387":{"m":128,"g":131},"12903":{"m":128,"g":131},"13388":{"m":128,"g":131},"12874":{"m":128,"g":131},"13386":{"m":128,"g":131},"13385":{"m":128,"g":131},"13384":{"m":128,"g":131},"13381":{"m":128,"g":131},"13335":{"m":128,"g":131},"13228":{"m":128,"g":131},"13371":{"m":128,"g":131},"13339":{"m":128,"g":131},"12978":{"m":128,"g":131},"13332":{"m":128,"g":131},"13263":{"m":128,"g":131},"13375":{"m":128,"g":131},"13344":{"m":128,"g":131},"13373":{"m":128,"g":131},"13372":{"m":128,"g":131},"13336":{"m":128,"g":131},"13369":{"m":128,"g":131},"13348":{"m":128,"g":131},"13199":{"m":128,"g":131},"13179":{"m":128,"g":131},"12310":{"m":128,"g":131},"11870":{"m":128,"g":131},"13358":{"m":128,"g":131},"13101":{"m":128,"g":131},"13321":{"m":128,"g":131},"13355":{"m":128,"g":131},"13351":{"m":128,"g":131},"13181":{"m":128,"g":131},"13341":{"m":128,"g":131},"13325":{"m":128,"g":131},"12329":{"m":128,"g":131},"13337":{"m":128,"g":131},"12001":{"m":128,"g":131},"12692":{"m":128,"g":131},"12443":{"m":128,"g":131},"13331":{"m":128,"g":131},"13330":{"m":128,"g":131},"13329":{"m":128,"g":131},"10568":{"m":128,"g":131},"13306":{"m":128,"g":131},"13297":{"m":128,"g":131},"13326":{"m":128,"g":131},"13323":{"m":128,"g":131},"13287":{"m":128,"g":131},"13322":{"m":128,"g":131},"7415":{"m":128,"g":131},"13286":{"m":128,"g":131},"13285":{"m":128,"g":131},"13259":{"m":128,"g":131},"13320":{"m":128,"g":131},"13295":{"m":128,"g":131},"13318":{"m":128,"g":131},"13314":{"m":128,"g":131},"13226":{"m":128,"g":131},"13278":{"m":128,"g":131},"13279":{"m":128,"g":131},"12612":{"m":128,"g":131},"12871":{"m":128,"g":131},"13317":{"m":128,"g":131},"13294":{"m":128,"g":131},"13316":{"m":128,"g":131},"13315":{"m":128,"g":131},"13312":{"m":128,"g":127},"13091":{"m":128,"g":127},"13100":{"m":128,"g":127},"13311":{"m":128,"g":127},"13310":{"m":128,"g":127},"13045":{"m":128,"g":127},"13293":{"m":128,"g":127},"13305":{"m":128,"g":127},"13170":{"m":128,"g":127},"13298":{"m":128,"g":127},"13274":{"m":128,"g":127},"13235":{"m":128,"g":127},"13272":{"m":128,"g":127},"13260":{"m":128,"g":127},"10573":{"m":128,"g":127},"10665":{"m":128,"g":127},"13236":{"m":128,"g":127},"12777":{"m":128,"g":127},"13277":{"m":128,"g":127},"13288":{"m":128,"g":127},"13254":{"m":128,"g":127},"13284":{"m":128,"g":127},"13283":{"m":128,"g":127},"13242":{"m":128,"g":127},"12191":{"m":128,"g":127},"12605":{"m":128,"g":127},"12623":{"m":128,"g":127},"12622":{"m":128,"g":127},"12620":{"m":128,"g":127},"13113":{"m":128,"g":127},"13247":{"m":128,"g":127},"13237":{"m":128,"g":127},"13265":{"m":128,"g":127},"11589":{"m":128,"g":127},"13261":{"m":128,"g":127},"13256":{"m":128,"g":127},"13257":{"m":128,"g":127},"13255":{"m":128,"g":127},"13096":{"m":128,"g":127},"13221":{"m":128,"g":127},"13213":{"m":128,"g":127},"13246":{"m":128,"g":127},"13243":{"m":128,"g":127},"13186":{"m":128,"g":127},"13239":{"m":128,"g":127},"13097":{"m":128,"g":127},"12392":{"m":128,"g":127},"13222":{"m":128,"g":127},"13188":{"m":128,"g":127},"13218":{"m":128,"g":127},"13087":{"m":128,"g":127},"13142":{"m":128,"g":127},"11595":{"m":128,"g":127},"13210":{"m":128,"g":127},"12774":{"m":128,"g":127},"13211":{"m":128,"g":127},"13220":{"m":128,"g":127},"13171":{"m":128,"g":127},"13215":{"m":128,"g":127},"13212":{"m":128,"g":127},"10485":{"m":128,"g":127},"12543":{"m":128,"g":127},"12201":{"m":128,"g":127},"12376":{"m":128,"g":127},"13155":{"m":128,"g":127},"13148":{"m":128,"g":127},"13190":{"m":128,"g":127},"13178":{"m":128,"g":127},"13102":{"m":128,"g":127},"13154":{"m":128,"g":127},"12975":{"m":128,"g":127},"13163":{"m":128,"g":127},"13172":{"m":128,"g":127},"13150":{"m":128,"g":127},"10973":{"m":128,"g":127},"12288":{"m":128,"g":127},"13162":{"m":128,"g":127},"12215":{"m":128,"g":127},"13104":{"m":128,"g":127},"13164":{"m":128,"g":127},"13127":{"m":128,"g":127},"12979":{"m":128,"g":127},"13128":{"m":128,"g":127},"12998":{"m":128,"g":127},"13075":{"m":128,"g":127},"10907":{"m":128,"g":127},"13153":{"m":128,"g":127},"12214":{"m":128,"g":127},"14316":{"m":129,"g":131},"14324":{"m":129,"g":131},"14323":{"m":129,"g":131},"14317":{"m":129,"g":131},"14309":{"m":129,"g":131},"14319":{"m":129,"g":131},"14262":{"m":129,"g":131},"14315":{"m":129,"g":131},"14249":{"m":129,"g":131},"11423":{"m":129,"g":131},"14278":{"m":129,"g":131},"14299":{"m":129,"g":131},"13089":{"m":129,"g":131},"14133":{"m":129,"g":131},"14269":{"m":129,"g":131},"14281":{"m":129,"g":131},"14287":{"m":129,"g":131},"14286":{"m":129,"g":131},"14283":{"m":129,"g":131},"14252":{"m":129,"g":131},"14244":{"m":129,"g":131},"14279":{"m":129,"g":131},"14276":{"m":129,"g":131},"13738":{"m":129,"g":131},"14047":{"m":129,"g":131},"14274":{"m":129,"g":131},"14257":{"m":129,"g":131},"14254":{"m":129,"g":131},"13700":{"m":129,"g":131},"14267":{"m":129,"g":131},"14261":{"m":129,"g":131},"14172":{"m":129,"g":131},"14263":{"m":129,"g":131},"14259":{"m":129,"g":131},"14260":{"m":129,"g":131},"14222":{"m":129,"g":131},"14232":{"m":129,"g":131},"13968":{"m":129,"g":131},"14256":{"m":129,"g":131},"14255":{"m":129,"g":131},"13880":{"m":129,"g":131},"13794":{"m":129,"g":131},"14247":{"m":129,"g":131},"13843":{"m":129,"g":131},"14250":{"m":129,"g":131},"14245":{"m":129,"g":131},"14243":{"m":129,"g":131},"14241":{"m":129,"g":131},"14240":{"m":129,"g":131},"14237":{"m":129,"g":131},"14152":{"m":129,"g":131},"14179":{"m":129,"g":131},"14229":{"m":129,"g":131},"14230":{"m":129,"g":131},"13693":{"m":129,"g":131},"14228":{"m":129,"g":131},"14122":{"m":129,"g":131},"14088":{"m":129,"g":131},"13887":{"m":129,"g":131},"14219":{"m":129,"g":131},"14218":{"m":129,"g":131},"14214":{"m":129,"g":131},"14212":{"m":129,"g":131},"14211":{"m":129,"g":131},"14173":{"m":129,"g":131},"14165":{"m":129,"g":131},"14123":{"m":129,"g":131},"14186":{"m":129,"g":131},"14180":{"m":129,"g":131},"14167":{"m":129,"g":131},"14044":{"m":129,"g":131},"14182":{"m":129,"g":131},"14183":{"m":129,"g":131},"14181":{"m":129,"g":131},"14003":{"m":129,"g":131},"14034":{"m":129,"g":131},"14153":{"m":129,"g":131},"14187":{"m":129,"g":131},"12181":{"m":129,"g":131},"14155":{"m":129,"g":131},"14104":{"m":129,"g":131},"13873":{"m":129,"g":131},"14059":{"m":129,"g":131},"14140":{"m":129,"g":131},"13646":{"m":129,"g":131},"13841":{"m":129,"g":131},"14171":{"m":129,"g":131},"13907":{"m":129,"g":131},"14065":{"m":129,"g":131},"14166":{"m":129,"g":131},"14005":{"m":129,"g":131},"12494":{"m":129,"g":131},"14163":{"m":129,"g":131},"14156":{"m":129,"g":131},"14148":{"m":129,"g":131},"14161":{"m":129,"g":131},"14052":{"m":129,"g":131},"14150":{"m":129,"g":131},"14157":{"m":129,"g":131},"14154":{"m":129,"g":131},"14147":{"m":129,"g":131},"14146":{"m":129,"g":131},"14145":{"m":129,"g":131},"14136":{"m":129,"g":131},"13956":{"m":129,"g":131},"13759":{"m":129,"g":131},"14151":{"m":129,"g":131},"14135":{"m":129,"g":131},"14130":{"m":129,"g":131},"14119":{"m":129,"g":131},"14131":{"m":129,"g":131},"14129":{"m":129,"g":131},"14121":{"m":129,"g":131},"12306":{"m":129,"g":131},"14124":{"m":129,"g":131},"14113":{"m":129,"g":131},"14117":{"m":129,"g":131},"13377":{"m":129,"g":131},"13488":{"m":129,"g":131},"14111":{"m":129,"g":131},"12558":{"m":129,"g":131},"10712":{"m":129,"g":131},"14106":{"m":129,"g":131},"14067":{"m":129,"g":131},"14096":{"m":129,"g":131},"14094":{"m":129,"g":131},"13724":{"m":129,"g":131},"13904":{"m":129,"g":131},"14076":{"m":129,"g":131},"13936":{"m":129,"g":131},"14006":{"m":129,"g":131},"14082":{"m":129,"g":131},"13205":{"m":129,"g":131},"14079":{"m":129,"g":131},"14036":{"m":129,"g":131},"13944":{"m":129,"g":131},"13749":{"m":129,"g":131},"14069":{"m":129,"g":131},"13946":{"m":129,"g":131},"14048":{"m":129,"g":131},"13960":{"m":129,"g":131},"14057":{"m":129,"g":131},"13425":{"m":129,"g":131},"13854":{"m":129,"g":131},"14002":{"m":129,"g":131},"13855":{"m":129,"g":131},"14040":{"m":129,"g":131},"13895":{"m":129,"g":131},"13976":{"m":129,"g":131},"13814":{"m":129,"g":131},"13965":{"m":129,"g":131},"14026":{"m":129,"g":131},"14017":{"m":129,"g":131},"14033":{"m":129,"g":131},"14030":{"m":129,"g":131},"14028":{"m":129,"g":131},"13761":{"m":129,"g":131},"14027":{"m":129,"g":131},"14025":{"m":129,"g":131},"13824":{"m":129,"g":131},"12277":{"m":129,"g":131},"13941":{"m":129,"g":131},"13983":{"m":129,"g":131},"14018":{"m":129,"g":131},"14007":{"m":129,"g":131},"14022":{"m":129,"g":131},"14019":{"m":129,"g":131},"14021":{"m":129,"g":131},"13937":{"m":129,"g":131},"13990":{"m":129,"g":131},"14020":{"m":129,"g":131},"13966":{"m":129,"g":131},"13872":{"m":129,"g":131},"14016":{"m":129,"g":131},"14013":{"m":129,"g":131},"14015":{"m":129,"g":131},"14014":{"m":129,"g":131},"14012":{"m":129,"g":131},"13151":{"m":129,"g":131},"13892":{"m":129,"g":131},"14000":{"m":129,"g":131},"14009":{"m":129,"g":131},"13766":{"m":129,"g":131},"13754":{"m":129,"g":131},"12491":{"m":129,"g":131},"13994":{"m":129,"g":131},"13991":{"m":129,"g":131},"13977":{"m":129,"g":131},"13203":{"m":129,"g":131},"12588":{"m":129,"g":131},"13922":{"m":129,"g":131},"13852":{"m":129,"g":131},"13963":{"m":129,"g":131},"10071":{"m":129,"g":131},"13961":{"m":129,"g":131},"13962":{"m":129,"g":131},"13958":{"m":129,"g":131},"7725":{"m":129,"g":131},"13925":{"m":129,"g":131},"12786":{"m":129,"g":131},"13954":{"m":129,"g":131},"13951":{"m":129,"g":131},"13950":{"m":129,"g":131},"13945":{"m":129,"g":131},"13942":{"m":129,"g":131},"12969":{"m":129,"g":131},"13910":{"m":129,"g":131},"13866":{"m":129,"g":131},"13903":{"m":129,"g":131},"13421":{"m":129,"g":131},"13851":{"m":129,"g":131},"13544":{"m":129,"g":131},"13938":{"m":129,"g":131},"13935":{"m":129,"g":131},"13933":{"m":129,"g":131},"13859":{"m":129,"g":131},"13931":{"m":129,"g":131},"13928":{"m":129,"g":131},"13927":{"m":129,"g":131},"13657":{"m":129,"g":131},"13081":{"m":129,"g":131},"12078":{"m":129,"g":131},"13921":{"m":129,"g":131},"13905":{"m":129,"g":131},"13916":{"m":129,"g":131},"13642":{"m":129,"g":131},"13908":{"m":129,"g":131},"13848":{"m":129,"g":131},"13793":{"m":129,"g":131},"13901":{"m":129,"g":131},"13827":{"m":129,"g":131},"13889":{"m":129,"g":131},"13890":{"m":129,"g":131},"13888":{"m":129,"g":131},"11893":{"m":129,"g":131},"13891":{"m":129,"g":131},"13870":{"m":129,"g":131},"13874":{"m":129,"g":131},"13860":{"m":129,"g":131},"13871":{"m":129,"g":131},"13572":{"m":129,"g":131},"10275":{"m":129,"g":131},"13487":{"m":129,"g":131},"13786":{"m":129,"g":131},"13834":{"m":129,"g":131},"13822":{"m":129,"g":131},"13783":{"m":129,"g":131},"13763":{"m":129,"g":131},"13752":{"m":129,"g":131},"13864":{"m":129,"g":131},"13865":{"m":129,"g":131},"13853":{"m":129,"g":131},"10027":{"m":129,"g":131},"13745":{"m":129,"g":131},"13858":{"m":129,"g":131},"13612":{"m":129,"g":131},"11871":{"m":129,"g":131},"13713":{"m":129,"g":131},"13508":{"m":129,"g":131},"13846":{"m":129,"g":131},"13819":{"m":129,"g":131},"13245":{"m":129,"g":131},"13751":{"m":129,"g":131},"13833":{"m":129,"g":131},"13831":{"m":129,"g":131},"13792":{"m":129,"g":131},"13829":{"m":129,"g":131},"13800":{"m":129,"g":131},"13656":{"m":129,"g":131},"13820":{"m":129,"g":131},"13201":{"m":129,"g":131},"13650":{"m":129,"g":131},"13816":{"m":129,"g":131},"13601":{"m":129,"g":131},"13810":{"m":129,"g":131},"13802":{"m":129,"g":131},"13815":{"m":129,"g":131},"13813":{"m":129,"g":131},"13687":{"m":129,"g":131},"13806":{"m":129,"g":131},"13781":{"m":129,"g":131},"13791":{"m":129,"g":131},"13180":{"m":129,"g":131},"13718":{"m":129,"g":131},"13787":{"m":129,"g":131},"13764":{"m":129,"g":131},"13709":{"m":129,"g":131},"13776":{"m":129,"g":131},"13777":{"m":129,"g":131},"13727":{"m":129,"g":131},"13676":{"m":129,"g":131},"13690":{"m":129,"g":131},"13547":{"m":129,"g":131},"13533":{"m":129,"g":131},"13769":{"m":129,"g":131},"13478":{"m":129,"g":131},"13736":{"m":129,"g":131},"13706":{"m":129,"g":131},"13768":{"m":129,"g":131},"13704":{"m":129,"g":131},"13720":{"m":129,"g":131},"13714":{"m":129,"g":131},"12759":{"m":129,"g":131},"9405":{"m":129,"g":131},"13506":{"m":129,"g":131},"13756":{"m":129,"g":131},"13669":{"m":129,"g":131},"13729":{"m":129,"g":131},"13702":{"m":129,"g":131},"13746":{"m":129,"g":131},"13484":{"m":129,"g":131},"12690":{"m":129,"g":131},"13701":{"m":129,"g":131},"12949":{"m":129,"g":131},"13707":{"m":129,"g":131},"13694":{"m":129,"g":131},"13466":{"m":129,"g":131},"13739":{"m":129,"g":131},"13737":{"m":129,"g":131},"13735":{"m":129,"g":131},"13734":{"m":129,"g":131},"13733":{"m":129,"g":131},"13407":{"m":129,"g":131},"13649":{"m":129,"g":131},"13705":{"m":129,"g":131},"13630":{"m":129,"g":131},"13719":{"m":129,"g":131},"13327":{"m":129,"g":131},"13647":{"m":129,"g":131},"13708":{"m":129,"g":131},"13498":{"m":129,"g":131},"13590":{"m":129,"g":131},"13679":{"m":129,"g":131},"13564":{"m":129,"g":131},"13686":{"m":129,"g":131},"13177":{"m":129,"g":131},"13665":{"m":129,"g":131},"13675":{"m":129,"g":131},"13640":{"m":129,"g":131},"13587":{"m":129,"g":131},"13596":{"m":129,"g":131},"13555":{"m":129,"g":131},"13619":{"m":129,"g":131},"13301":{"m":129,"g":131},"12672":{"m":129,"g":131},"13683":{"m":129,"g":131},"13685":{"m":129,"g":131},"13627":{"m":129,"g":131},"13678":{"m":129,"g":131},"13677":{"m":129,"g":131},"13659":{"m":129,"g":131},"13667":{"m":129,"g":131},"13610":{"m":129,"g":131},"11526":{"m":129,"g":131},"13038":{"m":129,"g":131},"13600":{"m":129,"g":131},"13459":{"m":129,"g":131},"13666":{"m":129,"g":131},"12964":{"m":129,"g":131},"13655":{"m":129,"g":131},"13524":{"m":129,"g":131},"11577":{"m":129,"g":131},"13663":{"m":129,"g":131},"13637":{"m":129,"g":131},"13634":{"m":129,"g":131},"13644":{"m":129,"g":131},"13197":{"m":129,"g":131},"13617":{"m":129,"g":131},"13633":{"m":129,"g":131},"13453":{"m":129,"g":131},"13614":{"m":129,"g":131},"13554":{"m":129,"g":131},"13583":{"m":129,"g":131},"13328":{"m":129,"g":131},"13253":{"m":129,"g":131},"13248":{"m":129,"g":131},"12379":{"m":129,"g":131},"13528":{"m":129,"g":131},"13613":{"m":129,"g":131},"13429":{"m":129,"g":131},"13562":{"m":129,"g":131},"13055":{"m":129,"g":131},"13603":{"m":129,"g":131},"13604":{"m":129,"g":131},"13357":{"m":129,"g":131},"13570":{"m":129,"g":131},"13577":{"m":129,"g":131},"13465":{"m":129,"g":131},"13049":{"m":129,"g":131},"13448":{"m":129,"g":131},"13589":{"m":129,"g":131},"13567":{"m":129,"g":131},"13568":{"m":129,"g":131},"13413":{"m":129,"g":131},"13558":{"m":129,"g":131},"13557":{"m":129,"g":131},"13542":{"m":129,"g":131},"13481":{"m":129,"g":131},"12740":{"m":129,"g":131},"13551":{"m":129,"g":131},"13452":{"m":129,"g":131},"9234":{"m":129,"g":131},"13047":{"m":129,"g":131},"13548":{"m":129,"g":131},"13543":{"m":129,"g":131},"13541":{"m":129,"g":131},"13489":{"m":129,"g":131},"13495":{"m":129,"g":131},"13540":{"m":129,"g":131},"13537":{"m":129,"g":131},"13534":{"m":129,"g":131},"13536":{"m":129,"g":131},"13532":{"m":129,"g":131},"13527":{"m":129,"g":131},"13474":{"m":129,"g":131},"13525":{"m":129,"g":131},"13522":{"m":129,"g":131},"13521":{"m":129,"g":131},"13519":{"m":129,"g":131},"13516":{"m":129,"g":131},"12962":{"m":129,"g":131},"13513":{"m":129,"g":131},"13512":{"m":129,"g":131},"13510":{"m":129,"g":131},"13509":{"m":129,"g":131},"13168":{"m":129,"g":131},"13157":{"m":129,"g":131},"13501":{"m":129,"g":131},"13126":{"m":129,"g":131},"13496":{"m":129,"g":131},"13374":{"m":129,"g":131},"13491":{"m":129,"g":131},"13482":{"m":129,"g":131},"13486":{"m":129,"g":131},"13393":{"m":129,"g":131},"13460":{"m":129,"g":131},"13094":{"m":129,"g":131},"13479":{"m":129,"g":131},"13476":{"m":129,"g":131},"12149":{"m":129,"g":131},"13289":{"m":129,"g":131},"13473":{"m":129,"g":131},"13258":{"m":129,"g":131},"13229":{"m":129,"g":131},"13462":{"m":129,"g":131},"13444":{"m":129,"g":131},"13458":{"m":129,"g":131},"13449":{"m":129,"g":131},"13455":{"m":129,"g":131},"13140":{"m":129,"g":131},"13463":{"m":129,"g":131},"13264":{"m":129,"g":131},"12359":{"m":129,"g":131},"13461":{"m":129,"g":131},"13457":{"m":129,"g":131},"13173":{"m":129,"g":131},"13456":{"m":129,"g":131},"13273":{"m":129,"g":131},"13022":{"m":129,"g":131},"13450":{"m":129,"g":131},"13447":{"m":129,"g":131},"13445":{"m":129,"g":131},"13443":{"m":129,"g":131},"13418":{"m":129,"g":131},"13217":{"m":129,"g":131},"5879":{"m":129,"g":131},"11900":{"m":129,"g":131},"13144":{"m":129,"g":131},"13379":{"m":129,"g":131},"13282":{"m":129,"g":131},"13420":{"m":129,"g":131},"13416":{"m":129,"g":131},"13415":{"m":129,"g":131},"13399":{"m":129,"g":131},"13004":{"m":129,"g":131},"13324":{"m":129,"g":131},"13112":{"m":129,"g":131},"11644":{"m":129,"g":131},"13398":{"m":129,"g":131},"13396":{"m":129,"g":131},"13345":{"m":129,"g":131},"13338":{"m":129,"g":131},"12065":{"m":129,"g":131},"13391":{"m":129,"g":131},"13383":{"m":129,"g":131},"14670":{"m":130,"g":131},"14650":{"m":130,"g":131},"14457":{"m":130,"g":131},"14657":{"m":130,"g":131},"14667":{"m":130,"g":131},"14634":{"m":130,"g":131},"14664":{"m":130,"g":131},"14663":{"m":130,"g":131},"14658":{"m":130,"g":131},"14497":{"m":130,"g":131},"14651":{"m":130,"g":131},"14649":{"m":130,"g":131},"14356":{"m":130,"g":131},"14629":{"m":130,"g":131},"14558":{"m":130,"g":131},"12527":{"m":130,"g":131},"14632":{"m":130,"g":131},"14606":{"m":130,"g":131},"12551":{"m":130,"g":131},"14625":{"m":130,"g":131},"14556":{"m":130,"g":131},"14585":{"m":130,"g":131},"14618":{"m":130,"g":131},"14203":{"m":130,"g":131},"14452":{"m":130,"g":131},"14612":{"m":130,"g":131},"14609":{"m":130,"g":131},"14604":{"m":130,"g":131},"14608":{"m":130,"g":131},"14605":{"m":130,"g":131},"14600":{"m":130,"g":131},"14591":{"m":130,"g":131},"14386":{"m":130,"g":131},"14573":{"m":130,"g":131},"14551":{"m":130,"g":131},"14141":{"m":130,"g":131},"14590":{"m":130,"g":131},"14588":{"m":130,"g":131},"14587":{"m":130,"g":131},"14586":{"m":130,"g":131},"14517":{"m":130,"g":131},"14455":{"m":130,"g":131},"14553":{"m":130,"g":131},"13573":{"m":130,"g":131},"14132":{"m":130,"g":131},"14185":{"m":130,"g":131},"14576":{"m":130,"g":131},"14577":{"m":130,"g":131},"13725":{"m":130,"g":131},"13998":{"m":130,"g":131},"14569":{"m":130,"g":131},"14412":{"m":130,"g":131},"14544":{"m":130,"g":131},"14561":{"m":130,"g":131},"14560":{"m":130,"g":131},"14559":{"m":130,"g":131},"14337":{"m":130,"g":131},"14555":{"m":130,"g":131},"14494":{"m":130,"g":131},"14476":{"m":130,"g":131},"14205":{"m":130,"g":131},"14557":{"m":130,"g":131},"14447":{"m":130,"g":131},"14552":{"m":130,"g":131},"14518":{"m":130,"g":131},"14538":{"m":130,"g":131},"14520":{"m":130,"g":131},"14535":{"m":130,"g":131},"14493":{"m":130,"g":131},"14464":{"m":130,"g":131},"14543":{"m":130,"g":131},"13897":{"m":130,"g":131},"14505":{"m":130,"g":131},"14539":{"m":130,"g":131},"13115":{"m":130,"g":131},"14533":{"m":130,"g":131},"12324":{"m":130,"g":131},"14290":{"m":130,"g":131},"14528":{"m":130,"g":131},"14530":{"m":130,"g":131},"11791":{"m":130,"g":131},"14522":{"m":130,"g":131},"14465":{"m":130,"g":131},"14521":{"m":130,"g":131},"14291":{"m":130,"g":131},"14427":{"m":130,"g":131},"14516":{"m":130,"g":131},"14514":{"m":130,"g":131},"14513":{"m":130,"g":131},"14512":{"m":130,"g":131},"14312":{"m":130,"g":131},"14405":{"m":130,"g":131},"14420":{"m":130,"g":131},"14460":{"m":130,"g":131},"14508":{"m":130,"g":131},"13607":{"m":130,"g":131},"14507":{"m":130,"g":131},"12471":{"m":130,"g":131},"14093":{"m":130,"g":131},"14234":{"m":130,"g":131},"13434":{"m":130,"g":131},"14506":{"m":130,"g":131},"14471":{"m":130,"g":131},"14459":{"m":130,"g":131},"14466":{"m":130,"g":131},"14456":{"m":130,"g":131},"14097":{"m":130,"g":131},"14499":{"m":130,"g":131},"13584":{"m":130,"g":131},"14364":{"m":130,"g":131},"13861":{"m":130,"g":131},"13996":{"m":130,"g":131},"14472":{"m":130,"g":131},"14484":{"m":130,"g":131},"14444":{"m":130,"g":131},"14475":{"m":130,"g":131},"14463":{"m":130,"g":131},"14473":{"m":130,"g":131},"14468":{"m":130,"g":131},"13836":{"m":130,"g":131},"14421":{"m":130,"g":131},"14432":{"m":130,"g":131},"14251":{"m":130,"g":131},"14450":{"m":130,"g":131},"14445":{"m":130,"g":131},"14446":{"m":130,"g":131},"8287":{"m":130,"g":131},"14348":{"m":130,"g":131},"14350":{"m":130,"g":131},"14441":{"m":130,"g":131},"14325":{"m":130,"g":131},"14440":{"m":130,"g":131},"14438":{"m":130,"g":131},"14143":{"m":130,"g":131},"14434":{"m":130,"g":131},"14366":{"m":130,"g":131},"14430":{"m":130,"g":131},"14429":{"m":130,"g":131},"14225":{"m":130,"g":131},"14409":{"m":130,"g":131},"14213":{"m":130,"g":131},"14224":{"m":130,"g":131},"14334":{"m":130,"g":131},"14399":{"m":130,"g":131},"12446":{"m":130,"g":131},"13359":{"m":130,"g":131},"14383":{"m":130,"g":131},"14394":{"m":130,"g":131},"14381":{"m":130,"g":131},"12309":{"m":130,"g":131},"14393":{"m":130,"g":131},"12316":{"m":130,"g":131},"14292":{"m":130,"g":131},"14392":{"m":130,"g":131},"14272":{"m":130,"g":131},"13731":{"m":130,"g":131},"14359":{"m":130,"g":131},"14377":{"m":130,"g":131},"14330":{"m":130,"g":131},"14277":{"m":130,"g":131},"14375":{"m":130,"g":131},"14374":{"m":130,"g":131},"14253":{"m":130,"g":131},"14372":{"m":130,"g":131},"14226":{"m":130,"g":131},"14371":{"m":130,"g":131},"14326":{"m":130,"g":131},"9660":{"m":130,"g":131},"12330":{"m":130,"g":131},"14355":{"m":130,"g":131},"13585":{"m":130,"g":131},"14362":{"m":130,"g":131},"14271":{"m":130,"g":131},"14295":{"m":130,"g":131},"13980":{"m":130,"g":131},"14347":{"m":130,"g":131},"14333":{"m":130,"g":131},"12441":{"m":130,"g":131},"14344":{"m":130,"g":131},"14265":{"m":130,"g":131},"14335":{"m":130,"g":131},"14336":{"m":130,"g":131},"13350":{"m":130,"g":131},"14266":{"m":130,"g":131},"14329":{"m":130,"g":131},"13812":{"m":130,"g":131},"14195":{"m":130,"g":131},"14321":{"m":130,"g":131},"13710":{"m":130,"g":131},"14858":{"m":132,"g":133},"14620":{"m":132,"g":133},"14304":{"m":132,"g":133},"14917":{"m":132,"g":133},"14307":{"m":132,"g":133},"14887":{"m":132,"g":133},"14911":{"m":132,"g":133},"14910":{"m":132,"g":133},"14852":{"m":132,"g":133},"14889":{"m":132,"g":133},"14890":{"m":132,"g":133},"14900":{"m":132,"g":133},"14899":{"m":132,"g":133},"12287":{"m":132,"g":133},"14878":{"m":132,"g":133},"14541":{"m":132,"g":133},"13641":{"m":132,"g":133},"14828":{"m":132,"g":133},"14827":{"m":132,"g":133},"14853":{"m":132,"g":133},"14876":{"m":132,"g":133},"14880":{"m":132,"g":133},"14877":{"m":132,"g":133},"14856":{"m":132,"g":133},"14875":{"m":132,"g":133},"14861":{"m":132,"g":133},"14845":{"m":132,"g":133},"14871":{"m":132,"g":133},"14313":{"m":132,"g":133},"14554":{"m":132,"g":133},"14865":{"m":132,"g":133},"14811":{"m":132,"g":133},"14836":{"m":132,"g":133},"14848":{"m":132,"g":133},"14854":{"m":132,"g":133},"14849":{"m":132,"g":133},"14844":{"m":132,"g":133},"14851":{"m":132,"g":133},"14850":{"m":132,"g":133},"14847":{"m":132,"g":133},"14442":{"m":132,"g":133},"14841":{"m":132,"g":133},"14796":{"m":132,"g":133},"14638":{"m":132,"g":133},"14823":{"m":132,"g":133},"14801":{"m":132,"g":133},"14837":{"m":132,"g":133},"14045":{"m":132,"g":133},"14833":{"m":132,"g":133},"14829":{"m":132,"g":133},"14769":{"m":132,"g":133},"14712":{"m":132,"g":133},"14716":{"m":132,"g":133},"14830":{"m":132,"g":133},"14834":{"m":132,"g":133},"14812":{"m":132,"g":133},"14831":{"m":132,"g":133},"14806":{"m":132,"g":133},"14822":{"m":132,"g":133},"14819":{"m":132,"g":133},"14710":{"m":132,"g":133},"14807":{"m":132,"g":133},"14770":{"m":132,"g":133},"14793":{"m":132,"g":133},"14808":{"m":132,"g":133},"14697":{"m":132,"g":133},"14720":{"m":132,"g":133},"14803":{"m":132,"g":133},"14788":{"m":132,"g":133},"14794":{"m":132,"g":133},"9650":{"m":132,"g":133},"14786":{"m":132,"g":133},"14784":{"m":132,"g":133},"14725":{"m":132,"g":133},"14777":{"m":132,"g":133},"14761":{"m":132,"g":133},"14759":{"m":132,"g":133},"14064":{"m":132,"g":133},"14768":{"m":132,"g":133},"14756":{"m":132,"g":133},"14744":{"m":132,"g":133},"14687":{"m":132,"g":133},"14763":{"m":132,"g":131},"14177":{"m":132,"g":131},"14758":{"m":132,"g":131},"14669":{"m":132,"g":131},"14740":{"m":132,"g":131},"14753":{"m":132,"g":131},"14698":{"m":132,"g":131},"14379":{"m":132,"g":131},"14752":{"m":132,"g":131},"14751":{"m":132,"g":131},"14745":{"m":132,"g":131},"12953":{"m":132,"g":131},"14743":{"m":132,"g":131},"14738":{"m":132,"g":131},"14733":{"m":132,"g":131},"12039":{"m":132,"g":131},"13432":{"m":132,"g":131},"14461":{"m":132,"g":131},"14686":{"m":132,"g":131},"14601":{"m":132,"g":131},"14622":{"m":132,"g":131},"14714":{"m":132,"g":131},"14707":{"m":132,"g":131},"14699":{"m":132,"g":131},"14647":{"m":132,"g":131},"14648":{"m":132,"g":131},"14683":{"m":132,"g":131},"14678":{"m":132,"g":131},"14676":{"m":132,"g":131},"14529":{"m":132,"g":131},"14689":{"m":132,"g":131},"14627":{"m":132,"g":131},"14679":{"m":132,"g":131},"14469":{"m":132,"g":131},"14614":{"m":132,"g":131},"14653":{"m":132,"g":131},"13147":{"m":132,"g":131},"14652":{"m":132,"g":131},"13334":{"m":132,"g":131},"14489":{"m":132,"g":131},"14675":{"m":132,"g":131},"14671":{"m":132,"g":131},"16253":{"m":134,"g":135},"16107":{"m":134,"g":135},"16244":"m134","16241":{"m":134,"g":135},"16211":{"m":134,"g":135},"16153":{"m":134,"g":135},"15942":{"m":134,"g":135},"16140":{"m":134,"g":135},"16142":{"m":134,"g":135},"16141":{"m":134,"g":135},"16129":{"m":134,"g":135},"10959":{"m":134,"g":135},"15888":{"m":134,"g":135},"16114":{"m":134,"g":135},"16131":{"m":134,"g":135},"16133":{"m":134,"g":135},"16053":{"m":134,"g":135},"16130":{"m":134,"g":135},"16105":{"m":134,"g":135},"15187":{"m":134,"g":135},"16123":{"m":134,"g":135},"15813":{"m":134,"g":135},"16081":{"m":134,"g":135},"15896":{"m":134,"g":135},"15877":{"m":134,"g":135},"15800":{"m":134,"g":135},"15985":{"m":134,"g":135},"16103":{"m":134,"g":135},"16099":{"m":134,"g":135},"15921":{"m":134,"g":135},"16101":{"m":134,"g":135},"16100":{"m":134,"g":135},"16098":{"m":134,"g":135},"16097":{"m":134,"g":135},"16094":{"m":134,"g":135},"16096":{"m":134,"g":135},"16093":{"m":134,"g":135},"15057":{"m":134,"g":135},"14838":{"m":134,"g":135},"16061":{"m":134,"g":135},"16066":{"m":134,"g":135},"16087":{"m":134,"g":135},"16069":{"m":134,"g":135},"16085":{"m":134,"g":135},"16062":{"m":134,"g":135},"14414":{"m":134,"g":135},"16047":{"m":134,"g":135},"15805":{"m":134,"g":135},"16054":{"m":134,"g":135},"16003":{"m":134,"g":135},"16046":{"m":134,"g":135},"16051":{"m":134,"g":135},"16038":{"m":134,"g":135},"16041":{"m":134,"g":135},"16039":{"m":134,"g":135},"16037":{"m":134,"g":135},"16035":{"m":134,"g":135},"16036":{"m":134,"g":135},"16010":{"m":134,"g":135},"14280":{"m":134,"g":135},"16028":{"m":134,"g":135},"16017":{"m":134,"g":135},"14873":{"m":134,"g":135},"15922":{"m":134,"g":135},"16016":{"m":134,"g":135},"15939":{"m":134,"g":135},"15998":{"m":134,"g":135},"15928":{"m":134,"g":135},"16008":{"m":134,"g":135},"16002":{"m":134,"g":135},"16013":{"m":134,"g":135},"16004":{"m":134,"g":135},"15615":{"m":134,"g":135},"15992":{"m":134,"g":135},"15991":{"m":134,"g":135},"16001":{"m":134,"g":135},"15945":{"m":134,"g":135},"15990":{"m":134,"g":135},"15988":{"m":134,"g":135},"15987":{"m":134,"g":135},"15891":{"m":134,"g":135},"15216":{"m":134,"g":135},"15693":{"m":134,"g":135},"15353":{"m":134,"g":135},"15835":{"m":134,"g":135},"15806":{"m":134,"g":135},"15937":{"m":134,"g":135},"15986":{"m":134,"g":135},"12596":{"m":134,"g":135},"15919":{"m":134,"g":135},"15947":{"m":134,"g":135},"15925":{"m":134,"g":135},"15936":{"m":134,"g":135},"15886":{"m":134,"g":135},"15943":{"m":134,"g":135},"15935":{"m":134,"g":135},"15934":{"m":134,"g":135},"15933":{"m":134,"g":135},"14736":{"m":134,"g":135},"15923":{"m":134,"g":135},"15907":{"m":134,"g":135},"15887":{"m":134,"g":135},"15920":{"m":134,"g":135},"15918":{"m":134,"g":135},"15905":{"m":134,"g":135},"15850":{"m":134,"g":135},"15915":{"m":134,"g":135},"15398":{"m":134,"g":135},"15914":{"m":134,"g":135},"15916":{"m":134,"g":135},"15913":{"m":134,"g":135},"15911":{"m":134,"g":135},"15910":{"m":134,"g":135},"14750":{"m":134,"g":135},"15906":{"m":134,"g":135},"15778":{"m":134,"g":135},"15844":{"m":134,"g":135},"15812":{"m":134,"g":135},"15842":{"m":134,"g":135},"15881":{"m":134,"g":135},"15874":{"m":134,"g":135},"15889":{"m":134,"g":135},"15846":{"m":134,"g":135},"14209":{"m":134,"g":135},"15849":{"m":134,"g":135},"15817":{"m":134,"g":135},"15870":{"m":134,"g":135},"15867":{"m":134,"g":135},"14644":{"m":134,"g":135},"15518":{"m":134,"g":135},"15821":{"m":134,"g":135},"15369":{"m":134,"g":135},"15858":{"m":134,"g":135},"15857":{"m":134,"g":135},"15820":{"m":134,"g":135},"15851":{"m":134,"g":135},"15701":{"m":134,"g":135},"15791":{"m":134,"g":135},"15847":{"m":134,"g":135},"15522":{"m":134,"g":135},"15796":{"m":134,"g":135},"15826":{"m":134,"g":135},"15822":{"m":134,"g":135},"15772":{"m":134,"g":135},"15356":{"m":134,"g":135},"15759":{"m":134,"g":135},"15827":{"m":134,"g":135},"15488":{"m":134,"g":135},"15815":{"m":134,"g":135},"15818":{"m":134,"g":135},"15802":{"m":134,"g":135},"15666":{"m":134,"g":135},"15709":{"m":134,"g":135},"14741":{"m":134,"g":135},"15803":{"m":134,"g":135},"15409":{"m":134,"g":135},"15798":{"m":134,"g":135},"15811":{"m":134,"g":135},"15720":{"m":134,"g":135},"15736":{"m":134,"g":135},"15801":{"m":134,"g":135},"15555":{"m":134,"g":135},"15770":{"m":134,"g":135},"11469":{"m":134,"g":135},"15586":{"m":134,"g":135},"15596":{"m":134,"g":135},"14032":{"m":134,"g":135},"15787":{"m":134,"g":135},"15700":{"m":134,"g":135},"15781":{"m":134,"g":133},"15149":{"m":134,"g":133},"15775":{"m":134,"g":133},"15782":{"m":134,"g":133},"15758":{"m":134,"g":133},"15745":{"m":134,"g":133},"15750":{"m":134,"g":133},"15780":{"m":134,"g":133},"15741":{"m":134,"g":133},"14137":{"m":134,"g":133},"15390":{"m":134,"g":133},"15718":{"m":134,"g":133},"15769":{"m":134,"g":133},"15768":{"m":134,"g":133},"15653":{"m":134,"g":133},"15652":{"m":134,"g":133},"15752":{"m":134,"g":133},"15747":{"m":134,"g":133},"15748":{"m":134,"g":133},"15743":{"m":134,"g":133},"15740":{"m":134,"g":133},"15655":{"m":134,"g":133},"15706":{"m":134,"g":133},"15459":{"m":134,"g":133},"15689":{"m":134,"g":133},"15593":{"m":134,"g":133},"15704":{"m":134,"g":133},"15691":{"m":134,"g":133},"15656":{"m":134,"g":133},"15717":{"m":134,"g":133},"15715":{"m":134,"g":133},"15722":{"m":134,"g":133},"15716":{"m":134,"g":133},"15719":{"m":134,"g":133},"11828":{"m":134,"g":133},"15500":{"m":134,"g":133},"15622":{"m":134,"g":133},"15624":{"m":134,"g":133},"15705":{"m":134,"g":133},"15177":{"m":134,"g":133},"15702":{"m":134,"g":133},"15538":{"m":134,"g":133},"15667":{"m":134,"g":133},"15695":{"m":134,"g":133},"15696":{"m":134,"g":133},"15606":{"m":134,"g":133},"15273":{"m":134,"g":133},"15672":{"m":134,"g":133},"15694":{"m":134,"g":133},"15469":{"m":134,"g":133},"12968":{"m":134,"g":133},"15644":{"m":134,"g":133},"15692":{"m":134,"g":133},"14570":{"m":134,"g":133},"15091":{"m":134,"g":133},"15646":{"m":134,"g":133},"15633":{"m":134,"g":133},"15688":{"m":134,"g":133},"15684":{"m":134,"g":133},"15312":{"m":134,"g":133},"15600":{"m":134,"g":133},"15463":{"m":134,"g":133},"15460":{"m":134,"g":133},"14983":{"m":134,"g":133},"15563":{"m":134,"g":133},"15582":{"m":134,"g":133},"14628":{"m":134,"g":133},"15539":{"m":134,"g":133},"15570":{"m":134,"g":133},"15635":{"m":134,"g":133},"15590":{"m":134,"g":133},"13576":{"m":134,"g":133},"15632":{"m":134,"g":133},"15621":{"m":134,"g":133},"15368":{"m":134,"g":133},"15611":{"m":134,"g":133},"15610":{"m":134,"g":133},"15572":{"m":134,"g":133},"15573":{"m":134,"g":133},"15616":{"m":134,"g":133},"15613":{"m":134,"g":133},"15607":{"m":134,"g":133},"15612":{"m":134,"g":133},"15164":{"m":134,"g":133},"15566":{"m":134,"g":133},"15599":{"m":134,"g":133},"15589":{"m":134,"g":133},"15588":{"m":134,"g":133},"15587":{"m":134,"g":133},"15565":{"m":134,"g":133},"15581":{"m":134,"g":133},"15585":{"m":134,"g":133},"15230":{"m":134,"g":133},"15427":{"m":134,"g":133},"15553":{"m":134,"g":133},"15583":{"m":134,"g":133},"15580":{"m":134,"g":133},"15569":{"m":134,"g":133},"14901":{"m":134,"g":133},"15564":{"m":134,"g":133},"15578":{"m":134,"g":133},"15579":{"m":134,"g":133},"15509":{"m":134,"g":133},"15540":{"m":134,"g":133},"15568":{"m":134,"g":133},"15558":{"m":134,"g":133},"12162":{"m":134,"g":133},"15531":{"m":134,"g":133},"15111":{"m":134,"g":133},"15432":{"m":134,"g":133},"15554":{"m":134,"g":133},"15556":{"m":134,"g":133},"15537":{"m":134,"g":133},"15520":{"m":134,"g":133},"15447":{"m":134,"g":133},"15324":{"m":134,"g":133},"15436":{"m":134,"g":133},"14134":{"m":134,"g":133},"15552":{"m":134,"g":133},"15526":{"m":134,"g":133},"15547":{"m":134,"g":133},"15413":{"m":134,"g":133},"15515":{"m":134,"g":133},"15544":{"m":134,"g":133},"15542":{"m":134,"g":133},"15418":{"m":134,"g":133},"15479":{"m":134,"g":133},"15533":{"m":134,"g":133},"15534":{"m":134,"g":133},"15530":{"m":134,"g":133},"15511":{"m":134,"g":133},"15536":{"m":134,"g":133},"15320":{"m":134,"g":133},"14091":{"m":134,"g":133},"15464":{"m":134,"g":133},"15484":{"m":134,"g":133},"15172":{"m":134,"g":133},"15178":{"m":134,"g":133},"15498":{"m":134,"g":133},"15333":{"m":134,"g":133},"15267":{"m":134,"g":133},"15507":{"m":134,"g":133},"15510":{"m":134,"g":133},"15485":{"m":134,"g":133},"15473":{"m":134,"g":133},"15505":{"m":134,"g":133},"15504":{"m":134,"g":133},"15503":{"m":134,"g":133},"15497":{"m":134,"g":133},"15040":{"m":134,"g":133},"15296":{"m":134,"g":133},"15496":{"m":134,"g":133},"15495":{"m":134,"g":133},"15494":{"m":134,"g":133},"15491":{"m":134,"g":133},"15022":{"m":134,"g":133},"14164":{"m":134,"g":133},"15348":{"m":134,"g":133},"15406":{"m":134,"g":133},"15483":{"m":134,"g":133},"15166":{"m":134,"g":133},"14138":{"m":134,"g":133},"15408":{"m":134,"g":133},"15416":{"m":134,"g":133},"15219":{"m":134,"g":133},"15478":{"m":134,"g":133},"15382":{"m":134,"g":133},"12995":{"m":134,"g":133},"15394":{"m":134,"g":133},"15458":{"m":134,"g":133},"15262":{"m":134,"g":133},"15437":{"m":134,"g":133},"14395":{"m":134,"g":133},"15253":{"m":134,"g":133},"15415":{"m":134,"g":133},"15433":{"m":134,"g":133},"13402":{"m":134,"g":133},"13760":{"m":134,"g":133},"15340":{"m":134,"g":133},"13782":{"m":134,"g":133},"15423":{"m":134,"g":133},"15425":{"m":134,"g":133},"15287":{"m":134,"g":133},"15207":{"m":134,"g":133},"15410":{"m":134,"g":133},"15371":{"m":134,"g":133},"15429":{"m":134,"g":133},"15431":{"m":134,"g":133},"14723":{"m":134,"g":133},"14843":{"m":134,"g":133},"14353":{"m":134,"g":133},"15424":{"m":134,"g":133},"14354":{"m":134,"g":133},"15318":{"m":134,"g":133},"14781":{"m":134,"g":133},"15421":{"m":134,"g":133},"15352":{"m":134,"g":133},"15395":{"m":134,"g":133},"15401":{"m":134,"g":133},"15407":{"m":134,"g":133},"15400":{"m":134,"g":133},"15396":{"m":134,"g":133},"15404":{"m":134,"g":133},"15397":{"m":134,"g":133},"12921":{"m":134,"g":133},"15298":{"m":134,"g":133},"15141":{"m":134,"g":133},"15379":{"m":134,"g":133},"15306":{"m":134,"g":133},"14270":{"m":134,"g":133},"15384":{"m":134,"g":133},"15361":{"m":134,"g":133},"15372":{"m":134,"g":133},"12967":{"m":134,"g":133},"15290":{"m":134,"g":133},"14501":{"m":134,"g":133},"15049":{"m":134,"g":133},"15354":{"m":134,"g":133},"15337":{"m":134,"g":133},"15278":{"m":134,"g":133},"15131":{"m":134,"g":133},"15205":{"m":134,"g":133},"15307":{"m":134,"g":133},"14860":{"m":134,"g":133},"15176":{"m":134,"g":133},"15277":{"m":134,"g":133},"15328":{"m":134,"g":133},"15120":{"m":134,"g":133},"15241":{"m":134,"g":133},"15308":{"m":134,"g":133},"15336":{"m":134,"g":133},"15316":{"m":134,"g":133},"15335":{"m":134,"g":133},"15329":{"m":134,"g":133},"15186":{"m":134,"g":133},"15222":{"m":134,"g":133},"15198":{"m":134,"g":133},"15189":{"m":134,"g":133},"15326":{"m":134,"g":133},"15237":{"m":134,"g":133},"15138":{"m":134,"g":133},"14918":{"m":134,"g":133},"11914":{"m":134,"g":133},"15304":{"m":134,"g":133},"15223":{"m":134,"g":133},"15233":{"m":134,"g":133},"15314":{"m":134,"g":133},"12333":{"m":134,"g":133},"15071":{"m":134,"g":133},"15284":{"m":134,"g":133},"14449":{"m":134,"g":133},"14357":{"m":134,"g":133},"15088":{"m":134,"g":133},"14376":{"m":134,"g":133},"15155":{"m":134,"g":133},"13571":{"m":134,"g":133},"15283":{"m":134,"g":133},"15218":{"m":134,"g":133},"15297":{"m":134,"g":133},"15258":{"m":134,"g":133},"15280":{"m":134,"g":133},"15293":{"m":134,"g":133},"15292":{"m":134,"g":133},"15291":{"m":134,"g":133},"15282":{"m":134,"g":133},"15242":{"m":134,"g":133},"14997":{"m":134,"g":133},"14934":{"m":134,"g":133},"15281":{"m":134,"g":133},"14857":{"m":134,"g":133},"14975":{"m":134,"g":133},"14936":{"m":134,"g":133},"15234":{"m":134,"g":133},"15270":{"m":134,"g":133},"15232":{"m":134,"g":133},"15192":{"m":134,"g":133},"15100":{"m":134,"g":133},"9337":{"m":134,"g":133},"15239":{"m":134,"g":133},"15220":{"m":134,"g":133},"14866":{"m":134,"g":133},"14415":{"m":134,"g":133},"15180":{"m":134,"g":133},"15225":{"m":134,"g":133},"15162":{"m":134,"g":133},"14990":{"m":134,"g":133},"15229":{"m":134,"g":133},"15231":{"m":134,"g":133},"15204":{"m":134,"g":133},"15092":{"m":134,"g":133},"15224":{"m":134,"g":133},"13410":{"m":134,"g":133},"15212":{"m":134,"g":133},"15185":{"m":134,"g":133},"15153":{"m":134,"g":133},"14820":{"m":134,"g":133},"15201":{"m":134,"g":133},"15127":{"m":134,"g":133},"15191":{"m":134,"g":133},"15190":{"m":134,"g":133},"15196":{"m":134,"g":133},"15193":{"m":134,"g":133},"15160":{"m":134,"g":133},"15047":{"m":134,"g":133},"15017":{"m":134,"g":133},"15163":{"m":134,"g":133},"15152":{"m":134,"g":133},"14906":{"m":134,"g":133},"15116":{"m":134,"g":133},"14862":{"m":134,"g":133},"15174":{"m":134,"g":133},"9324":{"m":134,"g":133},"15170":{"m":134,"g":133},"15005":{"m":134,"g":133},"13914":{"m":134,"g":133},"15158":{"m":134,"g":133},"15156":{"m":134,"g":133},"15154":{"m":134,"g":133},"15144":{"m":134,"g":133},"14778":{"m":134,"g":133},"15147":{"m":134,"g":133},"15146":{"m":134,"g":133},"15130":{"m":134,"g":133},"14907":{"m":134,"g":133},"14764":{"m":134,"g":133},"14792":{"m":134,"g":133},"15142":{"m":134,"g":133},"15139":{"m":134,"g":133},"15101":{"m":134,"g":133},"14938":{"m":134,"g":133},"15134":{"m":134,"g":133},"15058":{"m":134,"g":133},"15086":{"m":134,"g":133},"15099":{"m":134,"g":133},"15098":{"m":134,"g":133},"14953":{"m":134,"g":133},"15052":{"m":134,"g":133},"15044":{"m":134,"g":133},"15125":{"m":134,"g":133},"15110":{"m":134,"g":133},"14791":{"m":134,"g":133},"15113":{"m":134,"g":133},"15117":{"m":134,"g":133},"15124":{"m":134,"g":133},"15121":{"m":134,"g":133},"14874":{"m":134,"g":133},"15106":{"m":134,"g":133},"15090":{"m":134,"g":133},"15114":{"m":134,"g":133},"12263":{"m":134,"g":133},"13969":{"m":134,"g":133},"14961":{"m":134,"g":133},"15062":{"m":134,"g":133},"15053":{"m":134,"g":133},"14881":{"m":134,"g":133},"15108":{"m":134,"g":133},"15048":{"m":134,"g":133},"14294":{"m":134,"g":133},"15084":{"m":134,"g":133},"14969":{"m":134,"g":133},"15096":{"m":134,"g":133},"15061":{"m":134,"g":133},"15095":{"m":134,"g":133},"15094":{"m":134,"g":133},"15093":{"m":134,"g":133},"14943":{"m":134,"g":133},"15087":{"m":134,"g":133},"14993":{"m":134,"g":133},"14956":{"m":134,"g":133},"15056":{"m":134,"g":133},"15066":{"m":134,"g":133},"15085":{"m":134,"g":133},"15064":{"m":134,"g":133},"14935":{"m":134,"g":133},"15065":{"m":134,"g":133},"14998":{"m":134,"g":133},"14201":{"m":134,"g":133},"14940":{"m":134,"g":133},"15054":{"m":134,"g":133},"15055":{"m":134,"g":133},"15080":{"m":134,"g":133},"15079":{"m":134,"g":133},"14742":{"m":134,"g":133},"15074":{"m":134,"g":133},"15072":{"m":134,"g":133},"13740":{"m":134,"g":133},"15069":{"m":134,"g":133},"15060":{"m":134,"g":133},"15059":{"m":134,"g":133},"14422":{"m":134,"g":133},"14855":{"m":134,"g":133},"14423":{"m":134,"g":133},"15027":{"m":134,"g":133},"13989":{"m":134,"g":133},"14924":{"m":134,"g":133},"14659":{"m":134,"g":133},"15002":{"m":134,"g":133},"15034":{"m":134,"g":133},"14485":{"m":134,"g":133},"13876":{"m":134,"g":133},"15037":{"m":134,"g":133},"15036":{"m":134,"g":133},"15032":{"m":134,"g":133},"15031":{"m":134,"g":133},"15030":{"m":134,"g":133},"15028":{"m":134,"g":133},"15035":{"m":134,"g":133},"14939":{"m":134,"g":133},"15024":{"m":134,"g":133},"15033":{"m":134,"g":133},"15009":{"m":134,"g":133},"14957":{"m":134,"g":133},"14955":{"m":134,"g":133},"15015":{"m":134,"g":133},"15023":{"m":134,"g":133},"15021":{"m":134,"g":133},"14992":{"m":134,"g":133},"14976":{"m":134,"g":133},"14966":{"m":134,"g":133},"14933":{"m":134,"g":133},"14795":{"m":134,"g":133},"15020":{"m":134,"g":133},"15010":{"m":134,"g":133},"15014":{"m":134,"g":133},"14869":{"m":134,"g":133},"14989":{"m":134,"g":133},"14958":{"m":134,"g":133},"14951":{"m":134,"g":133},"15004":{"m":134,"g":133},"15001":{"m":134,"g":133},"15000":{"m":134,"g":133},"14999":{"m":134,"g":133},"14996":{"m":134,"g":133},"14995":{"m":134,"g":133},"14987":{"m":134,"g":133},"14945":{"m":134,"g":133},"14985":{"m":134,"g":133},"14534":{"m":134,"g":133},"14916":{"m":134,"g":133},"14959":{"m":134,"g":133},"14937":{"m":134,"g":133},"13798":{"m":134,"g":133},"14572":{"m":134,"g":133},"14949":{"m":134,"g":133},"14842":{"m":134,"g":133},"13699":{"m":134,"g":133},"14944":{"m":134,"g":133},"14888":{"m":134,"g":133},"14893":{"m":134,"g":133},"14892":{"m":134,"g":133},"14941":{"m":134,"g":133},"14894":{"m":134,"g":133},"14839":{"m":134,"g":133},"11852":{"m":134,"g":133},"14870":{"m":134,"g":133},"14358":{"m":134,"g":133},"14923":{"m":134,"g":133},"14891":{"m":134,"g":133},"14909":{"m":134,"g":133},"14467":{"m":134,"g":133},"14927":{"m":134,"g":133},"14931":{"m":134,"g":133},"13730":{"m":134,"g":133},"14932":{"m":134,"g":133},"14748":{"m":134,"g":133},"14525":{"m":134,"g":133},"14771":{"m":134,"g":133},"14074":{"m":134,"g":133},"14922":{"m":134,"g":133},"13125":{"m":134,"g":133},"14921":{"m":134,"g":133},"14928":{"m":134,"g":133},"14925":{"m":134,"g":133},"17458":"m136","17569":"m136","17591":"m136","17553":"m136","15859":"m136","17460":"m136","17474":"m136","17541":"m136","17518":"m136","17539":"m136","16366":"m136","17514":"m136","17536":"m136","17529":"m136","17534":"m136","17442":"m136","17493":"m136","17528":"m136","17524":"m136","16927":"m136","17486":"m136","17519":"m136","17416":"m136","17490":"m136","17517":"m136","17394":"m136","17498":"m136","16034":"m136","17510":"m136","17166":"m136","16919":"m136","17108":"m136","17417":"m136","16670":"m136","17355":"m136","17399":"m136","17400":"m136","17397":"m136","17372":"m136","16396":"m136","17457":"m136","17462":"m136","17466":"m136","17425":"m136","17465":"m136","17452":"m136","17386":"m136","17293":"m136","17455":"m136","17251":"m136","17444":"m136","17443":"m136","17439":"m136","17290":"m136","17291":"m136","17313":"m136","17436":"m136","17428":"m136","17289":"m136","17334":"m136","17403":"m136","17429":"m136","17247":"m136","17205":"m136","17288":"m136","17419":"m136","17414":"m136","17409":"m136","17358":"m136","11657":"m136","17382":"m136","17160":"m136","17179":"m136","17327":"m136","17385":"m136","17302":"m136","17339":"m136","15325":"m136","17364":"m136","16880":"m136","17378":"m136","17376":"m136","17370":"m136","17043":"m136","17088":"m136","17336":"m136","17367":"m136","17366":"m136","17363":"m136","17329":"m136","17049":"m136","17345":"m136","17116":"m136","17142":"m136","17220":"m136","16567":"m136","17305":"m136","17309":"m136","17158":"m136","17177":"m136","17332":"m136","17238":"m136","17245":"m136","17241":"m136","16412":"m136","17325":"m136","16744":"m136","17317":"m136","17319":"m136","16961":"m136","15347":"m136","17315":"m136","17306":"m136","15512":"m136","17308":"m136","17235":"m136","15631":"m136","14197":"m136","16649":"m136","17296":"m136","17212":"m136","16534":"m136","17295":"m136","17236":"m136","17281":"m136","16974":"m136","17287":"m136","17182":"m136","17264":"m136","16152":"m136","17261":"m136","17250":"m136","17234":"m136","17225":"m136","17256":"m136","17257":"m136","17048":"m136","14883":"m136","17191":"m136","17252":"m136","16561":"m136","17248":"m136","17249":"m136","16879":"m136","17045":"m136","17047":"m136","17044":"m136","16951":"m136","17242":"m136","16354":"m136","16817":"m136","16824":"m136","17232":"m136","17230":"m136","17233":"m136","17165":"m136","16925":"m136","17187":"m136","17217":"m136","16842":"m136","13672":"m136","15551":"m136","17133":"m136","17016":"m136","17200":"m136","17038":"m136","13216":"m136","17143":"m136","17105":"m136","17184":"m136","17173":"m136","17051":"m136","16934":"m136","17186":"m136","16121":"m136","17180":"m136","15513":"m136","17041":"m136","14565":"m136","17100":"m136","14579":"m136","17178":"m136","16882":"m136","16826":"m136","16369":"m136","17176":"m136","17174":"m136","11028":"m136","15455":"m136","17170":"m136","10598":"m136","17091":"m136","17168":"m136","17167":"m136","16568":"m136","15789":"m136","17163":"m136","17099":"m136","17126":"m136","16965":"m136","17111":"m136","17113":"m136","17020":"m136","17064":"m136","17103":"m136","17092":"m136","16949":"m136","17056":"m136","17075":"m136","17061":"m136","17028":"m136","17101":"m136","17052":"m136","12497":"m136","14504":"m136","17087":"m136","16278":"m136","16971":"m136","16976":"m136","16403":"m136","17058":"m136","17077":"m136","16264":"m136","7392":"m136","16732":"m136","16899":"m136","16569":"m136","17013":"m136","16941":"m136","16962":"m136","16888":"m136","15908":"m136","16989":"m136","17046":"m136","17030":"m136","17027":"m136","17054":"m136","16898":"m136","16994":"m136","16841":"m136","17002":"m136","14108":"m136","17022":"m136","17005":"m136","16982":"m136","16894":"m136","16953":"m136","17019":"m136","16886":"m136","16259":"m136","16924":"m136","16986":"m136","16790":"m136","16935":"m136","16884":"m136","16967":"m136","16766":"m136","16564":"m136","16767":"m136","16648":"m136","16252":"m136","16940":"m136","15853":"m136","16765":"m136","16978":"m136","16981":"m136","16980":"m136","16979":"m136","16977":"m136","16973":"m136","16970":"m136","16192":"m136","16348":"m136","16963":"m136","15271":"m136","16933":"m136","16480":"m136","16922":"m136","16916":"m136","16912":"m136","16019":"m136","16559":"m136","16677":"m136","16930":"m136","16932":"m136","16536":"m136","16850":"m136","16915":"m136","16896":"m136","16820":"m136","15268":"m136","12909":"m136","16908":"m136","16906":"m136","16851":"m136","14867":"m136","16895":"m136","16300":"m136","16864":"m136","16889":"m136","15182":"m136","16876":"m136","16258":"m136","16835":"m136","16345":"m136","16878":"m136","16867":"m136","16273":"m136","16757":"m136","16865":"m136","16877":"m136","16863":"m136","16667":"m136","16788":"m136","16397":"m136","16737":"m136","16872":"m136","16871":"m136","16870":"m136","15790":"m136","15927":"m136","15227":"m136","16226":"m136","16844":"m136","16838":"m136","16849":"m136","16333":"m136","16779":"m136","16854":"m136","16862":"m136","16572":"m136","16805":"m136","16723":"m136","13715":"m136","16853":"m136","16637":"m136","16852":"m136","16847":"m136","16848":"m136","16845":"m136","16698":"m136","16825":"m136","16837":"m136","16840":"m136","16839":"m136","16831":"m136","16830":"m136","16679":"m136","16810":"m136","16783":"m136","16587":"m136","16821":"m136","16804":"m136","15753":"m136","13947":"m136","16275":"m136","16814":"m136","16813":"m136","16812":"m136","16811":"m136","14655":"m136","16325":"m136","16708":"m136","16760":"m136","16792":"m136","11349":"m136","16458":"m136","16743":"m136","16768":"m136","16721":"m136","16778":"m136","16774":"m136","16686":"m136","16378":"m136","16588":"m136","16446":"m136","16773":"m136","16380":"m136","16254":{"m":136,"g":135},"16720":{"m":136,"g":135},"16560":{"m":136,"g":135},"16764":{"m":136,"g":135},"16763":{"m":136,"g":135},"16759":{"m":136,"g":135},"16715":{"m":136,"g":135},"16756":{"m":136,"g":135},"16754":{"m":136,"g":135},"16752":{"m":136,"g":135},"16751":{"m":136,"g":135},"16749":{"m":136,"g":135},"16748":{"m":136,"g":135},"16675":{"m":136,"g":135},"16746":{"m":136,"g":135},"16409":{"m":136,"g":135},"13681":{"m":136,"g":135},"16745":{"m":136,"g":135},"16741":{"m":136,"g":135},"16014":{"m":136,"g":135},"16739":{"m":136,"g":135},"16719":{"m":136,"g":135},"16668":{"m":136,"g":135},"16738":{"m":136,"g":135},"16709":{"m":136,"g":135},"16735":{"m":136,"g":135},"16622":{"m":136,"g":135},"16635":{"m":136,"g":135},"16733":{"m":136,"g":135},"16730":{"m":136,"g":135},"16729":{"m":136,"g":135},"16519":{"m":136,"g":135},"16706":{"m":136,"g":135},"16115":{"m":136,"g":135},"16634":{"m":136,"g":135},"16535":{"m":136,"g":135},"16452":{"m":136,"g":135},"16533":{"m":136,"g":135},"16531":{"m":136,"g":135},"16529":{"m":136,"g":135},"15627":{"m":136,"g":135},"16608":{"m":136,"g":135},"16693":{"m":136,"g":135},"16155":{"m":136,"g":135},"16505":{"m":136,"g":135},"16697":{"m":136,"g":135},"16555":{"m":136,"g":135},"16669":{"m":136,"g":135},"16625":{"m":136,"g":135},"16695":{"m":136,"g":135},"16678":{"m":136,"g":135},"16692":{"m":136,"g":135},"16629":{"m":136,"g":135},"16680":{"m":136,"g":135},"16445":{"m":136,"g":135},"16306":{"m":136,"g":135},"16652":{"m":136,"g":135},"15343":{"m":136,"g":135},"16095":{"m":136,"g":135},"15151":{"m":136,"g":135},"16681":{"m":136,"g":135},"16674":{"m":136,"g":135},"16457":{"m":136,"g":135},"16426":{"m":136,"g":135},"16676":{"m":136,"g":135},"16658":{"m":136,"g":135},"16672":{"m":136,"g":135},"16633":{"m":136,"g":135},"16671":{"m":136,"g":135},"15712":{"m":136,"g":135},"16660":{"m":136,"g":135},"16664":{"m":136,"g":135},"16657":{"m":136,"g":135},"16661":{"m":136,"g":135},"16618":{"m":136,"g":135},"16179":{"m":136,"g":135},"16599":{"m":136,"g":135},"15238":{"m":136,"g":135},"16532":{"m":136,"g":135},"16654":{"m":136,"g":135},"16651":{"m":136,"g":135},"16203":{"m":136,"g":135},"13602":{"m":136,"g":135},"16620":{"m":136,"g":135},"16514":{"m":136,"g":135},"16631":{"m":136,"g":135},"16630":{"m":136,"g":135},"16619":{"m":136,"g":135},"16527":{"m":136,"g":135},"15938":{"m":136,"g":135},"16582":{"m":136,"g":135},"16418":{"m":136,"g":135},"16459":{"m":136,"g":135},"16465":{"m":136,"g":135},"16468":{"m":136,"g":135},"16566":{"m":136,"g":135},"16617":{"m":136,"g":135},"16597":{"m":136,"g":135},"16504":{"m":136,"g":135},"16593":{"m":136,"g":135},"16118":{"m":136,"g":135},"15439":{"m":136,"g":135},"16609":{"m":136,"g":135},"16453":{"m":136,"g":135},"16499":{"m":136,"g":135},"16606":{"m":136,"g":135},"16603":{"m":136,"g":135},"16576":{"m":136,"g":135},"16596":{"m":136,"g":135},"16598":{"m":136,"g":135},"16601":{"m":136,"g":135},"16594":{"m":136,"g":135},"16442":{"m":136,"g":135},"16422":{"m":136,"g":135},"16223":{"m":136,"g":135},"16425":{"m":136,"g":135},"16589":{"m":136,"g":135},"16592":{"m":136,"g":135},"16583":{"m":136,"g":135},"16585":{"m":136,"g":135},"16591":{"m":136,"g":135},"16326":{"m":136,"g":135},"16523":{"m":136,"g":135},"16420":{"m":136,"g":135},"16548":{"m":136,"g":135},"16549":{"m":136,"g":135},"16575":{"m":136,"g":135},"16507":{"m":136,"g":135},"16570":{"m":136,"g":135},"16463":{"m":136,"g":135},"14215":{"m":136,"g":135},"13518":{"m":136,"g":135},"16455":{"m":136,"g":135},"16496":{"m":136,"g":135},"16540":{"m":136,"g":135},"16201":{"m":136,"g":135},"15456":{"m":136,"g":135},"16421":{"m":136,"g":135},"16466":{"m":136,"g":135},"16539":{"m":136,"g":135},"16417":{"m":136,"g":135},"16502":{"m":136,"g":135},"16415":{"m":136,"g":135},"16467":{"m":136,"g":135},"16399":{"m":136,"g":135},"14112":{"m":136,"g":135},"15492":{"m":136,"g":135},"16538":{"m":136,"g":135},"6135":{"m":136,"g":135},"16528":{"m":136,"g":135},"16416":{"m":136,"g":135},"16419":{"m":136,"g":135},"16434":{"m":136,"g":135},"16525":{"m":136,"g":135},"16520":{"m":136,"g":135},"16086":{"m":136,"g":135},"16524":{"m":136,"g":135},"15677":{"m":136,"g":135},"16477":{"m":136,"g":135},"16356":{"m":136,"g":135},"16513":{"m":136,"g":135},"15663":{"m":136,"g":135},"16516":{"m":136,"g":135},"16482":{"m":136,"g":135},"16511":{"m":136,"g":135},"16280":{"m":136,"g":135},"16509":{"m":136,"g":135},"16508":{"m":136,"g":135},"16492":{"m":136,"g":135},"16469":{"m":136,"g":135},"16138":{"m":136,"g":135},"16481":{"m":136,"g":135},"14474":{"m":136,"g":135},"16367":{"m":136,"g":135},"16475":{"m":136,"g":135},"16478":{"m":136,"g":135},"16479":{"m":136,"g":135},"16474":{"m":136,"g":135},"16447":{"m":136,"g":135},"16471":{"m":136,"g":135},"16456":{"m":136,"g":135},"16382":{"m":136,"g":135},"14051":{"m":136,"g":135},"16042":{"m":136,"g":135},"16424":{"m":136,"g":135},"16460":{"m":136,"g":135},"16464":{"m":136,"g":135},"16454":{"m":136,"g":135},"16088":{"m":136,"g":135},"16387":{"m":136,"g":135},"16386":{"m":136,"g":135},"16451":{"m":136,"g":135},"16450":{"m":136,"g":135},"16448":{"m":136,"g":135},"16389":{"m":136,"g":135},"16180":{"m":136,"g":135},"16441":{"m":136,"g":135},"16444":{"m":136,"g":135},"16437":{"m":136,"g":135},"16310":{"m":136,"g":135},"16435":{"m":136,"g":135},"16433":{"m":136,"g":135},"16432":{"m":136,"g":135},"16375":{"m":136,"g":135},"16330":{"m":136,"g":135},"15836":{"m":136,"g":135},"16430":{"m":136,"g":135},"16414":{"m":136,"g":135},"16429":{"m":136,"g":135},"16427":{"m":136,"g":135},"16413":{"m":136,"g":135},"16408":{"m":136,"g":135},"16411":{"m":136,"g":135},"16334":{"m":136,"g":135},"16127":{"m":136,"g":135},"16323":{"m":136,"g":135},"14066":{"m":136,"g":135},"16405":{"m":136,"g":135},"16406":{"m":136,"g":135},"16401":{"m":136,"g":135},"16400":{"m":136,"g":135},"16347":{"m":136,"g":135},"16349":{"m":136,"g":135},"16390":{"m":136,"g":135},"16392":{"m":136,"g":135},"16374":{"m":136,"g":135},"16391":{"m":136,"g":135},"16381":{"m":136,"g":135},"16376":{"m":136,"g":135},"16373":{"m":136,"g":135},"16359":{"m":136,"g":135},"16268":{"m":136,"g":135},"16365":{"m":136,"g":135},"16368":{"m":136,"g":135},"16363":{"m":136,"g":135},"16358":{"m":136,"g":135},"16343":{"m":136,"g":135},"16357":{"m":136,"g":135},"16355":{"m":136,"g":135},"16352":{"m":136,"g":135},"16064":{"m":136,"g":135},"15434":{"m":136,"g":135},"16353":{"m":136,"g":135},"16341":{"m":136,"g":135},"16351":{"m":136,"g":135},"16269":{"m":136,"g":135},"16350":{"m":136,"g":135},"16340":{"m":136,"g":135},"16339":{"m":136,"g":135},"16335":{"m":136,"g":135},"16344":{"m":136,"g":135},"15941":{"m":136,"g":135},"16342":{"m":136,"g":135},"16164":{"m":136,"g":135},"16304":{"m":136,"g":135},"16303":{"m":136,"g":135},"16338":{"m":136,"g":135},"16219":{"m":136,"g":135},"16337":{"m":136,"g":135},"15640":{"m":136,"g":135},"16311":{"m":136,"g":135},"16332":{"m":136,"g":135},"15175":{"m":136,"g":135},"16328":{"m":136,"g":135},"16009":{"m":136,"g":135},"13592":{"m":136,"g":135},"16272":{"m":136,"g":135},"16283":{"m":136,"g":135},"16257":{"m":136,"g":135},"16177":{"m":136,"g":135},"15560":{"m":136,"g":135},"16324":{"m":136,"g":135},"16317":{"m":136,"g":135},"16296":{"m":136,"g":135},"16321":{"m":136,"g":135},"16316":{"m":136,"g":135},"16313":{"m":136,"g":135},"16318":{"m":136,"g":135},"16320":{"m":136,"g":135},"16319":{"m":136,"g":135},"16314":{"m":136,"g":135},"16315":{"m":136,"g":135},"16312":{"m":136,"g":135},"16287":{"m":136,"g":135},"16277":{"m":136,"g":135},"16308":{"m":136,"g":135},"16092":{"m":136,"g":135},"16128":{"m":136,"g":135},"16305":{"m":136,"g":135},"16301":{"m":136,"g":135},"13959":{"m":136,"g":135},"16248":{"m":136,"g":135},"16292":{"m":136,"g":135},"16298":{"m":136,"g":135},"16227":{"m":136,"g":135},"16213":{"m":136,"g":135},"16267":{"m":136,"g":135},"16144":{"m":136,"g":135},"16222":{"m":136,"g":135},"15345":{"m":136,"g":135},"16270":{"m":136,"g":135},"16285":{"m":136,"g":135},"14636":{"m":136,"g":135},"16263":{"m":136,"g":135},"16265":{"m":136,"g":135},"15995":{"m":136,"g":135},"16162":{"m":136,"g":135},"16262":{"m":136,"g":135},"16261":{"m":136,"g":135},"16260":{"m":136,"g":135},"16251":{"m":136,"g":135},"16247":{"m":136,"g":135},"16250":{"m":136,"g":135},"16243":{"m":136,"g":135},"14085":{"m":136,"g":135},"16171":{"m":136,"g":135},"16245":{"m":136,"g":135},"16238":{"m":136,"g":135},"15814":{"m":136,"g":135},"16239":{"m":136,"g":135},"16240":{"m":136,"g":135},"16236":{"m":136,"g":135},"16233":{"m":136,"g":135},"16202":{"m":136,"g":135},"16228":{"m":136,"g":135},"16018":{"m":136,"g":135},"16195":{"m":136,"g":135},"16111":{"m":136,"g":135},"13394":{"m":136,"g":135},"16204":{"m":136,"g":135},"16214":{"m":136,"g":135},"16221":{"m":136,"g":135},"16178":{"m":136,"g":135},"16187":{"m":136,"g":135},"16209":{"m":136,"g":135},"15597":{"m":136,"g":135},"16117":{"m":136,"g":135},"15878":{"m":136,"g":135},"16200":{"m":136,"g":135},"15946":{"m":136,"g":135},"16198":{"m":136,"g":135},"16172":{"m":136,"g":135},"15575":{"m":136,"g":135},"16174":{"m":136,"g":135},"16156":{"m":136,"g":135},"15754":{"m":136,"g":135},"14920":{"m":136,"g":135},"16050":{"m":136,"g":135},"16181":{"m":136,"g":135},"16183":{"m":136,"g":135},"16182":{"m":136,"g":135},"14416":{"m":136,"g":135},"16168":{"m":136,"g":135},"16175":{"m":136,"g":135},"16166":{"m":136,"g":135},"16161":{"m":136,"g":135},"16163":{"m":136,"g":135},"16165":{"m":136,"g":135},"16110":{"m":136,"g":135},"16173":{"m":136,"g":135},"16159":{"m":136,"g":135},"16150":{"m":136,"g":135},"16160":{"m":136,"g":135},"16158":{"m":136,"g":135},"16149":{"m":136,"g":135},"16124":{"m":136,"g":135},"15730":{"m":136,"g":135},"13978":{"m":136,"g":135},"12625":{"m":136,"g":135},"16154":{"m":136,"g":135},"18298":"m137","18111":"m137"},"c":{"2ccd9fd8":{"m":0,"g":94},"46b7ea7c":{"m":0,"g":94},"70359bf3":{"m":0,"g":94},"01ca82d7":{"m":0,"g":94},"4bd8233f":{"m":0,"g":94},"08ab2a16":{"m":0,"g":94},"f652494d":{"m":0,"g":94},"30720e73":{"m":0,"g":94},"331848de":{"m":0,"g":94},"93eeb543":{"m":0,"g":94},"ead5b39f":{"m":0,"g":94},"22085081":{"m":0,"g":94},"f6d40df0":{"m":0,"g":94},"22ec7bc2":{"m":1,"g":94},"c0454b32":{"m":1,"g":94},"8024fc5e":{"m":1,"g":94},"70528762":{"m":1,"g":94},"71d30d6d":{"m":1,"g":94},"f9d72381":{"m":1,"g":94},"bf51ddc6":{"m":1,"g":94},"fd7c4792":{"m":1,"g":94},"c4707f1b":{"m":1,"g":94},"ffe4aaee":{"m":1,"g":94},"5b27a1dc":{"m":1,"g":94},"e71d4ab3":{"m":1,"g":94},"fbf42263":{"m":1,"g":94},"cc3ada98":{"m":2,"g":94},"a837166e":{"m":2,"g":94},"11f3cca6":{"m":2,"g":94},"ca13f3b8":{"m":2,"g":94},"0b2efc2a":{"m":2,"g":94},"f30abd09":{"m":2,"g":94},"40ab1f01":{"m":2,"g":94},"199e82a1":{"m":2,"g":94},"23471f9a":{"m":2,"g":94},"61d4c939":{"m":2,"g":94},"98a3e8ef":{"m":2,"g":94},"2b079f89":{"m":2,"g":94},"05b4c398":{"m":2,"g":94},"dafafe5b":{"m":2,"g":94},"b240f751":{"m":2,"g":94},"501f9444":{"m":2,"g":94},"723f0421":{"m":3,"g":94},"585eabab":{"m":3,"g":94},"c70b3cfa":{"m":4,"g":94},"489796c7":{"m":4,"g":94},"fa7a696d":{"m":4,"g":94},"bef0b359":{"m":4,"g":94},"c6576e82":{"m":4,"g":94},"99258181":{"m":4,"g":94},"3de54a1b":{"m":4,"g":94},"7358fa64":{"m":4,"g":94},"9a16fea0":{"m":4,"g":94},"9e037c82":{"m":4,"g":94},"9076386d":{"m":4,"g":94},"959c4174":{"m":4,"g":94},"94e05770":{"m":4,"g":94},"63e97e5e":{"m":4,"g":94},"e08bca28":{"m":4,"g":94},"cd3ccb2e":{"m":4,"g":94},"3f5c2f4c":{"m":4,"g":94},"007eeb4e":{"m":4,"g":94},"e8f2b155":{"m":4,"g":94},"6dceab4d":{"m":5,"g":94},"a49dc52b":{"m":6,"g":94},"873d0e85":{"m":6,"g":94},"1d0fbe8e":{"m":6,"g":94},"97aa9b32":{"m":6,"g":94},"06175286":{"m":6,"g":94},"4ea92f83":{"m":6,"g":94},"6b0af285":{"m":6,"g":94},"6f560c76":{"m":6,"g":94},"cd687233":{"m":6,"g":94},"81561f8e":{"m":6,"g":94},"3a581e99":{"m":6,"g":94},"0147f940":{"m":6,"g":94},"23950056":{"m":6,"g":94},"93414c82":{"m":6,"g":94},"ed7c7eca":{"m":6,"g":94},"0c457bae":{"m":6,"g":94},"d3fc86a4":{"m":6,"g":94},"01ee0fbc":{"m":6,"g":94},"711d3435":{"m":6,"g":94},"f6bfe3aa":{"m":7,"g":94},"e095b162":{"m":7,"g":94},"67be11c7":{"m":7,"g":94},"cd8c3ccd":{"m":7,"g":94},"9c121f2a":{"m":7,"g":94},"03e04b23":{"m":7,"g":94},"86442530":{"m":7,"g":94},"79cb018e":{"m":7,"g":94},"c7af9f73":{"m":7,"g":94},"876db8dc":{"m":7,"g":94},"ad82bac6":{"m":7,"g":94},"71b54eea":{"m":7,"g":94},"74b3bfaa":{"m":7,"g":94},"4a634cf6":{"m":7,"g":94},"624b21e7":{"m":8,"g":94},"c51020cf":{"m":8,"g":94},"50afed4e":{"m":8,"g":94},"4d303c4f":{"m":8,"g":94},"37b42297":{"m":8,"g":94},"cba50273":{"m":8,"g":94},"a6aa46dd":{"m":8,"g":94},"405f26b0":{"m":8,"g":94},"b1a3a454":{"m":8,"g":94},"79e6b84b":{"m":8,"g":94},"26c34941":{"m":8,"g":94},"cb8e1982":{"m":8,"g":94},"23f05005":{"m":8,"g":94},"a7334aee":{"m":8,"g":94},"ee1df26a":{"m":8,"g":94},"3ae78a09":{"m":8,"g":94},"ccbe1e67":{"m":8,"g":94},"e2bf732b":{"m":8,"g":94},"322421fa":{"m":8,"g":94},"8ff870bf":{"m":8,"g":94},"26f0bedc":{"m":8,"g":94},"82fa69b3":{"m":8,"g":94},"8fb7459e":{"m":8,"g":94},"bb3a3b66":{"m":8,"g":94},"45d6592d":{"m":8,"g":94},"4aa5dd2c":{"m":9,"g":94},"13662fd5":{"m":9,"g":94},"d5ae2eba":{"m":9,"g":94},"1b355479":{"m":9,"g":94},"faba293a":{"m":9,"g":94},"89885b31":{"m":9,"g":94},"64fe3115":{"m":9,"g":94},"a7ace9c8":{"m":9,"g":94},"a833de05":{"m":9,"g":94},"30d67b2b":{"m":9,"g":94},"b0b722ee":{"m":9,"g":94},"01b07ea3":{"m":9,"g":94},"dfb13ac4":{"m":9,"g":94},"ec90b9c0":{"m":9,"g":94},"9759d927":{"m":9,"g":94},"8d0a7fae":{"m":9,"g":94},"c4e9ebe3":{"m":9,"g":94},"3c2c5869":{"m":9,"g":94},"4cb9aaed":{"m":9,"g":94},"9de9a468":{"m":9,"g":94},"ce3b2610":{"m":9,"g":94},"91e03633":{"m":9,"g":94},"2a74748b":{"m":9,"g":94},"63ba630b":{"m":9,"g":94},"6493256b":{"m":9,"g":94},"06008bc2":{"m":9,"g":94},"bb824da4":{"m":9,"g":94},"93121324":{"m":9,"g":94},"c97fdae4":{"m":9,"g":94},"51104cd4":{"m":10,"g":94},"e2b2f0a2":{"m":10,"g":94},"b57abe16":{"m":10,"g":94},"e57f0792":{"m":10,"g":94},"08df63a6":{"m":10,"g":94},"77835756":{"m":10,"g":94},"ed315799":{"m":10,"g":94},"92e2d74f":{"m":10,"g":94},"d9b3b018":{"m":10,"g":94},"745ea007":{"m":10,"g":94},"ad1dd746":{"m":10,"g":94},"eb4308c4":{"m":10,"g":94},"b2eb0805":{"m":10,"g":94},"72bb3443":{"m":11,"g":94},"2d580e7a":{"m":11,"g":94},"3fc97f67":{"m":11,"g":94},"abc548c7":{"m":11,"g":94},"aee4f523":{"m":11,"g":94},"7023f413":{"m":11,"g":94},"09deb20d":{"m":11,"g":94},"33b242df":{"m":11,"g":94},"a511a2d0":{"m":11,"g":94},"6ec65f45":{"m":11,"g":94},"e2c31fca":{"m":11,"g":94},"d5de20a3":{"m":11,"g":94},"4a1c6ae2":{"m":11,"g":94},"14522e6a":{"m":11,"g":94},"183df472":{"m":11,"g":94},"5c5aba59":{"m":11,"g":94},"ba67101f":{"m":11,"g":94},"95c4e0df":{"m":11,"g":94},"19818b9c":{"m":11,"g":94},"9216b106":{"m":11,"g":94},"da19434c":{"m":11,"g":94},"150d7020":{"m":11,"g":94},"9acc6e35":{"m":11,"g":94},"cf9d8efd":{"m":11,"g":94},"1bf1cf19":{"m":11,"g":94},"e822e590":{"m":11,"g":94},"ca4f1ab8":{"m":11,"g":94},"2b6d9991":{"m":11,"g":94},"65501a9c":{"m":11,"g":94},"db611066":{"m":11,"g":94},"c93293c5":{"m":11,"g":94},"62b3812b":{"m":11,"g":94},"550a4f78":{"m":11,"g":94},"ff99c38a":{"m":11,"g":94},"c9de3e16":{"m":11,"g":94},"ed27a6b9":{"m":11,"g":94},"463c6632":{"m":11,"g":94},"b0890631":{"m":11,"g":94},"cb389c91":{"m":11,"g":94},"eddaa2b5":{"m":11,"g":94},"2af565b3":{"m":11,"g":94},"3842eba5":{"m":11,"g":94},"24e59f53":{"m":11,"g":94},"75235419":{"m":11,"g":94},"64ee9c03":{"m":11,"g":94},"30d17840":{"m":11,"g":94},"ce216c80":{"m":11,"g":94},"e0ae5d42":{"m":12,"g":94},"32de16ce":{"m":12,"g":94},"0992d85f":{"m":12,"g":94},"5dc55a5f":{"m":12,"g":94},"4231a42f":{"m":12,"g":94},"455c9ccc":{"m":12,"g":94},"39191c85":{"m":12,"g":94},"562b8857":{"m":12,"g":94},"04c0b214":{"m":12,"g":94},"6e09cf6a":{"m":12,"g":94},"e8a2327d":{"m":13,"g":94},"91f93f14":{"m":13,"g":94},"f70f7258":{"m":13,"g":94},"c0ae70c8":{"m":13,"g":94},"87260b7b":{"m":13,"g":94},"651a23ee":{"m":13,"g":94},"bf3e271f":{"m":13,"g":94},"3bc01ac1":{"m":13,"g":94},"9f009261":{"m":13,"g":94},"159cc741":{"m":13,"g":94},"7d1ebc2d":{"m":13,"g":94},"83525a1d":{"m":13,"g":94},"80a33ce8":{"m":13,"g":94},"1a57e416":{"m":13,"g":94},"adc97426":{"m":13,"g":94},"0463f7fb":{"m":13,"g":94},"565d7274":{"m":13,"g":94},"09de730d":{"m":13,"g":94},"55c16436":{"m":13,"g":94},"2b605ab1":{"m":13,"g":94},"947bda73":{"m":13,"g":94},"f06e90c2":{"m":13,"g":94},"2cea6146":{"m":13,"g":94},"44c998fc":{"m":13,"g":94},"3167d8da":{"m":13,"g":94},"0fafc560":{"m":13,"g":94},"19d2135c":{"m":13,"g":94},"ced77c66":{"m":13,"g":94},"8dbdc018":{"m":13,"g":94},"3e684be7":{"m":13,"g":94},"ec380dfd":{"m":13,"g":94},"5b647543":{"m":13,"g":94},"8210ec60":{"m":13,"g":94},"5be9eb8a":{"m":13,"g":94},"c05956e5":{"m":13,"g":94},"d75dc20f":{"m":13,"g":94},"690d162d":{"m":13,"g":94},"664287b2":{"m":13,"g":94},"2f11936f":{"m":14,"g":94},"63fbef98":{"m":14,"g":94},"2a754e57":{"m":14,"g":94},"96c503eb":{"m":14,"g":94},"441cca77":{"m":14,"g":94},"c7709d3a":{"m":14,"g":94},"9380f50f":{"m":14,"g":94},"95dc093b":{"m":14,"g":94},"d9ac6392":{"m":14,"g":94},"26294b2f":{"m":14,"g":94},"75b31a2a":{"m":14,"g":94},"11616fc6":{"m":14,"g":94},"9ce89bc1":{"m":14,"g":94},"badf3fa0":{"m":14,"g":94},"945aa9be":{"m":14,"g":94},"2e6e62e1":{"m":14,"g":94},"a385ee27":{"m":14,"g":94},"eb1ae6ae":{"m":14,"g":94},"2187f362":{"m":14,"g":94},"9465b668":{"m":14,"g":94},"05471f21":{"m":14,"g":94},"1fa15099":{"m":14,"g":94},"303ef888":{"m":14,"g":94},"92cb93f3":{"m":14,"g":94},"e94e60d6":{"m":14,"g":94},"d2f8bfb2":{"m":14,"g":94},"b7e2f800":{"m":14,"g":94},"09593e9b":{"m":14,"g":94},"53a7ebd8":{"m":14,"g":94},"ad5f04d6":{"m":14,"g":94},"bbec01c9":{"m":14,"g":94},"40e53d65":{"m":14,"g":94},"fb9296f0":{"m":14,"g":94},"1374334d":{"m":14,"g":94},"94aead9e":{"m":14,"g":94},"9c902b19":{"m":14,"g":94},"111991fe":{"m":14,"g":94},"a8c787d2":{"m":14,"g":94},"5f283991":{"m":14,"g":94},"b6667a53":{"m":14,"g":94},"542bc733":{"m":14,"g":94},"f6dbd240":{"m":14,"g":94},"ad872feb":{"m":15,"g":94},"da2e5d65":{"m":15,"g":94},"ce62dc73":{"m":15,"g":94},"02b72586":{"m":15,"g":94},"d557e9f3":{"m":15,"g":94},"740c46a1":{"m":15,"g":94},"b3868722":{"m":15,"g":94},"710f614e":{"m":15,"g":94},"f25b76c0":{"m":15,"g":94},"f4e885b7":{"m":15,"g":94},"0877f1e7":{"m":15,"g":94},"5304b4ef":{"m":15,"g":94},"26908d95":{"m":15,"g":94},"c0982ac5":{"m":15,"g":94},"dc1b8bcf":{"m":15,"g":94},"5a57b8ad":{"m":15,"g":94},"d737da5f":{"m":15,"g":94},"ac113887":{"m":15,"g":94},"dc8cef1d":{"m":15,"g":94},"5d264a90":{"m":16,"g":94},"5949b1ca":{"m":16,"g":94},"0feca02d":{"m":16,"g":94},"10143e1a":{"m":16,"g":94},"65c65776":{"m":16,"g":94},"66581596":{"m":16,"g":94},"396a6924":{"m":16,"g":94},"af4e7910":{"m":16,"g":94},"519e20cf":{"m":16,"g":94},"d9a69029":{"m":16,"g":94},"56f5fc4a":{"m":17,"g":94},"6a2941f4":{"m":17,"g":94},"5ac8b806":{"m":17,"g":94},"bae9541e":{"m":17,"g":94},"a56858ba":{"m":17,"g":94},"564a898a":{"m":17,"g":94},"2b4c6462":{"m":18,"g":94},"f424e76d":{"m":18,"g":94},"490a1f39":{"m":18,"g":94},"06487f12":{"m":18,"g":94},"39c57317":{"m":18,"g":94},"9592a1f3":{"m":18,"g":94},"35759efa":{"m":18,"g":94},"8f4b1559":{"m":18,"g":94},"e3046ea3":{"m":18,"g":94},"49c5e0ec":{"m":18,"g":94},"ec2150b2":{"m":18,"g":94},"7620cd37":{"m":18,"g":94},"50a53887":{"m":18,"g":94},"11c8efff":{"m":18,"g":94},"e87c7fd5":{"m":18,"g":94},"630479c3":{"m":18,"g":94},"51fda143":{"m":18,"g":94},"dc4e4a6a":{"m":18,"g":94},"2d96da81":{"m":18,"g":94},"c126a6cc":{"m":18,"g":94},"ac971ff6":{"m":18,"g":94},"e1792cca":{"m":18,"g":94},"1b7adbb5":{"m":18,"g":94},"a9ef49c1":{"m":18,"g":94},"21ba3a88":{"m":18,"g":94},"9c5cac24":{"m":18,"g":94},"5960a6e5":{"m":18,"g":94},"b050d928":{"m":18,"g":94},"6a4dc996":{"m":18,"g":94},"d774acad":{"m":18,"g":94},"d93388da":{"m":18,"g":94},"476584cb":{"m":18,"g":94},"abd5385a":{"m":18,"g":94},"3de2f30a":{"m":18,"g":94},"4efcc59d":{"m":18,"g":94},"2e341cd4":{"m":18,"g":94},"a8552cb1":{"m":18,"g":94},"a470e60c":{"m":18,"g":94},"5f90e076":{"m":18,"g":94},"8832ecb1":{"m":18,"g":94},"5ff60eda":{"m":18,"g":94},"c1930022":{"m":18,"g":94},"fe3be159":{"m":18,"g":94},"0aa189f1":{"m":18,"g":94},"f6b29f69":{"m":18,"g":94},"c9ee3d35":{"m":18,"g":94},"41d1f677":{"m":18,"g":94},"444a0244":{"m":19,"g":94},"fa7ccb33":{"m":19,"g":94},"26868443":{"m":19,"g":94},"824a77d0":{"m":19,"g":94},"cf99eab7":{"m":19,"g":94},"9fdea29d":{"m":19,"g":94},"df7c4c19":{"m":19,"g":94},"c3f1aac8":{"m":19,"g":94},"d198791f":{"m":19,"g":94},"c07526e4":{"m":19,"g":94},"7b597475":{"m":19,"g":94},"5303c1ed":{"m":19,"g":94},"65bd1338":{"m":19,"g":94},"eedc12e1":{"m":19,"g":94},"5a4ef2b5":{"m":19,"g":94},"9dab947d":{"m":19,"g":94},"33ee97b0":{"m":19,"g":94},"6a846bb1":{"m":19,"g":94},"0fdb3127":{"m":19,"g":94},"5ad033a0":{"m":19,"g":94},"77e592e8":{"m":19,"g":94},"caaad53b":{"m":19,"g":94},"69d19188":{"m":19,"g":94},"4b4a67f8":{"m":19,"g":94},"0ac94c36":{"m":19,"g":94},"459abad2":{"m":20,"g":94},"30d8e130":{"m":20,"g":94},"08a3bd19":{"m":20,"g":94},"321a963b":{"m":20,"g":94},"e17deb27":{"m":20,"g":94},"2d3ae4e1":{"m":20,"g":94},"75f4ccb7":{"m":20,"g":94},"83d2b30d":{"m":20,"g":94},"4367f4bb":{"m":20,"g":94},"00e4baa7":{"m":20,"g":94},"4cd64b8e":{"m":20,"g":94},"01d66ae2":{"m":20,"g":94},"a523a3c1":{"m":20,"g":94},"9f94728f":{"m":20,"g":94},"1a491d00":{"m":21,"g":94},"8fbba3de":{"m":21,"g":94},"ae0f6130":{"m":21,"g":94},"60105897":{"m":21,"g":94},"926ac01b":{"m":21,"g":94},"25c881a0":{"m":21,"g":94},"04ec6ba2":{"m":21,"g":94},"d63f13c1":{"m":21,"g":94},"fded6744":{"m":21,"g":94},"6e453940":{"m":21,"g":94},"97e0f7d2":{"m":21,"g":94},"d5146bae":{"m":21,"g":94},"5bd06b45":{"m":22,"g":94},"9a611827":{"m":22,"g":94},"eeb24821":{"m":22,"g":94},"3e455b01":{"m":22,"g":94},"8628ab9c":{"m":22,"g":94},"1b77670f":{"m":22,"g":94},"768e05d0":{"m":22,"g":94},"01fbb11b":{"m":22,"g":94},"05d216da":{"m":22,"g":94},"6b32bb1c":{"m":22,"g":94},"40facad5":{"m":22,"g":94},"da504445":{"m":22,"g":94},"252e0f7b":{"m":22,"g":94},"7f6f2f0f":{"m":22,"g":94},"7802df1e":{"m":22,"g":94},"bc1154c3":{"m":23,"g":94},"752e6430":{"m":23,"g":94},"30db99b3":{"m":23,"g":94},"0a409bd4":{"m":23,"g":94},"e4db4e5b":{"m":23,"g":94},"bbc07c41":{"m":23,"g":94},"a036d419":{"m":23,"g":94},"f95e6617":{"m":23,"g":94},"de854fb5":{"m":23,"g":94},"f64b2a9b":{"m":23,"g":94},"9f95dcc6":{"m":23,"g":94},"0736b270":{"m":23,"g":94},"3fdab919":{"m":23,"g":94},"ba29504b":{"m":23,"g":94},"a72342f1":{"m":23,"g":94},"c3c74bf8":{"m":23,"g":94},"d9fccfef":{"m":23,"g":94},"679ebcbb":{"m":23,"g":94},"1edd4e07":{"m":24,"g":94},"62c673c4":{"m":24,"g":94},"377c5dc9":{"m":24,"g":94},"f52eda35":{"m":24,"g":94},"b579ecf0":{"m":24,"g":94},"e7487b08":{"m":24,"g":94},"ae5c0fc4":{"m":24,"g":94},"a30d5d75":{"m":24,"g":94},"17af39c5":{"m":24,"g":94},"daf593a3":{"m":24,"g":94},"bece265f":{"m":24,"g":94},"cdcbde5f":{"m":24,"g":94},"21e22b9e":{"m":24,"g":94},"a50c8a14":{"m":24,"g":94},"db6089e6":{"m":24,"g":94},"3520f75f":{"m":24,"g":94},"c8e9fed8":{"m":24,"g":94},"084fa54d":{"m":24,"g":94},"eba458bd":{"m":24,"g":94},"3d1cb0af":{"m":24,"g":94},"7d352b4f":{"m":24,"g":94},"87064015":{"m":24,"g":94},"7cd4f244":{"m":24,"g":94},"98111fbe":{"m":24,"g":94},"2ec39ab7":{"m":24,"g":94},"8f6274c8":{"m":24,"g":94},"325a06c2":{"m":24,"g":94},"79f81629":{"m":24,"g":94},"b688fd85":{"m":24,"g":94},"5bd89924":{"m":24,"g":94},"8d908a93":{"m":24,"g":94},"dd7e8b94":{"m":24,"g":94},"1f013d64":{"m":24,"g":94},"628e1fa7":{"m":24,"g":94},"c71880f8":{"m":24,"g":94},"bcb6611a":{"m":24,"g":94},"fa2aa0db":{"m":24,"g":94},"6a387a69":{"m":24,"g":94},"27f5ce0a":{"m":24,"g":94},"94862579":{"m":24,"g":94},"68e52626":{"m":24,"g":94},"e4d3333c":{"m":25,"g":94},"6f221d4c":{"m":25,"g":94},"aba6f51f":{"m":25,"g":94},"7f6c690b":{"m":25,"g":94},"40e6f513":{"m":25,"g":94},"40756776":{"m":25,"g":94},"9e8d2c7f":{"m":25,"g":94},"c9bff5fc":{"m":25,"g":94},"b04444ac":{"m":25,"g":94},"3d617a21":{"m":25,"g":94},"c020f9ce":{"m":25,"g":94},"ca600e8c":{"m":25,"g":94},"0c0c8137":{"m":25,"g":94},"90286d85":{"m":25,"g":94},"5e7dd984":{"m":25,"g":94},"bc3eaac2":{"m":25,"g":94},"a78d98de":{"m":25,"g":94},"7d5ed7c6":{"m":25,"g":94},"a6c7ebbb":{"m":25,"g":94},"bb0501c0":{"m":25,"g":94},"6b0f2e90":{"m":25,"g":94},"30a9b2ef":{"m":26,"g":94},"3cadecf0":{"m":26,"g":94},"e90e3a50":{"m":26,"g":94},"fbd6b94d":{"m":26,"g":94},"4c8093c8":{"m":26,"g":94},"ae7ee01a":{"m":26,"g":94},"76e59088":{"m":26,"g":94},"12ce3bef":{"m":26,"g":94},"4013a4e1":{"m":26,"g":94},"60340a36":{"m":26,"g":94},"70c78cfb":{"m":26,"g":94},"72b6ea88":{"m":26,"g":94},"b906c015":{"m":27,"g":94},"9319cd13":{"m":27,"g":94},"046c2b33":{"m":27,"g":94},"6b8f66ef":{"m":27,"g":94},"7937a886":{"m":27,"g":94},"2e218b9e":{"m":27,"g":94},"141e8c71":{"m":28,"g":94},"d53dcf9c":{"m":28,"g":94},"bb66cc4c":{"m":28,"g":94},"975adb80":{"m":28,"g":94},"0d4f3a9f":{"m":28,"g":94},"afd411d0":{"m":28,"g":94},"e1eae1fd":{"m":28,"g":94},"f4d9953d":{"m":28,"g":94},"4f005250":{"m":28,"g":94},"995af5a5":{"m":28,"g":94},"53985645":{"m":28,"g":94},"70cc0749":{"m":28,"g":94},"7dd8a7e6":{"m":28,"g":94},"947402c8":{"m":28,"g":94},"8c5382e6":{"m":28,"g":94},"001b0bdd":{"m":28,"g":94},"dc9d06d8":{"m":29,"g":94},"c31f084c":{"m":29,"g":94},"a01ddd96":{"m":29,"g":94},"7fa54a1a":{"m":29,"g":94},"05abd126":{"m":29,"g":94},"5f6fa04a":{"m":29,"g":94},"58a09708":{"m":29,"g":94},"ff68ae85":{"m":29,"g":94},"795eab6d":{"m":29,"g":94},"41bb1ab1":{"m":29,"g":94},"87e8c090":{"m":29,"g":94},"ad56e684":{"m":29,"g":94},"ffb15744":{"m":29,"g":94},"a9c833d5":{"m":29,"g":94},"94e01151":{"m":29,"g":94},"b216a545":{"m":29,"g":94},"fde83405":{"m":29,"g":94},"fd7926e4":{"m":29,"g":94},"399cad91":{"m":29,"g":94},"0a4f5f9b":{"m":29,"g":94},"3bc99e6f":{"m":29,"g":94},"ebf69964":{"m":29,"g":94},"b0ad0c1b":{"m":30,"g":94},"c877292c":{"m":30,"g":94},"0c1c72a0":{"m":30,"g":94},"41598e0d":{"m":30,"g":94},"89f23a51":{"m":30,"g":94},"cb99ba4f":{"m":30,"g":94},"32f61443":{"m":30,"g":94},"fb1f28cb":{"m":30,"g":94},"fb7421db":{"m":30,"g":94},"14b64930":{"m":30,"g":94},"82076370":{"m":30,"g":94},"7de60345":{"m":30,"g":94},"d84c5e70":{"m":30,"g":94},"d7854120":{"m":30,"g":94},"7b6a5332":{"m":30,"g":94},"4080e822":{"m":30,"g":94},"c245b789":{"m":30,"g":94},"9dae4078":{"m":30,"g":94},"fcc0f5ed":{"m":30,"g":94},"a97df791":{"m":30,"g":94},"33d61356":{"m":30,"g":94},"94752ac8":{"m":30,"g":94},"43fbb6d9":{"m":30,"g":94},"54fb1c80":{"m":30,"g":94},"b68c4c07":{"m":30,"g":94},"e712837d":{"m":30,"g":94},"7599bade":{"m":30,"g":94},"62757db6":{"m":30,"g":94},"73fa2d49":{"m":30,"g":94},"61728884":{"m":30,"g":94},"9cf0a5ba":{"m":30,"g":94},"b16e856f":{"m":30,"g":94},"05c50a82":{"m":30,"g":94},"b568df5d":{"m":30,"g":94},"10bca45b":{"m":30,"g":94},"b91a4cb1":{"m":30,"g":94},"95a28019":{"m":30,"g":94},"e040a245":{"m":30,"g":94},"9f662501":{"m":30,"g":94},"ab787594":{"m":30,"g":94},"228cf475":{"m":30,"g":94},"3a79613c":{"m":30,"g":94},"1ac304ee":{"m":30,"g":94},"20a4f927":{"m":30,"g":94},"0de7c2d0":{"m":30,"g":94},"6ed4e3b8":{"m":30,"g":94},"00023d62":{"m":30,"g":94},"c62d560c":{"m":30,"g":94},"2b8257f3":{"m":30,"g":94},"7623091d":{"m":30,"g":94},"f724f1f1":{"m":30,"g":94},"6db27f7b":{"m":30,"g":94},"4d929107":{"m":30,"g":94},"fbe0c818":{"m":30,"g":94},"5bd95374":{"m":31,"g":94},"0cb099e2":{"m":31,"g":94},"93d4e354":{"m":31,"g":94},"9195d136":{"m":31,"g":94},"14cb544d":{"m":31,"g":94},"e86b1ccb":{"m":31,"g":94},"8d2d876f":{"m":31,"g":94},"326df4ba":{"m":31,"g":94},"6767e222":{"m":31,"g":94},"73cf6834":{"m":31,"g":94},"1c2b5f52":{"m":31,"g":94},"96a2093e":{"m":31,"g":94},"a34dd86a":{"m":31,"g":94},"67c0d832":{"m":31,"g":94},"a59636bb":{"m":31,"g":94},"fe502432":{"m":31,"g":94},"f14569f6":{"m":31,"g":94},"8f790ac1":{"m":31,"g":94},"616b59f3":{"m":31,"g":94},"c8423ca3":{"m":31,"g":94},"e205527c":{"m":31,"g":94},"0909bb0d":{"m":31,"g":94},"ad3e4f16":{"m":31,"g":94},"312e8492":{"m":31,"g":94},"95f5fbf1":{"m":31,"g":94},"cebd78d8":{"m":31,"g":94},"0076f115":{"m":31,"g":94},"f7fb68d2":{"m":31,"g":94},"396a13e6":{"m":31,"g":94},"65915f9f":{"m":31,"g":94},"162f3ccb":{"m":31,"g":94},"65e89bae":{"m":31,"g":94},"6a38efa8":{"m":31,"g":94},"c5fe11a8":{"m":32,"g":94},"75ce37f4":{"m":32,"g":94},"97589a60":{"m":32,"g":94},"632d506d":{"m":32,"g":94},"3579162a":{"m":32,"g":94},"7514b9f8":{"m":32,"g":94},"158e8f1e":{"m":32,"g":94},"d3efcb39":{"m":32,"g":94},"2c615d12":{"m":32,"g":94},"61bb223e":{"m":32,"g":94},"15f1a49d":{"m":32,"g":94},"308d0240":{"m":32,"g":94},"ab4990e4":{"m":32,"g":94},"90227800":{"m":32,"g":94},"30b4f771":{"m":32,"g":94},"66e7dcaf":{"m":32,"g":94},"bc4c7a35":{"m":32,"g":94},"1cb4da5c":{"m":32,"g":94},"e61d13ac":{"m":32,"g":94},"b20daf98":{"m":32,"g":94},"f6af3a65":{"m":32,"g":94},"c9064e6f":{"m":32,"g":94},"a5b14ad0":{"m":32,"g":94},"5fafcac0":{"m":32,"g":94},"364d3d72":{"m":32,"g":94},"5623826f":{"m":32,"g":94},"83e23c69":{"m":32,"g":94},"ac1b74fa":{"m":32,"g":94},"068e9eae":{"m":32,"g":94},"d6aeb9fa":{"m":32,"g":94},"1fb94599":{"m":32,"g":94},"bea2bb9e":{"m":32,"g":94},"cd10654e":{"m":32,"g":94},"350a8160":{"m":32,"g":94},"6242c399":{"m":32,"g":94},"04707b09":{"m":32,"g":94},"ff2cfdb1":{"m":32,"g":94},"a8ae6403":{"m":32,"g":94},"d8476818":{"m":32,"g":94},"df191254":{"m":32,"g":94},"b997a18d":{"m":32,"g":94},"d8627ed1":{"m":32,"g":94},"fa13b95d":{"m":32,"g":94},"3c1f5a92":{"m":32,"g":94},"57d0bd91":{"m":32,"g":94},"cdc8d607":{"m":32,"g":94},"9208591f":{"m":32,"g":94},"5d0d40d0":{"m":32,"g":94},"f624f6a6":{"m":32,"g":94},"3694f8f9":{"m":32,"g":94},"5a261bd0":{"m":32,"g":94},"6aa8ad14":{"m":32,"g":94},"26e9c12c":{"m":32,"g":94},"87a0db82":{"m":32,"g":94},"f25f4dfd":{"m":33,"g":94},"184ae1c6":{"m":33,"g":94},"198974cd":{"m":33,"g":94},"6cc38b2b":{"m":33,"g":94},"1ece2cda":{"m":33,"g":94},"c8a9e791":{"m":33,"g":94},"3602692c":{"m":33,"g":94},"909f3436":{"m":33,"g":94},"5ff25cdf":{"m":33,"g":94},"2f1d9283":{"m":33,"g":94},"c61a1b6f":{"m":33,"g":94},"9935f97b":{"m":33,"g":94},"13ac95b8":{"m":34,"g":94},"492143bf":{"m":34,"g":94},"0a97d796":{"m":34,"g":94},"c411f32e":{"m":34,"g":94},"bf53bf51":{"m":34,"g":94},"b1a540ec":{"m":34,"g":94},"66975360":{"m":34,"g":94},"6c498313":{"m":34,"g":94},"99994427":{"m":35,"g":94},"6def9b01":{"m":35,"g":94},"47f20da2":{"m":35,"g":94},"4a9f8ea4":{"m":35,"g":94},"58fa6076":{"m":35,"g":94},"6487ef64":{"m":35,"g":94},"9b080524":{"m":35,"g":94},"32a4141d":{"m":35,"g":94},"08360553":{"m":35,"g":94},"00b19f19":{"m":35,"g":94},"6cb32ef9":{"m":35,"g":94},"761b2ceb":{"m":35,"g":94},"54772f78":{"m":35,"g":94},"1b5d56f7":{"m":35,"g":94},"d134c139":{"m":35,"g":94},"6cc9c525":{"m":35,"g":94},"52cefdbf":{"m":35,"g":94},"51c554d8":{"m":35,"g":94},"79ece2c5":{"m":35,"g":94},"55f5976b":{"m":35,"g":94},"b7f83410":{"m":35,"g":94},"f414352a":{"m":35,"g":94},"a362340b":{"m":35,"g":94},"381dd57b":{"m":35,"g":94},"8153168c":{"m":35,"g":94},"6c34d633":{"m":35,"g":94},"5ab9418f":{"m":36,"g":94},"843e63d8":{"m":36,"g":94},"a63c8275":{"m":36,"g":94},"dc67d976":{"m":36,"g":94},"1e495e08":{"m":36,"g":94},"12cb115d":{"m":36,"g":94},"c500f96b":{"m":36,"g":94},"474317f2":{"m":36,"g":94},"f64eae3a":{"m":36,"g":94},"a5a134f3":{"m":36,"g":94},"2561ed01":{"m":36,"g":94},"90a26be3":{"m":37,"g":94},"1f4b5f77":{"m":37,"g":94},"76524b70":{"m":37,"g":94},"3a6e0418":{"m":37,"g":94},"2fa5cec7":{"m":37,"g":94},"27b557ae":{"m":37,"g":94},"93dffd69":{"m":37,"g":94},"2abe4f1c":{"m":37,"g":94},"37963394":{"m":37,"g":94},"899cf5c4":{"m":37,"g":94},"e79f6cd7":{"m":37,"g":94},"9ba1f097":{"m":37,"g":94},"282681b8":{"m":37,"g":94},"58cafe23":{"m":37,"g":94},"9463bc13":{"m":37,"g":94},"e3fc4658":{"m":37,"g":94},"33b54e7c":{"m":37,"g":94},"30b404ce":{"m":37,"g":94},"70b68029":{"m":37,"g":94},"f3d32f88":{"m":37,"g":94},"8779da95":{"m":37,"g":94},"ad0ff62a":{"m":37,"g":94},"9a903a87":{"m":37,"g":94},"68be2f6d":{"m":37,"g":94},"b912de11":{"m":37,"g":94},"eb02c161":{"m":37,"g":94},"71221692":{"m":37,"g":94},"c33d82a2":{"m":37,"g":94},"8234e663":{"m":37,"g":94},"debbdb51":{"m":37,"g":94},"3efa7981":{"m":37,"g":94},"2a71be5e":{"m":37,"g":94},"44621377":{"m":37,"g":94},"fec185ce":{"m":37,"g":94},"c03cece4":{"m":37,"g":94},"15c75e41":{"m":37,"g":94},"224200e3":{"m":37,"g":94},"8c0efa51":{"m":37,"g":94},"144bc70f":{"m":37,"g":94},"46094e0c":{"m":37,"g":94},"3a6e8b6d":{"m":37,"g":94},"fbb4754c":{"m":37,"g":94},"6c7cb903":{"m":37,"g":94},"dff2860a":{"m":37,"g":94},"e72275cf":{"m":37,"g":94},"fec2d122":{"m":37,"g":94},"8d1095db":{"m":37,"g":94},"743007e1":{"m":37,"g":94},"9144ed10":{"m":37,"g":94},"69b3bb9a":{"m":37,"g":94},"689ff588":{"m":37,"g":94},"a7c47e0f":{"m":37,"g":94},"e4d68afc":{"m":37,"g":94},"c9b75917":{"m":37,"g":94},"662ecd93":{"m":37,"g":94},"8e6bdf85":{"m":37,"g":94},"05bea688":{"m":37,"g":94},"ab4a83b2":{"m":37,"g":94},"62f15eea":{"m":37,"g":94},"79794af5":{"m":37,"g":94},"3494b32c":{"m":37,"g":94},"eda7c090":{"m":37,"g":94},"5ce55aee":{"m":38,"g":94},"2d346a57":{"m":38,"g":94},"446ea332":{"m":38,"g":94},"8f527e29":{"m":38,"g":94},"7f24ea95":{"m":38,"g":94},"1acccb36":{"m":38,"g":94},"aa2750be":{"m":38,"g":94},"5e62a6b7":{"m":38,"g":94},"5752f25e":{"m":38,"g":94},"7c162fa9":{"m":38,"g":94},"36078fb2":{"m":38,"g":94},"b3710d2c":{"m":38,"g":94},"c6b6d2e7":{"m":38,"g":94},"82136eb0":{"m":39,"g":94},"b8ccaf4d":{"m":39,"g":94},"a68cb201":{"m":39,"g":94},"014982b5":{"m":39,"g":94},"a6db8862":{"m":39,"g":94},"b4408b0d":{"m":39,"g":94},"2cd7e181":{"m":39,"g":94},"37c5899f":{"m":40,"g":94},"f39a0197":{"m":40,"g":94},"3c93187c":{"m":40,"g":94},"fb2d0680":{"m":40,"g":94},"067d8e16":{"m":40,"g":94},"e6692bf4":{"m":40,"g":94},"28b4d8e1":{"m":40,"g":94},"bc068e96":{"m":40,"g":94},"8d4ed42a":{"m":40,"g":94},"2854a5ea":{"m":40,"g":94},"42a2d82b":{"m":40,"g":94},"e4780cf8":{"m":40,"g":94},"39bb49d1":{"m":40,"g":94},"6f3cf129":{"m":40,"g":94},"13f1357e":{"m":40,"g":94},"2a99993c":{"m":40,"g":94},"167591e8":{"m":40,"g":94},"441c22db":{"m":40,"g":94},"ce636ac4":{"m":40,"g":94},"7b69d91b":{"m":41,"g":94},"e8613df0":{"m":41,"g":94},"c5325aba":{"m":41,"g":94},"ebbc42d9":{"m":41,"g":94},"3ff64113":{"m":41,"g":94},"2b302b93":{"m":41,"g":94},"68f8b60d":{"m":41,"g":94},"6a5b352a":{"m":41,"g":94},"565b05f0":{"m":41,"g":94},"b6aad70a":{"m":41,"g":94},"551a3a9d":{"m":41,"g":94},"91877a9f":{"m":41,"g":94},"f7cce751":{"m":41,"g":94},"17e998f1":{"m":41,"g":94},"c98e84c2":{"m":41,"g":94},"9c064bf7":{"m":41,"g":94},"58d1082e":{"m":41,"g":94},"4d086719":{"m":41,"g":94},"9244f27f":{"m":41,"g":94},"2422de51":{"m":41,"g":94},"521f862d":{"m":41,"g":94},"34c32d28":{"m":41,"g":94},"dde8bb16":{"m":41,"g":94},"8ac3ccc0":{"m":41,"g":94},"9b0926ce":{"m":41,"g":94},"1c1bdc76":{"m":41,"g":94},"6bfdb403":{"m":41,"g":94},"f8fb4ce9":{"m":41,"g":94},"5d0ba403":{"m":41,"g":94},"04b262cd":{"m":41,"g":94},"2432ad40":{"m":41,"g":94},"45473d4b":{"m":41,"g":94},"114bbc86":{"m":41,"g":94},"32eb6e96":{"m":41,"g":94},"e0b5dbce":{"m":41,"g":94},"e6852b0d":{"m":41,"g":94},"4ae0969c":{"m":41,"g":94},"317631ca":{"m":41,"g":94},"b5648353":{"m":41,"g":94},"2c7d0a5b":{"m":41,"g":94},"8cdc76f6":{"m":41,"g":94},"f202ed97":{"m":41,"g":94},"100f5b8b":{"m":41,"g":94},"619bb6dd":{"m":41,"g":94},"b88ea90d":{"m":41,"g":94},"99ec439d":{"m":41,"g":94},"0f4fb19b":{"m":41,"g":94},"63ba2f8d":{"m":41,"g":94},"36d5acfc":{"m":41,"g":94},"3f0fe08d":{"m":41,"g":94},"55b974f9":{"m":41,"g":94},"f86c1e61":{"m":41,"g":94},"acaffd23":{"m":41,"g":94},"04868543":{"m":41,"g":94},"fd9ad817":{"m":41,"g":94},"e165a9fc":{"m":41,"g":94},"4e4459b9":{"m":41,"g":94},"065bb947":{"m":41,"g":94},"f42e9bfb":{"m":41,"g":94},"840c5dbc":{"m":41,"g":94},"63e845d0":{"m":41,"g":94},"9aa6553d":{"m":41,"g":94},"b1e330bc":{"m":41,"g":94},"4353acb4":{"m":41,"g":94},"9ae1db0b":{"m":41,"g":94},"00c7e636":{"m":42,"g":94},"23cc66f7":{"m":42,"g":94},"5d09ca57":{"m":42,"g":94},"81c33274":{"m":42,"g":94},"f13d86f9":{"m":42,"g":94},"aba9eae4":{"m":42,"g":94},"bbd72bfc":{"m":42,"g":94},"b503881b":{"m":42,"g":94},"58093b86":{"m":42,"g":94},"8275049c":{"m":42,"g":94},"5476ccad":{"m":42,"g":94},"b040ed71":{"m":42,"g":94},"c9e66586":{"m":42,"g":94},"e11ab79e":{"m":42,"g":94},"01fdb2f3":{"m":42,"g":94},"c996e8cc":{"m":42,"g":94},"087257ea":{"m":43,"g":94},"736f0402":{"m":43,"g":94},"769bf11c":{"m":43,"g":94},"3db43d1b":{"m":43,"g":94},"f0f8a769":{"m":43,"g":94},"2bcfba1b":{"m":43,"g":94},"bc12d403":{"m":43,"g":94},"392f2863":{"m":43,"g":94},"6d0fa73e":{"m":43,"g":94},"9e0dac1a":{"m":43,"g":94},"a95d5589":{"m":43,"g":94},"d17d19e5":{"m":43,"g":94},"dd3809fa":{"m":43,"g":94},"7feba415":{"m":43,"g":94},"30ee3630":{"m":43,"g":94},"e5db40dc":{"m":43,"g":94},"b1709305":{"m":43,"g":94},"5ab20cce":{"m":43,"g":94},"02f7f3e4":{"m":43,"g":94},"2782132b":{"m":43,"g":94},"d19cc0b9":{"m":43,"g":94},"b0facb33":{"m":43,"g":94},"ecb8bad2":{"m":43,"g":94},"dbec2f18":{"m":43,"g":94},"e4b367ba":{"m":43,"g":94},"d10b933a":{"m":43,"g":94},"9116b289":{"m":43,"g":94},"a5114b6f":{"m":43,"g":94},"b6b40946":{"m":43,"g":94},"f1088e0f":{"m":43,"g":94},"175afed3":{"m":43,"g":94},"4a292f67":{"m":43,"g":94},"cd0be748":{"m":43,"g":94},"56503d9b":{"m":43,"g":94},"02bc9579":{"m":43,"g":94},"24f3e151":{"m":43,"g":94},"6790240c":{"m":43,"g":94},"061e5463":{"m":43,"g":94},"0c1e8796":{"m":43,"g":94},"869f1c02":{"m":43,"g":94},"2725f8da":{"m":43,"g":94},"da1ffed6":{"m":43,"g":94},"48761171":{"m":43,"g":94},"c3f2fc5a":{"m":43,"g":94},"7ee6c259":{"m":43,"g":94},"9610fcd4":{"m":43,"g":94},"31fad29a":{"m":43,"g":94},"9da5a60b":{"m":43,"g":94},"69aa937a":{"m":43,"g":94},"5d638c92":{"m":43,"g":94},"e37cdab0":{"m":43,"g":94},"1d9deeac":{"m":43,"g":94},"dafb6a52":{"m":43,"g":94},"862cd265":{"m":43,"g":94},"1f26e8b8":{"m":44,"g":94},"5e1558f1":{"m":44,"g":94},"94cde109":{"m":44,"g":94},"00611286":{"m":44,"g":94},"e68b9e76":{"m":44,"g":94},"7ce36068":{"m":44,"g":94},"efb099cd":{"m":44,"g":94},"09603c6d":{"m":44,"g":94},"cf470fea":{"m":44,"g":94},"45d5af24":{"m":44,"g":94},"b121bc03":{"m":44,"g":94},"e12358dc":{"m":44,"g":94},"554fbf93":{"m":44,"g":94},"b48edff6":{"m":44,"g":94},"593b19f2":{"m":44,"g":94},"59cbf476":{"m":44,"g":94},"95946271":{"m":44,"g":94},"5c4ce656":{"m":44,"g":94},"cbbc82b7":{"m":44,"g":94},"8bee20f8":{"m":44,"g":94},"12cad0fe":{"m":44,"g":94},"b6cd9036":{"m":44,"g":94},"30643fed":{"m":45,"g":94},"e646c590":{"m":45,"g":94},"c555ce2c":{"m":45,"g":94},"40900bae":{"m":45,"g":94},"a2f5e755":{"m":45,"g":94},"2148914e":{"m":45,"g":94},"def55bc8":{"m":45,"g":94},"86a2c473":{"m":45,"g":94},"1701b0db":{"m":45,"g":94},"384d85ba":{"m":45,"g":94},"60597219":{"m":45,"g":94},"fc82f5a7":{"m":45,"g":94},"0089c4bc":{"m":45,"g":94},"72e7b57a":{"m":45,"g":94},"87a7cfa0":{"m":45,"g":94},"8f8f96a6":{"m":45,"g":94},"05b3bf5e":{"m":45,"g":94},"3f5ac88d":{"m":45,"g":94},"0d800090":{"m":45,"g":94},"b7d05594":{"m":45,"g":94},"80a90547":{"m":45,"g":94},"9af7b88e":{"m":45,"g":94},"fbcbb263":{"m":45,"g":94},"2fce449b":{"m":45,"g":94},"ad4125d1":{"m":45,"g":94},"17536e7e":{"m":45,"g":94},"65859754":{"m":46,"g":94},"2ce32db6":{"m":46,"g":94},"793b79db":{"m":46,"g":94},"1363b519":{"m":46,"g":94},"0abbf289":{"m":46,"g":94},"c17c5781":{"m":46,"g":94},"916b3cdd":{"m":46,"g":94},"838dcda1":{"m":46,"g":94},"efbc116a":{"m":46,"g":94},"6aed0445":{"m":46,"g":94},"908dd7f9":{"m":46,"g":94},"f4cd8040":{"m":46,"g":94},"be7986e0":{"m":46,"g":94},"5a5f1843":{"m":46,"g":94},"7b394e5f":{"m":46,"g":94},"3b60558d":{"m":46,"g":94},"5a9a4f41":{"m":46,"g":94},"72e979bf":{"m":46,"g":94},"146f6134":{"m":46,"g":94},"660ecb73":{"m":46,"g":94},"2565cb0f":{"m":46,"g":94},"066e8a4e":{"m":46,"g":94},"2134f089":{"m":46,"g":94},"a54f278d":{"m":46,"g":94},"d1b31b06":{"m":46,"g":94},"d59a4782":{"m":46,"g":94},"104bf260":{"m":46,"g":94},"3bf3d011":{"m":46,"g":94},"d86a2d65":{"m":46,"g":94},"16eb33ff":{"m":46,"g":94},"61cf00e1":{"m":46,"g":94},"b9fd178f":{"m":46,"g":94},"d8e9d61f":{"m":46,"g":94},"a2e0424a":{"m":46,"g":94},"8ce202a4":{"m":46,"g":94},"d913d52c":{"m":46,"g":94},"0ab7bcaf":{"m":46,"g":94},"438526a8":{"m":46,"g":94},"f7102fbd":{"m":46,"g":94},"a7a0a688":{"m":46,"g":94},"2d4ce1b7":{"m":46,"g":94},"4ba815b8":{"m":46,"g":94},"5f65e2b8":{"m":46,"g":94},"4e2af03c":{"m":46,"g":94},"3184aa95":{"m":46,"g":94},"b548801d":{"m":46,"g":94},"539df95d":{"m":46,"g":94},"5e00ddeb":{"m":46,"g":94},"54dd3ea1":{"m":46,"g":94},"d04899d7":{"m":46,"g":94},"5010e0d2":{"m":46,"g":94},"5e6c3265":{"m":46,"g":94},"680cad20":{"m":46,"g":94},"0a24eb85":{"m":46,"g":94},"3839be29":{"m":46,"g":94},"6e13b650":{"m":46,"g":94},"6fcd6d7d":{"m":46,"g":94},"c77762d5":{"m":46,"g":94},"51c81e33":{"m":46,"g":94},"eaade87a":{"m":46,"g":94},"86fc0d79":{"m":46,"g":94},"1be853ee":{"m":46,"g":94},"86e0dde5":{"m":46,"g":94},"2b809788":{"m":46,"g":94},"9d6fb084":{"m":46,"g":94},"ced362f7":{"m":46,"g":94},"9084a864":{"m":46,"g":94},"6aa94b96":{"m":46,"g":94},"c2650748":{"m":46,"g":94},"07bf2e84":{"m":46,"g":94},"a628dd8e":{"m":46,"g":94},"1e890341":{"m":46,"g":94},"715b16c1":{"m":46,"g":94},"9ce8e1a9":{"m":46,"g":94},"fb99aaa5":{"m":46,"g":94},"b77a02cd":{"m":46,"g":94},"f407fcf9":{"m":47,"g":94},"54479d6f":{"m":47,"g":94},"ba069a24":{"m":47,"g":94},"125b1199":{"m":47,"g":94},"eff468dd":{"m":47,"g":94},"a1bd7190":{"m":47,"g":94},"78c1d644":{"m":47,"g":94},"027e6524":{"m":47,"g":94},"b808a383":{"m":47,"g":94},"602ebc66":{"m":47,"g":94},"530ae1bd":{"m":47,"g":94},"befc6beb":{"m":47,"g":94},"59a5ba9b":{"m":47,"g":94},"86c37d01":{"m":47,"g":94},"f18b9c72":{"m":47,"g":94},"3e335743":{"m":47,"g":94},"0d94f1dd":{"m":47,"g":94},"e728258d":{"m":47,"g":94},"239eafbd":{"m":47,"g":94},"9d427265":{"m":47,"g":94},"00ffde20":{"m":47,"g":94},"ddeb9d42":{"m":47,"g":94},"aaf0a315":{"m":47,"g":94},"f9633fa9":{"m":47,"g":94},"087ab832":{"m":47,"g":94},"8169c6f4":{"m":47,"g":94},"3d043319":{"m":47,"g":94},"a8aad935":{"m":47,"g":94},"47ffe7af":{"m":47,"g":94},"b3523af8":{"m":47,"g":94},"1929c067":{"m":47,"g":94},"ed53ac84":{"m":47,"g":94},"520f0094":{"m":47,"g":94},"9c939a3d":{"m":47,"g":94},"549e8b83":{"m":47,"g":94},"a1f32867":{"m":47,"g":94},"760552e0":{"m":47,"g":94},"d9aada9d":{"m":47,"g":94},"f11eb90f":{"m":47,"g":94},"95a4ed12":{"m":47,"g":94},"d1150e9a":{"m":47,"g":94},"e3126e3c":{"m":47,"g":94},"a5095520":{"m":47,"g":94},"7ef0084b":{"m":47,"g":94},"f9a377f6":{"m":47,"g":94},"4ade15dd":{"m":47,"g":94},"8dc84da0":{"m":47,"g":94},"f16eb15d":{"m":47,"g":94},"5bc2508b":{"m":47,"g":94},"a71a44f2":{"m":47,"g":94},"691808d5":{"m":47,"g":94},"d32fba2a":{"m":47,"g":94},"67c424cc":{"m":47,"g":94},"1ae270c5":{"m":47,"g":94},"c77c1e05":{"m":47,"g":94},"dca87ec3":{"m":47,"g":94},"4b1d7a25":{"m":47,"g":94},"a5e0defb":{"m":47,"g":94},"96766101":{"m":47,"g":94},"a146d999":{"m":47,"g":94},"f5113e50":{"m":47,"g":94},"02755768":{"m":47,"g":94},"463d56bf":{"m":47,"g":94},"530ff541":{"m":47,"g":94},"3cd28092":{"m":47,"g":94},"704f8e8e":{"m":47,"g":94},"1853c352":{"m":47,"g":94},"32c9a7ec":{"m":48,"g":94},"b01df48c":{"m":48,"g":94},"c29b98e0":{"m":48,"g":94},"954f4e6b":{"m":48,"g":94},"2558d6a6":{"m":48,"g":94},"29ebe3df":{"m":48,"g":94},"f6dd6486":{"m":48,"g":94},"ea53c63b":{"m":48,"g":94},"a10d5309":{"m":48,"g":94},"aae5434b":{"m":48,"g":94},"c3eac1b0":{"m":48,"g":94},"b275ce00":{"m":48,"g":94},"13ce3e4b":{"m":48,"g":94},"df246e69":{"m":48,"g":94},"fb9fb351":{"m":48,"g":94},"c722d9bd":{"m":48,"g":94},"218ab361":{"m":48,"g":94},"9a00e6f4":{"m":49,"g":94},"4f8c3aea":{"m":49,"g":94},"2369e882":{"m":49,"g":94},"ad30d5cf":{"m":49,"g":94},"dfec7fca":{"m":49,"g":94},"8048c28c":{"m":49,"g":94},"30af7dfb":{"m":49,"g":94},"f6f71379":{"m":49,"g":94},"f35cb46c":{"m":49,"g":94},"7f8fcd39":{"m":49,"g":94},"5c6a41fa":{"m":49,"g":94},"722530fa":{"m":49,"g":94},"56a347f7":{"m":49,"g":94},"3295cd8a":{"m":49,"g":94},"5942dfc0":{"m":49,"g":94},"63a395b9":{"m":49,"g":94},"7d671e4a":{"m":49,"g":94},"699384cb":{"m":49,"g":94},"ffd20fcd":{"m":49,"g":94},"55bd97f3":{"m":49,"g":94},"e57c3e12":{"m":49,"g":94},"f239268f":{"m":49,"g":94},"929c7621":{"m":49,"g":94},"b7a065ea":{"m":49,"g":94},"b1104538":{"m":49,"g":94},"3b44bbee":{"m":49,"g":94},"80e2c4a8":{"m":49,"g":94},"66318ffe":{"m":49,"g":94},"76619261":{"m":49,"g":94},"2a3992b6":{"m":49,"g":94},"4af3f889":{"m":49,"g":94},"df7fe452":{"m":49,"g":94},"a7164b62":{"m":49,"g":94},"11668533":{"m":49,"g":94},"a9e90b4b":{"m":49,"g":94},"8c280cee":{"m":49,"g":94},"9c745d07":{"m":49,"g":94},"ebaa2f31":{"m":49,"g":94},"62832bb2":{"m":49,"g":94},"11f881d1":{"m":49,"g":94},"38625e21":{"m":49,"g":94},"c1f401fc":{"m":49,"g":94},"3b878863":{"m":49,"g":94},"f719d9ae":{"m":49,"g":94},"edad3731":{"m":49,"g":94},"976bc302":{"m":49,"g":94},"2f2e0743":{"m":49,"g":94},"2ffe0a73":{"m":49,"g":94},"cf248976":{"m":49,"g":94},"e5c67150":{"m":49,"g":94},"023d0a73":{"m":49,"g":94},"ac5a0f04":{"m":50,"g":94},"ea34350d":{"m":50,"g":94},"1605ae12":{"m":50,"g":94},"1aea19f6":{"m":50,"g":94},"1f76fc6e":{"m":50,"g":94},"7f076c2c":{"m":50,"g":94},"3c5538f7":{"m":50,"g":94},"10189d08":{"m":50,"g":94},"c4336b2b":{"m":50,"g":94},"4d62bca5":{"m":50,"g":94},"e1e595d7":{"m":50,"g":94},"5ada33ff":{"m":50,"g":94},"254fd130":{"m":50,"g":94},"538fa0ae":{"m":50,"g":94},"55842eb8":{"m":50,"g":94},"a866b65e":{"m":50,"g":94},"4b0a1c93":{"m":50,"g":94},"8e1adb84":{"m":50,"g":94},"dd44173d":{"m":50,"g":94},"8912b763":{"m":50,"g":94},"be0124bd":{"m":50,"g":94},"fe5d3e81":{"m":50,"g":94},"731146f6":{"m":50,"g":94},"fa271613":{"m":50,"g":94},"5652c565":{"m":50,"g":94},"e3938b2f":{"m":50,"g":94},"c211e7b6":{"m":50,"g":94},"d90c3d6b":{"m":50,"g":94},"9e8f8fbf":{"m":50,"g":94},"b509db58":{"m":50,"g":94},"dbe17293":{"m":50,"g":94},"84a1698d":{"m":50,"g":94},"32293a29":{"m":50,"g":94},"79216908":{"m":50,"g":94},"bbb81c24":{"m":50,"g":94},"52f58fc4":{"m":50,"g":94},"145c0ddc":{"m":50,"g":94},"505d7f71":{"m":50,"g":94},"cbedd1db":{"m":50,"g":94},"ad47749b":{"m":50,"g":94},"751c3a03":{"m":50,"g":94},"60769be1":{"m":50,"g":94},"a78d8f8d":{"m":50,"g":94},"c5f86501":{"m":50,"g":94},"d98fa1e9":{"m":50,"g":94},"865233e2":{"m":50,"g":94},"66d4859a":{"m":50,"g":94},"e1b63624":{"m":50,"g":94},"c35cd1f8":{"m":50,"g":94},"72f87b72":{"m":50,"g":94},"62a4a339":{"m":50,"g":94},"2797bc34":{"m":50,"g":94},"fed4c694":{"m":51,"g":94},"fb6e04a0":{"m":51,"g":94},"6997e28f":{"m":51,"g":94},"a0e58740":{"m":51,"g":94},"37c8a576":{"m":51,"g":94},"c754652f":{"m":51,"g":94},"0b46b951":{"m":51,"g":94},"2763c0a7":{"m":51,"g":94},"de3b67b7":{"m":51,"g":94},"19f33b32":{"m":51,"g":94},"30ce5b59":{"m":51,"g":94},"bc1f6fda":{"m":51,"g":94},"867e092f":{"m":51,"g":94},"88c7763f":{"m":51,"g":94},"e4118b15":{"m":51,"g":94},"ba4ee37f":{"m":51,"g":94},"fae4e5e9":{"m":52,"g":94},"afe1e465":{"m":52,"g":94},"f50a6cf4":{"m":52,"g":94},"fe97a2d4":{"m":52,"g":94},"8b48496a":{"m":52,"g":94},"4057ea82":{"m":52,"g":94},"4f2ee48e":{"m":52,"g":94},"71ff2728":{"m":52,"g":94},"b7038fec":{"m":52,"g":94},"65fdb289":{"m":52,"g":94},"b2ccf36d":{"m":52,"g":94},"d4fc1a70":{"m":52,"g":94},"db674e3d":{"m":52,"g":94},"fb915bd1":{"m":52,"g":94},"09798b36":{"m":52,"g":94},"b79fffdc":{"m":52,"g":94},"cd51758f":{"m":52,"g":94},"91e5dbf5":{"m":52,"g":94},"dd5eba4c":{"m":52,"g":94},"a4fd2f9b":{"m":52,"g":94},"92d1253e":{"m":52,"g":94},"a9ca297d":{"m":52,"g":94},"2a02185c":{"m":52,"g":94},"f8b03269":{"m":53,"g":94},"04957965":{"m":53,"g":94},"1228f7ca":{"m":53,"g":94},"fda628d8":{"m":53,"g":94},"07ec07ad":{"m":53,"g":94},"83b340e3":{"m":53,"g":94},"0639bf15":{"m":53,"g":94},"aa47f642":{"m":53,"g":94},"3ddb1c46":{"m":53,"g":94},"480e38a7":{"m":53,"g":94},"69e2d4fb":{"m":53,"g":94},"85e1a6f3":{"m":53,"g":94},"33deca81":{"m":53,"g":94},"18108abe":{"m":53,"g":94},"c54bda30":{"m":53,"g":94},"3c79ad35":{"m":53,"g":94},"983bfcf3":{"m":53,"g":94},"28bc60dc":{"m":53,"g":94},"7301a39b":{"m":53,"g":94},"47eb139f":{"m":53,"g":94},"5c18a037":{"m":53,"g":94},"5c91a315":{"m":53,"g":94},"3dbd73d3":{"m":53,"g":94},"e9a6203d":{"m":53,"g":94},"62c516ac":{"m":53,"g":94},"fc78640e":{"m":53,"g":94},"906d795f":{"m":53,"g":94},"118b6af3":{"m":53,"g":94},"9449a954":{"m":53,"g":94},"5f12f0e7":{"m":53,"g":94},"d5b95cbb":{"m":53,"g":94},"0303ca91":{"m":53,"g":94},"00181098":{"m":53,"g":94},"4936be8a":{"m":53,"g":94},"1bfa511b":{"m":53,"g":94},"f5b5f2bf":{"m":53,"g":94},"7e4c6dd8":{"m":53,"g":94},"d622851d":{"m":53,"g":94},"883c9554":{"m":53,"g":94},"0d6a49bd":{"m":53,"g":94},"ccaf1f99":{"m":53,"g":94},"7d1485d3":{"m":53,"g":94},"7d5d1d3d":{"m":53,"g":94},"b53d6cbd":{"m":53,"g":94},"01017d4c":{"m":53,"g":94},"94e167ea":{"m":53,"g":94},"262e370f":{"m":53,"g":94},"419a57e7":{"m":53,"g":94},"e5f227c0":{"m":54,"g":94},"0e7409ad":{"m":54,"g":94},"3cde5eb6":{"m":54,"g":94},"f5b2a3aa":{"m":54,"g":94},"f6817596":{"m":54,"g":94},"67b65794":{"m":54,"g":94},"37ee906f":{"m":54,"g":94},"34b364e0":{"m":54,"g":94},"84d96b3a":{"m":54,"g":94},"3d32e4a3":{"m":54,"g":94},"64fceab8":{"m":54,"g":94},"71e2a277":{"m":54,"g":94},"4a63c181":{"m":54,"g":94},"2b0fc594":{"m":54,"g":94},"9cc733b3":{"m":54,"g":94},"d693ec04":{"m":54,"g":94},"18ea841f":{"m":54,"g":94},"786be44d":{"m":54,"g":94},"2db44698":{"m":54,"g":94},"ed45e509":{"m":54,"g":94},"ec52464d":{"m":54,"g":94},"eb0c1f53":{"m":54,"g":94},"b2986d7a":{"m":54,"g":94},"8f4d04e5":{"m":55,"g":94},"feb2b768":{"m":55,"g":94},"d95a5f5b":{"m":55,"g":94},"4b83db24":{"m":55,"g":94},"64456cf0":{"m":55,"g":94},"bb4a9220":{"m":55,"g":94},"21e9e63a":{"m":55,"g":94},"1fc84cf6":{"m":55,"g":94},"361ea8d9":{"m":55,"g":94},"33c5ff28":{"m":55,"g":94},"5ce9daea":{"m":55,"g":94},"ce094a5d":{"m":55,"g":94},"e2102669":{"m":55,"g":94},"bd619616":{"m":55,"g":94},"56198b45":{"m":55,"g":94},"ba36b552":{"m":55,"g":94},"9cd9dc83":{"m":55,"g":94},"7a1aecb9":{"m":55,"g":94},"82699474":{"m":55,"g":94},"7154b4b1":{"m":55,"g":94},"b532a5fd":{"m":55,"g":94},"a0592c05":{"m":55,"g":94},"e8dbdf75":{"m":55,"g":94},"e04d3f28":{"m":55,"g":94},"5f2595be":{"m":55,"g":94},"0ba2c589":{"m":55,"g":94},"fccbfa37":{"m":55,"g":94},"2f9bd0fa":{"m":55,"g":94},"5282a473":{"m":55,"g":94},"f0ed9c35":{"m":55,"g":94},"e3b3acfa":{"m":55,"g":94},"2673fa29":{"m":55,"g":94},"dedaf8cd":{"m":55,"g":94},"32ed0160":{"m":55,"g":94},"6efa9e4a":{"m":55,"g":94},"7791fd99":{"m":55,"g":94},"2ac36b9a":{"m":55,"g":94},"2d60a5ee":{"m":55,"g":94},"2e4a5907":{"m":55,"g":94},"c0ee46fe":{"m":55,"g":94},"9208618b":{"m":55,"g":94},"864bf2ba":{"m":55,"g":94},"a4cca7fc":{"m":55,"g":94},"993956c6":{"m":55,"g":94},"f8548295":{"m":55,"g":94},"959735fc":{"m":55,"g":94},"f6772394":{"m":55,"g":94},"626a99ac":{"m":55,"g":94},"ece72491":{"m":55,"g":94},"0fb88aaa":{"m":55,"g":94},"d4de9a62":{"m":55,"g":94},"7310aede":{"m":55,"g":94},"5de9a58e":{"m":55,"g":94},"56fcd8e8":{"m":55,"g":94},"2b340adf":{"m":55,"g":94},"8586b72d":{"m":55,"g":94},"641b7d0a":{"m":55,"g":94},"0ce091a8":{"m":55,"g":94},"835f8afc":{"m":55,"g":94},"3844feb9":{"m":55,"g":94},"27f7bed7":{"m":55,"g":94},"6387098f":{"m":55,"g":94},"2a717c50":{"m":55,"g":94},"a1e697b2":{"m":55,"g":94},"a6ca736c":{"m":55,"g":94},"f62055b5":{"m":55,"g":94},"74bc9184":{"m":55,"g":94},"0f8eb153":{"m":55,"g":94},"67470bbb":{"m":55,"g":94},"cc858953":{"m":55,"g":94},"6128f7cf":{"m":55,"g":94},"a2486eb5":{"m":55,"g":94},"61dec545":{"m":55,"g":94},"96db0f66":{"m":55,"g":94},"7dc66fcb":{"m":55,"g":94},"1f09e84b":{"m":55,"g":94},"63dfab1b":{"m":55,"g":94},"ef995dae":{"m":55,"g":94},"75ae9689":{"m":55,"g":94},"95f93f49":{"m":55,"g":94},"aaac33fd":{"m":55,"g":94},"d332aa3b":{"m":55,"g":94},"c36736c8":{"m":55,"g":94},"1bf9e347":{"m":55,"g":94},"499c85f1":{"m":55,"g":94},"efc52f85":{"m":56,"g":94},"60e2fdcf":{"m":56,"g":94},"d7c0e872":{"m":56,"g":94},"31548116":{"m":56,"g":94},"53aed988":{"m":56,"g":94},"8a56b431":{"m":56,"g":94},"e835a500":{"m":56,"g":94},"23e5e50f":{"m":56,"g":94},"25e5d589":{"m":56,"g":94},"41b1db69":{"m":56,"g":94},"84967019":{"m":56,"g":94},"7d672d27":{"m":56,"g":94},"d4b17481":{"m":56,"g":94},"19ba2b0e":{"m":56,"g":94},"4e1e3cff":{"m":56,"g":94},"ef5b0ff9":{"m":57,"g":94},"6e530515":{"m":57,"g":94},"77d1210b":{"m":57,"g":94},"70dc2fbe":{"m":57,"g":94},"b438a2e5":{"m":57,"g":94},"7ca751ff":{"m":57,"g":94},"c75adfec":{"m":57,"g":94},"7722c11c":{"m":57,"g":94},"b2ed5c8e":{"m":57,"g":94},"f46f394f":{"m":57,"g":94},"2125898a":{"m":57,"g":94},"44f011d2":{"m":57,"g":94},"ed91e003":{"m":57,"g":94},"531d6ea9":{"m":57,"g":94},"dc3bee48":{"m":57,"g":94},"a74d1941":{"m":57,"g":94},"3169e66c":{"m":57,"g":94},"77395154":{"m":57,"g":94},"637de9e8":{"m":57,"g":94},"acb34072":{"m":57,"g":94},"08effbff":{"m":57,"g":94},"60bd3272":{"m":57,"g":94},"e7ebecf8":{"m":57,"g":94},"9a23c484":{"m":57,"g":94},"635a0426":{"m":57,"g":94},"2dccecf4":{"m":57,"g":94},"75ad0a14":{"m":57,"g":94},"3ccf566b":{"m":58,"g":94},"afa0341e":{"m":58,"g":94},"30828e71":{"m":58,"g":94},"e0e09fce":{"m":58,"g":94},"9c05c689":{"m":58,"g":94},"3464e57b":{"m":58,"g":94},"3815b23c":{"m":58,"g":94},"fd34f2da":{"m":58,"g":94},"8ee9a850":{"m":58,"g":94},"fd28640d":{"m":58,"g":94},"7863e436":{"m":58,"g":94},"333e3bfd":{"m":58,"g":94},"239c9d4d":{"m":58,"g":94},"855d0ba3":{"m":58,"g":94},"9254a33a":{"m":58,"g":94},"8a2681e2":{"m":58,"g":94},"5276a675":{"m":58,"g":94},"751e5ca2":{"m":58,"g":94},"7a7ac6be":{"m":58,"g":94},"d9e6ee38":{"m":58,"g":94},"03d5fbfd":{"m":59,"g":94},"1703d766":{"m":59,"g":94},"09e6e2aa":{"m":59,"g":94},"fad29f7f":{"m":59,"g":94},"35bdb485":{"m":59,"g":94},"b085e06b":{"m":59,"g":94},"763dd55d":{"m":59,"g":94},"2f0d3864":{"m":60,"g":94},"3900a94a":{"m":60,"g":94},"ded9fcd0":{"m":60,"g":94},"bc6ad367":{"m":60,"g":94},"3a22a303":{"m":60,"g":94},"bdb3929d":{"m":60,"g":94},"f5d0865b":{"m":60,"g":94},"afdee7b1":{"m":60,"g":94},"cb34d848":{"m":60,"g":94},"0f9cc6d8":{"m":60,"g":94},"c7ae474a":{"m":60,"g":94},"bdf946bf":{"m":60,"g":94},"8c8779cd":{"m":60,"g":94},"1775b963":{"m":60,"g":94},"dd2e2d27":{"m":60,"g":94},"a990daff":{"m":60,"g":94},"ba5112ff":{"m":60,"g":94},"815dce05":{"m":60,"g":94},"ad20b795":{"m":60,"g":94},"9183c23e":{"m":60,"g":94},"148254d4":{"m":60,"g":94},"a4d6d6f1":{"m":60,"g":94},"062c48d2":{"m":60,"g":94},"b6e0cfb5":{"m":60,"g":94},"0d8d97b8":{"m":60,"g":94},"0a765bbc":{"m":60,"g":94},"286cad3e":{"m":60,"g":94},"dc7eb01f":{"m":60,"g":94},"b0524c37":{"m":60,"g":94},"6c42fa22":{"m":60,"g":94},"d49b13c6":{"m":60,"g":94},"bedc4c7a":{"m":60,"g":94},"f44d1439":{"m":60,"g":94},"b6b57fc2":{"m":60,"g":94},"b4403985":{"m":60,"g":94},"339c69a2":{"m":60,"g":94},"f7074700":{"m":60,"g":94},"21ec66e5":{"m":60,"g":94},"c5210dfa":{"m":60,"g":94},"a29dd950":{"m":60,"g":94},"9c6ba248":{"m":60,"g":94},"b02da24a":{"m":60,"g":94},"bdd2827a":{"m":60,"g":94},"8c3b420e":{"m":60,"g":94},"e6f523b5":{"m":60,"g":94},"32318178":{"m":60,"g":94},"a11f8d5f":{"m":60,"g":94},"098d659c":{"m":60,"g":94},"76d14f8c":{"m":60,"g":94},"b08c308e":{"m":60,"g":94},"f624901c":{"m":61,"g":94},"f0e15dc6":{"m":61,"g":94},"f1769586":{"m":61,"g":94},"5d6e9467":{"m":61,"g":94},"a47bf391":{"m":61,"g":94},"b1706469":{"m":61,"g":94},"5413ec2b":{"m":61,"g":94},"f290bd43":{"m":61,"g":94},"8f157893":{"m":61,"g":94},"2db03a04":{"m":61,"g":94},"5cc11705":{"m":61,"g":94},"11fffbc9":{"m":61,"g":94},"4f077c01":{"m":61,"g":94},"679c3bca":{"m":61,"g":94},"656aed58":{"m":61,"g":94},"b5fb4ef5":{"m":61,"g":94},"2e6346fc":{"m":61,"g":94},"977f785d":{"m":61,"g":94},"8a690612":{"m":61,"g":94},"694e4192":{"m":61,"g":94},"b22f3f64":{"m":61,"g":94},"6fb57683":{"m":61,"g":94},"51caee74":{"m":61,"g":94},"58f9060e":{"m":61,"g":94},"bdc1acf6":{"m":61,"g":94},"6d08ce2a":{"m":61,"g":94},"380930a9":{"m":61,"g":94},"9dec582d":{"m":61,"g":94},"b01febdc":{"m":61,"g":94},"1acbaf1b":{"m":61,"g":94},"287427e2":{"m":61,"g":94},"b8574f69":{"m":61,"g":94},"2855caa4":{"m":61,"g":94},"2329e1dd":{"m":61,"g":94},"0f3eb1d2":{"m":61,"g":94},"06dd2eab":{"m":61,"g":94},"439f6580":{"m":61,"g":94},"b3e99dfb":{"m":62,"g":94},"f005758f":{"m":62,"g":94},"f5c6c667":{"m":62,"g":94},"cc0485be":{"m":62,"g":94},"b8cd09f2":{"m":62,"g":94},"c19d8482":{"m":62,"g":94},"80002562":{"m":62,"g":94},"46d44318":{"m":62,"g":94},"923f5183":{"m":62,"g":94},"d08c77c4":{"m":62,"g":94},"c1e097ca":{"m":62,"g":94},"6ec75e62":{"m":62,"g":94},"d855653b":{"m":62,"g":94},"336ff5b9":{"m":62,"g":94},"3b141e15":{"m":62,"g":94},"6249e4a1":{"m":62,"g":94},"f3516c28":{"m":62,"g":94},"17de02f9":{"m":62,"g":94},"51ab3ccf":{"m":62,"g":94},"67008f4b":{"m":62,"g":94},"4536d724":{"m":62,"g":94},"41d7e5b7":{"m":62,"g":94},"20a9f5df":{"m":62,"g":94},"42f39099":{"m":62,"g":94},"72c77763":{"m":62,"g":94},"4093aa46":{"m":62,"g":94},"e808c1df":{"m":62,"g":94},"a18ab81d":{"m":62,"g":94},"0bb0f763":{"m":62,"g":94},"85b2e057":{"m":62,"g":94},"a879c2fb":{"m":62,"g":94},"e2b16c47":{"m":62,"g":94},"c4f9707e":{"m":62,"g":94},"197cbf9b":{"m":62,"g":94},"e94fb7cb":{"m":63,"g":94},"b5caa22d":{"m":63,"g":94},"73401fd0":{"m":63,"g":94},"89cd9235":{"m":63,"g":94},"dc188132":{"m":63,"g":94},"10bfce71":{"m":63,"g":94},"583697cd":{"m":63,"g":94},"2584f6d9":{"m":63,"g":94},"51e87f6f":{"m":63,"g":94},"09bcbe01":{"m":63,"g":94},"03464890":{"m":63,"g":94},"44a96697":{"m":63,"g":94},"1a820e38":{"m":63,"g":94},"0ffcfdf4":{"m":63,"g":94},"cd493b5a":{"m":63,"g":94},"61f42b57":{"m":63,"g":94},"e403d237":{"m":63,"g":94},"3bcf5ece":{"m":63,"g":94},"2c05f81f":{"m":63,"g":94},"d77caa2b":{"m":63,"g":94},"8b6a4486":{"m":63,"g":94},"a69cb5cf":{"m":63,"g":94},"def5c318":{"m":63,"g":94},"3fc2b625":{"m":63,"g":94},"6ada05d0":{"m":63,"g":94},"24cafe31":{"m":63,"g":94},"5a176c92":{"m":63,"g":94},"4719c1d0":{"m":63,"g":94},"ef18b0ed":{"m":63,"g":94},"53cc91e5":{"m":63,"g":94},"d33cbb7e":{"m":63,"g":94},"23196d52":{"m":63,"g":94},"93b77c8e":{"m":63,"g":94},"7906d1d2":{"m":63,"g":94},"81d27c8e":{"m":63,"g":94},"4d4cdb3f":{"m":63,"g":94},"2bd18e2d":{"m":63,"g":94},"83452dbb":{"m":63,"g":94},"3d93f84a":{"m":63,"g":94},"c2f212d6":{"m":63,"g":94},"e2cdc8a5":{"m":63,"g":94},"2add697d":{"m":63,"g":94},"6f98c586":{"m":63,"g":94},"656dcc1a":{"m":63,"g":94},"8af7048d":{"m":63,"g":94},"d3024f4f":{"m":63,"g":94},"13387e6b":{"m":63,"g":94},"120c3634":{"m":63,"g":94},"78e5b22f":{"m":63,"g":94},"7a15e9ad":{"m":63,"g":94},"dc2ac0cb":{"m":63,"g":94},"d47c5101":{"m":63,"g":94},"033c715b":{"m":63,"g":94},"d06c1ab5":{"m":63,"g":94},"c5644cac":{"m":63,"g":94},"53e6552f":{"m":63,"g":94},"5dc54f1a":{"m":63,"g":94},"f3e9b489":{"m":63,"g":94},"6a7973ad":{"m":63,"g":94},"63051738":{"m":63,"g":94},"a8ccacc8":{"m":63,"g":94},"0427416b":{"m":63,"g":94},"bf3edc2c":{"m":63,"g":94},"78e974b2":{"m":63,"g":94},"bc6915e3":{"m":63,"g":94},"a883f079":{"m":63,"g":94},"8b6ce52e":{"m":63,"g":94},"58f3f2b8":{"m":63,"g":94},"93d69061":{"m":63,"g":94},"e00e5385":{"m":63,"g":94},"a2f602b5":{"m":63,"g":94},"8f2c522a":{"m":63,"g":94},"75964177":{"m":63,"g":94},"2dc957d4":{"m":63,"g":94},"bf8d07a6":{"m":63,"g":94},"ab317936":{"m":63,"g":94},"b7f3fec1":{"m":63,"g":94},"58f42b1d":{"m":63,"g":94},"767c9dec":{"m":63,"g":94},"a53454c5":{"m":63,"g":94},"6cb3974e":{"m":63,"g":94},"f65c13b5":{"m":63,"g":94},"b803b395":{"m":63,"g":94},"bfbda62c":{"m":63,"g":94},"4ab43cfb":{"m":64,"g":94},"2f79f588":{"m":64,"g":94},"8a96f749":{"m":64,"g":94},"827aa873":{"m":64,"g":94},"f8ca66fb":{"m":64,"g":94},"53cef815":{"m":64,"g":94},"351a72d4":{"m":64,"g":94},"514f37c3":{"m":64,"g":94},"52c03f16":{"m":64,"g":94},"741fccd7":{"m":64,"g":94},"1e3e5215":{"m":64,"g":94},"fb11a439":{"m":64,"g":94},"af02f99b":{"m":64,"g":94},"9472e699":{"m":64,"g":94},"1acc1f56":{"m":64,"g":94},"b045841b":{"m":64,"g":94},"f265d15b":{"m":64,"g":94},"02431b9a":{"m":64,"g":94},"1dda8c5e":{"m":64,"g":94},"7e097613":{"m":64,"g":94},"f4a92f4b":{"m":64,"g":94},"318260c0":{"m":64,"g":94},"4a612531":{"m":64,"g":94},"d1a08632":{"m":64,"g":94},"f8b28e46":{"m":64,"g":94},"82392da8":{"m":64,"g":94},"95f789ad":{"m":64,"g":94},"4f118a39":{"m":64,"g":94},"66283dbc":{"m":64,"g":94},"822bae8c":{"m":64,"g":94},"8e48ca8c":{"m":64,"g":94},"27acf63b":{"m":64,"g":94},"da6f8081":{"m":64,"g":94},"9286740e":{"m":64,"g":94},"896c0744":{"m":64,"g":94},"c23d5706":{"m":64,"g":94},"67ad4338":{"m":64,"g":94},"3cab5f71":{"m":64,"g":94},"14e754a8":{"m":64,"g":94},"98522149":{"m":64,"g":94},"5d9d15e7":{"m":64,"g":94},"665e5e85":{"m":64,"g":94},"a22f60a3":{"m":64,"g":94},"04f0b4cb":{"m":64,"g":94},"4505a436":{"m":64,"g":94},"685a5738":{"m":64,"g":94},"153b414e":{"m":64,"g":94},"6619f48e":{"m":64,"g":94},"3ed0a547":{"m":64,"g":94},"8d8ef849":{"m":64,"g":94},"9a0cc2e9":{"m":64,"g":94},"7bad7e75":{"m":64,"g":94},"1c4e0d24":{"m":64,"g":94},"54bac8af":{"m":64,"g":94},"5de4051b":{"m":64,"g":94},"e0cd65c2":{"m":64,"g":94},"f1b68618":{"m":64,"g":94},"0da0989a":{"m":64,"g":94},"07a22cbb":{"m":64,"g":94},"3d0bfa3e":{"m":64,"g":94},"1f6cf0d4":{"m":64,"g":94},"553f5a3f":{"m":64,"g":94},"ac2dc35d":{"m":64,"g":94},"3e032c07":{"m":64,"g":94},"44e12ce4":{"m":64,"g":94},"a547aad6":{"m":64,"g":94},"ea535dc5":{"m":64,"g":94},"862bcff8":{"m":64,"g":94},"8b84e69f":{"m":64,"g":94},"5de50653":{"m":64,"g":94},"c0bf9bf1":{"m":64,"g":94},"022614d2":{"m":64,"g":94},"b8ab989f":{"m":64,"g":94},"b3393e94":{"m":64,"g":94},"ddc2001f":{"m":64,"g":94},"806a3002":{"m":64,"g":94},"0d2148ef":{"m":64,"g":94},"bf669606":{"m":64,"g":94},"b2bd8f44":{"m":64,"g":94},"9d9b482a":{"m":64,"g":94},"7353fb9b":{"m":64,"g":94},"bcda0c9e":{"m":64,"g":94},"9f8f2c7f":{"m":64,"g":94},"6fc37bd8":{"m":64,"g":94},"3d8f1c9b":{"m":64,"g":94},"a42213db":{"m":64,"g":94},"0ac019f1":{"m":64,"g":94},"5a0d680a":{"m":64,"g":94},"a4331cd2":{"m":64,"g":94},"ec1c21cd":{"m":64,"g":94},"6c856b4f":{"m":64,"g":94},"287d07a6":{"m":64,"g":94},"d2571dd5":{"m":64,"g":94},"b730aa6b":{"m":64,"g":94},"60b2a44a":{"m":64,"g":94},"949b3fbf":{"m":64,"g":94},"da4e8b38":{"m":64,"g":94},"af6c5357":{"m":64,"g":94},"3ad4cd49":{"m":64,"g":94},"3a8428ec":{"m":64,"g":94},"0311ce8e":{"m":64,"g":94},"5dfcacfc":{"m":64,"g":94},"41a0ccd4":{"m":64,"g":94},"cf0f7eaf":{"m":65,"g":94},"b49d6d0f":{"m":65,"g":94},"c02e3139":{"m":65,"g":94},"734daedd":{"m":65,"g":94},"3ee62235":{"m":65,"g":94},"9829e77e":{"m":65,"g":94},"cde4bbd5":{"m":65,"g":94},"9602c2aa":{"m":65,"g":94},"e81d7f11":{"m":65,"g":94},"222ce6f1":{"m":65,"g":94},"468d23cf":{"m":65,"g":94},"c38b5fb4":{"m":65,"g":94},"20453cef":{"m":65,"g":94},"9f635ea5":{"m":65,"g":94},"76285fde":{"m":65,"g":94},"988d0a4b":{"m":65,"g":94},"81262c7b":{"m":65,"g":94},"27aeb4b7":{"m":65,"g":94},"7b9b4f44":{"m":65,"g":94},"08104b56":{"m":65,"g":94},"cf142b6e":{"m":65,"g":94},"7aad8d18":{"m":66,"g":94},"76fa2d15":{"m":66,"g":94},"7ab84948":{"m":66,"g":94},"4885b908":{"m":66,"g":94},"c2723a42":{"m":66,"g":94},"c7256ca8":{"m":66,"g":94},"6186a8f8":{"m":66,"g":94},"a07364cc":{"m":66,"g":94},"2c1a695f":{"m":66,"g":94},"d39899e8":{"m":66,"g":94},"70817a7e":{"m":66,"g":94},"7b5a3741":{"m":66,"g":94},"4b6f62e2":{"m":66,"g":94},"897e2e25":{"m":66,"g":94},"d54cee14":{"m":66,"g":94},"00fa7d04":{"m":66,"g":94},"013021b6":{"m":66,"g":94},"3c8ac78d":{"m":66,"g":94},"455bfe8d":{"m":66,"g":94},"28b0a62b":{"m":66,"g":94},"566d61d9":{"m":66,"g":94},"55f5fc68":{"m":66,"g":94},"c27c378a":{"m":66,"g":94},"d9eb9358":{"m":66,"g":94},"959dca4f":{"m":66,"g":94},"f2b3a318":{"m":66,"g":94},"ad674097":{"m":66,"g":94},"8db776f0":{"m":66,"g":94},"4eb4b401":{"m":66,"g":94},"17dbf976":{"m":66,"g":94},"53179026":{"m":66,"g":94},"d7c0b32f":{"m":66,"g":94},"7b020cca":{"m":66,"g":94},"7876279e":{"m":66,"g":94},"34e405e0":{"m":66,"g":94},"1ebe1d6d":{"m":66,"g":94},"7811bfda":{"m":66,"g":94},"656f7fc1":{"m":66,"g":94},"c1f5f99f":{"m":67,"g":94},"fa82dfcc":{"m":67,"g":94},"5da3d21c":{"m":67,"g":94},"f2870376":{"m":67,"g":94},"f9905d59":{"m":67,"g":94},"45c87e08":{"m":67,"g":94},"2b1808ce":{"m":67,"g":94},"e868d0b6":{"m":67,"g":94},"591e751e":{"m":67,"g":94},"40022d07":{"m":67,"g":94},"823148e7":{"m":67,"g":94},"76ca91df":{"m":67,"g":94},"cdae77b0":{"m":67,"g":94},"adeee152":{"m":67,"g":94},"6792411e":{"m":67,"g":94},"7348d962":{"m":67,"g":94},"25ed22b6":{"m":67,"g":94},"200d3b16":{"m":67,"g":94},"ad349985":{"m":67,"g":94},"32de54ed":{"m":67,"g":94},"2d9c3195":{"m":67,"g":94},"07e58a2d":{"m":67,"g":94},"04d8cd20":{"m":67,"g":94},"a322051e":{"m":67,"g":94},"de553334":{"m":67,"g":94},"cddb1cdf":{"m":68,"g":94},"fa1b40e0":{"m":68,"g":94},"c45cab1c":{"m":68,"g":94},"27c4c9cf":{"m":68,"g":94},"52a492a1":{"m":68,"g":94},"36f6fc50":{"m":68,"g":94},"d8727275":{"m":68,"g":94},"6239d0b2":{"m":68,"g":94},"4cfd3add":{"m":68,"g":94},"20cf910d":{"m":68,"g":94},"0af1d239":{"m":68,"g":94},"85986bb9":{"m":68,"g":94},"64c87135":{"m":68,"g":94},"1646149a":{"m":68,"g":94},"bc72e5bd":{"m":68,"g":94},"014cab4d":{"m":68,"g":94},"4d2dbeac":{"m":68,"g":94},"29daf498":{"m":68,"g":94},"6702592d":{"m":68,"g":94},"60abdb3e":{"m":68,"g":94},"7b4e61ff":{"m":68,"g":94},"6222e1c2":{"m":68,"g":94},"fad315cb":{"m":68,"g":94},"f90db8bc":{"m":68,"g":94},"d8ad5970":{"m":68,"g":94},"849f58d6":{"m":68,"g":94},"64480df4":{"m":68,"g":94},"4530136e":{"m":68,"g":94},"0a6f18f0":{"m":68,"g":94},"e0b9a423":{"m":69,"g":94},"e0821425":{"m":69,"g":94},"70f894b8":{"m":69,"g":94},"368de366":{"m":69,"g":94},"20de05a7":{"m":69,"g":94},"f076328b":{"m":69,"g":94},"bf2a7087":{"m":69,"g":94},"871a4aa1":{"m":69,"g":94},"98eecbda":{"m":69,"g":94},"4430c0a5":{"m":69,"g":94},"640363ad":{"m":69,"g":94},"8616357a":{"m":69,"g":94},"8adbc78b":{"m":69,"g":94},"45e3a7bc":{"m":69,"g":94},"b96e92e6":{"m":69,"g":94},"693c2600":{"m":69,"g":94},"ced68066":{"m":69,"g":94},"b8318aec":{"m":69,"g":94},"2f482210":{"m":69,"g":94},"d81ac443":{"m":69,"g":94},"2491cc92":{"m":69,"g":94},"67c5de92":{"m":69,"g":94},"1e2cf2b5":{"m":69,"g":94},"9490d157":{"m":69,"g":94},"eefcbdd3":{"m":69,"g":94},"7e6d5fc6":{"m":69,"g":94},"cadd5dbe":{"m":69,"g":94},"bb418ced":{"m":69,"g":94},"fdf04a14":{"m":69,"g":94},"5f0e7de3":{"m":69,"g":94},"2f47d710":{"m":69,"g":94},"4fe92bfc":{"m":69,"g":94},"d23cb9a0":{"m":69,"g":94},"2d611323":{"m":69,"g":94},"e782eb7e":{"m":70,"g":94},"e319153b":{"m":70,"g":94},"32b44d2f":{"m":70,"g":94},"5f1a485d":{"m":70,"g":94},"c9565e49":{"m":70,"g":94},"d03c4c25":{"m":70,"g":94},"8f13377d":{"m":70,"g":94},"3d4a8f9b":{"m":70,"g":94},"7474bed8":{"m":70,"g":94},"03caefeb":{"m":70,"g":94},"bcc213df":{"m":70,"g":94},"39416e39":{"m":70,"g":94},"231c40d8":{"m":70,"g":94},"bbc47c34":{"m":70,"g":94},"dfce9269":{"m":70,"g":94},"6718b109":{"m":70,"g":94},"7711ac6e":{"m":70,"g":94},"7443197a":{"m":70,"g":94},"862dd76c":{"m":70,"g":94},"fb4c9c3a":{"m":70,"g":94},"d973c78e":{"m":70,"g":94},"6ce6eabb":{"m":70,"g":94},"4e23c961":{"m":70,"g":94},"3efbdf68":{"m":70,"g":94},"6cc30955":{"m":70,"g":94},"31eec35b":{"m":70,"g":94},"ac963be2":{"m":70,"g":94},"a5375adc":{"m":71,"g":94},"75d171a9":{"m":71,"g":94},"714f3e63":{"m":71,"g":94},"c38f3aed":{"m":71,"g":94},"2e6be53e":{"m":71,"g":94},"fc671f66":{"m":72,"g":94},"197751e9":{"m":72,"g":94},"d2d0d061":{"m":72,"g":94},"25482edb":{"m":72,"g":94},"62b362b1":{"m":72,"g":94},"44d76463":{"m":72,"g":94},"cd85b78f":{"m":72,"g":94},"0aaccbbf":{"m":72,"g":94},"357671e2":{"m":72,"g":94},"e70fa279":{"m":72,"g":94},"abe74b7b":{"m":72,"g":94},"70b3c6ee":{"m":72,"g":94},"ef9d3b3c":{"m":72,"g":94},"fc91d08a":{"m":72,"g":94},"71ab0dab":{"m":72,"g":94},"d3d4d767":{"m":72,"g":94},"5be8f1ed":{"m":72,"g":94},"e5760bc4":{"m":72,"g":94},"56a724eb":{"m":72,"g":94},"583d6af7":{"m":72,"g":94},"e074d84e":{"m":72,"g":94},"4725e3f6":{"m":72,"g":94},"77a3954b":{"m":72,"g":94},"03b0364f":{"m":72,"g":94},"2dd7d0c5":{"m":72,"g":94},"0d4e3228":{"m":72,"g":94},"926f8efc":{"m":72,"g":94},"9545bfb2":{"m":72,"g":94},"37373ef2":{"m":72,"g":94},"61261b39":{"m":72,"g":94},"19120f71":{"m":72,"g":94},"2415ec38":{"m":72,"g":94},"87f671ab":{"m":72,"g":94},"51d25405":{"m":72,"g":94},"e0a2c963":{"m":72,"g":94},"12f2e6c3":{"m":72,"g":94},"95575aa7":{"m":72,"g":94},"11eea69e":{"m":72,"g":94},"1baa9e6c":{"m":72,"g":94},"911fcd09":{"m":72,"g":94},"9fafa62d":{"m":72,"g":94},"146ac8df":{"m":72,"g":94},"57a404fd":{"m":72,"g":94},"2796fbb5":{"m":72,"g":94},"935cda94":{"m":72,"g":94},"110e0066":{"m":72,"g":94},"6b45a21d":{"m":72,"g":94},"a7000a76":{"m":72,"g":94},"1a8f995c":{"m":72,"g":94},"a3ab768a":{"m":72,"g":94},"66301e12":{"m":72,"g":94},"ac238727":{"m":72,"g":94},"0194948f":{"m":72,"g":94},"b4d34cd3":{"m":72,"g":94},"728e175f":{"m":72,"g":94},"9e1014cf":{"m":72,"g":94},"fa561067":{"m":72,"g":94},"7fbab730":{"m":72,"g":94},"b7e274f2":{"m":72,"g":94},"9cf40772":{"m":72,"g":94},"d3fe9bae":{"m":72,"g":94},"00ce7e31":{"m":72,"g":94},"50f28f65":{"m":72,"g":94},"90a55e25":{"m":72,"g":94},"407e2b92":{"m":72,"g":94},"40782f05":{"m":72,"g":94},"18bb216c":{"m":72,"g":94},"6b859e7d":{"m":72,"g":94},"930da877":{"m":72,"g":94},"3f8a4414":{"m":72,"g":94},"aceb4201":{"m":72,"g":94},"90a4b7d9":{"m":72,"g":94},"f3b99f73":{"m":72,"g":94},"9e74ee91":{"m":72,"g":94},"77a6c9d2":{"m":72,"g":94},"e3e0bc50":{"m":72,"g":94},"bac414ab":{"m":72,"g":94},"eec3f6d1":{"m":72,"g":94},"90bc26a8":{"m":72,"g":94},"ec0a72c2":{"m":72,"g":94},"1c96fa86":{"m":72,"g":94},"bc20e93f":{"m":72,"g":94},"d3887852":{"m":72,"g":94},"564bdf29":{"m":72,"g":94},"5d860168":{"m":72,"g":94},"d2815879":{"m":72,"g":94},"b0df5d24":{"m":72,"g":94},"3e02526b":{"m":72,"g":94},"d8a98a2c":{"m":72,"g":94},"0519269d":{"m":72,"g":94},"d6898dd2":{"m":72,"g":94},"71ed0183":{"m":72,"g":94},"8b681d77":{"m":72,"g":94},"194eea17":{"m":72,"g":94},"acd1a159":{"m":72,"g":94},"7c1692aa":{"m":72,"g":94},"8f019c7d":{"m":72,"g":94},"7551498a":{"m":72,"g":94},"44a2c4bd":{"m":72,"g":94},"c9fc4a9d":{"m":72,"g":94},"21463e32":{"m":72,"g":94},"3dc9ff3c":{"m":72,"g":94},"06427dfa":{"m":72,"g":94},"60524920":{"m":72,"g":94},"10771026":{"m":72,"g":94},"4606e2a3":{"m":72,"g":94},"127998cc":{"m":72,"g":94},"c0bb9eb3":{"m":72,"g":94},"7036d6fc":{"m":72,"g":94},"6ce9dbe8":{"m":72,"g":94},"3758d209":{"m":72,"g":94},"faf29e0b":{"m":72,"g":94},"b0743ea0":{"m":72,"g":94},"60b771c8":{"m":72,"g":94},"d7934cde":{"m":72,"g":94},"62bbd343":{"m":72,"g":94},"f2388f6b":{"m":72,"g":94},"c9745ee0":{"m":72,"g":94},"1a6e9757":{"m":72,"g":94},"b1100846":{"m":72,"g":94},"27a46317":{"m":72,"g":94},"c9795808":{"m":72,"g":94},"6c7a152c":{"m":72,"g":94},"4d2a88bd":{"m":72,"g":94},"45360b2f":{"m":72,"g":94},"3f41b184":{"m":72,"g":94},"45205d88":{"m":72,"g":94},"90876940":{"m":72,"g":94},"a3339d8c":{"m":72,"g":94},"14d90617":{"m":72,"g":94},"d37f9551":{"m":72,"g":94},"c66b2c9c":{"m":72,"g":94},"20b765a2":{"m":72,"g":94},"e3107222":{"m":72,"g":94},"e074e76b":{"m":72,"g":94},"4592afc2":{"m":72,"g":94},"9af0e21e":{"m":72,"g":94},"c7c79b16":{"m":72,"g":94},"d8d75d25":{"m":72,"g":94},"1df6eabd":{"m":72,"g":94},"0c227ee3":{"m":72,"g":94},"5c54ef03":{"m":72,"g":94},"c6a48521":{"m":72,"g":94},"4f678c87":{"m":72,"g":94},"e79f7420":{"m":72,"g":94},"ac053100":{"m":72,"g":94},"d5d80ab4":{"m":72,"g":94},"ddcf9fe3":{"m":72,"g":94},"6252ade9":{"m":72,"g":94},"1eb8eade":{"m":72,"g":94},"3c7bfd7e":{"m":72,"g":94},"bb121214":{"m":72,"g":94},"55de40f7":{"m":72,"g":94},"6b0aeb58":{"m":72,"g":94},"bb3e5268":{"m":72,"g":94},"f93e9158":{"m":72,"g":94},"55a7ec38":{"m":72,"g":94},"fe0673f1":{"m":72,"g":94},"99c1b9d2":{"m":72,"g":94},"634a3561":{"m":72,"g":94},"424848d2":{"m":72,"g":94},"e5ce395a":{"m":72,"g":94},"f983213a":{"m":72,"g":94},"67fc595b":{"m":72,"g":94},"07ab4d4a":{"m":72,"g":94},"522e18ea":{"m":72,"g":94},"c51dc2cc":{"m":72,"g":94},"ddf39d3f":{"m":72,"g":94},"2eab1132":{"m":72,"g":94},"058d199d":{"m":72,"g":94},"9c58e68b":{"m":73,"g":94},"d03b3467":{"m":73,"g":94},"ab7fba0e":{"m":73,"g":94},"bc1534ff":{"m":73,"g":94},"3a391812":{"m":73,"g":94},"800bf018":{"m":73,"g":94},"b16af90b":{"m":73,"g":94},"98c73d71":{"m":73,"g":94},"fcc2e37f":{"m":73,"g":94},"0804dd11":{"m":73,"g":94},"55dc8e4d":{"m":73,"g":94},"02e9e9f1":{"m":73,"g":94},"8f0b6313":{"m":73,"g":94},"b9b3b098":{"m":73,"g":94},"aee30630":{"m":73,"g":94},"286e6540":{"m":73,"g":94},"718c391f":{"m":73,"g":94},"6aaeb848":{"m":74,"g":94},"3623b6a7":{"m":74,"g":94},"4ff12642":{"m":74,"g":94},"2a4cbad8":{"m":74,"g":94},"2937387a":{"m":74,"g":94},"cf721fde":{"m":74,"g":94},"45de8971":{"m":74,"g":94},"71046fcd":{"m":74,"g":94},"c76040e3":{"m":74,"g":94},"2f6bacee":{"m":74,"g":94},"40148041":{"m":74,"g":94},"ad46550d":{"m":74,"g":94},"14344caa":{"m":74,"g":94},"f7f88b70":{"m":74,"g":94},"18c27131":{"m":74,"g":94},"ccdd10c8":{"m":74,"g":94},"76f6c0eb":{"m":74,"g":94},"959a3143":{"m":74,"g":94},"6412c5e4":{"m":74,"g":94},"0c020860":{"m":74,"g":94},"85ef7f64":{"m":74,"g":94},"f1cf6eef":{"m":74,"g":94},"0a59a465":{"m":74,"g":94},"aff79f10":{"m":74,"g":94},"01603318":{"m":74,"g":94},"56c39a05":{"m":74,"g":94},"4068e012":{"m":74,"g":94},"817d4370":{"m":74,"g":94},"c550e52f":{"m":74,"g":94},"e35a93fa":{"m":74,"g":94},"2c3656f2":{"m":74,"g":94},"d40ee62b":{"m":74,"g":94},"91b19949":{"m":74,"g":94},"7c866711":{"m":74,"g":94},"10b544ae":{"m":74,"g":94},"01090e8a":{"m":74,"g":94},"6f43a9b9":{"m":74,"g":94},"0540fef7":{"m":74,"g":94},"481f608b":{"m":74,"g":94},"ed91561f":{"m":74,"g":94},"6e7239f9":{"m":74,"g":94},"0a3960f2":{"m":74,"g":94},"07f94463":{"m":74,"g":94},"e0917e6b":{"m":74,"g":94},"7130a7ce":{"m":74,"g":94},"8f1f614e":{"m":74,"g":94},"7140ba35":{"m":74,"g":94},"d1da58e2":{"m":74,"g":94},"1cf63485":{"m":74,"g":94},"ff2ce0b8":{"m":74,"g":94},"0f2a2e3c":{"m":74,"g":94},"690e1f23":{"m":74,"g":94},"00f42707":{"m":74,"g":94},"6a02b32d":{"m":74,"g":94},"3a08f546":{"m":74,"g":94},"dce303e2":{"m":74,"g":94},"4d27eb9a":{"m":74,"g":94},"d3ecd632":{"m":74,"g":94},"cd909455":{"m":74,"g":94},"bde24ab3":{"m":74,"g":94},"bf2eefc0":{"m":74,"g":94},"5524e7d0":{"m":74,"g":94},"e187a3d5":{"m":74,"g":94},"3dd4feae":{"m":74,"g":94},"2ac189ed":{"m":74,"g":94},"5a6400ee":{"m":74,"g":94},"cf0ccd40":{"m":74,"g":94},"3d56585a":{"m":74,"g":94},"00d25a7f":{"m":74,"g":94},"1a5023e0":{"m":74,"g":94},"23308a90":{"m":74,"g":94},"ac698850":{"m":74,"g":94},"aa957102":{"m":74,"g":94},"007f8b3d":{"m":74,"g":94},"4455b26e":{"m":74,"g":94},"c553e160":{"m":74,"g":94},"7c0541b3":{"m":74,"g":94},"e8a69e4d":{"m":74,"g":94},"fbd56002":{"m":74,"g":94},"730d084f":{"m":74,"g":94},"4a05bdfa":{"m":74,"g":94},"eb06dbcb":{"m":74,"g":94},"9dfafa74":{"m":74,"g":94},"f1d09a65":{"m":74,"g":94},"df84ab2a":{"m":74,"g":94},"34c88987":{"m":74,"g":94},"0dd6cda2":{"m":74,"g":94},"9fb48f95":{"m":74,"g":94},"89ccb533":{"m":74,"g":94},"dceb256f":{"m":74,"g":94},"0e90ae62":{"m":74,"g":94},"1361ab9e":{"m":74,"g":94},"5c7dd14b":{"m":74,"g":94},"8abf74e3":{"m":74,"g":94},"ee132a45":{"m":74,"g":94},"79a321af":{"m":74,"g":94},"6eec3cdc":{"m":74,"g":94},"48473684":{"m":74,"g":94},"b3251e9f":{"m":74,"g":94},"2cadd51d":{"m":74,"g":94},"4a893d14":{"m":74,"g":94},"8d323e95":{"m":74,"g":94},"0fe7c13b":{"m":74,"g":94},"08c4d764":{"m":74,"g":94},"96d0e37f":{"m":74,"g":94},"90bb2be2":{"m":74,"g":94},"b93ef5e5":{"m":74,"g":94},"d4017a6b":{"m":74,"g":94},"d052f4c8":{"m":74,"g":94},"e1aaa79a":{"m":74,"g":94},"20c81199":{"m":74,"g":94},"70866b6f":{"m":74,"g":94},"eb61f5c9":{"m":74,"g":94},"0beea450":{"m":74,"g":94},"c827c671":{"m":74,"g":94},"b55a621f":{"m":74,"g":94},"ffa1b3e3":{"m":74,"g":94},"7e3bb527":{"m":74,"g":94},"96263f27":{"m":74,"g":94},"9376ac36":{"m":74,"g":94},"94a2b9d3":{"m":74,"g":94},"3c3eb374":{"m":74,"g":94},"d557319a":{"m":74,"g":94},"95085d65":{"m":74,"g":94},"c7f25446":{"m":74,"g":94},"63ee26d1":{"m":74,"g":94},"ad55f171":{"m":74,"g":94},"361971b8":{"m":74,"g":94},"13bc39c5":{"m":74,"g":94},"9854a18a":{"m":74,"g":94},"ebddb65a":{"m":74,"g":94},"19fd57bc":{"m":74,"g":94},"ba80c102":{"m":75,"g":94},"fbdb5050":{"m":75,"g":94},"f0afaf52":{"m":75,"g":94},"85d2365d":{"m":75,"g":94},"5fe79605":{"m":75,"g":94},"c6d7f8d3":{"m":75,"g":94},"a5a892ff":{"m":75,"g":94},"8e66fbec":{"m":75,"g":94},"f141298a":{"m":75,"g":94},"4fea040c":{"m":75,"g":94},"1099f6c9":{"m":76,"g":94},"04e3ff69":{"m":76,"g":94},"45fdf1f7":{"m":76,"g":94},"d89c0e4b":{"m":76,"g":94},"fa3c9e06":{"m":76,"g":94},"0d658ac3":{"m":76,"g":94},"ced35a06":{"m":76,"g":94},"26f07294":{"m":76,"g":94},"34e07a65":{"m":76,"g":94},"15ddd843":{"m":76,"g":94},"52029bd1":{"m":76,"g":94},"eb934bdf":{"m":76,"g":94},"e45ae444":{"m":76,"g":94},"ac3fae84":{"m":76,"g":94},"2d1b83e5":{"m":76,"g":94},"199bb01d":{"m":76,"g":94},"6b7038ba":{"m":76,"g":94},"57eec0bf":{"m":76,"g":94},"f01b0925":{"m":76,"g":94},"14269198":{"m":76,"g":94},"9b7cf9ee":{"m":76,"g":94},"1e86457c":{"m":76,"g":94},"64129fa6":{"m":76,"g":94},"e9f8e423":{"m":76,"g":94},"22c3702e":{"m":76,"g":94},"4c584fc6":{"m":76,"g":94},"77cf771e":{"m":76,"g":94},"8154de5a":{"m":76,"g":94},"c11cfda0":{"m":76,"g":94},"64edeb79":{"m":76,"g":94},"65c24c28":{"m":76,"g":94},"3980ff1b":{"m":76,"g":94},"5d7edc8e":{"m":76,"g":94},"af6535e7":{"m":76,"g":94},"93cf7fc5":{"m":76,"g":94},"2a206b22":{"m":76,"g":94},"4d253057":{"m":76,"g":94},"11577ced":{"m":76,"g":94},"ca75741e":{"m":76,"g":94},"c6d549e7":{"m":76,"g":94},"3c09548d":{"m":76,"g":94},"8796cebb":{"m":76,"g":94},"c2bd094d":{"m":76,"g":94},"f8f9244a":{"m":76,"g":94},"ecbfe58b":{"m":76,"g":94},"8f163b16":{"m":76,"g":94},"e7a8610d":{"m":76,"g":94},"a2cc62a6":{"m":76,"g":94},"fb888603":{"m":76,"g":94},"321ab756":{"m":76,"g":94},"38f25e87":{"m":76,"g":94},"8cd42504":{"m":76,"g":94},"6a384d5c":{"m":76,"g":94},"f69e0696":{"m":76,"g":94},"f6ab4ca6":{"m":76,"g":94},"c7c7dbeb":{"m":76,"g":94},"417fc72f":{"m":76,"g":94},"c6ec7029":{"m":76,"g":94},"4c56e5db":{"m":76,"g":94},"7b5fc719":{"m":76,"g":94},"ad4e58bf":{"m":76,"g":94},"bfb03c61":{"m":76,"g":94},"b36ab493":{"m":76,"g":94},"9e93ef3f":{"m":76,"g":94},"fad86a68":{"m":76,"g":94},"df7014a8":{"m":76,"g":94},"49420741":{"m":76,"g":94},"ba52fd18":{"m":76,"g":94},"b6944f97":{"m":76,"g":94},"f44db16c":{"m":76,"g":94},"f9c53cbb":{"m":76,"g":94},"90532b76":{"m":76,"g":94},"c0e9a36c":{"m":76,"g":94},"588865f0":{"m":76,"g":94},"3196999f":{"m":76,"g":94},"9e0186f3":{"m":76,"g":94},"8baf9a0c":{"m":76,"g":94},"c7872985":{"m":76,"g":94},"45212ce1":{"m":76,"g":94},"c16b33cc":{"m":76,"g":94},"2d004512":{"m":76,"g":94},"804d250a":{"m":76,"g":94},"dd865bef":{"m":76,"g":94},"d373a48c":{"m":76,"g":94},"98be3bd3":{"m":76,"g":94},"a98290ae":{"m":76,"g":94},"9b81f9bd":{"m":76,"g":94},"f81a27f6":{"m":76,"g":94},"988ab646":{"m":76,"g":94},"3ded4b21":{"m":76,"g":94},"f4d7ab7a":{"m":76,"g":94},"c38ca4fc":{"m":76,"g":94},"82dec1f7":{"m":76,"g":94},"5f9b2c62":{"m":76,"g":94},"5493c334":{"m":76,"g":94},"f2ab37e5":{"m":76,"g":94},"91ba98fe":{"m":76,"g":94},"c614dbdf":{"m":76,"g":94},"927ca935":{"m":76,"g":94},"ef3c2dd0":{"m":76,"g":94},"75b65648":{"m":76,"g":94},"0f52fb55":{"m":76,"g":94},"d6d21640":{"m":76,"g":94},"0212d2e2":{"m":76,"g":94},"8cc300f5":{"m":76,"g":94},"452db508":{"m":76,"g":94},"d1112d85":{"m":76,"g":94},"48efec7b":{"m":76,"g":94},"9b8333d9":{"m":76,"g":94},"f5bbf603":{"m":76,"g":94},"5cbd709e":{"m":76,"g":94},"2e4a1e2d":{"m":76,"g":94},"9d02bb3e":{"m":76,"g":94},"402db5c5":{"m":76,"g":94},"754a0e82":{"m":76,"g":94},"799fb5f4":{"m":76,"g":94},"25e1816e":{"m":76,"g":94},"a53fe428":{"m":76,"g":94},"1b859295":{"m":76,"g":94},"9971dc22":{"m":76,"g":94},"3db35c1a":{"m":76,"g":94},"52a34d74":{"m":76,"g":94},"06d12b39":{"m":76,"g":94},"c30976fb":{"m":76,"g":94},"1a3fa75f":{"m":76,"g":94},"81f431ed":{"m":76,"g":94},"65b7c9b7":{"m":76,"g":94},"2c4f5cca":{"m":76,"g":94},"15843047":{"m":76,"g":94},"8ec2ce07":{"m":76,"g":94},"1fd0cf8a":{"m":76,"g":94},"bf63ee54":{"m":76,"g":94},"22c96f78":{"m":76,"g":94},"2892b9bb":{"m":76,"g":94},"470b4740":{"m":76,"g":94},"26c372c1":{"m":76,"g":94},"86d9baed":{"m":76,"g":94},"21d485f8":{"m":76,"g":94},"035ac2ab":{"m":76,"g":94},"e1a5e7e4":{"m":76,"g":94},"ad1ae7f7":{"m":76,"g":94},"e73167ad":{"m":76,"g":94},"862fe522":{"m":76,"g":94},"61e4433c":{"m":76,"g":94},"660305c3":{"m":76,"g":94},"642ab418":{"m":76,"g":94},"1ce4878d":{"m":76,"g":94},"977d7cd2":{"m":76,"g":94},"0e0ec702":{"m":76,"g":94},"bb378556":{"m":76,"g":94},"19e96e59":{"m":77,"g":94},"aa08aeac":{"m":77,"g":94},"d8a136a1":{"m":77,"g":94},"20c90be2":{"m":77,"g":94},"ec3ee028":{"m":77,"g":94},"92941ce7":{"m":77,"g":94},"2bb0e7cf":{"m":77,"g":94},"72549263":{"m":77,"g":94},"044c3159":{"m":77,"g":94},"4db29e82":{"m":77,"g":94},"c483377e":{"m":77,"g":94},"74e0ac1d":{"m":77,"g":94},"ef9a378a":{"m":77,"g":94},"6dea5c96":{"m":77,"g":94},"6ffb6bd4":{"m":77,"g":94},"47e6628a":{"m":77,"g":94},"7907f9eb":{"m":77,"g":94},"8c04f0f2":{"m":77,"g":94},"265e7564":{"m":77,"g":94},"d3f71f5e":{"m":77,"g":94},"5eae67cb":{"m":77,"g":94},"6dbf9998":{"m":77,"g":94},"e0166f8a":{"m":77,"g":94},"53a2c3b4":{"m":77,"g":94},"550586ef":{"m":77,"g":94},"cf29fe9e":{"m":77,"g":94},"26c0f131":{"m":77,"g":94},"f9970bd1":{"m":77,"g":94},"2e0f94ab":{"m":77,"g":94},"18317ddc":{"m":77,"g":94},"e2e2ab70":{"m":77,"g":94},"0d3e3072":{"m":77,"g":94},"62dd9587":{"m":77,"g":94},"72031173":{"m":77,"g":94},"9fdc6d6a":{"m":77,"g":94},"42a45df0":{"m":77,"g":94},"04eb6062":{"m":77,"g":94},"e84f4ba0":{"m":77,"g":94},"b149b393":{"m":77,"g":94},"31dfff7d":{"m":77,"g":94},"10a9ab7b":{"m":77,"g":94},"bb0fd749":{"m":77,"g":94},"7f19e083":{"m":77,"g":94},"98a2cfa9":{"m":77,"g":94},"2a882e8f":{"m":77,"g":94},"e6e4d022":{"m":77,"g":94},"188105a2":{"m":77,"g":94},"b3953258":{"m":77,"g":94},"5fa3058f":{"m":77,"g":94},"bbab97a6":{"m":77,"g":94},"0bc0bf57":{"m":77,"g":94},"f60f2931":{"m":77,"g":94},"17000d2b":{"m":77,"g":94},"668ecc6c":{"m":77,"g":94},"886fcbdd":{"m":77,"g":94},"8bf6d7f4":{"m":77,"g":94},"1b9175cb":{"m":77,"g":94},"92bb49a7":{"m":77,"g":94},"6f5cc5eb":{"m":77,"g":94},"c913ed40":{"m":77,"g":94},"1afe3d07":{"m":77,"g":94},"44f47d3e":{"m":77,"g":94},"ae25d36d":{"m":77,"g":94},"35e0856b":{"m":78,"g":94},"aba5ca15":{"m":78,"g":94},"496dde84":{"m":78,"g":94},"bcbbf519":{"m":78,"g":94},"0d99adb7":{"m":78,"g":94},"efbae697":{"m":78,"g":94},"ca8d02ab":{"m":78,"g":94},"3f287b85":{"m":78,"g":94},"7ed77d6b":{"m":78,"g":94},"4c54f442":{"m":78,"g":94},"924ca7c9":{"m":78,"g":94},"6ff9c6a5":{"m":78,"g":94},"77e929a1":{"m":78,"g":94},"febe21ce":{"m":78,"g":94},"a995a773":{"m":78,"g":94},"31035dda":{"m":78,"g":94},"913e38df":{"m":78,"g":94},"d95269f9":{"m":78,"g":94},"e53bf190":{"m":78,"g":94},"3289c120":{"m":78,"g":94},"69df9761":{"m":78,"g":94},"98f768d1":{"m":78,"g":94},"d7954b76":{"m":78,"g":94},"74885a84":{"m":78,"g":94},"b8b6008f":{"m":78,"g":94},"8e10fec9":{"m":78,"g":94},"e8999b13":{"m":78,"g":94},"772d2a19":{"m":78,"g":94},"9d0b36c4":{"m":78,"g":94},"7d8c0ce7":{"m":78,"g":94},"e41549c3":{"m":78,"g":94},"cccfc10e":{"m":78,"g":94},"a2aea59b":{"m":78,"g":94},"2c8fd993":{"m":78,"g":94},"31da75ab":{"m":78,"g":94},"e983e432":{"m":78,"g":94},"e9c6ce46":{"m":78,"g":94},"3fadc647":{"m":78,"g":94},"e119f042":{"m":78,"g":94},"9eb49e87":{"m":78,"g":94},"12047f5e":{"m":78,"g":94},"fda6bb78":{"m":78,"g":94},"23c764b1":{"m":78,"g":94},"87fafa01":{"m":78,"g":94},"1c63e797":{"m":78,"g":94},"ee47a6c1":{"m":78,"g":94},"6384d317":{"m":78,"g":94},"5cb552b1":{"m":78,"g":94},"c7457191":{"m":78,"g":94},"51ac297a":{"m":78,"g":94},"a169b9f8":{"m":78,"g":94},"4a63bc32":{"m":78,"g":94},"a303325f":{"m":78,"g":94},"42873eac":{"m":78,"g":94},"4814ecaf":{"m":78,"g":94},"e62d60fe":{"m":78,"g":94},"032f8faa":{"m":78,"g":94},"37c66ec8":{"m":78,"g":94},"9adf178c":{"m":78,"g":94},"f842853a":{"m":78,"g":94},"195a09f5":{"m":78,"g":94},"9fccda31":{"m":78,"g":94},"4ede6770":{"m":78,"g":94},"b26bc86b":{"m":78,"g":94},"5ec5eaf7":{"m":78,"g":94},"0d7fe866":{"m":78,"g":94},"54b9a2de":{"m":78,"g":94},"8e7b3154":{"m":78,"g":94},"45dcfc2e":{"m":78,"g":94},"ddf8981d":{"m":78,"g":94},"400ad660":{"m":78,"g":94},"05625b97":{"m":78,"g":94},"736502d4":{"m":78,"g":94},"8690c40b":{"m":78,"g":94},"b1cfb4e9":{"m":78,"g":94},"57f99608":{"m":79,"g":94},"81992474":{"m":79,"g":94},"f04c80dc":{"m":79,"g":94},"d1bb1711":{"m":79,"g":94},"5b5c7237":{"m":80,"g":94},"a42736bb":{"m":80,"g":94},"dd83e7e9":{"m":80,"g":94},"0769b14b":{"m":80,"g":94},"b64b88e7":{"m":80,"g":94},"bc24205b":{"m":80,"g":94},"3efc8e2d":{"m":80,"g":94},"27a009bb":{"m":80,"g":94},"8ec0bb7d":{"m":80,"g":94},"fa909dc3":{"m":80,"g":94},"e8f62b20":{"m":80,"g":94},"88defc4d":{"m":80,"g":94},"6f509d55":{"m":80,"g":94},"12ef7e3b":{"m":80,"g":94},"838fa0f2":{"m":80,"g":94},"f1b3b75f":{"m":80,"g":94},"33b16ad1":{"m":80,"g":94},"ffde65a0":{"m":80,"g":94},"471650de":{"m":80,"g":94},"d06a83fb":{"m":80,"g":94},"5d134401":{"m":80,"g":94},"f88f7e19":{"m":80,"g":94},"3dfc6023":{"m":80,"g":94},"15e91d72":{"m":80,"g":94},"8aab7fdb":{"m":80,"g":94},"e940dc4f":{"m":80,"g":94},"388e15c0":{"m":80,"g":94},"11421a3f":{"m":80,"g":94},"6c41fcf0":{"m":80,"g":94},"ee9d6ca6":{"m":80,"g":94},"2dd64894":{"m":80,"g":94},"61e7c4dd":{"m":80,"g":94},"dae79444":{"m":80,"g":94},"f6772f14":{"m":80,"g":94},"ac5b78ba":{"m":80,"g":94},"38076dea":{"m":80,"g":94},"5e0a9b09":{"m":80,"g":94},"bdde2375":{"m":80,"g":94},"e9fc2ac7":{"m":80,"g":94},"44afde82":{"m":80,"g":94},"072df753":{"m":80,"g":94},"defede50":{"m":80,"g":94},"fc728719":{"m":80,"g":94},"14e8bd88":{"m":80,"g":94},"adca585b":{"m":80,"g":94},"39d90449":{"m":80,"g":94},"39e41138":{"m":80,"g":94},"5fbafbb8":{"m":80,"g":94},"a9499885":{"m":80,"g":94},"f7655790":{"m":80,"g":94},"f58b929a":{"m":80,"g":94},"c1270aab":{"m":80,"g":94},"8311b07f":{"m":80,"g":94},"c1380257":{"m":80,"g":94},"b62e7e99":{"m":80,"g":94},"7d3b7c87":{"m":80,"g":94},"75015bb6":{"m":80,"g":94},"b371f7cd":{"m":80,"g":94},"812e82f3":{"m":80,"g":94},"4879e50c":{"m":80,"g":94},"bc92107b":{"m":80,"g":94},"3e4794aa":{"m":80,"g":94},"690ec205":{"m":80,"g":94},"2074a2e6":{"m":80,"g":94},"57de7c6b":{"m":80,"g":94},"115ae2e7":{"m":80,"g":94},"aea98512":{"m":80,"g":94},"e4155e96":{"m":80,"g":94},"1b1b47a9":{"m":80,"g":94},"3c9740d2":{"m":80,"g":94},"2eb55770":{"m":80,"g":94},"f65b8d5c":{"m":80,"g":94},"5ad05719":{"m":80,"g":94},"34ef6c81":{"m":80,"g":94},"61172091":{"m":80,"g":94},"4f288113":{"m":80,"g":94},"136b8e6a":{"m":80,"g":94},"034c5256":{"m":80,"g":94},"c1dd773c":{"m":80,"g":94},"6f859379":{"m":80,"g":94},"f774a0d2":{"m":80,"g":94},"60bcbf2a":{"m":80,"g":94},"a0a9f6d6":{"m":80,"g":94},"80aa8ca8":{"m":80,"g":94},"4aa6bab0":{"m":80,"g":94},"c35dcfdb":{"m":80,"g":94},"c163bf4f":{"m":80,"g":94},"55986343":{"m":80,"g":94},"b75275b6":{"m":80,"g":94},"7074e9ca":{"m":80,"g":94},"fc14cca0":{"m":80,"g":94},"e7beff8a":{"m":80,"g":94},"4d2e3051":{"m":80,"g":94},"e53a0b3d":{"m":80,"g":94},"038bc5d5":{"m":80,"g":94},"aee62d74":{"m":80,"g":94},"cd7e32e2":{"m":80,"g":94},"88799448":{"m":80,"g":94},"a879811c":{"m":80,"g":94},"a222945d":{"m":80,"g":94},"ed01b451":{"m":80,"g":94},"d050df36":{"m":80,"g":94},"76f44c2a":{"m":80,"g":94},"1078396f":{"m":80,"g":94},"7e4f72dd":{"m":80,"g":94},"4c31ae9f":{"m":80,"g":94},"f730362e":{"m":80,"g":94},"e3c4bd31":{"m":80,"g":94},"5db37c86":{"m":80,"g":94},"4cb53ecd":{"m":80,"g":94},"456b008b":{"m":80,"g":94},"ebf495f0":{"m":80,"g":94},"7f875f12":{"m":80,"g":94},"fbebcb7a":{"m":80,"g":94},"87eddedf":{"m":80,"g":94},"40652482":{"m":80,"g":94},"86a876d8":{"m":80,"g":94},"92823069":{"m":80,"g":94},"d2e507df":{"m":80,"g":94},"61970b08":{"m":80,"g":94},"76c48a09":{"m":80,"g":94},"90caf06c":{"m":80,"g":94},"6669d127":{"m":80,"g":94},"f2b70afd":{"m":80,"g":94},"bc3f6db2":{"m":80,"g":94},"aac531c5":{"m":80,"g":94},"39efad4f":{"m":80,"g":94},"466899e6":{"m":80,"g":94},"11d760d5":{"m":80,"g":94},"5039d547":{"m":80,"g":94},"d09a51f1":{"m":80,"g":94},"f8194b26":{"m":80,"g":94},"6d3b35fa":{"m":80,"g":94},"a73c4df4":{"m":80,"g":94},"89a55418":{"m":80,"g":94},"2695ab05":{"m":80,"g":94},"88d6fd9a":{"m":80,"g":94},"cc88d98a":{"m":80,"g":94},"3033c11a":{"m":80,"g":94},"fd5a55cf":{"m":80,"g":94},"804d9f2e":{"m":80,"g":94},"a7c3f74b":{"m":80,"g":94},"5a144a8a":{"m":80,"g":94},"27f8e6b9":{"m":80,"g":94},"afb752bc":{"m":80,"g":94},"9731eca7":{"m":80,"g":94},"7c5658c1":{"m":80,"g":94},"9798e72b":{"m":80,"g":94},"ade714a6":{"m":80,"g":94},"93470a14":{"m":80,"g":94},"db452760":{"m":80,"g":94},"fbdc94ba":{"m":81,"g":94},"b54b5a96":{"m":81,"g":94},"bca832c7":{"m":81,"g":94},"d9dd5298":{"m":81,"g":94},"0a0dd34e":{"m":81,"g":94},"80ac527d":{"m":81,"g":94},"99456bca":{"m":81,"g":94},"d07e797a":{"m":81,"g":94},"c555d794":{"m":81,"g":94},"e2574ee9":{"m":81,"g":94},"ab4b5606":{"m":81,"g":94},"20f1c8e3":{"m":81,"g":94},"613b197e":{"m":81,"g":94},"d58e3544":{"m":81,"g":94},"bf86c5e9":{"m":81,"g":94},"dca90f1d":{"m":81,"g":94},"0961feef":{"m":81,"g":94},"59dd090f":{"m":81,"g":94},"569b032c":{"m":81,"g":94},"f6a71139":{"m":81,"g":94},"1e0806f3":{"m":81,"g":94},"2c11f9c2":{"m":81,"g":94},"a6f892e5":{"m":81,"g":94},"08b518d5":{"m":81,"g":94},"4db463b1":{"m":81,"g":94},"bfa39224":{"m":81,"g":94},"e465b08d":{"m":81,"g":94},"bed05878":{"m":81,"g":94},"b2a189dd":{"m":81,"g":94},"f28d8299":{"m":81,"g":94},"8e09b370":{"m":81,"g":94},"53dcf388":{"m":81,"g":94},"1effba4c":{"m":81,"g":94},"a0fc5bc1":{"m":81,"g":94},"27e9538a":{"m":81,"g":94},"211c7b31":{"m":81,"g":94},"c08a717c":{"m":81,"g":94},"f13d65a7":{"m":81,"g":94},"06d0a3d9":{"m":81,"g":94},"22c2a79d":{"m":81,"g":94},"8beb356f":{"m":81,"g":94},"c776234b":{"m":81,"g":94},"3bface15":{"m":81,"g":94},"6fb29ffd":{"m":81,"g":94},"4fb05583":{"m":81,"g":94},"81c89111":{"m":81,"g":94},"92d1561b":{"m":81,"g":94},"8f783c19":{"m":81,"g":94},"90faf901":{"m":81,"g":94},"177320a5":{"m":81,"g":94},"d7bc19a4":{"m":81,"g":94},"85ec0440":{"m":81,"g":94},"06a1656e":{"m":81,"g":94},"6aca5834":{"m":81,"g":94},"b9c87e78":{"m":82,"g":94},"968ef515":{"m":82,"g":94},"13432002":{"m":82,"g":94},"c2942907":{"m":82,"g":94},"e69a2190":{"m":82,"g":94},"bf98d2e3":{"m":82,"g":94},"e65b9f21":{"m":82,"g":94},"4dce1cc6":{"m":82,"g":94},"deded17f":{"m":82,"g":94},"f29a718f":{"m":82,"g":94},"3f57b00a":{"m":82,"g":94},"453d412c":{"m":82,"g":94},"dc86f25a":{"m":82,"g":94},"08289eaa":{"m":82,"g":94},"3b6d539f":{"m":82,"g":94},"57131dd9":{"m":82,"g":94},"a7591ecf":{"m":82,"g":94},"c44f2869":{"m":82,"g":94},"685d8980":{"m":82,"g":94},"70645f4d":{"m":82,"g":94},"188f0955":{"m":82,"g":94},"eef9433b":{"m":82,"g":94},"97cb762b":{"m":82,"g":94},"11951820":{"m":82,"g":94},"5239d795":{"m":82,"g":94},"f0815419":{"m":82,"g":94},"2b3bdc93":{"m":82,"g":94},"5fc4b600":{"m":82,"g":94},"b868526d":{"m":82,"g":94},"502524e2":{"m":82,"g":94},"4c764007":{"m":82,"g":94},"9f3bd2ad":{"m":82,"g":94},"8de53da9":{"m":82,"g":94},"fac17acf":{"m":82,"g":94},"8b39274e":{"m":82,"g":94},"5156d5a4":{"m":82,"g":94},"c951d312":{"m":82,"g":94},"dcb82325":{"m":82,"g":94},"66c0ff9e":{"m":82,"g":94},"9a7e83e8":{"m":82,"g":94},"417b44eb":{"m":82,"g":94},"475e2e37":{"m":82,"g":94},"fba86b6b":{"m":82,"g":94},"072b4d03":{"m":82,"g":94},"9c434777":{"m":82,"g":94},"fa2f677e":{"m":82,"g":94},"463d4b74":{"m":82,"g":94},"9924bbe1":{"m":82,"g":94},"84022c0e":{"m":83,"g":94},"f9fb33ef":{"m":83,"g":94},"a38f6932":{"m":83,"g":94},"beb65c74":{"m":83,"g":94},"621e96bf":{"m":83,"g":94},"35ca04d2":{"m":83,"g":94},"3c4e0ee6":{"m":83,"g":94},"9c088829":{"m":83,"g":94},"005aad32":{"m":83,"g":94},"4d23ba08":{"m":83,"g":94},"6e313c1b":{"m":83,"g":94},"a45a4b23":{"m":83,"g":94},"981a2619":{"m":83,"g":94},"8ba31330":{"m":83,"g":94},"02102063":{"m":83,"g":94},"7e944246":{"m":83,"g":94},"a086a113":{"m":83,"g":94},"bdbe5f81":{"m":83,"g":94},"9ad28f63":{"m":83,"g":94},"d7b1ce65":{"m":83,"g":94},"f55933e1":{"m":83,"g":94},"408ba022":{"m":83,"g":94},"094891c0":{"m":83,"g":94},"a21ef363":{"m":83,"g":94},"3c4dc38a":{"m":83,"g":94},"d8fbc7c0":{"m":83,"g":94},"c5e1026f":{"m":83,"g":94},"799c4bb5":{"m":83,"g":94},"02723e1b":{"m":83,"g":94},"df2cf583":{"m":83,"g":94},"133ded03":{"m":83,"g":94},"f87a6ab3":{"m":83,"g":94},"eebfdb94":{"m":83,"g":94},"dfb32264":{"m":83,"g":94},"63c13a2c":{"m":83,"g":94},"4d1e52ab":{"m":83,"g":94},"155890e4":{"m":83,"g":94},"1f963d7f":{"m":83,"g":94},"04d0123f":{"m":83,"g":94},"feda9b11":{"m":83,"g":94},"c3948ba6":{"m":83,"g":94},"269c457e":{"m":83,"g":94},"18ce468d":{"m":83,"g":94},"21514ff5":{"m":83,"g":94},"5641a094":{"m":83,"g":94},"3dd3538c":{"m":83,"g":94},"93c6fb12":{"m":83,"g":94},"11e27d09":{"m":83,"g":94},"50eda839":{"m":83,"g":94},"c55550cb":{"m":83,"g":94},"43fb95c2":{"m":83,"g":94},"7d9679b7":{"m":83,"g":94},"b5be5694":{"m":83,"g":94},"d2b8d0b8":{"m":83,"g":94},"a14654dd":{"m":83,"g":94},"5d93a950":{"m":83,"g":94},"c998d04b":{"m":83,"g":94},"7d0edf3c":{"m":83,"g":94},"ce4ecba4":{"m":83,"g":94},"b1f6d89b":{"m":83,"g":94},"7c99103f":{"m":83,"g":94},"de071366":{"m":83,"g":94},"e0673969":{"m":83,"g":94},"127ff898":{"m":83,"g":94},"8777a1d2":{"m":83,"g":94},"711efe78":{"m":83,"g":94},"fbb5f229":{"m":83,"g":94},"15fabcc0":{"m":83,"g":94},"e62c4955":{"m":83,"g":94},"71d1785f":{"m":83,"g":94},"3f87f831":{"m":83,"g":94},"ce5412b6":{"m":83,"g":94},"7282ab74":{"m":83,"g":94},"b0feda09":{"m":83,"g":94},"6b6e7487":{"m":83,"g":94},"91732486":{"m":83,"g":94},"2ed96c7a":{"m":83,"g":94},"2aa3f5e2":{"m":83,"g":94},"76d17c7e":{"m":83,"g":94},"70d040f9":{"m":83,"g":94},"4418f599":{"m":83,"g":94},"04f2abcb":{"m":83,"g":94},"506be6b8":{"m":83,"g":94},"2343d8df":{"m":83,"g":94},"92bb64bc":{"m":83,"g":94},"11b23ae9":{"m":83,"g":94},"dcae1fb2":{"m":84,"g":94},"a0251a3f":{"m":84,"g":94},"663037a7":{"m":84,"g":94},"f4a9f60c":{"m":84,"g":94},"ee71ed8a":{"m":84,"g":94},"d364b9b0":{"m":84,"g":94},"849c83a0":{"m":84,"g":94},"d73ddeb1":{"m":84,"g":94},"f48b007c":{"m":84,"g":94},"74cb12a8":{"m":84,"g":94},"c6c62640":{"m":84,"g":94},"92ab0a20":{"m":84,"g":94},"e132cba2":{"m":84,"g":94},"0045f4b2":{"m":84,"g":94},"8601300b":{"m":84,"g":94},"6fa6f38e":{"m":84,"g":94},"693723d1":{"m":84,"g":94},"966eb908":{"m":84,"g":94},"644ed409":{"m":84,"g":94},"3029889c":{"m":84,"g":94},"ef15dcda":{"m":84,"g":94},"ad4df307":{"m":84,"g":94},"41ac0c6d":{"m":84,"g":94},"84810da4":{"m":84,"g":94},"40d9b8ac":{"m":84,"g":94},"f0365820":{"m":84,"g":94},"86317c09":{"m":84,"g":94},"daed453e":{"m":84,"g":94},"ded04b2e":{"m":84,"g":94},"9858113c":{"m":85,"g":94},"8441baad":{"m":85,"g":94},"256c4c25":{"m":85,"g":94},"9f21e754":{"m":85,"g":94},"7bcd8b1c":{"m":85,"g":94},"11383cec":{"m":85,"g":94},"e97e57e6":{"m":85,"g":94},"9a6ad891":{"m":85,"g":94},"d353d08b":{"m":85,"g":94},"08acdb5c":{"m":85,"g":94},"2afba1b1":{"m":85,"g":94},"e330f2b8":{"m":85,"g":94},"3ddf5b9d":{"m":85,"g":94},"3cff9633":{"m":85,"g":94},"d50e36a7":{"m":85,"g":94},"8fefdd32":{"m":85,"g":94},"403b855a":{"m":85,"g":94},"1698e94e":{"m":85,"g":94},"58195dd5":{"m":85,"g":94},"799789af":{"m":85,"g":94},"cc4a80ca":{"m":85,"g":94},"3c8a5231":{"m":85,"g":94},"a043f7f2":{"m":85,"g":94},"e3a53044":{"m":85,"g":94},"28b26dbf":{"m":85,"g":94},"2b06484b":{"m":85,"g":94},"e4b6133b":{"m":85,"g":94},"dd408ee4":{"m":85,"g":94},"9419e75d":{"m":85,"g":94},"2c7dbb7c":{"m":85,"g":94},"9a62191b":{"m":85,"g":94},"ae523675":{"m":85,"g":94},"5c08aa49":{"m":85,"g":94},"f4c191a7":{"m":85,"g":94},"771669cb":{"m":85,"g":94},"1468769b":{"m":85,"g":94},"91dda4cd":{"m":85,"g":94},"8e5a6d34":{"m":85,"g":94},"8465f035":{"m":85,"g":94},"8c0cfca8":{"m":85,"g":94},"2c3ea294":{"m":85,"g":94},"5bb0accb":{"m":85,"g":94},"8d463fe3":{"m":85,"g":94},"26fc32d1":{"m":85,"g":94},"1cc32603":{"m":85,"g":94},"05ee2192":{"m":85,"g":94},"678d8cc9":{"m":86,"g":94},"d2cb3024":{"m":86,"g":94},"1940cdec":{"m":86,"g":94},"63484f9f":{"m":86,"g":94},"dff0ab92":{"m":86,"g":94},"e30c273b":{"m":86,"g":94},"0ab3f437":{"m":86,"g":94},"cec98f10":{"m":86,"g":94},"8dc4efd0":{"m":86,"g":94},"6578cf27":{"m":86,"g":94},"087751a8":{"m":86,"g":94},"911f3ba6":{"m":86,"g":94},"f6f96b05":{"m":86,"g":94},"2a936a84":{"m":86,"g":94},"5e023301":{"m":86,"g":94},"fa7d7fd9":{"m":86,"g":94},"f1ff736d":{"m":86,"g":94},"acc816d8":{"m":86,"g":94},"a05bd83a":{"m":86,"g":94},"cef91b1e":{"m":86,"g":94},"6450c122":{"m":86,"g":94},"b6cf3532":{"m":86,"g":94},"3b2680a4":{"m":86,"g":94},"79961afa":{"m":86,"g":94},"cfca4e0e":{"m":86,"g":94},"e88dd482":{"m":86,"g":94},"73600673":{"m":86,"g":94},"8f508cc7":{"m":86,"g":94},"9bddf1c8":{"m":86,"g":94},"24c13ca9":{"m":86,"g":94},"b70957fc":{"m":86,"g":94},"e444c13f":{"m":86,"g":94},"fee37d9e":{"m":86,"g":94},"c68de479":{"m":86,"g":94},"4c7b4242":{"m":86,"g":94},"38053c33":{"m":86,"g":94},"00c2c1f0":{"m":86,"g":94},"cb691945":{"m":86,"g":94},"d25398cb":{"m":86,"g":94},"8a828666":{"m":86,"g":94},"aff584fa":{"m":86,"g":94},"6f566147":{"m":86,"g":94},"bdd17998":{"m":86,"g":94},"c9abd7be":{"m":86,"g":94},"a3e4e9bf":{"m":86,"g":94},"6d4d3bc8":{"m":86,"g":94},"5f300141":{"m":86,"g":94},"1c05425b":{"m":86,"g":94},"b26cb1c5":{"m":86,"g":94},"f8e46093":{"m":86,"g":94},"683707c3":{"m":86,"g":94},"a68ed766":{"m":86,"g":94},"82653f66":{"m":86,"g":94},"22da3d97":{"m":86,"g":94},"b8559764":{"m":86,"g":94},"56f6589e":{"m":86,"g":94},"1232f7e8":{"m":86,"g":94},"3008db9c":{"m":86,"g":94},"357fb2db":{"m":86,"g":94},"95c231e5":{"m":86,"g":94},"3042f1da":{"m":86,"g":94},"2b63798c":{"m":86,"g":94},"bf203cb7":{"m":86,"g":94},"8ebde73f":{"m":86,"g":94},"6b0fae79":{"m":86,"g":94},"141a4596":{"m":86,"g":94},"d8ab6011":{"m":86,"g":94},"6579cd7d":{"m":86,"g":94},"97ac42b6":{"m":86,"g":94},"1acca3a2":{"m":86,"g":94},"6ea1e6ac":{"m":86,"g":94},"3409aaab":{"m":86,"g":94},"73dcf2b3":{"m":86,"g":94},"170d1f21":{"m":86,"g":94},"73bc1d00":{"m":86,"g":94},"c5645e92":{"m":86,"g":94},"d33955d2":{"m":86,"g":94},"6fc17596":{"m":86,"g":94},"ad506a4e":{"m":86,"g":94},"ebaba856":{"m":86,"g":94},"de2faef9":{"m":86,"g":94},"67b7d5b1":{"m":86,"g":94},"4322c31e":{"m":86,"g":94},"16267d4f":{"m":87,"g":94},"0f5cb8ca":{"m":87,"g":94},"17299f08":{"m":87,"g":94},"5380cd7e":{"m":87,"g":94},"b2e95f62":{"m":87,"g":94},"1ab14c4c":{"m":87,"g":94},"3c32895c":{"m":87,"g":94},"ac2324c1":{"m":87,"g":94},"ef8ec07b":{"m":87,"g":94},"f24fc5b8":{"m":87,"g":94},"d18c6b33":{"m":87,"g":94},"f1c89600":{"m":87,"g":94},"983c663d":{"m":87,"g":94},"f94543d2":{"m":87,"g":94},"e8e18dcd":{"m":87,"g":94},"bad7c26f":{"m":87,"g":94},"12319a67":{"m":87,"g":94},"d738ab52":{"m":87,"g":94},"3ee40ff9":{"m":87,"g":94},"0f334945":{"m":87,"g":94},"fba8eccd":{"m":87,"g":94},"7d3a3d45":{"m":87,"g":94},"25c83fff":{"m":87,"g":94},"9f2c9568":{"m":87,"g":94},"3f2702ae":{"m":87,"g":94},"6ea05950":{"m":87,"g":94},"e7dd906c":{"m":87,"g":94},"6e2da515":{"m":87,"g":94},"e9a47f4c":{"m":87,"g":94},"03227c5f":{"m":87,"g":94},"01bdbf7f":{"m":87,"g":94},"94d42b67":{"m":87,"g":94},"69276f61":{"m":87,"g":94},"41a645f5":{"m":87,"g":94},"23010630":{"m":87,"g":94},"45b4dcf0":{"m":87,"g":94},"213e8c7d":{"m":87,"g":94},"41273fd7":{"m":87,"g":94},"e9bebafb":{"m":87,"g":94},"4d1c9db6":{"m":87,"g":94},"17c36c55":{"m":87,"g":94},"31d1f6e7":{"m":87,"g":94},"a823c6e8":{"m":87,"g":94},"2ce87935":{"m":87,"g":94},"de167cf5":{"m":87,"g":94},"4319978c":{"m":87,"g":94},"03dd785c":{"m":87,"g":94},"66fc63d6":{"m":87,"g":94},"921e4a81":{"m":87,"g":94},"9d8ec2e6":{"m":87,"g":94},"c178abda":{"m":87,"g":94},"b29a026e":{"m":87,"g":94},"7e257cd6":{"m":88,"g":94},"c4831e2f":{"m":88,"g":94},"2e37fa07":{"m":88,"g":94},"2d831c6e":{"m":88,"g":94},"ed0c3035":{"m":88,"g":94},"e6f11356":{"m":88,"g":94},"7b02c326":{"m":88,"g":94},"fefa19fe":{"m":88,"g":94},"9c574585":{"m":88,"g":94},"8233cc10":{"m":88,"g":94},"1b2e8f76":{"m":88,"g":94},"d2e0881a":{"m":88,"g":94},"2f427491":{"m":88,"g":94},"d8189660":{"m":88,"g":94},"3ded6235":{"m":88,"g":94},"4ba1eea8":{"m":88,"g":94},"4685fbb8":{"m":88,"g":94},"0a4fc73b":{"m":88,"g":94},"a6970a17":{"m":88,"g":94},"a6ae3af1":{"m":88,"g":94},"0b07c4a9":{"m":88,"g":94},"fc0e3b91":{"m":88,"g":94},"d71f3f0a":{"m":88,"g":94},"58f10679":{"m":88,"g":94},"7a80f565":{"m":88,"g":94},"9484eba4":{"m":88,"g":94},"e9feb488":{"m":88,"g":94},"fc992a09":{"m":88,"g":94},"121f92c5":{"m":88,"g":94},"3bde1010":{"m":88,"g":94},"75135580":{"m":88,"g":94},"4d643f6c":{"m":88,"g":94},"6ce0ed07":{"m":88,"g":94},"969660c7":{"m":88,"g":94},"16d4f680":{"m":88,"g":94},"ada268fd":{"m":88,"g":94},"cfe48c59":{"m":88,"g":94},"d4c038da":{"m":88,"g":94},"55f6005f":{"m":88,"g":94},"7222e1da":{"m":88,"g":94},"505eec4d":{"m":88,"g":94},"ccfe5c00":{"m":88,"g":94},"a071dc40":{"m":88,"g":94},"a40aecc5":{"m":88,"g":94},"d6e1d28c":{"m":88,"g":94},"7c347259":{"m":88,"g":94},"669caa0a":{"m":88,"g":94},"4024e1d2":{"m":88,"g":94},"5c0b38f3":{"m":88,"g":94},"30ca18f4":{"m":88,"g":94},"03886917":{"m":88,"g":94},"66324895":{"m":88,"g":94},"13feffd0":{"m":88,"g":94},"e98afbe0":{"m":88,"g":94},"69af3ec3":{"m":88,"g":94},"32cc66ef":{"m":88,"g":94},"83f2d9d4":{"m":88,"g":94},"6317c5c6":{"m":88,"g":94},"cba1cdbc":{"m":88,"g":94},"c471d39e":{"m":88,"g":94},"d0443275":{"m":88,"g":94},"17d080b7":{"m":88,"g":94},"1b19df4b":{"m":88,"g":94},"f0653886":{"m":88,"g":94},"b1465557":{"m":88,"g":94},"b06215da":{"m":88,"g":94},"7adf245b":{"m":88,"g":94},"299fd22f":{"m":88,"g":94},"506e5de8":{"m":88,"g":94},"844e2f22":{"m":88,"g":94},"4f39bcf7":{"m":88,"g":94},"31c9569b":{"m":88,"g":94},"1be6956d":{"m":88,"g":94},"626ccb7d":{"m":88,"g":94},"72bfb0ba":{"m":88,"g":94},"15521495":{"m":88,"g":94},"ebe58d54":{"m":88,"g":94},"066cf445":{"m":88,"g":94},"6dc6b306":{"m":88,"g":94},"1f30c05d":{"m":88,"g":94},"5dd62c3a":{"m":88,"g":94},"f11481b9":{"m":88,"g":94},"9d24c3ff":{"m":88,"g":94},"24161c59":{"m":88,"g":94},"eabcf82a":{"m":88,"g":94},"c47a51db":{"m":88,"g":94},"11553c1a":{"m":88,"g":94},"01dd39ba":{"m":88,"g":94},"b3f3d610":{"m":88,"g":94},"f07c6a00":{"m":88,"g":94},"4bb816d4":{"m":88,"g":94},"c250939e":{"m":88,"g":94},"b6909aa2":{"m":88,"g":94},"f8728357":{"m":88,"g":94},"73187152":{"m":88,"g":94},"40865665":{"m":88,"g":94},"fd08c048":{"m":88,"g":94},"26ebb849":{"m":88,"g":94},"02973cd9":{"m":88,"g":94},"6d95a35a":{"m":88,"g":94},"01d2838c":{"m":88,"g":94},"e3b8a722":{"m":88,"g":94},"3cf1473a":{"m":88,"g":94},"27168308":{"m":88,"g":94},"e3bed74a":{"m":88,"g":94},"e9ef39d2":{"m":88,"g":94},"205d5cb4":{"m":88,"g":94},"3d7f7a43":{"m":88,"g":94},"2df9d40a":{"m":88,"g":94},"8dc191f2":{"m":88,"g":94},"64825b83":{"m":88,"g":94},"69748d08":{"m":88,"g":94},"dcc0a456":{"m":88,"g":94},"c2b7ddca":{"m":88,"g":94},"abebd939":{"m":88,"g":94},"4bd2952a":{"m":88,"g":94},"6fc93575":{"m":88,"g":94},"839fb31e":{"m":88,"g":94},"f19a9204":{"m":88,"g":94},"c23a7072":{"m":88,"g":94},"e07a6977":{"m":88,"g":94},"cd8d4b9d":{"m":88,"g":94},"f194e14f":{"m":88,"g":94},"cfc9f9ab":{"m":88,"g":94},"fb4959b2":{"m":88,"g":94},"9a405274":{"m":88,"g":94},"2e4babdb":{"m":88,"g":94},"44a3783d":{"m":88,"g":94},"f3bf6110":{"m":88,"g":94},"198b9056":{"m":88,"g":94},"73eb67c0":{"m":88,"g":94},"9a91fa0e":{"m":88,"g":94},"cd7c8a8d":{"m":88,"g":94},"3e350a93":{"m":88,"g":94},"fb71725c":{"m":88,"g":94},"912788c0":{"m":88,"g":94},"0f75b907":{"m":88,"g":94},"4f723edd":{"m":89,"g":94},"fcde67b0":{"m":89,"g":94},"81372f3b":{"m":89,"g":94},"baa6624d":{"m":89,"g":94},"c2b16795":{"m":89,"g":94},"f6ebba53":{"m":89,"g":94},"6716b417":{"m":89,"g":94},"1c8b42c8":{"m":89,"g":94},"f20f7000":{"m":89,"g":94},"f40942ad":{"m":89,"g":94},"dc0705a5":{"m":89,"g":94},"a968c888":{"m":89,"g":94},"a979daac":{"m":89,"g":94},"f1569876":{"m":89,"g":94},"3465d7ae":{"m":89,"g":94},"e58423b2":{"m":89,"g":94},"7059ae16":{"m":89,"g":94},"51d9a597":{"m":89,"g":94},"56ccd3c2":{"m":89,"g":94},"98c00a2d":{"m":89,"g":94},"451ffe74":{"m":89,"g":94},"b1e5a33a":{"m":89,"g":94},"9d5fa68b":{"m":89,"g":94},"2c186425":{"m":89,"g":94},"18efb5e8":{"m":89,"g":94},"de1350ea":{"m":89,"g":94},"86fe943b":{"m":89,"g":94},"9ecb1856":{"m":89,"g":94},"cc74499d":{"m":89,"g":94},"0c1f03a2":{"m":89,"g":94},"3712abfa":{"m":89,"g":94},"971a0dfa":{"m":89,"g":94},"2fc12995":{"m":89,"g":94},"20d3ad3b":{"m":89,"g":94},"fa3592cf":{"m":89,"g":94},"608668e1":{"m":89,"g":94},"6c0a4828":{"m":89,"g":94},"47402883":{"m":89,"g":94},"1fb76ebb":{"m":89,"g":94},"c2c4f57f":{"m":89,"g":94},"23881fa6":{"m":89,"g":94},"8db3ac55":{"m":89,"g":94},"3e56f557":{"m":89,"g":94},"62fec60d":{"m":89,"g":94},"e7759778":{"m":89,"g":94},"77e928d0":{"m":89,"g":94},"515ef4fa":{"m":89,"g":94},"f5599ef1":{"m":89,"g":94},"c499591a":{"m":89,"g":94},"e1ce44cd":{"m":89,"g":94},"f1114e7f":{"m":89,"g":94},"bae4fdc7":{"m":89,"g":94},"6153f2ff":{"m":89,"g":94},"8b5f83ed":{"m":89,"g":94},"2a413829":{"m":89,"g":94},"d5c097a2":{"m":89,"g":94},"9736cd3b":{"m":89,"g":94},"2f715f51":{"m":89,"g":94},"d664ca18":{"m":89,"g":94},"22fe7878":{"m":89,"g":94},"c4ffbeca":{"m":89,"g":94},"f8eaaab8":{"m":89,"g":94},"697b0f71":{"m":89,"g":94},"132dad87":{"m":89,"g":94},"60fdad7c":{"m":89,"g":94},"61ce91ed":{"m":89,"g":94},"e6b7053b":{"m":89,"g":94},"5f91c825":{"m":89,"g":94},"b819381f":{"m":89,"g":94},"562f279a":{"m":89,"g":94},"8b247489":{"m":89,"g":94},"0df6765c":{"m":89,"g":94},"35b65cf0":{"m":89,"g":94},"dd1012fc":{"m":89,"g":94},"44aab7f9":{"m":89,"g":94},"43baba64":{"m":89,"g":94},"0166403c":{"m":89,"g":94},"bcf66ef3":{"m":89,"g":94},"0de5e7d4":{"m":89,"g":94},"72a110f6":{"m":89,"g":94},"5aff1e93":{"m":89,"g":94},"8e3797be":{"m":89,"g":94},"4474eaf5":{"m":89,"g":94},"499f5e62":{"m":89,"g":94},"81964328":{"m":89,"g":94},"f0f84975":{"m":89,"g":94},"3f1e4339":{"m":89,"g":94},"cf9815ba":{"m":89,"g":94},"bd75690f":{"m":89,"g":94},"180ff5ee":{"m":89,"g":94},"37f15475":{"m":89,"g":94},"8a548052":{"m":89,"g":94},"b6d0ce9f":{"m":89,"g":94},"0ea330ca":{"m":89,"g":94},"27e327b4":{"m":89,"g":94},"ff00895c":{"m":89,"g":94},"ff914748":{"m":89,"g":94},"eb38c7d1":{"m":89,"g":94},"df7f61ee":{"m":89,"g":94},"ef21729c":{"m":89,"g":94},"f5159315":{"m":89,"g":94},"6d7b6696":{"m":89,"g":94},"6376b632":{"m":89,"g":94},"e05e29d1":{"m":89,"g":94},"a2cb5913":{"m":89,"g":94},"55444ed6":{"m":89,"g":94},"20fd53b8":{"m":89,"g":94},"6a47b730":{"m":89,"g":94},"c429919d":{"m":89,"g":94},"1da8d230":{"m":89,"g":94},"2f7420bc":{"m":89,"g":94},"c6a0cacc":{"m":89,"g":94},"0a9bfc20":{"m":89,"g":94},"34c63731":{"m":89,"g":94},"2d72fc47":{"m":89,"g":94},"b520d028":{"m":89,"g":94},"7dc0e394":{"m":89,"g":94},"fb507b7b":{"m":89,"g":94},"f90945c4":{"m":89,"g":94},"094fbdac":{"m":89,"g":94},"888cb175":{"m":89,"g":94},"e39bca07":{"m":89,"g":94},"a2bb8565":{"m":89,"g":94},"ced3c07a":{"m":89,"g":94},"f18b068f":{"m":89,"g":94},"4fac524b":{"m":89,"g":94},"b581b225":{"m":89,"g":94},"69dd878b":{"m":89,"g":94},"22630ca2":{"m":89,"g":94},"d279d499":{"m":89,"g":94},"6cb00c63":{"m":89,"g":94},"62cac2c4":{"m":89,"g":94},"2c3b71d6":{"m":89,"g":94},"51cdd81f":{"m":89,"g":94},"73def253":{"m":89,"g":94},"d9d35def":{"m":89,"g":94},"6df81e8a":{"m":89,"g":94},"3ab7d9b5":{"m":89,"g":94},"7e5071c9":{"m":89,"g":94},"78689d33":{"m":89,"g":94},"1dc6864f":{"m":89,"g":94},"485a023b":{"m":89,"g":94},"7e412900":{"m":89,"g":94},"c673727e":{"m":89,"g":94},"f4d4f939":{"m":89,"g":94},"f2bd3515":{"m":89,"g":94},"c459536b":{"m":89,"g":94},"535c8386":{"m":89,"g":94},"2163586e":{"m":89,"g":94},"e06b0761":{"m":89,"g":94},"844a8f42":{"m":89,"g":94},"791b3bfa":{"m":89,"g":94},"31589e17":{"m":89,"g":94},"ae6a5b29":{"m":89,"g":94},"4839999b":{"m":89,"g":94},"541a985f":{"m":89,"g":94},"5170b010":{"m":89,"g":94},"d63e76f7":{"m":89,"g":94},"e9fd11c0":{"m":89,"g":94},"c7588d59":{"m":89,"g":94},"6b231325":{"m":89,"g":94},"b1c8d4e9":{"m":89,"g":94},"c25231c6":{"m":89,"g":94},"fba03b29":{"m":89,"g":94},"461a7302":{"m":89,"g":94},"07610353":{"m":89,"g":94},"c087ddd6":{"m":89,"g":94},"f4a8987f":{"m":89,"g":94},"41ba767f":{"m":89,"g":94},"f127355a":{"m":89,"g":94},"bdb962d7":{"m":89,"g":94},"0b9557fc":{"m":89,"g":94},"87068b5c":{"m":89,"g":94},"a564e001":{"m":89,"g":94},"2103b806":{"m":89,"g":94},"e806f708":{"m":89,"g":94},"fa6723f0":{"m":89,"g":94},"673ff668":{"m":89,"g":94},"447be242":{"m":89,"g":94},"183d9f96":{"m":89,"g":94},"63195028":{"m":89,"g":94},"a3d7f4b6":{"m":89,"g":94},"b18416fb":{"m":89,"g":94},"ce9d690e":{"m":89,"g":94},"bdaefbbf":{"m":89,"g":94},"45a31a82":{"m":89,"g":94},"1aa0fbf4":{"m":89,"g":94},"7a0bbe6a":{"m":89,"g":94},"ae335842":{"m":89,"g":94},"477a101c":{"m":89,"g":94},"1a8f5f68":{"m":89,"g":94},"32cd7070":{"m":89,"g":94},"ebd1ed49":{"m":89,"g":94},"f77da699":{"m":89,"g":94},"d6864ce6":{"m":89,"g":94},"755a3661":{"m":89,"g":94},"79a39ac0":{"m":89,"g":94},"3ce94f71":{"m":89,"g":94},"ca95556c":{"m":89,"g":94},"eb8f02dd":{"m":89,"g":94},"0ca3e568":{"m":89,"g":94},"5c7aa009":{"m":89,"g":94},"fe386aca":{"m":89,"g":94},"14d1075f":{"m":89,"g":94},"006ead9d":{"m":89,"g":94},"0d503090":{"m":89,"g":94},"501efc3d":{"m":89,"g":94},"f9bab3d5":{"m":89,"g":94},"16f69b1f":{"m":89,"g":94},"65f09131":{"m":89,"g":94},"fc419b62":{"m":89,"g":94},"7eb9d8e5":{"m":89,"g":94},"84147254":{"m":89,"g":94},"6bebef60":{"m":89,"g":94},"25be63d0":{"m":89,"g":94},"d502dae0":{"m":89,"g":94},"93e53f6e":{"m":89,"g":94},"a191a0e4":{"m":89,"g":94},"8c7279c2":{"m":89,"g":94},"0ca18117":{"m":89,"g":94},"2c3a6fe1":{"m":89,"g":94},"8b33d8df":{"m":89,"g":94},"e235be16":{"m":89,"g":94},"5ccf8fe1":{"m":89,"g":94},"3f23d8cd":{"m":89,"g":94},"1a399799":{"m":89,"g":94},"022012aa":{"m":89,"g":94},"681e7af3":{"m":89,"g":94},"681fdc26":{"m":89,"g":94},"0d477880":{"m":89,"g":94},"f4560373":{"m":89,"g":94},"b2388433":{"m":89,"g":94},"a38376fa":{"m":89,"g":94},"7a5e6ce1":{"m":89,"g":94},"24c035f2":{"m":89,"g":94},"f9dc9dd2":{"m":90,"g":94},"62a7aa2e":{"m":90,"g":94},"5ca07eed":{"m":90,"g":94},"e30ef368":{"m":90,"g":94},"91a066ec":{"m":90,"g":94},"c4943867":{"m":90,"g":94},"53a525bf":{"m":90,"g":94},"7ddf8e83":{"m":90,"g":94},"8321f8e4":{"m":90,"g":94},"cfceb83d":{"m":90,"g":94},"b1286a11":{"m":90,"g":94},"21615cc3":{"m":90,"g":94},"0ae1e9a7":{"m":90,"g":94},"e07d0647":{"m":90,"g":94},"3c2274fb":{"m":90,"g":94},"d2679f51":{"m":90,"g":94},"96be97bf":{"m":90,"g":94},"88f9c347":{"m":90,"g":94},"fff10809":{"m":90,"g":94},"5f1ab327":{"m":90,"g":94},"7df7c679":{"m":90,"g":94},"38af4f68":{"m":90,"g":94},"a6305c7d":{"m":90,"g":94},"a023856b":{"m":90,"g":94},"db0cc57e":{"m":90,"g":94},"349bb2c9":{"m":90,"g":94},"0b8939bc":{"m":90,"g":94},"ed89837c":{"m":90,"g":94},"55561e25":{"m":90,"g":94},"44733203":{"m":90,"g":94},"0bd67ba2":{"m":90,"g":94},"7d316991":{"m":90,"g":94},"ab1a4fa5":{"m":90,"g":94},"ed54bf9d":{"m":90,"g":94},"b57d87c2":{"m":90,"g":94},"98538822":{"m":90,"g":94},"f47a1b1d":{"m":90,"g":94},"93cec433":{"m":90,"g":94},"ba589b88":{"m":90,"g":94},"50876abc":{"m":90,"g":94},"b4c41f72":{"m":90,"g":94},"8b8f2e74":{"m":90,"g":94},"0fc3d992":{"m":90,"g":94},"be2d985d":{"m":90,"g":94},"5b1afa78":{"m":90,"g":94},"c49c1d92":{"m":90,"g":94},"0f1dfa1e":{"m":90,"g":94},"e3ec6bf4":{"m":90,"g":94},"b04df75a":{"m":90,"g":94},"bec3e484":{"m":90,"g":94},"8ab7d93c":{"m":90,"g":94},"5c66c442":{"m":90,"g":94},"aa46ed34":{"m":90,"g":94},"2f4ec752":{"m":90,"g":94},"da47621c":{"m":90,"g":94},"22a6b9fc":{"m":90,"g":94},"b02df20a":{"m":90,"g":94},"bd7cfbd2":{"m":90,"g":94},"4b9971e4":{"m":90,"g":94},"dcc79d32":{"m":90,"g":94},"7046e0fa":{"m":90,"g":94},"930746d9":{"m":90,"g":94},"84727a51":{"m":90,"g":94},"ef326774":{"m":90,"g":94},"021f76e4":{"m":90,"g":94},"777688b8":{"m":90,"g":94},"0ca594ed":{"m":90,"g":94},"31d6dee5":{"m":90,"g":94},"02543b54":{"m":90,"g":94},"25a6a9aa":{"m":90,"g":94},"83d87685":{"m":90,"g":94},"2a5f0100":{"m":90,"g":94},"dbdf76ca":{"m":90,"g":94},"f2a75a66":{"m":90,"g":94},"6b12d6a8":{"m":90,"g":94},"0f218731":{"m":90,"g":94},"14c18d25":{"m":90,"g":94},"90bd3e32":{"m":90,"g":94},"ca929118":{"m":90,"g":94},"344adb00":{"m":90,"g":94},"b56de8f9":{"m":90,"g":94},"ce5ee3bd":{"m":90,"g":94},"a0e4d4eb":{"m":90,"g":94},"2f584455":{"m":90,"g":94},"fe55947a":{"m":90,"g":94},"19995dd7":{"m":90,"g":94},"3b014bc1":{"m":90,"g":94},"d7c3e8e9":{"m":90,"g":94},"8ea7df61":{"m":90,"g":94},"4a102a2b":{"m":90,"g":94},"6406408a":{"m":90,"g":94},"019851d0":{"m":90,"g":94},"2dae104d":{"m":90,"g":94},"cef6655b":{"m":90,"g":94},"27196d41":{"m":90,"g":94},"bb185b0e":{"m":90,"g":94},"7c3a12c0":{"m":91,"g":94},"e846d95e":{"m":91,"g":94},"15f34013":{"m":91,"g":94},"d04163b3":{"m":91,"g":94},"7732bbe4":{"m":91,"g":94},"ed0a0b69":{"m":91,"g":94},"fa42e419":{"m":91,"g":94},"e5afb88b":{"m":91,"g":94},"e5ddeb04":{"m":91,"g":94},"bdbb8d00":{"m":91,"g":94},"34c3f9b2":{"m":91,"g":94},"76139bfb":{"m":91,"g":94},"f8d48fd3":{"m":91,"g":94},"34b6b842":{"m":91,"g":94},"25549433":{"m":91,"g":94},"d6dddc19":{"m":91,"g":94},"55e03b10":{"m":91,"g":94},"8aa68ed5":{"m":91,"g":94},"506c4928":{"m":91,"g":94},"30ceccc7":{"m":91,"g":94},"ac5010e0":{"m":91,"g":94},"3cee035e":{"m":91,"g":94},"30f2a44a":{"m":91,"g":94},"bd4f5818":{"m":91,"g":94},"50f1b6d6":{"m":91,"g":94},"5962e70d":{"m":91,"g":94},"edc21cc8":{"m":91,"g":94},"05c9bc89":{"m":91,"g":94},"b7a2df0a":{"m":91,"g":94},"1998ce40":{"m":91,"g":94},"72676cd6":{"m":91,"g":94},"02bf31ef":{"m":91,"g":94},"5ea5d221":{"m":91,"g":94},"fdfd5224":{"m":91,"g":94},"7f3ee861":{"m":91,"g":94},"bec58910":{"m":91,"g":94},"9edf6608":{"m":91,"g":94},"ab74f8f0":{"m":91,"g":94},"5e7fdc79":{"m":91,"g":94},"4d8d9b8e":{"m":91,"g":94},"5041df2d":{"m":91,"g":94},"256801e9":{"m":91,"g":94},"73b13e69":{"m":91,"g":94},"8609e637":{"m":91,"g":94},"dea2b84b":{"m":91,"g":94},"cfb2fb5a":{"m":91,"g":94},"22bfed75":{"m":91,"g":94},"e879d8b7":{"m":91,"g":94},"09988080":{"m":91,"g":94},"794be55a":{"m":91,"g":94},"187b85b7":{"m":91,"g":94},"ceba0ce4":{"m":91,"g":94},"1ab6be1b":{"m":91,"g":94},"4df5fc21":{"m":91,"g":94},"a06912ad":{"m":91,"g":94},"97011abc":{"m":91,"g":94},"1d6515ef":{"m":91,"g":94},"dea8aa7a":{"m":91,"g":94},"906dbc34":{"m":91,"g":94},"fadf18fd":{"m":91,"g":94},"f88e7085":{"m":91,"g":94},"4f838c09":{"m":91,"g":94},"d20a073b":{"m":91,"g":94},"47367b76":{"m":91,"g":94},"650127a1":{"m":91,"g":94},"3774f078":{"m":91,"g":94},"9179ea15":{"m":91,"g":94},"20a503c7":{"m":91,"g":94},"ffd1a26e":{"m":91,"g":94},"09ae5b20":{"m":91,"g":94},"712bf9ec":{"m":91,"g":94},"9c6a0656":{"m":91,"g":94},"2ae809c5":{"m":91,"g":94},"1de4db9b":{"m":91,"g":94},"31fccf5a":{"m":91,"g":94},"b783c1cb":{"m":91,"g":94},"094c116f":{"m":91,"g":94},"e56685ac":{"m":91,"g":94},"c26d7349":{"m":91,"g":94},"ceaa85c9":{"m":91,"g":94},"0650e517":{"m":91,"g":94},"fc554105":{"m":91,"g":94},"4f204db5":{"m":91,"g":94},"3eb4a800":{"m":91,"g":94},"e7261315":{"m":91,"g":94},"8c16da33":{"m":91,"g":94},"a39d9287":{"m":91,"g":94},"10d60cd4":{"m":91,"g":94},"8a10c4c3":{"m":91,"g":94},"405780bc":{"m":91,"g":94},"1dffee31":{"m":91,"g":94},"70c471a8":{"m":91,"g":94},"1a9c2c92":{"m":91,"g":94},"873ae12c":{"m":91,"g":94},"c64290dc":{"m":91,"g":94},"8e2363dc":{"m":91,"g":94},"69183f88":{"m":92,"g":94},"9b00990b":{"m":92,"g":94},"4d67025a":{"m":92,"g":94},"0e05fe8c":{"m":92,"g":94},"2390a2bc":{"m":92,"g":94},"16d76b9f":{"m":92,"g":94},"5c214257":{"m":92,"g":94},"b8df43ab":{"m":92,"g":94},"a1c1ebe9":{"m":92,"g":94},"fe2a0f96":{"m":92,"g":94},"20beb370":{"m":92,"g":94},"00fbd8a4":{"m":92,"g":94},"802815e4":{"m":92,"g":94},"4c6675c4":{"m":92,"g":94},"e21aa1df":{"m":92,"g":94},"f3cbd245":{"m":92,"g":94},"506a2d59":{"m":92,"g":94},"a07f8ae4":{"m":92,"g":94},"7eb47b0f":{"m":92,"g":94},"bc2e5645":{"m":92,"g":94},"3abc3036":{"m":92,"g":94},"afeed465":{"m":92,"g":94},"587b4c6e":{"m":92,"g":94},"7b9a174a":{"m":92,"g":94},"03c039c4":{"m":92,"g":94},"57ab7769":{"m":92,"g":94},"112b496a":{"m":92,"g":94},"3562256b":{"m":92,"g":94},"5f527834":{"m":92,"g":94},"9f1787fa":{"m":92,"g":94},"8ecad0b1":{"m":92,"g":94},"7151194b":{"m":92,"g":94},"2ed68d7a":{"m":92,"g":94},"e984d507":{"m":92,"g":94},"755f3147":{"m":92,"g":94},"ec5f9c62":{"m":93,"g":94},"62f5522f":{"m":93,"g":94},"01f98730":{"m":93,"g":94},"199d6218":{"m":93,"g":94},"f200af0d":{"m":93,"g":94},"5589b750":{"m":93,"g":94},"c04a8a82":{"m":93,"g":94},"6c903611":{"m":93,"g":94},"77cfea68":{"m":93,"g":94},"8fc910db":{"m":93,"g":94},"75354d9a":{"m":93,"g":94},"4fece12b":{"m":93,"g":94},"c7973222":{"m":93,"g":94},"ef8a29c4":{"m":93,"g":94},"8e9fb43d":{"m":93,"g":94},"83646089":{"m":93,"g":94},"da3890e8":{"m":93,"g":94},"cb432f17":{"m":93,"g":94},"1964c325":{"m":93,"g":94},"af564774":{"m":93,"g":94},"af46f299":{"m":93,"g":94},"16a6b1d8":{"m":93,"g":94},"14229ccf":{"m":93,"g":94},"975a5ec6":{"m":93,"g":94},"1e3e3add":{"m":93,"g":94},"8c298031":{"m":93,"g":94},"4de03953":{"m":93,"g":94},"8b1942c6":{"m":93,"g":94},"489934be":{"m":93,"g":94},"43f93f63":{"m":93,"g":94},"aca1101a":{"m":93,"g":94},"2998c4bd":{"m":93,"g":94},"6840a7bb":{"m":93,"g":94},"c01a1df5":{"m":93,"g":94},"00991723":{"m":93,"g":94},"264dc6e7":{"m":93,"g":94},"646cef2e":{"m":93,"g":94},"1dce6c48":{"m":93,"g":94},"9fcc9a80":{"m":93,"g":94},"ac49dac0":{"m":93,"g":94},"1e0e5497":{"m":93,"g":94},"b5822651":{"m":93,"g":94},"2c4feaf3":{"m":93,"g":94},"2ff572e2":{"m":93,"g":94},"84f2e4a0":{"m":93,"g":94},"8f844db6":{"m":93,"g":94},"36cc3ffd":{"m":93,"g":94},"1bebd315":{"m":93,"g":94},"d3c275b1":{"m":93,"g":94},"b044400d":{"m":93,"g":94},"40e5cb7a":{"m":93,"g":94},"8e64140e":{"m":93,"g":94},"82f021e2":{"m":93,"g":94},"0626f678":{"m":93,"g":94},"09e699bb":{"m":93,"g":94},"b116b21a":{"m":93,"g":94},"88f484ce":{"m":93,"g":94},"8e03b641":{"m":93,"g":94},"b3fa5dc3":{"m":93,"g":94},"00aec6ad":{"m":93,"g":94},"1a08358a":{"m":93,"g":94},"f18a8fdd":{"m":93,"g":94},"a7efbb27":{"m":93,"g":94},"93b6785d":{"m":93,"g":94},"f9eb04dd":{"m":93,"g":94},"3a911b85":{"m":93,"g":94},"886d3449":{"m":93,"g":94},"637bfee4":{"m":93,"g":94},"6005ecee":{"m":93,"g":94},"ff2e9c94":{"m":93,"g":94},"3e34e900":{"m":93,"g":94},"7349717e":{"m":93,"g":94},"392e441a":{"m":93,"g":94},"7248272c":{"m":93,"g":94},"22352d47":{"m":93,"g":94},"c5131f7a":{"m":93,"g":94},"78700893":{"m":93,"g":94},"663c04f7":{"m":93,"g":94},"3b3f1e3a":{"m":93,"g":94},"b691dcc4":{"m":93,"g":94},"0c9c6c75":{"m":93,"g":94},"e3f9b548":{"m":93,"g":94},"b3cff365":{"m":93,"g":94},"8f335b5b":{"m":93,"g":94},"b2264076":{"m":93,"g":94},"04b35190":{"m":93,"g":94},"071a1f51":{"m":93,"g":94},"7c0db3a6":{"m":93,"g":94},"c45e49d8":{"m":93,"g":94},"d8053929":{"m":93,"g":94},"00c7b1ad":{"m":93,"g":94},"82eccae4":{"m":93,"g":94},"a8c10aee":{"m":93,"g":94},"eb429b88":{"m":93,"g":94},"49538d11":{"m":93,"g":94},"cfe2edac":{"m":93,"g":94},"2373faa3":{"m":93,"g":94},"9efb2993":{"m":93,"g":94},"a5317b2f":{"m":93,"g":94},"eb6c2c16":{"m":93,"g":94},"357921aa":{"m":93,"g":94},"c071198c":{"m":93,"g":94},"d7374d74":{"m":93,"g":94},"ce3a3e87":{"m":93,"g":94},"41650b0d":{"m":93,"g":94},"1b951620":{"m":93,"g":94},"29bd4c81":{"m":93,"g":94},"031f64aa":{"m":93,"g":94},"3d7cdb2e":{"m":93,"g":94},"604efe07":{"m":93,"g":94},"1b8cf77b":{"m":93,"g":94},"bb9b608c":{"m":93,"g":94},"066f4ec9":{"m":95,"g":97},"b6b6268c":{"m":95,"g":97},"08702321":{"m":95,"g":97},"64c5907e":{"m":95,"g":97},"128f16a8":{"m":95,"g":97},"49861046":{"m":95,"g":97},"a37e1247":{"m":95,"g":97},"136c6e04":{"m":95,"g":97},"43e20c06":{"m":95,"g":97},"4bab50a6":{"m":95,"g":97},"2e7ab862":{"m":95,"g":97},"51ae4030":{"m":95,"g":97},"653b873b":{"m":95,"g":97},"d379bda4":{"m":95,"g":97},"2b0e1d1c":{"m":95,"g":97},"659907e3":{"m":95,"g":97},"cb9d91ea":{"m":95,"g":97},"6a6e0bb7":{"m":95,"g":97},"076313bd":{"m":95,"g":97},"9abe1163":{"m":95,"g":97},"3646f6bb":{"m":95,"g":94},"35724aa1":{"m":95,"g":94},"2fc824b8":{"m":95,"g":94},"253454de":{"m":95,"g":94},"ea3e7ffe":{"m":95,"g":94},"8d4a01cb":{"m":95,"g":94},"a3398d84":{"m":95,"g":94},"ba69c153":{"m":95,"g":94},"3589aa79":{"m":95,"g":94},"e00715eb":{"m":95,"g":94},"ea4bf122":{"m":95,"g":94},"a291439a":{"m":95,"g":94},"54411f6a":{"m":95,"g":94},"625018d2":{"m":95,"g":94},"5732d904":{"m":95,"g":94},"eb118d88":{"m":96,"g":97},"732fc8e4":{"m":96,"g":97},"f2d5c492":{"m":96,"g":97},"2a2d3478":{"m":96,"g":97},"aa205609":{"m":96,"g":97},"61bb2858":{"m":96,"g":97},"880221bd":{"m":96,"g":97},"8f3173d0":{"m":96,"g":97},"26118a13":{"m":96,"g":97},"475a249b":{"m":96,"g":97},"191d836f":{"m":96,"g":97},"86044712":{"m":96,"g":97},"61555307":{"m":96,"g":97},"49a5915f":{"m":96,"g":97},"766392c6":{"m":96,"g":97},"4a0d1919":{"m":96,"g":97},"57482415":{"m":96,"g":97},"2d54d4bb":{"m":96,"g":97},"b5e3d603":{"m":96,"g":97},"4ed57807":{"m":96,"g":97},"dd445a41":{"m":96,"g":97},"7590f522":{"m":96,"g":97},"f9df11ae":{"m":96,"g":97},"d389bedf":{"m":96,"g":97},"ac80f4da":{"m":96,"g":97},"d487555f":{"m":96,"g":97},"e5888edd":{"m":96,"g":97},"01c00004":{"m":98,"g":103},"0dfe2491":{"m":98,"g":103},"ff45ab7a":{"m":98,"g":103},"0f8b5386":{"m":98,"g":103},"c33499a6":{"m":98,"g":103},"e50109f2":{"m":98,"g":103},"69adc4f8":{"m":98,"g":103},"11483785":{"m":98,"g":103},"7b68d271":{"m":98,"g":103},"74f59ae5":{"m":98,"g":103},"6936be32":{"m":98,"g":103},"9b5de6cb":{"m":98,"g":97},"5c8365a0":{"m":98,"g":97},"8430bfe3":{"m":98,"g":97},"c9e8613c":{"m":98,"g":97},"429bb0ef":{"m":98,"g":97},"7eebd440":{"m":98,"g":97},"93d124ef":{"m":98,"g":97},"1fc455e8":{"m":98,"g":97},"465968b2":{"m":98,"g":97},"750838ad":{"m":98,"g":97},"99aefa03":{"m":98,"g":97},"bbcfbc1a":{"m":98,"g":97},"83c104b1":{"m":98,"g":97},"2db6719c":{"m":98,"g":97},"55381a46":{"m":98,"g":97},"a589a071":{"m":98,"g":97},"f62d75b6":{"m":98,"g":97},"0f9b11e3":{"m":98,"g":97},"877e35d7":{"m":98,"g":97},"cbdfb771":{"m":98,"g":97},"282eb59f":{"m":98,"g":97},"4540a466":{"m":98,"g":97},"abda2542":{"m":98,"g":97},"8cddfa56":{"m":98,"g":97},"4e3defe5":{"m":98,"g":97},"60468da4":{"m":98,"g":97},"41d33e47":{"m":98,"g":97},"bfdd226f":{"m":98,"g":97},"3de617a7":{"m":98,"g":97},"bb0e8a32":{"m":98,"g":97},"1b427dae":{"m":98,"g":97},"f3d97361":{"m":98,"g":97},"561dd7b2":{"m":98,"g":97},"f98e88b9":{"m":98,"g":97},"15ad6c90":{"m":98,"g":97},"cfab0ff6":{"m":98,"g":97},"b763cf7e":{"m":98,"g":97},"8fcc55cf":{"m":98,"g":97},"610381b7":{"m":98,"g":97},"1403ea56":{"m":98,"g":97},"b7e951a6":{"m":98,"g":97},"d918ab79":{"m":98,"g":97},"3964b352":{"m":98,"g":97},"9c7a4618":{"m":98,"g":97},"7750b91c":{"m":98,"g":97},"c8f31042":{"m":98,"g":97},"1f76fc87":{"m":98,"g":97},"6737671c":{"m":98,"g":97},"fd63b62e":{"m":98,"g":97},"719b29f2":{"m":98,"g":97},"d0510f08":{"m":98,"g":97},"9d33fcfb":{"m":98,"g":97},"7891bac1":{"m":98,"g":97},"48c1fa7b":{"m":98,"g":97},"8aa5ae6b":{"m":98,"g":97},"8a323557":{"m":98,"g":97},"6e92da8f":{"m":98,"g":97},"e1020dc5":{"m":98,"g":97},"3586b4ce":{"m":98,"g":97},"42960214":{"m":98,"g":97},"01857fab":{"m":98,"g":97},"519ff5c8":{"m":98,"g":97},"af1cc8fe":{"m":98,"g":97},"49b87774":{"m":98,"g":97},"02404a1e":{"m":98,"g":97},"5c08a36c":{"m":98,"g":97},"9069884b":{"m":98,"g":97},"8a7a7770":{"m":98,"g":97},"795668dc":{"m":98,"g":97},"4395c87a":{"m":98,"g":97},"c28ad199":{"m":98,"g":97},"570d3343":{"m":98,"g":97},"d9eb5efc":{"m":98,"g":97},"6dc4af49":{"m":98,"g":97},"b188a89a":{"m":98,"g":97},"497efe74":{"m":98,"g":97},"69f453e5":{"m":98,"g":97},"3bc43c68":{"m":98,"g":97},"7498522f":{"m":98,"g":97},"194841e3":{"m":98,"g":97},"ebff5fcb":{"m":98,"g":97},"f06bd210":{"m":98,"g":97},"14f1f151":{"m":98,"g":97},"38216cf0":{"m":98,"g":97},"4a883795":{"m":98,"g":97},"f1f1d1d4":{"m":98,"g":97},"9120e83d":{"m":98,"g":97},"6e923dbd":{"m":98,"g":97},"c268c11c":{"m":98,"g":97},"e6d59884":{"m":98,"g":97},"9b560c3e":{"m":98,"g":97},"5dc5866e":{"m":98,"g":97},"64e78bb3":{"m":98,"g":97},"7c39e8a1":{"m":98,"g":97},"d969504d":{"m":98,"g":97},"1ebec1a8":{"m":98,"g":97},"d4d0c7c3":{"m":98,"g":97},"8d2cf38c":{"m":98,"g":97},"2117f82d":{"m":98,"g":97},"c07f647c":{"m":98,"g":97},"07452cbe":{"m":98,"g":97},"a562c8a3":{"m":98,"g":97},"cb736df8":{"m":98,"g":97},"e2ed9d04":{"m":98,"g":97},"b5dd5e87":{"m":98,"g":97},"9379da77":{"m":98,"g":97},"0c55cbcf":{"m":98,"g":97},"c46e069d":{"m":98,"g":97},"42fc4410":{"m":98,"g":97},"5f6756b0":{"m":98,"g":97},"98aa836b":{"m":98,"g":97},"22bd857c":{"m":98,"g":97},"ccfa0841":{"m":98,"g":97},"bcc5ba94":{"m":98,"g":97},"cee9f329":{"m":98,"g":97},"2272c2a5":{"m":99,"g":103},"3ec0b212":{"m":99,"g":103},"58c468f4":{"m":99,"g":103},"f8ca2368":{"m":99,"g":103},"d8ee1564":{"m":99,"g":103},"7181ec8c":{"m":99,"g":103},"ed2e313e":{"m":99,"g":103},"f8260f25":{"m":99,"g":103},"12cb760a":{"m":99,"g":103},"1b9cea5a":{"m":99,"g":103},"9045cc1e":{"m":99,"g":103},"70e37b97":{"m":99,"g":103},"15d27591":{"m":99,"g":103},"af4b9bae":{"m":99,"g":103},"7ad6b766":{"m":99,"g":103},"c0fb25e9":{"m":99,"g":103},"28d4d472":{"m":99,"g":103},"f4674df6":{"m":99,"g":103},"d40846d4":{"m":99,"g":103},"145482f4":{"m":99,"g":103},"39fe1e88":{"m":99,"g":103},"33c4b4d0":{"m":99,"g":103},"8d1c5b94":{"m":99,"g":103},"a167fd0b":{"m":99,"g":103},"2f86f3ad":{"m":99,"g":103},"bfb118c0":{"m":99,"g":103},"f6e07f27":{"m":99,"g":103},"5dd0f870":{"m":99,"g":103},"f7e102d5":{"m":99,"g":103},"0e5fa677":{"m":99,"g":103},"624a3b8d":{"m":99,"g":103},"01079e17":{"m":99,"g":103},"0e7a5b26":{"m":99,"g":103},"4953f4ca":{"m":99,"g":103},"38000a5f":{"m":99,"g":103},"70251e93":{"m":99,"g":103},"c87d4fec":{"m":99,"g":103},"a99801e0":{"m":99,"g":103},"4c605235":{"m":99,"g":103},"6f8f4aee":{"m":99,"g":103},"0c8dab9e":{"m":99,"g":103},"f39037ff":{"m":99,"g":103},"ce86e201":{"m":99,"g":103},"b4326330":{"m":99,"g":103},"8abd3e77":{"m":99,"g":103},"e885bfdc":{"m":99,"g":103},"e2d66f60":{"m":99,"g":103},"45bc170b":{"m":100,"g":103},"22623699":{"m":100,"g":103},"fb4ce17d":{"m":100,"g":103},"25f73c6c":{"m":100,"g":103},"581e7dcb":{"m":100,"g":103},"484d0e02":{"m":100,"g":103},"5922c0cb":{"m":100,"g":103},"6d6a8bc2":{"m":100,"g":103},"2fd5c704":{"m":100,"g":103},"4ad97370":{"m":100,"g":103},"28103384":{"m":100,"g":103},"fe6a445d":{"m":100,"g":103},"dd487e55":{"m":100,"g":103},"bb81daef":{"m":100,"g":103},"58dd95fb":{"m":100,"g":103},"b47eda33":{"m":100,"g":103},"e983d666":{"m":100,"g":103},"b58c3c28":{"m":100,"g":103},"df906455":{"m":100,"g":103},"95217a9b":{"m":100,"g":103},"22e00eeb":{"m":100,"g":103},"b3eac168":{"m":100,"g":103},"10ee8955":{"m":100,"g":103},"bf3352c5":{"m":100,"g":103},"4d921f2b":{"m":100,"g":103},"44d600cd":{"m":100,"g":103},"5c9c275b":{"m":100,"g":103},"bf0f448f":{"m":100,"g":103},"36d6f0ba":{"m":100,"g":103},"2a1936de":{"m":100,"g":103},"2ab97023":{"m":100,"g":103},"0bcc195f":{"m":100,"g":103},"91e3d154":{"m":100,"g":103},"85486b6f":{"m":100,"g":103},"e34cf6ad":{"m":100,"g":103},"62222bd2":{"m":100,"g":103},"ed0fdbf3":{"m":100,"g":103},"b602f423":{"m":100,"g":103},"426b7493":{"m":100,"g":103},"528bd1ed":{"m":100,"g":103},"62a6b7c7":{"m":100,"g":103},"76154631":{"m":100,"g":103},"5c705b1d":{"m":100,"g":103},"b7094a5e":{"m":100,"g":103},"da0c0260":{"m":100,"g":103},"3212c2ad":{"m":100,"g":103},"53475674":{"m":100,"g":103},"ce32bc2b":{"m":100,"g":103},"e236d8fe":{"m":100,"g":103},"4fa44d63":{"m":100,"g":103},"e6312d27":{"m":100,"g":103},"8af145b7":{"m":100,"g":103},"6478831b":{"m":101,"g":103},"2e1d2d7e":{"m":101,"g":103},"fb16fbaf":{"m":101,"g":103},"0ce84c82":{"m":101,"g":103},"59d0bf01":{"m":101,"g":103},"7df2c0c2":{"m":101,"g":103},"69712e6f":{"m":101,"g":103},"001bffca":{"m":101,"g":103},"7c969717":{"m":101,"g":103},"8240a6b0":{"m":101,"g":103},"3a04aa4b":{"m":101,"g":103},"bd516949":{"m":101,"g":103},"74e7e457":{"m":101,"g":103},"1466c1b8":{"m":101,"g":103},"9c138a04":{"m":101,"g":103},"c8f549d9":{"m":101,"g":103},"134fa43e":{"m":101,"g":103},"ccfe52a0":{"m":101,"g":103},"747dd450":{"m":101,"g":103},"b5821592":{"m":101,"g":103},"a9dd3ec3":{"m":101,"g":103},"02328864":{"m":102,"g":103},"7a1f7fc5":{"m":102,"g":103},"51c38163":{"m":102,"g":103},"09f1a247":{"m":102,"g":103},"32fa1e9c":{"m":102,"g":103},"e7dc163f":{"m":102,"g":103},"e179e0b7":{"m":102,"g":103},"d9049592":{"m":102,"g":103},"26c8a310":{"m":102,"g":103},"5963e505":{"m":102,"g":103},"43118f5f":{"m":102,"g":103},"a5f5ab40":{"m":102,"g":103},"59aab76f":{"m":102,"g":103},"659bfd10":{"m":102,"g":103},"67e53b16":{"m":102,"g":103},"9b9e8253":{"m":102,"g":103},"66a398f4":{"m":102,"g":103},"29980334":{"m":102,"g":103},"a79a5d70":{"m":102,"g":103},"ec5f9442":{"m":102,"g":103},"3bdcdd13":{"m":102,"g":103},"a730ce81":{"m":102,"g":103},"55ecdc0a":{"m":102,"g":103},"e3f08c77":{"m":102,"g":103},"a9fd8033":{"m":102,"g":103},"2fbb754e":{"m":102,"g":103},"a85ebf50":{"m":102,"g":103},"9effeb5b":{"m":102,"g":103},"1992ef9b":{"m":102,"g":103},"c0fd77e8":{"m":102,"g":103},"a4c3b121":{"m":102,"g":103},"5973675b":{"m":102,"g":103},"4d16c88b":{"m":102,"g":103},"7a4309cc":{"m":102,"g":103},"81367066":{"m":102,"g":103},"263c9236":{"m":102,"g":103},"33f0de33":{"m":105,"g":107},"e7e5a305":{"m":105,"g":107},"dd7ca006":{"m":105,"g":107},"9305ea6c":{"m":105,"g":107},"aa4c66b5":{"m":105,"g":107},"39decec1":{"m":105,"g":104},"f6f46f46":{"m":105,"g":104},"2886e23d":{"m":105,"g":104},"99795d61":{"m":105,"g":104},"fe5086fd":{"m":105,"g":104},"04913430":{"m":105,"g":104},"0ad098b4":{"m":105,"g":104},"4a6e7a66":{"m":105,"g":104},"4b04998d":{"m":105,"g":104},"3dde8619":{"m":105,"g":104},"b7170cc8":{"m":105,"g":104},"5c14515f":{"m":105,"g":104},"2cd2e27f":{"m":105,"g":104},"743638bc":{"m":105,"g":104},"061c8959":{"m":105,"g":104},"4acf6902":{"m":105,"g":104},"aee0ef52":{"m":105,"g":103},"ae807774":{"m":105,"g":103},"8fbcfd07":{"m":105,"g":103},"3c307dc0":{"m":105,"g":103},"5d15fb8c":{"m":105,"g":103},"016fd251":{"m":105,"g":103},"8cd34458":{"m":106,"g":107},"0e0eef00":{"m":106,"g":107},"cb099d20":{"m":106,"g":107},"7a913301":{"m":106,"g":107},"5ce5093b":{"m":106,"g":107},"6f9baf10":{"m":106,"g":107},"a31b7a70":{"m":106,"g":107},"7ed8e51b":{"m":106,"g":107},"32f28154":{"m":106,"g":107},"f7b2853f":{"m":106,"g":107},"b0add2da":{"m":106,"g":107},"0305c505":{"m":106,"g":107},"8675bdf2":{"m":106,"g":107},"a437aa99":{"m":106,"g":107},"0e612dbf":{"m":106,"g":107},"9f47d686":{"m":106,"g":107},"d9def43d":{"m":106,"g":107},"e273aa6d":{"m":106,"g":107},"828a4fe9":{"m":106,"g":107},"8ada1ab6":{"m":106,"g":107},"e314b084":{"m":106,"g":107},"403566bc":{"m":106,"g":107},"0a56b721":{"m":106,"g":107},"603f5ce0":{"m":106,"g":107},"6d4fd882":{"m":106,"g":107},"f9f0138f":{"m":106,"g":107},"ac6962cc":{"m":106,"g":107},"4ca43b06":{"m":106,"g":107},"ea93079b":{"m":106,"g":107},"4bec99ec":{"m":106,"g":107},"89caf7a3":{"m":106,"g":107},"b27b1191":{"m":106,"g":107},"f642524f":{"m":106,"g":107},"82e6c3a6":{"m":106,"g":107},"b89d37cb":{"m":106,"g":107},"5deab128":{"m":106,"g":107},"d1c4d51c":{"m":106,"g":107},"1fe691a4":{"m":106,"g":107},"e2521926":{"m":106,"g":107},"07e46eca":{"m":106,"g":107},"ab9b893e":{"m":106,"g":107},"6a7528e6":{"m":106,"g":107},"2ae95d17":{"m":106,"g":107},"2d401bd9":{"m":106,"g":107},"b17c5b01":{"m":106,"g":107},"db7343c9":{"m":106,"g":107},"533cb5b2":{"m":106,"g":107},"6bdd2786":{"m":106,"g":107},"46e9d1c7":{"m":106,"g":107},"6c88f6c8":{"m":106,"g":107},"c8d3a402":{"m":106,"g":107},"7e831efe":{"m":106,"g":107},"20b5563e":{"m":106,"g":107},"97a38ee8":{"m":108,"g":115},"86d10d22":{"m":108,"g":115},"83871aa1":{"m":108,"g":115},"b1b3f0b3":{"m":108,"g":115},"34e5e11f":{"m":108,"g":115},"2600fc0d":{"m":108,"g":115},"ccd3fb94":{"m":108,"g":115},"c9dd70fb":{"m":108,"g":115},"6b2b8bf0":{"m":108,"g":115},"4edbe0d5":{"m":108,"g":115},"0374304a":{"m":108,"g":115},"127d4b0d":{"m":108,"g":115},"7e880286":{"m":108,"g":115},"446c8e4c":{"m":108,"g":115},"5ef545e6":{"m":108,"g":115},"c4500233":{"m":108,"g":115},"f445a1d9":{"m":108,"g":115},"e5638573":{"m":108,"g":115},"f556ac8b":{"m":108,"g":115},"110a6598":{"m":108,"g":115},"49f9d025":{"m":108,"g":115},"0f587e80":{"m":108,"g":115},"6078d5fc":{"m":108,"g":115},"70cf4abc":{"m":108,"g":115},"cebf4599":{"m":108,"g":115},"9c0c1e30":{"m":108,"g":115},"a1f011d0":{"m":108,"g":115},"9ec314c6":{"m":108,"g":115},"fedfe91c":{"m":108,"g":115},"988accbc":{"m":108,"g":115},"b6b2287e":{"m":108,"g":115},"243e745d":{"m":108,"g":115},"61a0e600":{"m":108,"g":115},"0f8cee8c":{"m":108,"g":115},"816c4c85":{"m":108,"g":115},"13ec8d42":{"m":108,"g":115},"05bd7897":{"m":108,"g":115},"5fd311d3":{"m":108,"g":115},"53e2cd46":{"m":108,"g":115},"9708d353":{"m":108,"g":115},"704ced1b":{"m":108,"g":115},"3cc3d9b9":{"m":108,"g":115},"0b3a5b11":{"m":108,"g":115},"6c855db8":{"m":108,"g":115},"0f9318f7":{"m":108,"g":115},"849957bc":{"m":108,"g":115},"cded039b":{"m":108,"g":115},"275f9df3":{"m":108,"g":115},"e8449ab5":{"m":108,"g":115},"4746aaea":{"m":108,"g":115},"10d34f74":{"m":108,"g":115},"9ba72530":{"m":108,"g":115},"9c8e4f69":{"m":108,"g":115},"78ae1758":{"m":108,"g":115},"dae9a80f":{"m":108,"g":115},"e85cb1ce":{"m":108,"g":115},"55d336cb":{"m":108,"g":115},"de4990a5":{"m":108,"g":115},"029e0af3":{"m":108,"g":115},"64574ef8":{"m":108,"g":115},"18da2c96":{"m":108,"g":115},"9b5f0f64":{"m":108,"g":115},"70bb066e":{"m":108,"g":115},"2c4b4b78":{"m":108,"g":115},"7cd2ee06":{"m":108,"g":115},"eb19ccad":{"m":108,"g":115},"25ef53f0":{"m":108,"g":115},"c674bf9c":{"m":108,"g":115},"af1973b8":{"m":108,"g":115},"5cfbb4c1":{"m":108,"g":115},"e6523102":{"m":108,"g":115},"3828db43":{"m":108,"g":115},"88fbc31b":{"m":108,"g":115},"8f5b9910":{"m":108,"g":115},"ef3004d9":{"m":108,"g":115},"84719b52":{"m":108,"g":115},"e99729c9":{"m":108,"g":115},"c10b8e6a":{"m":108,"g":115},"d4bce297":{"m":108,"g":115},"b0980af8":{"m":108,"g":115},"24eaebeb":{"m":108,"g":115},"a91e90d9":{"m":108,"g":115},"f96413c4":{"m":108,"g":115},"08ebdf79":{"m":108,"g":115},"42c87045":{"m":108,"g":115},"c9bf3877":{"m":108,"g":115},"de2dd738":{"m":108,"g":115},"1ec97697":{"m":108,"g":115},"d8ed60f2":{"m":108,"g":115},"f1b0eda5":{"m":108,"g":115},"f20b6a3f":{"m":108,"g":115},"3680d6f8":{"m":108,"g":115},"f5154495":{"m":108,"g":115},"e0ce171d":{"m":108,"g":115},"fe43e889":{"m":108,"g":115},"5ae5ecaa":{"m":108,"g":115},"5fbad308":{"m":108,"g":115},"7638f5e4":{"m":108,"g":115},"b45f753c":{"m":108,"g":115},"c5057262":{"m":108,"g":115},"46fe8b8c":{"m":108,"g":115},"0b95a01a":{"m":108,"g":115},"a3b810eb":{"m":108,"g":115},"94959237":{"m":108,"g":115},"f4fafacc":{"m":108,"g":115},"01d47a27":{"m":108,"g":115},"ecc9f3e4":{"m":108,"g":115},"7e8187e0":{"m":108,"g":115},"e483ab6d":{"m":108,"g":115},"720cd308":{"m":108,"g":115},"ce67b2d5":{"m":108,"g":115},"3c2c9f6c":{"m":108,"g":115},"a31ea448":{"m":108,"g":115},"439df454":{"m":108,"g":115},"5626e20b":{"m":108,"g":115},"c6c379ab":{"m":108,"g":115},"c2fbf60f":{"m":108,"g":115},"98b44e9e":{"m":108,"g":115},"6805f6da":{"m":108,"g":115},"ca533580":{"m":108,"g":115},"886454e8":{"m":108,"g":115},"0cf3fbeb":{"m":108,"g":115},"2256d62d":{"m":108,"g":115},"6cdcbcc6":{"m":108,"g":115},"c480a3f6":{"m":108,"g":115},"6e316588":{"m":108,"g":115},"24247b41":{"m":108,"g":115},"4c0bb411":{"m":108,"g":115},"968e1818":{"m":108,"g":115},"d08663ee":{"m":108,"g":115},"716e6827":{"m":108,"g":115},"84b30d9e":{"m":108,"g":115},"ff0cf51c":{"m":108,"g":115},"a1c7f742":{"m":108,"g":115},"ebbb75e9":{"m":108,"g":115},"b341b7db":{"m":108,"g":115},"b498cd21":{"m":108,"g":115},"0fc54b97":{"m":108,"g":115},"b3c1f2e4":{"m":108,"g":115},"be1a3cd9":{"m":108,"g":115},"4b74c3fc":{"m":108,"g":115},"ce3ca9b0":{"m":108,"g":115},"4d98e486":{"m":108,"g":115},"3d77a318":{"m":108,"g":115},"845d12a9":{"m":108,"g":115},"e47800e1":{"m":108,"g":115},"bb10e3a1":{"m":108,"g":115},"fda762a2":{"m":108,"g":115},"1df84ff4":{"m":108,"g":115},"66d6be08":{"m":108,"g":115},"1c1f8a11":{"m":108,"g":115},"384f8ab5":{"m":108,"g":115},"6a9d6ca3":{"m":108,"g":115},"94371dbb":{"m":108,"g":115},"740f0630":{"m":108,"g":115},"81da16f6":{"m":108,"g":115},"bc938ea1":{"m":108,"g":115},"eff4eb3f":{"m":108,"g":115},"87dab548":{"m":108,"g":115},"5121af46":{"m":108,"g":115},"983aa496":{"m":108,"g":115},"9c3e95d9":{"m":108,"g":115},"e52c3866":{"m":108,"g":115},"da53e13c":{"m":108,"g":115},"d7e38b2f":{"m":108,"g":115},"21b88460":{"m":108,"g":115},"0c8594e6":{"m":108,"g":115},"c186feed":{"m":108,"g":115},"84b006b2":{"m":108,"g":115},"8ca07bd9":{"m":108,"g":115},"4fc09e0d":{"m":108,"g":115},"a3d99d6d":{"m":108,"g":115},"189af908":{"m":108,"g":115},"f8644a56":{"m":108,"g":115},"e3e75a78":{"m":108,"g":115},"d4db9b02":{"m":108,"g":115},"f7dd651d":{"m":108,"g":115},"9d54c6e6":{"m":108,"g":115},"1f9d65f5":{"m":108,"g":115},"29589512":{"m":108,"g":115},"584e1ab2":{"m":108,"g":115},"392de007":{"m":108,"g":115},"004f7f19":{"m":108,"g":115},"d2fbf2de":{"m":108,"g":115},"fab0f6e7":{"m":108,"g":115},"27985c27":{"m":108,"g":115},"ac474869":{"m":108,"g":115},"0b1e04f0":{"m":108,"g":115},"c1c7dc45":{"m":108,"g":115},"2cc9eeab":{"m":108,"g":115},"63d82a77":{"m":108,"g":115},"53dcc750":{"m":108,"g":115},"432f2053":{"m":108,"g":115},"1fea998a":{"m":108,"g":115},"5aa1ebd2":{"m":108,"g":115},"4dbf4360":{"m":108,"g":115},"3d6be1fb":{"m":108,"g":115},"4063234c":{"m":108,"g":115},"83feef5b":{"m":108,"g":115},"2871eacc":{"m":108,"g":115},"ac15bdc1":{"m":108,"g":115},"d6451c3f":{"m":108,"g":115},"841810f2":{"m":108,"g":115},"733446dd":{"m":108,"g":115},"4c22897a":{"m":108,"g":115},"1bc183c6":{"m":108,"g":115},"b87aacb5":{"m":108,"g":115},"b3363cc1":{"m":108,"g":115},"98457c04":{"m":108,"g":115},"0fc8bf2c":{"m":108,"g":115},"a669bc2f":{"m":108,"g":115},"6b7c2471":{"m":108,"g":115},"a027a9b4":{"m":108,"g":115},"9e426466":{"m":108,"g":115},"2f20f430":{"m":108,"g":115},"65736dc5":{"m":108,"g":115},"7b56e494":{"m":108,"g":115},"0ff6d1fc":{"m":108,"g":115},"4a16a71c":{"m":108,"g":115},"a16923ef":{"m":108,"g":115},"6337d905":{"m":108,"g":115},"71fb8c95":{"m":108,"g":115},"94f44b88":{"m":108,"g":115},"3b3b3baf":{"m":108,"g":115},"35e6bc92":{"m":108,"g":115},"9394ed63":{"m":108,"g":115},"930fe467":{"m":108,"g":115},"13c48dcf":{"m":108,"g":115},"8723b4f1":{"m":108,"g":115},"62f99e08":{"m":108,"g":115},"86a0be65":{"m":108,"g":115},"0edda320":{"m":108,"g":115},"924827c3":{"m":108,"g":115},"c81daf83":{"m":108,"g":115},"25caa7a8":{"m":108,"g":115},"03d11449":{"m":108,"g":115},"83123f48":{"m":108,"g":115},"48afa8f1":{"m":108,"g":115},"2ecbd8b8":{"m":108,"g":115},"305b27c1":{"m":108,"g":115},"1ce30dd1":{"m":108,"g":115},"c9ee7385":{"m":108,"g":115},"1f9ec653":{"m":108,"g":115},"ad359d1c":{"m":108,"g":115},"5f5b3b24":{"m":108,"g":115},"4caca4f6":{"m":108,"g":115},"f2a5de28":{"m":108,"g":115},"445f9dca":{"m":108,"g":115},"3a9afe2a":{"m":108,"g":115},"9aea2555":{"m":108,"g":115},"fcc11e5e":{"m":108,"g":115},"5190ba7f":{"m":108,"g":115},"5438886c":{"m":108,"g":115},"9c83d74d":{"m":108,"g":115},"b4ac2b9c":{"m":108,"g":115},"83262dcb":{"m":108,"g":115},"c46c75f8":{"m":108,"g":115},"2aaf22c4":{"m":108,"g":115},"29a610b4":{"m":108,"g":115},"5ded39ca":{"m":108,"g":115},"4093d460":{"m":108,"g":115},"9d68bdb2":{"m":108,"g":115},"a2184901":{"m":108,"g":115},"0eec4cb6":{"m":108,"g":115},"ff1f6825":{"m":108,"g":115},"9f78f391":{"m":108,"g":115},"f508cd3c":{"m":108,"g":115},"44e86480":{"m":108,"g":115},"8c07fabd":{"m":108,"g":115},"90f44b74":{"m":108,"g":115},"38907fe6":{"m":108,"g":115},"f9afa7dc":{"m":108,"g":115},"0d9e89ec":{"m":108,"g":115},"3d64fda3":{"m":108,"g":115},"3bffe112":{"m":108,"g":115},"44426e54":{"m":108,"g":115},"9f24dfef":{"m":108,"g":115},"89f1d4f5":{"m":108,"g":115},"75e6a7cd":{"m":108,"g":115},"6f81a710":{"m":108,"g":115},"a6452b71":{"m":108,"g":115},"f4ae50e9":{"m":108,"g":115},"84cb449e":{"m":108,"g":115},"f003cd35":{"m":108,"g":115},"9d834fdc":{"m":108,"g":115},"b3279251":{"m":108,"g":115},"067068f2":{"m":108,"g":115},"6beeff41":{"m":108,"g":115},"2e8e7e35":{"m":108,"g":115},"2449a0af":{"m":108,"g":115},"0f229c07":{"m":108,"g":115},"dd001a54":{"m":108,"g":115},"4ea9d74a":{"m":108,"g":115},"dd949ace":{"m":108,"g":115},"f2887498":{"m":108,"g":115},"8ecf6b9d":{"m":108,"g":115},"0418b9d4":{"m":108,"g":115},"e322a94d":{"m":108,"g":115},"2c7f01bc":{"m":108,"g":115},"b58ae7a2":{"m":108,"g":115},"6345069f":{"m":108,"g":115},"ce9cf353":{"m":108,"g":115},"f8a173bb":{"m":108,"g":115},"6b847a9a":{"m":108,"g":115},"473400e4":{"m":108,"g":115},"dd665f96":{"m":108,"g":115},"3817a37d":{"m":108,"g":115},"7ba5ad57":{"m":108,"g":115},"19bc77f0":{"m":108,"g":115},"86497d99":{"m":108,"g":115},"5c31b35d":{"m":108,"g":115},"ef48d554":{"m":108,"g":115},"a886564a":{"m":108,"g":115},"9a44b643":{"m":108,"g":115},"41d71ca4":{"m":108,"g":115},"20cfc5a2":{"m":108,"g":115},"48b8b4c1":{"m":108,"g":115},"323bc2f5":{"m":108,"g":115},"137e75da":{"m":108,"g":115},"52e1f52f":{"m":108,"g":115},"50188092":{"m":108,"g":115},"326a901d":{"m":108,"g":115},"6e0b6468":{"m":108,"g":115},"4a9f3eef":{"m":108,"g":115},"1b7afad0":{"m":108,"g":115},"f29aba8c":{"m":108,"g":115},"faa25df1":{"m":108,"g":115},"7b81f956":{"m":108,"g":115},"d3e67deb":{"m":108,"g":115},"442534aa":{"m":108,"g":115},"de8b8b6e":{"m":108,"g":115},"3f2e315f":{"m":108,"g":115},"6e215118":{"m":108,"g":115},"a47baff1":{"m":108,"g":115},"fd7e15b7":{"m":108,"g":115},"fc42ff7b":{"m":108,"g":115},"7c0db868":{"m":108,"g":115},"706bd69c":{"m":108,"g":115},"23f2afb2":{"m":108,"g":115},"a60f88b5":{"m":108,"g":115},"591c232f":{"m":108,"g":115},"f352b793":{"m":108,"g":115},"6642e3a2":{"m":108,"g":115},"67a7d1f6":{"m":108,"g":115},"92cbef59":{"m":108,"g":115},"b3359dc9":{"m":108,"g":115},"7b7e5615":{"m":108,"g":115},"1a8706c8":{"m":108,"g":115},"7d3af603":{"m":108,"g":115},"4e7f0252":{"m":108,"g":115},"36bfddec":{"m":108,"g":115},"91e2f902":{"m":108,"g":115},"a59cbea9":{"m":108,"g":115},"53f7874a":{"m":108,"g":115},"61a46804":{"m":108,"g":115},"9020f7fc":{"m":108,"g":115},"dd650e0e":{"m":108,"g":115},"a9471542":{"m":108,"g":115},"41357e51":{"m":108,"g":115},"e2fd2b9c":{"m":108,"g":115},"7490e3f6":{"m":108,"g":115},"6ee6619b":{"m":108,"g":115},"54ea57f2":{"m":108,"g":115},"b4c9f38a":{"m":108,"g":115},"11325474":{"m":108,"g":115},"1d24db83":{"m":108,"g":115},"44401358":{"m":108,"g":115},"9c7e3924":{"m":108,"g":115},"08fab2b0":{"m":108,"g":115},"0d1e27a0":{"m":108,"g":115},"774b47f3":{"m":108,"g":115},"76915d68":{"m":108,"g":115},"39fd1788":{"m":108,"g":115},"ed0a3dd5":{"m":108,"g":115},"2e901e89":{"m":108,"g":115},"d3be9710":{"m":108,"g":115},"3e7ff1ab":{"m":108,"g":115},"aaf0ad8c":{"m":108,"g":115},"361379b5":{"m":108,"g":115},"1ac16add":{"m":108,"g":115},"c3a5fb3b":{"m":108,"g":115},"4bf6e5a6":{"m":108,"g":115},"3ae33fcd":{"m":108,"g":115},"500b15c9":{"m":108,"g":107},"16a4c66d":{"m":108,"g":107},"89e6521c":{"m":108,"g":107},"fd05b567":{"m":108,"g":107},"482c3db2":{"m":108,"g":107},"47824c14":{"m":108,"g":107},"c36a6693":{"m":108,"g":107},"62f8eb48":{"m":108,"g":107},"b7cd7430":{"m":108,"g":107},"a69b6370":{"m":108,"g":107},"2d120f8b":{"m":108,"g":107},"4f2e1490":{"m":108,"g":107},"3fa3c6cd":{"m":108,"g":107},"6210e2c4":{"m":108,"g":107},"6ad6c8c9":{"m":108,"g":107},"5b6acc14":{"m":108,"g":107},"4373df55":{"m":108,"g":107},"c0e84297":{"m":108,"g":107},"92cc32d9":{"m":108,"g":107},"cbbd685a":{"m":108,"g":107},"78aad910":{"m":108,"g":107},"288ae41f":{"m":108,"g":107},"01c99a99":{"m":108,"g":107},"b114a810":{"m":108,"g":107},"0475448e":{"m":108,"g":107},"399e7ec8":{"m":108,"g":107},"1bd53168":{"m":108,"g":107},"aeac900c":{"m":108,"g":107},"4fc5f2f9":{"m":108,"g":107},"168033d5":{"m":108,"g":107},"cbbb7383":{"m":108,"g":107},"89588179":{"m":108,"g":107},"8c7bb39d":{"m":108,"g":107},"ca47e24f":{"m":108,"g":107},"d26ca84f":{"m":108,"g":107},"8128e08d":{"m":108,"g":107},"5d62b56f":{"m":108,"g":107},"3ae8e3ea":{"m":108,"g":107},"c1d2061f":{"m":108,"g":107},"556e4143":{"m":108,"g":107},"4ef47839":{"m":108,"g":107},"32d9e39a":{"m":108,"g":107},"4f4e0e41":{"m":108,"g":107},"901ab758":{"m":108,"g":107},"8e8545ca":{"m":108,"g":107},"a4b0d5c9":{"m":108,"g":107},"40e3b2be":{"m":108,"g":107},"75df31b6":{"m":108,"g":107},"194561f2":{"m":108,"g":107},"5e91fed1":{"m":108,"g":107},"873f384a":{"m":108,"g":107},"b01eeb80":{"m":108,"g":107},"1ea94d3b":{"m":108,"g":107},"354ac435":{"m":108,"g":107},"d98a4913":{"m":108,"g":107},"08f8f490":{"m":108,"g":107},"d4bf5a85":{"m":108,"g":107},"7cb20754":{"m":108,"g":107},"6d0646da":{"m":108,"g":107},"02bc1c7d":{"m":108,"g":107},"fc8c8e50":{"m":108,"g":107},"9bd4872a":{"m":108,"g":107},"2fa0462c":{"m":108,"g":107},"915140fd":{"m":108,"g":107},"36fc9260":{"m":108,"g":107},"fee0ab0f":{"m":108,"g":107},"f57d2dc1":{"m":108,"g":107},"f2d68ded":{"m":108,"g":107},"3b87a9e8":{"m":108,"g":107},"f024795e":{"m":108,"g":107},"b102353f":{"m":108,"g":107},"7a27e798":{"m":108,"g":107},"76ba5bbe":{"m":108,"g":107},"ed6f7597":{"m":108,"g":107},"e67276ec":{"m":108,"g":107},"0242bb9c":{"m":108,"g":107},"760286e3":{"m":108,"g":107},"3435a24e":{"m":108,"g":107},"00da9065":{"m":108,"g":107},"e0ab167d":{"m":109,"g":115},"c807cd7c":{"m":109,"g":115},"327f7b7c":{"m":109,"g":115},"80425e59":{"m":109,"g":115},"af9d4eb0":{"m":109,"g":115},"fb107cfd":{"m":109,"g":115},"e3e97a12":{"m":110,"g":115},"05106867":{"m":110,"g":115},"9dcdf5da":{"m":110,"g":115},"f8b757bc":{"m":110,"g":115},"ebd9dbe7":{"m":110,"g":115},"938e986e":{"m":110,"g":115},"17d5eda8":{"m":110,"g":115},"71a7f1d8":{"m":110,"g":115},"433266c1":{"m":110,"g":115},"fda47926":{"m":110,"g":115},"a0b22f2f":{"m":110,"g":115},"b5c6529e":{"m":110,"g":115},"ca4b86c5":{"m":110,"g":115},"dd6ec029":{"m":110,"g":115},"bf863e3b":{"m":110,"g":115},"9e169ea8":{"m":110,"g":115},"bc80dc4c":{"m":111,"g":115},"b962a296":{"m":111,"g":115},"aa3eba8e":{"m":111,"g":115},"07ee0ab7":{"m":111,"g":115},"5c06dcb7":{"m":111,"g":115},"6f6beca4":{"m":111,"g":115},"68a54e06":{"m":111,"g":115},"fd18995c":{"m":111,"g":115},"db0831e0":{"m":111,"g":115},"6e4e1c8c":{"m":111,"g":115},"9768c50d":{"m":111,"g":115},"fd71b11b":{"m":111,"g":115},"ae7428a8":{"m":111,"g":115},"a3aee7c3":{"m":111,"g":115},"79e6a8a6":{"m":111,"g":115},"8f7b1c31":{"m":111,"g":115},"b9683be6":{"m":111,"g":115},"a85363c1":{"m":111,"g":115},"b21fdd53":{"m":111,"g":115},"c04c17ed":{"m":111,"g":115},"16a6d21b":{"m":111,"g":115},"a530b3ff":{"m":111,"g":115},"603b3446":{"m":111,"g":115},"b6c14ec0":{"m":111,"g":115},"43de1d73":{"m":111,"g":115},"79ce3688":{"m":111,"g":115},"44ffe2cb":{"m":111,"g":115},"1a0896e9":{"m":111,"g":115},"90313fb0":{"m":111,"g":115},"3578eb1e":{"m":111,"g":115},"0936c766":{"m":111,"g":115},"0ef583b7":{"m":111,"g":115},"f7881a27":{"m":111,"g":115},"fdff3167":{"m":111,"g":115},"cbc0e4d7":{"m":111,"g":115},"4cd08dc5":{"m":111,"g":115},"f92b729d":{"m":111,"g":115},"e2e378ca":{"m":111,"g":115},"dc1decc6":{"m":111,"g":115},"03680f33":{"m":111,"g":115},"d4c5e534":{"m":111,"g":115},"817c62a0":{"m":111,"g":115},"0ff72419":{"m":111,"g":115},"80dc76e1":{"m":111,"g":115},"9b08d975":{"m":111,"g":115},"a0a77d93":{"m":111,"g":115},"24a8cee6":{"m":111,"g":115},"3affa9dc":{"m":111,"g":115},"ea0696b9":{"m":111,"g":115},"3aec3d4f":{"m":111,"g":115},"b0d25e72":{"m":112,"g":115},"a2424068":{"m":112,"g":115},"c5d2b01c":{"m":112,"g":115},"46ccbed2":{"m":112,"g":115},"fe68c148":{"m":112,"g":115},"70c0c1f9":{"m":112,"g":115},"760b788a":{"m":112,"g":115},"1ee11df8":{"m":112,"g":115},"dee197e1":{"m":112,"g":115},"ab795ae8":{"m":112,"g":115},"480d1b8b":{"m":112,"g":115},"6c18ab46":{"m":112,"g":115},"4a0e0be2":{"m":112,"g":115},"64f296f8":{"m":112,"g":115},"956d805d":{"m":112,"g":115},"30c6e1f5":{"m":112,"g":115},"bfe01a5e":{"m":112,"g":115},"3dd6420a":{"m":112,"g":115},"532f998b":{"m":112,"g":115},"de15d140":{"m":112,"g":115},"37367da6":{"m":112,"g":115},"ef959d7b":{"m":112,"g":115},"4aa1e69b":{"m":112,"g":115},"dc491b39":{"m":112,"g":115},"5b64f006":{"m":112,"g":115},"5b7448de":{"m":112,"g":115},"6d55f60e":{"m":112,"g":115},"033b75f5":{"m":112,"g":115},"f3b5db6e":{"m":112,"g":115},"2286e85e":{"m":112,"g":115},"91b3555d":{"m":112,"g":115},"9e2f7252":{"m":112,"g":115},"21176b00":{"m":112,"g":115},"94100294":{"m":112,"g":115},"cda7e47c":{"m":112,"g":115},"e903f695":{"m":112,"g":115},"27760fc1":{"m":112,"g":115},"0ac809de":{"m":112,"g":115},"4efe2c57":{"m":112,"g":115},"5be8c2f7":{"m":112,"g":115},"737d73ed":{"m":112,"g":115},"ebd0e1c1":{"m":112,"g":115},"a1d03892":{"m":112,"g":115},"dccf52f9":{"m":112,"g":115},"676a7b51":{"m":112,"g":115},"15f99347":{"m":112,"g":115},"bcf1955f":{"m":112,"g":115},"a06bf664":{"m":112,"g":115},"bf72b801":{"m":112,"g":115},"8cbe1538":{"m":112,"g":115},"8471e5e6":{"m":112,"g":115},"4582931a":{"m":112,"g":115},"d352c29a":{"m":112,"g":115},"d3ee7098":{"m":112,"g":115},"71fc7b7f":{"m":112,"g":115},"9ab72f98":{"m":112,"g":115},"f3817cb0":{"m":112,"g":115},"71133a04":{"m":112,"g":115},"2cd94dd0":{"m":112,"g":115},"f5f6b3b4":{"m":112,"g":115},"94fb4e9e":{"m":112,"g":115},"d1d4074c":{"m":112,"g":115},"718f25ae":{"m":112,"g":115},"948b01a0":{"m":112,"g":115},"cdc56ef6":{"m":112,"g":115},"16ff3d4b":{"m":112,"g":115},"83d55ac5":{"m":112,"g":115},"2fe17735":{"m":112,"g":115},"97fff98c":{"m":112,"g":115},"ba066ca0":{"m":112,"g":115},"96784a65":{"m":112,"g":115},"df5407fb":{"m":112,"g":115},"8ad700f7":{"m":112,"g":115},"148022fc":{"m":112,"g":115},"7a40e4f4":{"m":112,"g":115},"19d64f2b":{"m":112,"g":115},"a02071a1":{"m":112,"g":115},"45b3a6a2":{"m":112,"g":115},"9a18aa54":{"m":112,"g":115},"91f0fd95":{"m":112,"g":115},"8085aca7":{"m":112,"g":115},"0096798e":{"m":112,"g":115},"2c2b19b1":{"m":112,"g":115},"72f9fc5f":{"m":112,"g":115},"ec99668a":{"m":112,"g":115},"78f13981":{"m":112,"g":115},"bfd7a18d":{"m":112,"g":115},"5dd8c644":{"m":112,"g":115},"ee21817c":{"m":112,"g":115},"b7d1f17b":{"m":112,"g":115},"c8295d23":{"m":112,"g":115},"b67c277f":{"m":112,"g":115},"8116804e":{"m":112,"g":115},"8c5930f0":{"m":112,"g":115},"3b99f23c":{"m":112,"g":115},"ee0b3c5b":{"m":112,"g":115},"6049ca20":{"m":112,"g":115},"7577f0e4":{"m":112,"g":115},"8cda5a62":{"m":112,"g":115},"400d3b97":{"m":112,"g":115},"37d83c6e":{"m":112,"g":115},"7802586c":{"m":112,"g":115},"bc5fc332":{"m":112,"g":115},"f3440adc":{"m":112,"g":115},"5a7e10fe":{"m":112,"g":115},"33467c05":{"m":112,"g":115},"b0fcbb74":{"m":112,"g":115},"76a2c86b":{"m":112,"g":115},"e719bb0e":{"m":112,"g":115},"06724683":{"m":112,"g":115},"617aa2b2":{"m":112,"g":115},"111b1379":{"m":112,"g":115},"41628dc1":{"m":112,"g":115},"a12061df":{"m":112,"g":115},"85ed8e0a":{"m":112,"g":115},"dd1e2689":{"m":112,"g":115},"9a7ced4e":{"m":112,"g":115},"cb3918a0":{"m":112,"g":115},"f3b67602":{"m":112,"g":115},"9eb50ecc":{"m":112,"g":115},"b3e7a2ce":{"m":112,"g":115},"00974e4f":{"m":112,"g":115},"5f1eb204":{"m":112,"g":115},"039cef76":{"m":112,"g":115},"4c22ebe2":{"m":112,"g":115},"a5a03209":{"m":112,"g":115},"21af5c04":{"m":112,"g":115},"012584ec":{"m":112,"g":115},"90dfe3de":{"m":112,"g":115},"9a719b7a":{"m":112,"g":115},"3fa62da7":{"m":112,"g":115},"dbb1235d":{"m":112,"g":115},"ad26f298":{"m":112,"g":115},"8d114f25":{"m":112,"g":115},"0e78c63c":{"m":112,"g":115},"1a3d6f31":{"m":112,"g":115},"0b8c5721":{"m":112,"g":115},"beac202b":{"m":112,"g":115},"21b9a4b4":{"m":112,"g":115},"db37422c":{"m":112,"g":115},"ab62b135":{"m":112,"g":115},"273b2834":{"m":112,"g":115},"f84db115":{"m":112,"g":115},"efb0de2c":{"m":112,"g":115},"0f6ac5e2":{"m":112,"g":115},"29850900":{"m":112,"g":115},"e678cc71":{"m":112,"g":115},"4efe844a":{"m":112,"g":115},"bde73ee4":{"m":112,"g":115},"4f0e28d7":{"m":112,"g":115},"045ab92d":{"m":112,"g":115},"bd7f8821":{"m":112,"g":115},"5e5c30d9":{"m":112,"g":115},"9f00ec44":{"m":112,"g":115},"8e85ee88":{"m":112,"g":115},"adf73175":{"m":112,"g":115},"13705dae":{"m":112,"g":115},"df97b31f":{"m":112,"g":115},"339f8eef":{"m":112,"g":115},"afd9f2f5":{"m":112,"g":115},"f40038fb":{"m":112,"g":115},"bebd0576":{"m":112,"g":115},"f9836660":{"m":112,"g":115},"8b3b995a":{"m":112,"g":115},"6e95f5e5":{"m":112,"g":115},"0e9387a9":{"m":112,"g":115},"fa9c82d3":{"m":112,"g":115},"918e3d4c":{"m":112,"g":115},"e9697374":{"m":112,"g":115},"93088b69":{"m":112,"g":115},"453511ac":{"m":112,"g":115},"d0730487":{"m":112,"g":115},"b32ab070":{"m":112,"g":115},"75ee0011":{"m":112,"g":115},"ec15c836":{"m":112,"g":115},"106c2b31":{"m":112,"g":115},"c6756949":{"m":112,"g":115},"27e8ffed":{"m":112,"g":115},"4dbb34fe":{"m":112,"g":115},"1e18a341":{"m":112,"g":115},"2c562fd2":{"m":112,"g":115},"b648d862":{"m":112,"g":115},"bbf261ae":{"m":112,"g":115},"4f8a982d":{"m":112,"g":115},"d966b902":{"m":112,"g":115},"de921733":{"m":112,"g":115},"397448eb":{"m":112,"g":115},"66d5d042":{"m":112,"g":115},"73179b76":{"m":112,"g":115},"8cbf71dc":{"m":112,"g":115},"56eb5d0a":{"m":112,"g":115},"4ed9053e":{"m":112,"g":115},"5e19b159":{"m":112,"g":115},"788b19a5":{"m":112,"g":115},"f78b7fd1":{"m":112,"g":115},"b1fb7e45":{"m":112,"g":115},"1b2ff4fb":{"m":112,"g":115},"2c7ca33a":{"m":112,"g":115},"df397a72":{"m":112,"g":115},"5dfcd6c2":{"m":112,"g":115},"0dfd54d1":{"m":112,"g":115},"bcbeed71":{"m":112,"g":115},"cc9a31c6":{"m":112,"g":115},"d631290e":{"m":112,"g":115},"37565b7f":{"m":112,"g":115},"6243c367":{"m":112,"g":115},"60e37f80":{"m":112,"g":115},"369b1433":{"m":112,"g":115},"03dbf1aa":{"m":112,"g":115},"11dcabc5":{"m":112,"g":115},"4d89389c":{"m":112,"g":115},"9491d6e5":{"m":112,"g":115},"f64b8e3e":{"m":112,"g":115},"53976fce":{"m":112,"g":115},"18f91eb6":{"m":112,"g":115},"8766b3ac":{"m":112,"g":115},"1db649ac":{"m":112,"g":115},"a1e5d781":{"m":112,"g":115},"b7361cc4":{"m":112,"g":115},"a96c5b5c":{"m":112,"g":115},"b9eb0d9c":{"m":112,"g":115},"1fbfdebe":{"m":112,"g":115},"a25e8e42":{"m":112,"g":115},"d4a93841":{"m":112,"g":115},"21e1bc47":{"m":112,"g":115},"9a0cac1b":{"m":112,"g":115},"b5245064":{"m":112,"g":115},"9d9fa9a5":{"m":112,"g":115},"58d06fdc":{"m":112,"g":115},"cb9e0e41":{"m":112,"g":115},"9db80253":{"m":112,"g":115},"598c0bc1":{"m":112,"g":115},"b361750a":{"m":112,"g":115},"16e56ea6":{"m":112,"g":115},"349b491c":{"m":112,"g":115},"5f77e129":{"m":112,"g":115},"4750cddf":{"m":112,"g":115},"065e523d":{"m":112,"g":115},"7de2ce45":{"m":112,"g":115},"8c2ffaaf":{"m":112,"g":115},"20445327":{"m":112,"g":115},"6d3c20cf":{"m":112,"g":115},"8b6966d0":{"m":112,"g":115},"a391f73a":{"m":112,"g":115},"25c73959":{"m":112,"g":115},"f05c6873":{"m":112,"g":115},"9a0d0b75":{"m":112,"g":115},"ba861293":{"m":112,"g":115},"c112bcc4":{"m":112,"g":115},"5e194b21":{"m":112,"g":115},"fd5ce576":{"m":112,"g":115},"92d79646":{"m":112,"g":115},"f9076a5a":{"m":112,"g":115},"646076b7":{"m":112,"g":115},"0d040089":{"m":112,"g":115},"05e47872":{"m":112,"g":115},"1e61b496":{"m":112,"g":115},"300676af":{"m":112,"g":115},"7fe89f7c":{"m":112,"g":115},"9970e3bf":{"m":112,"g":115},"70eedb58":{"m":112,"g":115},"9c99949e":{"m":112,"g":115},"c5082f0f":{"m":112,"g":115},"836873b9":{"m":112,"g":115},"8abe8dea":{"m":112,"g":115},"1e85589d":{"m":112,"g":115},"c2a26e72":{"m":112,"g":115},"591e6c59":{"m":112,"g":115},"42f34437":{"m":112,"g":115},"5c34b4f1":{"m":112,"g":115},"ff9b5618":{"m":112,"g":115},"fcd72bd1":{"m":112,"g":115},"3d8fc434":{"m":112,"g":115},"87a0f7d2":{"m":112,"g":115},"839c93bd":{"m":112,"g":115},"f1e9bbaf":{"m":112,"g":115},"3fd1431d":{"m":112,"g":115},"161e9dc5":{"m":112,"g":115},"54e872d3":{"m":112,"g":115},"e5b29bf1":{"m":112,"g":115},"9a7c8842":{"m":112,"g":115},"7a16db9b":{"m":112,"g":115},"09a1df22":{"m":112,"g":115},"4b7034dd":{"m":112,"g":115},"a23c3020":{"m":112,"g":115},"a7d825fc":{"m":112,"g":115},"38cd5fb1":{"m":112,"g":115},"001f5194":{"m":112,"g":115},"5ad296bd":{"m":112,"g":115},"9f81d741":{"m":112,"g":115},"a38c1497":{"m":112,"g":115},"74dd4249":{"m":112,"g":115},"dc20c22f":{"m":112,"g":115},"711390a9":{"m":112,"g":115},"53430588":{"m":112,"g":115},"fce7ae33":{"m":112,"g":115},"6b39f9cf":{"m":112,"g":115},"07c9d8fb":{"m":112,"g":115},"4a4772ae":{"m":112,"g":115},"c3779233":{"m":112,"g":115},"f84b57c8":{"m":112,"g":115},"aee094e4":{"m":112,"g":115},"55349e36":{"m":112,"g":115},"e1f7cf57":{"m":112,"g":115},"2bb9d454":{"m":112,"g":115},"d0934a51":{"m":112,"g":115},"3f2d0cef":{"m":112,"g":115},"8b30bec2":{"m":112,"g":115},"4aeba40d":{"m":112,"g":115},"28684f90":{"m":112,"g":115},"a4a3d823":{"m":113,"g":115},"0b13cbb7":{"m":113,"g":115},"efbc687c":{"m":113,"g":115},"292a867a":{"m":113,"g":115},"8fd41eae":{"m":113,"g":115},"0cd1996e":{"m":113,"g":115},"b6b4b563":{"m":113,"g":115},"f8924ad7":{"m":113,"g":115},"2f80bd9f":{"m":113,"g":115},"366a603e":{"m":113,"g":115},"baee0860":{"m":113,"g":115},"c7a104c1":{"m":113,"g":115},"97d966a7":{"m":113,"g":115},"8e66d87f":{"m":113,"g":115},"a20fc7b7":{"m":113,"g":115},"6b30e097":{"m":113,"g":115},"d645ae90":{"m":113,"g":115},"41763ba0":{"m":113,"g":115},"652c24a6":{"m":113,"g":115},"5e142484":{"m":113,"g":115},"c560410d":{"m":113,"g":115},"590f2da0":{"m":113,"g":115},"148d8d48":{"m":113,"g":115},"1a599509":{"m":113,"g":115},"36a6b8db":{"m":113,"g":115},"e0b2d3ee":{"m":113,"g":115},"4cb5a523":{"m":113,"g":115},"85c1f793":{"m":113,"g":115},"48e9e719":{"m":113,"g":115},"31b49c0b":{"m":113,"g":115},"d736e0b6":{"m":113,"g":115},"ffd03a9b":{"m":113,"g":115},"666da3d5":{"m":113,"g":115},"d01b9214":{"m":113,"g":115},"c70e58e8":{"m":113,"g":115},"c61b9a1d":{"m":113,"g":115},"3c3d6255":{"m":113,"g":115},"546914fa":{"m":113,"g":115},"4726c919":{"m":113,"g":115},"a0010bf4":{"m":113,"g":115},"307fc060":{"m":113,"g":115},"586e81a2":{"m":113,"g":115},"fad7ca73":{"m":113,"g":115},"08af8ffb":{"m":113,"g":115},"2c7f4ca2":{"m":113,"g":115},"03def5e3":{"m":113,"g":115},"6ae3f05b":{"m":113,"g":115},"fdc4e1e5":{"m":113,"g":115},"04b86b3c":{"m":113,"g":115},"d6777a70":{"m":113,"g":115},"8c574902":{"m":113,"g":115},"34151f17":{"m":113,"g":115},"6794d210":{"m":113,"g":115},"1a31229c":{"m":113,"g":115},"de89ef49":{"m":113,"g":115},"b00a0c78":{"m":113,"g":115},"a2faf894":{"m":113,"g":115},"7e61737d":{"m":113,"g":115},"3c699772":{"m":113,"g":115},"e8100774":{"m":113,"g":115},"963175d5":{"m":113,"g":115},"0618ad6d":{"m":113,"g":115},"6a261aac":{"m":113,"g":115},"7ff740a6":{"m":113,"g":115},"bfcd9b24":{"m":113,"g":115},"458611de":{"m":113,"g":115},"3511b370":{"m":113,"g":115},"afcd3e10":{"m":113,"g":115},"12d68183":{"m":113,"g":115},"b65db028":{"m":113,"g":115},"948278f1":{"m":113,"g":115},"7d004799":{"m":113,"g":115},"083629c2":{"m":113,"g":115},"b658be6f":{"m":113,"g":115},"5e786cca":{"m":113,"g":115},"0b9dfba7":{"m":113,"g":115},"6a290034":{"m":113,"g":115},"2ac453b0":{"m":113,"g":115},"f35def86":{"m":113,"g":115},"d61615fe":{"m":113,"g":115},"b1ccaf01":{"m":113,"g":115},"097725bb":{"m":113,"g":115},"44b1fbe2":{"m":113,"g":115},"c0dbbdd1":{"m":113,"g":115},"25e7dbe8":{"m":113,"g":115},"0b2aa8a7":{"m":113,"g":115},"609f65ba":{"m":113,"g":115},"2d62af6b":{"m":113,"g":115},"a28b394f":{"m":113,"g":115},"96fe2d0f":{"m":113,"g":115},"bfa27438":{"m":113,"g":115},"86cb4db0":{"m":113,"g":115},"2e130b76":{"m":113,"g":115},"ac1f2928":{"m":113,"g":115},"195a59fe":{"m":113,"g":115},"47488cc3":{"m":113,"g":115},"61305291":{"m":113,"g":115},"a9ce2bcb":{"m":113,"g":115},"5dddb331":{"m":113,"g":115},"01a26544":{"m":113,"g":115},"73d4a5f8":{"m":113,"g":115},"7fb551a7":{"m":113,"g":115},"1193f131":{"m":113,"g":115},"84a9f5d6":{"m":113,"g":115},"8ce830a8":{"m":113,"g":115},"fb367acf":{"m":113,"g":115},"a6cc86df":{"m":113,"g":115},"229d2b95":{"m":113,"g":115},"9710f718":{"m":113,"g":115},"91847e38":{"m":113,"g":115},"5a290a56":{"m":113,"g":115},"580051c5":{"m":113,"g":115},"1237aa19":{"m":113,"g":115},"59911195":{"m":113,"g":115},"424591d5":{"m":113,"g":115},"d1676cd4":{"m":113,"g":115},"33b3c0f8":{"m":113,"g":115},"e5281f84":{"m":113,"g":115},"d17986f8":{"m":113,"g":115},"8831c55c":{"m":113,"g":115},"2bc61dd1":{"m":113,"g":115},"6535fda1":{"m":113,"g":115},"3713eb61":{"m":113,"g":115},"5937a56d":{"m":113,"g":115},"f065e5be":{"m":113,"g":115},"9de1320b":{"m":113,"g":115},"dda34c2f":{"m":113,"g":115},"4eeaff74":{"m":113,"g":115},"a17e70f5":{"m":113,"g":115},"816b3a43":{"m":113,"g":115},"3a641d90":{"m":113,"g":115},"6f16bf9d":{"m":113,"g":115},"5942fdb4":{"m":113,"g":115},"af4ab656":{"m":113,"g":115},"11965b0d":{"m":113,"g":115},"71959545":{"m":113,"g":115},"24f7cb1e":{"m":113,"g":115},"e05555fa":{"m":113,"g":115},"43fa9f22":{"m":113,"g":115},"e98d9346":{"m":113,"g":115},"0c917410":{"m":113,"g":115},"25728863":{"m":113,"g":115},"dba751a8":{"m":113,"g":115},"2e763398":{"m":113,"g":115},"336e9a60":{"m":113,"g":115},"abb67815":{"m":113,"g":115},"07440f5f":{"m":113,"g":115},"9816989b":{"m":113,"g":115},"42245551":{"m":113,"g":115},"2a9d995c":{"m":113,"g":115},"a9050b5c":{"m":113,"g":115},"66face35":{"m":113,"g":115},"5519766a":{"m":113,"g":115},"72392f29":{"m":113,"g":115},"c1c8dd1d":{"m":113,"g":115},"f6bc3f52":{"m":113,"g":115},"8cc27fdc":{"m":113,"g":115},"9c339d6b":{"m":113,"g":115},"e23e280e":{"m":113,"g":115},"51f7c6bd":{"m":113,"g":115},"62e2e99d":{"m":113,"g":115},"8ebf72fe":{"m":113,"g":115},"82605747":{"m":113,"g":115},"37f3325b":{"m":113,"g":115},"bd95944c":{"m":113,"g":115},"c8a5d12a":{"m":113,"g":115},"2387c22b":{"m":113,"g":115},"592ddf37":{"m":113,"g":115},"0c3db889":{"m":113,"g":115},"2bdaf482":{"m":113,"g":115},"777eb538":{"m":113,"g":115},"05a35266":{"m":113,"g":115},"e56c64bf":{"m":113,"g":115},"fff7fbab":{"m":113,"g":115},"aae7ead2":{"m":113,"g":115},"a7fe6e10":{"m":113,"g":115},"be059b83":{"m":113,"g":115},"5d4fe1ce":{"m":113,"g":115},"1b011e68":{"m":113,"g":115},"5c0efa56":{"m":113,"g":115},"1e57b947":{"m":113,"g":115},"a5095d62":{"m":113,"g":115},"6c2c467d":{"m":113,"g":115},"c3d2ad4e":{"m":113,"g":115},"7ec5b4e8":{"m":113,"g":115},"60885482":{"m":113,"g":115},"172bcf01":{"m":113,"g":115},"37158f20":{"m":113,"g":115},"3e95aa1a":{"m":113,"g":115},"c4197e99":{"m":113,"g":115},"0ac61146":{"m":113,"g":115},"7dcd689b":{"m":113,"g":115},"f7bab41a":{"m":113,"g":115},"f68dd998":{"m":113,"g":115},"35ec2a45":{"m":113,"g":115},"0035f1ce":{"m":113,"g":115},"5e21d6ae":{"m":113,"g":115},"cd4da1f1":{"m":113,"g":115},"91678474":{"m":113,"g":115},"d511b2d9":{"m":113,"g":115},"77830a26":{"m":113,"g":115},"fce17048":{"m":113,"g":115},"3d40794f":{"m":113,"g":115},"c1f39013":{"m":113,"g":115},"3e43eb13":{"m":113,"g":115},"458c0219":{"m":113,"g":115},"a73eb8cd":{"m":113,"g":115},"e7387035":{"m":113,"g":115},"fe531d6f":{"m":113,"g":115},"c4e314f9":{"m":113,"g":115},"7a06ef98":{"m":113,"g":115},"4a87ba21":{"m":113,"g":115},"d7b20dd6":{"m":113,"g":115},"c3faf2d6":{"m":113,"g":115},"9209b209":{"m":113,"g":115},"adba172f":{"m":113,"g":115},"cd641a99":{"m":113,"g":115},"71f24ef8":{"m":113,"g":115},"b1f0fc1c":{"m":113,"g":115},"32d89373":{"m":113,"g":115},"f47a2c67":{"m":113,"g":115},"ee704e62":{"m":113,"g":115},"f4e3ebeb":{"m":113,"g":115},"312bfc4c":{"m":113,"g":115},"e290303e":{"m":113,"g":115},"aab35bcc":{"m":113,"g":115},"42aedb02":{"m":113,"g":115},"984730b7":{"m":113,"g":115},"23632d35":{"m":113,"g":115},"08b8c0c3":{"m":113,"g":115},"d42975c6":{"m":113,"g":115},"adc24a3a":{"m":113,"g":115},"7ff93e61":{"m":113,"g":115},"b24b2e7e":{"m":113,"g":115},"7135db5d":{"m":113,"g":115},"4b5ef300":{"m":113,"g":115},"4f564b9e":{"m":113,"g":115},"98c3b04f":{"m":113,"g":115},"ddab4fc7":{"m":113,"g":115},"d21c3522":{"m":113,"g":115},"4a762041":{"m":113,"g":115},"ea338676":{"m":113,"g":115},"b06db198":{"m":113,"g":115},"8c1ef0f9":{"m":113,"g":115},"f5a2faf2":{"m":113,"g":115},"1c82d9db":{"m":113,"g":115},"9241f4fd":{"m":113,"g":115},"063c3791":{"m":113,"g":115},"632b7d8c":{"m":113,"g":115},"16adf3dc":{"m":113,"g":115},"c3a1d775":{"m":113,"g":115},"89971c4c":{"m":113,"g":115},"113f8f65":{"m":113,"g":115},"e22f3a5e":{"m":113,"g":115},"095093ee":{"m":113,"g":115},"d27a6f70":{"m":113,"g":115},"0753ef83":{"m":113,"g":115},"662393f2":{"m":113,"g":115},"b1bb8e74":{"m":113,"g":115},"38c00ed7":{"m":113,"g":115},"d4041a5e":{"m":113,"g":115},"2f555c4c":{"m":113,"g":115},"e53df7c0":{"m":113,"g":115},"9c53dad8":{"m":113,"g":115},"7ca1bea6":{"m":113,"g":115},"97c38239":{"m":113,"g":115},"60dbbd08":{"m":113,"g":115},"aa1c5cf5":{"m":113,"g":115},"592caab6":{"m":113,"g":115},"2101d93b":{"m":113,"g":115},"70e4b218":{"m":113,"g":115},"944f1ea0":{"m":113,"g":115},"9d7e82a0":{"m":113,"g":115},"f0580551":{"m":113,"g":115},"635ccda6":{"m":113,"g":115},"1c3dbad8":{"m":113,"g":115},"e2ac7888":{"m":113,"g":115},"86527a47":{"m":113,"g":115},"134b4f7e":{"m":113,"g":115},"f67d1f45":{"m":113,"g":115},"0f04a5f4":{"m":113,"g":115},"2f18602f":{"m":113,"g":115},"56321e9f":{"m":113,"g":115},"12d6cf18":{"m":113,"g":115},"fc3e5420":{"m":113,"g":115},"08ecd0aa":{"m":113,"g":115},"720c1c8c":{"m":113,"g":115},"d403c143":{"m":113,"g":115},"cba0d8c3":{"m":113,"g":115},"f1d78923":{"m":113,"g":115},"7c876de7":{"m":113,"g":115},"ba94b829":{"m":113,"g":115},"2b7417bf":{"m":113,"g":115},"f1116495":{"m":113,"g":115},"bd7eb020":{"m":113,"g":115},"74cd6e39":{"m":113,"g":115},"b17e67df":{"m":113,"g":115},"8ecef73f":{"m":113,"g":115},"1d1ce624":{"m":113,"g":115},"60e2a7ce":{"m":113,"g":115},"d88ef4a3":{"m":113,"g":115},"6f993e8b":{"m":113,"g":115},"03ce92e5":{"m":113,"g":115},"00eb5eb7":{"m":113,"g":115},"dab4663b":{"m":113,"g":115},"610a6d6e":{"m":113,"g":115},"36efd5be":{"m":113,"g":115},"68cdc189":{"m":113,"g":115},"7f399e4b":{"m":113,"g":115},"873d858b":{"m":113,"g":115},"3fa3c22a":{"m":113,"g":115},"4f2055ad":{"m":113,"g":115},"616a3e20":{"m":113,"g":115},"ac2a723b":{"m":113,"g":115},"56b991b1":{"m":113,"g":115},"780d6a22":{"m":113,"g":115},"8b713c72":{"m":113,"g":115},"5bfafdfc":{"m":113,"g":115},"8c52de6f":{"m":113,"g":115},"c1815a99":{"m":113,"g":115},"4e6c4923":{"m":113,"g":115},"b91cb67e":{"m":113,"g":115},"e7bc6003":{"m":113,"g":115},"2a2ff9a8":{"m":113,"g":115},"5291f32d":{"m":113,"g":115},"67073dde":{"m":113,"g":115},"9a5c42f9":{"m":113,"g":115},"388c05d5":{"m":113,"g":115},"fc809665":{"m":113,"g":115},"6fd4816d":{"m":113,"g":115},"1344ebc8":{"m":113,"g":115},"e07b21ce":{"m":113,"g":115},"52f248cd":{"m":113,"g":115},"93f75778":{"m":113,"g":115},"4039c626":{"m":113,"g":115},"db71c38f":{"m":113,"g":115},"7a68b422":{"m":113,"g":115},"60fc5b51":{"m":113,"g":115},"a13dd1e4":{"m":113,"g":115},"d500eb91":{"m":113,"g":115},"1ccd59c7":{"m":113,"g":115},"c32fb7a2":{"m":113,"g":115},"1ba137e9":{"m":113,"g":115},"de28f8e7":{"m":113,"g":115},"56405076":{"m":113,"g":115},"b73ac629":{"m":113,"g":115},"77098aea":{"m":113,"g":115},"5ccf0b03":{"m":113,"g":115},"a77564e0":{"m":113,"g":115},"4f9e71df":{"m":113,"g":115},"541551ce":{"m":113,"g":115},"124097fc":{"m":113,"g":115},"e1d45bc2":{"m":113,"g":115},"14fdd527":{"m":113,"g":115},"f949ad57":{"m":113,"g":115},"c49484a6":{"m":113,"g":115},"a2f7218a":{"m":113,"g":115},"311de47b":{"m":113,"g":115},"373080ea":{"m":113,"g":115},"7f028b07":{"m":113,"g":115},"0abb41c7":{"m":113,"g":115},"925dbb32":{"m":113,"g":115},"8df7353a":{"m":113,"g":115},"ae4be601":{"m":113,"g":115},"9b876889":{"m":113,"g":115},"c0c6f543":{"m":113,"g":115},"edd6a07b":{"m":113,"g":115},"b6dd4bcb":{"m":113,"g":115},"b2435be6":{"m":113,"g":115},"5fe39e85":{"m":113,"g":115},"fa5d0bf6":{"m":113,"g":115},"16e93359":{"m":113,"g":115},"f1c692f6":{"m":113,"g":115},"80572c83":{"m":113,"g":115},"4bb08f6e":{"m":113,"g":115},"ec272dda":{"m":113,"g":115},"a220c14f":{"m":113,"g":115},"35ef3f29":{"m":113,"g":115},"31fb19a0":{"m":113,"g":115},"3f41b48c":{"m":113,"g":115},"2689f0bf":{"m":113,"g":115},"52074240":{"m":113,"g":115},"c3c26f76":{"m":113,"g":115},"2cf811a9":{"m":113,"g":115},"3b25dc12":{"m":113,"g":115},"5c08d7d2":{"m":113,"g":115},"a45d9a4e":{"m":113,"g":115},"28c79dc8":{"m":113,"g":115},"1fcccda4":{"m":113,"g":115},"79acec4f":{"m":113,"g":115},"b1721edb":{"m":113,"g":115},"57234d0c":{"m":113,"g":115},"b93acd70":{"m":113,"g":115},"86a32bb5":{"m":113,"g":115},"5afd0365":{"m":113,"g":115},"059c13de":{"m":113,"g":115},"50dc0c1e":{"m":113,"g":115},"76becc1d":{"m":113,"g":115},"2a37b24d":{"m":113,"g":115},"f73aae0b":{"m":113,"g":115},"69b35793":{"m":113,"g":115},"957482c8":{"m":113,"g":115},"3795b6a4":{"m":113,"g":115},"7eccbe99":{"m":113,"g":115},"0549f21c":{"m":113,"g":115},"b354e3c9":{"m":113,"g":115},"65e6f48c":{"m":113,"g":115},"0ec580a8":{"m":113,"g":115},"0b14159f":{"m":113,"g":115},"1489cd6c":{"m":113,"g":115},"fc2c3a3d":{"m":113,"g":115},"8f6a1758":{"m":113,"g":115},"01018138":{"m":113,"g":115},"4844fac9":{"m":113,"g":115},"b7d385e8":{"m":113,"g":115},"305c9e8c":{"m":113,"g":115},"ca63f075":{"m":113,"g":115},"f9ee6ae1":{"m":113,"g":115},"dcee42c2":{"m":113,"g":115},"258d02c8":{"m":113,"g":115},"60d7beda":{"m":113,"g":115},"2f8ba6fe":{"m":113,"g":115},"7ce6c10e":{"m":113,"g":115},"55025b92":{"m":113,"g":115},"4c21b090":{"m":113,"g":115},"165abeeb":{"m":113,"g":115},"21ca4c3a":{"m":113,"g":115},"e3cf812f":{"m":113,"g":115},"4da55336":{"m":113,"g":115},"ac964d2e":{"m":113,"g":115},"fa46e2bd":{"m":113,"g":115},"b047b553":{"m":113,"g":115},"a0f844ed":{"m":113,"g":115},"2df532ef":{"m":113,"g":115},"abea9250":{"m":113,"g":115},"b3c97762":{"m":113,"g":115},"b8347b40":{"m":113,"g":115},"72dfa96a":{"m":113,"g":115},"05b01ef4":{"m":113,"g":115},"55a6e644":{"m":113,"g":115},"6897e06b":{"m":113,"g":115},"a360511d":{"m":113,"g":115},"94d0f656":{"m":113,"g":115},"eca59f96":{"m":113,"g":115},"97528610":{"m":113,"g":115},"297d3745":{"m":113,"g":115},"31e9d3a5":{"m":113,"g":115},"6f4676ef":{"m":113,"g":115},"c9ec4cae":{"m":113,"g":115},"99757cc3":{"m":113,"g":115},"cdddab05":{"m":113,"g":115},"7c5a0a1b":{"m":113,"g":115},"49f169d5":{"m":113,"g":115},"7fce2fd9":{"m":113,"g":115},"16cd550c":{"m":113,"g":115},"d5e2a374":{"m":113,"g":115},"366043db":{"m":113,"g":115},"2f173ea0":{"m":113,"g":115},"321fecab":{"m":113,"g":115},"9d775b1a":{"m":113,"g":115},"78b7465c":{"m":113,"g":115},"07bcad7f":{"m":113,"g":115},"98adac8e":{"m":113,"g":115},"cef11e9a":{"m":113,"g":115},"2269cf1e":{"m":113,"g":115},"151e287d":{"m":113,"g":115},"8c86595c":{"m":113,"g":115},"4634fd59":{"m":113,"g":115},"efedbe6c":{"m":113,"g":115},"3a77c80b":{"m":113,"g":115},"36acd2ff":{"m":113,"g":115},"fe6cdf89":{"m":113,"g":115},"30d20ce8":{"m":113,"g":115},"1b1701f1":{"m":113,"g":115},"6d403089":{"m":113,"g":115},"24dc2bee":{"m":113,"g":115},"fac07c9b":{"m":113,"g":115},"b3839a7f":{"m":113,"g":115},"4aa39d72":{"m":113,"g":115},"b4c2c421":{"m":113,"g":115},"53ca1552":{"m":113,"g":115},"a23bdeaf":{"m":113,"g":115},"27778010":{"m":113,"g":115},"46d8fb1c":{"m":113,"g":115},"c7e85f53":{"m":113,"g":115},"3df05f4d":{"m":113,"g":115},"7b141f81":{"m":113,"g":115},"7bc5fb0d":{"m":113,"g":115},"144ee5f3":{"m":113,"g":115},"758b887a":{"m":114,"g":115},"eb7d9261":{"m":114,"g":115},"44cb0607":{"m":114,"g":115},"88bb627d":{"m":114,"g":115},"b520958e":{"m":114,"g":115},"fa7e2c30":{"m":114,"g":115},"8f2cd177":{"m":114,"g":115},"ab926dd6":{"m":114,"g":115},"a4b424c6":{"m":114,"g":115},"a0557642":{"m":114,"g":115},"84768d10":{"m":114,"g":115},"368fd206":{"m":114,"g":115},"53bd00d9":{"m":114,"g":115},"e22b13c5":{"m":114,"g":115},"a3c2ea44":{"m":114,"g":115},"fccac7d1":{"m":114,"g":115},"7ac6b900":{"m":114,"g":115},"a1080b72":{"m":114,"g":115},"a65ca739":{"m":114,"g":115},"677aa0e2":{"m":114,"g":115},"01c9ee1a":{"m":114,"g":115},"d6837aea":{"m":114,"g":115},"c882b5ae":{"m":114,"g":115},"e3bb7f5a":{"m":114,"g":115},"92473e2e":{"m":114,"g":115},"6c0bb327":{"m":114,"g":115},"0a7c4bde":{"m":114,"g":115},"edefab0c":{"m":114,"g":115},"97cd38e5":{"m":114,"g":115},"3c06b673":{"m":114,"g":115},"7c3f07db":{"m":114,"g":115},"edd86b88":{"m":114,"g":115},"4b4dc132":{"m":114,"g":115},"5a9170d9":{"m":114,"g":115},"c4d77774":{"m":114,"g":115},"832c84fb":{"m":114,"g":115},"64d1505c":{"m":114,"g":115},"f3764c26":{"m":114,"g":115},"7ba3de0e":{"m":114,"g":115},"fde9b963":{"m":114,"g":115},"f094e0a4":{"m":114,"g":115},"4ed67c27":{"m":114,"g":115},"cd4b39a9":{"m":114,"g":115},"420c99ac":{"m":114,"g":115},"e3c7f091":{"m":114,"g":115},"6f1e03a4":{"m":114,"g":115},"f4affd4d":{"m":114,"g":115},"df08bf9b":{"m":114,"g":115},"69efdd27":{"m":114,"g":115},"64582caa":{"m":114,"g":115},"2fcd56ea":{"m":114,"g":115},"0958a397":{"m":114,"g":115},"4f42c8cd":{"m":114,"g":115},"3ddd7dc9":{"m":114,"g":115},"501dfa6b":{"m":114,"g":115},"79d34951":{"m":114,"g":115},"1519a89c":{"m":114,"g":115},"24bc3fb0":{"m":114,"g":115},"8a8a608a":{"m":114,"g":115},"533e58a1":{"m":114,"g":115},"9b4c4497":{"m":114,"g":115},"a578d300":{"m":114,"g":115},"8c967037":{"m":114,"g":115},"fb27d383":{"m":114,"g":115},"fd8a0b29":{"m":114,"g":115},"afc35ccc":{"m":114,"g":115},"a57f0e3d":{"m":114,"g":115},"708f4ff4":{"m":114,"g":115},"e2daeb35":{"m":114,"g":115},"0e7b3530":{"m":114,"g":115},"b07c9c76":{"m":114,"g":115},"748f86f3":{"m":114,"g":115},"73ea484a":{"m":114,"g":115},"4aeb193f":{"m":114,"g":115},"466992b2":{"m":114,"g":115},"155cbb51":{"m":114,"g":115},"eb30b888":{"m":114,"g":115},"5ee777c9":{"m":114,"g":115},"baf277a9":{"m":116,"g":118},"f5d30dae":{"m":116,"g":118},"2479b894":{"m":116,"g":118},"54644572":{"m":116,"g":118},"6c01844f":{"m":116,"g":118},"f226d3da":{"m":116,"g":118},"d2478cd4":{"m":116,"g":118},"30ea4c46":{"m":116,"g":118},"6d036468":{"m":116,"g":118},"8221f9ae":{"m":116,"g":118},"ab9187a2":{"m":116,"g":118},"6b143d62":{"m":116,"g":118},"6bc503af":{"m":116,"g":118},"b2c85669":{"m":116,"g":118},"32803fb2":{"m":116,"g":118},"91fc5bb5":{"m":116,"g":118},"780fbf2f":{"m":116,"g":118},"825432fc":{"m":116,"g":118},"a40229f6":{"m":116,"g":118},"74737b28":{"m":116,"g":115},"40e0082d":{"m":116,"g":115},"e9e120ac":{"m":116,"g":115},"e0c2af2a":{"m":116,"g":115},"1d7f7835":{"m":116,"g":115},"32595146":{"m":116,"g":115},"86373b9e":{"m":116,"g":115},"d314bf60":{"m":116,"g":115},"e28c9e52":{"m":116,"g":115},"b98cf398":{"m":116,"g":115},"27d71045":{"m":116,"g":115},"c224a4c6":{"m":116,"g":115},"49345a68":{"m":116,"g":115},"94d26d85":{"m":116,"g":115},"9e8a15a7":{"m":116,"g":115},"3962e39d":{"m":116,"g":115},"eb8cac6f":{"m":116,"g":115},"5ea96ac7":{"m":116,"g":115},"56222658":{"m":116,"g":115},"dc965db0":{"m":116,"g":115},"817e46f4":{"m":116,"g":115},"5a33c3aa":{"m":116,"g":115},"9767a1e4":{"m":116,"g":115},"1d086539":{"m":116,"g":115},"a04efc49":{"m":116,"g":115},"642fa966":{"m":116,"g":115},"da7fac1b":{"m":116,"g":115},"28ad2297":{"m":116,"g":115},"f7f9f8ec":{"m":116,"g":115},"4b62af92":{"m":116,"g":115},"0b9915c1":{"m":116,"g":115},"27ef1459":{"m":116,"g":115},"e4358a45":{"m":116,"g":115},"ba2ce28f":{"m":116,"g":115},"98923880":{"m":116,"g":115},"f792e3c5":{"m":116,"g":115},"28f80b12":{"m":116,"g":115},"88a6f9da":{"m":116,"g":115},"cb8ed2c0":{"m":116,"g":115},"38473363":{"m":116,"g":115},"aaf7af1b":{"m":116,"g":115},"932e2637":{"m":116,"g":115},"43f80884":{"m":116,"g":115},"60b05032":{"m":116,"g":115},"dc48c4c0":{"m":116,"g":115},"6dc9ca8c":{"m":116,"g":115},"887c2b45":{"m":116,"g":115},"065ce815":{"m":116,"g":115},"8e51049f":{"m":116,"g":115},"cb8f3d90":{"m":116,"g":115},"4b694e7d":{"m":116,"g":115},"9f1f699a":{"m":116,"g":115},"c9cff2b9":{"m":116,"g":115},"b6fb5d76":{"m":116,"g":115},"f4aa7880":{"m":116,"g":115},"5e3f7e7f":{"m":116,"g":115},"728af887":{"m":116,"g":115},"7b59b0b8":{"m":116,"g":115},"acc2327b":{"m":116,"g":115},"bfadb5ea":{"m":116,"g":115},"9cc1e065":{"m":116,"g":115},"b8c430f1":{"m":116,"g":115},"f35f120d":{"m":116,"g":115},"54a46a26":{"m":116,"g":115},"7c94eaee":{"m":116,"g":115},"13d596c9":{"m":116,"g":115},"c7867b67":{"m":116,"g":115},"516738b0":{"m":116,"g":115},"0b6f535f":{"m":116,"g":115},"c5fe3c0b":{"m":116,"g":115},"318424e2":{"m":116,"g":115},"6806c4e6":{"m":116,"g":115},"0c0779d6":{"m":116,"g":115},"a55cf530":{"m":116,"g":115},"19ba16aa":{"m":116,"g":115},"a2b3d9b9":{"m":116,"g":115},"9a30914e":{"m":116,"g":115},"8e776c78":{"m":116,"g":115},"63e84352":{"m":116,"g":115},"a20e7df8":{"m":116,"g":115},"1bdd0102":{"m":116,"g":115},"6cd29694":{"m":116,"g":115},"2ac46e94":{"m":116,"g":115},"0aa65f94":{"m":116,"g":115},"0ecb4261":{"m":116,"g":115},"05f015f6":{"m":116,"g":115},"1083e7e3":{"m":116,"g":115},"2157d12a":{"m":116,"g":115},"9f2b457c":{"m":116,"g":115},"f5b34a51":{"m":116,"g":115},"5a6ec8f9":{"m":116,"g":115},"6a653bb1":{"m":116,"g":115},"548a57b1":{"m":116,"g":115},"88e73ed0":{"m":116,"g":115},"4b15fa00":{"m":116,"g":115},"f4941906":{"m":116,"g":115},"01e59e82":{"m":116,"g":115},"99a0704a":{"m":116,"g":115},"ec1cd90a":{"m":116,"g":115},"1103dc62":{"m":116,"g":115},"a220536f":{"m":116,"g":115},"7b064f04":{"m":116,"g":115},"43190bec":{"m":116,"g":115},"be740acd":{"m":116,"g":115},"2db2cddd":{"m":116,"g":115},"9b5efe34":{"m":116,"g":115},"4ac8e09d":{"m":116,"g":115},"20a6c0a6":{"m":116,"g":115},"47c606d3":{"m":116,"g":115},"9fcf7306":{"m":116,"g":115},"0a304870":{"m":116,"g":115},"8fdcd98e":{"m":116,"g":115},"b5dcfd41":{"m":116,"g":115},"5061b8fd":{"m":116,"g":115},"c8452551":{"m":116,"g":115},"bf3e7149":{"m":116,"g":115},"f5754d12":{"m":116,"g":115},"739daa63":{"m":116,"g":115},"d957177a":{"m":116,"g":115},"21337b22":{"m":116,"g":115},"129d2992":{"m":116,"g":115},"8b85926a":{"m":116,"g":115},"451d15c4":{"m":116,"g":115},"c80a96da":{"m":116,"g":115},"eae9a9fb":{"m":116,"g":115},"2674c1d2":{"m":116,"g":115},"61055cb3":{"m":116,"g":115},"92777135":{"m":116,"g":115},"c4958331":{"m":116,"g":115},"2eeb2751":{"m":116,"g":115},"b36afed4":{"m":116,"g":115},"9aa4502d":{"m":116,"g":115},"a0835c3a":{"m":116,"g":115},"55b14656":{"m":116,"g":115},"b4408e60":{"m":116,"g":115},"52fcbbb8":{"m":116,"g":115},"af96ca11":{"m":116,"g":115},"9082a7d3":{"m":116,"g":115},"3b9d97f3":{"m":116,"g":115},"a1a20b4c":{"m":116,"g":115},"4299aebd":{"m":116,"g":115},"0babd487":{"m":116,"g":115},"f19613e6":{"m":116,"g":115},"8df49455":{"m":116,"g":115},"ee3bd8a1":{"m":116,"g":115},"d8467db7":{"m":116,"g":115},"b5044fbf":{"m":116,"g":115},"70fbb3ad":{"m":116,"g":115},"9a7e7a65":{"m":116,"g":115},"0fe87213":{"m":116,"g":115},"1f106ee3":{"m":116,"g":115},"9b8ebb27":{"m":116,"g":115},"85ebeecf":{"m":117,"g":118},"0dd6cf16":{"m":117,"g":118},"0975ba99":{"m":117,"g":118},"1de3924b":{"m":117,"g":118},"3cceaa38":{"m":117,"g":118},"b0d20cde":{"m":117,"g":118},"cbac4997":{"m":117,"g":118},"476c67d7":{"m":117,"g":118},"3289da5b":{"m":117,"g":118},"868403f6":{"m":117,"g":118},"97d857c0":{"m":117,"g":118},"52a54a26":{"m":117,"g":118},"cd7e1bd5":{"m":117,"g":118},"729b7edf":{"m":117,"g":118},"4c03dbaa":{"m":117,"g":118},"1053e1be":{"m":119,"g":121},"9a71500c":{"m":119,"g":121},"6d6e24bc":{"m":119,"g":121},"2c057fbf":{"m":119,"g":121},"dbd9435d":{"m":119,"g":121},"8ae9d4bb":{"m":119,"g":121},"1c304aa9":{"m":119,"g":121},"770529a7":{"m":119,"g":121},"39c237f0":{"m":119,"g":121},"28b8a406":{"m":119,"g":121},"8bd26dd4":{"m":119,"g":121},"ab07cd3e":{"m":119,"g":121},"a9849683":{"m":119,"g":121},"a4b637d8":{"m":119,"g":121},"96a5e4dd":{"m":119,"g":121},"b0b4f716":{"m":119,"g":121},"6c18addb":{"m":119,"g":121},"32852fe9":{"m":119,"g":121},"53c2934d":{"m":119,"g":121},"e321c971":{"m":119,"g":121},"d6fee73d":{"m":119,"g":121},"36a4cad7":{"m":119,"g":121},"65d376b4":{"m":119,"g":121},"c23eda85":{"m":119,"g":121},"138ff231":{"m":119,"g":121},"13fb8b54":{"m":119,"g":121},"81fd2b0e":{"m":119,"g":121},"007b849b":{"m":119,"g":121},"8612811d":{"m":119,"g":121},"e7aa4664":{"m":119,"g":121},"4d4feccb":{"m":119,"g":121},"99c92ff2":{"m":119,"g":121},"6ade6a02":{"m":119,"g":121},"983ef22c":{"m":119,"g":121},"164302c7":{"m":119,"g":121},"5dccf697":{"m":119,"g":121},"eec9e471":{"m":119,"g":121},"6d535b71":{"m":119,"g":121},"fdcb1d13":{"m":119,"g":121},"d7e834d6":{"m":119,"g":121},"200a3c0b":{"m":119,"g":121},"77258ce0":{"m":119,"g":121},"1d097aac":{"m":119,"g":121},"7fceeef5":{"m":119,"g":121},"88568c01":{"m":119,"g":121},"904655c5":{"m":119,"g":121},"e028af69":{"m":119,"g":121},"80b2b320":{"m":119,"g":121},"4b65ed42":{"m":119,"g":121},"23afdfd1":{"m":119,"g":121},"9d61205d":{"m":119,"g":121},"590bc4b7":{"m":119,"g":121},"63cfe1b0":{"m":119,"g":121},"70f6309c":{"m":119,"g":121},"70416001":{"m":119,"g":121},"87a92e45":{"m":119,"g":121},"c461e771":{"m":119,"g":121},"fde2decf":{"m":119,"g":121},"9792b9d7":{"m":119,"g":121},"ef4a8097":{"m":119,"g":121},"ebff4ee6":{"m":119,"g":121},"2b1da821":{"m":119,"g":121},"97710ccd":{"m":119,"g":121},"f3cd5d25":{"m":119,"g":121},"c61b0b29":{"m":119,"g":121},"e8640ee9":{"m":119,"g":121},"d0a64c7e":{"m":119,"g":121},"05d3667a":{"m":119,"g":121},"260fe755":{"m":119,"g":121},"dbb16bed":{"m":119,"g":121},"c1e16003":{"m":119,"g":121},"852c0578":{"m":119,"g":121},"7e6191c0":{"m":119,"g":121},"6f9b66bd":{"m":119,"g":121},"8a801ee3":{"m":119,"g":118},"d9a20fd2":{"m":119,"g":118},"b113c72e":{"m":119,"g":118},"fb6cc7b0":{"m":119,"g":118},"8374a96e":{"m":119,"g":118},"74de76c6":{"m":119,"g":118},"9c0b1eb5":{"m":119,"g":118},"01f14a7a":{"m":119,"g":118},"11110303":{"m":119,"g":118},"28ddfb37":{"m":119,"g":118},"e69094df":{"m":119,"g":118},"43ad0590":{"m":119,"g":118},"b4948512":{"m":119,"g":118},"ddcba74b":{"m":119,"g":118},"0917c5da":{"m":119,"g":118},"184a4df6":{"m":119,"g":118},"f7b1d8c5":{"m":119,"g":118},"bfc3b3f7":{"m":119,"g":118},"da5bde4d":{"m":119,"g":118},"276e7b3e":{"m":119,"g":118},"296f6892":{"m":119,"g":118},"9edb7b51":{"m":119,"g":118},"e53bf442":{"m":119,"g":118},"d383e661":{"m":119,"g":118},"984fbeb1":{"m":119,"g":118},"a2ba0bc3":{"m":119,"g":118},"6d2d0ce2":{"m":119,"g":118},"271d3d0d":{"m":119,"g":118},"c4e81e64":{"m":119,"g":118},"c726d44c":{"m":119,"g":118},"283c8ba0":{"m":119,"g":118},"cae39565":{"m":119,"g":118},"27a223ab":{"m":119,"g":118},"53529f46":{"m":119,"g":118},"24ed3f32":{"m":119,"g":118},"44f0ece9":{"m":119,"g":118},"be0058bc":{"m":119,"g":118},"9e3be1fa":{"m":119,"g":118},"a8ba3279":{"m":119,"g":118},"3b80232d":{"m":119,"g":118},"252dc4e1":{"m":119,"g":118},"cbb5fc2e":{"m":119,"g":118},"53fb229f":{"m":119,"g":118},"4fff1ec1":{"m":119,"g":118},"7a020e0f":{"m":119,"g":118},"48738af7":{"m":119,"g":118},"efa47334":{"m":119,"g":118},"d658f049":{"m":119,"g":118},"57e25de7":{"m":119,"g":118},"12eb02e9":{"m":119,"g":118},"002d0373":{"m":119,"g":118},"a27825ae":{"m":119,"g":118},"ce399e15":{"m":119,"g":118},"ea6275df":{"m":119,"g":118},"eb7318f1":{"m":119,"g":118},"6058fb52":{"m":119,"g":118},"80407b04":{"m":119,"g":118},"b288f4f4":{"m":119,"g":118},"6d6ea5af":{"m":119,"g":118},"1dacedd2":{"m":119,"g":118},"b5e14b2b":{"m":119,"g":118},"d513ee93":{"m":119,"g":118},"a7ae61ed":{"m":119,"g":118},"fda0cb2a":{"m":119,"g":118},"ebda73dc":{"m":119,"g":118},"f4f8a1b4":{"m":119,"g":118},"c44e985d":{"m":119,"g":118},"f9a7d9b3":{"m":119,"g":118},"a93f10a7":{"m":119,"g":118},"585e1223":{"m":119,"g":118},"a7043c6f":{"m":119,"g":118},"67e34c56":{"m":119,"g":118},"1d726528":{"m":119,"g":118},"f4488e9d":{"m":119,"g":118},"e68a2b5b":{"m":119,"g":118},"31b9f19e":{"m":119,"g":118},"547003bd":{"m":119,"g":118},"f7ab9554":{"m":119,"g":118},"dbbd4e18":{"m":119,"g":118},"ca240eef":{"m":119,"g":118},"6c7c92eb":{"m":119,"g":118},"5b214b50":{"m":119,"g":118},"13219e1e":{"m":119,"g":118},"33e9bbec":{"m":119,"g":118},"dcb8f090":{"m":119,"g":118},"9eefe2c0":{"m":119,"g":118},"69fe3c97":{"m":119,"g":118},"8af84912":{"m":119,"g":118},"505329ca":{"m":119,"g":118},"8a382fd3":{"m":119,"g":118},"62797440":{"m":119,"g":118},"2614adf9":{"m":119,"g":118},"fdd7c69d":{"m":119,"g":118},"b9a54e09":{"m":119,"g":118},"20b8d230":{"m":119,"g":118},"d1984e21":{"m":119,"g":118},"b79f75fd":{"m":119,"g":118},"8fcc69e7":{"m":119,"g":118},"f440baa1":{"m":119,"g":118},"2bc3fcd4":{"m":119,"g":118},"a5978a20":{"m":119,"g":118},"e483c1ea":{"m":119,"g":118},"da681f35":{"m":119,"g":118},"9b0f725b":{"m":119,"g":118},"cde5a6e3":{"m":119,"g":118},"3e4c7da2":{"m":119,"g":118},"d88ac9bc":{"m":119,"g":118},"ce11dd82":{"m":119,"g":118},"9e87b60f":{"m":119,"g":118},"7780230a":{"m":119,"g":118},"dc01313d":{"m":119,"g":118},"7a7f99be":{"m":119,"g":118},"fd389df9":{"m":119,"g":118},"b0d1d717":{"m":119,"g":118},"c7962868":{"m":119,"g":118},"4f24ab17":{"m":119,"g":118},"64affab4":{"m":119,"g":118},"4c9bcb9d":{"m":119,"g":118},"86b04d25":{"m":119,"g":118},"55d75e11":{"m":120,"g":121},"3f4cc0af":{"m":120,"g":121},"cadfae66":{"m":120,"g":121},"da1766e4":{"m":120,"g":121},"a124b517":{"m":120,"g":121},"d05a968b":{"m":120,"g":121},"94aad0de":{"m":120,"g":121},"7ebc28f5":{"m":120,"g":121},"b89111d6":{"m":120,"g":121},"0b3b3e9a":{"m":120,"g":121},"a1d5bc4c":{"m":120,"g":121},"a8023891":{"m":120,"g":121},"0103f374":{"m":120,"g":121},"96a5a949":{"m":120,"g":121},"ea385ae8":{"m":120,"g":121},"9e949e58":{"m":120,"g":121},"6dbb569b":{"m":120,"g":121},"5994e6c3":{"m":120,"g":121},"3e6281d0":{"m":120,"g":121},"6371f7af":{"m":120,"g":121},"8491c794":{"m":120,"g":121},"bda3758f":{"m":120,"g":121},"7b36c47b":{"m":120,"g":121},"773d89da":{"m":120,"g":121},"03e7d949":{"m":120,"g":121},"ff604064":{"m":120,"g":121},"212f5e48":{"m":120,"g":121},"fe527812":{"m":120,"g":121},"97828878":{"m":120,"g":121},"c001deba":{"m":120,"g":121},"b4d2da10":{"m":120,"g":121},"b72f9f08":{"m":120,"g":121},"8e70064c":{"m":120,"g":121},"d98b81e2":{"m":120,"g":121},"bcecf27e":{"m":120,"g":121},"4b0ac1d5":{"m":120,"g":121},"8e987fa2":{"m":120,"g":121},"9e656dd3":{"m":120,"g":121},"3862661c":{"m":120,"g":121},"c8492978":{"m":120,"g":121},"428710c2":{"m":120,"g":121},"d9b31011":{"m":120,"g":121},"d0cff78f":{"m":120,"g":121},"e8b71445":{"m":120,"g":121},"4caca1ba":{"m":120,"g":121},"ceb105a7":{"m":120,"g":121},"22cbc9c0":{"m":120,"g":121},"3865afc5":{"m":120,"g":121},"4ea42f7c":{"m":120,"g":121},"ea13cb14":{"m":120,"g":121},"a04212f1":{"m":120,"g":121},"ce869793":{"m":120,"g":121},"22f55e1b":{"m":120,"g":121},"89824189":{"m":120,"g":121},"433c622e":{"m":120,"g":121},"20bd2271":{"m":120,"g":121},"64994980":{"m":120,"g":121},"729b2429":{"m":120,"g":121},"d7056c52":{"m":120,"g":121},"13bf565d":{"m":120,"g":121},"e51046be":{"m":120,"g":121},"4eeeae1e":{"m":120,"g":121},"f4b78d13":{"m":120,"g":121},"4463e90d":{"m":120,"g":121},"229f236d":{"m":120,"g":121},"5983e5bd":{"m":120,"g":121},"4b046a72":{"m":120,"g":121},"770d6312":{"m":120,"g":121},"14203432":{"m":120,"g":121},"d7f0d88f":{"m":120,"g":121},"fc86b18b":{"m":120,"g":121},"0bfa394a":{"m":120,"g":121},"e04340bf":{"m":120,"g":121},"84701338":{"m":120,"g":121},"93ef9a09":{"m":120,"g":121},"b04cd3d4":{"m":120,"g":121},"7ef5d8af":{"m":120,"g":121},"71d41212":{"m":120,"g":121},"b9fb74f3":{"m":120,"g":121},"e15b63a1":{"m":120,"g":121},"4060ed37":{"m":120,"g":121},"2342605e":{"m":120,"g":121},"dbf17a83":{"m":120,"g":121},"0f0c430e":{"m":120,"g":121},"8e797a47":{"m":120,"g":121},"aa3003f1":{"m":120,"g":121},"4793ec7d":{"m":120,"g":121},"92009bd2":{"m":120,"g":121},"4ef981e2":{"m":120,"g":121},"69ed8b67":{"m":120,"g":121},"1801cd19":{"m":120,"g":121},"ffc722a6":{"m":120,"g":121},"49afb3d9":{"m":120,"g":121},"f80371ff":{"m":120,"g":121},"62eff37b":{"m":120,"g":121},"47e12e08":{"m":120,"g":121},"823b4429":{"m":120,"g":121},"14a4d80e":{"m":120,"g":121},"41c10e67":{"m":122,"g":127},"0bfe1d14":{"m":122,"g":127},"5f98b7fe":{"m":122,"g":127},"a4bf5c6a":{"m":122,"g":127},"30ad1070":{"m":122,"g":127},"a80bcb5a":{"m":122,"g":127},"f7f9e41b":{"m":122,"g":127},"263eab9f":{"m":122,"g":127},"25257d8e":{"m":122,"g":127},"cf0c2415":{"m":122,"g":127},"5538e05c":{"m":122,"g":127},"c30ebb93":{"m":122,"g":127},"41efcaeb":{"m":122,"g":127},"70562969":{"m":122,"g":127},"b57dc169":{"m":122,"g":127},"0095e018":{"m":122,"g":127},"68486481":{"m":122,"g":127},"410225b7":{"m":122,"g":127},"2c9aebea":{"m":122,"g":127},"bc741073":{"m":122,"g":127},"2f6af1a3":{"m":122,"g":127},"50b6842b":{"m":122,"g":127},"2d5605e8":{"m":122,"g":127},"300b4c21":{"m":122,"g":127},"c0652d90":{"m":122,"g":127},"2e48584b":{"m":122,"g":127},"57cc5385":{"m":122,"g":127},"5cc0d25a":{"m":122,"g":127},"a076ec1a":{"m":122,"g":127},"72b5f3d0":{"m":122,"g":127},"2f766f38":{"m":122,"g":127},"069e490b":{"m":122,"g":127},"ab95d35f":{"m":122,"g":127},"34c286b8":{"m":122,"g":127},"9416ee60":{"m":122,"g":127},"d4a09ec9":{"m":122,"g":127},"662725b9":{"m":122,"g":127},"82cfcd3b":{"m":122,"g":127},"6c1a3f0c":{"m":122,"g":127},"62377548":{"m":122,"g":121},"96ac24c0":{"m":122,"g":121},"c0d02cf4":{"m":122,"g":121},"7d121448":{"m":122,"g":121},"6a63a985":{"m":122,"g":121},"4d2f17bd":{"m":122,"g":121},"7cd716f7":{"m":122,"g":121},"b7fdde4b":{"m":122,"g":121},"69bf8011":{"m":122,"g":121},"d8fcbaa3":{"m":122,"g":121},"2cf3d0f8":{"m":122,"g":121},"1ed1abfd":{"m":122,"g":121},"ecb9fa14":{"m":122,"g":121},"700daa34":{"m":122,"g":121},"39cee0fe":{"m":122,"g":121},"04e5b6fa":{"m":122,"g":121},"ce6b17c0":{"m":122,"g":121},"cafebef1":{"m":122,"g":121},"73dfd2df":{"m":122,"g":121},"df5192cf":{"m":122,"g":121},"78c43d88":{"m":122,"g":121},"7e28c67d":{"m":122,"g":121},"3edba9bc":{"m":122,"g":121},"e5ec9764":{"m":122,"g":121},"621dfb88":{"m":122,"g":121},"fb52d35f":{"m":122,"g":121},"2b71531a":{"m":122,"g":121},"25c50498":{"m":122,"g":121},"8e2ac2e6":{"m":122,"g":121},"17a57fd8":{"m":122,"g":121},"32438eba":{"m":122,"g":121},"fed02a49":{"m":122,"g":121},"03b3e89a":{"m":122,"g":121},"9ff9fa7f":{"m":122,"g":121},"7ed8ba05":{"m":122,"g":121},"df08f346":{"m":122,"g":121},"db15148c":{"m":122,"g":121},"5259becd":{"m":122,"g":121},"ed1044ac":{"m":122,"g":121},"d717e73e":{"m":122,"g":121},"a1816187":{"m":122,"g":121},"e39628fd":{"m":122,"g":121},"bacb3825":{"m":122,"g":121},"b53d9e11":{"m":122,"g":121},"8a683821":{"m":122,"g":121},"52694b60":{"m":122,"g":121},"400bddf2":{"m":122,"g":121},"1e90fe2e":{"m":122,"g":121},"caa5d296":{"m":122,"g":121},"750940ae":{"m":122,"g":121},"42f8ea40":{"m":122,"g":121},"14cbe42f":{"m":122,"g":121},"685c0645":{"m":122,"g":121},"1357397a":{"m":122,"g":121},"42e1a72e":{"m":122,"g":121},"83a7c89c":{"m":122,"g":121},"0380ca82":{"m":122,"g":121},"ec92b0ce":{"m":122,"g":121},"e03b6bee":{"m":122,"g":121},"5e36a0b4":{"m":122,"g":121},"0297773a":{"m":122,"g":121},"587deb15":{"m":122,"g":121},"83087247":{"m":122,"g":121},"334543ff":{"m":122,"g":121},"c143f416":{"m":122,"g":121},"b48354c5":{"m":122,"g":121},"29195aaa":{"m":122,"g":121},"0ee831de":{"m":122,"g":121},"8d6ab1cb":{"m":122,"g":121},"84a9d0ea":{"m":122,"g":121},"737b58d6":{"m":122,"g":121},"77225d60":{"m":122,"g":121},"9c6e25d2":{"m":122,"g":121},"2a3763c3":{"m":122,"g":121},"fdd00295":{"m":122,"g":121},"25e73640":{"m":122,"g":121},"0da9845e":{"m":122,"g":121},"92885441":{"m":122,"g":121},"64cf868e":{"m":122,"g":121},"ea399527":{"m":122,"g":121},"41a11335":{"m":122,"g":121},"a1f2dc90":{"m":122,"g":121},"ea961060":{"m":122,"g":121},"b1e13e7c":{"m":122,"g":121},"cc7b04a2":{"m":122,"g":121},"d85d6dba":{"m":122,"g":121},"c5642a7a":{"m":122,"g":121},"691c8534":{"m":122,"g":121},"d2b8c412":{"m":122,"g":121},"bf8f7a94":{"m":122,"g":121},"81a632ac":{"m":122,"g":121},"83b22400":{"m":122,"g":121},"285a8e69":{"m":122,"g":121},"813bd6f8":{"m":122,"g":121},"729f612d":{"m":122,"g":121},"899453ac":{"m":122,"g":121},"ce832d70":{"m":122,"g":121},"88596739":{"m":122,"g":121},"a6ea3add":{"m":122,"g":121},"326c84c4":{"m":122,"g":121},"8da608cc":{"m":122,"g":121},"9fc3e8aa":{"m":122,"g":121},"c11b34d5":{"m":122,"g":121},"05ad28f2":{"m":122,"g":121},"0cae873f":{"m":122,"g":121},"a8b91f6b":{"m":122,"g":121},"959d1ab8":{"m":122,"g":121},"6c1c1933":{"m":122,"g":121},"3029d301":{"m":122,"g":121},"f389f017":{"m":122,"g":121},"caa4819b":{"m":122,"g":121},"a88b006e":{"m":122,"g":121},"ce112c07":{"m":122,"g":121},"f7dc2f33":{"m":122,"g":121},"cd784faf":{"m":122,"g":121},"75c09e1f":{"m":122,"g":121},"09af0a7b":{"m":122,"g":121},"c8d385ce":{"m":122,"g":121},"09938e1f":{"m":123,"g":127},"23407983":{"m":123,"g":127},"1357ab02":{"m":123,"g":127},"44da7377":{"m":123,"g":127},"fb9582c4":{"m":123,"g":127},"d22d0447":{"m":123,"g":127},"887742a1":{"m":123,"g":127},"34f7564d":{"m":123,"g":127},"1cfbbc42":{"m":123,"g":127},"55dfb539":{"m":123,"g":127},"42889acb":{"m":123,"g":127},"211f4070":{"m":123,"g":127},"befa41a1":{"m":123,"g":127},"30b26ee9":{"m":123,"g":127},"aa797d01":{"m":123,"g":127},"7cee07a0":{"m":123,"g":127},"bb517fe3":{"m":123,"g":127},"0e82fd3d":{"m":123,"g":127},"b7d70411":{"m":123,"g":127},"ff0b64e1":{"m":123,"g":127},"d84790db":{"m":123,"g":127},"0678beaa":{"m":123,"g":127},"c2d4716d":{"m":123,"g":127},"c14cc47e":{"m":123,"g":127},"dbcf85b7":{"m":123,"g":127},"83804bc6":{"m":123,"g":127},"d5fa019c":{"m":123,"g":127},"fef3a6b6":{"m":123,"g":127},"173e0f70":{"m":123,"g":127},"f600866a":{"m":123,"g":127},"93be7e86":{"m":123,"g":127},"60b0754c":{"m":123,"g":127},"0b24af4d":{"m":123,"g":127},"a209fb05":{"m":123,"g":127},"48d6bea1":{"m":123,"g":127},"1689c0e3":{"m":123,"g":127},"193fbb0b":{"m":123,"g":127},"e607850f":{"m":123,"g":127},"15efbcb4":{"m":123,"g":127},"243c064d":{"m":123,"g":127},"0b41a293":{"m":123,"g":127},"d31d48b3":{"m":123,"g":127},"88342607":{"m":123,"g":127},"fd7a72d6":{"m":123,"g":127},"21a8fa16":{"m":123,"g":127},"7a21d8b2":{"m":123,"g":127},"d36639ee":{"m":123,"g":127},"6ef23b98":{"m":123,"g":127},"385599cb":{"m":123,"g":127},"952fbe47":{"m":123,"g":127},"edb25693":{"m":123,"g":127},"3529c061":{"m":123,"g":127},"ffb32a85":{"m":123,"g":127},"14d80648":{"m":123,"g":127},"ab8b83f7":{"m":123,"g":127},"de0b10cf":{"m":123,"g":127},"6e29446e":{"m":123,"g":127},"0c3543d7":{"m":123,"g":127},"6a3b9fd0":{"m":123,"g":127},"65f1d065":{"m":123,"g":127},"9434a0e5":{"m":123,"g":127},"20315697":{"m":123,"g":127},"c9db7911":{"m":123,"g":127},"15ed27d7":{"m":123,"g":127},"66fb9b13":{"m":123,"g":127},"819fc591":{"m":123,"g":127},"7efd8b3d":{"m":123,"g":127},"a920b9da":{"m":123,"g":127},"76196b3c":{"m":123,"g":127},"95191ebd":{"m":123,"g":127},"9a512cf9":{"m":123,"g":127},"3451fc32":{"m":123,"g":127},"c550ab91":{"m":123,"g":127},"086f0b79":{"m":123,"g":127},"0afd6832":{"m":123,"g":127},"6f858930":{"m":123,"g":127},"229256c5":{"m":123,"g":127},"6b634493":{"m":123,"g":127},"756ad9ce":{"m":123,"g":127},"d2a8f71c":{"m":123,"g":127},"2b7bf11b":{"m":123,"g":127},"566ade03":{"m":123,"g":127},"69193f71":{"m":123,"g":127},"d5b6e50f":{"m":123,"g":127},"9632e48f":{"m":123,"g":127},"59cce594":{"m":123,"g":127},"795e98f8":{"m":123,"g":127},"358ae356":{"m":123,"g":127},"0c006b88":{"m":124,"g":127},"cd135bfe":{"m":124,"g":127},"fc84b073":{"m":124,"g":127},"8be0e1bc":{"m":124,"g":127},"8e1d6756":{"m":124,"g":127},"0da30dbc":{"m":124,"g":127},"58095cb0":{"m":124,"g":127},"fd3034da":{"m":124,"g":127},"fbbe16fa":{"m":124,"g":127},"6a1a64fa":{"m":124,"g":127},"e2715cf8":{"m":124,"g":127},"74630ba3":{"m":124,"g":127},"73e9a2ef":{"m":124,"g":127},"837b08eb":{"m":124,"g":127},"bb6a21cd":{"m":124,"g":127},"32ec68fa":{"m":124,"g":127},"3c0a6df8":{"m":124,"g":127},"2104d20e":{"m":124,"g":127},"f235498e":{"m":124,"g":127},"149dc9aa":{"m":124,"g":127},"9a954982":{"m":124,"g":127},"97be66c3":{"m":124,"g":127},"74243dff":{"m":124,"g":127},"4ea4c48b":{"m":124,"g":127},"cf5d27e3":{"m":124,"g":127},"9ec6031d":{"m":124,"g":127},"a5affb0c":{"m":124,"g":127},"5925d3d7":{"m":124,"g":127},"7ef1964a":{"m":124,"g":127},"b0476a06":{"m":124,"g":127},"ffba61a1":{"m":124,"g":127},"3c219eb0":{"m":124,"g":127},"1ffdcdc4":{"m":124,"g":127},"c7d57d5b":{"m":124,"g":127},"80802c4c":{"m":124,"g":127},"83b104ee":{"m":124,"g":127},"14127804":{"m":124,"g":127},"82f39dc1":{"m":124,"g":127},"627bac64":{"m":124,"g":127},"3651cfbf":{"m":124,"g":127},"c8547ecd":{"m":124,"g":127},"7bc1dae0":{"m":124,"g":127},"4fe53e58":{"m":124,"g":127},"fb2e816e":{"m":124,"g":127},"7c45b8b4":{"m":124,"g":127},"ba5b6823":{"m":124,"g":127},"508d2f7a":{"m":124,"g":127},"a889c854":{"m":124,"g":127},"4d84f886":{"m":124,"g":127},"dc4f5418":{"m":124,"g":127},"0648eb48":{"m":124,"g":127},"b88fab31":{"m":124,"g":127},"36942660":{"m":124,"g":127},"9f5e7018":{"m":124,"g":127},"cbf23dbb":{"m":124,"g":127},"6dade6c3":{"m":124,"g":127},"b419e20c":{"m":124,"g":127},"48641435":{"m":124,"g":127},"44b1b394":{"m":124,"g":127},"0711d150":{"m":124,"g":127},"303cc957":{"m":125,"g":127},"661c1c97":{"m":125,"g":127},"c022107f":{"m":125,"g":127},"56c83e0f":{"m":125,"g":127},"b51d46d0":{"m":125,"g":127},"10864731":{"m":125,"g":127},"665416f6":{"m":125,"g":127},"838bcb0d":{"m":125,"g":127},"f1f4c451":{"m":125,"g":127},"b0ee99dd":{"m":125,"g":127},"ddfcb7c8":{"m":125,"g":127},"58b12ccb":{"m":125,"g":127},"547de8c7":{"m":125,"g":127},"37c40a87":{"m":125,"g":127},"1240ac13":{"m":125,"g":127},"5639145f":{"m":125,"g":127},"afee2843":{"m":125,"g":127},"6f084880":{"m":125,"g":127},"611a4fd0":{"m":125,"g":127},"9ea2c686":{"m":125,"g":127},"e2a784ec":{"m":125,"g":127},"05559a4a":{"m":125,"g":127},"7bffc5dc":{"m":125,"g":127},"a5e5088d":{"m":125,"g":127},"ac19ce7e":{"m":125,"g":127},"95876d75":{"m":125,"g":127},"a30f1907":{"m":125,"g":127},"dc8a5a1c":{"m":125,"g":127},"e123648b":{"m":125,"g":127},"90401cf7":{"m":125,"g":127},"9cfe78dd":{"m":125,"g":127},"307e7a61":{"m":125,"g":127},"ddd1440d":{"m":125,"g":127},"d1be60c3":{"m":125,"g":127},"583bb180":{"m":125,"g":127},"1f2a6c69":{"m":125,"g":127},"61c7fe7a":{"m":125,"g":127},"83f89cc6":{"m":125,"g":127},"db24d346":{"m":125,"g":127},"885cfca2":{"m":125,"g":127},"4e916f98":{"m":125,"g":127},"4f65a646":{"m":125,"g":127},"f5b3ccd9":{"m":125,"g":127},"bb00e24f":{"m":125,"g":127},"210a9cab":{"m":125,"g":127},"b8ac4fcb":{"m":125,"g":127},"877cb528":{"m":125,"g":127},"3633f8b0":{"m":125,"g":127},"93cf60fc":{"m":125,"g":127},"b5e04173":{"m":125,"g":127},"8a821af7":{"m":125,"g":127},"4b1d163b":{"m":125,"g":127},"c21a3ec2":{"m":125,"g":127},"b142831a":{"m":125,"g":127},"d1340963":{"m":125,"g":127},"f290e801":{"m":125,"g":127},"9299a62f":{"m":125,"g":127},"49543be9":{"m":125,"g":127},"52362903":{"m":125,"g":127},"b2b26d43":{"m":125,"g":127},"d3a03aee":{"m":125,"g":127},"5f02b918":{"m":125,"g":127},"49653c88":{"m":125,"g":127},"44f594d8":{"m":125,"g":127},"2b6c4257":{"m":125,"g":127},"243ea585":{"m":125,"g":127},"6fee2c53":{"m":125,"g":127},"f1a9c72d":{"m":125,"g":127},"190002c6":{"m":125,"g":127},"0296f1cd":{"m":125,"g":127},"e039ff38":{"m":125,"g":127},"b8ddc296":{"m":125,"g":127},"0b88d520":{"m":125,"g":127},"d3d7f960":{"m":125,"g":127},"e4341872":{"m":125,"g":127},"fe19a580":{"m":125,"g":127},"55e8e399":{"m":125,"g":127},"32f79828":{"m":125,"g":127},"0f76976c":{"m":125,"g":127},"ae622790":{"m":125,"g":127},"5c9273c0":{"m":125,"g":127},"e316bcac":{"m":125,"g":127},"0fe9c1f7":{"m":125,"g":127},"61bfd9fa":{"m":125,"g":127},"c67fce16":{"m":125,"g":127},"bef37d6d":{"m":125,"g":127},"bc25ea67":{"m":125,"g":127},"1fa788ec":{"m":125,"g":127},"125f76ea":{"m":125,"g":127},"d8736c75":{"m":125,"g":127},"3b1cc466":{"m":125,"g":127},"0ee5ab5a":{"m":125,"g":127},"ed5e905c":{"m":125,"g":127},"d9c812d8":{"m":125,"g":127},"7257525c":{"m":125,"g":127},"34ba10ef":{"m":125,"g":127},"a119363f":{"m":125,"g":127},"fb314d7b":{"m":125,"g":127},"3a64844a":{"m":125,"g":127},"585c417f":{"m":125,"g":127},"b0d1c21d":{"m":125,"g":127},"b07c5e40":{"m":125,"g":127},"1772671b":{"m":125,"g":127},"6e6009fb":{"m":125,"g":127},"88a2a340":{"m":125,"g":127},"78c58621":{"m":125,"g":127},"c3bb348d":{"m":125,"g":127},"4e234b4c":{"m":125,"g":127},"4cc725ac":{"m":125,"g":127},"2cb42dc1":{"m":125,"g":127},"ebaf86d4":{"m":126,"g":127},"8359f185":{"m":126,"g":127},"5324f37a":{"m":126,"g":127},"2864c49f":{"m":126,"g":127},"d26ec39f":{"m":126,"g":127},"9c546bfd":{"m":126,"g":127},"c9b58164":{"m":126,"g":127},"e5e65e3d":{"m":126,"g":127},"ffeb28ba":{"m":126,"g":127},"b40f605f":{"m":126,"g":127},"3cdec20c":{"m":126,"g":127},"d28caaf6":{"m":126,"g":127},"ad8d24c3":{"m":126,"g":127},"018123b5":{"m":126,"g":127},"4983b7e7":{"m":126,"g":127},"f825137f":{"m":126,"g":127},"ae68158f":{"m":126,"g":127},"a7cc02e3":{"m":126,"g":127},"8ece99a9":{"m":126,"g":127},"44e391b6":{"m":126,"g":127},"1a5c313f":{"m":126,"g":127},"7ea5b42d":{"m":126,"g":127},"5ded5e27":{"m":126,"g":127},"33d1aeb0":{"m":126,"g":127},"dd909a51":{"m":126,"g":127},"60cb7167":{"m":126,"g":127},"151e1368":{"m":126,"g":127},"2f9952cd":{"m":126,"g":127},"3e7cc273":{"m":126,"g":127},"8f01a12d":{"m":126,"g":127},"28b8c579":{"m":126,"g":127},"7b877ab8":{"m":126,"g":127},"0d4a4184":{"m":126,"g":127},"2ca25a8a":{"m":126,"g":127},"cc2e36c3":{"m":126,"g":127},"d8f7816a":{"m":126,"g":127},"99e25805":{"m":126,"g":127},"4a2768a8":{"m":126,"g":127},"9b247f73":{"m":126,"g":127},"36d14712":{"m":126,"g":127},"e0e6a6ef":{"m":126,"g":127},"e8114102":{"m":126,"g":127},"14a339fc":{"m":126,"g":127},"5f662e78":{"m":126,"g":127},"7f5055ed":{"m":126,"g":127},"63728b11":{"m":126,"g":127},"5de25f78":{"m":126,"g":127},"d52800db":{"m":126,"g":127},"38a704bc":{"m":126,"g":127},"a06c44f9":{"m":126,"g":127},"527b7d3f":{"m":126,"g":127},"f09eee03":{"m":126,"g":127},"e38994dd":{"m":126,"g":127},"4a78031a":{"m":126,"g":127},"ea10a9d1":{"m":126,"g":127},"71aea45c":{"m":126,"g":127},"e3b38d71":{"m":126,"g":127},"5c0cadd0":{"m":126,"g":127},"39b1d048":{"m":126,"g":127},"8db7fc41":{"m":126,"g":127},"fe92d4d8":{"m":126,"g":127},"fc8cda14":{"m":126,"g":127},"6a7322ff":{"m":126,"g":127},"c751cb38":{"m":126,"g":127},"3594815a":{"m":126,"g":127},"9caca6a4":{"m":126,"g":127},"08c805a8":{"m":126,"g":127},"f18ec927":{"m":126,"g":127},"aea88fa7":{"m":126,"g":127},"2fe4e69f":{"m":126,"g":127},"012bfc4f":{"m":126,"g":127},"0493775b":{"m":126,"g":127},"40b26b45":{"m":126,"g":127},"9840bf4f":{"m":126,"g":127},"7b2fb3d4":{"m":128,"g":131},"d64dd3e1":{"m":128,"g":131},"3ccd7fa6":{"m":128,"g":131},"254f62d8":{"m":128,"g":131},"2b8b9d84":{"m":128,"g":131},"e970892f":{"m":128,"g":131},"d5fa58c4":{"m":128,"g":131},"9f011f61":{"m":128,"g":131},"3b18fd4c":{"m":128,"g":131},"1869f25c":{"m":128,"g":131},"e019f233":{"m":128,"g":131},"b1c688fb":{"m":128,"g":131},"8e3663d4":{"m":128,"g":131},"9edb0e0d":{"m":128,"g":131},"95f43669":{"m":128,"g":131},"50691d7b":{"m":128,"g":131},"6afe3963":{"m":128,"g":131},"191f5c77":{"m":128,"g":131},"d7246708":{"m":128,"g":131},"efc5d8f5":{"m":128,"g":131},"ef32a252":{"m":128,"g":131},"9509c4cc":{"m":128,"g":131},"4e19c1d5":{"m":128,"g":131},"78a4b446":{"m":128,"g":131},"f9696641":{"m":128,"g":131},"f35f7f12":{"m":128,"g":131},"12c789eb":{"m":128,"g":131},"597d4160":{"m":128,"g":131},"1ca205f6":{"m":128,"g":131},"24a25ffa":{"m":128,"g":131},"db7299aa":{"m":128,"g":131},"13366843":{"m":128,"g":131},"be353ffd":{"m":128,"g":131},"20e59f95":{"m":128,"g":131},"0d116b9a":{"m":128,"g":131},"4a56fa5c":{"m":128,"g":131},"51f9b962":{"m":128,"g":131},"daf494b6":{"m":128,"g":131},"37e8724e":{"m":128,"g":131},"10592e9c":{"m":128,"g":131},"2aec8b6e":{"m":128,"g":131},"0d41ddfb":{"m":128,"g":131},"37c87615":{"m":128,"g":131},"3f400f25":{"m":128,"g":131},"d91b16eb":{"m":128,"g":131},"4a10e37b":{"m":128,"g":131},"b051d76d":{"m":128,"g":131},"d52d992a":{"m":128,"g":131},"d971f228":{"m":128,"g":131},"1d3d42bd":{"m":128,"g":131},"8e9f05ec":{"m":128,"g":131},"bc083521":{"m":128,"g":131},"33f08a98":{"m":128,"g":131},"8e6083bf":{"m":128,"g":131},"2fbc78a0":{"m":128,"g":131},"f0b5ccf5":{"m":128,"g":131},"b732ffa4":{"m":128,"g":131},"56fc4830":{"m":128,"g":131},"2a96e302":{"m":128,"g":131},"8a437340":{"m":128,"g":131},"f0021c0d":{"m":128,"g":131},"7ee3e364":{"m":128,"g":131},"6d5e16fb":{"m":128,"g":131},"67e6f143":{"m":128,"g":131},"34851471":{"m":128,"g":131},"c2083116":{"m":128,"g":131},"10285ec2":{"m":128,"g":131},"9b3fc186":{"m":128,"g":131},"172c71a2":{"m":128,"g":127},"af373636":{"m":128,"g":127},"a5be6ef9":{"m":128,"g":127},"8f4e18a2":{"m":128,"g":127},"e9681444":{"m":128,"g":127},"eae59b33":{"m":128,"g":127},"14dc0523":{"m":128,"g":127},"b2236691":{"m":128,"g":127},"dcc47a56":{"m":128,"g":127},"6448b4cd":{"m":128,"g":127},"5ae0ac42":{"m":128,"g":127},"22f641ab":{"m":128,"g":127},"0997c78d":{"m":128,"g":127},"fd3be107":{"m":128,"g":127},"665f43bd":{"m":128,"g":127},"a53f2d6c":{"m":128,"g":127},"a7002e61":{"m":128,"g":127},"84e151ac":{"m":128,"g":127},"875a25dd":{"m":128,"g":127},"fc55b45e":{"m":128,"g":127},"0050ff25":{"m":128,"g":127},"af9f71f9":{"m":128,"g":127},"15264232":{"m":128,"g":127},"f8d3d80f":{"m":128,"g":127},"5027739f":{"m":128,"g":127},"3701f34d":{"m":128,"g":127},"821fb060":{"m":128,"g":127},"ace27c0c":{"m":128,"g":127},"ed1d18d4":{"m":128,"g":127},"e7b57b0d":{"m":128,"g":127},"49141df9":{"m":128,"g":127},"7b79cc4f":{"m":128,"g":127},"385ff0e5":{"m":128,"g":127},"fc5da1e8":{"m":128,"g":127},"04848ba7":{"m":128,"g":127},"9bc6a9ad":{"m":128,"g":127},"7cdaedb8":{"m":128,"g":127},"1f134f85":{"m":128,"g":127},"5f72d36d":{"m":128,"g":127},"7a2254b2":{"m":128,"g":127},"b5904999":{"m":128,"g":127},"922525ee":{"m":128,"g":127},"ee3e337c":{"m":128,"g":127},"e523e216":{"m":128,"g":127},"2ce23777":{"m":128,"g":127},"19f6a33c":{"m":128,"g":127},"e9c0c558":{"m":128,"g":127},"9b41f31a":{"m":128,"g":127},"9db3add3":{"m":128,"g":127},"c9e5799b":{"m":128,"g":127},"2966367a":{"m":128,"g":127},"87791007":{"m":128,"g":127},"dd192a55":{"m":128,"g":127},"bfe638f7":{"m":128,"g":127},"4ac65e3c":{"m":128,"g":127},"0779c3d1":{"m":128,"g":127},"5c2d72ba":{"m":128,"g":127},"9bd511a5":{"m":128,"g":127},"85b8c5c4":{"m":128,"g":127},"c8b7516f":{"m":128,"g":127},"67e9d287":{"m":128,"g":127},"e7e89349":{"m":128,"g":127},"aead0ef5":{"m":128,"g":127},"66640835":{"m":128,"g":127},"c2d69e8b":{"m":128,"g":127},"4c1e909a":{"m":128,"g":127},"2bb0317e":{"m":128,"g":127},"86255f27":{"m":128,"g":127},"e4b29370":{"m":128,"g":127},"7a8524b4":{"m":128,"g":127},"909d0d38":{"m":128,"g":127},"e42df37d":{"m":128,"g":127},"6d21392b":{"m":128,"g":127},"a1cb717d":{"m":128,"g":127},"c9456491":{"m":128,"g":127},"c4b74c1d":{"m":128,"g":127},"7aa44390":{"m":128,"g":127},"4eda9969":{"m":128,"g":127},"03a7e6f4":{"m":128,"g":127},"2cdde3d4":{"m":128,"g":127},"401ed0c5":{"m":128,"g":127},"4ef43905":{"m":128,"g":127},"d646cf63":{"m":128,"g":127},"4edb2401":{"m":128,"g":127},"706502ff":{"m":128,"g":127},"c2e56dad":{"m":128,"g":127},"9c1c5c6d":{"m":128,"g":127},"2d531946":{"m":128,"g":127},"7ae368ef":{"m":129,"g":131},"c4e20cad":{"m":129,"g":131},"5dad1ff1":{"m":129,"g":131},"ca52ed42":{"m":129,"g":131},"5c03aa3e":{"m":129,"g":131},"253be18e":{"m":129,"g":131},"084b06e7":{"m":129,"g":131},"fc6fb550":{"m":129,"g":131},"7c38eca1":{"m":129,"g":131},"427b08e2":{"m":129,"g":131},"0141ca37":{"m":129,"g":131},"25a6be49":{"m":129,"g":131},"df1f3124":{"m":129,"g":131},"c5947ecd":{"m":129,"g":131},"9530b766":{"m":129,"g":131},"3067b3f0":{"m":129,"g":131},"e0ec42c7":{"m":129,"g":131},"9c9d7091":{"m":129,"g":131},"51a86ce6":{"m":129,"g":131},"64092c8b":{"m":129,"g":131},"63b9300f":{"m":129,"g":131},"21ec99be":{"m":129,"g":131},"383689e3":{"m":129,"g":131},"236a7c23":{"m":129,"g":131},"3dabd609":{"m":129,"g":131},"73df5253":{"m":129,"g":131},"e6420100":{"m":129,"g":131},"c9e20901":{"m":129,"g":131},"106df4ea":{"m":129,"g":131},"427a19b6":{"m":129,"g":131},"1f930cd2":{"m":129,"g":131},"11ce0516":{"m":129,"g":131},"cd4151ab":{"m":129,"g":131},"8fe8b635":{"m":129,"g":131},"8a7b1b83":{"m":129,"g":131},"26aebf83":{"m":129,"g":131},"3ab8ae68":{"m":129,"g":131},"03888b9d":{"m":129,"g":131},"796d82b1":{"m":129,"g":131},"1da59e83":{"m":129,"g":131},"1d66a14c":{"m":129,"g":131},"02af51e4":{"m":129,"g":131},"eb500884":{"m":129,"g":131},"079b1738":{"m":129,"g":131},"57f933fd":{"m":129,"g":131},"1f2b84d2":{"m":129,"g":131},"e7d6027e":{"m":129,"g":131},"07821352":{"m":129,"g":131},"edbeaf3b":{"m":129,"g":131},"d9dca282":{"m":129,"g":131},"9325f945":{"m":129,"g":131},"41b7aab8":{"m":129,"g":131},"491f4fe8":{"m":129,"g":131},"34035d8c":{"m":129,"g":131},"92ca6295":{"m":129,"g":131},"ec92d7f1":{"m":129,"g":131},"79b389da":{"m":129,"g":131},"3de09aad":{"m":129,"g":131},"2e8f54e6":{"m":129,"g":131},"e55731b6":{"m":129,"g":131},"45264554":{"m":129,"g":131},"c4293f59":{"m":129,"g":131},"a2423052":{"m":129,"g":131},"bc3d2a85":{"m":129,"g":131},"d815d002":{"m":129,"g":131},"fa9021b2":{"m":129,"g":131},"9c800728":{"m":129,"g":131},"630a6930":{"m":129,"g":131},"7ce8faae":{"m":129,"g":131},"de153cf7":{"m":129,"g":131},"f4a0c5c7":{"m":129,"g":131},"0f8e5394":{"m":129,"g":131},"e8ba5a66":{"m":129,"g":131},"a2960bdd":{"m":129,"g":131},"487c8d4d":{"m":129,"g":131},"f87b8eab":{"m":129,"g":131},"e8542db5":{"m":129,"g":131},"6df1e8d6":{"m":129,"g":131},"4addb602":{"m":129,"g":131},"bd0e6908":{"m":129,"g":131},"0825d7f4":{"m":129,"g":131},"0b9dbea5":{"m":129,"g":131},"982db4eb":{"m":129,"g":131},"f5f3a5d9":{"m":129,"g":131},"f138ae57":{"m":129,"g":131},"decb4896":{"m":129,"g":131},"c72f0756":{"m":129,"g":131},"f1115cf5":{"m":129,"g":131},"412160f4":{"m":129,"g":131},"7b03cc64":{"m":129,"g":131},"9872a677":{"m":129,"g":131},"c15c864b":{"m":129,"g":131},"dc7bdc73":{"m":129,"g":131},"0a9d6453":{"m":129,"g":131},"340c613a":{"m":129,"g":131},"36b729c2":{"m":129,"g":131},"67e6ef4b":{"m":129,"g":131},"65ba5ab8":{"m":129,"g":131},"990023e5":{"m":129,"g":131},"5ddd2f6b":{"m":129,"g":131},"0ae4b1ad":{"m":129,"g":131},"94cd64a7":{"m":129,"g":131},"b870271a":{"m":129,"g":131},"22ee9b01":{"m":129,"g":131},"9d0e5f1f":{"m":129,"g":131},"1d3d8b34":{"m":129,"g":131},"155a9e72":{"m":129,"g":131},"d7cb08c5":{"m":129,"g":131},"3339c810":{"m":129,"g":131},"d6c88d51":{"m":129,"g":131},"f03ea34a":{"m":129,"g":131},"4cafc835":{"m":129,"g":131},"c6a52f44":{"m":129,"g":131},"c6d34a06":{"m":129,"g":131},"848ee570":{"m":129,"g":131},"ce6b7dfc":{"m":129,"g":131},"0fe74af5":{"m":129,"g":131},"0a362d65":{"m":129,"g":131},"143b57b8":{"m":129,"g":131},"f446b51c":{"m":129,"g":131},"a102a050":{"m":129,"g":131},"6bad6a36":{"m":129,"g":131},"11b6217a":{"m":129,"g":131},"0b0b2607":{"m":129,"g":131},"841eb29d":{"m":129,"g":131},"45cf5758":{"m":129,"g":131},"0e8ce1e8":{"m":129,"g":131},"ea1e9f6b":{"m":129,"g":131},"f6e37d3e":{"m":129,"g":131},"ab9a46d4":{"m":129,"g":131},"621061f0":{"m":129,"g":131},"7daddcdb":{"m":129,"g":131},"91d249cd":{"m":129,"g":131},"95102896":{"m":129,"g":131},"3543a04a":{"m":129,"g":131},"e12c78aa":{"m":129,"g":131},"bce40fa2":{"m":129,"g":131},"051ad833":{"m":129,"g":131},"4c9f7c97":{"m":129,"g":131},"63b05621":{"m":129,"g":131},"21af8e73":{"m":129,"g":131},"7ab548ef":{"m":129,"g":131},"bab033b9":{"m":129,"g":131},"63500426":{"m":129,"g":131},"25758647":{"m":129,"g":131},"2bc8ee8b":{"m":129,"g":131},"ab843ced":{"m":129,"g":131},"6edffc63":{"m":129,"g":131},"077ca70e":{"m":129,"g":131},"7cb04dc0":{"m":129,"g":131},"5443db87":{"m":129,"g":131},"9f340ab1":{"m":129,"g":131},"70c6f951":{"m":129,"g":131},"d941a3be":{"m":129,"g":131},"b12c9e5c":{"m":129,"g":131},"e9e90460":{"m":129,"g":131},"6330d664":{"m":129,"g":131},"91e8dc37":{"m":129,"g":131},"231df4b0":{"m":129,"g":131},"b087ef8b":{"m":129,"g":131},"5155016b":{"m":129,"g":131},"082b54c6":{"m":129,"g":131},"a8ef4d18":{"m":129,"g":131},"15ff6982":{"m":129,"g":131},"9adef42c":{"m":129,"g":131},"685b9d82":{"m":129,"g":131},"a223402f":{"m":129,"g":131},"44d0a848":{"m":129,"g":131},"5b7da0f5":{"m":129,"g":131},"697a77bf":{"m":129,"g":131},"0a186924":{"m":129,"g":131},"b6312e62":{"m":129,"g":131},"779cbc6e":{"m":129,"g":131},"67c8c867":{"m":129,"g":131},"69a03bc3":{"m":129,"g":131},"66f242b9":{"m":129,"g":131},"8a9b8b84":{"m":129,"g":131},"7e964b51":{"m":129,"g":131},"a0d9f6cd":{"m":129,"g":131},"8308cd36":{"m":129,"g":131},"e0e8a996":{"m":129,"g":131},"5102d009":{"m":129,"g":131},"0dd759e0":{"m":129,"g":131},"5e70880e":{"m":129,"g":131},"9dab534b":{"m":129,"g":131},"262c3c1f":{"m":129,"g":131},"6c190cbd":{"m":129,"g":131},"eff6a07c":{"m":129,"g":131},"b704b0a9":{"m":129,"g":131},"15729dbc":{"m":129,"g":131},"21b0582d":{"m":129,"g":131},"5795da5e":{"m":129,"g":131},"540d6fee":{"m":129,"g":131},"5a8adca9":{"m":129,"g":131},"007c3e23":{"m":129,"g":131},"7130ad3a":{"m":129,"g":131},"35a4c21a":{"m":129,"g":131},"846ba3c6":{"m":129,"g":131},"18fb5158":{"m":129,"g":131},"ca5c8b16":{"m":129,"g":131},"f33e5d1e":{"m":129,"g":131},"13e5beea":{"m":129,"g":131},"c53e729d":{"m":129,"g":131},"03a26557":{"m":129,"g":131},"873382a9":{"m":129,"g":131},"391a863b":{"m":129,"g":131},"36b1bcd2":{"m":129,"g":131},"64a11303":{"m":129,"g":131},"1ab6ce0e":{"m":129,"g":131},"e99ca6ac":{"m":129,"g":131},"fcccaf90":{"m":129,"g":131},"215a97fa":{"m":129,"g":131},"5eed5fc0":{"m":129,"g":131},"808b6dfd":{"m":129,"g":131},"4852aa05":{"m":129,"g":131},"f922bfd5":{"m":129,"g":131},"dfd7ab96":{"m":129,"g":131},"46673b42":{"m":129,"g":131},"3421d049":{"m":129,"g":131},"d3d404d3":{"m":129,"g":131},"64225a8a":{"m":129,"g":131},"d64bf6c6":{"m":129,"g":131},"432ecf84":{"m":129,"g":131},"0b3f002d":{"m":129,"g":131},"6f094def":{"m":129,"g":131},"59464dbf":{"m":129,"g":131},"1f7fcc10":{"m":129,"g":131},"dbab5d50":{"m":129,"g":131},"760c20b3":{"m":129,"g":131},"407cb3ce":{"m":129,"g":131},"7cc43bd4":{"m":129,"g":131},"cce2d748":{"m":129,"g":131},"c1dd9a95":{"m":129,"g":131},"ed8786b0":{"m":129,"g":131},"f9fe0630":{"m":129,"g":131},"8ff3ef1f":{"m":129,"g":131},"da182e4b":{"m":129,"g":131},"83e72077":{"m":129,"g":131},"a2c388ba":{"m":129,"g":131},"9384fa27":{"m":129,"g":131},"173e73fa":{"m":129,"g":131},"a164259e":{"m":129,"g":131},"b0a26ba6":{"m":129,"g":131},"de430b67":{"m":129,"g":131},"4b45d556":{"m":129,"g":131},"db0ffc09":{"m":129,"g":131},"eb1d8854":{"m":129,"g":131},"bf108692":{"m":129,"g":131},"e83bd1fa":{"m":129,"g":131},"9dc15d85":{"m":129,"g":131},"fafaa2cc":{"m":129,"g":131},"9b4b3441":{"m":129,"g":131},"94216a9c":{"m":129,"g":131},"9535015d":{"m":129,"g":131},"a3b578fc":{"m":129,"g":131},"b60e769d":{"m":129,"g":131},"a95a3807":{"m":129,"g":131},"98b38de3":{"m":129,"g":131},"a146f833":{"m":129,"g":131},"1dd9a6ae":{"m":129,"g":131},"8ef11569":{"m":129,"g":131},"ecefc790":{"m":129,"g":131},"aeac6220":{"m":129,"g":131},"04b52fa8":{"m":129,"g":131},"e5c0f591":{"m":129,"g":131},"981ca831":{"m":129,"g":131},"414248e0":{"m":129,"g":131},"f56b9b42":{"m":129,"g":131},"b2f7b08c":{"m":129,"g":131},"75222bfe":{"m":129,"g":131},"9ea19533":{"m":129,"g":131},"dbf22152":{"m":129,"g":131},"d5e03468":{"m":129,"g":131},"a22104a6":{"m":129,"g":131},"4683e244":{"m":129,"g":131},"9054e844":{"m":129,"g":131},"18403f6b":{"m":129,"g":131},"2892265d":{"m":129,"g":131},"c9bd1aca":{"m":129,"g":131},"618ca238":{"m":129,"g":131},"5c291549":{"m":129,"g":131},"aaa40a9b":{"m":129,"g":131},"dd70cf99":{"m":129,"g":131},"d4593964":{"m":129,"g":131},"53fffefd":{"m":129,"g":131},"b964ce61":{"m":129,"g":131},"ac5505b0":{"m":129,"g":131},"a90435c0":{"m":129,"g":131},"5354d7b7":{"m":129,"g":131},"e0148677":{"m":129,"g":131},"04793508":{"m":129,"g":131},"cad78789":{"m":129,"g":131},"b29769f3":{"m":129,"g":131},"3990b84b":{"m":129,"g":131},"5a4394a3":{"m":129,"g":131},"dd303614":{"m":129,"g":131},"86312468":{"m":129,"g":131},"5625e32c":{"m":129,"g":131},"ca548d83":{"m":129,"g":131},"3397bcee":{"m":129,"g":131},"3e804bb0":{"m":129,"g":131},"a22de641":{"m":129,"g":131},"ac438226":{"m":129,"g":131},"a92afb00":{"m":129,"g":131},"0eea17e3":{"m":129,"g":131},"38052432":{"m":129,"g":131},"8bfce9b0":{"m":129,"g":131},"b41afa37":{"m":129,"g":131},"94ae816f":{"m":129,"g":131},"53620a1b":{"m":129,"g":131},"59b4d7f8":{"m":129,"g":131},"a56f7702":{"m":129,"g":131},"1b48e1b9":{"m":129,"g":131},"a24aefe5":{"m":129,"g":131},"45c572c5":{"m":129,"g":131},"681b9e64":{"m":129,"g":131},"dab06b50":{"m":129,"g":131},"85ffce30":{"m":129,"g":131},"e94ef9fc":{"m":129,"g":131},"964cdedc":{"m":129,"g":131},"aa6e2c8a":{"m":129,"g":131},"1776dce5":{"m":129,"g":131},"99e13d18":{"m":129,"g":131},"dc836909":{"m":129,"g":131},"323fed5c":{"m":129,"g":131},"eff7df6d":{"m":129,"g":131},"5e7f91d4":{"m":129,"g":131},"a34d3abb":{"m":129,"g":131},"6d0e0b9b":{"m":129,"g":131},"589d9ad5":{"m":129,"g":131},"a244c030":{"m":129,"g":131},"1bb063aa":{"m":129,"g":131},"b30f63c4":{"m":129,"g":131},"90a01335":{"m":129,"g":131},"43602790":{"m":129,"g":131},"b537ac0d":{"m":129,"g":131},"eda2f700":{"m":129,"g":131},"8c212a20":{"m":129,"g":131},"d754ce97":{"m":129,"g":131},"475962a1":{"m":129,"g":131},"bfcf15a1":{"m":129,"g":131},"fb04d434":{"m":129,"g":131},"6be65ae4":{"m":129,"g":131},"750084ae":{"m":129,"g":131},"64480ec7":{"m":129,"g":131},"db2d362d":{"m":129,"g":131},"81e86992":{"m":129,"g":131},"c4db77f8":{"m":129,"g":131},"c0a2513b":{"m":129,"g":131},"3ae664d7":{"m":129,"g":131},"c56fc424":{"m":129,"g":131},"3f1cfd87":{"m":129,"g":131},"b5344b31":{"m":129,"g":131},"6b262ac8":{"m":129,"g":131},"ada8ce1f":{"m":129,"g":131},"5a2c7039":{"m":129,"g":131},"42028af6":{"m":129,"g":131},"7291c72e":{"m":129,"g":131},"6bc30628":{"m":129,"g":131},"fa924410":{"m":129,"g":131},"fc9efdcb":{"m":129,"g":131},"2dec555d":{"m":129,"g":131},"acde21d8":{"m":129,"g":131},"4528cb7d":{"m":129,"g":131},"a352e833":{"m":129,"g":131},"852eb6ce":{"m":129,"g":131},"7af9b88c":{"m":129,"g":131},"2847e5c4":{"m":129,"g":131},"c8ede0e9":{"m":129,"g":131},"19729f72":{"m":129,"g":131},"b51f9bbe":{"m":129,"g":131},"4a8442af":{"m":129,"g":131},"7dcf910d":{"m":129,"g":131},"c7b37b70":{"m":129,"g":131},"10e0b83a":{"m":129,"g":131},"bc42c8c4":{"m":129,"g":131},"2e3a69ae":{"m":129,"g":131},"21370ef7":{"m":129,"g":131},"c3c4da71":{"m":129,"g":131},"dc694624":{"m":129,"g":131},"48ca9f75":{"m":129,"g":131},"127d59cd":{"m":129,"g":131},"af6bcadc":{"m":129,"g":131},"67fca6b2":{"m":129,"g":131},"f88b2aa6":{"m":129,"g":131},"17b24aca":{"m":129,"g":131},"bfaf0b86":{"m":129,"g":131},"83756a4b":{"m":129,"g":131},"a3557949":{"m":129,"g":131},"e72cf136":{"m":129,"g":131},"196b940a":{"m":129,"g":131},"a1e1e533":{"m":129,"g":131},"d4a4dcdf":{"m":129,"g":131},"97ba2c2d":{"m":129,"g":131},"e197bef5":{"m":129,"g":131},"8900f996":{"m":129,"g":131},"ba9102f9":{"m":129,"g":131},"b638abba":{"m":129,"g":131},"37980559":{"m":129,"g":131},"075ba74d":{"m":129,"g":131},"f7be98e1":{"m":129,"g":131},"9a1a9a42":{"m":129,"g":131},"6c2e5fcd":{"m":129,"g":131},"0d2d6878":{"m":129,"g":131},"9f59194f":{"m":129,"g":131},"cf1f0166":{"m":129,"g":131},"10969ae4":{"m":129,"g":131},"92ad2ff9":{"m":129,"g":131},"9b64f6f3":{"m":129,"g":131},"c0d1a338":{"m":129,"g":131},"a9d22b75":{"m":129,"g":131},"6b9459e8":{"m":129,"g":131},"3a6ec47b":{"m":129,"g":131},"b8e32e79":{"m":129,"g":131},"f5566acc":{"m":129,"g":131},"d79e1294":{"m":129,"g":131},"6e9b1549":{"m":129,"g":131},"109f27ba":{"m":129,"g":131},"c1a30aa7":{"m":129,"g":131},"2e1dbdb2":{"m":129,"g":131},"6d025fd3":{"m":129,"g":131},"518467be":{"m":129,"g":131},"63807079":{"m":129,"g":131},"f6cfe9f1":{"m":129,"g":131},"7bc99d41":{"m":129,"g":131},"e2d67468":{"m":129,"g":131},"6beb6e99":{"m":129,"g":131},"7e88b9c1":{"m":129,"g":131},"cfcf2758":{"m":129,"g":131},"ac81db66":{"m":129,"g":131},"33905005":{"m":129,"g":131},"820e13c9":{"m":129,"g":131},"595adf6d":{"m":129,"g":131},"a5ad0069":{"m":129,"g":131},"4e41edcb":{"m":129,"g":131},"67071f55":{"m":129,"g":131},"0c966779":{"m":129,"g":131},"f3386077":{"m":129,"g":131},"4ce8fb3c":{"m":129,"g":131},"4c3573e4":{"m":129,"g":131},"f1be8aa0":{"m":129,"g":131},"aa8ecbda":{"m":129,"g":131},"9188fecc":{"m":129,"g":131},"26ca0746":{"m":129,"g":131},"90c18a16":{"m":129,"g":131},"d7984f31":{"m":129,"g":131},"7119d188":{"m":129,"g":131},"fe3bbfb4":{"m":129,"g":131},"9846f8ed":{"m":129,"g":131},"85ae508e":{"m":129,"g":131},"d879e37f":{"m":129,"g":131},"e2c9a590":{"m":129,"g":131},"a1e37b02":{"m":129,"g":131},"e389f91d":{"m":129,"g":131},"aac07bf7":{"m":129,"g":131},"ea89a3a0":{"m":129,"g":131},"2bc7c5eb":{"m":129,"g":131},"a63f433b":{"m":129,"g":131},"58f8f4e4":{"m":129,"g":131},"25acbbc6":{"m":129,"g":131},"a8fcbf6f":{"m":129,"g":131},"e486308c":{"m":129,"g":131},"60420109":{"m":129,"g":131},"ff00b6ad":{"m":129,"g":131},"7b445260":{"m":129,"g":131},"c236d05f":{"m":129,"g":131},"df561392":{"m":129,"g":131},"15db5497":{"m":129,"g":131},"9ba3597d":{"m":129,"g":131},"80797c2a":{"m":129,"g":131},"7afff8fd":{"m":129,"g":131},"ac406d43":{"m":129,"g":131},"b436113f":{"m":129,"g":131},"8b5e2c53":{"m":129,"g":131},"b24235b8":{"m":129,"g":131},"ae7698fb":{"m":129,"g":131},"a3e4fe4b":{"m":129,"g":131},"f3e9336d":{"m":129,"g":131},"290fcd89":{"m":129,"g":131},"1dcde539":{"m":129,"g":131},"15bc1f5c":{"m":129,"g":131},"4d597616":{"m":129,"g":131},"ab63f3c5":{"m":129,"g":131},"147b7823":{"m":129,"g":131},"d368c745":{"m":129,"g":131},"7e626d12":{"m":129,"g":131},"2a577344":{"m":129,"g":131},"6abb8051":{"m":130,"g":131},"2e3946d8":{"m":130,"g":131},"32f8b606":{"m":130,"g":131},"b9bef31a":{"m":130,"g":131},"8550822d":{"m":130,"g":131},"8810152e":{"m":130,"g":131},"7bf16c63":{"m":130,"g":131},"39f9a9c2":{"m":130,"g":131},"d69ecc19":{"m":130,"g":131},"763888b5":{"m":130,"g":131},"9a327bdf":{"m":130,"g":131},"2de98010":{"m":130,"g":131},"8200fb56":{"m":130,"g":131},"cb4cdb43":{"m":130,"g":131},"80cfca50":{"m":130,"g":131},"7871593c":{"m":130,"g":131},"4a62a0e3":{"m":130,"g":131},"12a08efc":{"m":130,"g":131},"06836ad0":{"m":130,"g":131},"f72a7703":{"m":130,"g":131},"aeff0d38":{"m":130,"g":131},"cf0478d6":{"m":130,"g":131},"a2ca9bd4":{"m":130,"g":131},"36361adc":{"m":130,"g":131},"661e9775":{"m":130,"g":131},"2970f229":{"m":130,"g":131},"c08b780f":{"m":130,"g":131},"8fbf7dd5":{"m":130,"g":131},"1915a1f8":{"m":130,"g":131},"85d0ccfa":{"m":130,"g":131},"a4ffd665":{"m":130,"g":131},"559202b5":{"m":130,"g":131},"f57d4fe7":{"m":130,"g":131},"b7b7524e":{"m":130,"g":131},"6799847e":{"m":130,"g":131},"03b835e7":{"m":130,"g":131},"aff1238e":{"m":130,"g":131},"5e2cda61":{"m":130,"g":131},"b0bbc7f5":{"m":130,"g":131},"673c11ba":{"m":130,"g":131},"3b47973a":{"m":130,"g":131},"f6423b62":{"m":130,"g":131},"84efe54b":{"m":130,"g":131},"948b6ace":{"m":130,"g":131},"f124539a":{"m":130,"g":131},"c8683ae3":{"m":130,"g":131},"125e17ef":{"m":130,"g":131},"88c459c6":{"m":130,"g":131},"ae6a6630":{"m":130,"g":131},"26d95008":{"m":130,"g":131},"f2b5dcc9":{"m":130,"g":131},"9abcab3f":{"m":130,"g":131},"e5135b73":{"m":130,"g":131},"41d61faa":{"m":130,"g":131},"3c7886ec":{"m":130,"g":131},"0e4d8790":{"m":130,"g":131},"6d5d76ad":{"m":130,"g":131},"91c9c14c":{"m":130,"g":131},"32a32cf7":{"m":130,"g":131},"be4a3ec3":{"m":130,"g":131},"ff6e3ea9":{"m":130,"g":131},"dd91d38e":{"m":130,"g":131},"5f6f550a":{"m":130,"g":131},"5edbe351":{"m":130,"g":131},"d2b42477":{"m":130,"g":131},"9dfa01a4":{"m":130,"g":131},"e592ee65":{"m":130,"g":131},"cee93a6f":{"m":130,"g":131},"bc388471":{"m":130,"g":131},"3e40c636":{"m":130,"g":131},"80122e4f":{"m":130,"g":131},"e12c6b32":{"m":130,"g":131},"6d417918":{"m":130,"g":131},"35a9a073":{"m":130,"g":131},"ea177372":{"m":130,"g":131},"42fcf543":{"m":130,"g":131},"d257bf87":{"m":130,"g":131},"7b0c7ad1":{"m":130,"g":131},"d30d6b36":{"m":130,"g":131},"d881f314":{"m":130,"g":131},"2ac5b983":{"m":130,"g":131},"a0dde90a":{"m":130,"g":131},"b988c18e":{"m":130,"g":131},"e41664ba":{"m":130,"g":131},"3d1b591a":{"m":130,"g":131},"e11f795f":{"m":130,"g":131},"959a1746":{"m":130,"g":131},"b72f0268":{"m":130,"g":131},"09376fd7":{"m":130,"g":131},"aed835e3":{"m":130,"g":131},"49dfa1d8":{"m":130,"g":131},"1ea6b740":{"m":130,"g":131},"e73173b0":{"m":130,"g":131},"16e8463a":{"m":130,"g":131},"cf9a774c":{"m":130,"g":131},"ec7b2c16":{"m":130,"g":131},"1569fc7f":{"m":130,"g":131},"5a46fb15":{"m":130,"g":131},"38daa294":{"m":130,"g":131},"66984a8b":{"m":130,"g":131},"889b46ea":{"m":130,"g":131},"05284378":{"m":130,"g":131},"a8904560":{"m":130,"g":131},"66280987":{"m":130,"g":131},"205f041e":{"m":130,"g":131},"7235a7fb":{"m":130,"g":131},"8fce9e7b":{"m":130,"g":131},"53477322":{"m":130,"g":131},"2ce121a1":{"m":130,"g":131},"35ba6fe1":{"m":130,"g":131},"498ea41c":{"m":130,"g":131},"7c744d13":{"m":130,"g":131},"46b05ef5":{"m":130,"g":131},"beec8eed":{"m":130,"g":131},"b76e303e":{"m":130,"g":131},"80a575e4":{"m":130,"g":131},"4c5074eb":{"m":130,"g":131},"41429a8c":{"m":130,"g":131},"532037df":{"m":130,"g":131},"fa0ca976":{"m":130,"g":131},"b5d39985":{"m":130,"g":131},"2ecee757":{"m":130,"g":131},"6d37e708":{"m":130,"g":131},"c1006fd8":{"m":130,"g":131},"29c6c2ea":{"m":130,"g":131},"eb85fa6d":{"m":130,"g":131},"0e6441b4":{"m":130,"g":131},"88d1bab5":{"m":130,"g":131},"922756aa":{"m":130,"g":131},"d8faf2f3":{"m":130,"g":131},"7dfcc781":{"m":130,"g":131},"7f3308bc":{"m":130,"g":131},"fdc2ef58":{"m":130,"g":131},"1808df48":{"m":130,"g":131},"b01fc161":{"m":130,"g":131},"788628b5":{"m":130,"g":131},"11d33c0e":{"m":130,"g":131},"441420e1":{"m":130,"g":131},"29a2d4b5":{"m":130,"g":131},"079ac237":{"m":130,"g":131},"84280784":{"m":130,"g":131},"af35023e":{"m":130,"g":131},"cb8df87f":{"m":130,"g":131},"e3ab23c1":{"m":130,"g":131},"70d25873":{"m":130,"g":131},"894c0dc5":{"m":130,"g":131},"d6c49019":{"m":130,"g":131},"fa78c44a":{"m":130,"g":131},"654a78f9":{"m":130,"g":131},"f90b4004":{"m":130,"g":131},"78647e08":{"m":130,"g":131},"46f21a59":{"m":130,"g":131},"b2b09f5f":{"m":130,"g":131},"04df80a9":{"m":130,"g":131},"4f73e53d":{"m":130,"g":131},"16ff892c":{"m":130,"g":131},"df026bb1":{"m":130,"g":131},"d42c167b":{"m":130,"g":131},"38815105":{"m":130,"g":131},"9d823402":{"m":130,"g":131},"8ab5d8b4":{"m":130,"g":131},"03575ce3":{"m":130,"g":131},"80518bea":{"m":130,"g":131},"7e78825d":{"m":130,"g":131},"5bbd83a2":{"m":130,"g":131},"abf6272b":{"m":130,"g":131},"46d7b35e":{"m":130,"g":131},"20aad5b5":{"m":130,"g":131},"974c562a":{"m":130,"g":131},"aca0d01d":{"m":130,"g":131},"dc163502":{"m":130,"g":131},"24903b88":{"m":130,"g":131},"443d7bcd":{"m":130,"g":131},"16d8de22":{"m":130,"g":131},"d122e324":{"m":130,"g":131},"96cc1083":{"m":130,"g":131},"77512ae0":{"m":130,"g":131},"c233e9d7":{"m":130,"g":131},"58ac3f31":{"m":130,"g":131},"93452a82":{"m":130,"g":131},"65c8568c":{"m":130,"g":131},"4bcc5879":{"m":130,"g":131},"d5ea8c71":{"m":130,"g":131},"42271376":{"m":130,"g":131},"84e0abb7":{"m":130,"g":131},"043f1317":{"m":130,"g":131},"f764c691":{"m":130,"g":131},"92205407":{"m":130,"g":131},"7d1a130c":{"m":130,"g":131},"5c8bd8b5":{"m":132,"g":133},"b05b346a":{"m":132,"g":133},"5c961756":{"m":132,"g":133},"c5f1e861":{"m":132,"g":133},"2c4d376d":{"m":132,"g":133},"d0f756ae":{"m":132,"g":133},"6f99dc97":{"m":132,"g":133},"cd1c1fa5":{"m":132,"g":133},"5b0872d2":{"m":132,"g":133},"ba88f1ca":{"m":132,"g":133},"60560c07":{"m":132,"g":133},"ca114421":{"m":132,"g":133},"543d62d1":{"m":132,"g":133},"27032cec":{"m":132,"g":133},"5d804a37":{"m":132,"g":133},"388018a5":{"m":132,"g":133},"fca8e88f":{"m":132,"g":133},"45eeeb9a":{"m":132,"g":133},"a368df28":{"m":132,"g":133},"f85460fb":{"m":132,"g":133},"e52cf30e":{"m":132,"g":133},"a076d75e":{"m":132,"g":133},"8348725d":{"m":132,"g":133},"28566241":{"m":132,"g":133},"b62fe850":{"m":132,"g":133},"624725cb":{"m":132,"g":133},"1a96e664":{"m":132,"g":133},"32829b16":{"m":132,"g":133},"e54307f2":{"m":132,"g":133},"8642dbe4":{"m":132,"g":133},"7dcad45c":{"m":132,"g":133},"7c985331":{"m":132,"g":133},"bd7824b2":{"m":132,"g":133},"312df1d6":{"m":132,"g":133},"25e97380":{"m":132,"g":133},"b6523a4f":{"m":132,"g":133},"c51efb8b":{"m":132,"g":133},"bcc5483e":{"m":132,"g":133},"ccf26027":{"m":132,"g":133},"a4992873":{"m":132,"g":133},"c97ce391":{"m":132,"g":133},"c032b559":{"m":132,"g":133},"da9b801e":{"m":132,"g":133},"0e54a695":{"m":132,"g":133},"e99ee0c6":{"m":132,"g":133},"c1bd5ee8":{"m":132,"g":133},"ef1ab230":{"m":132,"g":133},"d6598737":{"m":132,"g":133},"6c5ebc0e":{"m":132,"g":133},"5b5571a8":{"m":132,"g":133},"1698c234":{"m":132,"g":133},"3d82c0f1":{"m":132,"g":133},"617e9b3b":{"m":132,"g":133},"83e35a7c":{"m":132,"g":133},"2543666c":{"m":132,"g":133},"f732f8ea":{"m":132,"g":133},"503880db":{"m":132,"g":133},"b8cfa02c":{"m":132,"g":133},"d85fecb5":{"m":132,"g":133},"6634f67b":{"m":132,"g":133},"5eccaf77":{"m":132,"g":133},"d7f6320b":{"m":132,"g":133},"766476f5":{"m":132,"g":133},"12b7a4fa":{"m":132,"g":133},"02f1e81e":{"m":132,"g":133},"908c7186":{"m":132,"g":133},"03836d85":{"m":132,"g":133},"87dbdddc":{"m":132,"g":133},"56e5c074":{"m":132,"g":133},"6c9c8da6":{"m":132,"g":133},"21028b55":{"m":132,"g":133},"b0a25d09":{"m":132,"g":133},"793c98af":{"m":132,"g":133},"b1cbfce6":{"m":132,"g":133},"b0f531ad":{"m":132,"g":133},"01835998":{"m":132,"g":133},"4285e99d":{"m":132,"g":133},"f0774368":{"m":132,"g":133},"5e8f544d":{"m":132,"g":133},"c8d74feb":{"m":132,"g":133},"cbc7dcda":{"m":132,"g":133},"a6dc7d29":{"m":132,"g":133},"390406c4":{"m":132,"g":131},"0c63fb94":{"m":132,"g":131},"9ad02b79":{"m":132,"g":131},"18bd8e8d":{"m":132,"g":131},"036e64da":{"m":132,"g":131},"7c6fb3aa":{"m":132,"g":131},"8b0b6a45":{"m":132,"g":131},"55504df2":{"m":132,"g":131},"73df7a4e":{"m":132,"g":131},"8b98bb76":{"m":132,"g":131},"15bc8cbd":{"m":132,"g":131},"9496f12d":{"m":132,"g":131},"ab004879":{"m":132,"g":131},"6ec77680":{"m":132,"g":131},"13680e55":{"m":132,"g":131},"98c430e1":{"m":132,"g":131},"fe7f91ef":{"m":132,"g":131},"cef5ba65":{"m":132,"g":131},"53d17088":{"m":132,"g":131},"f0e948a0":{"m":132,"g":131},"9a426fc5":{"m":132,"g":131},"66772aa2":{"m":132,"g":131},"b6263344":{"m":132,"g":131},"0f8bd55f":{"m":132,"g":131},"da3dc497":{"m":132,"g":131},"817daba0":{"m":132,"g":131},"af60cad0":{"m":132,"g":131},"ce4e836b":{"m":132,"g":131},"0e0b0c05":{"m":132,"g":131},"e6f0ddda":{"m":132,"g":131},"af20657c":{"m":132,"g":131},"08da4c26":{"m":132,"g":131},"ef3f8c97":{"m":132,"g":131},"e5201bda":{"m":132,"g":131},"60d36e7b":{"m":132,"g":131},"6f657070":{"m":132,"g":131},"c106b54b":{"m":132,"g":131},"119fd956":{"m":132,"g":131},"eac5b664":{"m":132,"g":131},"07404d76":{"m":132,"g":131},"93043f7b":{"m":132,"g":131},"edde5e5d":{"m":132,"g":131},"232982a0":"m134","9e88c0a2":"m134","b7e0d54e":"m134","5dde0a57":"m134","9e5ab903":"m134","98225be6":{"m":134,"g":135},"94bcc19b":{"m":134,"g":135},"db3821a9":{"m":134,"g":135},"8c5d91b8":{"m":134,"g":135},"6a3e7092":{"m":134,"g":135},"c2601f0d":{"m":134,"g":135},"1048803c":{"m":134,"g":135},"8a84b1e7":{"m":134,"g":135},"5e20e7a6":{"m":134,"g":135},"3946dad6":{"m":134,"g":135},"ac03ec08":{"m":134,"g":135},"94164646":{"m":134,"g":135},"5fb734f1":{"m":134,"g":135},"60f1ca69":{"m":134,"g":135},"9e263c21":{"m":134,"g":135},"0d003e34":{"m":134,"g":135},"f253f43c":{"m":134,"g":135},"1e453201":{"m":134,"g":135},"26e17f90":{"m":134,"g":135},"3de23274":{"m":134,"g":135},"f4ec6f8e":{"m":134,"g":135},"9c4eb460":{"m":134,"g":135},"c2e0913e":{"m":134,"g":135},"8c6f865a":{"m":134,"g":135},"269aa27b":{"m":134,"g":135},"f39382c6":{"m":134,"g":135},"2ff289e2":{"m":134,"g":135},"b2a3f055":{"m":134,"g":135},"684e148e":{"m":134,"g":135},"f44c4b37":{"m":134,"g":135},"7380ec9d":{"m":134,"g":135},"ac78f96e":{"m":134,"g":135},"278012ca":{"m":134,"g":135},"7f587998":{"m":134,"g":135},"162d1cf9":{"m":134,"g":135},"8e08207c":{"m":134,"g":135},"c31f6272":{"m":134,"g":135},"b5d9fc87":{"m":134,"g":135},"88f3de25":{"m":134,"g":135},"24616c52":{"m":134,"g":135},"d48723b7":{"m":134,"g":135},"f3d73b01":{"m":134,"g":135},"4ab66d95":{"m":134,"g":135},"de2799f3":{"m":134,"g":135},"f784cbfa":{"m":134,"g":135},"09733090":{"m":134,"g":135},"a44d0079":{"m":134,"g":135},"8305dc17":{"m":134,"g":135},"ec8c831d":{"m":134,"g":135},"f13949e5":{"m":134,"g":135},"c236a3fd":{"m":134,"g":135},"41a1d16b":{"m":134,"g":135},"9884c9fd":{"m":134,"g":135},"a435f55d":{"m":134,"g":135},"2ec6fa3c":{"m":134,"g":135},"c58a573a":{"m":134,"g":135},"b840d6aa":{"m":134,"g":135},"ef4b3c0e":{"m":134,"g":135},"6f9d0a89":{"m":134,"g":135},"d7a3336e":{"m":134,"g":135},"7d02c8e5":{"m":134,"g":135},"e6d5a213":{"m":134,"g":135},"9f8e2307":{"m":134,"g":135},"d90f9bfc":{"m":134,"g":135},"3881bc8d":{"m":134,"g":135},"208e6a9d":{"m":134,"g":135},"be3828a1":{"m":134,"g":135},"5969be2f":{"m":134,"g":135},"8fab4895":{"m":134,"g":135},"7ccaec64":{"m":134,"g":135},"c457aad5":{"m":134,"g":135},"c7e7bfa3":{"m":134,"g":135},"bf90ea9c":{"m":134,"g":135},"26c50912":{"m":134,"g":135},"325a4c19":{"m":134,"g":135},"0294844f":{"m":134,"g":135},"0e536600":{"m":134,"g":135},"d70c2655":{"m":134,"g":135},"656f4d69":{"m":134,"g":135},"8e43980e":{"m":134,"g":135},"474a4699":{"m":134,"g":135},"b4a00ed2":{"m":134,"g":135},"f55d608c":{"m":134,"g":135},"349ce2dd":{"m":134,"g":135},"183b6519":{"m":134,"g":135},"2af955e1":{"m":134,"g":135},"0cd2b719":{"m":134,"g":135},"39d56196":{"m":134,"g":135},"41addd2e":{"m":134,"g":135},"5c393e81":{"m":134,"g":135},"3645ed0f":{"m":134,"g":135},"ca740a41":{"m":134,"g":135},"0e25aa43":{"m":134,"g":135},"60a230b1":{"m":134,"g":135},"aa89c6a7":{"m":134,"g":135},"171912a9":{"m":134,"g":135},"faecd37e":{"m":134,"g":135},"9ad546d7":{"m":134,"g":135},"29ce7b36":{"m":134,"g":135},"a8380ded":{"m":134,"g":135},"4edee695":{"m":134,"g":135},"67caea6f":{"m":134,"g":135},"cd3289c7":{"m":134,"g":135},"acddb8e0":{"m":134,"g":135},"2ec57cef":{"m":134,"g":135},"988b14ca":{"m":134,"g":135},"93495dca":{"m":134,"g":135},"886e0383":{"m":134,"g":135},"01bd0d3e":{"m":134,"g":135},"43e1bbc0":{"m":134,"g":135},"59b12996":{"m":134,"g":135},"b7091496":{"m":134,"g":135},"51dbdb22":{"m":134,"g":135},"8dc6f0fc":{"m":134,"g":135},"cf34d0ab":{"m":134,"g":135},"3778c2fc":{"m":134,"g":135},"5d421db8":{"m":134,"g":135},"fe3d47fc":{"m":134,"g":135},"ef92b4eb":{"m":134,"g":135},"73c0c66f":{"m":134,"g":135},"a1e9b4ed":{"m":134,"g":135},"e75657c8":{"m":134,"g":135},"c28c536c":{"m":134,"g":135},"086813ae":{"m":134,"g":135},"3fd232ad":{"m":134,"g":135},"cb181295":{"m":134,"g":135},"a91e072f":{"m":134,"g":135},"0271fc34":{"m":134,"g":135},"68bece8c":{"m":134,"g":135},"7b7e357f":{"m":134,"g":135},"2f66b067":{"m":134,"g":135},"48051181":{"m":134,"g":135},"f2ccc442":{"m":134,"g":135},"caa95c7e":{"m":134,"g":135},"9d878c1f":{"m":134,"g":135},"8087ef12":{"m":134,"g":135},"6ef543f9":{"m":134,"g":135},"a3559119":{"m":134,"g":135},"f3ba7116":{"m":134,"g":135},"5c243ba5":{"m":134,"g":135},"b6702d72":{"m":134,"g":135},"c1256727":{"m":134,"g":135},"bb9e6cdf":{"m":134,"g":135},"de03b0cd":{"m":134,"g":135},"2a8a7856":{"m":134,"g":135},"f4e835af":{"m":134,"g":135},"e6ce16a4":{"m":134,"g":135},"de2f2880":{"m":134,"g":135},"a89e85e7":{"m":134,"g":135},"cbf9f134":{"m":134,"g":135},"eb3da9c1":{"m":134,"g":135},"72a980c6":{"m":134,"g":135},"ccf2330b":{"m":134,"g":135},"b9af8d2e":{"m":134,"g":135},"10a9573e":{"m":134,"g":135},"49ab72f8":{"m":134,"g":135},"8865424f":{"m":134,"g":135},"b311c43d":{"m":134,"g":135},"0c39730b":{"m":134,"g":135},"45adad37":{"m":134,"g":135},"1ba897f3":{"m":134,"g":135},"38dd4fbb":{"m":134,"g":135},"ecd2d09a":{"m":134,"g":135},"92ddc468":{"m":134,"g":135},"5454d2a7":{"m":134,"g":133},"17b38f88":{"m":134,"g":133},"ae434f78":{"m":134,"g":133},"643aeefe":{"m":134,"g":133},"186a56f6":{"m":134,"g":133},"17e65466":{"m":134,"g":133},"370bd27f":{"m":134,"g":133},"b27b5a83":{"m":134,"g":133},"2f7c6292":{"m":134,"g":133},"2fb31605":{"m":134,"g":133},"8bf7f240":{"m":134,"g":133},"2c5679f3":{"m":134,"g":133},"159b1283":{"m":134,"g":133},"9338f63f":{"m":134,"g":133},"8196998a":{"m":134,"g":133},"aa21c6e3":{"m":134,"g":133},"fd4a558e":{"m":134,"g":133},"b3b818fd":{"m":134,"g":133},"d5fbbfd9":{"m":134,"g":133},"e245cac0":{"m":134,"g":133},"c6a6ba43":{"m":134,"g":133},"d6108166":{"m":134,"g":133},"ddb3970e":{"m":134,"g":133},"f65fa047":{"m":134,"g":133},"7e027691":{"m":134,"g":133},"cb719c74":{"m":134,"g":133},"ff903a7e":{"m":134,"g":133},"eee3700d":{"m":134,"g":133},"e254cdf3":{"m":134,"g":133},"dfb53574":{"m":134,"g":133},"96655749":{"m":134,"g":133},"ac320a6f":{"m":134,"g":133},"6292c244":{"m":134,"g":133},"5f5a5677":{"m":134,"g":133},"3bf07c68":{"m":134,"g":133},"e7b09efc":{"m":134,"g":133},"99d3bcdf":{"m":134,"g":133},"4d64f150":{"m":134,"g":133},"aef7ca7c":{"m":134,"g":133},"fe712aa3":{"m":134,"g":133},"846953d9":{"m":134,"g":133},"5c64a20d":{"m":134,"g":133},"cf817376":{"m":134,"g":133},"aa6ac966":{"m":134,"g":133},"0d0367e9":{"m":134,"g":133},"bd572360":{"m":134,"g":133},"5f3a47d8":{"m":134,"g":133},"80ae2229":{"m":134,"g":133},"705287b2":{"m":134,"g":133},"76284653":{"m":134,"g":133},"dd620987":{"m":134,"g":133},"53f974b9":{"m":134,"g":133},"6a5764a7":{"m":134,"g":133},"291f11ae":{"m":134,"g":133},"d7301c89":{"m":134,"g":133},"758b9067":{"m":134,"g":133},"ffc23ef8":{"m":134,"g":133},"b3f83cc1":{"m":134,"g":133},"c15fa1c5":{"m":134,"g":133},"fa296698":{"m":134,"g":133},"f9dd90ac":{"m":134,"g":133},"66902e0f":{"m":134,"g":133},"e50f356f":{"m":134,"g":133},"ac42797c":{"m":134,"g":133},"883747ce":{"m":134,"g":133},"989d4b30":{"m":134,"g":133},"bc3ca300":{"m":134,"g":133},"061f41af":{"m":134,"g":133},"82f1d615":{"m":134,"g":133},"5e1a495c":{"m":134,"g":133},"34013d9d":{"m":134,"g":133},"77597167":{"m":134,"g":133},"3c882db3":{"m":134,"g":133},"b736a152":{"m":134,"g":133},"6984837d":{"m":134,"g":133},"2142881b":{"m":134,"g":133},"d77f3fcc":{"m":134,"g":133},"828dec1c":{"m":134,"g":133},"575a49dc":{"m":134,"g":133},"e62e1744":{"m":134,"g":133},"d5431ff8":{"m":134,"g":133},"454a2544":{"m":134,"g":133},"89619a99":{"m":134,"g":133},"cb30d056":{"m":134,"g":133},"677930c2":{"m":134,"g":133},"beae3f96":{"m":134,"g":133},"1167867e":{"m":134,"g":133},"ad7f35fb":{"m":134,"g":133},"f4100732":{"m":134,"g":133},"468931b5":{"m":134,"g":133},"796969ca":{"m":134,"g":133},"122c2503":{"m":134,"g":133},"a3a55223":{"m":134,"g":133},"0bf95e6d":{"m":134,"g":133},"a92de891":{"m":134,"g":133},"b9d78605":{"m":134,"g":133},"1354063a":{"m":134,"g":133},"254de6d2":{"m":134,"g":133},"350fbbf4":{"m":134,"g":133},"a3912667":{"m":134,"g":133},"e1dcd0df":{"m":134,"g":133},"393e2f9b":{"m":134,"g":133},"8766a1dd":{"m":134,"g":133},"1d9ba2ce":{"m":134,"g":133},"ef001fb8":{"m":134,"g":133},"c69c1c4f":{"m":134,"g":133},"bed301a5":{"m":134,"g":133},"8fe3e374":{"m":134,"g":133},"60143655":{"m":134,"g":133},"fc05acc2":{"m":134,"g":133},"1ed94668":{"m":134,"g":133},"d7fbe73b":{"m":134,"g":133},"42bff706":{"m":134,"g":133},"26704c23":{"m":134,"g":133},"43b7c174":{"m":134,"g":133},"96740d69":{"m":134,"g":133},"9a3bdf2c":{"m":134,"g":133},"4b351f6b":{"m":134,"g":133},"7fa4906f":{"m":134,"g":133},"050f108c":{"m":134,"g":133},"47cdb65a":{"m":134,"g":133},"1d90b194":{"m":134,"g":133},"537ef18d":{"m":134,"g":133},"69412ccb":{"m":134,"g":133},"d3885d4b":{"m":134,"g":133},"0a346d3b":{"m":134,"g":133},"bc18cb86":{"m":134,"g":133},"bee8ac5b":{"m":134,"g":133},"41bd76e1":{"m":134,"g":133},"c6ca1b3a":{"m":134,"g":133},"8999ce75":{"m":134,"g":133},"dce2ed44":{"m":134,"g":133},"019517a3":{"m":134,"g":133},"1f1f05a8":{"m":134,"g":133},"165f5c04":{"m":134,"g":133},"3e01f3a5":{"m":134,"g":133},"51e2eaa4":{"m":134,"g":133},"6468cb58":{"m":134,"g":133},"e220da17":{"m":134,"g":133},"b82c7a0a":{"m":134,"g":133},"c0f9b519":{"m":134,"g":133},"5529ab58":{"m":134,"g":133},"74a3349b":{"m":134,"g":133},"b9ebf0ed":{"m":134,"g":133},"3c116d5e":{"m":134,"g":133},"71a60288":{"m":134,"g":133},"61405b3d":{"m":134,"g":133},"0adfc42b":{"m":134,"g":133},"ba72e759":{"m":134,"g":133},"6afc5d49":{"m":134,"g":133},"2ee6c810":{"m":134,"g":133},"bd16244d":{"m":134,"g":133},"b5eb0214":{"m":134,"g":133},"d72e908b":{"m":134,"g":133},"50cad014":{"m":134,"g":133},"ef908aeb":{"m":134,"g":133},"9d0347b3":{"m":134,"g":133},"05eb0bcc":{"m":134,"g":133},"5dccd9bd":{"m":134,"g":133},"933cef16":{"m":134,"g":133},"241ae17b":{"m":134,"g":133},"2c5a4460":{"m":134,"g":133},"f3705b01":{"m":134,"g":133},"5a0ad731":{"m":134,"g":133},"5045aa34":{"m":134,"g":133},"ff1e2ce2":{"m":134,"g":133},"1e582488":{"m":134,"g":133},"46be74b4":{"m":134,"g":133},"1c658026":{"m":134,"g":133},"ba410808":{"m":134,"g":133},"89512029":{"m":134,"g":133},"92e6b3c3":{"m":134,"g":133},"6559e43f":{"m":134,"g":133},"af780c59":{"m":134,"g":133},"a21aa87e":{"m":134,"g":133},"fb178457":{"m":134,"g":133},"65c09859":{"m":134,"g":133},"4bf06635":{"m":134,"g":133},"f2d64e67":{"m":134,"g":133},"0e869f08":{"m":134,"g":133},"17394092":{"m":134,"g":133},"f228b662":{"m":134,"g":133},"a36142aa":{"m":134,"g":133},"160a06ca":{"m":134,"g":133},"a0985dd5":{"m":134,"g":133},"f6c9db4b":{"m":134,"g":133},"e88e75a9":{"m":134,"g":133},"4b4050e2":{"m":134,"g":133},"b2803ff2":{"m":134,"g":133},"9e0ef04e":{"m":134,"g":133},"216067c0":{"m":134,"g":133},"e72b02db":{"m":134,"g":133},"29e8f7f9":{"m":134,"g":133},"e0963a6c":{"m":134,"g":133},"e0026f7c":{"m":134,"g":133},"9749d3e3":{"m":134,"g":133},"2b0ddf89":{"m":134,"g":133},"17e81c75":{"m":134,"g":133},"88a405cc":{"m":134,"g":133},"602fe3b2":{"m":134,"g":133},"ad9616f1":{"m":134,"g":133},"c5f4e20f":{"m":134,"g":133},"2c196f95":{"m":134,"g":133},"9a7641d7":{"m":134,"g":133},"793c96c3":{"m":134,"g":133},"d1f00632":{"m":134,"g":133},"4792d1f4":{"m":134,"g":133},"fea2d521":{"m":134,"g":133},"56d12b4a":{"m":134,"g":133},"374ad4cc":{"m":134,"g":133},"8b0a68f1":{"m":134,"g":133},"70607e55":{"m":134,"g":133},"ee1ca51d":{"m":134,"g":133},"ef7c29ac":{"m":134,"g":133},"58c840db":{"m":134,"g":133},"3d42b7e7":{"m":134,"g":133},"9970ee34":{"m":134,"g":133},"9e7656be":{"m":134,"g":133},"9d4f066f":{"m":134,"g":133},"41683536":{"m":134,"g":133},"891ee822":{"m":134,"g":133},"8fa3dc36":{"m":134,"g":133},"d20699a3":{"m":134,"g":133},"169a75df":{"m":134,"g":133},"4128d4f5":{"m":134,"g":133},"011d8d89":{"m":134,"g":133},"5290cef9":{"m":134,"g":133},"726fe3e7":{"m":134,"g":133},"5d087891":{"m":134,"g":133},"8451e227":{"m":134,"g":133},"53e15194":{"m":134,"g":133},"d747147a":{"m":134,"g":133},"0c002207":{"m":134,"g":133},"eeb2b9b2":{"m":134,"g":133},"c4aed389":{"m":134,"g":133},"9d04b570":{"m":134,"g":133},"3e690cce":{"m":134,"g":133},"b12b40de":{"m":134,"g":133},"6c4bf8a0":{"m":134,"g":133},"533851fb":{"m":134,"g":133},"0071fe9c":{"m":134,"g":133},"ffa7e035":{"m":134,"g":133},"712f44ee":{"m":134,"g":133},"8c34e181":{"m":134,"g":133},"e9abb525":{"m":134,"g":133},"88859433":{"m":134,"g":133},"feb8e30b":{"m":134,"g":133},"45a959d3":{"m":134,"g":133},"cdce5163":{"m":134,"g":133},"79ab57bd":{"m":134,"g":133},"4b8901ac":{"m":134,"g":133},"435d1c83":{"m":134,"g":133},"2bdbaef1":{"m":134,"g":133},"31d48d7f":{"m":134,"g":133},"0129c911":{"m":134,"g":133},"03f9eb25":{"m":134,"g":133},"7ec678eb":{"m":134,"g":133},"9d64a7b2":{"m":134,"g":133},"46ad4b98":{"m":134,"g":133},"71cb9037":{"m":134,"g":133},"c8c64876":{"m":134,"g":133},"0861dca8":{"m":134,"g":133},"da58df6b":{"m":134,"g":133},"49237e26":{"m":134,"g":133},"d92c1f8c":{"m":134,"g":133},"93070586":{"m":134,"g":133},"ccc8f3b2":{"m":134,"g":133},"a4c76281":{"m":134,"g":133},"28a19e49":{"m":134,"g":133},"0261c4af":{"m":134,"g":133},"8ac350f3":{"m":134,"g":133},"99401e7b":{"m":134,"g":133},"66824751":{"m":134,"g":133},"f95729b0":{"m":134,"g":133},"9f4ed93d":{"m":134,"g":133},"ecb401ed":{"m":134,"g":133},"b399e3ac":{"m":134,"g":133},"e27635a0":{"m":134,"g":133},"272c5fe4":{"m":134,"g":133},"3c8dc448":{"m":134,"g":133},"5e96beb3":{"m":134,"g":133},"9327482b":{"m":134,"g":133},"36fcf71f":{"m":134,"g":133},"6292d971":{"m":134,"g":133},"c8434195":{"m":134,"g":133},"3e4d431a":{"m":134,"g":133},"538e733e":{"m":134,"g":133},"22587bc0":{"m":134,"g":133},"02d24244":{"m":134,"g":133},"1da5cd63":{"m":134,"g":133},"a9a2cdd8":{"m":134,"g":133},"4733fcff":{"m":134,"g":133},"3ffa2604":{"m":134,"g":133},"61f362c6":{"m":134,"g":133},"e7157c9b":{"m":134,"g":133},"30da2f05":{"m":134,"g":133},"49016931":{"m":134,"g":133},"3d484be5":{"m":134,"g":133},"c0d94440":{"m":134,"g":133},"1dedb638":{"m":134,"g":133},"7bc8b153":{"m":134,"g":133},"9003a436":{"m":134,"g":133},"3518b331":{"m":134,"g":133},"b098b1ae":{"m":134,"g":133},"abd3e048":{"m":134,"g":133},"92c29d43":{"m":134,"g":133},"bf643814":{"m":134,"g":133},"f03bfa4c":{"m":134,"g":133},"89ad3908":{"m":134,"g":133},"2ea844ec":{"m":134,"g":133},"1e2d7538":{"m":134,"g":133},"d16ff357":{"m":134,"g":133},"af49e302":{"m":134,"g":133},"01b955ac":{"m":134,"g":133},"16e6bc20":{"m":134,"g":133},"37250764":{"m":134,"g":133},"7b9156c7":{"m":134,"g":133},"21cfebac":{"m":134,"g":133},"702426b0":{"m":134,"g":133},"9e9a6169":{"m":134,"g":133},"bd9c3a47":{"m":134,"g":133},"1e641ee4":{"m":134,"g":133},"fb96669f":{"m":134,"g":133},"3912ee49":{"m":134,"g":133},"1ab9b8e0":{"m":134,"g":133},"e61dabf5":{"m":134,"g":133},"36e7c8c5":{"m":134,"g":133},"037c3982":{"m":134,"g":133},"62b3fdae":{"m":134,"g":133},"1cd0c3bf":{"m":134,"g":133},"2c899431":{"m":134,"g":133},"4513f549":{"m":134,"g":133},"c9690307":{"m":134,"g":133},"4449c170":{"m":134,"g":133},"5ca962ce":{"m":134,"g":133},"bab20a84":{"m":134,"g":133},"0e4108ba":{"m":134,"g":133},"99cb2ed9":{"m":134,"g":133},"0612175c":{"m":134,"g":133},"8c96fcda":{"m":134,"g":133},"ea7c69ce":{"m":134,"g":133},"8102e36b":{"m":134,"g":133},"3f048217":{"m":134,"g":133},"b11af135":{"m":134,"g":133},"f9bceea0":{"m":134,"g":133},"997ea57e":{"m":134,"g":133},"47633c19":{"m":134,"g":133},"64b5c3ab":{"m":134,"g":133},"d277a86d":{"m":134,"g":133},"9acb21ae":{"m":134,"g":133},"a9ce1623":{"m":134,"g":133},"6f0c77d7":{"m":134,"g":133},"e3f51e82":{"m":134,"g":133},"fdfabb7a":{"m":134,"g":133},"19c16748":{"m":134,"g":133},"5c75907e":{"m":134,"g":133},"4ea36422":{"m":134,"g":133},"f50af32d":{"m":134,"g":133},"72952919":{"m":134,"g":133},"2ae5bed1":{"m":134,"g":133},"54df514b":{"m":134,"g":133},"681c68cf":{"m":134,"g":133},"74ea45cc":{"m":134,"g":133},"0fa044ad":{"m":134,"g":133},"7d8e42c9":{"m":134,"g":133},"6abdf73f":{"m":134,"g":133},"20ce9938":{"m":134,"g":133},"168a31eb":{"m":134,"g":133},"69cfb17b":{"m":134,"g":133},"a7a4b175":{"m":134,"g":133},"fdc93b01":{"m":134,"g":133},"fd37cc5d":{"m":134,"g":133},"96705514":{"m":134,"g":133},"ab3ffd1c":{"m":134,"g":133},"2285afff":{"m":134,"g":133},"a81cc1b8":{"m":134,"g":133},"3134d2b2":{"m":134,"g":133},"06b58c5d":{"m":134,"g":133},"ea07a283":{"m":134,"g":133},"ea91a720":{"m":134,"g":133},"5d9c6bac":{"m":134,"g":133},"e048ee90":{"m":134,"g":133},"ed52d01b":{"m":134,"g":133},"90e7d4f7":{"m":134,"g":133},"c20d43d2":{"m":134,"g":133},"d977dd2e":{"m":134,"g":133},"0c23331e":{"m":134,"g":133},"993278b4":{"m":134,"g":133},"9e9d9107":{"m":134,"g":133},"3b8a824b":{"m":134,"g":133},"f1bbd26f":{"m":134,"g":133},"d36299ad":{"m":134,"g":133},"0e7d7969":{"m":134,"g":133},"80554598":{"m":134,"g":133},"b2e240bc":{"m":134,"g":133},"dcc5f5c0":{"m":134,"g":133},"4eda4194":{"m":134,"g":133},"875f84db":{"m":134,"g":133},"f6031adf":{"m":134,"g":133},"2a39cfe0":{"m":134,"g":133},"bf17e769":{"m":134,"g":133},"9a5d6a84":{"m":134,"g":133},"31c23e5f":{"m":134,"g":133},"06617a9e":{"m":134,"g":133},"e79ca959":{"m":134,"g":133},"9d3b411c":{"m":134,"g":133},"05325db3":{"m":134,"g":133},"01e3b3f3":{"m":134,"g":133},"86988674":{"m":134,"g":133},"29139654":{"m":134,"g":133},"665cb020":{"m":134,"g":133},"71602838":{"m":134,"g":133},"77873343":{"m":134,"g":133},"267170bf":{"m":134,"g":133},"313f59ad":{"m":134,"g":133},"487cf81a":{"m":134,"g":133},"df111bc0":{"m":134,"g":133},"6d2b3324":{"m":134,"g":133},"8cc77261":{"m":134,"g":133},"d143b020":{"m":134,"g":133},"9b9d2131":{"m":134,"g":133},"44fd7017":{"m":134,"g":133},"9a56273a":{"m":134,"g":133},"b737a125":{"m":134,"g":133},"1b5e9034":{"m":134,"g":133},"171b442a":{"m":134,"g":133},"4b7b5af3":{"m":134,"g":133},"ec242f51":{"m":134,"g":133},"b2431546":{"m":134,"g":133},"526fd008":{"m":134,"g":133},"56d0ad47":{"m":134,"g":133},"306e5b8d":{"m":134,"g":133},"10c68f62":{"m":134,"g":133},"c7c837cd":{"m":134,"g":133},"3e1e7157":{"m":134,"g":133},"8fa8d9d7":{"m":134,"g":133},"c8cf1caf":{"m":134,"g":133},"d71baa72":{"m":134,"g":133},"82e33170":{"m":134,"g":133},"94e12511":{"m":134,"g":133},"4dabfbc8":{"m":134,"g":133},"d7ed8a8c":{"m":134,"g":133},"c05d3afb":{"m":134,"g":133},"edb172e9":{"m":134,"g":133},"fe6d38d2":{"m":134,"g":133},"dab31e4c":{"m":134,"g":133},"a7fa31ff":{"m":134,"g":133},"bd91f882":{"m":134,"g":133},"b47adb80":{"m":134,"g":133},"b2b5bdb0":{"m":134,"g":133},"22fe5da1":{"m":134,"g":133},"1834401e":{"m":134,"g":133},"198c8ecf":{"m":134,"g":133},"c01b2ee0":{"m":134,"g":133},"8f5adac8":{"m":134,"g":133},"76743a98":{"m":134,"g":133},"8bf10e71":{"m":134,"g":133},"e4873d04":{"m":134,"g":133},"c660d8df":{"m":134,"g":133},"10146af0":{"m":134,"g":133},"b62e7e3b":{"m":134,"g":133},"6ce36b12":{"m":134,"g":133},"e9e7f15e":{"m":134,"g":133},"0aa3dec5":{"m":134,"g":133},"4885f8b9":{"m":134,"g":133},"f832994c":{"m":134,"g":133},"e59435c3":{"m":134,"g":133},"d6bd2d11":{"m":134,"g":133},"9975acf5":{"m":134,"g":133},"70758d45":{"m":134,"g":133},"fd1ebbb0":{"m":134,"g":133},"3d98bd5e":{"m":134,"g":133},"aa3716b2":{"m":134,"g":133},"6107268f":{"m":134,"g":133},"0189f41c":"m136","b6e4893a":"m136","3eb7da53":"m136","cb53ddc9":"m136","0c265321":"m136","6469c964":"m136","71705394":"m136","67589c16":"m136","a702c8f1":"m136","f2ae066a":"m136","0c2993ee":"m136","590969ee":"m136","e6ccb294":"m136","d6ea2c52":"m136","9be2a3a9":"m136","b74a57a8":"m136","95f59c13":"m136","85d9af51":"m136","858f317f":"m136","cf893516":"m136","1fdf5cac":"m136","cda43ffa":"m136","39089854":"m136","b827e9d3":"m136","d725487d":"m136","a95c9f5b":"m136","19089aa4":"m136","4f6f5d25":"m136","458fe5a3":"m136","2ff0880a":"m136","2c1b164a":"m136","bcc6d84f":"m136","a618202f":"m136","7520b929":"m136","e7224e96":"m136","e776239a":"m136","1b97fa76":"m136","236772c0":"m136","0d49b13f":"m136","0a7a2017":"m136","0a9099e1":"m136","0050c476":"m136","8251a74d":"m136","a54d75bf":"m136","3321eb4e":"m136","be5121b4":"m136","c3f9c30f":"m136","54a82179":"m136","aca354bc":"m136","aea57b33":"m136","823a046e":"m136","648aab0c":"m136","1e309030":"m136","60927215":"m136","38c233fd":"m136","20ed3822":"m136","4ecd9afd":"m136","eb38d644":"m136","6ea491e4":"m136","16802fb6":"m136","d97066d2":"m136","ce2d686e":"m136","76b06bee":"m136","612026ad":"m136","55c61642":"m136","91a4cd86":"m136","23d765d1":"m136","db2425a0":"m136","603f386c":"m136","f7a5e425":"m136","d50dcd9b":"m136","e6b7c049":"m136","a3addd62":"m136","6988a0f5":"m136","1b192cf1":"m136","c560e142":"m136","71cb9d03":"m136","8fb45523":"m136","84aef378":"m136","17c04b10":"m136","55c4288b":"m136","9f8b79f1":"m136","7dc3cbe7":"m136","79ddc34c":"m136","09a9d214":"m136","7e40d526":"m136","057b07fc":"m136","91d8c52d":"m136","e9a44ea6":"m136","c1282da2":"m136","71279e31":"m136","1a053a81":"m136","2ea02f06":"m136","20b0523e":"m136","ce8a6ac6":"m136","ebca5879":"m136","cc410a10":"m136","f374623f":"m136","5c022177":"m136","8916b9d0":"m136","a3d9a218":"m136","fb88fb67":"m136","2d72e168":"m136","64946679":"m136","5836324c":"m136","9fe56cd0":"m136","858a4d65":"m136","e619f531":"m136","fc4b932f":"m136","d2105d4a":"m136","84c83905":"m136","ea879c77":"m136","0227db89":"m136","ad1b4e47":"m136","51f147ad":"m136","d3eafc73":"m136","e00b4344":"m136","733de6be":"m136","93433726":"m136","330605cc":"m136","bb6055b4":"m136","4df74eb5":"m136","f3a7c7dc":"m136","2069050d":"m136","1fe0c82f":"m136","a45e0e5d":"m136","f78201f3":"m136","8fd33998":"m136","088758c1":"m136","6d29d8ab":"m136","e499258e":"m136","09491a9b":"m136","7edb0615":"m136","e486a4da":"m136","90399cbc":"m136","53609e5e":"m136","9c253064":"m136","d2c86387":"m136","737a1183":"m136","dc743fe4":"m136","dd99f818":"m136","8ce64aa1":"m136","eb768189":"m136","2cdd4370":"m136","a7b5f75d":"m136","305c1a57":"m136","c824ddd5":"m136","e18e0057":"m136","166396ca":"m136","43779f27":"m136","8b9e9357":"m136","d36f6f04":"m136","b0701f02":"m136","4229de3b":"m136","2e144079":"m136","d2ec128b":"m136","7f8353af":"m136","a7f5677a":"m136","3e968ab3":"m136","b4fce995":"m136","ec9b48ea":"m136","a0467589":"m136","82a1b645":"m136","6f10e17b":"m136","c771933d":"m136","9d8bbd42":"m136","daea5138":"m136","3355b6e2":"m136","a1dd3d48":"m136","d9ed80b9":"m136","7c39ea68":"m136","daa4841e":"m136","968c4f55":"m136","669d309a":"m136","21ee597e":"m136","6ee970a3":"m136","8ec160ed":"m136","2740ed1a":"m136","d44f09ad":"m136","e7dc85c5":"m136","c81bad1b":"m136","0e86de7c":"m136","e3a95077":"m136","146b5fcc":"m136","8b22deef":"m136","69822c72":"m136","8b99af9a":"m136","72e2f70e":"m136","3d72944f":"m136","7dde3438":"m136","77fc4c4a":"m136","655d2c7c":"m136","3f44268f":"m136","f7ec8174":"m136","cd23c2f0":"m136","d1110e1c":"m136","9227d9f6":"m136","4c59782e":"m136","dda35ccb":"m136","d11e2dc6":"m136","16831ab6":"m136","c9a45b7e":"m136","e7df8bdc":"m136","e9979950":"m136","6586f44a":"m136","43fe3a4d":"m136","98096b5e":"m136","68e8d0f6":"m136","000ad422":"m136","9d5f16d4":"m136","7f8a58ff":"m136","6b065298":"m136","c020d300":"m136","4346db5f":"m136","424a3800":"m136","f0918583":"m136","5b1215d9":"m136","aa2b4f76":"m136","b3a3f513":"m136","de94d793":"m136","0d904ef4":"m136","969faaa4":"m136","48c2aca9":"m136","5af84c8a":"m136","feae615b":"m136","e75299a1":"m136","72bacc88":"m136","ba625c2d":"m136","b025cff4":"m136","030496eb":"m136","c86ca128":"m136","cd336945":"m136","c5e363e8":"m136","9479eca7":"m136","a5348eac":"m136","95240402":"m136","2122fea3":"m136","e2c8a50b":"m136","a4825ed5":"m136","afe285f7":"m136","cf25852a":"m136","5938c3b0":"m136","b8806071":"m136","339915ce":"m136","075c5a57":"m136","a0b4ba90":"m136","2a7b67ad":"m136","1d811094":"m136","2ab3ed3e":"m136","250477d2":"m136","af1232b2":"m136","7a869045":"m136","888d7e54":"m136","3cb1fbae":"m136","ba9f6d8f":"m136","740d3c0b":"m136","87165898":"m136","ff3ddb9d":"m136","9d3018f4":"m136","a8348427":"m136","47d485f3":"m136","2b423099":"m136","ae0baefb":"m136","1f0e3d7f":"m136","d3c08fb0":"m136","c6a64e9f":"m136","e0ac559a":"m136","6620548f":"m136","6e158e55":"m136","ed729d22":"m136","fa51b854":"m136","559ff9ec":"m136","7b682de8":"m136","d0092dec":"m136","76f69b77":"m136","9a628744":"m136","53dca74f":"m136","2dadf635":"m136","aab640c9":"m136","2b3791ed":"m136","9f5cd80a":"m136","b1ee75ae":"m136","f44c63ee":"m136","c54c70ab":"m136","aab906a3":"m136","feb39f77":"m136","38b30c7b":"m136","c581b5ed":"m136","a1c48943":"m136","5b7bed7c":"m136","38a88479":"m136","503c3d95":"m136","934ae89a":"m136","cf1426a7":"m136","17cb3c8e":"m136","2f4a6add":"m136","f9fc50ac":"m136","3c16c586":"m136","7b089ae4":"m136","8b5d4263":"m136","b5493f65":"m136","09e2571e":"m136","c0248d6f":"m136","cc25f9df":"m136","cf14feba":"m136","7c25687c":"m136","ff978142":"m136","d112f6a2":"m136","a2c2c09d":"m136","2a9344d3":"m136","78c41758":"m136","206db66f":"m136","5c72be1e":"m136","76d48817":"m136","d1ec93e3":"m136","bdb76b34":"m136","dae6a409":"m136","a0899bdb":"m136","641830c1":"m136","3fd88ea9":"m136","145bd54f":"m136","2d088b85":"m136","3a8b44fe":"m136","aeb480c1":"m136","9fd2358c":"m136","3c358736":"m136","6327dff2":"m136","675acece":"m136","ad201273":"m136","7f393d95":"m136","4b14f622":"m136","94fc26aa":"m136","67b61a4e":"m136","d27f16f3":"m136","c89949bb":"m136","20abaee2":"m136","32a569fb":"m136","3ed3b7ef":"m136","1f9d4795":"m136","e6d40bff":"m136","fbc128a3":"m136","9c64a15a":"m136","e91a7176":"m136","1f0ea4f9":"m136","15da3061":"m136","6406a596":"m136","cec19b56":"m136","ef35d8fe":"m136","08636f72":"m136","a6c29d4c":"m136","84ab32a2":"m136","70667115":"m136","bd1afeb5":"m136","8ef5b905":"m136","76b3c698":"m136","5dcff947":"m136","8eeffbe9":"m136","71e9c31c":"m136","7656d267":"m136","dd24ba90":"m136","068abe7e":"m136","64a31d4b":"m136","2babf88f":"m136","d56d14e5":"m136","0c4e155a":"m136","d6d5c3fd":"m136","e46f7943":"m136","9d4d57db":"m136","41609b52":"m136","ccd0fb32":"m136","87ee6b5e":"m136","f7c1d24b":"m136","cceb5e6a":"m136","c1c13c84":"m136","fcec35dc":"m136","75da784d":"m136","77d35665":"m136","05dfef92":"m136","74602407":{"m":136,"g":135},"f7f5c389":{"m":136,"g":135},"9e3a032a":{"m":136,"g":135},"1bc7aa58":{"m":136,"g":135},"8726d30c":{"m":136,"g":135},"a1b243d7":{"m":136,"g":135},"a9799277":{"m":136,"g":135},"ee71e773":{"m":136,"g":135},"55a8dd00":{"m":136,"g":135},"cf242321":{"m":136,"g":135},"9d03af91":{"m":136,"g":135},"064ae341":{"m":136,"g":135},"49305fa1":{"m":136,"g":135},"4e999404":{"m":136,"g":135},"fbc24886":{"m":136,"g":135},"16880235":{"m":136,"g":135},"cda35611":{"m":136,"g":135},"8a45a9c6":{"m":136,"g":135},"aecd5f5f":{"m":136,"g":135},"05ab110e":{"m":136,"g":135},"c8dc4d2d":{"m":136,"g":135},"82a8d77b":{"m":136,"g":135},"294ff71d":{"m":136,"g":135},"f52ae586":{"m":136,"g":135},"2f8a3634":{"m":136,"g":135},"b6e8a0d8":{"m":136,"g":135},"fb7609f1":{"m":136,"g":135},"d2ea44f7":{"m":136,"g":135},"20ca2c6e":{"m":136,"g":135},"83abecd0":{"m":136,"g":135},"d54f0a10":{"m":136,"g":135},"fb04e7e3":{"m":136,"g":135},"1e5de05e":{"m":136,"g":135},"7dd679cb":{"m":136,"g":135},"3d51ae18":{"m":136,"g":135},"f9c04266":{"m":136,"g":135},"48b8dcd4":{"m":136,"g":135},"41b434a7":{"m":136,"g":135},"4935344f":{"m":136,"g":135},"63cc97f4":{"m":136,"g":135},"ab7d5829":{"m":136,"g":135},"261860e1":{"m":136,"g":135},"154740bd":{"m":136,"g":135},"1c09cbe3":{"m":136,"g":135},"e14f5ec8":{"m":136,"g":135},"8867d248":{"m":136,"g":135},"6b3f93c4":{"m":136,"g":135},"5a5cece5":{"m":136,"g":135},"d566739b":{"m":136,"g":135},"4c46ecde":{"m":136,"g":135},"12a0292b":{"m":136,"g":135},"a08dc5aa":{"m":136,"g":135},"eec7dbd3":{"m":136,"g":135},"109fe03a":{"m":136,"g":135},"bb798a1c":{"m":136,"g":135},"38dc5839":{"m":136,"g":135},"5e867f60":{"m":136,"g":135},"65bed838":{"m":136,"g":135},"156d97b2":{"m":136,"g":135},"24b30f77":{"m":136,"g":135},"3a4767da":{"m":136,"g":135},"6037267f":{"m":136,"g":135},"0241e046":{"m":136,"g":135},"0c474273":{"m":136,"g":135},"3e73e124":{"m":136,"g":135},"f4742558":{"m":136,"g":135},"7385834c":{"m":136,"g":135},"b5a94f8a":{"m":136,"g":135},"c356ed03":{"m":136,"g":135},"ee4d2287":{"m":136,"g":135},"153c69f6":{"m":136,"g":135},"55b79365":{"m":136,"g":135},"7fc12e0b":{"m":136,"g":135},"8729ad5e":{"m":136,"g":135},"e4320573":{"m":136,"g":135},"4d902c82":{"m":136,"g":135},"2ff87231":{"m":136,"g":135},"fd16c91c":{"m":136,"g":135},"32a6540a":{"m":136,"g":135},"62d0280f":{"m":136,"g":135},"d4b717c0":{"m":136,"g":135},"98a107d4":{"m":136,"g":135},"b86bbf84":{"m":136,"g":135},"48381c3b":{"m":136,"g":135},"8bce0853":{"m":136,"g":135},"6b8a9d70":{"m":136,"g":135},"973116e6":{"m":136,"g":135},"820e97d6":{"m":136,"g":135},"4c85f9d0":{"m":136,"g":135},"7d757d6f":{"m":136,"g":135},"5c04088b":{"m":136,"g":135},"f066036c":{"m":136,"g":135},"52de807d":{"m":136,"g":135},"70933f34":{"m":136,"g":135},"ce453fa4":{"m":136,"g":135},"3be1e734":{"m":136,"g":135},"38895a00":{"m":136,"g":135},"d8b81981":{"m":136,"g":135},"913b688f":{"m":136,"g":135},"951d16c8":{"m":136,"g":135},"53846746":{"m":136,"g":135},"534ac384":{"m":136,"g":135},"2a8d5493":{"m":136,"g":135},"90eac38a":{"m":136,"g":135},"badcd028":{"m":136,"g":135},"d874c8bb":{"m":136,"g":135},"9a21d89c":{"m":136,"g":135},"4c9ac856":{"m":136,"g":135},"fb5b71d0":{"m":136,"g":135},"05b54b6d":{"m":136,"g":135},"4f443f44":{"m":136,"g":135},"dce8b060":{"m":136,"g":135},"399d5283":{"m":136,"g":135},"2e0527dd":{"m":136,"g":135},"6beb50d6":{"m":136,"g":135},"18e2ef09":{"m":136,"g":135},"3271e0e7":{"m":136,"g":135},"d415d22d":{"m":136,"g":135},"a49b9a64":{"m":136,"g":135},"53497642":{"m":136,"g":135},"d57d8e7e":{"m":136,"g":135},"0cbd8f32":{"m":136,"g":135},"6e3fff13":{"m":136,"g":135},"2210155a":{"m":136,"g":135},"a3656cbb":{"m":136,"g":135},"21da2dc1":{"m":136,"g":135},"ed307a40":{"m":136,"g":135},"02722b91":{"m":136,"g":135},"f959250f":{"m":136,"g":135},"95934379":{"m":136,"g":135},"bc2f40be":{"m":136,"g":135},"2724b110":{"m":136,"g":135},"176266f3":{"m":136,"g":135},"5e5b1183":{"m":136,"g":135},"9bf76c11":{"m":136,"g":135},"1d7ad4af":{"m":136,"g":135},"fba785c4":{"m":136,"g":135},"5cfa901b":{"m":136,"g":135},"f27c6cdc":{"m":136,"g":135},"5097e1e8":{"m":136,"g":135},"861a35fb":{"m":136,"g":135},"6ffe1fc0":{"m":136,"g":135},"73398e22":{"m":136,"g":135},"17958c5f":{"m":136,"g":135},"3aa11ca7":{"m":136,"g":135},"7be1a8c7":{"m":136,"g":135},"84d13c54":{"m":136,"g":135},"c80c0e0f":{"m":136,"g":135},"c105a312":{"m":136,"g":135},"4cf2bbd0":{"m":136,"g":135},"d7b706be":{"m":136,"g":135},"4a9537a4":{"m":136,"g":135},"ca922d4b":{"m":136,"g":135},"402a0bd6":{"m":136,"g":135},"76c71d1d":{"m":136,"g":135},"2d02c150":{"m":136,"g":135},"b98bd9a5":{"m":136,"g":135},"ce694b2b":{"m":136,"g":135},"6c0fb189":{"m":136,"g":135},"9a9f996f":{"m":136,"g":135},"4221b7c5":{"m":136,"g":135},"c371df2f":{"m":136,"g":135},"1751c75b":{"m":136,"g":135},"23849eba":{"m":136,"g":135},"51541404":{"m":136,"g":135},"45ef8344":{"m":136,"g":135},"5a2b1ed4":{"m":136,"g":135},"454dc9e2":{"m":136,"g":135},"1e41069a":{"m":136,"g":135},"2b4d6d81":{"m":136,"g":135},"a3914e3b":{"m":136,"g":135},"130f60ee":{"m":136,"g":135},"4397cda7":{"m":136,"g":135},"d56fd10c":{"m":136,"g":135},"abb06be9":{"m":136,"g":135},"c35eb0fd":{"m":136,"g":135},"7f35c46e":{"m":136,"g":135},"da2f8cc3":{"m":136,"g":135},"4308c25b":{"m":136,"g":135},"4d737db8":{"m":136,"g":135},"9d6029fb":{"m":136,"g":135},"dcfb92dd":{"m":136,"g":135},"bebd625b":{"m":136,"g":135},"b12258bf":{"m":136,"g":135},"012dc586":{"m":136,"g":135},"7f6a678f":{"m":136,"g":135},"399ca037":{"m":136,"g":135},"d93f37a6":{"m":136,"g":135},"e6fe092d":{"m":136,"g":135},"f02d8221":{"m":136,"g":135},"10174e11":{"m":136,"g":135},"2138ff48":{"m":136,"g":135},"e53160bb":{"m":136,"g":135},"4ea6a11c":{"m":136,"g":135},"2181bc9e":{"m":136,"g":135},"520c048d":{"m":136,"g":135},"561a3e04":{"m":136,"g":135},"f84487af":{"m":136,"g":135},"07827047":{"m":136,"g":135},"c63e9cb2":{"m":136,"g":135},"12cde0df":{"m":136,"g":135},"a7fd8108":{"m":136,"g":135},"ca80c19b":{"m":136,"g":135},"1e7b3264":{"m":136,"g":135},"e267ca0b":{"m":136,"g":135},"9a8ba3c1":{"m":136,"g":135},"0fee6bc6":{"m":136,"g":135},"0ff3747c":{"m":136,"g":135},"f8411ded":{"m":136,"g":135},"249c3563":{"m":136,"g":135},"12df1660":{"m":136,"g":135},"55d112dc":{"m":136,"g":135},"a1ed247f":{"m":136,"g":135},"87699d48":{"m":136,"g":135},"52c60434":{"m":136,"g":135},"ff0f370f":{"m":136,"g":135},"26f9e207":{"m":136,"g":135},"828cd893":{"m":136,"g":135},"4436dc0f":{"m":136,"g":135},"cf6800f6":{"m":136,"g":135},"f16606d6":{"m":136,"g":135},"387fad2f":{"m":136,"g":135},"76bc07a3":{"m":136,"g":135},"b328cd20":{"m":136,"g":135},"bf32cd83":{"m":136,"g":135},"27a08305":{"m":136,"g":135},"53479e22":{"m":136,"g":135},"ff978e7d":{"m":136,"g":135},"16e00651":{"m":136,"g":135},"1b2b95d8":{"m":136,"g":135},"9cac3c86":{"m":136,"g":135},"8d58b3dc":{"m":136,"g":135},"5f3eb377":{"m":136,"g":135},"22993880":{"m":136,"g":135},"d7aa0ce7":{"m":136,"g":135},"e797f0c5":{"m":136,"g":135},"5d4b7c78":{"m":136,"g":135},"9bd64d73":{"m":136,"g":135},"24e116ef":{"m":136,"g":135},"8ca95970":{"m":136,"g":135},"216ea910":{"m":136,"g":135},"f4ab2ec5":{"m":136,"g":135},"25fa2ac2":{"m":136,"g":135},"ef5ac6f0":{"m":136,"g":135},"c88aaf22":{"m":136,"g":135},"e139d2aa":{"m":136,"g":135},"2337b1bb":{"m":136,"g":135},"7bc13c90":{"m":136,"g":135},"877c8e3a":{"m":136,"g":135},"66dfb8c1":{"m":136,"g":135},"dcacc492":{"m":136,"g":135},"87ef05e2":{"m":136,"g":135},"b65c9889":{"m":136,"g":135},"c7c0d97f":{"m":136,"g":135},"6bc5a52f":{"m":136,"g":135},"2c09de34":{"m":136,"g":135},"38d48de9":{"m":136,"g":135},"7f2fa216":{"m":136,"g":135},"d0fb24ee":{"m":136,"g":135},"65b0b5b2":{"m":136,"g":135},"9a414b16":{"m":136,"g":135},"5b4f7902":{"m":136,"g":135},"d8ac5eec":{"m":136,"g":135},"bdde9496":{"m":136,"g":135},"b23e7ed1":{"m":136,"g":135},"9821fae5":{"m":136,"g":135},"078d9621":{"m":136,"g":135},"8b111b20":{"m":136,"g":135},"74a166cb":{"m":136,"g":135},"888e126a":{"m":136,"g":135},"bb23a8fe":{"m":136,"g":135},"8b869e32":{"m":136,"g":135},"6256936d":{"m":136,"g":135},"62f73a8c":{"m":136,"g":135},"7c1b4b1c":{"m":136,"g":135},"31ed68e7":{"m":136,"g":135},"24c91001":{"m":136,"g":135},"c4edcac6":{"m":136,"g":135},"e9343389":{"m":136,"g":135},"a2d4f58a":{"m":136,"g":135},"f66b0916":{"m":136,"g":135},"2f623368":{"m":136,"g":135},"bd9a2ced":{"m":136,"g":135},"0ca417d9":{"m":136,"g":135},"30cfb687":{"m":136,"g":135},"b7c7e03d":{"m":136,"g":135},"f0195627":{"m":136,"g":135},"d401d238":{"m":136,"g":135},"17041f46":{"m":136,"g":135},"f07e76b2":{"m":136,"g":135},"1cfd2b2d":{"m":136,"g":135},"0d244116":{"m":136,"g":135},"0eae8317":{"m":136,"g":135},"c483a5f4":{"m":136,"g":135},"f26f6c2c":{"m":136,"g":135},"b0213323":{"m":136,"g":135},"5062537b":{"m":136,"g":135},"698629d1":{"m":136,"g":135},"dd93e445":{"m":136,"g":135},"6c8587b5":{"m":136,"g":135},"02704260":{"m":136,"g":135},"bd48ad5e":{"m":136,"g":135},"749736ba":{"m":136,"g":135},"72549863":{"m":136,"g":135},"00562ee1":{"m":136,"g":135},"d7a8257b":{"m":136,"g":135},"f6f7af40":{"m":136,"g":135},"a3b1e8ef":{"m":136,"g":135},"db499e18":{"m":136,"g":135},"6cf3a6dd":{"m":136,"g":135},"90e24f5c":{"m":136,"g":135},"21de3e14":{"m":136,"g":135},"e4c1e441":{"m":136,"g":135},"b5af283b":{"m":136,"g":135},"130c6911":{"m":136,"g":135},"e0e50848":{"m":136,"g":135},"70a769bc":{"m":136,"g":135},"3a42c5e3":{"m":136,"g":135},"12b89e51":{"m":136,"g":135},"85184557":{"m":136,"g":135},"57d2ba92":{"m":136,"g":135},"417e75a6":{"m":136,"g":135},"0500fea9":{"m":136,"g":135},"2b461c15":{"m":136,"g":135},"5595ae14":{"m":136,"g":135},"ace6f300":{"m":136,"g":135},"2da49eec":{"m":136,"g":135},"d65ae0ec":{"m":136,"g":135},"60d7279c":{"m":136,"g":135},"abdf65d4":{"m":136,"g":135},"c1dfbc77":{"m":136,"g":135},"3b3c5a05":{"m":136,"g":135},"2667c857":{"m":136,"g":135},"fc643ffb":{"m":136,"g":135},"4280a18a":{"m":136,"g":135},"386e5415":{"m":136,"g":135},"b6267de5":{"m":136,"g":135},"e47afa02":{"m":136,"g":135},"5ed384d0":{"m":136,"g":135},"b4ce7a6d":{"m":136,"g":135},"25b48564":{"m":136,"g":135},"1c360bf7":{"m":136,"g":135},"3619ec61":{"m":136,"g":135},"db6b51a8":{"m":136,"g":135},"8a9ca41f":{"m":136,"g":135},"ac11e6a7":{"m":136,"g":135},"47a660d5":{"m":136,"g":135},"c0fc7a89":{"m":136,"g":135},"5bf0d862":{"m":136,"g":135},"75b72eb8":{"m":136,"g":135},"bc8b526e":{"m":136,"g":135},"3dfff6ae":{"m":136,"g":135},"ad2c1ee3":{"m":136,"g":135},"9940c6f5":{"m":136,"g":135},"00e60711":{"m":136,"g":135},"4bc2f2e0":{"m":136,"g":135},"d17b9e63":{"m":136,"g":135},"ba67e006":{"m":136,"g":135},"45f3ad2f":{"m":136,"g":135},"f35b5da5":{"m":136,"g":135},"733a0c1a":{"m":136,"g":135},"b369aaa2":{"m":136,"g":135},"b9732025":{"m":136,"g":135},"cbff7ad9":{"m":136,"g":135},"4de59d83":{"m":136,"g":135},"39ca57cd":{"m":136,"g":135},"34498067":{"m":136,"g":135},"b3817fa9":{"m":136,"g":135},"059428bd":{"m":136,"g":135},"5d200dd8":{"m":136,"g":135},"7f9a3d06":{"m":136,"g":135},"664f611e":{"m":136,"g":135},"49adb37e":{"m":136,"g":135},"7518dc35":{"m":136,"g":135},"b6871ba7":{"m":136,"g":135},"c1f2241a":"m137","8da70e2a":"m137"},"g":"2026-02-12T19:54:15.483852"} diff --git a/docs/supported_models/diffusion_language_models.md b/docs/supported_models/diffusion_language_models.md deleted file mode 100644 index 73a24ead1eb8..000000000000 --- a/docs/supported_models/diffusion_language_models.md +++ /dev/null @@ -1,83 +0,0 @@ -# Diffusion Language Models - -Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies. - -## Example Launch Command - -```shell -python3 -m sglang.launch_server \ - --model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path - --dllm-algorithm LowConfidence \ - --dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set. - --host 0.0.0.0 \ - --port 30000 -``` - -## Example Configuration File - -```yaml -# Confidence threshold for accepting predicted tokens -# - Higher values: More conservative, better quality but slower -# - Lower values: More aggressive, faster but potentially lower quality -# Range: 0.0 - 1.0 -threshold: 0.95 - -# Default: 32, for LLaDA2MoeModelLM -block_size: 32 -``` -## Example Client Code Snippet - -Just like other supported models, diffusion language models can be used via the REST API or Python client. - -Python client example for making a generation request to the launched server: - -```python -import sglang as sgl - -def main(): - llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini", - dllm_algorithm="LowConfidence", - max_running_requests=1, - trust_remote_code=True) - - prompts = [ - "SYSTEMdetailed thinking off<|role_end|>HUMAN Write a brief introduction of the great wall <|role_end|>ASSISTANT" - ] - - sampling_params = { - "temperature": 0, - "max_new_tokens": 1024, - } - - outputs = llm.generate(prompts, sampling_params) - print(outputs) - -if __name__ == '__main__': - main() -``` - -Curl example for making a generation request to the launched server: - -```bash -curl -X POST "http://127.0.0.1:30000/generate" \ - -H "Content-Type: application/json" \ - -d '{ - "text": [ - "SYSTEMdetailed thinking off<|role_end|>HUMAN Write the number from 1 to 128 <|role_end|>ASSISTANT", - "SYSTEMdetailed thinking off<|role_end|>HUMAN Write a brief introduction of the great wall <|role_end|>ASSISTANT" - ], - "stream": true, - "sampling_params": { - "temperature": 0, - "max_new_tokens": 1024 - } - }' -``` - -## Supported Models - -Below the supported models are summarized in a table. - -| Model Family | Example Model | Description | -| ------------------------------------------ | -------------------------------------- | --------------------------------------------------------------------------- | -| **LLaDA2.0 (mini, flash)** | `inclusionAI/LLaDA2.0-flash` | LLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture. | diff --git a/docs/supported_models/diffusion_models.md b/docs/supported_models/diffusion_models.md deleted file mode 100644 index 8ed55a944a00..000000000000 --- a/docs/supported_models/diffusion_models.md +++ /dev/null @@ -1,1278 +0,0 @@ -# Diffusion Models - -SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels from sgl-kernel and an efficient scheduler loop. - -## Key Features - -- **Broad Model Support**: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more -- **Fast Inference**: Optimized kernels from sgl-kernel, efficient scheduler loop, and Cache-DiT acceleration -- **Ease of Use**: OpenAI-compatible API, CLI, and Python SDK -- **Multi-Platform**: NVIDIA GPUs (H100, H200, A100, B200, 4090) and AMD GPUs (MI300X, MI325X) - ---- - -# Install SGLang-diffusion - -You can install sglang-diffusion using one of the methods below. - -This page primarily applies to common NVIDIA GPU platforms. For AMD Instinct/ROCm environments see the dedicated [ROCm quickstart](#rocm-quickstart-for-sgl-diffusion), which lists the exact steps (including kernel builds) we used to validate sgl-diffusion on MI300X. - -## Method 1: With pip or uv - -It is recommended to use uv for a faster installation: - -```bash -pip install --upgrade pip -pip install uv -uv pip install "sglang[diffusion]" --prerelease=allow -``` - -## Method 2: From source - -```bash -# Use the latest release branch -git clone https://github.com/sgl-project/sglang.git -cd sglang - -# Install the Python packages -pip install --upgrade pip -pip install -e "python[diffusion]" - -# With uv -uv pip install -e "python[diffusion]" --prerelease=allow -``` - -## Method 3: Using Docker - -The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang), built from the [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile). -Replace `` below with your HuggingFace Hub [token](https://huggingface.co/docs/hub/en/security-tokens). - -```bash -docker run --gpus all \ - --shm-size 32g \ - -p 30000:30000 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - --env "HF_TOKEN=" \ - --ipc=host \ - lmsysorg/sglang:dev \ - sglang generate --model-path black-forest-labs/FLUX.1-dev \ - --prompt "A logo With Bold Large text: SGL Diffusion" \ - --save-output -``` - ---- - -# ROCm quickstart for sgl-diffusion - -```bash -docker run --device=/dev/kfd --device=/dev/dri --ipc=host \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - --env HF_TOKEN= \ - lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \ - sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output -``` - ---- - -# Compatibility Matrix - -The table below shows every supported model and the optimizations supported for them. - -The symbols used have the following meanings: - -- ✅ = Full compatibility -- ❌ = No compatibility -- ⭕ = Does not apply to this model - -## Models x Optimization - -The `HuggingFace Model ID` can be passed directly to `from_pretrained()` methods, and sglang-diffusion will use the -optimal -default parameters when initializing and generating videos. - -### Video Generation Models - -| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | -|:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:| -| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | -| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | -| Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | -| Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p
720p | ❌ | ❌ | ✅ | ⭕ | -| Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p
720p | ❌ | ❌ | ✅ | ⭕ | -| HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280
544×960 | ❌ | ✅ | ✅ | ⭕ | -| FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280
544×960 | ❌ | ✅ | ✅ | ⭕ | -| Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | -| Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | -| Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | -| Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | - -**Note**: Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue. - -### Image Generation Models - -| Model Name | HuggingFace Model ID | Resolutions | -|:-----------------|:----------------------------------------|:---------------| -| FLUX.1-dev | `black-forest-labs/FLUX.1-dev` | Any resolution | -| FLUX.2-dev | `black-forest-labs/FLUX.2-dev` | Any resolution | -| FLUX.2-Klein | `black-forest-labs/FLUX.2-klein-4B` | Any resolution | -| Z-Image-Turbo | `Tongyi-MAI/Z-Image-Turbo` | Any resolution | -| GLM-Image | `zai-org/GLM-Image` | Any resolution | -| Qwen Image | `Qwen/Qwen-Image` | Any resolution | -| Qwen Image 2512 | `Qwen/Qwen-Image-2512` | Any resolution | -| Qwen Image Edit | `Qwen/Qwen-Image-Edit` | Any resolution | - -## Verified LoRA Examples - -This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline. - -> Important: \ -> LoRAs that are not listed here are not necessarily incompatible. -> In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions. -> The entries below simply reflect configurations that have been manually validated by the SGLang team. - -### Verified LoRAs by Base Model - -| Base Model | Supported LoRAs | -|:-----------------|:----------------| -| Wan2.2 | `lightx2v/Wan2.2-Distill-Loras`
`Cseti/wan2.2-14B-Arcane_Jinx-lora-v1` | -| Wan2.1 | `lightx2v/Wan2.1-Distill-Loras` | -| Z-Image-Turbo | `tarn59/pixel_art_style_lora_z_image_turbo`
`wcde/Z-Image-Turbo-DeJPEG-Lora` | -| Qwen-Image | `lightx2v/Qwen-Image-Lightning`
`flymy-ai/qwen-image-realism-lora`
`prithivMLmods/Qwen-Image-HeadshotX`
`starsfriday/Qwen-Image-EVA-LoRA` | -| Qwen-Image-Edit | `ostris/qwen_image_edit_inpainting`
`lightx2v/Qwen-Image-Edit-2511-Lightning` | -| Flux | `dvyio/flux-lora-simple-illustration`
`XLabs-AI/flux-furry-lora`
`XLabs-AI/flux-RealismLora` | - -#### Special Requirements - -> [!NOTE] -> Sliding Tile Attention: Currently, only Hopper GPUs (H100s) are supported. - - ---- - -# SGLang diffusion CLI Inference - -The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation. - -## Prerequisites - -- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`. -- Python 3.11+ if you plan to use the OpenAI Python SDK. - - -## Supported Arguments - -### Server Arguments - -- `--model-path {MODEL_PATH}`: Path to the model or model ID -- `--vae-path {VAE_PATH}`: Path to a custom VAE model or HuggingFace model ID (e.g., `fal/FLUX.2-Tiny-AutoEncoder`). If not specified, the VAE will be loaded from the main model path. -- `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied. -- `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`). -- `--num-gpus {NUM_GPUS}`: Number of GPUs to use -- `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) -- `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs) -- `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP -- `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP - - -### Sampling Parameters - -- `--prompt {PROMPT}`: Text description for the video you want to generate -- `--num-inference-steps {STEPS}`: Number of denoising steps -- `--negative-prompt {PROMPT}`: Negative prompt to guide generation away from certain concepts -- `--seed {SEED}`: Random seed for reproducible generation - - -#### Image/Video Configuration - -- `--height {HEIGHT}`: Height of the generated output -- `--width {WIDTH}`: Width of the generated output -- `--num-frames {NUM_FRAMES}`: Number of frames to generate -- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task - - -#### Output Options - -- `--output-path {PATH}`: Directory to save the generated video -- `--save-output`: Whether to save the image/video to disk -- `--return-frames`: Whether to return the raw frames - -### Using Configuration Files - -Instead of specifying all parameters on the command line, you can use a configuration file: - -```bash -sglang generate --config {CONFIG_FILE_PATH} -``` - -The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file. - -Example configuration file (config.json): - -```json -{ - "model_path": "FastVideo/FastHunyuan-diffusers", - "prompt": "A beautiful woman in a red dress walking down a street", - "output_path": "outputs/", - "num_gpus": 2, - "sp_size": 2, - "tp_size": 1, - "num_frames": 45, - "height": 720, - "width": 1280, - "num_inference_steps": 6, - "seed": 1024, - "fps": 24, - "precision": "bf16", - "vae_precision": "fp16", - "vae_tiling": true, - "vae_sp": true, - "vae_config": { - "load_encoder": false, - "load_decoder": true, - "tile_sample_min_height": 256, - "tile_sample_min_width": 256 - }, - "text_encoder_precisions": [ - "fp16", - "fp16" - ], - "mask_strategy_file_path": null, - "enable_torch_compile": false -} -``` - -Or using YAML format (config.yaml): - -```yaml -model_path: "FastVideo/FastHunyuan-diffusers" -prompt: "A beautiful woman in a red dress walking down a street" -output_path: "outputs/" -num_gpus: 2 -sp_size: 2 -tp_size: 1 -num_frames: 45 -height: 720 -width: 1280 -num_inference_steps: 6 -seed: 1024 -fps: 24 -precision: "bf16" -vae_precision: "fp16" -vae_tiling: true -vae_sp: true -vae_config: - load_encoder: false - load_decoder: true - tile_sample_min_height: 256 - tile_sample_min_width: 256 -text_encoder_precisions: - - "fp16" - - "fp16" -mask_strategy_file_path: null -enable_torch_compile: false -``` - - -To see all the options, you can use the `--help` flag: - -```bash -sglang generate --help -``` - -## Serve - -Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl. - -### Start the server - -Use the following command to launch the server: - -```bash -SERVER_ARGS=( - --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers - --text-encoder-cpu-offload - --pin-cpu-memory - --num-gpus 4 - --ulysses-degree=2 - --ring-degree=2 -) - -sglang serve "${SERVER_ARGS[@]}" -``` - -- **--model-path**: Which model to load. The example uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. -- **--port**: HTTP port to listen on (the default here is `30010`). - -For detailed API usage, including Image, Video Generation and LoRA management, please refer to the [OpenAI API Documentation](#sglang-diffusion-openai-api). - - -## Generate - -Run a one-off generation task without launching a persistent server. - -To use it, pass both server arguments and sampling parameters in one command, after the `generate` subcommand, for example: - -```bash -SERVER_ARGS=( - --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers - --text-encoder-cpu-offload - --pin-cpu-memory - --num-gpus 4 - --ulysses-degree=2 - --ring-degree=2 -) - -SAMPLING_ARGS=( - --prompt "A curious raccoon" - --save-output - --output-path outputs - --output-file-name "A curious raccoon.mp4" -) - -sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" - -# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration -SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" -``` - -Once the generation task has finished, the server will shut down automatically. - -> [!NOTE] -> The HTTP server-related arguments are ignored in this subcommand. - -## Diffusers Backend - -SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes. - -### Arguments - -| Argument | Values | Description | -|----------|--------|-------------| -| `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline. | -| `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines. See [diffusers attention backends](https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends). | -| `--trust-remote-code` | flag | Required for models with custom pipeline classes (e.g., Ovis). | -| `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile). | -| `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice). | -| `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer. | -| `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE. | - -### Example: Running Ovis-Image-7B - -[Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering. - -```bash -sglang generate \ - --model-path AIDC-AI/Ovis-Image-7B \ - --backend diffusers \ - --trust-remote-code \ - --diffusers-attention-backend flash \ - --prompt "A serene Japanese garden with cherry blossoms" \ - --height 1024 \ - --width 1024 \ - --num-inference-steps 30 \ - --save-output \ - --output-path outputs \ - --output-file-name ovis_garden.png -``` - -### Extra Diffusers Arguments - -For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file: - -```json -{ - "model_path": "AIDC-AI/Ovis-Image-7B", - "backend": "diffusers", - "prompt": "A beautiful landscape", - "diffusers_kwargs": { - "cross_attention_kwargs": {"scale": 0.5} - } -} -``` - -```bash -sglang generate --config config.json -``` - ---- - -# SGLang Diffusion OpenAI API - -The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management. - -## Serve - -Launch the server using the `sglang serve` command. - -### Start the server - -```bash -SERVER_ARGS=( - --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers - --text-encoder-cpu-offload - --pin-cpu-memory - --num-gpus 4 - --ulysses-degree=2 - --ring-degree=2 - --port 30010 -) - -sglang serve "${SERVER_ARGS[@]}" -``` - -- **--model-path**: Path to the model or model ID. -- **--port**: HTTP port to listen on (default: `30000`). - -#### Get Model Information - -**Endpoint:** `GET /models` - -Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings. - -**Curl Example:** - -```bash -curl -sS -X GET "http://localhost:30010/models" -``` - -**Response Example:** - -```json -{ - "model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", - "task_type": "T2V", - "pipeline_name": "wan_pipeline", - "pipeline_class": "WanPipeline", - "num_gpus": 4, - "dit_precision": "bf16", - "vae_precision": "fp16" -} -``` - ---- - -## Endpoints - -### Image Generation - -The server implements an OpenAI-compatible Images API under the `/v1/images` namespace. - -#### Create an image - -**Endpoint:** `POST /v1/images/generations` - -**Python Example (b64_json response):** - -```python -import base64 -from openai import OpenAI - -client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1") - -img = client.images.generate( - prompt="A calico cat playing a piano on stage", - size="1024x1024", - n=1, - response_format="b64_json", -) - -image_bytes = base64.b64decode(img.data[0].b64_json) -with open("output.png", "wb") as f: - f.write(image_bytes) -``` - -**Curl Example:** - -```bash -curl -sS -X POST "http://localhost:30010/v1/images/generations" \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer sk-proj-1234567890" \ - -d '{ - "prompt": "A calico cat playing a piano on stage", - "size": "1024x1024", - "n": 1, - "response_format": "b64_json" - }' -``` - -> **Note** -> The `response_format=url` option is not supported for `POST /v1/images/generations` and will return a `400` error. - -#### Edit an image - -**Endpoint:** `POST /v1/images/edits` - -This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image. - -**Curl Example (b64_json response):** - -```bash -curl -sS -X POST "http://localhost:30010/v1/images/edits" \ - -H "Authorization: Bearer sk-proj-1234567890" \ - -F "image=@local_input_image.png" \ - -F "url=image_url.jpg" \ - -F "prompt=A calico cat playing a piano on stage" \ - -F "size=1024x1024" \ - -F "response_format=b64_json" -``` - -**Curl Example (URL response):** - -```bash -curl -sS -X POST "http://localhost:30010/v1/images/edits" \ - -H "Authorization: Bearer sk-proj-1234567890" \ - -F "image=@local_input_image.png" \ - -F "url=image_url.jpg" \ - -F "prompt=A calico cat playing a piano on stage" \ - -F "size=1024x1024" \ - -F "response_format=url" -``` - -#### Download image content - -When `response_format=url` is used with `POST /v1/images/edits`, the API returns a relative URL like `/v1/images//content`. - -**Endpoint:** `GET /v1/images/{image_id}/content` - -**Curl Example:** - -```bash -curl -sS -L "http://localhost:30010/v1/images//content" \ - -H "Authorization: Bearer sk-proj-1234567890" \ - -o output.png -``` - -### Video Generation - -The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace. - -#### Create a video - -**Endpoint:** `POST /v1/videos` - -**Python Example:** - -```python -from openai import OpenAI - -client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1") - -video = client.videos.create( - prompt="A calico cat playing a piano on stage", - size="1280x720" -) -print(f"Video ID: {video.id}, Status: {video.status}") -``` - -**Curl Example:** - -```bash -curl -sS -X POST "http://localhost:30010/v1/videos" \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer sk-proj-1234567890" \ - -d '{ - "prompt": "A calico cat playing a piano on stage", - "size": "1280x720" - }' -``` - -#### List videos - -**Endpoint:** `GET /v1/videos` - -**Python Example:** - -```python -videos = client.videos.list() -for item in videos.data: - print(item.id, item.status) -``` - -**Curl Example:** - -```bash -curl -sS -X GET "http://localhost:30010/v1/videos" \ - -H "Authorization: Bearer sk-proj-1234567890" -``` - -#### Download video content - -**Endpoint:** `GET /v1/videos/{video_id}/content` - -**Python Example:** - -```python -import time - -# Poll for completion -while True: - page = client.videos.list() - item = next((v for v in page.data if v.id == video_id), None) - if item and item.status == "completed": - break - time.sleep(5) - -# Download content -resp = client.videos.download_content(video_id=video_id) -with open("output.mp4", "wb") as f: - f.write(resp.read()) -``` - -**Curl Example:** - -```bash -curl -sS -L "http://localhost:30010/v1/videos//content" \ - -H "Authorization: Bearer sk-proj-1234567890" \ - -o output.mp4 -``` - ---- - -### LoRA Management - -The server supports dynamic loading, merging, and unmerging of LoRA adapters. - -**Important Notes:** -- Mutual Exclusion: Only one LoRA can be *merged* (active) at a time -- Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one -- Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost - -#### Set LoRA Adapter - -Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters. - -**Endpoint:** `POST /v1/set_lora` - -**Parameters:** -- `lora_nickname` (string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAs -- `lora_path` (string or list of strings/None, optional): Path to the `.safetensors` file(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length of `lora_nickname` -- `target` (string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length of `lora_nickname`. Valid values: - - `"all"` (default): Apply to all transformers - - `"transformer"`: Apply only to the primary transformer (high noise for Wan2.2) - - `"transformer_2"`: Apply only to transformer_2 (low noise for Wan2.2) - - `"critic"`: Apply only to the critic model -- `strength` (float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length of `lora_nickname`. Values < 1.0 reduce the effect, values > 1.0 amplify the effect - -**Single LoRA Example:** - -```bash -curl -X POST http://localhost:30010/v1/set_lora \ - -H "Content-Type: application/json" \ - -d '{ - "lora_nickname": "lora_name", - "lora_path": "/path/to/lora.safetensors", - "target": "all", - "strength": 0.8 - }' -``` - -**Multiple LoRA Example:** - -```bash -curl -X POST http://localhost:30010/v1/set_lora \ - -H "Content-Type: application/json" \ - -d '{ - "lora_nickname": ["lora_1", "lora_2"], - "lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"], - "target": ["transformer", "transformer_2"], - "strength": [0.8, 1.0] - }' -``` - -**Multiple LoRA with Same Target:** - -```bash -curl -X POST http://localhost:30010/v1/set_lora \ - -H "Content-Type: application/json" \ - -d '{ - "lora_nickname": ["style_lora", "character_lora"], - "lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"], - "target": "all", - "strength": [0.7, 0.9] - }' -``` - -> [!NOTE] -> When using multiple LoRAs: -> - All list parameters (`lora_nickname`, `lora_path`, `target`, `strength`) must have the same length -> - If `target` or `strength` is a single value, it will be applied to all LoRAs -> - Multiple LoRAs applied to the same target will be merged in order - - -#### Merge LoRA Weights - -Manually merges the currently set LoRA weights into the base model. - -> [!NOTE] -> `set_lora` automatically performs a merge, so this is typically only needed if you have manually unmerged but want to re-apply the same LoRA without calling `set_lora` again.* - -**Endpoint:** `POST /v1/merge_lora_weights` - -**Parameters:** -- `target` (string, optional): Which transformer(s) to merge. One of "all" (default), "transformer", "transformer_2", "critic" -- `strength` (float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect - -**Curl Example:** - -```bash -curl -X POST http://localhost:30010/v1/merge_lora_weights \ - -H "Content-Type: application/json" \ - -d '{"strength": 0.8}' -``` - - -#### Unmerge LoRA Weights - -Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA. - -**Endpoint:** `POST /v1/unmerge_lora_weights` - -**Curl Example:** - -```bash -curl -X POST http://localhost:30010/v1/unmerge_lora_weights \ - -H "Content-Type: application/json" -``` - -#### List LoRA Adapters - -Returns loaded LoRA adapters and current application status per module. - -**Endpoint:** `GET /v1/list_loras` - -**Curl Example:** - -```bash -curl -sS -X GET "http://localhost:30010/v1/list_loras" -``` - -**Response Example:** - -```json -{ - "loaded_adapters": [ - { "nickname": "lora_a", "path": "/weights/lora_a.safetensors" }, - { "nickname": "lora_b", "path": "/weights/lora_b.safetensors" } - ], - "active": { - "transformer": [ - { - "nickname": "lora2", - "path": "tarn59/pixel_art_style_lora_z_image_turbo", - "merged": true, - "strength": 1.0 - } - ] - } -} -``` - -Notes: -- If LoRA is not enabled for the current pipeline, the server will return an error. -- `num_lora_layers_with_weights` counts only layers that have LoRA weights applied for the active adapter. - -### Example: Switching LoRAs - -1. Set LoRA A: - ```bash - curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}' - ``` -2. Generate with LoRA A... -3. Unmerge LoRA A: - ```bash - curl -X POST http://localhost:30010/v1/unmerge_lora_weights - ``` -4. Set LoRA B: - ```bash - curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}' - ``` -5. Generate with LoRA B... - ---- - -# Attention Backends - -This document describes the attention backends available in sglang diffusion (`sglang.multimodal_gen`) and how to select them. - -## Overview - -Attention backends are defined by `AttentionBackendEnum` (`sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum`) and selected via the CLI flag `--attention-backend`. - -Backend selection is performed by the shared attention layers (e.g. `LocalAttention` / `USPAttention` / `UlyssesAttention` in `sglang.multimodal_gen.runtime.layers.attention.layer`) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders). - -- **CUDA**: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA. -- **ROCm**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA. -- **MPS**: always uses PyTorch SDPA. - -## Backend options - -The CLI accepts the lowercase names of `AttentionBackendEnum`. The table below lists the backends implemented by the built-in platforms. `fa3`/`fa4` are accepted as aliases for `fa`. - -| CLI value | Enum value | Notes | -|---|---|---| -| `fa` / `fa3` / `fa4` | `FA` | FlashAttention. `fa3/fa4` are normalized to `fa` during argument parsing (`ServerArgs.__post_init__`). | -| `torch_sdpa` | `TORCH_SDPA` | PyTorch `scaled_dot_product_attention`. | -| `sliding_tile_attn` | `SLIDING_TILE_ATTN` | Sliding Tile Attention (STA). Requires `st_attn` and a mask-strategy config file set via the `SGLANG_DIFFUSION_ATTENTION_CONFIG` environment variable. | -| `sage_attn` | `SAGE_ATTN` | Requires `sageattention`. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream `setup.py`: https://github.com/thu-ml/SageAttention/blob/main/setup.py. | -| `sage_attn_3` | `SAGE_ATTN_3` | Requires SageAttention3 installed per upstream instructions. | -| `video_sparse_attn` | `VIDEO_SPARSE_ATTN` | Requires `vsa`. | -| `vmoba_attn` | `VMOBA_ATTN` | Requires `kernel.attn.vmoba_attn.vmoba`. | -| `aiter` | `AITER` | Requires `aiter`. | - -## Selection priority - -The selection order in `runtime/layers/attention/selector.py` is: - -1. `global_force_attn_backend(...)` / `global_force_attn_backend_context_manager(...)` -2. CLI `--attention-backend` (`ServerArgs.attention_backend`) -3. Auto selection (platform capability, dtype, and installed packages) - -## Platform support matrix - -| Backend | CUDA | ROCm | MPS | Notes | -|---|---:|---:|---:|---| -| `fa` | ✅ | ✅ | ❌ | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. | -| `torch_sdpa` | ✅ | ✅ | ✅ | Most compatible option across platforms. | -| `sliding_tile_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `st_attn` and `SGLANG_DIFFUSION_ATTENTION_CONFIG`. | -| `sage_attn` | ✅ | ❌ | ❌ | CUDA-only (optional dependency). | -| `sage_attn_3` | ✅ | ❌ | ❌ | CUDA-only (optional dependency). | -| `video_sparse_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `vsa`. | -| `vmoba_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. | -| `aiter` | ✅ | ❌ | ❌ | Requires `aiter`. | - -## Usage - -### Select a backend via CLI - -```bash -sglang generate \ - --model-path \ - --prompt "..." \ - --attention-backend fa -``` - -```bash -sglang generate \ - --model-path \ - --prompt "..." \ - --attention-backend torch_sdpa -``` - -### Using Sliding Tile Attention (STA) - -```bash -export SGLANG_DIFFUSION_ATTENTION_CONFIG=/abs/path/to/mask_strategy.json - -sglang generate \ - --model-path \ - --prompt "..." \ - --attention-backend sliding_tile_attn -``` - -### Notes for ROCm / MPS - -- ROCm: use `--attention-backend torch_sdpa` or `fa` depending on what is available in your environment. -- MPS: the platform implementation always uses `torch_sdpa`. - ---- - -# Cache-DiT Acceleration - -SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion -Transformers (DiT), to achieve up to **7.4x inference speedup** with minimal quality loss. - -## Overview - -**Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop: - -- **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences -- **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions -- **SCM (Step Computation Masking)**: Step-level caching control for additional speedup - -## Basic Usage - -Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` : - -```bash -SGLANG_CACHE_DIT_ENABLED=true \ -sglang generate --model-path Qwen/Qwen-Image \ - --prompt "A beautiful sunset over the mountains" -``` - -## Advanced Configuration - -### DBCache Parameters - -DBCache controls block-level caching behavior: - -| Parameter | Env Variable | Default | Description | -|-----------|---------------------------|---------|------------------------------------------| -| Fn | `SGLANG_CACHE_DIT_FN` | 1 | Number of first blocks to always compute | -| Bn | `SGLANG_CACHE_DIT_BN` | 0 | Number of last blocks to always compute | -| W | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching starts | -| R | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold | -| MC | `SGLANG_CACHE_DIT_MC` | 3 | Maximum continuous cached steps | - -### TaylorSeer Configuration - -TaylorSeer improves caching accuracy using Taylor expansion: - -| Parameter | Env Variable | Default | Description | -|-----------|-------------------------------|---------|---------------------------------| -| Enable | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator | -| Order | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | Taylor expansion order (1 or 2) | - -### Combined Configuration Example - -DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters -simultaneously: - -```bash -SGLANG_CACHE_DIT_ENABLED=true \ -SGLANG_CACHE_DIT_FN=2 \ -SGLANG_CACHE_DIT_BN=1 \ -SGLANG_CACHE_DIT_WARMUP=4 \ -SGLANG_CACHE_DIT_RDT=0.4 \ -SGLANG_CACHE_DIT_MC=4 \ -SGLANG_CACHE_DIT_TAYLORSEER=true \ -SGLANG_CACHE_DIT_TS_ORDER=2 \ -sglang generate --model-path black-forest-labs/FLUX.1-dev \ - --prompt "A curious raccoon in a forest" -``` - -### SCM (Step Computation Masking) - -SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and -which to use cached results. - -#### SCM Presets - -SCM is configured with presets: - -| Preset | Compute Ratio | Speed | Quality | -|----------|---------------|----------|------------| -| `none` | 100% | Baseline | Best | -| `slow` | ~75% | ~1.3x | High | -| `medium` | ~50% | ~2x | Good | -| `fast` | ~35% | ~3x | Acceptable | -| `ultra` | ~25% | ~4x | Lower | - -##### Usage - -```bash -SGLANG_CACHE_DIT_ENABLED=true \ -SGLANG_CACHE_DIT_SCM_PRESET=medium \ -sglang generate --model-path Qwen/Qwen-Image \ - --prompt "A futuristic cityscape at sunset" -``` - -#### Custom SCM Bins - -For fine-grained control over which steps to compute vs cache: - -```bash -SGLANG_CACHE_DIT_ENABLED=true \ -SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \ -SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \ -sglang generate --model-path Qwen/Qwen-Image \ - --prompt "A futuristic cityscape at sunset" -``` - -#### SCM Policy - -| Policy | Env Variable | Description | -|-----------|---------------------------------------|---------------------------------------------| -| `dynamic` | `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` | Adaptive caching based on content (default) | -| `static` | `SGLANG_CACHE_DIT_SCM_POLICY=static` | Fixed caching pattern | - -## Environment Variables - -All Cache-DiT parameters can be set via the following environment variables: - -| Environment Variable | Default | Description | -|-------------------------------------|---------|------------------------------------------| -| `SGLANG_CACHE_DIT_ENABLED` | false | Enable Cache-DiT acceleration | -| `SGLANG_CACHE_DIT_FN` | 1 | First N blocks to always compute | -| `SGLANG_CACHE_DIT_BN` | 0 | Last N blocks to always compute | -| `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching | -| `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold | -| `SGLANG_CACHE_DIT_MC` | 3 | Max continuous cached steps | -| `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator | -| `SGLANG_CACHE_DIT_TS_ORDER` | 1 | TaylorSeer order (1 or 2) | -| `SGLANG_CACHE_DIT_SCM_PRESET` | none | SCM preset (none/slow/medium/fast/ultra) | -| `SGLANG_CACHE_DIT_SCM_POLICY` | dynamic | SCM caching policy | -| `SGLANG_CACHE_DIT_SCM_COMPUTE_BINS` | not set | Custom SCM compute bins | -| `SGLANG_CACHE_DIT_SCM_CACHE_BINS` | not set | Custom SCM cache bins | - -## Supported Models - -SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion: - -| Model Family | Example Models | -|--------------|-------------------------------------------| -| Wan | Wan2.1, Wan2.2 | -| Flux | FLUX.1-dev, FLUX.2-dev, FLUX.2-Klein | -| Z-Image | Z-Image-Turbo | -| Qwen | Qwen-Image, Qwen-Image-Edit | -| GLM | GLM-Image | -| Hunyuan | HunyuanVideo | - -## Performance Tips - -1. **Start with defaults**: The default parameters work well for most models -2. **Use TaylorSeer**: It typically improves both speed and quality -3. **Tune R threshold**: Lower values = better quality, higher values = faster -4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance -5. **Warmup matters**: Higher warmup = more stable caching decisions - -## Limitations - -- **Single GPU only**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically disabled when - `world_size > 1` -- **SCM minimum steps**: SCM requires >= 8 inference steps to be effective -- **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported - -## Troubleshooting - -### Distributed environment warning - -``` -WARNING: cache-dit is disabled in distributed environment (world_size=N) -``` - -This is expected behavior. Cache-DiT currently only supports single-GPU inference. - -### SCM disabled for low step count - -For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache -acceleration still works. - -## References - -- [Cache-Dit](https://github.com/vipshop/cache-dit) -- [SGLang Diffusion](https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen) - ---- - -# Profiling Multimodal Generation - -This guide covers profiling techniques for multimodal generation pipelines in SGLang. - -## PyTorch Profiler - -PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics. - -### Denoising Stage Profiling - -Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step): - -```bash -sglang generate \ - --model-path Qwen/Qwen-Image \ - --prompt "A Logo With Bold Large Text: SGL Diffusion" \ - --seed 0 \ - --profile -``` - -**Parameters:** -- `--profile`: Enable profiling for the denoising stage -- `--num-profiled-timesteps N`: Number of timesteps to profile after warmup (default: 5) - - Smaller values reduce trace file size - - Example: `--num-profiled-timesteps 10` profiles 10 steps after 1 warmup step - -### Full Pipeline Profiling - -Profile all pipeline stages (text encoding, denoising, VAE decoding, etc.): - -```bash -sglang generate \ - --model-path Qwen/Qwen-Image \ - --prompt "A Logo With Bold Large Text: SGL Diffusion" \ - --seed 0 \ - --profile \ - --profile-all-stages -``` - -**Parameters:** -- `--profile-all-stages`: Used with `--profile`, profile all pipeline stages instead of just denoising - -### Output Location - -By default, trace files are saved in the ./logs/ directory. - -The exact output file path will be shown in the console output, for example: - -```bash -[mm-dd hh:mm:ss] Saved profiler traces to: /sgl-workspace/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz -``` - -### View Traces - -Load and visualize trace files at: -- https://ui.perfetto.dev/ (recommended) -- chrome://tracing (Chrome only) - -For large trace files, reduce `--num-profiled-timesteps` or avoid using `--profile-all-stages`. - - -### `--perf-dump-path` (Stage/Step Timing Dump) - -Besides profiler traces, you can also dump a lightweight JSON report that contains: -- stage-level timing breakdown for the full pipeline -- step-level timing breakdown for the denoising stage (per diffusion step) - -This is useful to quickly identify which stage dominates end-to-end latency, and whether denoising steps have uniform runtimes (and if not, which step has an abnormal spike). - -The dumped JSON contains a `denoise_steps_ms` field formatted as an array of objects, each with a `step` key (the step index) and a `duration_ms` key. - -Example: - -```bash -sglang generate \ - --model-path \ - --prompt "" \ - --perf-dump-path perf.json -``` - -## Nsight Systems - -Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns. - -### Installation - -See the [SGLang profiling guide](https://github.com/sgl-project/sglang/blob/main/docs/developer_guide/benchmark_and_profiling.md#profile-with-nsight) for installation instructions. - -### Basic Profiling - -Profile the entire pipeline execution: - -```bash -nsys profile \ - --trace-fork-before-exec=true \ - --cuda-graph-trace=node \ - --force-overwrite=true \ - -o QwenImage \ - sglang generate \ - --model-path Qwen/Qwen-Image \ - --prompt "A Logo With Bold Large Text: SGL Diffusion" \ - --seed 0 -``` - -### Targeted Stage Profiling - -Use `--delay` and `--duration` to capture specific stages and reduce file size: - -```bash -nsys profile \ - --trace-fork-before-exec=true \ - --cuda-graph-trace=node \ - --force-overwrite=true \ - --delay 10 \ - --duration 30 \ - -o QwenImage_denoising \ - sglang generate \ - --model-path Qwen/Qwen-Image \ - --prompt "A Logo With Bold Large Text: SGL Diffusion" \ - --seed 0 -``` - -**Parameters:** -- `--delay N`: Wait N seconds before starting capture (skip initialization overhead) -- `--duration N`: Capture for N seconds (focus on specific stages) -- `--force-overwrite`: Overwrite existing output files - -## Notes - -- **Reduce trace size**: Use `--num-profiled-timesteps` with smaller values or `--delay`/`--duration` with Nsight Systems -- **Stage-specific analysis**: Use `--profile` alone for denoising stage, add `--profile-all-stages` for full pipeline -- **Multiple runs**: Profile with different prompts and resolutions to identify bottlenecks across workloads - -## FAQ - -- If you are profiling `sglang generate` with Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model's inference steps to extend the execution time. - ---- - -# Contributing to SGLang Diffusion - -This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`). - -## 1. Commit Message Convention - -We follow a structured commit message format to maintain a clean history. - -**Format:** -```text -[diffusion] : -``` - -**Examples:** -- `[diffusion] cli: add --perf-dump-path argument` -- `[diffusion] scheduler: fix deadlock in batch processing` -- `[diffusion] model: support Stable Diffusion 3.5` - -**Rules:** -- **Prefix**: Always start with `[diffusion]`. -- **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc. -- **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature"). - -## 2. Performance Reporting - -For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report. - -### How to Generate a Report - -1. **Baseline**: run the benchmark (for a single generation task) - ```bash - $ sglang generate --model-path --prompt "A benchmark prompt" --perf-dump-path baseline.json - ``` - -2. **New**: run the same benchmark, without modifying any server_args or sampling_params - ```bash - $ sglang generate --model-path --prompt "A benchmark prompt" --perf-dump-path new.json - ``` - -3. **Compare**: run the compare script, which will print a Markdown table to the console - ```bash - $ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...] - ### Performance Comparison Report - ... - ``` -4. **Paste**: paste the table into the PR description - -## 3. CI-Based Change Protection - -Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that: - -1. support a new model -2. support or fix important features -3. significantly improve performance - -See [test](https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen/test) for examples - ---- - -# How to Support New Diffusion Models - -SGLang diffusion uses a modular pipeline architecture built around two key concepts: - -- **`ComposedPipeline`**: Orchestrates `PipelineStage`s to define the complete generation process -- **`PipelineStage`**: Modular components (prompt encoding, denoising loop, VAE decoding, etc.) - -To add a new model, you'll need to define: -1. **`PipelineConfig`**: Static model configurations (paths, precision settings) -2. **`SamplingParams`**: Runtime generation parameters (prompt, guidance_scale, steps) -3. **`ComposedPipeline`**: Chain together pipeline stages -4. **Modules**: Model components (text_encoder, transformer, vae, scheduler) - -For the complete implementation guide with examples, see: **[How to Support New Diffusion Models](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/support_new_models.md)** - ---- - -## References - -- [SGLang GitHub](https://github.com/sgl-project/sglang) -- [Cache-DiT](https://github.com/vipshop/cache-dit) -- [FastVideo](https://github.com/hao-ai-lab/FastVideo) -- [xDiT](https://github.com/xdit-project/xDiT) -- [Diffusers](https://github.com/huggingface/diffusers) diff --git a/docs/supported_models/extending/index.rst b/docs/supported_models/extending/index.rst new file mode 100644 index 000000000000..dbd5ff6cece4 --- /dev/null +++ b/docs/supported_models/extending/index.rst @@ -0,0 +1,12 @@ +Extending SGLang +================ + +Adding new models and alternative backends. + +.. toctree:: + :maxdepth: 1 + + support_new_models.md + transformers_fallback.md + modelscope.md + mindspore_models.md diff --git a/docs/supported_models/mindspore_models.md b/docs/supported_models/extending/mindspore_models.md similarity index 92% rename from docs/supported_models/mindspore_models.md rename to docs/supported_models/extending/mindspore_models.md index 0f8fc342bdf0..caa5ade9c166 100644 --- a/docs/supported_models/mindspore_models.md +++ b/docs/supported_models/extending/mindspore_models.md @@ -6,8 +6,8 @@ MindSpore is a high-performance AI framework optimized for Ascend NPUs. This doc ## Requirements -MindSpore currently only supports Ascend NPU devices. Users need to first install Ascend CANN software packages. -The CANN software packages can be downloaded from the [Ascend Official Website](https://www.hiascend.com). The recommended version is 8.3.RC2. +MindSpore currently only supports Ascend NPU devices. Users need to first install Ascend CANN 8.5. +The CANN software packages can be downloaded from the [Ascend Official Website](https://www.hiascend.com). ## Supported Models @@ -19,7 +19,7 @@ Currently, the following models are supported: ## Installation -> **Note**: Currently, MindSpore models are provided by an independent package `sgl-mindspore`. Support for MindSpore is built upon current SGLang support for Ascend NPU platform. Please first [install SGLang for Ascend NPU](../platforms/ascend_npu.md) and then install `sgl-mindspore`: +> **Note**: Currently, MindSpore models are provided by an independent package `sgl-mindspore`. Support for MindSpore is built upon current SGLang support for Ascend NPU platform. Please first [install SGLang for Ascend NPU](../../platforms/ascend/ascend_npu.md) and then install `sgl-mindspore`: ```shell git clone https://github.com/mindspore-lab/sgl-mindspore.git @@ -32,9 +32,9 @@ pip install -e . Current SGLang-MindSpore supports Qwen3 and DeepSeek V3/R1 models. This doc uses Qwen3-8B as an example. -### Offline infer +### Offline inference -Use the following script for offline infer: +Use the following script for offline inference: ```python import sglang as sgl diff --git a/docs/supported_models/modelscope.md b/docs/supported_models/extending/modelscope.md similarity index 100% rename from docs/supported_models/modelscope.md rename to docs/supported_models/extending/modelscope.md diff --git a/docs/supported_models/extending/support_new_models.md b/docs/supported_models/extending/support_new_models.md new file mode 100644 index 000000000000..7951631e9e21 --- /dev/null +++ b/docs/supported_models/extending/support_new_models.md @@ -0,0 +1,520 @@ +# How to Support New Models + +This document explains how to add support for new language models and multimodal large language models (MLLMs) in +SGLang. It also covers how to test new models and register external implementations. + +## How to Support a New Language Model + +To support a new model in SGLang, you only need to add a single file under +the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn +from existing model implementations and create a new file for your model. For most models, you should be able to find a +similar model to start with (e.g., starting from Llama). Also refer how +to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) + +## How to Support a New Multimodal Large Language Model + +To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the +standard LLM support: + +1. **Register your new model as multimodal**: + Extend `is_multimodal_model` + in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) + to return `True` for your model. + +2. **Register a new chat-template**: + Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/parser/conversation.py) and the corresponding matching function. + +3. **Multimodal Data Processor**: + Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your + model’s dedicated processor. + See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors) + for more details. + +4. **Handle Multimodal Tokens**: + Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be + expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data + with `RadixAttention`. + +5. **Handle Image Feature Extraction**: + Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model. + +6. **Adapt to Vision Attention**: + Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. + +You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or +other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. + +## Testing and Debugging + +Please note all your testing and benchmarking results in PR description. + +### Interactive Debugging + +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands +should give the same text output and very similar prefill logits: + +- Get the reference output: + ```bash + python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm} + ``` +- Get the SGLang output: + ```bash + python3 -m sglang.bench_one_batch --correct --model [new model] + ``` + +### Add the Model to the Test Suite + +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in +the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/registered/models/test_generation_models.py) +file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, +MMMU-Pro, etc.) in your PR. \\ +For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/registered/vlm/test_vision_openai_server_a.py)). + +This is an example command to run to test a new model on your local machine: + +```bash +ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others +``` + +### Benchmark + +- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.io/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer). +- **(Optional) Other evals**: If you ran other evals, please note the results in PR description. + +## Port a Model from vLLM to SGLang + +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable +resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models +from vLLM to SGLang. + +To port a model from vLLM to SGLang: + +- Compare these two files for guidance: + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) +- The major differences include: + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + +Note: make sure you add your new model to the supported models list in the supported models documentation. + +## Registering an External Model Implementation + +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. +This allows you to integrate your model without modifying the source code. + +For example: + +```python +from sglang.srt.models.registry import ModelRegistry +from sglang.srt.entrypoints.http_server import launch_server + +# For a single model, add it to the registry: +ModelRegistry.models[model_name] = model_class + +# For multiple models, you can imitate the import_model_classes() function: +from functools import lru_cache + +@lru_cache() +def import_new_model_classes(): + model_arch_name_to_cls = {} + # Populate model_arch_name_to_cls with your new model classes. + ... + return model_arch_name_to_cls + +ModelRegistry.models.update(import_new_model_classes()) + +# Launch the server with your server arguments: +launch_server(server_args) +``` + +## Example: Implementing and Serving a Llama Wrapper Model + +Below is an introductory, step-by-step walkthrough on how to implement a new model end-to-end in SGLang and then run it via the [Offline Engine](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/offline_engine_api.ipynb). + +### Implementing Our Model + +To keep things simple, this new model will be a simple wrapper around [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), and our goal will be just to bias the output logits for each `forward` call by taking the square root of each individual logit. + +Let's start by defining our model in a file called `llama_wrapper.py`. +The first step is to import the necessary libraries from SRT, which is SGLang's internal backend. + +```python +# In the file `llama_wrapper.py` + +import torch +from transformers import LlamaConfig +from typing import Optional +from sglang.srt.layers.logits_processor import LogitsProcessorOutput +from sglang.srt.layers.quantization.base_config import QuantizationConfig +from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors + +from sglang.srt.models.llama import LlamaForCausalLM +``` + +Next, we declare a new `class` for our model and have it inherit from `LlamaForCausalLM`, which allows our model to access `LlamaForCausalLM`'s predefined modules and layers, such as `LlamaAttention` and `LlamaMLP`. +Note that almost all model implementations take in `config` and `quant_config` as arguments for their `__init__` method; `config` and `quant_config` are passed in via [`model_loader/loader.py`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_loader/loader.py#L219). +Because we have inherited from `LlamaForCausalLM`, we can pass our parameters directly to its constructor, which will set the member variables for us. + +```python +class LlamaWrapper(LlamaForCausalLM): + def __init__( + self, + config: LlamaConfig, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ) -> None: + super().__init__(config=config, quant_config=quant_config, prefix=prefix) +``` + +Now, we want to define the `forward` method, which is what will be called at inference time. +Note that the signature for `forward` is essentially the same for any model; you can take a look at the other models defined in the [`models` directory](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/) for references. +To see where exactly `forward` is called in the SGLang runtime's internals, take a look at [`forward_decode`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1705) and [`forward_extend`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1724) in the [`ModelRunner` class](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py). + +```python + @torch.no_grad() + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + forward_batch: ForwardBatch, + pp_proxy_tensors: Optional[PPProxyTensors] = None, + input_embeds: Optional[torch.Tensor] = None, + get_embedding: bool = False, + ) -> LogitsProcessorOutput: +``` + +We now call the `__call__` method for `self.model` (which is a member variable that `LlamaForCausalLM` defines in its `__init__` method), which eventually calls `LlamaForCausalLM`'s `forward` method. +After that, we feed the `hidden_states` into our model's `LogitsProcessor` (again defined in `LlamaForCausalLM`). + +```python + hidden_states = self.model( + input_ids, + positions, + forward_batch, + input_embeds, + pp_proxy_tensors=pp_proxy_tensors, + ) + + res: LogitsProcessorOutput = self.logits_processor( + input_ids, + hidden_states, + self.lm_head, + forward_batch, + ) +``` + +After receiving the logits for the next token, we can finally perform our biasing step. + +```python + orig_logits = res.next_token_logits + res.next_token_logits = torch.where( + orig_logits > 0, + orig_logits.sqrt(), + orig_logits + ) + + return res +``` + +Now, our `LlamaWrapper` model is created and ready to be served! + +### Serving Our Model Via SGLang's Offline Engine + +The next step of this walkthrough involves hosting our new model offline, so that it can be served locally and without an HTTP server. + +First, create a new file called `run.py`. +Now, we must ensure that SGLang's `ModelRegistry` can find our model. +To do this, we first download the model's configuration and weights from Huggingface. + +```python +# In the file `run.py` + +import asyncio +from functools import lru_cache +from huggingface_hub import snapshot_download +from llama_wrapper import LlamaWrapper # Make sure to import our new model! +import sglang as sgl +from sglang.srt.models.registry import ModelRegistry + +# Make sure to request access to this model on Huggingface, then export your +# `HF_TOKEN` to download the model snapshot +llama_dir = snapshot_download( + repo_id="meta-llama/Llama-3.1-8B-Instruct", + local_dir="./llama_ckpt", +) +``` + +Now that we have our model on disk, we want to point it to `LlamaWrapper` by changing the `architectures` field in `./llama_ckpt/config.json` to be `LlamaWrapper`. +That way, when we pass in the path of our model checkpoint to SGLang, it will know that we want to use "LlamaWrapper" instead of "LlamaForCausalLM" as our model. + +```python +{ + "architectures": [ + # "LlamaForCausalLM" + "LlamaWrapper" + ], + ... +} +``` + +However, if we don't link our `LlamaWrapper` class to the "LlamaWrapper" registry keyword, then SGLang won't be able to find our model. +Thus, to register our `LlamaWrapper`, we want to follow the steps in the above section titled "Registering an External Model Implementation". + +```python +@lru_cache() +def import_new_model_classes(): + model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper} + return model_arch_name_to_cls + +ModelRegistry.models.update(import_new_model_classes()) +``` + +Lastly, when we create our `Engine`, we just pass in the path to the local model directory. +Then, our `LlamaWrapper` is ready to be served; for this walkthrough, we will use SGLang `Engine`'s non-streaming asynchronous generation endpoint. + +```python +def main(): + llm = sgl.Engine(model_path="./llama_ckpt") + sampling_params = {"temperature": 0.2, "top_k": 5} + prompts = [ + "Write a short, neutral self-introduction for a fictional character. Hello, my name is", + "Provide a concise factual statement about France’s capital city. The capital of France is", + "Explain possible future trends in artificial intelligence. The future of AI is", + ] + + asyncio.run(run_llm(llm, sampling_params, prompts)) + + llm.shutdown() + +async def run_llm( + llm, + sampling_params, + prompts, +) -> None: + outputs = await llm.async_generate(prompts, sampling_params) + + for prompt, output in zip(prompts, outputs): + print(f"\nPrompt: {prompt}") + print(f"Generated text: {output['text']}") + +if __name__ == "__main__": + main() +``` + +Now, when we call `python run.py`, we will get the outputs of our newly created model! + +## Serving External Models via the Standard CLI + +The previous sections show how to register a model programmatically via `ModelRegistry` and serve it through the Offline Engine. Similar to vLLM model plugin, there is an alternative that lets you keep using the standard `python -m sglang.launch_server` CLI without modifying any SGLang source code: you can register your model using the `SGLANG_EXTERNAL_MODEL_PACKAGE` environment variable. + +### The `EntryClass` Variable + +When SGLang scans a model package, it looks for the variable `EntryClass` at the module level of your Python file. The [model registry](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/registry.py) imports your file, checks for `EntryClass`, and registers the class assigned to it. If you are using a model based on HuggingFace, the name of this class needs to match the `"architectures"` field in your model's `config.json`. + +For example, if you are implementing a Llama wrapper, add this line at the end of your model file: + +```python +# This is what "Add EntryClass at the end" means +EntryClass = LlamaWrapper +``` + +### Example: Text-Only Model + +Using the same Llama wrapper from the previous section, here is how to package and serve it via the CLI. + +1. Create your project + +``` +sglang_custom_project/ +|----setup.py +|----custom_llm/ + |----__init__.py + |----llama_wrapper.py +``` + +Write the `setup.py`: + +```python +# sglang_custom_project/setup.py + +from setuptools import setup, find_packages +setup( + name="sglang-custom-plugins", + version="0.1", + packages=find_packages(), +) +``` + +2. Write your model code + +Inside `llama_wrapper.py`, write your model and include `EntryClass`: + +```python +# sglang_custom_project/custom_llm/llama_wrapper.py + +import torch +from typing import Optional +from sglang.srt.layers.logits_processor import LogitsProcessorOutput +from sglang.srt.layers.quantization.base_config import QuantizationConfig +from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors +from sglang.srt.models.llama import LlamaForCausalLM + +class LlamaWrapper(LlamaForCausalLM): + def __init__(self, config, quant_config: Optional[QuantizationConfig] = None, + prefix: str = "") -> None: + super().__init__(config=config, quant_config=quant_config, prefix=prefix) + @torch.no_grad() + def forward(self, input_ids, positions, forward_batch, + pp_proxy_tensors=None, input_embeds=None, get_embedding=False): + hidden_states = self.model( + input_ids, positions, forward_batch, input_embeds, + pp_proxy_tensors=pp_proxy_tensors, + ) + res: LogitsProcessorOutput = self.logits_processor( + input_ids, hidden_states, self.lm_head, forward_batch, + ) + + orig = res.next_token_logits + res.next_token_logits = torch.where(orig > 0, orig.sqrt(), orig) + return res + +# Don't forget to add EntryClass +EntryClass = LlamaWrapper +``` + +3. Install your package + +Run this inside your `sglang_custom_project` directory to install your code into the active Python environment: + +```bash +pip install -e . +``` + +4. Update your `config.json` + +Update the `config.json` under your HuggingFace model checkpoint directory so the `architectures` field matches your class name: + +```json +{ + "architectures": ["LlamaWrapper"], + ... +} +``` + +5. Launch the server + +Set the environment variable before running the CLI: + +```bash +export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_llm +python -m sglang.launch_server \ + --model-path /path/to/Llama-3.1-8B-Instruct \ + --port 8000 +``` + +The `SGLANG_EXTERNAL_MODEL_PACKAGE` should be the parent folder name containing your model-related code. In this example, it should be `custom_llm`. + +### Example: Multimodal Model + +If you are working with multimodal models, setting `SGLANG_EXTERNAL_MODEL_PACKAGE` alone is not enough. SGLang also needs to recognize your architecture as multimodal to enable the image/video processing pipelines, and it needs a custom processor. + +You can handle this by setting two additional environment variables: + +- `SGLANG_EXTERNAL_MM_MODEL_ARCH`: Adds your architecture name to SGLang's internal list of multimodal models. +- `SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE`: Tells SGLang where to find your custom processor class. + +For example, let's build a custom model based on Qwen2-VL-Instruct that takes the square root of the logits. + +Create the project: + +``` +sglang_custom_project_vl/ +|----setup.py +|----custom_vlm/ + |----__init__.py + |----qwenvl_wrapper.py +``` + +Write `setup.py`: + +```python +# sglang_custom_project_vl/setup.py + +from setuptools import setup, find_packages +setup( + name="sglang-custom-plugins-vl", + version="0.1", + packages=find_packages(), +) +``` + +Write the model in `qwenvl_wrapper.py`: + +```python +# sglang_custom_project_vl/custom_vlm/qwenvl_wrapper.py +import torch +from sglang.srt.models.qwen2_vl import Qwen2VLForConditionalGeneration +from sglang.srt.multimodal.processors.qwen_vl import QwenVLImageProcessor + +class CustomQwen2VL(Qwen2VLForConditionalGeneration): + def forward(self, input_ids, positions, forward_batch, + input_embeds=None, get_embedding=False): + res = super().forward( + input_ids, positions, forward_batch, + input_embeds=input_embeds, get_embedding=get_embedding + ) + if not get_embedding: + orig = res.next_token_logits + res.next_token_logits = torch.where(orig > 0, orig.sqrt(), orig) + return res + +class CustomQwen2VLProcessor(QwenVLImageProcessor): + models = [CustomQwen2VL] + + def __init__(self, hf_config, server_args, _processor, *args, **kwargs): + super().__init__(hf_config, server_args, _processor, *args, **kwargs) + +EntryClass = CustomQwen2VL +``` + +**Note:** you don't need a separate `EntryClass` for the custom processor as long as you associate the processor with the specific model class. + +Install the package, update `config.json`, and launch: + +```bash +pip install -e . +``` + +```json +{ + "architectures": ["CustomQwen2VL"], + ... +} +``` + +```bash +export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_vlm +export SGLANG_EXTERNAL_MM_MODEL_ARCH=CustomQwen2VL +export SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE=custom_vlm + +python -m sglang.launch_server \ + --model-path /path/to/Qwen2-VL-2B-Instruct \ + --port 8000 \ + --enable-multimodal +``` + +## Documentation + +Add to table of supported models in [generative_models.md](../text_generation/generative_models.md) or [multimodal_language_models.md](../text_generation/multimodal_language_models.md) + +--- + +By following these guidelines, you can add support for new language models and multimodal large language models in +SGLang and ensure they are thoroughly tested and easily integrated into the system. diff --git a/docs/supported_models/transformers_fallback.md b/docs/supported_models/extending/transformers_fallback.md similarity index 92% rename from docs/supported_models/transformers_fallback.md rename to docs/supported_models/extending/transformers_fallback.md index 3c7dd961c142..cd80d561236a 100644 --- a/docs/supported_models/transformers_fallback.md +++ b/docs/supported_models/extending/transformers_fallback.md @@ -18,7 +18,7 @@ python3 -m sglang.launch_server \ ### Quantization -Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang. +Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../../advanced_features/quantization.md) for more information about supported quantization in SGLang. ### Remote code diff --git a/docs/supported_models/index.rst b/docs/supported_models/index.rst new file mode 100644 index 000000000000..f90c6fba104c --- /dev/null +++ b/docs/supported_models/index.rst @@ -0,0 +1,13 @@ +Supported Models +================ + +SGLang supports a wide variety of model architectures for different use cases. +Browse by category below to find models suited for your needs. + +.. toctree:: + :maxdepth: 2 + + text_generation/index + retrieval_ranking/index + specialized/index + extending/index diff --git a/docs/supported_models/multimodal_language_models.md b/docs/supported_models/multimodal_language_models.md deleted file mode 100644 index 1677bb574b57..000000000000 --- a/docs/supported_models/multimodal_language_models.md +++ /dev/null @@ -1,112 +0,0 @@ -# Multimodal Language Models - -These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders. - -## Example launch Command - -```shell -python3 -m sglang.launch_server \ - --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path - --host 0.0.0.0 \ - --port 30000 \ -``` - -> See the [OpenAI APIs section](https://docs.sglang.io/basic_usage/openai_api_vision.html) for how to send multimodal requests. - -## Supported models - -Below the supported models are summarized in a table. - -If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen2_5_VLForConditionalGeneration`, use the expression: - -``` -repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration -``` - -in the GitHub search bar. - - -| Model Family (Variants) | Example HuggingFace Identifier | Description | Notes | -|----------------------------|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| -| **Qwen-VL** | `Qwen/Qwen3-VL-235B-A22B-Instruct` | Alibaba's vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | | -| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | | -| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | DeepSeek's open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | | -| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | | -| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | | -| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | | -| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | | -| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | | -| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | | -| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | Kimi-VL is a multimodal model that can understand and generate text from images. | | -| **Mistral-Small-3.1-24B** | `mistralai/Mistral-Small-3.1-24B-Instruct-2503` | Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. | | -| **Phi-4-multimodal-instruct** | `microsoft/Phi-4-multimodal-instruct` | Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. | | -| **MiMo-VL** (7B) | `XiaomiMiMo/MiMo-VL-7B-RL` | Xiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. | | -| **GLM-4.5V** (106B) / **GLM-4.1V**(9B) | `zai-org/GLM-4.5V` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Use `--chat-template glm-4v` | -| **DotsVLM** (General/OCR) | `rednote-hilab/dots.vlm1.inst` | RedNote's vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training. | | -| **DotsVLM-OCR** | `rednote-hilab/dots.ocr` | Specialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities. | Don't use `--trust-remote-code` | -| **NVILA** (8B, 15B, Lite-2B, Lite-8B, Lite-15B) | `Efficient-Large-Model/NVILA-8B` | `chatml` | NVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance. | -| **NVIDIA Nemotron Nano 2.0 VL** | `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16` | NVIDIA Nemotron Nano v2 VL enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities. It builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, in order to achieve higher inference throughput in long document and video scenarios. | Use `--trust-remote-code`. You may need to adjust `--max-mamba-cache-size` [default is 512] to fit memory constraints. | -| **Ernie4.5-VL** | `baidu/ERNIE-4.5-VL-28B-A3B-PT` | Baidu's vision-language models(28B,424B). Support image and video comprehension, and also support thinking. | | -| **JetVLM** | | JetVLM is an vision-language model designed for high-performance multimodal understanding and generation tasks built upon Jet-Nemotron. | Coming soon | - -## Video Input Support - -SGLang supports video input for Vision-Language Models (VLMs), enabling temporal reasoning tasks such as video question answering, captioning, and holistic scene understanding. Video clips are decoded, key frames are sampled, and the resulting tensors are batched together with the text prompt, allowing multimodal inference to integrate visual and linguistic context. - -| Model Family | Example Identifier | Video notes | -|--------------|--------------------|-------------| -| **Qwen-VL** (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-Omni) | `Qwen/Qwen3-VL-235B-A22B-Instruct` | The processor gathers `video_data`, runs Qwen's frame sampler, and merges the resulting features with text tokens before inference. | -| **GLM-4v** (4.5V, 4.1V, MOE) | `zai-org/GLM-4.5V` | Video clips are read with Decord, converted to tensors, and passed to the model alongside metadata for rotary-position handling. | -| **NVILA** (Full & Lite) | `Efficient-Large-Model/NVILA-8B` | The runtime samples eight frames per clip and attaches them to the multimodal request when `video_data` is present. | -| **LLaVA video variants** (LLaVA-NeXT-Video, LLaVA-OneVision) | `lmms-lab/LLaVA-NeXT-Video-7B` | The processor routes video prompts to the LlavaVid video-enabled architecture, and the provided example shows how to query it with `sgl.video(...)` clips. | -| **NVIDIA Nemotron Nano 2.0 VL** | `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16` | The processor samples at 2 FPS, at a max of 128 frames, as per model training. The model uses [EVS](../../python/sglang/srt/multimodal/evs/README.md), a pruning method that removes redundant tokens from video embeddings. By default `video_pruning_rate=0.7`. Change this by providing: `--json-model-override-args '{"video_pruning_rate": 0.0}'` to disable EVS, for example. | -| **JetVLM** | | The runtime samples eight frames per clip and attaches them to the multimodal request when `video_data` is present. | - -Use `sgl.video(path, num_frames)` when building prompts to attach clips from your SGLang programs. - -Example OpenAI-compatible request that sends a video clip: - -```python -import requests - -url = "http://localhost:30000/v1/chat/completions" - -data = { - "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", - "messages": [ - { - "role": "user", - "content": [ - {"type": "text", "text": "What’s happening in this video?"}, - { - "type": "video_url", - "video_url": { - "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" - }, - }, - ], - } - ], - "max_tokens": 300, -} - -response = requests.post(url, json=data) -print(response.text) -``` - -## Usage Notes - -### Performance Optimization - -For multimodal models, you can use the `--keep-mm-feature-on-device` flag to optimize for latency at the cost of increased GPU memory usage: - -- **Default behavior**: Multimodal feature tensors are moved to CPU after processing to save GPU memory -- **With `--keep-mm-feature-on-device`**: Feature tensors remain on GPU, reducing device-to-host copy overhead and improving latency, but consuming more GPU memory - -Use this flag when you have sufficient GPU memory and want to minimize latency for multimodal inference. - -### Multimodal Inputs Limitation - -- **Use `--mm-process-config '{"image":{"max_pixels":1048576},"video":{"fps":3,"max_pixels":602112,"max_frames":60}}'`**: To set `image`, `video`, and `audio` input limits. - -This can reduce GPU memory usage, improve inference speed, and help to avoid OOM, but may impact model performance, thus set a proper value based on your specific use case. Currently, only `qwen_vl` supports this config. Please refer to [qwen_vl processor](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/multimodal/processors/qwen_vl.py) for understanding the meaning of each parameter. diff --git a/docs/supported_models/classify_models.md b/docs/supported_models/retrieval_ranking/classify_models.md similarity index 100% rename from docs/supported_models/classify_models.md rename to docs/supported_models/retrieval_ranking/classify_models.md diff --git a/docs/supported_models/embedding_models.md b/docs/supported_models/retrieval_ranking/embedding_models.md similarity index 100% rename from docs/supported_models/embedding_models.md rename to docs/supported_models/retrieval_ranking/embedding_models.md diff --git a/docs/supported_models/retrieval_ranking/index.rst b/docs/supported_models/retrieval_ranking/index.rst new file mode 100644 index 000000000000..e7c669f9b7be --- /dev/null +++ b/docs/supported_models/retrieval_ranking/index.rst @@ -0,0 +1,11 @@ +Retrieval & Ranking +=================== + +Models for embeddings, reranking, and classification. + +.. toctree:: + :maxdepth: 1 + + embedding_models.md + rerank_models.md + classify_models.md diff --git a/docs/supported_models/rerank_models.md b/docs/supported_models/retrieval_ranking/rerank_models.md similarity index 96% rename from docs/supported_models/rerank_models.md rename to docs/supported_models/retrieval_ranking/rerank_models.md index bb989128a8ec..12f3e05e28da 100644 --- a/docs/supported_models/rerank_models.md +++ b/docs/supported_models/retrieval_ranking/rerank_models.md @@ -161,6 +161,7 @@ Example (with `top_n: 2`): ### Common Pitfalls +- **`--chat-template` is required.** Without `--chat-template examples/chat_template/qwen3_reranker.jinja`, the server does not recognize the model as a decoder-only reranker and returns a 400 error: `"This model does not appear to be an embedding model by default. Please add `--is-embedding`..."`. The fix is to add the chat template flag, NOT `--is-embedding`. - If you launch Qwen3-Reranker with `--is-embedding`, `/v1/rerank` cannot compute yes/no logprob scores. Relaunch **without** `--is-embedding`. - If you see a validation error like "score should be a valid number" and the backend returned a list, upgrade to a version that coerces `embedding[0]` into `score` for rerank responses. diff --git a/docs/supported_models/specialized/index.rst b/docs/supported_models/specialized/index.rst new file mode 100644 index 000000000000..40d108acb3df --- /dev/null +++ b/docs/supported_models/specialized/index.rst @@ -0,0 +1,9 @@ +Specialized Models +================== + +Models for specialized tasks like reward modeling. + +.. toctree:: + :maxdepth: 1 + + reward_models.md diff --git a/docs/supported_models/reward_models.md b/docs/supported_models/specialized/reward_models.md similarity index 100% rename from docs/supported_models/reward_models.md rename to docs/supported_models/specialized/reward_models.md diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md deleted file mode 100644 index b71e06c47c9c..000000000000 --- a/docs/supported_models/support_new_models.md +++ /dev/null @@ -1,320 +0,0 @@ -# How to Support New Models - -This document explains how to add support for new language models and multimodal large language models (MLLMs) in -SGLang. It also covers how to test new models and register external implementations. - -## How to Support a New Language Model - -To support a new model in SGLang, you only need to add a single file under -the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn -from existing model implementations and create a new file for your model. For most models, you should be able to find a -similar model to start with (e.g., starting from Llama). Also refer how -to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) - -## How to Support a New Multimodal Large Language Model - -To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the -standard LLM support: - -1. **Register your new model as multimodal**: - Extend `is_multimodal_model` - in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) - to return `True` for your model. - -2. **Register a new chat-template**: - Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/conversation.py) and the corresponding matching function. - -3. **Multimodal Data Processor**: - Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your - model’s dedicated processor. - See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors) - for more details. - -4. **Handle Multimodal Tokens**: - Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be - expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data - with `RadixAttention`. - -5. **Handle Image Feature Extraction**: - Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model. - -6. **Adapt to Vision Attention**: - Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. - -You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or -other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. - -## Testing and Debugging - -Please note all your testing and benchmarking results in PR description. - -### Interactive Debugging - -For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands -should give the same text output and very similar prefill logits: - -- Get the reference output: - ```bash - python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm} - ``` -- Get the SGLang output: - ```bash - python3 -m sglang.bench_one_batch --correct --model [new model] - ``` - -### Add the Model to the Test Suite - -To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in -the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) -file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, -MMMU-Pro, etc.) in your PR. \\ -For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_a.py), [test_vision_openai_server_b.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_b.py)). - - -This is an example command to run to test a new model on your local machine: - -```bash -ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others -``` - -### Benchmark - -- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.io/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer). -- **(Optional) Other evals**: If you ran other evals, please note the results in PR description. - -## Port a Model from vLLM to SGLang - -The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable -resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models -from vLLM to SGLang. - -To port a model from vLLM to SGLang: - -- Compare these two files for guidance: - - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) - - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) -- The major differences include: - - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). - - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** - - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** - - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. - - **Remove `Sample`.** - - **Change the `forward()` functions** and add a `forward_batch()` method. - - **Add `EntryClass`** at the end. - - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. - -Note: make sure you add your new model to the supported models list in the supported models documentation. - -## Registering an External Model Implementation - -In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. -This allows you to integrate your model without modifying the source code. - -For example: - -```python -from sglang.srt.models.registry import ModelRegistry -from sglang.srt.entrypoints.http_server import launch_server - -# For a single model, add it to the registry: -ModelRegistry.models[model_name] = model_class - -# For multiple models, you can imitate the import_model_classes() function: -from functools import lru_cache - -@lru_cache() -def import_new_model_classes(): - model_arch_name_to_cls = {} - # Populate model_arch_name_to_cls with your new model classes. - ... - return model_arch_name_to_cls - -ModelRegistry.models.update(import_new_model_classes()) - -# Launch the server with your server arguments: -launch_server(server_args) -``` - -## Example: Implementing and Serving a Llama Wrapper Model - -Below is an introductory, step-by-step walkthrough on how to implement a new model end-to-end in SGLang and then run it via the [Offline Engine](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/offline_engine_api.ipynb). - -### Implementing Our Model - -To keep things simple, this new model will be a simple wrapper around [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), and our goal will be just to bias the output logits for each `forward` call by taking the square root of each individual logit. - -Let's start by defining our model in a file called `llama_wrapper.py`. -The first step is to import the necessary libraries from SRT, which is SGLang's internal backend. - -```python -# In the file `llama_wrapper.py` - -import torch -from transformers import LlamaConfig -from typing import Optional -from sglang.srt.layers.logits_processor import LogitsProcessorOutput -from sglang.srt.layers.quantization.base_config import QuantizationConfig -from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors - -from sglang.srt.models.llama import LlamaForCausalLM -``` - -Next, we declare a new `class` for our model and have it inherit from `LlamaForCausalLM`, which allows our model to access `LlamaForCausalLM`'s predefined modules and layers, such as `LlamaAttention` and `LlamaMLP`. -Note that almost all model implementations take in `config` and `quant_config` as arguments for their `__init__` method; `config` and `quant_config` are passed in via [`model_loader/loader.py`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_loader/loader.py#L219). -Because we have inherited from `LlamaForCausalLM`, we can pass our parameters directly to its constructor, which will set the member variables for us. - -```python -class LlamaWrapper(LlamaForCausalLM): - def __init__( - self, - config: LlamaConfig, - quant_config: Optional[QuantizationConfig] = None, - prefix: str = "", - ) -> None: - super().__init__(config=config, quant_config=quant_config, prefix=prefix) -``` - -Now, we want to define the `forward` method, which is what will be called at inference time. -Note that the signature for `forward` is essentially the same for any model; you can take a look at the other models defined in the [`models` directory](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/) for references. -To see where exactly `forward` is called in the SGLang runtime's internals, take a look at [`forward_decode`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1705) and [`forward_extend`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1724) in the [`ModelRunner` class](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py). - -```python - @torch.no_grad() - def forward( - self, - input_ids: torch.Tensor, - positions: torch.Tensor, - forward_batch: ForwardBatch, - pp_proxy_tensors: Optional[PPProxyTensors] = None, - input_embeds: Optional[torch.Tensor] = None, - get_embedding: bool = False, - ) -> LogitsProcessorOutput: -``` - -We now call the `__call__` method for `self.model` (which is a member variable that `LlamaForCausalLM` defines in its `__init__` method), which eventually calls `LlamaForCausalLM`'s `forward` method. -After that, we feed the `hidden_states` into our model's `LogitsProcessor` (again defined in `LlamaForCausalLM`). - -```python - hidden_states = self.model( - input_ids, - positions, - forward_batch, - input_embeds, - pp_proxy_tensors=pp_proxy_tensors, - ) - - res: LogitsProcessorOutput = self.logits_processor( - input_ids, - hidden_states, - self.lm_head, - forward_batch, - ) -``` - -After receiving the logits for the next token, we can finally perform our biasing step. - -```python - orig_logits = res.next_token_logits - res.next_token_logits = torch.where( - orig_logits > 0, - orig_logits.sqrt(), - orig_logits - ) - - return res -``` -Now, our `LlamaWrapper` model is created and ready to be served! - -### Serving Our Model Via SGLang's Offline Engine - -The next step of this walkthrough involves hosting our new model offline, so that it can be served locally and without an HTTP server. - -First, create a new file called `run.py`. -Now, we must ensure that SGLang's `ModelRegistry` can find our model. -To do this, we first download the model's configuration and weights from Huggingface. - -```python -# In the file `run.py` - -import asyncio -from functools import lru_cache -from huggingface_hub import snapshot_download -from llama_wrapper import LlamaWrapper # Make sure to import our new model! -import sglang as sgl -from sglang.srt.models.registry import ModelRegistry - -# Make sure to request access to this model on Huggingface, then export your -# `HF_TOKEN` to download the model snapshot -llama_dir = snapshot_download( - repo_id="meta-llama/Llama-3.1-8B-Instruct", - local_dir="./llama_ckpt", -) -``` - -Now that we have our model on disk, we want to point it to `LlamaWrapper` by changing the `architectures` field in `./llama_ckpt/config.json` to be `LlamaWrapper`. -That way, when we pass in the path of our model checkpoint to SGLang, it will know that we want to use "LlamaWrapper" instead of "LlamaForCausalLM" as our model. - -```python -{ - "architectures": [ - # "LlamaForCausalLM" - "LlamaWrapper" - ], - ... -} -``` - -However, if we don't link our `LlamaWrapper` class to the "LlamaWrapper" registry keyword, then SGLang won't be able to find our model. -Thus, to register our `LlamaWrapper`, we want to follow the steps in the above section titled "Registering an External Model Implementation". - -```python -@lru_cache() -def import_new_model_classes(): - model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper} - return model_arch_name_to_cls - -ModelRegistry.models.update(import_new_model_classes()) -``` - -Lastly, when we create our `Engine`, we just pass in the path to the local model directory. -Then, our `LlamaWrapper` is ready to be served; for this walkthrough, we will use SGLang `Engine`'s non-streaming asynchronous generation endpoint. - -```python -def main(): - llm = sgl.Engine(model_path="./llama_ckpt") - sampling_params = {"temperature": 0.2, "top_k": 5} - prompts = [ - "Write a short, neutral self-introduction for a fictional character. Hello, my name is", - "Provide a concise factual statement about France’s capital city. The capital of France is", - "Explain possible future trends in artificial intelligence. The future of AI is", - ] - - asyncio.run(run_llm(llm, sampling_params, prompts)) - - llm.shutdown() - -async def run_llm( - llm, - sampling_params, - prompts, -) -> None: - outputs = await llm.async_generate(prompts, sampling_params) - - for prompt, output in zip(prompts, outputs): - print(f"\nPrompt: {prompt}") - print(f"Generated text: {output['text']}") - -if __name__ == "__main__": - main() -``` - -Now, when we call `python run.py`, we will get the outputs of our newly created model! - - -## Documentation -Add to table of supported models in [generative_models.md](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/generative_models.md) or [multimodal_language_models.md](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/multimodal_language_models.md) - ---- - -By following these guidelines, you can add support for new language models and multimodal large language models in -SGLang and ensure they are thoroughly tested and easily integrated into the system. diff --git a/docs/supported_models/text_generation/diffusion_language_models.md b/docs/supported_models/text_generation/diffusion_language_models.md new file mode 100644 index 000000000000..7dbb4828b695 --- /dev/null +++ b/docs/supported_models/text_generation/diffusion_language_models.md @@ -0,0 +1,111 @@ +# Diffusion Language Models + +Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies. + +## Example Launch Command + +SGLang supports different DLLM algorithms such as `LowConfidence` and `JointThreshold`. + +```shell +python3 -m sglang.launch_server \ + --model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path + --dllm-algorithm LowConfidence \ + --dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set. + --host 0.0.0.0 \ + --port 30000 +``` + +## Example Configuration File + +Depending on the algorithm selected, the configuration parameters vary. + +LowConfidence Config: + +```yaml +# Confidence threshold for accepting predicted tokens +# - Higher values: More conservative, better quality but slower +# - Lower values: More aggressive, faster but potentially lower quality +# Range: 0.0 - 1.0 +threshold: 0.95 + +# Default: 32, for LLaDA2MoeModelLM +block_size: 32 +``` + +JointThreshold Config: + +```yaml +# Decoding threshold for Mask-to-Token (M2T) phase +# - Higher values: More conservative, better quality but slower +# - Lower values: More aggressive, faster but potentially lower quality +# Range: 0.0 - 1.0 +threshold: 0.5 +# Decoding threshold for Token-to-Token (T2T) phase +# Range: 0.0 - 1.0 +# Setting to 0.0 allows full editing (recommended for most cases). +edit_threshold: 0.0 +# Max extra T2T steps after all masks are removed. Prevents infinite loops. +max_post_edit_steps: 16 +# 2-gram repetition penalty (default 0). +# An empirical value of 3 is often sufficient to mitigate most repetitions. +penalty_lambda: 0 +``` + +## Example Client Code Snippet + +Just like other supported models, diffusion language models can be used via the REST API or Python client. + +Python client example for making a generation request to the launched server: + +```python +import sglang as sgl + +def main(): + llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini", + dllm_algorithm="LowConfidence", + max_running_requests=1, + trust_remote_code=True) + + prompts = [ + "SYSTEMdetailed thinking off<|role_end|>HUMAN Write a brief introduction of the great wall <|role_end|>ASSISTANT" + ] + + sampling_params = { + "temperature": 0, + "max_new_tokens": 1024, + } + + outputs = llm.generate(prompts, sampling_params) + print(outputs) + +if __name__ == '__main__': + main() +``` + +Curl example for making a generation request to the launched server: + +```bash +curl -X POST "http://127.0.0.1:30000/generate" \ + -H "Content-Type: application/json" \ + -d '{ + "text": [ + "SYSTEMdetailed thinking off<|role_end|>HUMAN Write the number from 1 to 128 <|role_end|>ASSISTANT", + "SYSTEMdetailed thinking off<|role_end|>HUMAN Write a brief introduction of the great wall <|role_end|>ASSISTANT" + ], + "stream": true, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 1024 + } + }' +``` + +## Supported Models + +Below the supported models are summarized in a table. + +| Model Family | Example Model | Description | +| -------------------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------- | +| **LLaDA2.0 (mini, flash)** | `inclusionAI/LLaDA2.0-flash` | LLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture. | +| **SDAR (JetLM)** | `JetLM/SDAR-8B-Chat` | SDAR series diffusion language model (Chat), dense architecture. | +| **SDAR (JetLM)** | `JetLM/SDAR-30B-A3B-Chat` | SDAR series diffusion language model (Chat), MoE architecture. | diff --git a/docs/supported_models/generative_models.md b/docs/supported_models/text_generation/generative_models.md similarity index 77% rename from docs/supported_models/generative_models.md rename to docs/supported_models/text_generation/generative_models.md index 3d75fa3077d3..a3e263f68356 100644 --- a/docs/supported_models/generative_models.md +++ b/docs/supported_models/text_generation/generative_models.md @@ -1,67 +1,76 @@ -# Large Language Models - -These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling. - -## Example launch Command - -```shell -python3 -m sglang.launch_server \ - --model-path meta-llama/Llama-3.2-1B-Instruct \ # example HF/local path - --host 0.0.0.0 \ - --port 30000 \ -``` - -## Supported models - -Below the supported models are summarized in a table. - -If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen3ForCausalLM`, use the expression: - -``` -repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen3ForCausalLM -``` - -in the GitHub search bar. - -| Model Family (Variants) | Example HuggingFace Identifier | Description | -|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------| -| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)| -| **Kimi K2** (Thinking, Instruct) | `moonshotai/Kimi-K2-Instruct` | Moonshot AI's 1 trillion parameter MoE model (32B active) with 128K–256K context; state-of-the-art agentic intelligence with stable long-horizon agency across 200–300 sequential tool calls. Features MLA attention and native INT4 quantization. [See Reasoning Parser docs](../advanced_features/separate_reasoning.ipynb)| -| **Kimi Linear** (48B-A3B) | `moonshotai/Kimi-Linear-48B-A3B-Instruct` | Moonshot AI's hybrid linear attention model (48B total, 3B active) with 1M token context; features Kimi Delta Attention (KDA) for up to 6× faster decoding and 75% KV cache reduction vs full attention. | -| **GPT-OSS** | `openai/gpt-oss-20b`, `openai/gpt-oss-120b` | OpenAI’s latest GPT-OSS series for complex reasoning, agentic tasks, and versatile developer use cases.| -| **Qwen** (3, 3MoE, 3Next, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` `Qwen/Qwen3-Next-80B-A3B-Instruct ` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)| -| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta's open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md) | -| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. | -| **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. | -| **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. | -| **MiniCPM** (v3, 4B) | `openbmb/MiniCPM3-4B` | OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. | -| **OLMo** (2, 3) | `allenai/OLMo-3-1125-32B`, `allenai/OLMo-3-32B-Think`, `allenai/OLMo-2-1124-7B-Instruct` | Allen AI’s series of Open Language Models designed to enable the science of language models. | -| **OLMoE** (Open MoE) | `allenai/OLMoE-1B-7B-0924` | Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. | -| **MiniMax-M2** (M2, M2.1) | `minimax/MiniMax-M2`, `minimax/MiniMax-M2.1` | MiniMax’s SOTA LLM for coding & agentic workflows. | -| **StableLM** (3B, 7B) | `stabilityai/stablelm-tuned-alpha-7b` | StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. | -| **Command-(R,A)** (Cohere) | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025` | Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. | -| **DBRX** (Databricks) | `databricks/dbrx-instruct` | Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. | -| **Grok** (xAI) | `xai-org/grok-1` | xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. | -| **ChatGLM** (GLM-130B family) | `THUDM/chatglm2-6b` | Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. | -| **InternLM 2** (7B, 20B) | `internlm/internlm2-7b` | Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). | -| **ExaONE 3** (Korean-English) | `LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct` | LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. | -| **Baichuan 2** (7B, 13B) | `baichuan-inc/Baichuan2-13B-Chat` | BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. | -| **XVERSE** (MoE) | `xverse/XVERSE-MoE-A36B` | Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. | -| **SmolLM** (135M–1.7B) | `HuggingFaceTB/SmolLM-1.7B` | Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. | -| **GLM-4** (Multilingual 9B) | `ZhipuAI/glm-4-9b-chat` | Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). | -| **MiMo** (7B series) | `XiaomiMiMo/MiMo-7B-RL` | Xiaomi's reasoning-optimized model series, leverages Multiple-Token Prediction for faster inference. | -| **ERNIE-4.5** (4.5, 4.5MoE series) | `baidu/ERNIE-4.5-21B-A3B-PT` | Baidu's ERNIE-4.5 series which consists of MoE with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. | -| **Arcee AFM-4.5B** | `arcee-ai/AFM-4.5B-Base` | Arcee's foundational model series for real world reliability and edge deployments. | -| **Persimmon** (8B) | `adept/persimmon-8b-chat` | Adept’s open 8B model with a 16K context window and fast inference; trained for broad usability and licensed under Apache 2.0. | -| **Solar** (10.7B) | `upstage/SOLAR-10.7B-Instruct-v1.0` | Upstage's 10.7B parameter model, optimized for instruction-following tasks. This architecture incorporates a depth-up scaling methodology, enhancing model performance. | -| **Tele FLM** (52B-1T) | `CofeAI/Tele-FLM` | BAAI & TeleAI's multilingual model, available in 52-billion and 1-trillion parameter variants. It is a decoder-only transformer trained on ~2T tokens | -| **Ling** (16.8B–290B) | `inclusionAI/Ling-lite`, `inclusionAI/Ling-plus` | InclusionAI’s open MoE models. Ling-Lite has 16.8B total / 2.75B active parameters, and Ling-Plus has 290B total / 28.8B active parameters. They are designed for high performance on NLP and complex reasoning tasks. | -| **Granite 3.0, 3.1** (IBM) | `ibm-granite/granite-3.1-8b-instruct` | IBM's open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems. | -| **Granite 3.0 MoE** (IBM) | `ibm-granite/granite-3.0-3b-a800m-instruct` | IBM’s Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale. | -| **Orion** (14B) | `OrionStarAI/Orion-14B-Base` | A series of open-source multilingual large language models by OrionStarAI, pretrained on a 2.5T token multilingual corpus including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages. | -| **Llama Nemotron Super** (v1, v1.5, NVIDIA) | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. | -| **Llama Nemotron Ultra** (v1, NVIDIA) | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. | -| **NVIDIA Nemotron Nano 2.0** | `nvidia/NVIDIA-Nemotron-Nano-9B-v2` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. `Nemotron-Nano-9B-v2` is a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. | -| **StarCoder2** (3B-15B) | `bigcode/starcoder2-7b` | StarCoder2 is a family of open large language models (LLMs) specialized for code generation and understanding. It is the successor to StarCoder, jointly developed by the BigCode project (a collaboration between Hugging Face, ServiceNow Research, and other contributors). | -| **Jet-Nemotron** | `jet-ai/Jet-Nemotron-2B` | Jet-Nemotron is a new family of hybrid-architecture language models that surpass state-of-the-art open-source full-attention language models, while achieving significant efficiency gains. | -| **Trinity** (Nano, Mini) | `arcee-ai/Trinity-Mini` | Arcee's foundational MoE Trinity family of models, open weights under Apache 2.0. | +# Large Language Models + +These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling. + +## Example launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-1B-Instruct \ # example HF/local path + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Supported models + +Below the supported models are summarized in a table. + +If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen3ForCausalLM`, use the expression: + +``` +repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen3ForCausalLM +``` + +in the GitHub search bar. + +| Model Family (Variants) | Example HuggingFace Identifier | Description | +|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------| +| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../../basic_usage/deepseek_v3.md) and [Reasoning Parser](../../advanced_features/separate_reasoning.ipynb)| +| **Kimi K2** (Thinking, Instruct) | `moonshotai/Kimi-K2-Instruct` | Moonshot AI's 1 trillion parameter MoE model (32B active) with 128K–256K context; state-of-the-art agentic intelligence with stable long-horizon agency across 200–300 sequential tool calls. Features MLA attention and native INT4 quantization. [See Reasoning Parser docs](../../advanced_features/separate_reasoning.ipynb)| +| **Kimi Linear** (48B-A3B) | `moonshotai/Kimi-Linear-48B-A3B-Instruct` | Moonshot AI's hybrid linear attention model (48B total, 3B active) with 1M token context; features Kimi Delta Attention (KDA) for up to 6× faster decoding and 75% KV cache reduction vs full attention. | +| **GPT-OSS** | `openai/gpt-oss-20b`, `openai/gpt-oss-120b` | OpenAI’s latest GPT-OSS series for complex reasoning, agentic tasks, and versatile developer use cases.| +| **Qwen** (3.5, 3, 3MoE, 3Next, 2.5, 2 series) | `Qwen/Qwen3.5-397B-A17B`, `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Next-80B-A3B-Instruct` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../../advanced_features/separate_reasoning.ipynb)| +| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta's open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../../basic_usage/llama4.md) | +| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. | +| **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. | +| **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. | +| **MiniCPM** (v3, 4B) | `openbmb/MiniCPM3-4B` | OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. | +| **OLMo** (2, 3) | `allenai/OLMo-3-1125-32B`, `allenai/OLMo-3-32B-Think`, `allenai/OLMo-2-1124-7B-Instruct` | Allen AI’s series of Open Language Models designed to enable the science of language models. | +| **OLMoE** (Open MoE) | `allenai/OLMoE-1B-7B-0924` | Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. | +| **MiniMax-M2** (M2, M2.1, M2.5) | `MiniMaxAI/MiniMax-M2.5`, `MiniMaxAI/MiniMax-M2.1`, `MiniMaxAI/MiniMax-M2` | MiniMax's SOTA LLM for coding & agentic workflows. | +| **StableLM** (3B, 7B) | `stabilityai/stablelm-tuned-alpha-7b` | StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. | +| **Command-(R,A)** (Cohere) | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025` | Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. | +| **DBRX** (Databricks) | `databricks/dbrx-instruct` | Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. | +| **Grok** (xAI) | `xai-org/grok-1` | xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. | +| **ChatGLM** (GLM-130B family) | `THUDM/chatglm2-6b` | Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. | +| **InternLM 2** (7B, 20B) | `internlm/internlm2-7b` | Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). | +| **ExaONE 3** (Korean-English) | `LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct` | LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. | +| **Baichuan 2** (7B, 13B) | `baichuan-inc/Baichuan2-13B-Chat` | BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. | +| **XVERSE** (MoE) | `xverse/XVERSE-MoE-A36B` | Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. | +| **SmolLM** (135M–1.7B) | `HuggingFaceTB/SmolLM-1.7B` | Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. | +| **GLM-4** (Multilingual 9B) | `ZhipuAI/glm-4-9b-chat` | Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). | +| **MiMo** (7B series) | `XiaomiMiMo/MiMo-7B-RL` | Xiaomi's reasoning-optimized model series, leverages Multiple-Token Prediction for faster inference. | +| **ERNIE-4.5** (4.5, 4.5MoE series) | `baidu/ERNIE-4.5-21B-A3B-PT` | Baidu's ERNIE-4.5 series which consists of MoE with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. | +| **Arcee AFM-4.5B** | `arcee-ai/AFM-4.5B-Base` | Arcee's foundational model series for real world reliability and edge deployments. | +| **Persimmon** (8B) | `adept/persimmon-8b-chat` | Adept’s open 8B model with a 16K context window and fast inference; trained for broad usability and licensed under Apache 2.0. | +| **Solar** (10.7B) | `upstage/SOLAR-10.7B-Instruct-v1.0` | Upstage's 10.7B parameter model, optimized for instruction-following tasks. This architecture incorporates a depth-up scaling methodology, enhancing model performance. | +| **Tele FLM** (52B-1T) | `CofeAI/Tele-FLM` | BAAI & TeleAI's multilingual model, available in 52-billion and 1-trillion parameter variants. It is a decoder-only transformer trained on ~2T tokens | +| **Ling** (16.8B–290B) | `inclusionAI/Ling-lite`, `inclusionAI/Ling-plus` | InclusionAI’s open MoE models. Ling-Lite has 16.8B total / 2.75B active parameters, and Ling-Plus has 290B total / 28.8B active parameters. They are designed for high performance on NLP and complex reasoning tasks. | +| **Granite 3.0, 3.1** (IBM) | `ibm-granite/granite-3.1-8b-instruct` | IBM's open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems. | +| **Granite 3.0 MoE** (IBM) | `ibm-granite/granite-3.0-3b-a800m-instruct` | IBM’s Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale. | +| **GPT-J** (6B) | `EleutherAI/gpt-j-6b` | EleutherAI's GPT-2-like causal language model (6B) trained on the [Pile](https://pile.eleuther.ai/) dataset. | +| **Orion** (14B) | `OrionStarAI/Orion-14B-Base` | A series of open-source multilingual large language models by OrionStarAI, pretrained on a 2.5T token multilingual corpus including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages. | +| **Llama Nemotron Super** (v1, v1.5, NVIDIA) | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. | +| **Llama Nemotron Ultra** (v1, NVIDIA) | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. | +| **NVIDIA Nemotron Nano 2.0** | `nvidia/NVIDIA-Nemotron-Nano-9B-v2` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. `Nemotron-Nano-9B-v2` is a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. | +| **NVIDIA Nemotron 3 Super** (NVIDIA) | `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) 3 Super is a 120B-parameter MoE model (12B active) delivering high-quality reasoning and generation for enterprise AI agents. | +| **NVIDIA Nemotron 3 Nano** (NVIDIA) | `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) 3 Nano is a compact model designed for efficient edge and enterprise deployment with strong reasoning capabilities. | +| **StarCoder2** (3B-15B) | `bigcode/starcoder2-7b` | StarCoder2 is a family of open large language models (LLMs) specialized for code generation and understanding. It is the successor to StarCoder, jointly developed by the BigCode project (a collaboration between Hugging Face, ServiceNow Research, and other contributors). | +| **Jet-Nemotron** | `jet-ai/Jet-Nemotron-2B` | Jet-Nemotron is a new family of hybrid-architecture language models that surpass state-of-the-art open-source full-attention language models, while achieving significant efficiency gains. | +| **Trinity** (Nano, Mini) | `arcee-ai/Trinity-Mini` | Arcee's foundational MoE Trinity family of models, open weights under Apache 2.0. | +| **LFM2** (350M, 1.2B) | `LiquidAI/LFM2.5-1.2B-Instruct` | Liquid AI's hybrid attention + short convolution language model. | +| **LFM2-MoE** (8B-A1B, 24B-A2B) | `LiquidAI/LFM2-8B-A1B` | Liquid AI's Mixture-of-Experts variant with sigmoid routing and top-k expert selection. | +| **Falcon-H1** (0.5B–34B) | `tiiuae/Falcon-H1-34B-Instruct` | TII's hybrid Mamba-Transformer architecture combining attention and state-space models for efficient long-context inference. | +| **Hunyuan-Large** (389B, MoE) | `tencent/Tencent-Hunyuan-Large` | Tencent's open-source MoE model with 389B total / 52B active parameters, featuring Cross-Layer Attention (CLA) for improved efficiency. | +| **IBM Granite 4.0 (Hybrid, Dense)** | `ibm-granite/granite-4.0-h-micro`, `ibm-granite/granite-4.0-micro` | IBM Granite 4.0 micro models: hybrid Mamba–MoE (`h-micro`) and dense (`micro`) variants. Enterprise-focused reasoning models | +| **Sarvam 2** (30B-A2B, 105B-A10B) | `sarvamai/sarvam-2` | Sarvam's Mixture-of-Experts models. The 105B variant uses MLA (Multi-head Latent Attention) and the 30B variant uses GQA, both with 128 routed experts. | diff --git a/docs/supported_models/text_generation/index.rst b/docs/supported_models/text_generation/index.rst new file mode 100644 index 000000000000..e315f83d1a05 --- /dev/null +++ b/docs/supported_models/text_generation/index.rst @@ -0,0 +1,11 @@ +Text Generation +=============== + +Models for generating text from text or multimodal inputs. + +.. toctree:: + :maxdepth: 1 + + generative_models.md + multimodal_language_models.md + diffusion_language_models.md diff --git a/docs/supported_models/text_generation/multimodal_language_models.md b/docs/supported_models/text_generation/multimodal_language_models.md new file mode 100644 index 000000000000..a12113f6ba08 --- /dev/null +++ b/docs/supported_models/text_generation/multimodal_language_models.md @@ -0,0 +1,166 @@ +# Multimodal Language Models + +These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders. + +## Example launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path + --host 0.0.0.0 \ + --port 30000 \ +``` + +> See the [OpenAI APIs section](https://docs.sglang.io/basic_usage/openai_api_vision.html) for how to send multimodal requests. + +## Supported models + +Below the supported models are summarized in a table. + +If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen2_5_VLForConditionalGeneration`, use the expression: + +``` +repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration +``` + +in the GitHub search bar. + + +| Model Family (Variants) | Example HuggingFace Identifier | Description | Notes | +|----------------------------|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| +| **Qwen-VL** | `Qwen/Qwen3-VL-235B-A22B-Instruct` | Alibaba's vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | | +| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | | +| **DeepSeek-OCR / OCR-2** | `deepseek-ai/DeepSeek-OCR-2` | OCR-focused DeepSeek models for document understanding and text extraction. | Use `--trust-remote-code`. | +| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | DeepSeek's open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | | +| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | | +| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | | +| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | | +| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | | +| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | | +| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | | +| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | Kimi-VL is a multimodal model that can understand and generate text from images. | | +| **Mistral-Small-3.1-24B** | `mistralai/Mistral-Small-3.1-24B-Instruct-2503` | Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. | | +| **Phi-4-multimodal-instruct** | `microsoft/Phi-4-multimodal-instruct` | Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. | | +| **MiMo-VL** (7B) | `XiaomiMiMo/MiMo-VL-7B-RL` | Xiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. | | +| **GLM-4.5V** (106B) / **GLM-4.1V**(9B) | `zai-org/GLM-4.5V` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Use `--chat-template glm-4v` | +| **GLM-OCR** | `zai-org/GLM-OCR` | GLM-OCR: A fast and accurate general OCR model | | +| **DotsVLM** (General/OCR) | `rednote-hilab/dots.vlm1.inst` | RedNote's vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training. | | +| **DotsVLM-OCR** | `rednote-hilab/dots.ocr` | Specialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities. | Don't use `--trust-remote-code` | +| **NVILA** (8B, 15B, Lite-2B, Lite-8B, Lite-15B) | `Efficient-Large-Model/NVILA-8B` | `chatml` | NVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance. | +| **NVIDIA Nemotron Nano 2.0 VL** | `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16` | NVIDIA Nemotron Nano v2 VL enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities. It builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, in order to achieve higher inference throughput in long document and video scenarios. | Use `--trust-remote-code`. You may need to adjust `--max-mamba-cache-size` [default is 512] to fit memory constraints. | +| **Ernie4.5-VL** | `baidu/ERNIE-4.5-VL-28B-A3B-PT` | Baidu's vision-language models(28B,424B). Support image and video comprehension, and also support thinking. | | +| **JetVLM** | | JetVLM is an vision-language model designed for high-performance multimodal understanding and generation tasks built upon Jet-Nemotron. | Coming soon | +| **Step3-VL** (10B) | `stepfun-ai/Step3-VL-10B` | StepFun's lightweight open-source 10B parameter VLM for multimodal intelligence, excelling in visual perception, complex reasoning, and human alignment. | | +| **Qwen3-ASR** (0.6B, 1.7B) | `Qwen/Qwen3-ASR-1.7B` | Alibaba's automatic speech recognition models supporting 52 languages. Served via the `/v1/audio/transcriptions` endpoint. | | +| **Qwen3-Omni** | `Qwen/Qwen3-Omni-30B-A3B-Instruct` | Alibaba's omni-modal MoE model. Currently supports the **Thinker** component (multimodal understanding for text, images, audio, and video), while the **Talker** component (audio generation) is not yet supported. | | +| **LFM2-VL** | `LiquidAI/LFM2.5-VL-1.6B` | Liquid AI's vision-language model combining a SigLip2 vision encoder (NaFlex variable-resolution) with the LFM2 hybrid attention + short convolution language model. Supports multi-image inputs. | | + +## Audio Transcription + +SGLang supports audio-only ASR models via the OpenAI-compatible `/v1/audio/transcriptions` endpoint. Upload an audio file and receive a transcription. + +### Launch Command + +```shell +sglang serve \ + --model-path Qwen/Qwen3-ASR-1.7B \ + --served-model-name qwen3-asr \ + --trust-remote-code \ + --host 0.0.0.0 --port 30000 +``` + +### Example Request + +```bash +curl http://localhost:30000/v1/audio/transcriptions \ + -F file=@audio.wav \ + -F model=qwen3-asr \ + -F response_format=verbose_json +``` + +| Model Family | Example Identifier | Notes | +|--------------|--------------------|-------| +| **Whisper** | `openai/whisper-large-v3` | OpenAI's speech recognition model. | +| **Qwen3-ASR** (0.6B, 1.7B) | `Qwen/Qwen3-ASR-1.7B` | Use `--trust-remote-code`. Supports 52 languages. | + +## Video Input Support + +SGLang supports video input for Vision-Language Models (VLMs), enabling temporal reasoning tasks such as video question answering, captioning, and holistic scene understanding. Video clips are decoded, key frames are sampled, and the resulting tensors are batched together with the text prompt, allowing multimodal inference to integrate visual and linguistic context. + +| Model Family | Example Identifier | Video notes | +|--------------|--------------------|-------------| +| **Qwen-VL** (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-Omni) | `Qwen/Qwen3-VL-235B-A22B-Instruct` | The processor gathers `video_data`, runs Qwen's frame sampler, and merges the resulting features with text tokens before inference. | +| **GLM-4v** (4.5V, 4.1V, MOE) | `zai-org/GLM-4.5V` | Video clips are read with Decord, converted to tensors, and passed to the model alongside metadata for rotary-position handling. | +| **NVILA** (Full & Lite) | `Efficient-Large-Model/NVILA-8B` | The runtime samples eight frames per clip and attaches them to the multimodal request when `video_data` is present. | +| **LLaVA video variants** (LLaVA-NeXT-Video, LLaVA-OneVision) | `lmms-lab/LLaVA-NeXT-Video-7B` | The processor routes video prompts to the LlavaVid video-enabled architecture, and the provided example shows how to query it with `sgl.video(...)` clips. | +| **NVIDIA Nemotron Nano 2.0 VL** | `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16` | The processor samples at 2 FPS, at a max of 128 frames, as per model training. The model uses [EVS](../../../python/sglang/srt/multimodal/evs/README.md), a pruning method that removes redundant tokens from video embeddings. By default `video_pruning_rate=0.7`. Change this by providing: `--json-model-override-args '{"video_pruning_rate": 0.0}'` to disable EVS, for example. | +| **JetVLM** | | The runtime samples eight frames per clip and attaches them to the multimodal request when `video_data` is present. | + +Use `sgl.video(path, num_frames)` when building prompts to attach clips from your SGLang programs. + +Example OpenAI-compatible request that sends a video clip: + +```python +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s happening in this video?"}, + { + "type": "video_url", + "video_url": { + "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +## Usage Notes + +### Performance Optimization + +For multimodal models, you can use the `--keep-mm-feature-on-device` flag to optimize for latency at the cost of increased GPU memory usage: + +- **Default behavior**: Multimodal feature tensors are moved to CPU after processing to save GPU memory +- **With `--keep-mm-feature-on-device`**: Feature tensors remain on GPU, reducing device-to-host copy overhead and improving latency, but consuming more GPU memory + +Use this flag when you have sufficient GPU memory and want to minimize latency for multimodal inference. + +### Multimodal Inputs Limitation + +- **Use `--mm-process-config '{"image":{"max_pixels":1048576},"video":{"fps":3,"max_pixels":602112,"max_frames":60}}'`**: To set `image`, `video`, and `audio` input limits. + +This can reduce GPU memory usage, improve inference speed, and help to avoid OOM, but may impact model performance, thus set a proper value based on your specific use case. The config entries are passed as `images_kwargs`, `videos_kwargs`, and `audio_kwargs` to the HuggingFace processor, so each modality's settings are kept separate and do not collide. Refer to the HuggingFace documentation for your model's processor to understand the available parameters. + +### Bidirectional Attention in Multimodal Model Serving +**Note for serving the Gemma-3 multimodal model**: + +As mentioned in [Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM +](https://huggingface.co/blog/gemma3#multimodality), Gemma-3 employs bidirectional attention between image tokens during the prefill phase. Currently, SGLang only supports bidirectional attention when using the Triton Attention Backend. Note, however, that SGLang's current bidirectional attention implementation is incompatible with both CUDA Graph and Chunked Prefill. + +To enable bidirectional attention, you can use the `TritonAttnBackend` while disabling CUDA Graph and Chunked Prefill. Example launch command: +```shell +python -m sglang.launch_server \ + --model-path google/gemma-3-4b-it \ + --host 0.0.0.0 --port 30000 \ + --enable-multimodal \ + --dtype bfloat16 --triton-attention-reduce-in-fp32 \ + --attention-backend triton \ # Use Triton attention backend + --disable-cuda-graph \ # Disable Cuda Graph + --chunked-prefill-size -1 # Disable Chunked Prefill +``` + +If higher serving performance is required and a certain degree of accuracy loss is acceptable, you may choose to use other attention backends, and you can also enable features like CUDA Graph and Chunked Prefill for better performance, but note that the model will fall back to using causal attention instead of bidirectional attention. diff --git a/docs_new/.gitignore b/docs_new/.gitignore new file mode 100644 index 000000000000..126ca65507d8 --- /dev/null +++ b/docs_new/.gitignore @@ -0,0 +1,30 @@ +# Node +node_modules/ +.env +.DS_Store +.cache/ +dist/ +.next/ +*.log + +# OS +Thumbs.db +Desktop.ini + +# VSCode +.vscode/ + +# Mintlify +.mintlify/ + +# Python (if any) +__pycache__/ +*.pyc + +# Misc +*.swp +*.swo + +.agents +.claude +skills-lock.json diff --git a/docs_new/.mintignore b/docs_new/.mintignore new file mode 100644 index 000000000000..9922f06dc8c1 --- /dev/null +++ b/docs_new/.mintignore @@ -0,0 +1,7 @@ +# Mintlify automatically ignores these files and directories: +# .git, .github, .claude, .agents, .idea, node_modules, +# README.md, LICENSE.md, CHANGELOG.md, CONTRIBUTING.md + +# Draft content +drafts/ +*.draft.mdx diff --git a/docs_new/AGENTS.md b/docs_new/AGENTS.md new file mode 100644 index 000000000000..dd60e68fbdec --- /dev/null +++ b/docs_new/AGENTS.md @@ -0,0 +1,381 @@ +--- +name: sglang-docs-mintlify +description: Build and maintain the SGLang documentation site and integrated cookbook using Mintlify. Use when + creating docs pages, configuring navigation, adding components, or setting up + API references. +license: Apache-2.0 +compatibility: Requires Node.js for CLI. Works with any Git-based workflow. +metadata: + author: SGLang Team + version: "1.0" + mintlify-proj: mintlify +--- + +# SGLang Mintlify documentation guide for agents + +## Non-negotiables + +- **Do not guess flags, defaults, or behavior.** + If you’re documenting CLI args, env vars, APIs, or performance behavior, verify against: + - the upstream codebase (`sgl-project/sglang`) + - the current public docs (`docs.sglang.io`) until the migration is complete + - or an authoritative vendor doc when platform-specific (ROCm, CANN/Ascend, Intel XPU). +- **Prefer fixing the docs-site version of an internal link** instead of copying links from older docs. +- **Keep examples copy/pasteable.** Use placeholders consistently (e.g., `MODEL_PATH`, `HF_TOKEN`, `HOST`, `PORT`). + +## Source of truth hierarchy + +1. **This repo** + - `docs.json` for site structure + navigation + - existing MDX pages for voice + conventions +2. **Canonical current docs** + - `docs.sglang.io` (Sphinx site) is currently the reference structure and content baseline. +3. **Implementation** + - `sgl-project/sglang` for anything that can change with releases (flags, env vars, defaults, supported models). +4. **Cookbook** + - `cookbook.sglang.io` / `sgl-project/sgl-cookbook` for recipe patterns and model-specific operational guidance. + +## Writing standards (SGLang-specific) + +### Voice and structure + +* Second person (“you”), active voice. +* Prefer **short, scannable sections** with clear outcomes. +* Headings in **sentence case**. +* Put prerequisites before commands. + +### Technical accuracy patterns + +For pages that include commands/configs, always specify: + +* **Platform** (NVIDIA CUDA / AMD ROCm / Intel XPU / Ascend NPU / CPU) +* **OS** (if relevant) and **version constraints** +* **Model identifier** format (e.g., Hugging Face repo id) and where it goes (`--model-path`, `--model`, etc.) +* **Parallelism knobs** used in the example (`--tp`, `--dp`, node count, etc.) +* Any required secrets/tokens (`HF_TOKEN`) and where they are used. + +# Mintlify best practices + +**Always consult [mintlify.com/docs](https://mintlify.com/docs) for components, configuration, and latest features.** + +If you are not already connected to the Mintlify MCP server, [https://mintlify.com/docs/mcp](https://mintlify.com/docs/mcp), add it so that you can search more efficiently. + +**Always** favor searching the current Mintlify documentation over whatever is in your training data about Mintlify. + +Mintlify is a documentation platform that transforms MDX files into documentation sites. Configure site-wide settings in the `docs.json` file, write content in MDX with YAML frontmatter, and favor built-in components over custom components. + +Full schema at [mintlify.com/docs.json](https://mintlify.com/docs.json). + +## Before you write + +### Understand the project + +Read `docs.json` in the project root. This file defines the entire site: navigation structure, theme, colors, links, API and specs. + +Understanding the project tells you: + +* What pages exist and how they're organized +* What navigation groups are used (and their naming conventions) +* How the site navigation is structured +* What theme and configuration the site uses + +### Check for existing content + +Search the docs before creating new pages. You may need to: + +* Update an existing page instead of creating a new one +* Add a section to an existing page +* Link to existing content rather than duplicating + +### Read surrounding content + +Before writing, read 2-3 similar pages to understand the site's voice, structure, formatting conventions, and level of detail. + +### Understand Mintlify components + +Review the Mintlify [components](https://www.mintlify.com/docs/components) to select and use any relevant components for the documentation request that you are working on. + +## Quick reference + +### CLI commands + +* `npm i -g mint` - Install the Mintlify CLI +* `mint dev` - Local preview at localhost:3000 +* `mint broken-links` - Check internal links +* `mint a11y` - Check for accessibility issues in content +* `mint rename` - Rename/move files and update references +* `mint validate` - Validate documentation builds + +### Required files + +* `docs.json` - Site configuration (navigation, theme, integrations, etc.). See [global settings](https://www.mintlify.com/docs/organize/settings) for all options. +* `*.mdx` files - Documentation pages with YAML frontmatter + +### Example file structure + +``` +project/ +├── docs.json # Site configuration +├── introduction.mdx +├── quickstart.mdx +├── guides/ +│ └── example.mdx +├── openapi.yml # API specification +├── images/ # Static assets +│ └── example.png +└── snippets/ # Reusable components + └── component.jsx +``` + +## Page frontmatter + +Every page requires `title` in its frontmatter. Include `description` for SEO and navigation. + +```yaml theme={null} +--- +title: "Clear, descriptive title" +description: "Concise summary for SEO and navigation." +--- +``` + +Optional frontmatter fields: + +* `sidebarTitle`: Short title for sidebar navigation. +* `icon`: Lucide or Font Awesome icon name, URL, or file path. +* `tag`: Label next to the page title in the sidebar (for example, "NEW"). +* `mode`: Page layout mode (`default`, `wide`, `custom`). +* `keywords`: Array of terms related to the page content for local search and SEO. +* Any custom YAML fields for use with personalization or conditional content. + +## File conventions + +* Match existing naming patterns in the directory +* If there are no existing files or inconsistent file naming patterns, use kebab-case: `getting-started.mdx`, `api-reference.mdx` +* Use root-relative paths without file extensions for internal links: `/getting-started/quickstart` +* Do not use relative paths (`../`) or absolute URLs for internal pages +* When you create a new page, add it to `docs.json` navigation or it won't appear in the sidebar + +## Organize content + +When a user asks about anything related to site-wide configurations, start by understanding the [global settings](https://www.mintlify.com/docs/organize/settings). See if a setting in the `docs.json` file can be updated to achieve what the user wants. + +### Navigation + +The `navigation` property in `docs.json` controls site structure. Choose one primary pattern at the root level, then nest others within it. + +**Choose your primary pattern:** + +| Pattern | When to use | +| ------------- | ---------------------------------------------------------------------------------------------- | +| **Groups** | Default. Single audience, straightforward hierarchy | +| **Tabs** | Distinct sections with different audiences (Guides vs API Reference) or content types | +| **Anchors** | Want persistent section links at sidebar top. Good for separating docs from external resources | +| **Dropdowns** | Multiple doc sections users switch between, but not distinct enough for tabs | +| **Products** | Multi-product company with separate documentation per product | +| **Versions** | Maintaining docs for multiple API/product versions simultaneously | +| **Languages** | Localized content | + +**Within your primary pattern:** + +* **Groups** - Organize related pages. Can nest groups within groups, but keep hierarchy shallow +* **Menus** - Add dropdown navigation within tabs for quick jumps to specific pages +* **`expanded: false`** - Collapse nested groups by default. Use for reference sections users browse selectively +* **`openapi`** - Auto-generate pages from OpenAPI spec. Add at group/tab level to inherit + +**Common combinations:** + +* Tabs containing groups (most common for docs with API reference) +* Products containing tabs (multi-product SaaS) +* Versions containing tabs (versioned API docs) +* Anchors containing groups (simple docs with external resource links) + +### Links and paths + +* **Internal links:** Root-relative, no extension: `/getting-started/quickstart` +* **Images:** Store in `/images`, reference as `/images/example.png` +* **External links:** Use full URLs, they open in new tabs automatically + +## Customize docs sites + +**What to customize where:** + +* **Brand colors, fonts, logo** → `docs.json`. See [global settings](https://www.mintlify.com/docs/organize/settings) +* **Component styling, layout tweaks** → `custom.css` at project root +* **Dark mode** → Enabled by default. Only disable with `"appearance": "light"` in `docs.json` if brand requires it + +Start with `docs.json`. Only add `custom.css` when you need styling that config doesn't support. + +## Write content + +### Components + +The [components overview](https://mintlify.com/docs/components) organizes all components by purpose: structure content, draw attention, show/hide content, document APIs, link to pages, and add visual context. Start there to find the right component. + +**Common decision points:** + +| Need | Use | +| -------------------------- | ----------------------- | +| Hide optional details | `` | +| Long code examples | `` | +| User chooses one option | `` | +| Linked navigation cards | `` in `` | +| Sequential instructions | `` | +| Code in multiple languages | `` | +| API parameters | `` | +| API response fields | `` | + +**Callouts by severity:** + +* `` - Supplementary info, safe to skip +* `` - Helpful context such as permissions +* `` - Recommendations or best practices +* `` - Potentially destructive actions +* `` - Success confirmation + +### Reusable content + +**When to use snippets:** + +* Exact content appears on more than one page +* Complex components you want to maintain in one place +* Shared content across teams/repos + +**When NOT to use snippets:** + +* Slight variations needed per page (leads to complex props) + +Import snippets with `import { Component } from "/path/to/snippet-name.jsx"`. + +## Writing standards + +### Voice and structure + +* Second-person voice ("you") +* Active voice, direct language +* Sentence case for headings ("Getting started", not "Getting Started") +* Sentence case for code block titles ("Expandable example", not "Expandable Example") +* Lead with context: explain what something is before how to use it +* Prerequisites at the start of procedural content + +### What to avoid + +**Never use:** + +* Marketing language ("powerful", "seamless", "robust", "cutting-edge") +* Filler phrases ("it's important to note", "in order to") +* Excessive conjunctions ("moreover", "furthermore", "additionally") +* Editorializing ("obviously", "simply", "just", "easily") + +**Watch for AI-typical patterns:** + +* Overly formal or stilted phrasing +* Unnecessary repetition of concepts +* Generic introductions that don't add value +* Concluding summaries that restate what was just said + +### Formatting + +* All code blocks must have language tags +* All images and media must have descriptive alt text +* Use bold and italics only when they serve the reader's understanding--never use text styling just for decoration +* No decorative formatting or emoji + +### Code examples + +* Keep examples simple and practical +* Use realistic values (not "foo" or "bar") +* One clear example is better than multiple variations +* Test that code works before including it + +## Deploy + +Mintlify deploys automatically when changes are pushed to the connected Git repository. + +**What agents can configure:** + +* **Redirects** → Add to `docs.json` with `"redirects": [{"source": "/old", "destination": "/new"}]` +* **SEO indexing** → Control with `"seo": {"indexing": "all"}` to include hidden pages in search + +**Requires dashboard setup (human task):** + +* Custom domains and subdomains +* Preview deployment settings +* DNS configuration + +For `/docs` subpath hosting with Vercel or Cloudflare, agents can help configure rewrite rules. See [/docs subpath](https://mintlify.com/docs/deploy/vercel). + +## Workflow + +### 1. Understand the task + +Identify what needs to be documented, which pages are affected, and what the reader should accomplish afterward. If any of these are unclear, ask. + +### 2. Research + +* Read `docs.json` to understand the site structure +* Search existing docs for related content +* Read similar pages to match the site's style + +### 3. Plan + +* Synthesize what the reader should accomplish after reading the docs and the current content +* Propose any updates or new content +* Verify that your proposed changes will help readers be successful + +### 4. Write + +* Start with the most important information +* Keep sections focused and scannable +* Use components appropriately (don't overuse them) +* Mark anything uncertain with a TODO comment: + +```mdx theme={null} +{/* TODO: Verify the default timeout value */} +``` + +### 5. Update navigation + +If you created a new page, add it to the appropriate group in `docs.json`. + +### 6. Verify + +Before submitting: + +* [ ] Frontmatter includes title and description +* [ ] All code blocks have language tags +* [ ] Internal links use root-relative paths without file extensions +* [ ] New pages are added to `docs.json` navigation +* [ ] Content matches the style of surrounding pages +* [ ] No marketing language or filler phrases +* [ ] TODOs are clearly marked for anything uncertain +* [ ] Run `mint broken-links` to check links +* [ ] Run `mint validate` to find any errors + +## Edge cases + +### Migrations + +If a user asks about migrating to Mintlify, ask if they are using ReadMe or Docusaurus. If they are, use the [@mintlify/scraping](https://www.npmjs.com/package/@mintlify/scraping) CLI to migrate content. If they are using a different platform to host their documentation, help them manually convert their content to MDX pages using Mintlify components. + +### Hidden pages + +Any page that is not included in the `docs.json` navigation is hidden. Use hidden pages for content that should be accessible by URL or indexed for the assistant or search, but not discoverable through the sidebar navigation. + +### Exclude pages + +The `.mintignore` file is used to exclude files from a documentation repository from being processed. + +## Common gotchas + +1. **Component imports** - JSX components need explicit import, MDX components don't +2. **Frontmatter required** - Every MDX file needs `title` at minimum +3. **Code block language** - Always specify language identifier +4. **Never use `mint.json`** - `mint.json` is deprecated. Only ever use `docs.json` + +## Resources + +* [Documentation](https://mintlify.com/docs) +* [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang) +* [Configuration schema](https://mintlify.com/docs.json) +* [Feature requests](https://github.com/orgs/mintlify/discussions/categories/feature-requests) +* [Bugs and feedback](https://github.com/orgs/mintlify/discussions/categories/bugs-feedback) diff --git a/docs_new/CONTRIBUTING.md b/docs_new/CONTRIBUTING.md new file mode 100644 index 000000000000..fc42a9b2d5e0 --- /dev/null +++ b/docs_new/CONTRIBUTING.md @@ -0,0 +1,34 @@ +> **Customize this file**: Tailor this template to your project by noting specific contribution types you're looking for, adding a Code of Conduct, or adjusting the writing guidelines to match your style. + +# Contribute to the documentation + +Thank you for your interest in contributing to our documentation! This guide will help you get started. + +## How to contribute + +### Option 1: Edit directly on GitHub + +1. Navigate to the page you want to edit +2. Click the "Edit this file" button (the pencil icon) +3. Make your changes and submit a pull request + +### Option 2: Local development + +1. Fork and clone this repository +2. Install the Mintlify CLI: `npm i -g mint` +3. Create a branch for your changes +4. Make changes +5. Run `mint dev` +6. Preview your changes at `http://localhost:3000` +7. Commit your changes and submit a pull request + +For more details on local development, see our [development guide](development.mdx). + +## Writing guidelines + +- **Use active voice**: "Run the command" not "The command should be run" +- **Address the reader directly**: Use "you" instead of "the user" +- **Keep sentences concise**: Aim for one idea per sentence +- **Lead with the goal**: Start instructions with what the user wants to accomplish +- **Use consistent terminology**: Don't alternate between synonyms for the same concept +- **Include examples**: Show, don't just tell diff --git a/docs_new/LICENSE b/docs_new/LICENSE new file mode 100644 index 000000000000..261eeb9e9f8b --- /dev/null +++ b/docs_new/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/docs_new/README.md b/docs_new/README.md new file mode 100644 index 000000000000..c3b11bce291e --- /dev/null +++ b/docs_new/README.md @@ -0,0 +1,126 @@ +# SGLang Documentation + +The official documentation and cookbook for [SGLang](https://github.com/sgl-project/sglang) — a high-performance serving framework for large language models and vision-language models. + +- **Docs**: Getting started guides, installation, and reference +- **Cookbook**: Battle-tested recipes for deploying specific models (Qwen, DeepSeek, Llama, GLM, etc.) on various hardware + + +## Project structure + +``` +. +├── docs.json # Site configuration (navigation, theme, metadata) +├── index.mdx # Homepage +├── docs/ # Documentation pages +│ └── get-started/ +│ └── install.mdx # Installation guide +└── cookbook/ # Model deployment recipes + ├── intro.mdx # Cookbook overview and recipe index + └── autoregressive/ # Autoregressive model recipes + └── Qwen/ + └── Qwen3.5.mdx +``` + +Pages are `.mdx` files with YAML frontmatter. Navigation is defined in `docs.json`. + +## Local development + +### Prerequisites + +- Node.js >= 20 + +### Setup + +```bash +# Install the CLI +npm i -g mint + +# Start the dev server (with hot reload) +mint dev +``` + +Preview at `http://localhost:3000`. + +### Useful commands + +```bash +mint dev # Start local preview server +mint broken-links # Check for broken links +mint update # Update the CLI +``` + +## Contributing + +We welcome contributions! Whether you want to add a recipe for a new model, improve existing docs, or fix a typo — PRs are appreciated. + +### Quick edit (GitHub) + +1. Navigate to the file you want to edit on GitHub +2. Click the pencil icon to edit +3. Submit a pull request + +### Local development workflow + +```bash +# 1. Fork and clone the repo +git clone https://github.com//sgl-docs.git +cd sgl-docs + +# 2. Create a branch +git checkout -b my-changes + +# 3. Start the dev server and make your changes +mint dev + +# 4. Verify links aren't broken +mint broken-links + +# 5. Commit and push +git add +git commit -m "docs: describe your change" +git push origin my-changes + +# 6. Open a pull request on GitHub +``` + +### Adding a new cookbook recipe + +1. Create a new `.mdx` file under `cookbook/` following the existing directory structure (e.g., `cookbook/llm//.mdx` or `cookbook/vlm//.mdx`) +2. Use an existing recipe like `cookbook/llm/Qwen/Qwen3.5.mdx` as a template +3. Add your page to the navigation in `docs.json` +4. Each recipe should include: + - Model introduction and key specs + - Installation / environment setup + - Deployment configuration (with hardware recommendations) + - Usage examples (basic + advanced) + - Benchmarks (if available) + +### Writing guidelines + +- Use active voice: "Run the command" not "The command should be run" +- Address the reader as "you" +- Keep sentences concise — one idea per sentence +- Lead with the goal, then the steps +- Use consistent terminology +- Include concrete examples and code snippets + +## Acknowledgements + +Thank you to all the authors who contributed to the original documentation in [`sglang/docs/`](https://github.com/sgl-project/sglang/tree/main/docs) and the original cookbook in [`sgl-cookbook`](https://github.com/sgl-project/sgl-cookbook). The migration to the new Mintlify-based documentation was led by the following [ACM-VIT](https://github.com/ACM-VIT) students: + +[@Adhyan Jain](https://github.com/Adhyan-Jain), [@Maitri-shah29](https://github.com/Maitri-shah29), [@architnigam](https://github.com/architnigam), [@Nakul-Sinha](https://github.com/Nakul-Sinha), [@divyamagrawal06](https://github.com/divyamagrawal06), [@A-Taman](https://github.com/A-Taman), [@nimeshas](https://github.com/nimeshas), [@IshhanKheria](https://github.com/IshhanKheria), [@Krishang-Zinzuwadia](https://github.com/Krishang-Zinzuwadia), [@pokymono](https://github.com/pokymono), [@Ishitajoshii](https://github.com/Ishitajoshii), [@AdityaVKochar](https://github.com/AdityaVKochar) + +Advised by [@adarshxs](https://github.com/adarshxs) (ACM-VIT) and [@wisclmy0611](https://github.com/wisclmy0611), [@Richardczl98](https://github.com/Richardczl98) (LMSYS). + +## Community + +- [GitHub](https://github.com/sgl-project/sglang) +- [Slack](https://slack.sglang.io/) +- [Discord](https://discord.gg/4ugb2t6YY2) +- [X / Twitter](https://x.com/lmsysorg) +- [LinkedIn](https://www.linkedin.com/company/sgl-project/) + +## License + +Apache License 2.0 — see the [LICENSE](LICENSE) for details. diff --git a/docs_new/cards/Autoregressive-benchmark-card.png b/docs_new/cards/Autoregressive-benchmark-card.png new file mode 100644 index 000000000000..fee0558b515a Binary files /dev/null and b/docs_new/cards/Autoregressive-benchmark-card.png differ diff --git a/docs_new/cards/Autoregressive-card.png b/docs_new/cards/Autoregressive-card.png new file mode 100644 index 000000000000..da4c367c36c0 Binary files /dev/null and b/docs_new/cards/Autoregressive-card.png differ diff --git a/docs_new/cards/Classification-card.png b/docs_new/cards/Classification-card.png new file mode 100644 index 000000000000..a26b9180adbd Binary files /dev/null and b/docs_new/cards/Classification-card.png differ diff --git a/docs_new/cards/Diffusion-benchmark-card.png b/docs_new/cards/Diffusion-benchmark-card.png new file mode 100644 index 000000000000..7799bbd2a9a6 Binary files /dev/null and b/docs_new/cards/Diffusion-benchmark-card.png differ diff --git a/docs_new/cards/Diffusion-card.png b/docs_new/cards/Diffusion-card.png new file mode 100644 index 000000000000..708171b43109 Binary files /dev/null and b/docs_new/cards/Diffusion-card.png differ diff --git a/docs_new/cards/Embedding-card.png b/docs_new/cards/Embedding-card.png new file mode 100644 index 000000000000..7a53d213f8b0 Binary files /dev/null and b/docs_new/cards/Embedding-card.png differ diff --git a/docs_new/cards/LLM-card.png b/docs_new/cards/LLM-card.png new file mode 100644 index 000000000000..36fa41267ceb Binary files /dev/null and b/docs_new/cards/LLM-card.png differ diff --git a/docs_new/cards/Omni-card.png b/docs_new/cards/Omni-card.png new file mode 100644 index 000000000000..203158c854f6 Binary files /dev/null and b/docs_new/cards/Omni-card.png differ diff --git a/docs_new/cards/Rerank-card.png b/docs_new/cards/Rerank-card.png new file mode 100644 index 000000000000..2000b30d5cc6 Binary files /dev/null and b/docs_new/cards/Rerank-card.png differ diff --git a/docs_new/cards/Reward-card.png b/docs_new/cards/Reward-card.png new file mode 100644 index 000000000000..11fbe240ef95 Binary files /dev/null and b/docs_new/cards/Reward-card.png differ diff --git a/docs_new/cards/VLM-card.png b/docs_new/cards/VLM-card.png new file mode 100644 index 000000000000..d0e8c059dc77 Binary files /dev/null and b/docs_new/cards/VLM-card.png differ diff --git a/docs_new/cards/dLLM-card.png b/docs_new/cards/dLLM-card.png new file mode 100644 index 000000000000..1bd217f090c1 Binary files /dev/null and b/docs_new/cards/dLLM-card.png differ diff --git a/docs_new/cards/logos/deepseek.png b/docs_new/cards/logos/deepseek.png new file mode 100644 index 000000000000..b553b56271e4 Binary files /dev/null and b/docs_new/cards/logos/deepseek.png differ diff --git a/docs_new/cards/logos/ernie.png b/docs_new/cards/logos/ernie.png new file mode 100644 index 000000000000..ac1a0bd55525 Binary files /dev/null and b/docs_new/cards/logos/ernie.png differ diff --git a/docs_new/cards/logos/fishaudio.png b/docs_new/cards/logos/fishaudio.png new file mode 100644 index 000000000000..a3c951c953da Binary files /dev/null and b/docs_new/cards/logos/fishaudio.png differ diff --git a/docs_new/cards/logos/flashlabs.png b/docs_new/cards/logos/flashlabs.png new file mode 100644 index 000000000000..0c15819889c6 Binary files /dev/null and b/docs_new/cards/logos/flashlabs.png differ diff --git a/docs_new/cards/logos/flux.png b/docs_new/cards/logos/flux.png new file mode 100644 index 000000000000..df4fde4b81c8 Binary files /dev/null and b/docs_new/cards/logos/flux.png differ diff --git a/docs_new/cards/logos/glm.png b/docs_new/cards/logos/glm.png new file mode 100644 index 000000000000..6d0f33657525 Binary files /dev/null and b/docs_new/cards/logos/glm.png differ diff --git a/docs_new/cards/logos/google.png b/docs_new/cards/logos/google.png new file mode 100644 index 000000000000..dc39c804e5f9 Binary files /dev/null and b/docs_new/cards/logos/google.png differ diff --git a/docs_new/cards/logos/inclusionai.png b/docs_new/cards/logos/inclusionai.png new file mode 100644 index 000000000000..0128c8371677 Binary files /dev/null and b/docs_new/cards/logos/inclusionai.png differ diff --git a/docs_new/cards/logos/internlm.png b/docs_new/cards/logos/internlm.png new file mode 100644 index 000000000000..655f7d467647 Binary files /dev/null and b/docs_new/cards/logos/internlm.png differ diff --git a/docs_new/cards/logos/internvl.png b/docs_new/cards/logos/internvl.png new file mode 100644 index 000000000000..e6f972289177 Binary files /dev/null and b/docs_new/cards/logos/internvl.png differ diff --git a/docs_new/cards/logos/jina.png b/docs_new/cards/logos/jina.png new file mode 100644 index 000000000000..2a660ab6867e Binary files /dev/null and b/docs_new/cards/logos/jina.png differ diff --git a/docs_new/cards/logos/llama.png b/docs_new/cards/logos/llama.png new file mode 100644 index 000000000000..e101baba7a79 Binary files /dev/null and b/docs_new/cards/logos/llama.png differ diff --git a/docs_new/cards/logos/minimax.png b/docs_new/cards/logos/minimax.png new file mode 100644 index 000000000000..dbb8f23adf1c Binary files /dev/null and b/docs_new/cards/logos/minimax.png differ diff --git a/docs_new/cards/logos/mistral.png b/docs_new/cards/logos/mistral.png new file mode 100644 index 000000000000..337646d72737 Binary files /dev/null and b/docs_new/cards/logos/mistral.png differ diff --git a/docs_new/cards/logos/moonshotai.png b/docs_new/cards/logos/moonshotai.png new file mode 100644 index 000000000000..0789af8d62bd Binary files /dev/null and b/docs_new/cards/logos/moonshotai.png differ diff --git a/docs_new/cards/logos/mova.png b/docs_new/cards/logos/mova.png new file mode 100644 index 000000000000..4de81e642418 Binary files /dev/null and b/docs_new/cards/logos/mova.png differ diff --git a/docs_new/cards/logos/nvidia.png b/docs_new/cards/logos/nvidia.png new file mode 100644 index 000000000000..fa35ada3c83c Binary files /dev/null and b/docs_new/cards/logos/nvidia.png differ diff --git a/docs_new/cards/logos/openai.png b/docs_new/cards/logos/openai.png new file mode 100644 index 000000000000..89c332dd20ad Binary files /dev/null and b/docs_new/cards/logos/openai.png differ diff --git a/docs_new/cards/logos/qwen.png b/docs_new/cards/logos/qwen.png new file mode 100644 index 000000000000..5fa6c1ce9174 Binary files /dev/null and b/docs_new/cards/logos/qwen.png differ diff --git a/docs_new/cards/logos/stepfun.png b/docs_new/cards/logos/stepfun.png new file mode 100644 index 000000000000..18403cd1886f Binary files /dev/null and b/docs_new/cards/logos/stepfun.png differ diff --git a/docs_new/cards/logos/wan.png b/docs_new/cards/logos/wan.png new file mode 100644 index 000000000000..5fa6c1ce9174 Binary files /dev/null and b/docs_new/cards/logos/wan.png differ diff --git a/docs_new/cards/logos/xiaomi.png b/docs_new/cards/logos/xiaomi.png new file mode 100644 index 000000000000..22623d5e8182 Binary files /dev/null and b/docs_new/cards/logos/xiaomi.png differ diff --git a/docs_new/cards/logos/zimage.png b/docs_new/cards/logos/zimage.png new file mode 100644 index 000000000000..5fa6c1ce9174 Binary files /dev/null and b/docs_new/cards/logos/zimage.png differ diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-Math-V2.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-Math-V2.mdx new file mode 100644 index 000000000000..bbb08460fb99 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-Math-V2.mdx @@ -0,0 +1,522 @@ +--- +title: DeepSeek-Math-V2 +metatags: + description: "Deploy DeepSeek-Math-V2 with SGLang - advanced mathematical reasoning model with gold-level IMO/CMO performance and theorem-proving capabilities." +--- + +import { DeepSeekMathV2Deployment } from '/src/snippets/autoregressive/deepseek-math-v2-deployment.jsx'; + +## 1. Model Introduction + +[DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) is DeepSeek's advanced mathematical reasoning model with strong theorem-proving capabilities. The model demonstrates exceptional performance on mathematical competitions, achieving gold-level scores on IMO 2025 and CMO 2024, and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute. + +**Key Features:** + +- **Strong Theorem-Proving**: Gold-level performance on IMO 2025 and CMO 2024 +- **Self-Verifiable Reasoning**: Implements self-verifiable mathematical reasoning for improved accuracy +- **Competition-Level Math**: Near-perfect score (118/120) on Putnam 2024 +- **Large MoE Model**: ~671B total parameters, requires high-memory GPUs (B200 183GB or B300 275GB) + +**Available Models:** + +- **BF16 (Full Weights)**: [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) - Full precision weights + +**License:** +To use DeepSeek-Math-V2, you must agree to DeepSeek's Community License. See [LICENSE](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2/blob/main/LICENSE) for details. + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and deployment strategy. + + + +### 3.2 Configuration Tips + +**Hardware Requirements:** + +- **B200 (183GB)**: BF16 tp=8 +- **B300 (275GB)**: BF16 tp=8 + +**DP Attention:** + +- Enable DP attention for high-throughput scenarios +- The `--dp` value commonly matches the `--tp` value +- Trade-off: Higher throughput at the cost of slightly increased latency + +## 4. Model Invocation + +### 4.1 Deployment Command + +Deploy the model using the command generated above. Example for B200: + +```shell Command +sglang serve --model-path deepseek-ai/DeepSeek-Math-V2 \ + --tp 8 \ + --ep 8 \ + --reasoning-parser deepseek-r1 \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.2 Mathematical Reasoning + +DeepSeek-Math-V2 excels at mathematical problem-solving with step-by-step reasoning. + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Mathematical reasoning problem +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-Math-V2", + messages=[ + {"role": "user", "content": "Prove that for any positive integer n, the sum 1 + 2 + 3 + ... + n = n(n+1)/2"} + ], + max_tokens=4096, + stream=True +) + +# Process the stream +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +We need to prove that for any positive integer n, the sum 1 + 2 + 3 + ... + n = n(n+1)/2. + +This is a classic formula for the sum of the first n natural numbers. We can prove by induction. + +Base case: n=1, LHS = 1, RHS = 1*(1+1)/2 = 1*2/2 = 1. Holds. + +Inductive step: Assume true for n = k, i.e., 1 + 2 + ... + k = k(k+1)/2. Then for n = k+1, sum = 1 + 2 + ... + k + (k+1) = [k(k+1)/2] + (k+1) = (k(k+1) + 2(k+1))/2 = (k+1)(k+2)/2 = (k+1 +)((k+1)+1)/2. So holds for k+1. By induction, holds for all positive integers n. + +... +=============== Content ================= +We can prove the well-known formula for the sum of the first \(n\) positive integers in several ways. Two of the most elementary are presented below. + +--- + +### 1. Proof by mathematical induction + +**Base case (\(n=1\))**: +\[ +1 = \frac{1\cdot(1+1)}{2}= \frac{1\cdot2}{2}=1, +\] +so the formula holds for \(n=1\). + +**Inductive hypothesis:** +Assume that for some positive integer \(k\) the formula is true, i.e. +\[ +1+2+\dots+k = \frac{k(k+1)}{2}. +\] + +**Inductive step (\(k \to k+1\))**: +Consider the sum up to \(k+1\): +\[ +\begin{aligned} +1+2+\dots+k+(k+1) &= \bigl(1+2+\dots+k\bigr) + (k+1) \\[4pt] +&= \frac{k(k+1)}{2} + (k+1) \qquad\text{(by the induction hypothesis)}\\[4pt] +&= (k+1)\left(\frac{k}{2}+1\right)\\[4pt] +&= (k+1)\frac{k+2}{2}\\[4pt] +&= \frac{(k+1)(k+2)}{2}\\[4pt] +&= \frac{(k+1)\bigl((k+1)+1\bigr)}{2}. +\end{aligned} +\] +Thus the formula also holds for \(n=k+1\). + +By the principle of mathematical induction, +\[ +1+2+3+\dots+n = \frac{n(n+1)}{2} +\] +for every positive integer \(n\). + +--- + +### 2. Proof by pairing (Gauss’s trick) + +Let +\[ +S = 1 + 2 + 3 + \dots + n. +\] + +Write the same sum in reverse order: +\[ +S = n + (n-1) + (n-2) + \dots + 1. +\] + +Add the two equalities term‑by‑term: +\[ +\begin{aligned} +2S &= (1+n) + \bigl(2+(n-1)\bigr) + \bigl(3+(n-2)\bigr) + \dots + (n+1)\\ + &= \underbrace{(n+1)+(n+1)+\dots+(n+1)}_{n\ \text{times}}\\ + &= n\,(n+1). +\end{aligned} +\] + +Therefore +\[ +S = \frac{n(n+1)}{2}. +\] + +Both proofs are rigorous and show that the formula holds for all positive integers \(n\). +``` + +### 4.3 Competition-Level Problems + +**Example: IMO-style Problem:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# IMO-style problem +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-Math-V2", + messages=[ + {"role": "user", "content": "Let a, b, c be positive real numbers such that abc = 1. Prove that (a-1+1/b)(b-1+1/c)(c-1+1/a) <= 1."} + ], + max_tokens=8192, + stream=True +) + +# Process the stream +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +We need to prove that for positive real numbers a,b,c with abc = 1, we have: + +\[ +(a - 1 + \frac{1}{b})(b - 1 + \frac{1}{c})(c - 1 + \frac{1}{a}) \le 1. +\] + +We can rewrite the expressions: Since abc=1, we have 1/b = ac, 1/c = ab, 1/a = bc. Wait careful: abc=1 => 1/b = ac? Actually 1/b = ac? Let's check: abc=1 => ac = 1/b? Multiply both sides by something: abc=1 => (ac) b = 1 => ac = 1/b. Yes, because (ac) * b = 1 => ac = 1/b. Similarly, ab = 1/c, bc = 1/a. So we can rewrite: + +... +=============== Content ================= + +We are given positive real numbers \(a,b,c\) with \(abc=1\). We must prove + +\[ +\Bigl(a-1+\frac1b\Bigr)\Bigl(b-1+\frac1c\Bigr)\Bigl(c-1+\frac1a\Bigr)\le 1 . +\] + +--- + +### 1. A convenient substitution + +Because \(abc=1\), we can write + +\[ +a=\frac{x}{y},\qquad b=\frac{y}{z},\qquad c=\frac{z}{x} +\] + +with positive numbers \(x,y,z\). +(For instance, take \(x=1,\;y=\frac1a,\;z=\frac1{ab}\); then indeed \(a=\frac{x}{y},\;b=\frac{y}{z}\) and, using \(abc=1\), we obtain \(c=\frac{z}{x}=\frac1{ab}=c\).) + +--- + +### 2. Rewriting the factors + +\[ +\begin{aligned} +a-1+\frac1b &=\frac{x}{y}-1+\frac{z}{y}= \frac{x+z-y}{y},\\[2mm] +b-1+\frac1c &=\frac{y}{z}-1+\frac{x}{z}= \frac{x+y-z}{z},\\[2mm] +c-1+\frac1a &=\frac{z}{x}-1+\frac{y}{x}= \frac{y+z-x}{x}. +\end{aligned} +\] + +Hence the product becomes + +\[ +P=\Bigl(a-1+\frac1b\Bigr)\Bigl(b-1+\frac1c\Bigr)\Bigl(c-1+\frac1a\Bigr) + =\frac{(x+z-y)(x+y-z)(y+z-x)}{xyz}. +\] + +--- + +### 3. Reducing to a known inequality + +We have to show \(P\le1\), i.e. + +\[ +(x+z-y)(x+y-z)(y+z-x)\le xyz . +\tag{1} +\] + +Set + +\[ +p=x+y+z,\qquad q=xy+yz+zx,\qquad r=xyz . +\] + +Notice that + +\[ +x+z-y=p-2y,\quad x+y-z=p-2z,\quad y+z-x=p-2x . +\] + +Therefore + +\[ +\begin{aligned} +(x+z-y)(x+y-z)(y+z-x) +&=(p-2x)(p-2y)(p-2z)\\ +&=p^{3}-2p^{2}(x+y+z)+4p(xy+yz+zx)-8xyz\\ +&=-p^{3}+4pq-8r . +\end{aligned} +\] + +Inequality (1) is thus equivalent to + +\[ +-p^{3}+4pq-8r\le r\quad\Longleftrightarrow\quad 4pq-p^{3}\le 9r . +\tag{2} +\] + +--- + +### 4. Applying Schur’s inequality + +Schur’s inequality of third degree states that for any non‑negative \(x,y,z\) + +\[ +p^{3}+9r\ge 4pq . +\] + +Rearranged, this is exactly \(4pq-p^{3}\le 9r\), which is (2). +Since our \(x,y,z\) are positive, Schur’s inequality applies and (2) holds. + +Consequently (1) is true, and we obtain \(P\le1\). + +--- + +### 5. Equality case + +Equality in Schur’s inequality for positive numbers occurs only when \(x=y=z\). +Then \(a=b=c=1\), and indeed the product equals \(1\). + +--- + +Thus for all positive \(a,b,c\) with \(abc=1\), + +\[ +\Bigl(a-1+\frac1b\Bigr)\Bigl(b-1+\frac1c\Bigr)\Bigl(c-1+\frac1a\Bigr)\le 1 . +\] + +∎ +``` + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +#### 5.1.1 GSM8K Benchmark + +**Benchmark Command:** + +```shell Command +python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --port 30000 +``` + +**Test Results:** + +```text Output +Accuracy: 0.975 +Invalid: 0.000 +Latency: 34.358 s +Output throughput: 540.162 token/s +``` + +### 5.2 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x, 183GB each) +- Model: DeepSeek-Math-V2 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.8 + +#### 5.2.1 Latency Benchmark + +**Benchmark Command:** + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-Math-V2 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +**Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 53.34 +Total input tokens: 1972 +Total input text tokens: 1972 +Total generated tokens: 2784 +Total generated tokens (retokenized): 2778 +Request throughput (req/s): 0.19 +Input token throughput (tok/s): 36.97 +Output token throughput (tok/s): 52.19 +Peak output token throughput (tok/s): 56.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 89.16 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5330.72 +Median E2E Latency (ms): 5879.28 +P90 E2E Latency (ms): 8320.33 +P99 E2E Latency (ms): 9921.29 +---------------Time to First Token---------------- +Mean TTFT (ms): 183.38 +Median TTFT (ms): 177.92 +P99 TTFT (ms): 217.64 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 17.96 +Median TPOT (ms): 18.39 +P99 TPOT (ms): 19.03 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 18.57 +Median ITL (ms): 18.63 +P95 ITL (ms): 19.26 +P99 ITL (ms): 19.48 +Max ITL (ms): 24.93 +================================================== +``` + +#### 5.2.2 Throughput Benchmark + +**Benchmark Command:** + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-Math-V2 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +**Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 217.36 +Total input tokens: 301701 +Total input text tokens: 301701 +Total generated tokens: 188375 +Total generated tokens (retokenized): 187456 +Request throughput (req/s): 4.60 +Input token throughput (tok/s): 1388.05 +Output token throughput (tok/s): 866.67 +Peak output token throughput (tok/s): 2589.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 2254.72 +Concurrency: 89.81 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 19521.73 +Median E2E Latency (ms): 12076.76 +P90 E2E Latency (ms): 47248.87 +P99 E2E Latency (ms): 86862.79 +---------------Time to First Token---------------- +Mean TTFT (ms): 790.40 +Median TTFT (ms): 456.81 +P99 TTFT (ms): 4223.33 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 106.52 +Median TPOT (ms): 107.24 +P99 TPOT (ms): 238.33 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 100.29 +Median ITL (ms): 38.34 +P95 ITL (ms): 237.00 +P99 ITL (ms): 347.49 +Max ITL (ms): 3642.56 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR-2.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR-2.mdx new file mode 100644 index 000000000000..1b744541f4b3 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR-2.mdx @@ -0,0 +1,250 @@ +--- +title: DeepSeek-OCR-2 +metatags: + description: "Deploy DeepSeek-OCR-2 with SGLang - high-accuracy text extraction from images and documents for OCR tasks." +--- + +import { DeepSeekOCR2Deployment } from '/src/snippets/autoregressive/deepseek-ocr-v2-deployment.jsx'; + +## 1. Model Introduction + +[DeepSeek-OCR-2](https://github.com/deepseek-ai/DeepSeek-OCR-2) is DeepSeek's next-generation OCR (Optical Character Recognition) model, building on DeepSeek-OCR with improved accuracy and broader document understanding capabilities. The model is optimized for high-accuracy text extraction from images across a wide variety of document types and formats. + +**Key Features:** + +- **Semantic-Aware Visual Encoding (DeepEncoder V2)**: DeepSeek-OCR-2 introduces DeepEncoder V2, which models document reading order in a more human-like, semantic-driven manner rather than relying on fixed raster scanning. This significantly improves logical reading flow in complex layouts (e.g., multi-column documents). +- **Stronger Layout and Structural Understanding**: DeepSeek-OCR-2 demonstrates improved performance on structured documents such as tables, forms, and dense multi-column pages. It reduces reading-order errors and improves overall document parsing robustness compared to the original version. +- **Improved Accuracy While Maintaining Token Efficiency**: The original DeepSeek-OCR emphasized aggressive visual token compression. OCR-2 maintains high token efficiency while delivering higher benchmark performance, particularly on document-level understanding tasks. +- **Better Generalization Across Complex Document Tasks**: DeepSeek-OCR-2 performs more consistently across multilingual documents, structured data extraction, and visually complex content, making it more suitable for real-world document intelligence scenarios beyond plain text OCR. + +**Available Models:** + +- **Base Model**: [deepseek-ai/DeepSeek-OCR-2](https://huggingface.co/deepseek-ai/DeepSeek-OCR-2) - Recommended for OCR tasks + +**License:** +To use DeepSeek-OCR-2, you must agree to DeepSeek's Community License. See [LICENSE](https://huggingface.co/deepseek-ai/DeepSeek-OCR-2/blob/main/LICENSE.txt) for details. + +For more details, please refer to the [official DeepSeek-OCR-2 repository](https://github.com/deepseek-ai/DeepSeek-OCR-2). + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and deployment strategy. SGLang supports serving DeepSeek-OCR-2 on NVIDIA H200 and B200, and AMD MI300X, MI355X, and MI325X GPUs. + + + +**Note**: DeepSeek-OCR-2 has ~3.58B parameters and easily fits on a single modern GPU. For low-latency serving, no model parallelism is needed. For high-throughput requirements, consider using data parallelism with the SGLang Model Gateway — see [DP, DPA and SGLang DP Router](../../../docs/advanced_features/sgl_model_gateway) for more details. + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [DeepSeek V3/V3.1/R1 Usage](../../../docs/basic_usage/deepseek_v3). + +## 4. Model Invocation + +### 4.1 Basic Usage + +**OpenAI-compatible request example** + +```python Example +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "deepseek-ai/DeepSeek-OCR-2", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "\n<|grounding|>Convert the document to markdown."}, + {"type": "image_url", "image_url": {"url": "https://example.com/your_image.jpg"}}, + ], + } + ], + "max_tokens": 512, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +**Reference** +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Recommended Prompts + +The following prompts are recommended by the [official model card](https://huggingface.co/deepseek-ai/DeepSeek-OCR-2#main-prompts). + +**Structured document conversion** — extracts text while preserving layout: + +```text Example + +<|grounding|>Convert the document to markdown. +``` + +**Free-form OCR** — extracts without layouts: + +```text Example + +Free OCR. +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA H200 GPU (1x) +- Model: DeepSeek-OCR-2 +- Tensor Parallelism: 1 +- sglang version: 0.0.0.dev1+g93fca0bbc + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses. For more details on how to perform evaluation, see [Evaluating New Models with SGLang](../../../docs/developer_guide/evaluating_new_models). + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +sglang serve \ + --model-path deepseek-ai/DeepSeek-OCR-2 \ + --enable-multimodal \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 0.0.0.0 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-OCR-2 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 3.54 +Total input tokens: 1972 +Total input text tokens: 1972 +Total generated tokens: 2784 +Total generated tokens (retokenized): 2710 +Request throughput (req/s): 2.83 +Input token throughput (tok/s): 557.53 +Output token throughput (tok/s): 787.10 +Peak output token throughput (tok/s): 818.00 +Peak concurrent requests: 5 +Total token throughput (tok/s): 1344.63 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 352.69 +Median E2E Latency (ms): 392.34 +P90 E2E Latency (ms): 540.64 +P99 E2E Latency (ms): 639.01 +---------------Time to First Token---------------- +Mean TTFT (ms): 18.08 +Median TTFT (ms): 16.57 +P99 TTFT (ms): 25.67 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 1.18 +Median TPOT (ms): 1.21 +P99 TPOT (ms): 1.22 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 1.21 +Median ITL (ms): 1.21 +P95 ITL (ms): 1.28 +P99 ITL (ms): 1.44 +Max ITL (ms): 4.32 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +sglang serve \ + --model-path deepseek-ai/DeepSeek-OCR-2 \ + --enable-multimodal \ + --tp 1 \ + --ep 1 \ + --dp 1 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 0.0.0.0 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-OCR-2 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 14.79 +Total input tokens: 301698 +Total input text tokens: 301698 +Total generated tokens: 188375 +Total generated tokens (retokenized): 185236 +Request throughput (req/s): 67.63 +Input token throughput (tok/s): 20402.54 +Output token throughput (tok/s): 12738.99 +Peak output token throughput (tok/s): 17508.00 +Peak concurrent requests: 187 +Total token throughput (tok/s): 33141.53 +Concurrency: 86.87 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1284.50 +Median E2E Latency (ms): 866.07 +P90 E2E Latency (ms): 3027.32 +P99 E2E Latency (ms): 5490.63 +---------------Time to First Token---------------- +Mean TTFT (ms): 86.08 +Median TTFT (ms): 50.09 +P99 TTFT (ms): 613.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.79 +Median TPOT (ms): 6.54 +P99 TPOT (ms): 50.10 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.42 +Median ITL (ms): 4.64 +P95 ITL (ms): 23.65 +P99 ITL (ms): 39.62 +Max ITL (ms): 452.65 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR.mdx new file mode 100644 index 000000000000..55b8aca248b7 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR.mdx @@ -0,0 +1,204 @@ +--- +title: DeepSeek-OCR +metatags: + description: "Deploy DeepSeek-OCR with SGLang - high-accuracy text extraction from images and documents for OCR tasks." +--- + +## 1. Model Introduction + +[DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) is DeepSeek's advanced OCR (Optical Character Recognition) model designed for high-accuracy text extraction from images. The model is optimized for various document processing and image-to-text conversion tasks. + +**Key Features:** + +- **Advanced OCR**: High-accuracy text recognition from images and documents +- **Multi-Modality**: Supports various image formats and document types + +**Available Models:** + +- **Base Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) - Recommended for OCR tasks + +**License:** +To use DeepSeek-OCR, you must agree to DeepSeek's Community License. See [LICENSE](https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LICENSE) for details. + +For more details, please refer to the [official DeepSeek-OCR repository](https://github.com/deepseek-ai/DeepSeek-OCR). + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and deployment strategy. + +import { DeepSeekOCRDeployment } from "/src/snippets/autoregressive/deepseek-ocr-deployment.jsx"; + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [DeepSeek V3/V3.1/R1 Usage](../../../docs/basic_usage/deepseek_v3). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X GPU (1x) +- Model: DeepSeek-OCR +- Tensor Parallelism: 1 +- sglang version: 0.5.7 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-OCR \ + --tp 1 \ + --dtype float16 \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 8000 \ + --model deepseek-ai/DeepSeek-OCR \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 4.45 +Total input tokens: 1972 +Total input text tokens: 1972 +Total input vision tokens: 0 +Total generated tokens: 2784 +Total generated tokens (retokenized): 2770 +Request throughput (req/s): 2.25 +Input token throughput (tok/s): 442.89 +Output token throughput (tok/s): 625.26 +Peak output token throughput (tok/s): 635.00 +Peak concurrent requests: 4 +Total token throughput (tok/s): 1068.16 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 443.32 +Median E2E Latency (ms): 493.29 +---------------Time to First Token---------------- +Mean TTFT (ms): 21.59 +Median TTFT (ms): 20.89 +P99 TTFT (ms): 24.81 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 1.47 +Median TPOT (ms): 1.52 +P99 TPOT (ms): 1.53 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 1.52 +Median ITL (ms): 1.51 +P95 ITL (ms): 1.76 +P99 ITL (ms): 1.93 +Max ITL (ms): 8.28 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-OCR \ + --tp 1 \ + --ep 1 \ + --dp 1 \ + --enable-dp-attention \ + --dtype float16 \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 8000 \ + --model deepseek-ai/DeepSeek-OCR \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 16.24 +Total input tokens: 301698 +Total input text tokens: 301698 +Total input vision tokens: 0 +Total generated tokens: 188375 +Total generated tokens (retokenized): 186927 +Request throughput (req/s): 61.59 +Input token throughput (tok/s): 18582.90 +Output token throughput (tok/s): 11602.84 +Peak output token throughput (tok/s): 15479.00 +Peak concurrent requests: 179 +Total token throughput (tok/s): 30185.75 +Concurrency: 85.53 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1388.60 +Median E2E Latency (ms): 901.43 +---------------Time to First Token---------------- +Mean TTFT (ms): 73.36 +Median TTFT (ms): 50.21 +P99 TTFT (ms): 349.53 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.42 +Median TPOT (ms): 7.31 +P99 TPOT (ms): 27.99 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 7.04 +Median ITL (ms): 4.62 +P95 ITL (ms): 21.11 +P99 ITL (ms): 36.92 +Max ITL (ms): 172.15 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-R1.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-R1.mdx new file mode 100644 index 000000000000..432f85bbd4d7 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-R1.mdx @@ -0,0 +1,910 @@ +--- +title: DeepSeek-R1 +metatags: + description: "Deploy DeepSeek-R1 reasoning model with SGLang - advanced step-by-step reasoning with FP8/FP4 quantization for NVIDIA and AMD GPUs." +--- + +import { DeepSeekR1BasicDeployment } from '/src/snippets/autoregressive/deepseek-r1-basic-deployment.jsx'; +import { DeepSeekR1AdvancedDeployment } from '/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx'; + +## 1. Model Introduction + +[DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) is DeepSeek's advanced reasoning model that combines powerful language understanding with step-by-step reasoning capabilities. The model is available in multiple quantization formats optimized for different hardware platforms. + +**Key Features:** + +- **Advanced Reasoning**: Built-in reasoning capabilities for complex problem-solving +- **Multiple Quantizations**: FP8 and FP4 variants for different performance/memory trade-offs +- **Hardware Optimization**: Specifically tuned for NVIDIA B200 (Blackwell) and H200 (Hopper) GPUs, and AMD MI300X, MI325X and MI355X GPUs +- **High Performance**: Optimized for both throughput and latency scenarios + +**Available Models:** + +- **FP8 (8-bit quantized)**: [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) - Recommended for H200 and MI300X +- **FP4 (4-bit quantized)**: [nvidia/DeepSeek-R1-0528-FP4-v2](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4-v2) - Recommended for B200 and MI355X + +**License:** +To use DeepSeek-R1, you must agree to DeepSeek's Community License. See [LICENSE](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE) for details. + +For more details, please refer to the [official DeepSeek-R1 repository](https://github.com/deepseek-ai/DeepSeek-R1). + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate a basic deployment command for your hardware platform, quantization method, and deployment strategy. + + + +### 3.2 Optimal Configurations + +Pareto-optimal configurations for B200, H200, MI300X, MI325X, and MI355X hardware. + + + +### 3.3 Configuration Tips + +For more detailed configuration tips and advanced tuning, please refer to [DeepSeek V3/V3.1/R1 Usage](../../../docs/basic_usage/deepseek_v3). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +DeepSeek-R1 supports advanced reasoning capabilities with built-in thinking process. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-0528 \ + --reasoning-parser deepseek-r1 \ + --tp 8 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-0528", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +DeepSeek-R1 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-0528 \ + --reasoning-parser deepseek-r1 \ + --tool-call-parser deepseekv3 \ + --chat-template examples/chat_template/tool_chat_template_deepseekr1.jinja \ + --tp 8 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-0528", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"🔧 Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +🔧 Tool Call: get_weather + Arguments: +🔧 Tool Call: None + Arguments: {"location": "Beijing"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-0528", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: B200 GPU (8x) +- Model: DeepSeek-R1-0528 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.6.post1 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Test Scenarios + +Three core scenarios reflect real-world usage patterns: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScenarioInput LengthOutput LengthUse Case
**Chat**1K1KMost common conversational AI workload
**Reasoning**1K8KLong-form generation, complex reasoning tasks
**Summarization**8K1KDocument summarization, RAG retrieval
+ +#### 5.1.2 Concurrency Levels + +Test each scenario at different concurrency levels to capture the throughput vs. latency trade-off: + +- **Low Concurrency**: `--max-concurrency 1` (Latency-optimized) +- **Medium Concurrency**: `--max-concurrency 16` (Balanced) +- **High Concurrency**: `--max-concurrency 100` (Throughput-optimized) + +#### 5.1.3 Number of Prompts + +For each concurrency level, configure `num_prompts` to simulate realistic user loads: + +- **Quick Test**: `num_prompts = concurrency × 1` (minimal test) +- **Recommended**: `num_prompts = concurrency × 5` (standard benchmark) +- **Stable Measurements**: `num_prompts = concurrency × 10` (production-grade) + +--- + +#### 5.1.4 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** + +```bash Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-0528 \ + --tp 8 +``` + +- Low Concurrency (Latency-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 40.00 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4210 +Total generated tokens (retokenized): 4205 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 152.52 +Output token throughput (tok/s): 105.24 +Peak output token throughput (tok/s): 110.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 257.76 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3998.40 +Median E2E Latency (ms): 3207.53 +---------------Time to First Token---------------- +Mean TTFT (ms): 153.00 +Median TTFT (ms): 140.76 +P99 TTFT (ms): 214.66 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.16 +Median TPOT (ms): 9.15 +P99 TPOT (ms): 9.21 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.16 +Median ITL (ms): 9.15 +P95 ITL (ms): 9.47 +P99 ITL (ms): 9.63 +Max ITL (ms): 15.45 +================================================== +``` + +- Medium Concurrency (Balanced) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 51.21 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40725 +Total generated tokens (retokenized): 40458 +Request throughput (req/s): 1.56 +Input token throughput (tok/s): 774.66 +Output token throughput (tok/s): 795.30 +Peak output token throughput (tok/s): 1088.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 1569.96 +Concurrency: 13.93 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8918.33 +Median E2E Latency (ms): 9466.16 +---------------Time to First Token---------------- +Mean TTFT (ms): 273.51 +Median TTFT (ms): 131.71 +P99 TTFT (ms): 839.57 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 17.56 +Median TPOT (ms): 17.46 +P99 TPOT (ms): 28.68 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 17.02 +Median ITL (ms): 14.70 +P95 ITL (ms): 16.41 +P99 ITL (ms): 112.38 +Max ITL (ms): 461.90 +================================================== +``` + +- High Concurrency (Throughput-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 110.46 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252162 +Total generated tokens (retokenized): 251441 +Request throughput (req/s): 4.53 +Input token throughput (tok/s): 2261.80 +Output token throughput (tok/s): 2282.90 +Peak output token throughput (tok/s): 3900.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 4544.71 +Concurrency: 92.26 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 20380.71 +Median E2E Latency (ms): 19391.65 +---------------Time to First Token---------------- +Mean TTFT (ms): 563.14 +Median TTFT (ms): 147.62 +P99 TTFT (ms): 2632.11 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 40.11 +Median TPOT (ms): 41.98 +P99 TPOT (ms): 50.10 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 39.37 +Median ITL (ms): 26.36 +P95 ITL (ms): 98.16 +P99 ITL (ms): 150.08 +Max ITL (ms): 2052.85 +================================================== +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 411.34 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 44452 +Total generated tokens (retokenized): 44390 +Request throughput (req/s): 0.02 +Input token throughput (tok/s): 14.83 +Output token throughput (tok/s): 108.07 +Peak output token throughput (tok/s): 110.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 122.90 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 41132.04 +Median E2E Latency (ms): 44288.71 +---------------Time to First Token---------------- +Mean TTFT (ms): 125.76 +Median TTFT (ms): 126.19 +P99 TTFT (ms): 137.69 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.21 +Median TPOT (ms): 9.20 +P99 TPOT (ms): 9.27 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.23 +Median ITL (ms): 9.22 +P95 ITL (ms): 9.64 +P99 ITL (ms): 9.86 +Max ITL (ms): 15.18 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 348.93 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 318226 +Total generated tokens (retokenized): 317630 +Request throughput (req/s): 0.23 +Input token throughput (tok/s): 113.69 +Output token throughput (tok/s): 912.02 +Peak output token throughput (tok/s): 1088.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 1025.70 +Concurrency: 14.07 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 61360.70 +Median E2E Latency (ms): 62071.20 +---------------Time to First Token---------------- +Mean TTFT (ms): 176.02 +Median TTFT (ms): 153.75 +P99 TTFT (ms): 268.44 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 15.42 +Median TPOT (ms): 15.59 +P99 TPOT (ms): 16.07 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.39 +Median ITL (ms): 15.17 +P95 ITL (ms): 16.62 +P99 ITL (ms): 18.13 +Max ITL (ms): 226.59 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 589.31 +Total input tokens: 158939 +Total input text tokens: 158939 +Total input vision tokens: 0 +Total generated tokens: 1300705 +Total generated tokens (retokenized): 1297658 +Request throughput (req/s): 0.54 +Input token throughput (tok/s): 269.70 +Output token throughput (tok/s): 2207.16 +Peak output token throughput (tok/s): 2944.00 +Peak concurrent requests: 68 +Total token throughput (tok/s): 2476.86 +Concurrency: 57.03 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 105032.36 +Median E2E Latency (ms): 108229.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 223.91 +Median TTFT (ms): 158.15 +P99 TTFT (ms): 474.86 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 25.94 +Median TPOT (ms): 26.72 +P99 TPOT (ms): 27.99 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 25.79 +Median ITL (ms): 25.37 +P95 ITL (ms): 26.70 +P99 ITL (ms): 105.49 +Max ITL (ms): 237.91 +================================================== +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 40.65 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4210 +Total generated tokens (retokenized): 4195 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 1031.65 +Output token throughput (tok/s): 103.56 +Peak output token throughput (tok/s): 110.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 1135.20 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4063.62 +Median E2E Latency (ms): 3296.13 +---------------Time to First Token---------------- +Mean TTFT (ms): 165.91 +Median TTFT (ms): 154.96 +P99 TTFT (ms): 240.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.26 +Median TPOT (ms): 9.27 +P99 TPOT (ms): 9.42 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.28 +Median ITL (ms): 9.28 +P95 ITL (ms): 9.66 +P99 ITL (ms): 9.83 +Max ITL (ms): 14.06 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 56.71 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41589 +Total generated tokens (retokenized): 41490 +Request throughput (req/s): 1.41 +Input token throughput (tok/s): 5290.75 +Output token throughput (tok/s): 733.41 +Peak output token throughput (tok/s): 1024.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 6024.16 +Concurrency: 14.25 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 10098.99 +Median E2E Latency (ms): 10623.46 +---------------Time to First Token---------------- +Mean TTFT (ms): 486.80 +Median TTFT (ms): 189.59 +P99 TTFT (ms): 2138.73 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 19.06 +Median TPOT (ms): 19.23 +P99 TPOT (ms): 30.69 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 18.53 +Median ITL (ms): 15.63 +P95 ITL (ms): 16.64 +P99 ITL (ms): 109.71 +Max ITL (ms): 1471.36 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-R1-0528 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 115.55 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 169680 +Total generated tokens (retokenized): 169275 +Request throughput (req/s): 2.77 +Input token throughput (tok/s): 11024.93 +Output token throughput (tok/s): 1468.50 +Peak output token throughput (tok/s): 2254.00 +Peak concurrent requests: 70 +Total token throughput (tok/s): 12493.43 +Concurrency: 59.45 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 21465.98 +Median E2E Latency (ms): 20686.26 +---------------Time to First Token---------------- +Mean TTFT (ms): 913.93 +Median TTFT (ms): 224.92 +P99 TTFT (ms): 6257.83 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 39.93 +Median TPOT (ms): 40.99 +P99 TPOT (ms): 60.91 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 38.83 +Median ITL (ms): 26.29 +P95 ITL (ms): 113.81 +P99 ITL (ms): 176.94 +Max ITL (ms): 5521.53 +================================================== +``` + +#### 5.1.5 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py \ + --num-shots 8 \ + --num-questions 1316 \ + --parallel 1316 +``` + +**Test Results:** + +```text Output +Accuracy: 0.959 +Invalid: 0.000 +Latency: 29.185 s +Output throughput: 4854.672 token/s +``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3.mdx new file mode 100644 index 000000000000..b9c26d6bc9cb --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3.mdx @@ -0,0 +1,520 @@ +--- +title: "DeepSeek-V3" +metatags: + description: "Deploy DeepSeek-V3 MoE model with SGLang - efficient architecture with strong reasoning, coding, and tool-augmented capabilities." +--- + + +## 1. Model Introduction + +[DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) is a large-scale Mixture-of-Experts (MoE) language model developed by DeepSeek, designed to deliver strong general-purpose reasoning, coding, and tool-augmented capabilities with high training and inference efficiency. As the latest generation in the DeepSeek model family, DeepSeek V3 introduces systematic architectural and training innovations that significantly improve performance across reasoning, mathematics, coding, and long-context understanding, while maintaining a competitive compute cost. + +Key highlights include: + +- **Efficient MoE architecture**: DeepSeek V3 adopts a fine-grained Mixture-of-Experts design with a large number of experts and sparse activation, enabling high model capacity while keeping inference and training costs manageable. +- **Advanced reasoning and coding**: The model demonstrates strong performance on mathematical reasoning, logical inference, and real-world coding benchmarks, benefiting from improved data curation and training strategies. +- **Long-context capability**: DeepSeek V3 supports extended context lengths, allowing it to handle long documents, complex multi-step reasoning, and agent-style workflows more effectively. +- **Tool use and function calling**: The model is trained to support structured outputs and tool invocation, enabling seamless integration with external tools and agent frameworks during inference. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + +import { DeepSeekV3Deployment } from "/src/snippets/autoregressive/deepseek-v3-deployment.jsx"; + + + +### 3.2 Configuration Tips +For more detailed configuration tips, please refer to [DeepSeek-V3 Usage](../../../docs/basic_usage/deepseek_v3). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [Basic API Usage](../../../docs/get-started/quickstart) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +DeepSeek-V3 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-V3 \ + --reasoning-parser deepseek-v3 \ + --tp 8 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + extra_body = {"chat_template_kwargs": {"thinking": True}}, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To determine 15% of a number, follow these steps: + +**Step 1: Understand the Problem** +You need to find 15% of a given number. Let's assume the number is 240 for this example. + +**Step 2: Convert the Percentage to a Decimal** +To work with percentages in calculations, convert the percentage to its decimal form. To do this, divide the percentage by 100. + +\[ 15\% = \frac{15}{100} = 0.15 \] + +**Step 3: Multiply the Decimal by the Number** +Now, multiply the decimal form of the percentage by the number you want to find the percentage of. + +\[ 0.15 \times 240 \] + +**Step 4: Perform the Multiplication** +Calculate the product: + +\[ 0.15 \times 240 = 36 \] + +**Step 5: Conclusion** +Therefore, 15% of 240 is: + +\boxed{36} + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +DeepSeek-V3 supports tool calling capabilities. Enable the tool call parser: + +**Deployment Command:** + +```shell Command +python -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-V3 \ + --tool-call-parser deepseekv3 \ + --reasoning-parser deepseek-v3 \ + --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + extra_body = {"chat_template_kwargs": {"thinking": True}}, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>get_weather +```json +{"location": "Beijing", "unit": "celsius"} +```<|tool▁call▁end|><|tool▁calls▁end|> +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +Please attach the code blocks below to the previous Python script. + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X GPU (8x) +- Model: DeepSeek-V3 +- Tensor Parallelism: 8 +- sglang version: 0.5.7 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ + --tp 8 \ + --dp 8 \ + --enable-dp-attention \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 8000 \ + --model deepseek-ai/DeepSeek-V3 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 81.27 +Total input tokens: 1972 +Total input text tokens: 1972 +Total input vision tokens: 0 +Total generated tokens: 2784 +Total generated tokens (retokenized): 2774 +Request throughput (req/s): 0.12 +Input token throughput (tok/s): 24.27 +Output token throughput (tok/s): 34.26 +Peak output token throughput (tok/s): 65.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 58.52 +Concurrency: 1.00 +Accept length: 2.61 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8123.17 +Median E2E Latency (ms): 7982.65 +---------------Time to First Token---------------- +Mean TTFT (ms): 1080.76 +Median TTFT (ms): 1248.82 +P99 TTFT (ms): 1896.37 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 25.04 +Median TPOT (ms): 24.76 +P99 TPOT (ms): 32.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 25.41 +Median ITL (ms): 20.14 +P95 ITL (ms): 60.28 +P99 ITL (ms): 60.99 +Max ITL (ms): 61.49 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ + --tp 8 \ + --ep 8 \ + --dp 8 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 8000 \ + --model deepseek-ai/DeepSeek-V3 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 406.16 +Total input tokens: 301701 +Total input text tokens: 301701 +Total input vision tokens: 0 +Total generated tokens: 188375 +Total generated tokens (retokenized): 187542 +Request throughput (req/s): 2.46 +Input token throughput (tok/s): 742.81 +Output token throughput (tok/s): 463.80 +Peak output token throughput (tok/s): 1299.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 1206.61 +Concurrency: 87.53 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 35552.98 +Median E2E Latency (ms): 21466.07 +---------------Time to First Token---------------- +Mean TTFT (ms): 1521.51 +Median TTFT (ms): 476.80 +P99 TTFT (ms): 8329.50 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 214.73 +Median TPOT (ms): 152.00 +P99 TPOT (ms): 1155.85 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 182.10 +Median ITL (ms): 79.18 +P95 ITL (ms): 398.60 +P99 ITL (ms): 1488.96 +Max ITL (ms): 43465.60 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000 +``` + +- **Test Results**: + - DeepSeek-V3 + ```text Output + Accuracy: 0.960 + Invalid: 0.000 + Latency: 32.450 s + Output throughput: 614.211 token/s + ``` + +#### 5.2.2 MMLU Benchmark + +- **Benchmark Command:** + +```shell Command +cd sglang +bash benchmark/mmlu/download_data.sh +python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 8000 +``` + +- **Test Results**: + - DeepSeek-V3 + ```text Output + subject: abstract_algebra, #q:100, acc: 0.800 + subject: anatomy, #q:135, acc: 0.874 + subject: astronomy, #q:152, acc: 0.928 + subject: business_ethics, #q:100, acc: 0.880 + subject: clinical_knowledge, #q:265, acc: 0.928 + subject: college_biology, #q:144, acc: 0.965 + subject: college_chemistry, #q:100, acc: 0.670 + subject: college_computer_science, #q:100, acc: 0.840 + subject: college_mathematics, #q:100, acc: 0.800 + subject: college_medicine, #q:173, acc: 0.861 + Total latency: 58.339 + Average accuracy: 0.871 + ``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3_1.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3_1.mdx new file mode 100644 index 000000000000..1b4376a6c518 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3_1.mdx @@ -0,0 +1,941 @@ +--- +title: DeepSeek-V3.1 +metatags: + description: "Deploy DeepSeek-V3.1 MoE model with SGLang - hybrid reasoning, improved tool calling, and agentic behavior for complex multi-step tasks." +--- + +## 1. Model Introduction + +[DeepSeek V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) is an advanced Mixture-of-Experts (MoE) large language model developed by DeepSeek, representing a major capability and usability upgrade over DeepSeek V3. As a refined iteration in the DeepSeek V3 family, DeepSeek V3.1 introduces a hybrid reasoning paradigm that supports both fast non-thinking responses and explicit multi-step reasoning, alongside significantly improved tool calling and agentic behavior. The model demonstrates strong performance across reasoning, mathematics, coding, long-context understanding, and real-world agent workflows, benefiting from continued training, alignment optimization, and inference-time refinements. DeepSeek V3.1 is designed to serve as a robust general-purpose foundation model, well suited for conversational AI, structured tool invocation, search-augmented generation, and complex multi-step tasks, while maintaining high efficiency through its sparse MoE architecture. + +**[DeepSeek-V3.1-Terminus](https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus)** is an experimental version designed for general conversations and long-context processing. It features hybrid thinking capabilities, allowing you to toggle between "Think" mode for deliberate reasoning and "Non-Think" mode for faster responses. Recommended for general conversations, long-context processing, and experimental use cases. + + + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + +import { DeepSeekV31Deployment } from "/src/snippets/autoregressive/deepseek-v31-deployment.jsx"; + + + +### 3.2 Configuration Tips +For more detailed configuration tips, please refer to [DeepSeek V3/V3.1/R1 Usage](../../../docs/basic_usage/deepseek_v3). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [Basic API Usage](../../../docs/get-started/quickstart) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +DeepSeek-V3.1 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-V3.1-Terminus \ + --reasoning-parser deepseek-v3 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3.1-Terminus", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + extra_body = {"chat_template_kwargs": {"thinking": True}}, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +First, the problem is asking for 15% of 240. Percent means per hundred, so 15% is the same as 15 out of 100, or 15/100. + +To find a percentage of a number, I can multiply the number by the percentage expressed as a decimal. So, I need to convert 15% to a decimal. To do that, I divide 15 by 100, which gives me 0.15. + +Now, I multiply 0.15 by 240. So, the calculation is 0.15 × 240. + +I can compute this step by step. First, I know that 15% of 100 is 15, but since 240 is larger, I need to adjust. Alternatively, I can think of 10% of 240, which is easy because 10% is just 240 divided by 10, which is 24. Then, 5% is half of 10%, so half of 24 is 12. Therefore, 15% is 10% plus 5%, so 24 plus 12, which equals 36. + +I should also do the multiplication to confirm. 0.15 × 240. I can break it down: 0.15 × 200 = 30, and 0.15 × 40 = 6, so 30 + 6 = 36. Same answer. + +So, 15% of 240 is 36. + +The problem says "step by step," so I should present it clearly. +=============== Content ================= +To find 15% of 240, follow these steps: + +1. Understand that "percent" means "per hundred," so 15% is equivalent to \( \frac{15}{100} \). +2. Convert 15% to a decimal by dividing by 100: \( 15\% = \frac{15}{100} = 0.15 \). +3. Multiply the decimal by 240: \( 0.15 \times 240 \). +4. Perform the multiplication: + - \( 0.15 \times 200 = 30 \) + - \( 0.15 \times 40 = 6 \) + - Add the results: \( 30 + 6 = 36 \). + +Alternatively, you can find 15% by breaking it into parts: +- 10% of 240 is \( \frac{10}{100} \times 240 = 0.10 \times 240 = 24 \). +- 5% of 240 is half of 10%, so \( \frac{24}{2} = 12 \). +- Add 10% and 5%: \( 24 + 12 = 36 \). + +Thus, 15% of 240 is 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +DeepSeek-V3.1 and DeepSeek-V3.1-Terminus support tool calling capabilities. Enable the tool call parser: + +**Note:** DeepSeek-V3.1-Speciale does **NOT** support tool calling. It is designed exclusively for deep reasoning tasks. + +**Deployment Command:** + +```shell Command +python -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-V3.1-Terminus \ + --tool-call-parser deepseekv31 \ + --reasoning-parser deepseek-v3 \ + --chat-template ./examples/chat_template/tool_chat_template_deepseekv31.jinja \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +For DeepSeek-V3.1, use `--tool-call-parser deepseekv31` as well. + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3.1-Terminus", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + extra_body = {"chat_template_kwargs": {"thinking": True}}, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +Hmm, the user is asking for the weather in Beijing. This is a straightforward request that matches exactly what the weather tool can provide. + +I need to call the get_weather function with Beijing as the location parameter. The user didn't specify a temperature unit, so I'll default to Celsius since that's commonly used in most parts of the world. + +The tool call format needs to be precise - just the city name and unit selection. Once I get the weather data back, I'll present it clearly to the user.I'll check the weather in Beijing for you. +=============== Content ================= + +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +Please attach the code blocks below to the previous Python script. + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3.1-Terminus", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "Currently, it is **22°C and sunny** in Beijing." +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X GPU (8x) +- Model: DeepSeek-V3.1-Terminus +- Tensor Parallelism: 8 +- sglang version: 0.5.7 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Test Scenarios + +Three core scenarios reflect real-world usage patterns: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScenarioInput LengthOutput LengthUse Case
**Chat**1K1KMost common conversational AI workload
**Reasoning**1K8KLong-form generation, complex reasoning tasks
**Summarization**8K1KDocument summarization, RAG retrieval
+ +#### 5.1.2 Concurrency Levels + +Test each scenario at different concurrency levels to capture the throughput vs. latency trade-off: + +- **Low Concurrency**: `--max-concurrency 1` (Latency-optimized) +- **Medium Concurrency**: `--max-concurrency 16` (Balanced) +- **High Concurrency**: `--max-concurrency 100` (Throughput-optimized) + +#### 5.1.3 Number of Prompts + +For each concurrency level, configure `num_prompts` to simulate realistic user loads: + +- **Quick Test**: `num_prompts = concurrency × 1` (minimal test) +- **Recommended**: `num_prompts = concurrency × 5` (standard benchmark) +- **Stable Measurements**: `num_prompts = concurrency × 10` (production-grade) + +--- + +#### 5.1.4 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** + +```bash Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.1 \ + --tp 8 +``` + +- Low Concurrency (Latency-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 106.24 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4201 +Request throughput (req/s): 0.09 +Input token throughput (tok/s): 57.43 +Output token throughput (tok/s): 39.72 +Peak output token throughput (tok/s): 43.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 97.15 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 10620.29 +Median E2E Latency (ms): 8868.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 557.85 +Median TTFT (ms): 213.58 +P99 TTFT (ms): 1625.28 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 23.84 +Median TPOT (ms): 23.90 +P99 TPOT (ms): 24.03 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 23.90 +Median ITL (ms): 23.92 +P95 ITL (ms): 24.15 +P99 ITL (ms): 24.25 +Max ITL (ms): 25.44 +================================================== +``` + +- Medium Concurrency (Balanced) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 107.71 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40625 +Request throughput (req/s): 0.74 +Input token throughput (tok/s): 368.28 +Output token throughput (tok/s): 378.84 +Peak output token throughput (tok/s): 508.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 747.12 +Concurrency: 13.72 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 18473.65 +Median E2E Latency (ms): 19558.42 +---------------Time to First Token---------------- +Mean TTFT (ms): 607.91 +Median TTFT (ms): 191.32 +P99 TTFT (ms): 2135.13 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 35.50 +Median TPOT (ms): 35.99 +P99 TPOT (ms): 43.62 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 35.10 +Median ITL (ms): 32.18 +P95 ITL (ms): 33.03 +P99 ITL (ms): 159.99 +Max ITL (ms): 453.99 +================================================== +``` + +- High Concurrency (Throughput-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 207.65 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 251238 +Request throughput (req/s): 2.41 +Input token throughput (tok/s): 1203.15 +Output token throughput (tok/s): 1216.79 +Peak output token throughput (tok/s): 2100.00 +Peak concurrent requests: 106 +Total token throughput (tok/s): 2419.94 +Concurrency: 91.02 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 37800.20 +Median E2E Latency (ms): 35921.56 +---------------Time to First Token---------------- +Mean TTFT (ms): 835.15 +Median TTFT (ms): 236.88 +P99 TTFT (ms): 2868.52 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 73.33 +Median TPOT (ms): 76.35 +P99 TPOT (ms): 97.63 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 73.30 +Median ITL (ms): 50.82 +P95 ITL (ms): 180.67 +P99 ITL (ms): 186.83 +Max ITL (ms): 1661.39 +================================================== +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 1097.29 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44313 +Request throughput (req/s): 0.01 +Input token throughput (tok/s): 5.56 +Output token throughput (tok/s): 40.52 +Peak output token throughput (tok/s): 43.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 46.08 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 109725.52 +Median E2E Latency (ms): 117748.67 +---------------Time to First Token---------------- +Mean TTFT (ms): 156.67 +Median TTFT (ms): 156.19 +P99 TTFT (ms): 159.87 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 24.41 +Median TPOT (ms): 24.51 +P99 TPOT (ms): 24.96 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 24.65 +Median ITL (ms): 24.58 +P95 ITL (ms): 25.68 +P99 ITL (ms): 25.93 +Max ITL (ms): 29.80 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 775.02 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 318306 +Total generated tokens (retokenized): 317426 +Request throughput (req/s): 0.10 +Input token throughput (tok/s): 51.18 +Output token throughput (tok/s): 410.70 +Peak output token throughput (tok/s): 512.00 +Peak concurrent requests: 18 +Total token throughput (tok/s): 461.89 +Concurrency: 13.86 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 134236.65 +Median E2E Latency (ms): 135181.28 +---------------Time to First Token---------------- +Mean TTFT (ms): 214.35 +Median TTFT (ms): 194.12 +P99 TTFT (ms): 300.27 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 33.72 +Median TPOT (ms): 34.00 +P99 TPOT (ms): 34.75 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 33.69 +Median ITL (ms): 33.71 +P95 ITL (ms): 34.50 +P99 ITL (ms): 34.92 +Max ITL (ms): 164.76 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 1231.97 +Total input tokens: 158939 +Total input text tokens: 158939 +Total input vision tokens: 0 +Total generated tokens: 1301025 +Total generated tokens (retokenized): 1296845 +Request throughput (req/s): 0.26 +Input token throughput (tok/s): 129.01 +Output token throughput (tok/s): 1056.05 +Peak output token throughput (tok/s): 1472.00 +Peak concurrent requests: 67 +Total token throughput (tok/s): 1185.07 +Concurrency: 56.17 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 216256.25 +Median E2E Latency (ms): 224192.84 +---------------Time to First Token---------------- +Mean TTFT (ms): 317.68 +Median TTFT (ms): 235.28 +P99 TTFT (ms): 649.39 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 53.30 +Median TPOT (ms): 55.10 +P99 TPOT (ms): 56.58 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 53.13 +Median ITL (ms): 52.95 +P95 ITL (ms): 56.23 +P99 ITL (ms): 181.04 +Max ITL (ms): 208.61 +================================================== +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 114.47 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4194 +Request throughput (req/s): 0.09 +Input token throughput (tok/s): 366.39 +Output token throughput (tok/s): 36.87 +Peak output token throughput (tok/s): 42.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 403.26 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 11442.86 +Median E2E Latency (ms): 9508.87 +---------------Time to First Token---------------- +Mean TTFT (ms): 883.78 +Median TTFT (ms): 481.38 +P99 TTFT (ms): 2217.45 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 24.93 +Median TPOT (ms): 25.05 +P99 TPOT (ms): 26.11 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 25.08 +Median ITL (ms): 25.08 +P95 ITL (ms): 26.18 +P99 ITL (ms): 26.28 +Max ITL (ms): 27.41 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 162.33 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41443 +Request throughput (req/s): 0.49 +Input token throughput (tok/s): 1848.27 +Output token throughput (tok/s): 256.70 +Peak output token throughput (tok/s): 467.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 2104.97 +Concurrency: 14.52 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 29456.89 +Median E2E Latency (ms): 27628.16 +---------------Time to First Token---------------- +Mean TTFT (ms): 1784.30 +Median TTFT (ms): 1347.21 +P99 TTFT (ms): 5384.54 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 53.65 +Median TPOT (ms): 52.09 +P99 TPOT (ms): 74.39 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 53.23 +Median ITL (ms): 34.52 +P95 ITL (ms): 35.81 +P99 ITL (ms): 513.25 +Max ITL (ms): 2865.73 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model deepseek-ai/DeepSeek-V3.1 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 282.55 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169081 +Request throughput (req/s): 1.13 +Input token throughput (tok/s): 4508.6 +Output token throughput (tok/s): 601.67 +Peak output token throughput (tok/s): 1216 +Peak concurrent requests: 68 +Total token throughput (tok/s): 5110.27 +Concurrency: 59.81 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 52810.32 +Median E2E Latency (ms): 50981.81 +---------------Time to First Token---------------- +Mean TTFT (ms): 786.69 +Median TTFT (ms): 499.38 +P99 TTFT (ms): 2925.98 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 97.93 +Median TPOT (ms): 103.45 +P99 TPOT (ms): 157.84 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 98.11 +Median ITL (ms): 55.7 +P95 ITL (ms): 240.71 +P99 ITL (ms): 1114.36 +================================================== +``` + +#### 5.1.5 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py \ + --num-shots 8 \ + --num-questions 1316 \ + --parallel 1316 +``` + +**Test Results:** + +```text Output +Accuracy: 0.959 +Invalid: 0.000 +Latency: 29.185 s +Output throughput: 4854.672 token/s +``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3_2.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3_2.mdx new file mode 100644 index 000000000000..e5c44234a812 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V3_2.mdx @@ -0,0 +1,827 @@ +--- +title: DeepSeek-V3.2 +metatags: + description: "Deploy DeepSeek-V3.2 with SGLang - featuring DeepSeek Sparse Attention for efficient long-context processing and deep reasoning capabilities." +--- + +## 1. Model Introduction + +The DeepSeek-V3.2 series includes three model variants, each optimized for different use cases: + +**[DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)** is an upgraded version of DeepSeek-V3.1-Terminus, introducing the DeepSeek Sparse Attention (DSA) mechanism through continued training. DSA is a fine-grained sparse attention mechanism powered by a lightning indexer, enabling DeepSeek-V3.2-Exp to achieve significant efficiency improvements in long-context scenarios. Recommended for general conversations, long-context processing, and efficient inference. + +**[DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)** is the standard version suitable for general tasks and conversational scenarios. For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 0.95. Recommended for standard conversations and general tasks. + +**[DeepSeek-V3.2-Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale)** is a special variant designed exclusively for deep reasoning tasks. This model is specifically optimized for scenarios requiring complex logical reasoning and deep thinking. However this model does not support tool calls (see below). For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 0.95. Recommended for deep reasoning tasks, complex logical problems, and mathematical reasoning. + +**[DeepSeek-V3.2-NVFP4](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4)** is an NVIDIA-optimized NVFP4-quantized variant of DeepSeek-V3.2 for Blackwell devices. It uses ModelOpt FP4 quantization with a choice of MoE runner backends (`flashinfer_trtllm` (recommended), `flashinfer_cutlass`, or `flashinfer_cutedsl`), enabling efficient deployment with lower tensor parallelism (TP=4). It supports the same features as DeepSeek-V3.2 including tool calling, reasoning, and speculative decoding (MTP). + +**[DeepSeek-V3.2-MXFP4](https://huggingface.co/amd/DeepSeek-V3.2-mxfp4)** is an OCP-MXFP4 optimized variant for DeepSeek-V3.2 for AMD MI300X/MI355X devices. It uses OCP MXFP4 quantization with a triton mxfp4 backend (the same backend for gptoss-120B), enabling efficient deployment with lower tensor parallelism (TP=8) in a single node. It includes the same features as DeepSeek-V3.2 including tool calling, reasoning, fp8-kv, CP, TP and speculative decoding MTP. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. SGLang supports serving DeepSeek V3.2 on NVIDIA H200, B200, and AMD MI300X/MI355X GPUs. + +import { DeepSeekV32Deployment } from "/src/snippets/autoregressive/deepseek-v32-deployment.jsx"; + + + +### 3.2 Configuration Tips +For more detailed configuration tips, please refer to [DeepSeek-V3.2 Usage](../../../docs/basic_usage/deepseek_v32). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [Basic API Usage](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +DeepSeek-V3.2 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +sglang serve \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --reasoning-parser deepseek-v3 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3.2-Exp", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + extra_body = {"chat_template_kwargs": {"thinking": True}}, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +DeepSeek-V3.2 and DeepSeek-V3.2-Exp support tool calling capabilities. But they use different parameters. Enable the tool call parser: + +**Note:** DeepSeek-V3.2-Speciale does **NOT** support tool calling. It is designed exclusively for deep reasoning tasks. + +**Deployment Command:** + +For DeepSeek-V3.2-Exp: + +```shell Command +sglang serve \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --tool-call-parser deepseekv31 \ + --reasoning-parser deepseek-v3 \ + --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja \ + --tp 8 \ + --host 0.0.0.0 \ + --port 30000 +``` + +For DeepSeek-V3.2, use `--tool-call-parser deepseekv32` and remove `--chat-template`. + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3.2-Exp", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + extra_body = {"chat_template_kwargs": {"thinking": True}}, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3.2-Exp", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +#### 4.2.3 Enabling PP, CP and TP with FP8 KV cache + +We suggested `DP2` + `MTP` for local deployment of agentic workflow with DeepSeek V3.2 on Hopper platform: + +```shell Command +export SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS=32 +export SGLANG_SET_CPU_AFFINITY=1 + +# Test workload ISL/OSL=1k/1k, raw tap : 4948.16 toks/sec, MAX ITL 5970 +# dp 2 : 5019.54 toks/sec, MAX ITL 7233 +# dp 4 : 4942.82 toks/sec, MAX ITL 35654 +# dp 2 + mtp : 6842.51 toks/sec, MAX ITL 3081 +sglang_args=$(echo serve \ + --model-path $MAPPED_MODEL_PATH \ + --nccl-init $MASTER_ADDR:$MASTER_PORT --nnodes 2 --node-rank $RANK --tp 16 \ + --dp 2 --enable-dp-attention --page-size 64 \ + --trust-remote-code --host "0.0.0.0" --port 30000 \ + --log-requests \ + --context-length 65536 --max-running-requests 128 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --allow-auto-truncate --enable-metrics \ + --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 \ + --served-model-name DeepSeek-V3.2-Opt-dp2-mtp +) + +sglang_args=($sglang_args) + +sglang "${sglang_args[@]}" 2>&1 | tee $LOG_DIR/$RANK.log +``` + +**CP + PP + EP + DP** + +`CP` is currently enabled with `PP=2` on Hopper platform and we can reduce TP=16 to TP=8 from standalone deployment: + +```shell Command +# verified on Hopper platform +sglang_args=$(echo serve \ + --model-path $MAPPED_MODEL_PATH \ + --nccl-init $MASTER_ADDR:$MASTER_PORT --nnodes 2 --node-rank $RANK --tp 8 --pp-size 2 --dp 1 --enable-dp-attention \ + --moe-a2a-backend deepep --ep-size 16 \ + --page-size 128 \ + --chunked-prefill-size 16384 \ + --attention-backend nsa \ + --nsa-prefill-backend flashmla_sparse \ + --nsa-decode-backend flashmla_sparse \ + --enable-nsa-prefill-context-parallel \ + --nsa-prefill-cp-mode round-robin-split \ + --cuda-graph-max-bs 128 \ + --max-running-requests 128 \ + --trust-remote-code --host "0.0.0.0" --port 30000 \ + --log-requests \ + --context-length 65536 \ + --allow-auto-truncate --enable-metrics \ + --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 \ + --served-model-name DeepSeek-V3.2-nsa-pp-cp-ep-dp +) + +sglang_args=($sglang_args) + +sglang "${sglang_args[@]}" 2>&1 | tee $LOG_DIR/$RANK.log +``` + +**fp8 KV + CP + PP** + +With FP8 KV, we can have less memory footprint. This can be combined with various parallel schemes: + +```shell Command +# verified in Hopper platform +dp=1 + +dp_config=" \ + --dp 1 --enable-dp-attention \ +" + +cp_config=" \ + --enable-nsa-prefill-context-parallel \ +" + +if [ "$dp" -eq 1 ]; then + +cp_config=" \ + $cp_config \ + --nsa-prefill-cp-mode round-robin-split \ +" + +else +cp_config=" \ + $cp_config \ + --nsa-prefill-cp-mode in-seq-split \ +" +fi + +# see discussion : https://github.com/sgl-project/sglang/pull/12065 +sglang_args=$(echo serve \ + --model-path $MAPPED_MODEL_PATH \ + --nccl-init $MASTER_ADDR:$MASTER_PORT --nnodes 2 --node-rank $RANK --tp 8 --pp-size 2 --pp-async-batch-depth 1 \ + $dp_config \ + --trust-remote-code --host "0.0.0.0" --port 30000 \ + --log-requests \ + --context-length 65536 --max-running-requests 128 \ + $cp_config \ + --kv-cache-dtype fp8_e4m3 \ + --allow-auto-truncate --enable-metrics \ + --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 \ + --served-model-name DeepSeek-V3.2-Opt-fp8kv-pp2-cp4 +) + +sglang_args=($sglang_args) + +sglang "${sglang_args[@]}" 2>&1 | tee $LOG_DIR/$RANK.log +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark on Blackwell + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x) +- Model: DeepSeek-V3.2-Exp +- Tensor Parallelism: 8 +- sglang version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +sglang serve \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --tp 8 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V3.2-Exp \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 29.11 +Total input tokens: 1972 +Total input text tokens: 1972 +Total input vision tokens: 0 +Total generated tokens: 2784 +Total generated tokens (retokenized): 2777 +Request throughput (req/s): 0.34 +Input token throughput (tok/s): 67.73 +Output token throughput (tok/s): 95.62 +Peak output token throughput (tok/s): 157.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 163.36 +Concurrency: 1.00 +Accept length: 2.46 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2909.74 +Median E2E Latency (ms): 3088.27 +P90 E2E Latency (ms): 4200.62 +P99 E2E Latency (ms): 5588.52 +---------------Time to First Token---------------- +Mean TTFT (ms): 317.58 +Median TTFT (ms): 191.31 +P99 TTFT (ms): 740.79 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.09 +Median TPOT (ms): 9.25 +P99 TPOT (ms): 11.73 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.35 +Median ITL (ms): 7.64 +P95 ITL (ms): 22.81 +P99 ITL (ms): 23.33 +Max ITL (ms): 31.45 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +sglang serve \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --tp 8 \ + --ep 8 \ + --dp 8 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V3.2-Exp \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 219.09 +Total input tokens: 301701 +Total input text tokens: 301701 +Total input vision tokens: 0 +Total generated tokens: 188375 +Total generated tokens (retokenized): 187443 +Request throughput (req/s): 4.56 +Input token throughput (tok/s): 1377.06 +Output token throughput (tok/s): 859.80 +Peak output token throughput (tok/s): 2465.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 2236.86 +Concurrency: 88.05 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 19291.23 +Median E2E Latency (ms): 11927.39 +---------------Time to First Token---------------- +Mean TTFT (ms): 530.36 +Median TTFT (ms): 444.00 +P99 TTFT (ms): 1504.78 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 106.16 +Median TPOT (ms): 106.69 +P99 TPOT (ms): 221.12 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 100.46 +Median ITL (ms): 41.73 +P95 ITL (ms): 225.67 +P99 ITL (ms): 392.37 +Max ITL (ms): 975.03 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 30000 +``` + +- **Test Results**: + - DeepSeek-V3.2-Exp + ``` + Accuracy: 0.980 + Invalid: 0.000 + Latency: 19.128 s + Output throughput: 965.919 token/s + ``` + +#### 5.2.2 MMLU Benchmark + +- **Benchmark Command:** + +```shell Command +cd sglang +bash benchmark/mmlu/download_data.sh +python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 30000 +``` + +- **Test Results**: + - DeepSeek-V3.2-Exp + ``` + subject: abstract_algebra, #q:100, acc: 0.780 + subject: anatomy, #q:135, acc: 0.874 + subject: astronomy, #q:152, acc: 0.961 + subject: business_ethics, #q:100, acc: 0.860 + subject: clinical_knowledge, #q:265, acc: 0.925 + subject: college_biology, #q:144, acc: 0.972 + subject: college_chemistry, #q:100, acc: 0.660 + subject: college_computer_science, #q:100, acc: 0.880 + subject: college_mathematics, #q:100, acc: 0.840 + subject: college_medicine, #q:173, acc: 0.879 + Total latency: 7.961 + Average accuracy: 0.879 + ``` + +### 5.3 Speed Benchmark on Hopper + +**Test Environment:** + +- Hardware: NVIDIA H800 GPU (16x) +- Model: DeepSeek-V3.2 +- Tensor Parallelism: 16 +- sglang version: 0.5.9 + +#### 5.3.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +export SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS=32 +export SGLANG_SET_CPU_AFFINITY=1 + +# Test workload ISL/OSL=1k/1k, raw tap : 4948.16 toks/sec, MAX ITL 5970 +# dp 2 : 5019.54 toks/sec, MAX ITL 7233 +# dp 4 : 4942.82 toks/sec, MAX ITL 35654 +# dp 2 + mtp : 6842.51 toks/sec, MAX ITL 3081 +sglang_args=$(echo serve \ + --model-path $MAPPED_MODEL_PATH \ + --nccl-init $MASTER_ADDR:$MASTER_PORT --nnodes 2 --node-rank $RANK --tp 16 \ + --dp 2 --enable-dp-attention --page-size 64 \ + --trust-remote-code --host "0.0.0.0" --port 30000 \ + --log-requests \ + --context-length 65536 --max-running-requests 128 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --allow-auto-truncate --enable-metrics \ + --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 \ + --served-model-name DeepSeek-V3.2-Opt-dp2-mtp +) + +sglang_args=($sglang_args) + +sglang "${sglang_args[@]}" 2>&1 | tee $LOG_DIR/$RANK.log +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host $MASTER_ADDR \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V3.2 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 64.0 +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 48.96 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4217 +Request throughput (req/s): 0.20 +Input token throughput (tok/s): 124.62 +Output token throughput (tok/s): 86.20 +Peak output token throughput (tok/s): 113.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 210.81 +Concurrency: 1.00 +Accept length: 3.27 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4893.12 +Median E2E Latency (ms): 3742.47 +P90 E2E Latency (ms): 8877.37 +P99 E2E Latency (ms): 10769.85 +---------------Time to First Token---------------- +Mean TTFT (ms): 199.88 +Median TTFT (ms): 176.15 +P99 TTFT (ms): 272.49 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.99 +Median TPOT (ms): 10.88 +P99 TPOT (ms): 13.93 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.15 +Median ITL (ms): 8.86 +P95 ITL (ms): 17.29 +P99 ITL (ms): 33.71 +Max ITL (ms): 36.84 +================================================== +``` + +#### 5.3.2 Throughput-Sensitive Benchmark + +We simply use the same deployment method and vary the throughput by maximizing concurrencies: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host $MASTER_ADDR \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V3.2 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 2048 \ + --max-concurrency 1024 # see picture below why we use 1024 for concurrency, hence num prompts 2048 +``` + +DeepSeek 3.2 can steadily support concurrency up to `1024` and when concurrency is greater than `128`, the TTFT increase sharply: + +![DeepSeek V3.2 Concurrency ISL/OSL=1024/128](https://github.com/user-attachments/assets/d5c9c9fb-44f3-4793-a0fd-f8fa954546f5) + + +Performance record: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 64.0 +Max request concurrency: 1024 +Successful requests: 2048 +Benchmark duration (s): 408.09 +Total input tokens: 1048992 +Total input text tokens: 1048992 +Total generated tokens: 1032734 +Total generated tokens (retokenized): 1031817 +Request throughput (req/s): 5.02 +Input token throughput (tok/s): 2570.50 +Output token throughput (tok/s): 2530.66 +Peak output token throughput (tok/s): 5092.00 +Peak concurrent requests: 1035 +Total token throughput (tok/s): 5101.16 +Concurrency: 763.41 +Accept length: 3.26 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 152117.70 +Median E2E Latency (ms): 181704.84 +P90 E2E Latency (ms): 215924.77 +P99 E2E Latency (ms): 231679.59 +---------------Time to First Token---------------- +Mean TTFT (ms): 127729.28 +Median TTFT (ms): 170098.94 +P99 TTFT (ms): 185705.73 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 49.18 +Median TPOT (ms): 48.48 +P99 TPOT (ms): 77.24 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 48.46 +Median ITL (ms): 52.11 +P95 ITL (ms): 110.26 +P99 ITL (ms): 200.63 +Max ITL (ms): 2666.37 +================================================== +``` + +By adding `--random-range-ratio 1`, we could get even higher statistical numbers: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 64.0 +Max request concurrency: 1024 +Successful requests: 2048 +Benchmark duration (s): 612.87 +Total input tokens: 2097152 +Total input text tokens: 2097152 +Total generated tokens: 2097152 +Total generated tokens (retokenized): 2096201 +Request throughput (req/s): 3.34 +Input token throughput (tok/s): 3421.84 +Output token throughput (tok/s): 3421.84 +Peak output token throughput (tok/s): 9077.00 +Peak concurrent requests: 1039 +Total token throughput (tok/s): 6843.68 +Concurrency: 772.66 +Accept length: 3.26 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 231222.27 +Median E2E Latency (ms): 289846.24 +P90 E2E Latency (ms): 314480.41 +P99 E2E Latency (ms): 320392.27 +---------------Time to First Token---------------- +Mean TTFT (ms): 194081.02 +Median TTFT (ms): 252945.22 +P99 TTFT (ms): 279637.50 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 36.31 +Median TPOT (ms): 36.73 +P99 TPOT (ms): 46.33 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 36.31 +Median ITL (ms): 23.18 +P95 ITL (ms): 96.79 +P99 ITL (ms): 135.81 +Max ITL (ms): 3121.00 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx new file mode 100644 index 000000000000..e99e29499792 --- /dev/null +++ b/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx @@ -0,0 +1,489 @@ +--- +title: DeepSeek-V4 +metatags: + description: "Deploy DeepSeek-V4 with SGLang — a next-generation MoE model from DeepSeek." +tag: NEW +--- + +## 1. Model Introduction + +**DeepSeek-V4** is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an **MIT License**. It ships as two Instruct repos (one per variant) plus matching Base repos: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
VariantTotal paramsActive (MoE)Use
DeepSeek-V4-Flash284B13Bsingle-node serving: B200 / GB200 / GB300 / H200 on 4 GPUs
DeepSeek-V4-Pro1.6T49Bhigh-capacity: B200 8 GPU / GB200 8 GPU (2 nodes) / GB300 4 GPU / H200 8 GPU(fp4)/16 GPU(fp8)
+ +The Instruct repos ship **FP4 MoE experts + FP8 attention / dense** (one mixed-precision checkpoint covers all GPUs that support FP4). The Base (pre-trained only) variants — `DeepSeek-V4-Flash-Base`, `DeepSeek-V4-Pro-Base` — ship pure FP8 mixed and are **not** for chat / tool calling. + +**Key Features** (per the official model card): + +- **Hybrid Attention Architecture** — combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency. At 1M-token context, DeepSeek-V4-Pro uses only ~27% of per-token inference FLOPs and ~10% of KV cache compared with DeepSeek-V3.2. +- **Manifold-Constrained Hyper-Connections (mHC)** — strengthens residual connections, improving signal-propagation stability across layers while preserving expressivity. +- **Muon optimizer** — faster convergence and greater training stability. +- **Context length: 1M tokens**; pre-trained on 32T+ diverse, high-quality tokens. +- **Three reasoning modes**: *Non-think* (fast, intuitive responses), *Think High* (conscious logical analysis, slower but more accurate), *Think Max* (push reasoning to its fullest extent). Recommend a ≥ 384K context window when running Think Max. +- Ships with a dedicated `encoding_dsv4.encode_messages` Python encoder + DSML tool-call grammar (`<|DSML|tool_calls>` / `<|DSML|invoke>` / `<|DSML|parameter>`). + +**Recommended Generation Parameters:** `temperature=1.0`, `top_p=1.0` (per the official model card). + +**License:** MIT. + +**Resources:** + +- HuggingFace: [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash), [DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) +- ModelScope: [DeepSeek-V4-Flash](https://modelscope.cn/models/deepseek-ai/DeepSeek-V4-Flash), [DeepSeek-V4-Pro](https://modelscope.cn/models/deepseek-ai/DeepSeek-V4-Pro) + +## 2. SGLang Installation + +SGLang offers multiple installation methods. Choose based on your hardware platform. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +**Docker Images by Hardware Platform:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Hardware PlatformDocker Image
NVIDIA B300lmsysorg/sglang:deepseek-v4-b300
NVIDIA B200lmsysorg/sglang:deepseek-v4-blackwell
NVIDIA GB200lmsysorg/sglang:deepseek-v4-grace-blackwell
NVIDIA GB300lmsysorg/sglang:deepseek-v4-grace-blackwell
NVIDIA H200lmsysorg/sglang:deepseek-v4-hopper
+ +For how to actually launch one of these images, see [Install → Method 3: Using Docker](../../../docs/get-started/install#method-3-using-docker). A minimal example (substitute the image tag for your platform and the inner `sglang serve ...` with whatever the [command generator](#3-model-deployment) below produces): + +```bash Command +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:deepseek-v4-blackwell \ + sglang serve +``` + +## 3. Model Deployment + +SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (`low-latency`, `balanced`, `max-throughput`), plus specialized recipes for long-context (`cp`, prefill context-parallel) and prefill/decode disaggregation (`pd-disagg`). The interactive generator below emits the exact launch command for any `(hardware, variant, recipe)` combination. + + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the selector below to generate the deployment command for your hardware + recipe combination. + +import { DeepSeekV4Deployment } from "/src/snippets/autoregressive/deepseek-v4-deployment.jsx"; + + + +### 3.2 Configuration Tips + +{/* TODO: expand this section as more recipes are validated end-to-end. */} + +**Concurrency & DeepEP dispatch buffer** + +Must hold: `max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK`. Violating it blows DeepEP's dispatch buffer at steady-state load (`deep_ep.cpp:1105`). When tuning, move `--cuda-graph-max-bs`, `--max-running-requests`, and the env together. + +The generator currently picks values on the **conservative** side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload's peak concurrency and report findings back so the defaults can be revised. + +**MTP (Multi-Token Prediction, EAGLE)** + +- `low-latency`: steps=3, draft-tokens=4 → largest win at bs=1. +- `balanced`: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch. +- `max-throughput`: MTP disabled — at saturation the verify step costs more than it saves. +- MTP currently requires `SGLANG_ENABLE_SPEC_V2=1`. + + + +**Hopper (H200) note** + +We provide two different options for running DeepSeek-V4 models on Hopper devices (H200) +- Original FP4 checkpoints: To run original FP4 checkpoints, apply the w4a16 MoE kernels (marlin) as in interactive command generator. For this option we only support TP method. Complete Pro model can be run on a single H200 node with this option. +- Converted FP8 checkpoints: We also provide pre-converted FP8 checkpoints (`sgl-project/DeepSeek-V4-Flash-FP8`, `sgl-project/DeepSeek-V4-Pro-FP8`), which support more parallelism and features. + +PD-Disagg recipes on H200 may require `docker run --privileged --ulimit memlock=-1` +(or `--device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK`) so mooncake +can discover the IB HCAs; without IB exposure mooncake silently falls back to +TCP, which can lead to garbled KV transfer on large checkpoints. + +**GB300 PD-Disagg cross-pod MNNVL** + +On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may +fail with `nvlink_transport.cpp:497 Requested address ... not found!`. If +this happens, prepend `MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1` +to both prefill and decode `sglang serve` commands. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, see: + +- [Basic API Usage](../../../docs/basic_usage/send_request) + +Once the server is running (for example via the command generator above), send a request: + +```shell Command +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Flash", + "messages": [{"role": "user", "content": "What is 15% of 240?"}] + }' +``` + +> **PD-Disagg note**: if you deployed with the `pd-disagg` recipe from the generator above, the prefill server is on port `30000`, the decode server on `30001`, and the **router** on port `8000` — client traffic should target `http://localhost:8000`, not `:30000`. + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +Enable the `deepseek-v4` reasoning parser (check the box in the [command panel above](#3-model-deployment)) to separate thinking from the final answer into `reasoning_content` vs `content`. + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V4-Flash", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + extra_body={"chat_template_kwargs": {"thinking": True}}, + stream=True, +) + +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if not chunk.choices: + continue + delta = chunk.choices[0].delta + + if getattr(delta, "reasoning_content", None): + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +Pending update — replace with real server output after deployment. +``` + +#### 4.2.2 Tool Calling + +Enable the `deepseekv4` tool-call parser (check the box in the [command panel above](#3-model-deployment)) to surface structured tool calls via `message.tool_calls`. + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "The city name"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["location"], + }, + }, + } +] + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V4-Flash", + messages=[{"role": "user", "content": "What's the weather in Beijing?"}], + tools=tools, + extra_body={"chat_template_kwargs": {"thinking": True}}, + stream=True, +) + +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if not chunk.choices: + continue + delta = chunk.choices[0].delta + + if getattr(delta, "reasoning_content", None): + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if getattr(delta, "tool_calls", None): + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = {"name": None, "arguments": ""} + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]["name"] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments + + if delta.content: + print(delta.content, end="", flush=True) + +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +Pending update — replace with real server output after deployment. +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark on Blackwell + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (4x) +- Model: DeepSeek-V4-Flash (FP4) +- Tensor Parallelism: 4 +- sglang version: Pending update + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses. + +#### 5.1.1 Latency-Sensitive Benchmark + +- **Model Deployment Command:** see the [command panel above](#3-model-deployment). + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V4-Flash \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +Pending update — replace with real bench_serving output after the latency run. +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- **Model Deployment Command:** see the [command panel above](#3-model-deployment). + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V4-Flash \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +Pending update — replace with real bench_serving output after the throughput run. +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 30000 +``` + +- **Test Results:** + - DeepSeek-V4-Flash (FP4, Blackwell) + ``` + Pending update + ``` + - DeepSeek-V4-Flash (FP8, Hopper) + ``` + Pending update + ``` + +#### 5.2.2 MMLU Benchmark + +- **Benchmark Command:** + +```shell Command +cd sglang +bash benchmark/mmlu/download_data.sh +python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 30000 +``` + +- **Test Results:** + - DeepSeek-V4-Flash (FP4, Blackwell) + ``` + Pending update + ``` + - DeepSeek-V4-Flash (FP8, Hopper) + ``` + Pending update + ``` + +### 5.3 Speed Benchmark on Hopper + +**Test Environment:** + +- Hardware: NVIDIA H200 GPU (4x) +- Model: DeepSeek-V4-Flash (FP8) +- Tensor Parallelism: 4 +- sglang version: Pending update + +#### 5.3.1 Latency-Sensitive Benchmark + +- **Model Deployment Command:** see the [command panel above](#3-model-deployment). + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V4-Flash \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +Pending update — replace with real bench_serving output after the latency run. +``` + +#### 5.3.2 Throughput-Sensitive Benchmark + +- **Model Deployment Command:** see the [command panel above](#3-model-deployment). + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model deepseek-ai/DeepSeek-V4-Flash \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +Pending update — replace with real bench_serving output after the throughput run. +``` diff --git a/docs_new/cookbook/autoregressive/Ernie/Ernie4.5-VL.mdx b/docs_new/cookbook/autoregressive/Ernie/Ernie4.5-VL.mdx new file mode 100644 index 000000000000..fbd280360aae --- /dev/null +++ b/docs_new/cookbook/autoregressive/Ernie/Ernie4.5-VL.mdx @@ -0,0 +1,28 @@ +--- +title: Ernie4.5-VL +metatags: + description: "Deploy Ernie4.5-VL vision-language model with SGLang - community contribution guide for Baidu's multimodal model." +--- + +## 📝 Community Contribution Welcome + +This guide is currently under development. We welcome community contributions! + +If you have experience deploying **Ernie4.5-VL** with SGLang, please help us complete this documentation. + +## 🚀 How to Contribute + +```shell Command +git clone https://github.com/YOUR_USERNAME/sglang-cookbook.git +cd sglang-cookbook +git checkout -b add-ernie4-5-vl-guide +# Edit this file and submit a PR +``` + +## 📚 Reference + +- [GLM-4.6V](../GLM/GLM-4.6V) + +--- + +**Let's build this together!** 🌟 diff --git a/docs_new/cookbook/autoregressive/Ernie/Ernie4.5.mdx b/docs_new/cookbook/autoregressive/Ernie/Ernie4.5.mdx new file mode 100644 index 000000000000..c2dc5fda56be --- /dev/null +++ b/docs_new/cookbook/autoregressive/Ernie/Ernie4.5.mdx @@ -0,0 +1,696 @@ +--- +title: Ernie4.5 +metatags: + description: "Deploy Ernie4.5 with SGLang - community contribution guide for Baidu's Ernie 4.5 model deployment." +--- + +import { Ernie45Deployment } from '/src/snippets/autoregressive/ernie-45-deployment.jsx'; + +## 1. Model Introduction + +The **ERNIE-4.5** series is a family of large language models developed by Baidu. ERNIE (Enhanced Representation through Knowledge Integration) 4.5 represents an advanced version of the ERNIE series, optimized for general-purpose tasks and conversational scenarios. + +ERNIE-4.5 delivers advanced features as below: +- **Heterogeneous Modality Structure**: MoE architecture that supports parameter sharing across modalities while allowing dedicated parameters for each individual modality, enhancing multimodal understanding without compromising, and even improving, performance on text-related tasks. +- **Vision Encoder**: Dedicated adaptive-resolution ViT with 2D RoPE and image packing; for video, adaptive frame sampling and timestamp rendering, supporting both shared and modality-specific visual processing. +- **Adapter**: Shared modality-bridging module with spatial and temporal compression to align vision to text embedding space, enabling cross-modal understanding without compromising text representations. +- **Multimodal Position Embedding**: Unified 3D RoPE (temporal, height, width) for vision and 1D RoPE for text in a single embedding space, supporting parameter sharing while encoding modality-specific positions. +- **Hardware Optimization**: Specifically tuned for AMD MI300X, MI325X, and MI355X GPUs. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + + + +## 4. API Usage +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +The following example demonstrates deployment using ERNIE-4.5-21B-A3B-PT. + +```shell Command +python -m sglang.launch_server \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --tp 1 +``` + +**Basic Python Client Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="baidu/ERNIE-4.5-21B-A3B-PT", + messages=[ + {"role": "user", "content": "What is artificial intelligence?"} + ], + temperature=1.0, + top_p=0.95, + max_tokens=1024 +) + +print(response.choices[0].message.content) +``` + +**Output Example:** +```text Output +**Artificial Intelligence (AI)** is the simulation of human intelligence processes by machines, particularly computer systems. These processes include **learning** (acquiring information and rules for using the information), **reasoning** (using rules to reach approximate or definite conclusions), and **self-correction**. AI encompasses a wide range of techniques, algorithms, and methodologies designed to enable machines to perform tasks that typically require human intelligence. + +### Key Characteristics of AI: +... + +### In Summary: +AI represents a transformative force with the potential to revolutionize industries and enhance human capabilities. However, its development requires careful consideration of ethical, legal, and social implications to ensure that it benefits society as a whole. As AI continues to evolve, ongoing dialogue among stakeholders will be crucial to balancing innovation with responsibility. +``` + +**Streaming Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="baidu/ERNIE-4.5-21B-A3B-PT", + messages=[ + {"role": "user", "content": "Explain quantum computing in simple terms."} + ], + temperature=1.0, + top_p=0.95, + max_tokens=2048, + stream=True +) + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +Sure! Here’s a simple explanation of quantum computing: + +### **Quantum Computing: Making Computers Super Fast (But Weird) Using Quantum Rules** + +1. **Classic vs. Quantum Computers** + - **Normal computers** use **bits** (0s and 1s) to store and process information. + - **Quantum computers** use **qubits** (short for quantum bits). Unlike bits, qubits can be **0, 1, or both at the same time** (this is called **superposition**). + +2. **Superposition: The Magic Behind Speed** + - A single qubit can represent **0 and 1 simultaneously**, like a coin spinning in the air. + - Many qubits working together (in something called **quantum parallelism**) can **check multiple possibilities at once**, making quantum computers much faster for certain problems. + +3. **Entanglement: Making Qubits Link** + - When qubits are **entangled**, their states are linked—changing one instantly affects the other, no matter how far apart they are (this is called **spooky action at a distance** by Einstein). + - Entanglement allows quantum computers to process information in **very efficient ways**. + +4. **What Quantum Computers Are Good At** + - **Cracking encryption** (like RSA). + - **Factoring large numbers** (used in encryption and cryptography). + - **Searching unsorted databases** (way faster than classical computers). + - **Simulating quantum systems** (like molecules for drug discovery). + - **Optimizing problems** (like logistics or finance). + +5. **Challenges & Current State** + - Qubits are **fragile** and easily disturbed (called **decoherence**). + - Engineers are working to keep qubits stable long enough to do useful calculations. + - Today’s quantum computers are **small and experimental**, but the goal is to build powerful ones that outperform classical supercomputers. + +### **Final Thought** +Quantum computing isn’t just a faster calculator—it’s a **new way of thinking about problems** using the weird laws of physics. While still new, it has the potential to revolutionize fields like medicine, AI, and cybersecurity. + +Would you like an example of how a quantum computer might solve a problem? 😊 +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X GPU (1x) +- Model: ERNIE-4.5-21B-A3B-PT +- Tensor Parallelism: 1 +- SGLang Version: 0.5.7 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Scenario Benchmark + +- Model Deployment Command: + +```bash Command +python -m sglang.launch_server \ + --model-path baidu/ERNIE-4.5-21B-A3B-PT \ + --tp 1 +``` + +##### 5.1.1.1 Low Concurrency (Latency-Optimized) +- Benchmark Command: + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 58.72 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4219 +Request throughput (req/s): 0.17 +Input token throughput (tok/s): 103.90 +Output token throughput (tok/s): 71.87 +Peak output token throughput (tok/s): 245.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 175.77 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5869.86 +Median E2E Latency (ms): 1870.80 +---------------Time to First Token---------------- +Mean TTFT (ms): 4152.58 +Median TTFT (ms): 36.81 +P99 TTFT (ms): 37498.23 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.07 +Median TPOT (ms): 4.09 +P99 TPOT (ms): 4.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.08 +Median ITL (ms): 4.08 +P95 ITL (ms): 4.14 +P99 ITL (ms): 4.20 +Max ITL (ms): 4.67 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency (Balanced) +- Benchmark Command: + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 34.30 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40773 +Request throughput (req/s): 2.33 +Input token throughput (tok/s): 1156.62 +Output token throughput (tok/s): 1189.77 +Peak output token throughput (tok/s): 1392.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 2346.39 +Concurrency: 14.14 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6060.62 +Median E2E Latency (ms): 6496.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 78.90 +Median TTFT (ms): 45.90 +P99 TTFT (ms): 234.33 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 11.99 +Median TPOT (ms): 12.16 +P99 TPOT (ms): 14.81 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.75 +Median ITL (ms): 11.48 +P95 ITL (ms): 12.24 +P99 ITL (ms): 34.85 +Max ITL (ms): 105.01 +================================================== +``` + +##### 5.1.1.3 High Concurrency (Throughput-Optimized) +- Benchmark Command: + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 66.63 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252449 +Request throughput (req/s): 7.50 +Input token throughput (tok/s): 3749.79 +Output token throughput (tok/s): 3792.28 +Peak output token throughput (tok/s): 4902.00 +Peak concurrent requests: 113 +Total token throughput (tok/s): 7542.06 +Concurrency: 90.33 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 12036.90 +Median E2E Latency (ms): 11782.16 +---------------Time to First Token---------------- +Mean TTFT (ms): 104.86 +Median TTFT (ms): 84.62 +P99 TTFT (ms): 297.85 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 23.89 +Median TPOT (ms): 24.62 +P99 TPOT (ms): 26.91 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 23.66 +Median ITL (ms): 20.48 +P95 ITL (ms): 45.57 +P99 ITL (ms): 54.31 +Max ITL (ms): 185.12 +================================================== +``` + +#### 5.1.2 Reasoning Scenario Benchmark + +##### 5.1.2.1 Low Concurrency +- Benchmark Command: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 185.11 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44423 +Request throughput (req/s): 0.05 +Input token throughput (tok/s): 32.96 +Output token throughput (tok/s): 240.19 +Peak output token throughput (tok/s): 245.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 273.15 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 18508.84 +Median E2E Latency (ms): 19866.81 +---------------Time to First Token---------------- +Mean TTFT (ms): 32.59 +Median TTFT (ms): 32.14 +P99 TTFT (ms): 38.58 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.13 +Median TPOT (ms): 4.13 +P99 TPOT (ms): 4.20 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.16 +Median ITL (ms): 4.12 +P95 ITL (ms): 4.31 +P99 ITL (ms): 4.36 +Max ITL (ms): 7.28 +================================================== +``` + +##### 5.1.2.2 Medium Concurrency + +- Benchmark Command: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 263.48 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 318306 +Total generated tokens (retokenized): 317984 +Request throughput (req/s): 0.30 +Input token throughput (tok/s): 150.55 +Output token throughput (tok/s): 1208.09 +Peak output token throughput (tok/s): 1408.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 1358.64 +Concurrency: 14.35 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 47249.55 +Median E2E Latency (ms): 47828.67 +---------------Time to First Token---------------- +Mean TTFT (ms): 62.77 +Median TTFT (ms): 57.10 +P99 TTFT (ms): 93.70 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 11.92 +Median TPOT (ms): 12.09 +P99 TPOT (ms): 12.50 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.86 +Median ITL (ms): 12.04 +P95 ITL (ms): 12.68 +P99 ITL (ms): 13.61 +Max ITL (ms): 39.94 +================================================== +``` + +##### 5.1.2.3 High Concurrency +- Benchmark Command: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 428.30 +Total input tokens: 158939 +Total input text tokens: 158939 +Total input vision tokens: 0 +Total generated tokens: 1301025 +Total generated tokens (retokenized): 1299877 +Request throughput (req/s): 0.75 +Input token throughput (tok/s): 371.09 +Output token throughput (tok/s): 3037.63 +Peak output token throughput (tok/s): 3880.00 +Peak concurrent requests: 69 +Total token throughput (tok/s): 3408.73 +Concurrency: 57.08 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 76392.58 +Median E2E Latency (ms): 79698.73 +---------------Time to First Token---------------- +Mean TTFT (ms): 92.79 +Median TTFT (ms): 78.71 +P99 TTFT (ms): 168.89 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 18.81 +Median TPOT (ms): 19.15 +P99 TPOT (ms): 19.81 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 18.77 +Median ITL (ms): 18.77 +P95 ITL (ms): 19.86 +P99 ITL (ms): 42.08 +Max ITL (ms): 74.36 +================================================== +``` + +#### 5.1.3 Summarization Scenario Benchmark + +##### 5.1.3.1 Low Concurrency +- Benchmark Command: + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 18.59 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4216 +Request throughput (req/s): 0.54 +Input token throughput (tok/s): 2256.43 +Output token throughput (tok/s): 227.04 +Peak output token throughput (tok/s): 245.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 2483.46 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1856.72 +Median E2E Latency (ms): 1513.87 +---------------Time to First Token---------------- +Mean TTFT (ms): 86.66 +Median TTFT (ms): 72.30 +P99 TTFT (ms): 167.13 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.19 +Median TPOT (ms): 4.22 +P99 TPOT (ms): 4.30 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.20 +Median ITL (ms): 4.23 +P95 ITL (ms): 4.34 +P99 ITL (ms): 4.42 +Max ITL (ms): 5.68 +================================================== +``` + +##### 5.1.3.2 Medium Concurrency +- Benchmark Command: + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 40.25 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41646 +Request throughput (req/s): 1.99 +Input token throughput (tok/s): 7454.72 +Output token throughput (tok/s): 1035.37 +Peak output token throughput (tok/s): 1310.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 8490.09 +Concurrency: 14.37 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 7229.56 +Median E2E Latency (ms): 7578.95 +---------------Time to First Token---------------- +Mean TTFT (ms): 137.38 +Median TTFT (ms): 122.59 +P99 TTFT (ms): 485.34 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.04 +Median TPOT (ms): 14.24 +P99 TPOT (ms): 20.77 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 13.64 +Median ITL (ms): 12.36 +P95 ITL (ms): 14.72 +P99 ITL (ms): 57.39 +Max ITL (ms): 411.31 +================================================== +``` + +##### 5.1.3.3 High Concurrency + +- Benchmark Command: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model baidu/ERNIE-4.5-21B-A3B-PT \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 78.33 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169888 +Request throughput (req/s): 4.09 +Input token throughput (tok/s): 16262.33 +Output token throughput (tok/s): 2170.20 +Peak output token throughput (tok/s): 3005.00 +Peak concurrent requests: 73 +Total token throughput (tok/s): 18432.53 +Concurrency: 58.79 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 14392.52 +Median E2E Latency (ms): 14460.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 184.82 +Median TTFT (ms): 155.24 +P99 TTFT (ms): 379.82 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 26.97 +Median TPOT (ms): 28.31 +P99 TPOT (ms): 33.61 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 26.79 +Median ITL (ms): 20.55 +P95 ITL (ms): 47.55 +P99 ITL (ms): 145.64 +Max ITL (ms): 287.62 +================================================== +``` + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command: + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py \ + --num-shots 8 \ + --num-questions 1316 \ + --parallel 1316 +``` + +- Test Results: + - ERNIE-4.5-21B-A3B-PT + ``` + Accuracy: 0.865 + Invalid: 0.000 + Latency: 21.669 s + Output throughput: 10359.790 token/s + ``` diff --git a/docs_new/cookbook/autoregressive/FlashLabs/Chroma1.0.mdx b/docs_new/cookbook/autoregressive/FlashLabs/Chroma1.0.mdx new file mode 100644 index 000000000000..8b5d6096473d --- /dev/null +++ b/docs_new/cookbook/autoregressive/FlashLabs/Chroma1.0.mdx @@ -0,0 +1,192 @@ +--- +title: Chroma-1.0 +metatags: + description: "Deploy Chroma-1.0 end-to-end speech conversation model with SGLang - real-time speech generation, voice cloning, and speech reasoning." +--- + +## 1. Model Introduction + +[Chroma-1.0](https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma) is an open-source end-to-end speech conversation model developed by FlashLabs, focusing on the following core capabilities: + +- **Real-time Speech Generation**: Supports low-latency speech synthesis, suitable for real-time conversational scenarios. +- **Customized Voice Cloning**: Capable of cloning and replicating specific speaker voice characteristics. +- **End-to-End Architecture**: Provides a complete processing workflow from speech to speech. +- **Speech Reasoning**: Equipped with reasoning capabilities to understand and process speech content. + +## 2. Architecture Overview + +**Chroma-1.0** utilizes a hybrid serving architecture rather than a direct SGLang deployment. This design choice is driven by: + +1. **Complex Model Architecture**: The end-to-end speech processing pipeline involves specialized components that go beyond standard text generation loops. +2. **KV Cache & State Management**: The model requires custom handling of KV caches that differs from standard implementations. +3. **Batching Limitations**: The current implementation supports a batch size of 1, meaning SGLang's advanced continuous batching capabilities are not yet fully applicable. + +Therefore, you will start the **FlashLabs Server**, which manages the overall workflow and selectively leverages SGLang for specific inference components where supported. + +- **Outer Layer**: FlashLabs Server (Handles Audio I/O, State, and Model Logic) +- **Inner Engine**: SGLang Instance (Utilized for specific acceleration where applicable) + +## 3. Installation & Setup + +We recommend following these steps to set up the environment and prepare the model. + +### Step 1: Get the Docker Image + +Pull the official pre-built image from Docker Hub to ensure all dependencies are correctly configured. + +```bash Command +docker pull flashlabs/chroma:latest +``` + +### Step 2: Download Model Weights + +Download the **Chroma-4B** weights from Hugging Face. You can choose one of the following methods: + +**Method 1: Using Python (Recommended)** + +```bash Command +huggingface-cli download FlashLabs/Chroma-4B --local-dir Chroma-4B +``` + +**Method 2: Using Git Clone** + +Make sure you have Git LFS installed before cloning. + +```bash Command +# Install Git LFS first +git lfs install + +# Clone the repository +git clone https://huggingface.co/FlashLabs/Chroma-4B Chroma-4B +``` + +### Step 3: Download Chroma Codes (SGLang version) + +```bash Command +git clone https://github.com/FlashLabs-AI-Corp/Chroma-SGLang.git + +cd Chroma-SGLang +``` + +### Step 4: Run the Server + +```bash Command +docker run -d \ + --gpus all \ + -p 8000:8000 \ + -w /app/Chroma-SGLang \ + -v "your_Chroma-SGLang_path":/app/Chroma-SGLang \ + -v "your_chroma_path":/model \ + -e CHROMA_MODEL_PATH=/model \ + -e DP_SIZE="1" \ + flashlabs/chroma:latest \ + /opt/conda/bin/python -m uvicorn api_server:app \ + --host 0.0.0.0 \ + --port 8000 \ + --workers 1 +``` + +or run simply the following one line command + +```bash Command +docker-compose up -d +``` + +## 5. Client Usage Example + +Once the server is running, you can interact with it using HTTP requests. + +### Python Client + +```python Example +import requests +import base64 + +url = "http://localhost:8000/v1/chat/completions" +headers = {"Content-Type": "application/json"} + +payload = { + "model": "chroma", + "messages": [ + { + "role": "system", + "content": "You are Chroma, a voice agent developed by FlashLabs." + }, + { + "role": "user", + "content": [ + {"type": "audio", "audio": "assets/question_audio.wav"} + ] + } + ], + "max_tokens": 1000, + "return_audio": True +} + +response = requests.post(url, json=payload, headers=headers) +result = response.json() + +if result.get("audio"): + audio_data = base64.b64decode(result["audio"]) + with open("output.wav", "wb") as f: + f.write(audio_data) + print("Audio saved to output.wav") +``` + +### OpenAI SDK Compatible Example + +```python Example +from openai import OpenAI + +client = OpenAI( + api_key="dummy", + base_url="http://localhost:8000/v1" +) + +response = client.chat.completions.create( + model="chroma", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + { + "role": "user", + "content": [ + {"type": "audio", "audio": "assets/question_audio.wav"} + ] + } + ], + extra_body={ + "prompt_text": "I have not... I'm so exhausted, I haven't slept in a very long time. It could be because... Well, I used our... Uh, I'm, I just use... This is what I use every day. I use our cleanser every day, I use serum in the morning and then the moistu- daily moisturizer. That's what I use every morning.", + "prompt_audio": "assets/ref_audio.wav", + "return_audio": True + } +) + +print(response) +``` + +### CLI (cURL) + +```bash Command +curl -X POST http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "chroma", + "messages": [ + { + "role": "system", + "content": "You are Chroma, a voice agent developed by FlashLabs." + }, + { + "role": "user", + "content": [ + { + "type": "audio", + "audio": "assets/question_audio.wav" + } + ] + } + ], + "max_tokens": 1000, + "return_audio": true + }' | jq -r '.audio' | base64 -d > output.wav +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-4.5.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-4.5.mdx new file mode 100644 index 000000000000..fe03bd8da368 --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-4.5.mdx @@ -0,0 +1,490 @@ +--- +title: GLM-4.5 +metatags: + description: "Deploy GLM-4.5 with SGLang on AMD GPUs - advanced reasoning, function calling, BF16/FP8 quantization options." +--- + +## 1. Model Introduction + +[GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) is a powerful language model developed by Zhipu AI, featuring advanced capabilities in reasoning, function calling, and multi-modal understanding. + +**Key Features:** + +- **Advanced Reasoning**: Built-in reasoning capabilities for complex problem-solving +- **Multiple Quantizations**: BF16 and FP8 variants for different performance/memory trade-offs +- **Hardware Optimization**: Specifically tuned for AMD MI300X/MI325X/MI355X GPUs +- **High Performance**: Optimized for both throughput and latency scenarios + +**Available Models:** + +- **BF16 (Full precision)**: [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) - Recommended for MI300X/MI325X/MI355X +- **FP8 (8-bit quantized)**: [zai-org/GLM-4.5-FP8](https://huggingface.co/zai-org/GLM-4.5-FP8) - Recommended for MI300X/MI325X/MI355X + +**License:** + +Please refer to the [official GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for license details. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities. + +import { GLM45Deployment } from "/src/snippets/autoregressive/glm-45-deployment.jsx"; + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [GLM-4.5/GLM-4.6 Usage](../../../docs/basic_usage/glm45). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +GLM-4.5 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.5 \ + --reasoning-parser glm45 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="zai-org/GLM-4.5", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +GLM-4.5 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.5 \ + --reasoning-parser glm45 \ + --tool-call-parser glm45 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-4.5", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X (8x), AMD MI325X (8x), AMD MI355X (8x) +- Model: GLM-4.5 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.6.post1 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Test Scenarios + +Three core scenarios reflect real-world usage patterns: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScenarioInput LengthOutput LengthUse Case
**Chat**1K1KMost common conversational AI workload
**Reasoning**1K8KLong-form generation, complex reasoning tasks
**Summarization**8K1KDocument summarization, RAG retrieval
+ +#### 5.1.2 Concurrency Levels + +Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier): + +- **Low Concurrency**: `--max-concurrency 1` (Latency-optimized) +- **Medium Concurrency**: `--max-concurrency 16` (Balanced) +- **High Concurrency**: `--max-concurrency 100` (Throughput-optimized) + +#### 5.1.3 Number of Prompts + +For each concurrency level, configure `num_prompts` to simulate realistic user loads: + +- **Quick Test**: `num_prompts = concurrency × 1` (minimal test) +- **Recommended**: `num_prompts = concurrency × 5` (standard benchmark) +- **Stable Measurements**: `num_prompts = concurrency × 10` (production-grade) + +--- + +#### 5.1.4 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** +```bash Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.5 \ + --tp 8 +``` + + +- Low Concurrency (Latency-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Medium Concurrency (Balanced) +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- High Concurrency (Throughput-Optimized) +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Medium Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- High Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Medium Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- High Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +#### 5.1.5 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal tradeoff between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --num-questions 200 \ + --port 30000 +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-4.5V.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-4.5V.mdx new file mode 100644 index 000000000000..3bd305d43973 --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-4.5V.mdx @@ -0,0 +1,533 @@ +--- +title: GLM-4.5V +metatags: + description: "Deploy GLM-4.5V vision-language model with SGLang - SOTA multimodal performance, 64K context, image reasoning and video understanding." +--- + +## 1. Model Introduction + +[GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V) is a state-of-the-art multimodal vision-language model from ZhipuAI, built on the next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It achieves SOTA performance among models of the same scale across 42 public vision-language benchmarks. Through efficient hybrid training, GLM-4.5V focuses on real-world usability and enables full-spectrum vision reasoning across diverse visual content types. + +**Hardware Support:** NVIDIA B200/H100/H200, AMD MI300X/MI325X/MI355X + +GLM-4.5V introduces several key features: + +- **Image Reasoning & Grounding** Scene understanding, complex multi-image analysis, and spatial recognition with precise visual element localization. Supports bounding box predictions with normalized coordinates (0-1000) for accurate object detection. +- **Video Understanding** Long video segmentation and event recognition, supporting comprehensive temporal analysis across extended video sequences. +- **GUI Agent Tasks** Screen reading, icon recognition, and desktop operation assistance for agent-based applications. Enables natural interaction with graphical user interfaces. +- **Complex Chart & Long Document Parsing** Research report analysis and information extraction from documents with text, charts, tables, and figures. Processes up to 64K tokens of multimodal context. +- **Thinking Mode Switch** Allows users to balance between quick responses and deep reasoning. Users can enable/disable Chain-of-Thought reasoning based on task requirements for improved accuracy and interpretability. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The GLM-4.5V offers models in various sizes and architectures, optimized for different hardware platforms. The recommended launch configurations vary by hardware and model size. + +**Interactive Command Generator**: Use the interactive configuration generator below to customize your deployment settings. Select your hardware platform, model size, quantization method, and other options to generate the appropriate launch command. + +import { GLM45VDeployment } from "/src/snippets/autoregressive/glm-45v-deployment.jsx"; + + + +### 3.2 Configuration Tips +- **TTFT Optimization** : Set `SGLANG_USE_CUDA_IPC_TRANSPORT=1` to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting `--mem-fraction-static` and/or `--max-running-requests`. (additional memory is proportional to image size * number of images in current running requests.) +- **TP=8 Configuration**: When using Tensor Parallelism (TP) of 8, the vision attention's 12 heads cannot be evenly divided. You can resolve this by adding `--mm-enable-dp-encoder`. +- **Fast Model Loading**: For large models (like the 106B version), you can speed up model loading by using `--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'`. +- For more detailed configuration tips, please refer to [GLM-4.5V/GLM-4.6V Usage](../../../docs/basic_usage/glmv). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Multi-Modal Inputs + +GLM-4.5V supports both image and video inputs. Here's a basic example with image input: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Describe this image in detail." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 3.37s +Generated text: Auntie Anne's + +CINNAMON SUGAR +1 x 17,000 17,000 + +SUB TOTAL 17,000 + +GRAND TOTAL 17,000 + +CASH IDR 20,000 + +CHANGE DUE 3,000 +``` + +**Multi-Image Input Example:** + +GLM-4.5V can process multiple images in a single request for comparison or analysis: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg" + } + }, + { + "type": "image_url", + "image_url": { + "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg" + } + }, + { + "type": "text", + "text": "Compare these two images and describe the differences in 100 words or less. Focus on the key visual elements, colors, textures, and any notable contrasts between the two scenes. Be specific about what you see in each image." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 3.86s +Generated text: The first image shows a close - up of a few red taxis on a street with storefronts in the background. The taxis are in a line, and the scene has an urban, busy feel with visible shop displays. The second image is an aerial view of a large taxi parking area with numerous red and green taxis, some with hoods open. The scene is more open, with a parking lot layout, and includes elements like a bridge and grassy areas. Key differences: number of taxis (few vs many), perspective (close - up vs aerial), color variety (mostly red vs red and green), and setting (street with shops vs parking lot). +``` + +**Video Input Example:** + +GLM-4.5V supports video understanding by processing video URLs: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "video_url", + "video_url": { + "url": "https://videos.pexels.com/video-files/4114797/4114797-uhd_3840_2160_25fps.mp4" + } + }, + { + "type": "text", + "text": "Describe what happens in this video." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Note:** + +- For video processing, ensure you have sufficient context length configured (up to 64K tokens) +- Video processing may require more memory; adjust `--mem-fraction-static` accordingly +- You can also provide local file paths using `file://` protocol + +**Example Output:** + +```text Output +Response costs: 3.89s +Generated text: A person wearing blue gloves is using a microscope. They are adjusting the focus knob with one hand while holding a pipette with the other, suggesting they are preparing or examining a sample on the slide beneath the objective lens. The microscope's 40x objective lens is positioned over the slide, indicating a high-magnification observation. The person carefully manipulates the slide and the microscope controls, likely to achieve a clear view of the specimen. +``` + +#### 4.2.2 Thinking Mode + +GLM-4.5V supports thinking mode for enhanced reasoning. Enable thinking mode during deployment: + +```shell Command +python -m sglang.launch_server \ + --model-path zai-org/GLM-4.5V \ + --reasoning-parser glm45 \ + --tp 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +**Disable Thinking Mode:** + +To disable thinking mode for a specific request: + +```python Example +response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=[{"role": "user", "content": "What is the capital of France?"}], + extra_body={"chat_template_kwargs": {"enable_thinking": False}} +) +``` + +#### 4.2.3 Tool Calling + +GLM-4.5V supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model-path zai-org/GLM-4.5V \ + --reasoning-parser glm45 \ + --tool-call-parser glm45 \ + --tp 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="zai-org/GLM-4.5V", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.1.1 MMMU Benchmark + +- Benchmark Command + +```bash Command +python3 benchmark/mmmu/bench_sglang.py --response-answer-regex "<\|begin_of_box\|>(.*)<\|end_of_box\|>" --port 30000 --concurrency 64 +``` + +- Test Result + +```text Output +Benchmark time: 616.6163094160147 +answers saved to: ./answer_sglang.json +Evaluating... +answers saved to: ./answer_sglang.json +{'Accounting': {'acc': 0.867, 'num': 30}, + 'Agriculture': {'acc': 0.567, 'num': 30}, + 'Architecture_and_Engineering': {'acc': 0.667, 'num': 30}, + 'Art': {'acc': 0.667, 'num': 30}, + 'Art_Theory': {'acc': 0.9, 'num': 30}, + 'Basic_Medical_Science': {'acc': 0.8, 'num': 30}, + 'Biology': {'acc': 0.6, 'num': 30}, + 'Chemistry': {'acc': 0.533, 'num': 30}, + 'Clinical_Medicine': {'acc': 0.667, 'num': 30}, + 'Computer_Science': {'acc': 0.8, 'num': 30}, + 'Design': {'acc': 0.867, 'num': 30}, + 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.667, 'num': 30}, + 'Economics': {'acc': 0.833, 'num': 30}, + 'Electronics': {'acc': 0.433, 'num': 30}, + 'Energy_and_Power': {'acc': 0.733, 'num': 30}, + 'Finance': {'acc': 0.767, 'num': 30}, + 'Geography': {'acc': 0.667, 'num': 30}, + 'History': {'acc': 0.8, 'num': 30}, + 'Literature': {'acc': 0.9, 'num': 30}, + 'Manage': {'acc': 0.733, 'num': 30}, + 'Marketing': {'acc': 0.9, 'num': 30}, + 'Materials': {'acc': 0.567, 'num': 30}, + 'Math': {'acc': 0.8, 'num': 30}, + 'Mechanical_Engineering': {'acc': 0.767, 'num': 30}, + 'Music': {'acc': 0.3, 'num': 30}, + 'Overall': {'acc': 0.732, 'num': 900}, + 'Overall-Art and Design': {'acc': 0.683, 'num': 120}, + 'Overall-Business': {'acc': 0.82, 'num': 150}, + 'Overall-Health and Medicine': {'acc': 0.787, 'num': 150}, + 'Overall-Humanities and Social Science': {'acc': 0.783, 'num': 120}, + 'Overall-Science': {'acc': 0.707, 'num': 150}, + 'Overall-Tech and Engineering': {'acc': 0.648, 'num': 210}, + 'Pharmacy': {'acc': 0.9, 'num': 30}, + 'Physics': {'acc': 0.933, 'num': 30}, + 'Psychology': {'acc': 0.767, 'num': 30}, + 'Public_Health': {'acc': 0.9, 'num': 30}, + 'Sociology': {'acc': 0.667, 'num': 30}} +eval out saved to ./val_sglang.json +Overall accuracy: 0.732 +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-4.6.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-4.6.mdx new file mode 100644 index 000000000000..dfefd313a31d --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-4.6.mdx @@ -0,0 +1,888 @@ +--- +title: GLM-4.6 +metatags: + description: "Deploy GLM-4.6 with SGLang - 200K context window, superior coding, advanced reasoning, and enhanced agentic capabilities." +--- + +## 1. Model Introduction + +[GLM-4.6](https://huggingface.co/zai-org/GLM-4.6) is a powerful language model developed by Zhipu AI, featuring advanced capabilities in reasoning, function calling, and multi-modal understanding. + +As the latest iteration in the GLM series, GLM-4.6 achieves comprehensive enhancements across multiple domains, including real-world coding, long-context processing, reasoning, searching, writing, and agentic applications. Details are as follows: + +- **Longer context window**: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. +- **Superior coding performance**: The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code, Cline, Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. +- **Advanced reasoning**: GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. +- **More capable agents**: GLM-4.6 exhibits stronger performance in tool use and search-based agents, and integrates more effectively within agent frameworks. +- **Refined writing**: Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios. + +For more details, please refer to the [official GLM-4.6 documentation](https://docs.z.ai/guides/llm/glm-4.6). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities. + +import { GLM46Deployment } from "/src/snippets/autoregressive/glm-46-deployment.jsx"; + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [GLM-4.5/GLM-4.6 Usage](../../../docs/basic_usage/glm45). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +GLM-4.6 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.6 \ + --reasoning-parser glm45 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="zai-org/GLM-4.6", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +GLM-4.6 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.6 \ + --reasoning-parser glm45 \ + --tool-call-parser glm45 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-4.6", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"🔧 Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="zai-org/GLM-4.6", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x), AMD MI300X (8x), AMD MI325X (8x), AMD MI355X (8x) +- Model: GLM-4.6 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.6.post1 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Test Scenarios + +Three core scenarios reflect real-world usage patterns: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScenarioInput LengthOutput LengthUse Case
**Chat**1K1KMost common conversational AI workload
**Reasoning**1K8KLong-form generation, complex reasoning tasks
**Summarization**8K1KDocument summarization, RAG retrieval
+ +#### 5.1.2 Concurrency Levels + +Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier): + +- **Low Concurrency**: `--max-concurrency 1` (Latency-optimized) +- **Medium Concurrency**: `--max-concurrency 16` (Balanced) +- **High Concurrency**: `--max-concurrency 100` (Throughput-optimized) + +#### 5.1.3 Number of Prompts + +For each concurrency level, configure `num_prompts` to simulate realistic user loads: + +- **Quick Test**: `num_prompts = concurrency × 1` (minimal test) +- **Recommended**: `num_prompts = concurrency × 5` (standard benchmark) +- **Stable Measurements**: `num_prompts = concurrency × 10` (production-grade) + +--- + +#### 5.1.4 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** +```bash Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.6 \ + --tp 8 +``` + + +- Low Concurrency (Latency-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 63.82 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4210 +Total generated tokens (retokenized): 4209 +Request throughput (req/s): 0.16 +Input token throughput (tok/s): 95.60 +Output token throughput (tok/s): 65.97 +Peak output token throughput (tok/s): 68.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 161.57 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6379.24 +Median E2E Latency (ms): 5085.00 +---------------Time to First Token---------------- +Mean TTFT (ms): 155.57 +Median TTFT (ms): 149.79 +P99 TTFT (ms): 207.69 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.81 +Median TPOT (ms): 14.80 +P99 TPOT (ms): 14.84 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 14.82 +Median ITL (ms): 14.82 +P95 ITL (ms): 15.17 +P99 ITL (ms): 15.36 +Max ITL (ms): 25.05 +================================================== +``` + + +- Medium Concurrency (Balanced) +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +```text Output + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 72.06 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40725 +Total generated tokens (retokenized): 40672 +Request throughput (req/s): 1.11 +Input token throughput (tok/s): 550.47 +Output token throughput (tok/s): 565.14 +Peak output token throughput (tok/s): 752.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 1115.61 +Concurrency: 13.71 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 12348.93 +Median E2E Latency (ms): 13164.81 +---------------Time to First Token---------------- +Mean TTFT (ms): 196.08 +Median TTFT (ms): 155.22 +P99 TTFT (ms): 377.98 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 24.24 +Median TPOT (ms): 24.55 +P99 TPOT (ms): 30.42 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 23.92 +Median ITL (ms): 21.40 +P95 ITL (ms): 22.49 +P99 ITL (ms): 123.83 +Max ITL (ms): 486.54 +================================================== +``` + + +- High Concurrency (Throughput-Optimized) +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 138.50 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252162 +Total generated tokens (retokenized): 251841 +Request throughput (req/s): 3.61 +Input token throughput (tok/s): 1803.78 +Output token throughput (tok/s): 1820.61 +Peak output token throughput (tok/s): 2900.00 +Peak concurrent requests: 107 +Total token throughput (tok/s): 3624.40 +Concurrency: 90.91 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 25183.97 +Median E2E Latency (ms): 23968.49 +---------------Time to First Token---------------- +Mean TTFT (ms): 337.77 +Median TTFT (ms): 180.65 +P99 TTFT (ms): 906.14 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 49.97 +Median TPOT (ms): 52.20 +P99 TPOT (ms): 61.81 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 49.36 +Median ITL (ms): 35.05 +P95 ITL (ms): 124.91 +P99 ITL (ms): 187.69 +Max ITL (ms): 440.34 +================================================== +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 666.64 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 44452 +Total generated tokens (retokenized): 44387 +Request throughput (req/s): 0.02 +Input token throughput (tok/s): 9.15 +Output token throughput (tok/s): 66.68 +Peak output token throughput (tok/s): 68.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 75.83 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 66661.35 +Median E2E Latency (ms): 71902.36 +---------------Time to First Token---------------- +Mean TTFT (ms): 160.21 +Median TTFT (ms): 140.32 +P99 TTFT (ms): 295.56 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.92 +Median TPOT (ms): 14.94 +P99 TPOT (ms): 15.02 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 14.96 +Median ITL (ms): 14.96 +P95 ITL (ms): 15.36 +P99 ITL (ms): 15.57 +Max ITL (ms): 19.06 +================================================== +``` + +- Medium Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 503.30 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 318226 +Total generated tokens (retokenized): 318025 +Request throughput (req/s): 0.16 +Input token throughput (tok/s): 78.82 +Output token throughput (tok/s): 632.28 +Peak output token throughput (tok/s): 752.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 711.09 +Concurrency: 13.88 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 87349.22 +Median E2E Latency (ms): 88248.04 +---------------Time to First Token---------------- +Mean TTFT (ms): 228.54 +Median TTFT (ms): 142.78 +P99 TTFT (ms): 569.84 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 21.97 +Median TPOT (ms): 22.14 +P99 TPOT (ms): 22.47 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 21.91 +Median ITL (ms): 21.80 +P95 ITL (ms): 22.30 +P99 ITL (ms): 22.78 +Max ITL (ms): 137.19 +================================================== +``` + +- High Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 772.28 +Total input tokens: 158939 +Total input text tokens: 158939 +Total input vision tokens: 0 +Total generated tokens: 1300705 +Total generated tokens (retokenized): 1299924 +Request throughput (req/s): 0.41 +Input token throughput (tok/s): 205.80 +Output token throughput (tok/s): 1684.24 +Peak output token throughput (tok/s): 2112.00 +Peak concurrent requests: 68 +Total token throughput (tok/s): 1890.05 +Concurrency: 56.17 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 135563.36 +Median E2E Latency (ms): 140888.88 +---------------Time to First Token---------------- +Mean TTFT (ms): 232.45 +Median TTFT (ms): 145.59 +P99 TTFT (ms): 576.49 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 33.47 +Median TPOT (ms): 34.02 +P99 TPOT (ms): 35.10 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 33.30 +Median ITL (ms): 32.63 +P95 ITL (ms): 34.27 +P99 ITL (ms): 104.39 +Max ITL (ms): 155.65 +================================================== +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 65.11 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4210 +Total generated tokens (retokenized): 4210 +Request throughput (req/s): 0.15 +Input token throughput (tok/s): 644.17 +Output token throughput (tok/s): 64.66 +Peak output token throughput (tok/s): 68.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 708.83 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6508.31 +Median E2E Latency (ms): 5263.36 +---------------Time to First Token---------------- +Mean TTFT (ms): 189.48 +Median TTFT (ms): 159.23 +P99 TTFT (ms): 304.09 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 15.02 +Median TPOT (ms): 15.03 +P99 TPOT (ms): 15.27 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.04 +Median ITL (ms): 15.03 +P95 ITL (ms): 15.46 +P99 ITL (ms): 15.65 +Max ITL (ms): 24.20 +================================================== +``` + +- Medium Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 76.43 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41589 +Total generated tokens (retokenized): 41577 +Request throughput (req/s): 1.05 +Input token throughput (tok/s): 3925.47 +Output token throughput (tok/s): 544.15 +Peak output token throughput (tok/s): 752.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 4469.62 +Concurrency: 13.95 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 13329.63 +Median E2E Latency (ms): 14141.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 339.88 +Median TTFT (ms): 252.75 +P99 TTFT (ms): 906.54 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 25.37 +Median TPOT (ms): 25.73 +P99 TPOT (ms): 30.94 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 25.04 +Median ITL (ms): 21.68 +P95 ITL (ms): 22.69 +P99 ITL (ms): 146.98 +Max ITL (ms): 483.14 +================================================== +``` + + +- High Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.6 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 136.24 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 169680 +Total generated tokens (retokenized): 169452 +Request throughput (req/s): 2.35 +Input token throughput (tok/s): 9350.32 +Output token throughput (tok/s): 1245.44 +Peak output token throughput (tok/s): 1984.00 +Peak concurrent requests: 69 +Total token throughput (tok/s): 10595.77 +Concurrency: 58.46 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 24889.40 +Median E2E Latency (ms): 25123.37 +---------------Time to First Token---------------- +Mean TTFT (ms): 355.82 +Median TTFT (ms): 268.84 +P99 TTFT (ms): 858.64 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 46.62 +Median TPOT (ms): 49.04 +P99 TPOT (ms): 58.88 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 46.36 +Median ITL (ms): 32.46 +P95 ITL (ms): 135.23 +P99 ITL (ms): 204.27 +Max ITL (ms): 508.14 +================================================== +``` + +#### 5.1.5 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal tradeoff between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --num-questions 200 \ + --port 30000 +``` + +- Test Result +```text Output +Accuracy: 0.975 +Invalid: 0.000 +Latency: 16.574 s +Output throughput: 1194.637 token/s +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-4.6V.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-4.6V.mdx new file mode 100644 index 000000000000..a0fc9d2949e3 --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-4.6V.mdx @@ -0,0 +1,382 @@ +--- +title: GLM-4.6V +metatags: + description: "Deploy GLM-4.6V vision-language model with SGLang - native function calling, 128K context, multimodal document understanding and frontend replication." +--- + +## 1. Model Introduction + +GLM-4.6V series model includes two versions: GLM-4.6V (106B), a foundation model designed for cloud and high-performance cluster scenarios, and GLM-4.6V-Flash (9B), a lightweight model optimized for local deployment and low-latency applications. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM team integrated native Function Calling capabilities for the first time. This effectively bridges the gap between "visual perception" and "executable action" providing a unified technical foundation for multimodal agents in real-world business scenarios. + +Beyond achieves SoTA performance across major multimodal benchmarks at comparable model scales. GLM-4.6V introduces several key features: + +- **Native Multimodal Function Calling** Enables native vision-driven tool use. Images, screenshots, and document pages can be passed directly as tool inputs without text conversion, while visual outputs (charts, search images, rendered pages) are interpreted and integrated into the reasoning chain. This closes the loop from perception to understanding to execution. Please refer to this [example](#tool-call-example). +- **Interleaved Image-Text Content Generation** Supports high-quality mixed media creation from complex multimodal inputs. GLM-4.6V takes a multimodal context—spanning documents, user inputs, and tool-retrieved images—and synthesizes coherent, interleaved image-text content tailored to the task. During generation it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content. +- **Multimodal Document Understanding** GLM-4.6V can process up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. It understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents without requiring prior conversion to plain text. +- **Frontend Replication & Visual Editing** Reconstructs pixel-accurate HTML/CSS from UI screenshots and supports natural-language-driven edits. It detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +### 2.1 Docker Installation (Recommended) + +```shell Command +docker pull lmsysorg/sglang:latest +``` + +**Advantages:** + +- Ready to use out of the box, no manual environment configuration needed +- Avoids dependency conflict issues +- Easy to migrate between different environments + +### 2.2 Build from Source + +If you need to use the latest development version or require custom modifications, you can build from source: + +```bash Command +# Install SGLang using UV (recommended) +git clone https://github.com/sgl-project/sglang.git +cd sglang +uv venv +source .venv/bin/activate +uv pip install -e "python[all]" --index-url=https://pypi.org/simple +pip install nvidia-cudnn-cu12==9.16.0.29 +# Install ffmpeg to support video input +sudo apt update +sudo apt install ffmpeg +``` + +**Use Cases:** + +- Need to customize and modify SGLang source code +- Want to use the latest development features +- Participate in SGLang project development + +For general installation instructions, you can also refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the interactive configuration generator below to customize your deployment settings. Select your hardware platform, model size, quantization method, and other options to generate the appropriate launch command. + +import { GLM46VDeployment } from "/src/snippets/autoregressive/glm-46v-deployment.jsx"; + + + +### 3.2 Configuration Tips +- **TTFT Optimization** : Set `SGLANG_USE_CUDA_IPC_TRANSPORT=1` to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting `--mem-fraction-static` and/or `--max-running-requests`. (additional memory is proportional to image size * number of images in current running requests.) +- **TP=8 Configuration**: When using Tensor Parallelism (TP) of 8, the vision attention's 12 heads cannot be evenly divided. You can resolve this by adding `--mm-enable-dp-encoder` (which the generator above handles automatically). +- **Fast Model Loading**: For large models (like the 106B version), you can speed up model loading by using `--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'`. +- For more detailed configuration tips, please refer to [GLM-4.5V/GLM-4.6V Usage](../../../docs/basic_usage/glmv). + +## 4. Example APIs + +### Image Input Example + +#### API Payload +```python Example +curl_command = f""" +curl -s http://localhost:{30000}/v1/chat/completions \\ + -H "Content-Type: application/json" \\ + -d '{{ + "model": "default", + "messages": [ + {{ + "role": "user", + "content": [ + {{ + "type": "image_url", + "image_url": {{ + "url": "/home/jobuser/sgl_logo.png" + }} + }}, + {{ + "type": "text", + "text": "What is the image" + }} + ] + }} + ], + "temperature": "0", + "max_completion_tokens": "1000", + "max_tokens": "1000" + }}' +""" + +response = subprocess.check_output(curl_command, shell=True).decode() +print(response) +``` + +#### API Response +```text Output +{"id":"b61596ca71394dd699fd8abd4f650c44","object":"chat.completion","created":1765259019,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"The image is a logo featuring the text \"SGL\" (in a bold, orange-brown font) alongside a stylized icon. The icon includes a network-like structure with circular nodes (suggesting connectivity or a tree/graph structure) and a tag with \"\" (a common symbol for coding, web development, or software). The color scheme uses warm orange-brown tones with a black background, giving it a tech-focused, modern aesthetic (likely representing a company, project, or tool related to software, web development, or digital technology).<|begin_of_box|>SGL logo (stylized text + network/coding icon)<|end_of_box|>","reasoning_content":"Okay, let's see. The image has a logo with the text \"SGL\" and a little icon on the left. The icon looks like a network or a tree structure with circles, and there's a tag with \"\" which is a common symbol for coding or web development. The colors are orange and brown tones, with a black background. So probably a logo for a company or project named SGL, maybe related to software, web development, or a tech company.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":2222,"total_tokens":2448,"completion_tokens":226,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}} +``` + +### Video Input Example + +#### API Payload +```python Example +curl_command = f""" +curl -s http://localhost:{30000}/v1/chat/completions \\ + -H "Content-Type: application/json" \\ + -d '{{ + "model": "default", + "messages": [ + {{ + "role": "user", + "content": [ + {{ + "type": "video_url", + "video_url": {{ + "url": "/home/jobuser/jobs_presenting_ipod.mp4" + }} + }}, + {{ + "type": "text", + "text": "What is the image" + }} + ] + }} + ], + "temperature": "0", + "max_completion_tokens": "1000", + "max_tokens": "1000" + }}' +""" + +response = subprocess.check_output(curl_command, shell=True).decode() +print(response) +``` + +#### API Response +```text Output +{"id":"520e0a079e5d4b17b82a6af619315a97","object":"chat.completion","created":1765259029,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"The image is a still from a presentation by a man on a stage. He is pointing to a small pocket on his jeans and asking the audience what the pocket is for. The video is being shared by Evan Carmichael. The man then reveals that the pocket is for an iPod Nano.","reasoning_content":"Based on the visual evidence in the video, here is a breakdown of what is being shown:\n\n* **Subject:** The video features a man on a stage, giving a presentation. He is wearing a black t-shirt and dark jeans.\n* **Action:** The man is pointing to a pocket on his jeans. He is asking the audience a question about the purpose of this pocket.\n* **Context:** The presentation is being filmed, and the video is being shared by \"Evan Carmichael,\" a well-known motivational speaker and content creator. The source of the clip is credited to \"JoshuaG.\"\n* **Reveal:** The man then reveals the answer to his question. He pulls a small, white, rectangular device out of the pocket. He identifies this device as an \"iPod Nano.\"\n\nIn summary, the image is a still from a presentation where a speaker is explaining the purpose of the small pocket found on many pairs of jeans.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":30276,"total_tokens":30532,"completion_tokens":256,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}} +``` + +### Tool Call Example + +#### API Payload +```python Example +from openai import OpenAI +import argparse +import sys +import base64 + +def image_to_base64(image_path): + """Convert image file to base64 data URL format for OpenAI API""" + with open(image_path, 'rb') as image_file: + image_data = image_file.read() + base64_string = base64.b64encode(image_data).decode('utf-8') + return f"data:image/png;base64,{base64_string}" + +openai_api_key = "EMPTY" +openai_api_base = "http://127.0.0.1:30000/v1" +client = OpenAI(api_key=openai_api_key, base_url=openai_api_base) + + + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get current temperature for a given location.", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "City and country e.g. Beijing, China", + } + }, + "required": ["location"], + "additionalProperties": False, + }, + }, + } +] + + +messages = [ + { + "role": "user", + "content": "Please help me check today’s weather in Beijing, and tell me whether the tool returned an image." + }, + { + "role": "assistant", + "tool_calls": [ + { + "id": "call_bk32t88BGpSdbtDgzT044Rh4", + "type": "function", + "function": { + "name": 'get_weather', + "arguments": '{"location":"Beijing, China"}' + } + } + ] + }, + { + "role": "tool", + "tool_call_id": "call_bk32t88BGpSdbtDgzT044Rh4", + "content": [ + { + "type": "text", + "text": "Weather report generated: Beijing, November 7, 2025, sunny, temperature 2°C." + }, + { + "type": "image_url", + "image_url": { + "url": "/home/jobuser/sgl_logo.png" + } + } + ] + }, +] + +response = client.chat.completions.create( + model="zai-org/GLM-4.6V", + messages=messages, + timeout=900, + tools=tools +) +print(response.choices[0].message.content.strip()) +``` + +#### Output + +```text Output +The weather in Beijing today (November 7, 2025) is sunny with a temperature of 2°C. + +Yes, the tool returned an image (the SGL logo). +``` + +## 5. Benchmark + +### 5.1. Text Benchmark: Latency, Throughput and Accuracy + +#### Command +```shell Command +python3 ./benchmark/gsm8k/bench_sglang.py +``` +#### Result Output +```text Output +Accuracy: 0.925 +Invalid: 0.000 +Latency: 15.327 s +Output throughput: 1788.375 token/s +``` + +### 5.2. Multimodal Benchmark - Latency and Throughput + +#### Command +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --port 30000 \ + --model zai-org/GLM-4.6V \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 128 \ + --max-concurrency 8 +``` + +#### Result Output +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 8 +Successful requests: 128 +Benchmark duration (s): 89.27 +Total input tokens: 315390 +Total input text tokens: 8702 +Total input vision tokens: 306688 +Total generated tokens: 66020 +Total generated tokens (retokenized): 31037 +Request throughput (req/s): 1.43 +Input token throughput (tok/s): 3533.17 +Output token throughput (tok/s): 739.59 +Peak output token throughput (tok/s): 823.00 +Peak concurrent requests: 12 +Total token throughput (tok/s): 4272.76 +Concurrency: 7.67 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5349.20 +Median E2E Latency (ms): 5380.98 +---------------Time to First Token---------------- +Mean TTFT (ms): 1724.04 +Median TTFT (ms): 1688.16 +P99 TTFT (ms): 6152.34 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 8.15 +Median TPOT (ms): 7.77 +P99 TPOT (ms): 23.97 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.00 +Median ITL (ms): 8.44 +P95 ITL (ms): 9.23 +P99 ITL (ms): 116.02 +Max ITL (ms): 173.48 +================================================== +``` + + +### 5.3. Multimodal Accuracy Benchmark - MMMU + +#### Command +```shell Command +python3 benchmark/mmmu/bench_sglang.py --response-answer-regex "<\|begin_of_box\|>(.*)<\|end_of_box\|>" --port 30000 --concurrency 64 --extra-request-body '{"max_tokens": 4096}' +``` + +#### Result Output +```text Output +Benchmark time: 487.2229107860476 +answers saved to: ./answer_sglang.json +Evaluating... +answers saved to: ./answer_sglang.json +{'Accounting': {'acc': 0.962, 'num': 26}, + 'Agriculture': {'acc': 0.5, 'num': 30}, + 'Architecture_and_Engineering': {'acc': 0.733, 'num': 15}, + 'Art': {'acc': 0.833, 'num': 30}, + 'Art_Theory': {'acc': 0.9, 'num': 30}, + 'Basic_Medical_Science': {'acc': 0.733, 'num': 30}, + 'Biology': {'acc': 0.586, 'num': 29}, + 'Chemistry': {'acc': 0.654, 'num': 26}, + 'Clinical_Medicine': {'acc': 0.633, 'num': 30}, + 'Computer_Science': {'acc': 0.76, 'num': 25}, + 'Design': {'acc': 0.867, 'num': 30}, + 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.633, 'num': 30}, + 'Economics': {'acc': 0.862, 'num': 29}, + 'Electronics': {'acc': 0.5, 'num': 18}, + 'Energy_and_Power': {'acc': 0.875, 'num': 16}, + 'Finance': {'acc': 0.857, 'num': 28}, + 'Geography': {'acc': 0.714, 'num': 28}, + 'History': {'acc': 0.767, 'num': 30}, + 'Literature': {'acc': 0.897, 'num': 29}, + 'Manage': {'acc': 0.759, 'num': 29}, + 'Marketing': {'acc': 1.0, 'num': 26}, + 'Materials': {'acc': 0.833, 'num': 18}, + 'Math': {'acc': 0.76, 'num': 25}, + 'Mechanical_Engineering': {'acc': 0.619, 'num': 21}, + 'Music': {'acc': 0.286, 'num': 28}, + 'Overall': {'acc': 0.761, 'num': 803}, + 'Overall-Art and Design': {'acc': 0.729, 'num': 118}, + 'Overall-Business': {'acc': 0.884, 'num': 138}, + 'Overall-Health and Medicine': {'acc': 0.773, 'num': 150}, + 'Overall-Humanities and Social Science': {'acc': 0.78, 'num': 118}, + 'Overall-Science': {'acc': 0.728, 'num': 136}, + 'Overall-Tech and Engineering': {'acc': 0.671, 'num': 143}, + 'Pharmacy': {'acc': 0.933, 'num': 30}, + 'Physics': {'acc': 0.929, 'num': 28}, + 'Psychology': {'acc': 0.733, 'num': 30}, + 'Public_Health': {'acc': 0.933, 'num': 30}, + 'Sociology': {'acc': 0.724, 'num': 29}} +eval out saved to ./val_sglang.json +Overall accuracy: 0.761 +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-4.7-Flash.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-4.7-Flash.mdx new file mode 100644 index 000000000000..76bff2893484 --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-4.7-Flash.mdx @@ -0,0 +1,931 @@ +--- +title: GLM-4.7-Flash +metatags: + description: "Deploy GLM-4.7-Flash 30B-A3B MoE model with SGLang - lightweight, efficient inference optimized for single-GPU deployment." +--- + +## 1. Model Introduction + +[GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) is a lightweight and high-speed model in the GLM-4.7 series developed by Zhipu AI, featuring state-of-the-art capabilities in reasoning, function calling, and efficient local deployment. + +As a compact variant in the GLM-4.7 family, GLM-4.7-Flash is a **30B-A3B MoE** model designed to balance performance and efficiency: + +- **Lightweight Architecture**: 30B total parameters with only 3B active parameters, enabling efficient inference +- **Enhanced Reasoning**: Inherits the reasoning capabilities from GLM-4.7 with optimized performance +- **Superior Coding**: Strong code generation and understanding capabilities +- **Advanced Tool Use**: Robust tool calling and agent capabilities for complex workflows +- **Optimized for Local Deployment**: Designed for single-GPU deployment scenarios + +For more details, please refer to the [official GLM-4.7 documentation](https://docs.z.ai/guides/llm/glm-4.7). + +**Key Features:** + +- **Efficient MoE Architecture**: 30B-A3B sparse activation for optimal performance/efficiency trade-off +- **Multiple Quantizations**: BF16 and FP8 variants for different performance/memory trade-offs +- **Hardware Optimization**: Specifically tuned for NVIDIA H100/H200/B200 GPUs +- **High Performance**: Optimized for both throughput and latency scenarios + +**Available Models:** + +- **BF16 (Full precision)**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) + +**License:** + +Please refer to the [official GLM-4.7-Flash model card](https://huggingface.co/zai-org/GLM-4.7-Flash) for license details. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities. + +import { GLM47FlashDeployment } from "/src/snippets/autoregressive/glm-47-flash-deployment.jsx"; + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [GLM-4.7 Usage](../../../docs/basic_usage/glm45). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +GLM-4.7-Flash supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.7-Flash \ + --reasoning-parser glm45 \ + --attention-backend triton \ + --tp 1 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="zai-org/GLM-4.7-Flash", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +GLM-4.7-Flash supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.7-Flash \ + --reasoning-parser glm45 \ + --tool-call-parser glm47 \ + --attention-backend triton \ + --tp 1 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-4.7-Flash", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls (tool call deltas may stream in multiple chunks) + if hasattr(delta, 'tool_calls') and delta.tool_calls: + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +if tool_calls_accumulator: + print("\n=============== Tool Calls =================", flush=True) + for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking for the weather in Beijing. I have the get_weather function available which can provide weather information for a location. The required parameter is "location" and the + user has provided "Beijing". There's an optional parameter "unit" for temperature unit, but the user hasn't specified which unit they prefer, and since it's optional, I should not ask about it or make up a value for it. I'll call the function with just the location parameter.I'll check the current weather in Beijing for you. +=============== Tool Calls ================= +Tool Call: get_weather + Arguments: {"location": "Beijing"} + +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="zai-org/GLM-4.7-Flash", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 (1x) +- Model: GLM-4.7-Flash +- Tensor Parallelism: 1 +- SGLang Version: 0.5.7 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Test Scenarios + +Three core scenarios reflect real-world usage patterns: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScenarioInput LengthOutput LengthUse Case
**Chat**1K1KMost common conversational AI workload
**Reasoning**1K8KLong-form generation, complex reasoning tasks
**Summarization**8K1KDocument summarization, RAG retrieval
+ +#### 5.1.2 Concurrency Levels + +Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier): + +- **Low Concurrency**: `--max-concurrency 1` (Latency-optimized) +- **Medium Concurrency**: `--max-concurrency 16` (Balanced) +- **High Concurrency**: `--max-concurrency 100` (Throughput-optimized) + +#### 5.1.3 Number of Prompts + +For each concurrency level, configure `num_prompts` to simulate realistic user loads: + +- **Quick Test**: `num_prompts = concurrency × 1` (minimal test) +- **Recommended**: `num_prompts = concurrency × 5` (standard benchmark) +- **Stable Measurements**: `num_prompts = concurrency × 10` (production-grade) + +--- + +#### 5.1.4 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** + +```bash Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.7-Flash \ + --attention-backend triton \ + --tp 1 +``` + +- Low Concurrency (Latency-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 38.94 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.26 +Input token throughput (tok/s): 156.67 +Output token throughput (tok/s): 108.37 +Peak output token throughput (tok/s): 125.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 265.03 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3891.12 +Median E2E Latency (ms): 3061.48 +P90 E2E Latency (ms): 7172.25 +P99 E2E Latency (ms): 9042.62 +---------------Time to First Token---------------- +Mean TTFT (ms): 131.36 +Median TTFT (ms): 94.55 +P99 TTFT (ms): 435.93 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 8.75 +Median TPOT (ms): 8.82 +P99 TPOT (ms): 9.39 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 8.93 +Median ITL (ms): 8.98 +P95 ITL (ms): 9.83 +P99 ITL (ms): 10.20 +Max ITL (ms): 18.50 +================================================== +``` + +- Medium Concurrency (Balanced) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 52.73 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40775 +Request throughput (req/s): 1.52 +Input token throughput (tok/s): 752.27 +Output token throughput (tok/s): 773.83 +Peak output token throughput (tok/s): 1040.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 1526.10 +Concurrency: 13.98 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 9217.90 +Median E2E Latency (ms): 9642.50 +P90 E2E Latency (ms): 15147.02 +P99 E2E Latency (ms): 18237.06 +---------------Time to First Token---------------- +Mean TTFT (ms): 299.02 +Median TTFT (ms): 105.98 +P99 TTFT (ms): 1109.29 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 18.03 +Median TPOT (ms): 18.00 +P99 TPOT (ms): 26.51 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 17.52 +Median ITL (ms): 16.07 +P95 ITL (ms): 18.14 +P99 ITL (ms): 89.43 +Max ITL (ms): 763.13 +================================================== +``` + +- High Concurrency (Throughput-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 91.48 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 250941 +Request throughput (req/s): 5.47 +Input token throughput (tok/s): 2730.87 +Output token throughput (tok/s): 2761.82 +Peak output token throughput (tok/s): 4199.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 5492.69 +Concurrency: 90.54 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16566.04 +Median E2E Latency (ms): 16134.36 +P90 E2E Latency (ms): 30167.60 +P99 E2E Latency (ms): 34034.04 +---------------Time to First Token---------------- +Mean TTFT (ms): 433.94 +Median TTFT (ms): 123.26 +P99 TTFT (ms): 1760.09 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 32.26 +Median TPOT (ms): 33.56 +P99 TPOT (ms): 38.78 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 31.99 +Median ITL (ms): 24.06 +P95 ITL (ms): 79.62 +P99 ITL (ms): 103.03 +Max ITL (ms): 1369.20 +================================================== +``` + + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 525.43 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44451 +Request throughput (req/s): 0.02 +Input token throughput (tok/s): 11.61 +Output token throughput (tok/s): 84.62 +Peak output token throughput (tok/s): 125.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 96.23 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 52540.19 +Median E2E Latency (ms): 53694.45 +P90 E2E Latency (ms): 94742.08 +P99 E2E Latency (ms): 101224.18 +---------------Time to First Token---------------- +Mean TTFT (ms): 97.45 +Median TTFT (ms): 95.28 +P99 TTFT (ms): 105.64 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.94 +Median TPOT (ms): 11.25 +P99 TPOT (ms): 13.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.80 +Median ITL (ms): 11.51 +P95 ITL (ms): 15.83 +P99 ITL (ms): 16.86 +Max ITL (ms): 19.96 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 473.92 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 318306 +Total generated tokens (retokenized): 317860 +Request throughput (req/s): 0.17 +Input token throughput (tok/s): 83.70 +Output token throughput (tok/s): 671.65 +Peak output token throughput (tok/s): 1040.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 755.35 +Concurrency: 13.80 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 81746.73 +Median E2E Latency (ms): 78508.54 +P90 E2E Latency (ms): 155292.49 +P99 E2E Latency (ms): 166769.99 +---------------Time to First Token---------------- +Mean TTFT (ms): 117.50 +Median TTFT (ms): 101.97 +P99 TTFT (ms): 182.88 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 20.36 +Median TPOT (ms): 20.48 +P99 TPOT (ms): 22.63 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 20.52 +Median ITL (ms): 20.42 +P95 ITL (ms): 23.41 +P99 ITL (ms): 26.29 +Max ITL (ms): 90.48 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 714.72 +Total input tokens: 158939 +Total input text tokens: 158939 +Total generated tokens: 1301025 +Total generated tokens (retokenized): 1289431 +Request throughput (req/s): 0.45 +Input token throughput (tok/s): 222.38 +Output token throughput (tok/s): 1820.33 +Peak output token throughput (tok/s): 3200.00 +Peak concurrent requests: 68 +Total token throughput (tok/s): 2042.71 +Concurrency: 55.68 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 124364.58 +Median E2E Latency (ms): 129250.98 +P90 E2E Latency (ms): 219175.80 +P99 E2E Latency (ms): 247741.77 +---------------Time to First Token---------------- +Mean TTFT (ms): 149.40 +Median TTFT (ms): 114.78 +P99 TTFT (ms): 288.60 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 30.51 +Median TPOT (ms): 31.75 +P99 TPOT (ms): 33.32 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 30.56 +Median ITL (ms): 30.82 +P95 ITL (ms): 33.20 +P99 ITL (ms): 80.54 +Max ITL (ms): 117.72 +================================================== +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 58.27 +Total input tokens: 41941 +Total input text tokens: 41941 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.17 +Input token throughput (tok/s): 719.73 +Output token throughput (tok/s): 72.42 +Peak output token throughput (tok/s): 112.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 792.15 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5825.08 +Median E2E Latency (ms): 4624.26 +P90 E2E Latency (ms): 12690.22 +P99 E2E Latency (ms): 13177.96 +---------------Time to First Token---------------- +Mean TTFT (ms): 296.01 +Median TTFT (ms): 195.59 +P99 TTFT (ms): 717.88 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 12.63 +Median TPOT (ms): 13.07 +P99 TPOT (ms): 16.68 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 13.13 +Median ITL (ms): 13.17 +P95 ITL (ms): 17.02 +P99 ITL (ms): 17.47 +Max ITL (ms): 19.84 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 89.59 +Total input tokens: 300020 +Total input text tokens: 300020 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41656 +Request throughput (req/s): 0.89 +Input token throughput (tok/s): 3348.77 +Output token throughput (tok/s): 465.10 +Peak output token throughput (tok/s): 752.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 3813.87 +Concurrency: 14.39 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16120.74 +Median E2E Latency (ms): 16246.55 +P90 E2E Latency (ms): 27279.72 +P99 E2E Latency (ms): 34577.93 +---------------Time to First Token---------------- +Mean TTFT (ms): 1943.94 +Median TTFT (ms): 382.19 +P99 TTFT (ms): 8980.41 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 27.87 +Median TPOT (ms): 28.26 +P99 TPOT (ms): 40.55 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 27.27 +Median ITL (ms): 21.74 +P95 ITL (ms): 23.32 +P99 ITL (ms): 232.65 +Max ITL (ms): 4282.01 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7-Flash \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 167.01 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169226 +Request throughput (req/s): 1.92 +Input token throughput (tok/s): 7627.82 +Output token throughput (tok/s): 1017.93 +Peak output token throughput (tok/s): 1984.00 +Peak concurrent requests: 69 +Total token throughput (tok/s): 8645.75 +Concurrency: 59.68 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 31147.52 +Median E2E Latency (ms): 30603.34 +P90 E2E Latency (ms): 54889.44 +P99 E2E Latency (ms): 67665.30 +---------------Time to First Token---------------- +Mean TTFT (ms): 428.87 +Median TTFT (ms): 441.69 +P99 TTFT (ms): 1232.68 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 58.06 +Median TPOT (ms): 62.79 +P99 TPOT (ms): 82.23 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 57.93 +Median ITL (ms): 33.30 +P95 ITL (ms): 247.98 +P99 ITL (ms): 409.63 +Max ITL (ms): 1421.21 +================================================== +``` + +#### 5.1.5 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal tradeoff between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command + +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --num-questions 200 \ + --port 30000 +``` + +- Result + +```text Output +Accuracy: 0.845 +Invalid: 0.000 +Latency: 8.431 s +Output throughput: 2195.387 token/s +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-4.7.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-4.7.mdx new file mode 100644 index 000000000000..98ad0084b75b --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-4.7.mdx @@ -0,0 +1,546 @@ +--- +title: GLM-4.7 +metatags: + description: "Deploy GLM-4.7 with SGLang on AMD GPUs - state-of-the-art reasoning, enhanced coding, and robust tool calling capabilities." +--- + +## 1. Model Introduction + +[GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) is the latest and most powerful language model in the GLM series developed by Zhipu AI, featuring state-of-the-art capabilities in reasoning, function calling, and multi-modal understanding. + +As the newest iteration in the GLM series, GLM-4.7 achieves significant improvements across all domains: + +- **Extended Context Window**: Expanded context window supporting even longer documents and complex multi-turn conversations +- **Enhanced Reasoning**: Improved reasoning capabilities with better chain-of-thought processing +- **Superior Coding**: Significantly improved code generation and understanding, with better real-world application performance +- **Advanced Tool Use**: More robust tool calling and agent capabilities for complex workflows +- **Optimized Performance**: Better throughput and latency characteristics across all hardware platforms + +For more details, please refer to the [official GLM-4.7 documentation](https://docs.z.ai/guides/llm/glm-4.7). + +**Key Features:** + +- **State-of-the-Art Reasoning**: Enhanced reasoning capabilities for the most complex problem-solving tasks +- **Multiple Quantizations**: BF16 and FP8 variants for different performance/memory trade-offs +- **Hardware Optimization**: Specifically tuned for AMD MI300X/MI325X/MI355X GPUs +- **High Performance**: Optimized for both throughput and latency scenarios + +**Available Models:** + +- **BF16 (Full precision)**: [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) - Recommended for MI300X/MI325X/MI355X +- **FP8 (8-bit quantized)**: [zai-org/GLM-4.7-FP8](https://huggingface.co/zai-org/GLM-4.7-FP8) - Recommended for MI300X/MI325X/MI355X + +**License:** + +Please refer to the [official GLM-4.7 model card](https://huggingface.co/zai-org/GLM-4.7) for license details. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities. + +import { GLM47Deployment } from "/src/snippets/autoregressive/glm-47-deployment.jsx"; + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [GLM-4.7 Usage](../../../docs/basic_usage/glm45). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +GLM-4.7 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.7 \ + --reasoning-parser glm47 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="zai-org/GLM-4.7", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +GLM-4.7 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.7 \ + --reasoning-parser glm47 \ + --tool-call-parser glm47 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-4.7", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="zai-org/GLM-4.7", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X (8x), AMD MI325X (8x), AMD MI355X (8x) +- Model: GLM-4.7 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.6.post1 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Test Scenarios + +Three core scenarios reflect real-world usage patterns: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScenarioInput LengthOutput LengthUse Case
**Chat**1K1KMost common conversational AI workload
**Reasoning**1K8KLong-form generation, complex reasoning tasks
**Summarization**8K1KDocument summarization, RAG retrieval
+ +#### 5.1.2 Concurrency Levels + +Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier): + +- **Low Concurrency**: `--max-concurrency 1` (Latency-optimized) +- **Medium Concurrency**: `--max-concurrency 16` (Balanced) +- **High Concurrency**: `--max-concurrency 100` (Throughput-optimized) + +#### 5.1.3 Number of Prompts + +For each concurrency level, configure `num_prompts` to simulate realistic user loads: + +- **Quick Test**: `num_prompts = concurrency × 1` (minimal test) +- **Recommended**: `num_prompts = concurrency × 5` (standard benchmark) +- **Stable Measurements**: `num_prompts = concurrency × 10` (production-grade) + +--- + +#### 5.1.4 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** +```bash Command +python -m sglang.launch_server \ + --model zai-org/GLM-4.7 \ + --tp 8 +``` + + +- Low Concurrency (Latency-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Medium Concurrency (Balanced) +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- High Concurrency (Throughput-Optimized) +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Medium Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- High Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Medium Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- High Concurrency +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-4.7 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +#### 5.1.5 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal tradeoff between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --num-questions 200 \ + --port 30000 +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-5.1.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-5.1.mdx new file mode 100644 index 000000000000..cd40835615ac --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-5.1.mdx @@ -0,0 +1,641 @@ +--- +title: GLM-5.1 +metatags: + description: "Deploy GLM-5.1 with SGLang on NVIDIA H100/H200/B200/GB300 and AMD MI300X/MI325X/MI355X." +tag: NEW +--- + +## 1. Model Introduction + +**Available Models:** + +- **BF16 (Full precision)**: [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) +- **FP8 (8-bit quantized)**: [zai-org/GLM-5.1-FP8](https://huggingface.co/zai-org/GLM-5.1-FP8) + +**License:** MIT + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities. SGLang supports serving GLM-5.1 on NVIDIA H100, H200, B200, GB300, and AMD MI300X/MI325X/MI355X GPUs. + +import { GLM51Deployment } from '/src/snippets/autoregressive/glm-51-deployment.jsx' + + + +### 3.2 Configuration Tips + +- Speculative decoding (MTP) can significantly reduce latency for interactive use cases. +- **DP Attention**: Enables data parallel attention for higher throughput under high concurrency. Note that DP attention trades off low-concurrency latency for high-concurrency throughput — disable it if your workload is latency-sensitive with few concurrent requests. +- The `--mem-fraction-static` flag is recommended for optimal memory utilization, adjust it based on your hardware and workload. +- BF16 model always requires **2x GPUs** compared to FP8 on NVIDIA hardware. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareFP8BF16
H100tp=16tp=32
H200tp=8tp=16
B200tp=8tp=16
GB300tp=4
MI300X/MI325Xtp=8tp=8
MI355Xtp=8tp=8
+ +- **AMD GPUs**: Both BF16 and FP8 checkpoints are supported on MI300X/MI325X/MI355X at tp=8. Use `--nsa-prefill-backend tilelang --nsa-decode-backend tilelang` for the NSA attention backend. Add `--chunked-prefill-size 131072` and `--watchdog-timeout 1200` (20 minutes for weight loading). FP8 uses approximately half the memory of BF16 (~89 GB/GPU vs ~175 GB/GPU). EAGLE speculative decoding is not currently supported on AMD for GLM-5.1. +- **GB300**: Only the FP8 checkpoint is recommended on GB300, with `tp=4`. For high-throughput DP attention on GB300, use `--dp 4`. +- For other configuration tips, please refer to [DeepSeek V3.2 documentation](../../../docs/basic_usage/deepseek_v32). GLM-5.1 and DeepSeek V3.2 share the same model structure, so the optimization techniques between these two models are also common (MTP, DSA kernel, Context Parallel...). +- Use `--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'` for GLM-5.1-FP8 if you want to enable the [IndexCache](https://github.com/THUDM/IndexCache) method. This feature is supported through [this PR](https://github.com/sgl-project/sglang/pull/21405) and introduces only a small accuracy loss. However, if you are running rigorous accuracy evaluations, it is not recommended to enable this feature. + +## 4. Model Invocation + +Deploy GLM-5.1 with the following command (FP8 on H200, all features enabled): + +```shell Command +SGLANG_ENABLE_SPEC_V2=1 sglang serve \ + --model-path zai-org/GLM-5.1-FP8 \ + --tp 8 \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --mem-fraction-static 0.85 \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.1 MI300X/MI325X/MI355X (ROCm) Server Command + +The following ROCm commands are additional options for AMD GPUs and do not replace the NVIDIA instructions above. + +#### FP8 (Recommended) + +```shell Command +sglang serve \ + --model-path zai-org/GLM-5.1-FP8 \ + --tp 8 \ + --trust-remote-code \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --nsa-prefill-backend tilelang \ + --nsa-decode-backend tilelang \ + --chunked-prefill-size 131072 \ + --mem-fraction-static 0.80 \ + --watchdog-timeout 1200 \ + --host 0.0.0.0 \ + --port 30000 +``` + +#### BF16 + +```shell Command +sglang serve \ + --model-path zai-org/GLM-5.1 \ + --tp 8 \ + --trust-remote-code \ + --nsa-prefill-backend tilelang \ + --nsa-decode-backend tilelang \ + --chunked-prefill-size 131072 \ + --mem-fraction-static 0.80 \ + --watchdog-timeout 1200 \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.2 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.3 Advanced Usage + +#### 4.3.1 Reasoning Parser + +GLM-5.1 supports Thinking mode **by default**. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via `reasoning_content` in the streaming response. + +To disable thinking and use Instruct mode, pass `chat_template_kwargs` at request time: + +- **Thinking mode** (default): The model performs step-by-step reasoning before answering. No extra parameters needed. +- **Instruct mode** (`{"enable_thinking": false}`): The model responds directly without a thinking process. + +**Example 1: Thinking Mode (Default)** + +Thinking mode is enabled by default. The model will reason step-by-step before answering, and the thinking process is returned via `reasoning_content`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Thinking mode is enabled by default, no extra parameters needed +response = client.chat.completions.create( + model="zai-org/GLM-5.1-FP8", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +1. **Understand the Goal:** The user wants to find 15% of 240, and they want the solution explained step-by-step. + +2. **Identify the Core Mathematical Concept:** "Percent" means "per hundred" or "out of 100". Finding "X% of Y" translates to the mathematical operation: $(X / 100) \times Y$. + +3. **Step-by-Step Breakdown:** + * *Step 1: Convert the percentage to a decimal (or fraction).* 15% means 15 out of 100, which is $15/100$ or $0.15$. + * *Step 2: Multiply the decimal by the given number.* Multiply $0.15$ by $240$. + * *Step 3: Perform the calculation.* + * $0.15 \times 240$ + * I can break this down further to make it easy to follow: + * $0.10 \times 240 = 24$ (which is 10%) + * $0.05 \times 240 = 12$ (which is 5%, half of 10%) + * $24 + 12 = 36$ + * Alternatively, standard multiplication: + * $240 \times 15 = 3600$ + * Move decimal two places left -> $36$ + * *Step 4: State the final answer clearly.* + +4. **Draft the Response (incorporating the steps clearly):** + * *Introduction:* State the problem clearly. + * *Step 1:* Explain how to convert 15% to a decimal. + * *Step 2:* Explain the multiplication step. + * *Step 3:* Show the actual math (I'll provide the standard multiplication and the "mental math" trick as it adds value). + * *Conclusion:* Give the final answer. + +5. **Refine the Output (Self-Correction/Polishing during drafting):** + * *Drafting Step 1:* To find 15% of 240, first convert 15% into a decimal. Since percent means "per hundred," you divide 15 by 100. 15 ÷ 100 = 0.15. + * *Drafting Step 2:* Next, multiply this decimal by the number you are finding the percentage of (which is 240). So, calculate 0.15 × 240. + * *Drafting Step 3 (Standard way):* 0.15 × 240 = 36. + * *Adding the alternative mental math way:* It's often helpful to break it down into 10% and 5%. + * 10% of 240 = 24 (move the decimal point one place to the left) + * 5% is half of 10%, so half of 24 = 12 + * Add them together: 24 + 12 = 36. + * *Final Answer:* 15% of 240 is 36. + +6. **Final Review against User Prompt:** Does it solve the problem? Yes. Is it step-by-step? Yes. Is it clear? Yes. (Proceed to generate output). +=============== Content ================= +Here is the step-by-step solution to find 15% of 240: + +**Step 1: Convert the percentage to a decimal.** +To convert a percentage to a decimal, divide it by 100 (or simply move the decimal point two places to the left). +* 15% = 15 ÷ 100 = **0.15** + +**Step 2: Multiply the decimal by the number.** +Now, multiply the decimal (0.15) by the number you are finding the percentage of (240). +* 0.15 × 240 = **36** + +*(Alternative mental math method for Step 2)*: +If you don't want to multiply by 0.15 directly, you can break 15% down into 10% and 5%: +* **10% of 240** = 24 (just move the decimal point one place to the left) +* **5% of 240** = 12 (5% is half of 10%, so just divide 24 by 2) +* **Add them together**: 24 + 12 = **36** + +**Answer:** +15% of 240 is **36**. +``` + +**Example 2: Instruct Mode (Thinking Off)** + +To disable thinking and get a direct response, pass `{"enable_thinking": false}` via `chat_template_kwargs`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Disable thinking mode via chat_template_kwargs +response = client.chat.completions.create( + model="zai-org/GLM-5.1-FP8", + messages=[ + {"role": "user", "content": "What is 15% of 240?"} + ], + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, + max_tokens=2048, + stream=True +) + +# In Instruct mode, the model responds directly without reasoning_content +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +15% of 240 is 36. + +Here is how to calculate it: +1. Convert the percentage to a decimal: 15% = 0.15 +2. Multiply the decimal by the number: 0.15 × 240 = 36 +``` + +#### 4.3.2 Tool Calling + +GLM-5.1 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, pass `extra_body={"chat_template_kwargs": {"enable_thinking": False}}`. + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-5.1-FP8", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user wants to know the weather in Beijing. I'll call the get_weather function with "Beijing" as the location. +=============== Content ================= +Tool Call: get_weather + Arguments: +Tool Call: None + Arguments: { +Tool Call: None + Arguments: "location": "Be +Tool Call: None + Arguments: ijing" +Tool Call: None + Arguments: } +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: H200 (8x) +- Model: GLM-5.1-FP8 +- Tensor Parallelism: 8 +- SGLang Version: commit 947927bdb + +#### 5.1.1 Latency Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-5.1-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 35.78 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4213 +Request throughput (req/s): 0.28 +Input token throughput (tok/s): 170.54 +Output token throughput (tok/s): 117.96 +Peak output token throughput (tok/s): 148.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 288.50 +Concurrency: 1.00 +Accept length: 3.48 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3576.31 +Median E2E Latency (ms): 2935.97 +P90 E2E Latency (ms): 5908.97 +P99 E2E Latency (ms): 8588.08 +---------------Time to First Token---------------- +Mean TTFT (ms): 290.88 +Median TTFT (ms): 282.34 +P99 TTFT (ms): 332.27 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.54 +Median TPOT (ms): 6.97 +P99 TPOT (ms): 9.04 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 7.80 +Median ITL (ms): 6.81 +P95 ITL (ms): 13.51 +P99 ITL (ms): 26.99 +Max ITL (ms): 29.50 +================================================== +``` + +#### 5.1.2 Throughput Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-5.1-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 1000 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 411.74 +Total input tokens: 502493 +Total input text tokens: 502493 +Total generated tokens: 500251 +Total generated tokens (retokenized): 499614 +Request throughput (req/s): 2.43 +Input token throughput (tok/s): 1220.41 +Output token throughput (tok/s): 1214.97 +Peak output token throughput (tok/s): 2648.00 +Peak concurrent requests: 105 +Total token throughput (tok/s): 2435.38 +Concurrency: 96.30 +Accept length: 3.50 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 39648.76 +Median E2E Latency (ms): 39058.12 +P90 E2E Latency (ms): 57009.82 +P99 E2E Latency (ms): 68880.33 +---------------Time to First Token---------------- +Mean TTFT (ms): 20613.80 +Median TTFT (ms): 21429.21 +P99 TTFT (ms): 29543.17 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 38.73 +Median TPOT (ms): 36.52 +P99 TPOT (ms): 67.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 38.13 +Median ITL (ms): 16.57 +P95 ITL (ms): 86.01 +P99 ITL (ms): 164.88 +Max ITL (ms): 1307.02 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +- Test Result +```text Output +Accuracy: 0.955 +Invalid: 0.000 +Latency: 32.470 s +Output throughput: 642.044 token/s +``` + +#### 5.2.2 MMLU Benchmark + +- Benchmark Command +```bash Command +python3 benchmark/mmlu/bench_sglang.py --port 30000 +``` + +- Test Result +```text Output +subject: abstract_algebra, #q:100, acc: 0.860 +subject: anatomy, #q:135, acc: 0.874 +subject: astronomy, #q:152, acc: 0.941 +subject: business_ethics, #q:100, acc: 0.880 +subject: clinical_knowledge, #q:265, acc: 0.932 +subject: college_biology, #q:144, acc: 0.972 +subject: college_chemistry, #q:100, acc: 0.640 +subject: college_computer_science, #q:100, acc: 0.900 +subject: college_mathematics, #q:100, acc: 0.810 +subject: college_medicine, #q:173, acc: 0.873 +subject: college_physics, #q:102, acc: 0.912 +subject: computer_security, #q:100, acc: 0.880 +subject: conceptual_physics, #q:235, acc: 0.928 +subject: econometrics, #q:114, acc: 0.807 +subject: electrical_engineering, #q:145, acc: 0.897 +subject: elementary_mathematics, #q:378, acc: 0.937 +subject: formal_logic, #q:126, acc: 0.778 +subject: global_facts, #q:100, acc: 0.710 +subject: high_school_biology, #q:310, acc: 0.961 +subject: high_school_chemistry, #q:203, acc: 0.847 +subject: high_school_computer_science, #q:100, acc: 0.960 +subject: high_school_european_history, #q:165, acc: 0.891 +subject: high_school_geography, #q:198, acc: 0.960 +subject: high_school_government_and_politics, #q:193, acc: 0.984 +subject: high_school_macroeconomics, #q:390, acc: 0.923 +subject: high_school_mathematics, #q:270, acc: 0.696 +subject: high_school_microeconomics, #q:238, acc: 0.962 +subject: high_school_physics, #q:151, acc: 0.821 +subject: high_school_psychology, #q:545, acc: 0.956 +subject: high_school_statistics, #q:216, acc: 0.889 +subject: high_school_us_history, #q:204, acc: 0.941 +subject: high_school_world_history, #q:237, acc: 0.945 +subject: human_aging, #q:223, acc: 0.857 +subject: human_sexuality, #q:131, acc: 0.908 +subject: international_law, #q:121, acc: 0.934 +subject: jurisprudence, #q:108, acc: 0.907 +subject: logical_fallacies, #q:163, acc: 0.933 +subject: machine_learning, #q:112, acc: 0.830 +subject: management, #q:103, acc: 0.942 +subject: marketing, #q:234, acc: 0.940 +subject: medical_genetics, #q:100, acc: 0.990 +subject: miscellaneous, #q:783, acc: 0.959 +subject: moral_disputes, #q:346, acc: 0.873 +subject: moral_scenarios, #q:895, acc: 0.837 +subject: nutrition, #q:306, acc: 0.922 +subject: philosophy, #q:311, acc: 0.897 +subject: prehistory, #q:324, acc: 0.929 +subject: professional_accounting, #q:282, acc: 0.844 +subject: professional_law, #q:1534, acc: 0.714 +subject: professional_medicine, #q:272, acc: 0.941 +subject: professional_psychology, #q:612, acc: 0.913 +subject: public_relations, #q:110, acc: 0.791 +subject: security_studies, #q:245, acc: 0.878 +subject: sociology, #q:201, acc: 0.940 +subject: us_foreign_policy, #q:100, acc: 0.920 +subject: virology, #q:166, acc: 0.596 +subject: world_religions, #q:171, acc: 0.936 +Total latency: 165.275 +Average accuracy: 0.877 +``` + +### 5.3 AMD GPU Benchmarks + +#### 5.3.1 GSM8K Benchmark (MI325/MI35x) + +- MI325/MI35x Test (GLM-5.1 BF16, `tp=8`, TileLang NSA backends) + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 +``` + +```text Output +Accuracy: 0.970 +Invalid: 0.000 +``` + +Results from [AMD nightly CI](https://github.com/sgl-project/sglang/actions/runs/22556197510/attempts/2#summary-65346783629). See also [sglang#18911](https://github.com/sgl-project/sglang/pull/18911). diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-5.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-5.mdx new file mode 100644 index 000000000000..406e64aa10c4 --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-5.mdx @@ -0,0 +1,666 @@ +--- +title: GLM-5 +metatags: + description: "Deploy GLM-5 with SGLang on NVIDIA H100/H200/B200 and AMD MI300X/MI325X/MI355X — state-of-the-art reasoning, enhanced coding, and robust tool calling capabilities." +--- + +## 1. Model Introduction + +[GLM-5](https://huggingface.co/zai-org/GLM-5) is the most powerful language model in the GLM series developed by Zhipu AI, targeting complex systems engineering and long-horizon agentic tasks. Scaling from GLM-4.5's 355B parameters (32B active) to 744B parameters (40B active), GLM-5 integrates DeepSeek Sparse Attention (DSA) to largely reduce deployment cost while preserving long-context capacity. + +With advances in both pre-training (28.5T tokens) and post-training via [slime](https://github.com/THUDM/slime) (a novel asynchronous RL infrastructure), GLM-5 delivers significant improvements over GLM-4.7 and achieves best-in-class performance among open-source models on reasoning, coding, and agentic tasks. + +**Key Features:** + +- **Systems Engineering & Agentic Tasks**: Purpose-built for complex systems engineering and long-horizon agentic tasks +- **State-of-the-Art Performance**: Best-in-class among open-source models on reasoning (HLE, AIME, GPQA), coding (SWE-bench, Terminal-Bench), and agentic tasks (BrowseComp, Vending Bench 2) +- **DeepSeek Sparse Attention (DSA)**: Reduces deployment cost while preserving long-context capacity +- **Multiple Quantizations**: BF16 and FP8 variants for different performance/memory trade-offs +- **Speculative Decoding**: EAGLE-based speculative decoding support for lower latency + +**Available Models:** + +- **BF16 (Full precision)**: [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) +- **FP8 (8-bit quantized)**: [zai-org/GLM-5-FP8](https://huggingface.co/zai-org/GLM-5-FP8) + +**License:** MIT + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities. SGLang supports serving GLM-5 on NVIDIA H100, H200, B200, and AMD MI300X/MI325X/MI355X GPUs. + +import { GLM5Deployment } from '/src/snippets/autoregressive/glm-5-deployment.jsx' + + + +### 3.2 Configuration Tips + +- Speculative decoding (MTP) can significantly reduce latency for interactive use cases. +- **DP Attention**: Enables data parallel attention for higher throughput under high concurrency. Note that DP attention trades off low-concurrency latency for high-concurrency throughput — disable it if your workload is latency-sensitive with few concurrent requests. +- The `--mem-fraction-static` flag is recommended for optimal memory utilization, adjust it based on your hardware and workload. +- BF16 model always requires **2x GPUs** compared to FP8 on NVIDIA hardware. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareFP8BF16
H100tp=16tp=32
H200tp=8tp=16
B200tp=8tp=16
MI300X/MI325Xtp=8
MI355Xtp=8
+ +- **B200 (FP8)**: Use `--ep 1 --attention-backend nsa --nsa-decode-backend trtllm --nsa-prefill-backend trtllm --moe-runner-backend flashinfer_trtllm --enable-flashinfer-allreduce-fusion` for optimized NSA and MoE backends on Blackwell. Also add `--quantization fp8` for FP8 weight quantization. + +- **AMD GPUs**: Use `--nsa-prefill-backend tilelang --nsa-decode-backend tilelang` for the NSA attention backend. Add `--chunked-prefill-size 131072` and `--watchdog-timeout 1200` (20 minutes for weight loading). EAGLE speculative decoding is not currently supported on AMD for GLM-5. +- For other configuration tips, please refer to [DeepSeek V3.2 documentation](../../../docs/basic_usage/deepseek_v32). GLM-5 and DeepSeek V3.2 share the same model structure, so the optimization techniques between these two models are also common (MTP, DSA kernel, Context Parallel...). +- Use `--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'` for GLM-5-FP8 if you want to enable the [IndexCache](https://github.com/THUDM/IndexCache) method. This feature is supported through [this PR](https://github.com/sgl-project/sglang/pull/21405) and introduces only a small accuracy loss. However, if you are running rigorous accuracy evaluations, it is not recommended to enable this feature. + + +**FP8 KV Cache**: `--kv-cache-dtype fp8_e4m3` quantizes the KV cache to FP8 at runtime. Since these FP8 model checkpoints do not include pre-calibrated KV cache scaling factors, SGLang defaults to a scale of 1.0, which may cause noticeable accuracy degradation on reasoning-heavy tasks. It is not included in the generated commands above; add it manually only if memory constraints require the trade-off. + + +## 4. Model Invocation + +Deploy GLM-5 with the following command (FP8 on H200, all features enabled): + +```shell Command +sglang serve \ + --model-path zai-org/GLM-5-FP8 \ + --tp 8 \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --mem-fraction-static 0.85 \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.1 MI300X/MI325X/MI355X (ROCm) Server Command + +The following ROCm command is an additional option for AMD GPUs and does not replace the NVIDIA instructions above. + +```shell Command +sglang serve \ + --model-path zai-org/GLM-5 \ + --tp 8 \ + --trust-remote-code \ + --nsa-prefill-backend tilelang \ + --nsa-decode-backend tilelang \ + --chunked-prefill-size 131072 \ + --mem-fraction-static 0.80 \ + --watchdog-timeout 1200 \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.2 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.3 Advanced Usage + +#### 4.3.1 Reasoning Parser + +GLM-5 supports Thinking mode **by default**. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via `reasoning_content` in the streaming response. + +To disable thinking and use Instruct mode, pass `chat_template_kwargs` at request time: + +- **Thinking mode** (default): The model performs step-by-step reasoning before answering. No extra parameters needed. +- **Instruct mode** (`{"enable_thinking": false}`): The model responds directly without a thinking process. + +**Example 1: Thinking Mode (Default)** + +Thinking mode is enabled by default. The model will reason step-by-step before answering, and the thinking process is returned via `reasoning_content`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Thinking mode is enabled by default, no extra parameters needed +response = client.chat.completions.create( + model="zai-org/GLM-5-FP8", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user wants me to solve a math problem: "What is 15% of 240?". + +Step 1: Understand the problem. I need to calculate a percentage of a number. +Formula: Percentage × Number = Result. + +Step 2: Convert the percentage to a decimal or fraction. +15% = 15/100 or 0.15. + +Step 3: Perform the multiplication. +Method A: Decimal multiplication. +0.15 × 240. +Break it down: +10% of 240 = 24. +5% is half of 10%, so 12. +15% = 10% + 5% = 24 + 12 = 36. + +Method B: Fraction multiplication. +15/100 × 240. +Simplify 240/100 = 2.4. +15 × 2.4. +10 × 2.4 = 24. +5 × 2.4 = 12. +24 + 12 = 36. + +Method C: Direct multiplication. +240 × 0.15. +240 × 0.10 = 24. +240 × 0.05 = 12. +24 + 12 = 36. + +Step 4: Final Verification. +Is 36 reasonable? +10% is 24. 20% is 48. +15% is halfway between 10% and 20%. +Halfway between 24 and 48 is 36. +The result is correct. + +Step 5: Structure the final response. I will present the calculation clearly, perhaps showing the fractional or decimal method, or the mental math shortcut (10% + 5%). +=============== Content ================= +Here is the step-by-step solution: + +**Step 1: Convert the percentage to a decimal.** +To convert 15% to a decimal, divide by 100. +$$15\% = \frac{15}{100} = 0.15$$ + +**Step 2: Multiply the decimal by the number.** +Now, multiply 0.15 by 240. +$$0.15 \times 240$$ + +**Step 3: Perform the calculation.** +You can break this down to make it easier: +$$0.15 = 0.10 + 0.05$$ + +* First, find 10% of 240: + $$0.10 \times 240 = 24$$ +* Next, find 5% (which is half of 10%): + $$\frac{24}{2} = 12$$ +* Add the two results together: + $$24 + 12 = 36$$ + +**Answer:** +15% of 240 is **36**. +``` + +**Example 2: Instruct Mode (Thinking Off)** + +To disable thinking and get a direct response, pass `{"enable_thinking": false}` via `chat_template_kwargs`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Disable thinking mode via chat_template_kwargs +response = client.chat.completions.create( + model="zai-org/GLM-5-FP8", + messages=[ + {"role": "user", "content": "What is 15% of 240?"} + ], + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, + max_tokens=2048, + stream=True +) + +# In Instruct mode, the model responds directly without reasoning_content +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +To find **15% of 240**, follow these steps: + +### Step 1: Convert the Percentage to a Decimal +First, convert the percentage to a decimal by dividing by 100. + +\[ +15\% = \frac{15}{100} = 0.15 +\] + +### Step 2: Multiply by the Number +Next, multiply the decimal by the number you want to find the percentage of. + +\[ +0.15 \times 240 +\] + +### Step 3: Perform the Multiplication +Calculate the multiplication: + +\[ +0.15 \times 240 = 36 +\] + +### Final Answer +\[ +\boxed{36} +\] +``` + +#### 4.3.2 Tool Calling + +GLM-5 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, pass `extra_body={"chat_template_kwargs": {"enable_thinking": False}}`. + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/GLM-5-FP8", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking for the weather in Beijing. I have access to a get_weather function that can provide current weather information. Let me check what parameters are required: + +- location: required, should be "Beijing" +- unit: optional (not in required array), can be "celsius" or "fahrenheit" + +Since the user didn't specify a unit preference and it's optional, I should not ask about it or make up a value. I'll just call the function with the required location parameter.I'll get the current weather in Beijing for you. +=============== Content ================= +Tool Call: get_weather + Arguments: +Tool Call: None + Arguments: { +Tool Call: None + Arguments: "location": "Be +Tool Call: None + Arguments: ijing" +Tool Call: None + Arguments: } +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: H200 (8x) +- Model: GLM-5-FP8 +- Tensor Parallelism: 8 +- SGLang Version: commit 947927bdb + +#### 5.1.1 Latency Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-5-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 35.78 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4213 +Request throughput (req/s): 0.28 +Input token throughput (tok/s): 170.54 +Output token throughput (tok/s): 117.96 +Peak output token throughput (tok/s): 148.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 288.50 +Concurrency: 1.00 +Accept length: 3.48 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3576.31 +Median E2E Latency (ms): 2935.97 +P90 E2E Latency (ms): 5908.97 +P99 E2E Latency (ms): 8588.08 +---------------Time to First Token---------------- +Mean TTFT (ms): 290.88 +Median TTFT (ms): 282.34 +P99 TTFT (ms): 332.27 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.54 +Median TPOT (ms): 6.97 +P99 TPOT (ms): 9.04 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 7.80 +Median ITL (ms): 6.81 +P95 ITL (ms): 13.51 +P99 ITL (ms): 26.99 +Max ITL (ms): 29.50 +================================================== +``` + +#### 5.1.2 Throughput Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/GLM-5-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 1000 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 411.74 +Total input tokens: 502493 +Total input text tokens: 502493 +Total generated tokens: 500251 +Total generated tokens (retokenized): 499614 +Request throughput (req/s): 2.43 +Input token throughput (tok/s): 1220.41 +Output token throughput (tok/s): 1214.97 +Peak output token throughput (tok/s): 2648.00 +Peak concurrent requests: 105 +Total token throughput (tok/s): 2435.38 +Concurrency: 96.30 +Accept length: 3.50 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 39648.76 +Median E2E Latency (ms): 39058.12 +P90 E2E Latency (ms): 57009.82 +P99 E2E Latency (ms): 68880.33 +---------------Time to First Token---------------- +Mean TTFT (ms): 20613.80 +Median TTFT (ms): 21429.21 +P99 TTFT (ms): 29543.17 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 38.73 +Median TPOT (ms): 36.52 +P99 TPOT (ms): 67.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 38.13 +Median ITL (ms): 16.57 +P95 ITL (ms): 86.01 +P99 ITL (ms): 164.88 +Max ITL (ms): 1307.02 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +- Test Result +```text Output +Accuracy: 0.955 +Invalid: 0.000 +Latency: 32.470 s +Output throughput: 642.044 token/s +``` + +#### 5.2.2 MMLU Benchmark + +- Benchmark Command +```bash Command +python3 benchmark/mmlu/bench_sglang.py --port 30000 +``` + +- Test Result +```text Output +subject: abstract_algebra, #q:100, acc: 0.860 +subject: anatomy, #q:135, acc: 0.874 +subject: astronomy, #q:152, acc: 0.941 +subject: business_ethics, #q:100, acc: 0.880 +subject: clinical_knowledge, #q:265, acc: 0.932 +subject: college_biology, #q:144, acc: 0.972 +subject: college_chemistry, #q:100, acc: 0.640 +subject: college_computer_science, #q:100, acc: 0.900 +subject: college_mathematics, #q:100, acc: 0.810 +subject: college_medicine, #q:173, acc: 0.873 +subject: college_physics, #q:102, acc: 0.912 +subject: computer_security, #q:100, acc: 0.880 +subject: conceptual_physics, #q:235, acc: 0.928 +subject: econometrics, #q:114, acc: 0.807 +subject: electrical_engineering, #q:145, acc: 0.897 +subject: elementary_mathematics, #q:378, acc: 0.937 +subject: formal_logic, #q:126, acc: 0.778 +subject: global_facts, #q:100, acc: 0.710 +subject: high_school_biology, #q:310, acc: 0.961 +subject: high_school_chemistry, #q:203, acc: 0.847 +subject: high_school_computer_science, #q:100, acc: 0.960 +subject: high_school_european_history, #q:165, acc: 0.891 +subject: high_school_geography, #q:198, acc: 0.960 +subject: high_school_government_and_politics, #q:193, acc: 0.984 +subject: high_school_macroeconomics, #q:390, acc: 0.923 +subject: high_school_mathematics, #q:270, acc: 0.696 +subject: high_school_microeconomics, #q:238, acc: 0.962 +subject: high_school_physics, #q:151, acc: 0.821 +subject: high_school_psychology, #q:545, acc: 0.956 +subject: high_school_statistics, #q:216, acc: 0.889 +subject: high_school_us_history, #q:204, acc: 0.941 +subject: high_school_world_history, #q:237, acc: 0.945 +subject: human_aging, #q:223, acc: 0.857 +subject: human_sexuality, #q:131, acc: 0.908 +subject: international_law, #q:121, acc: 0.934 +subject: jurisprudence, #q:108, acc: 0.907 +subject: logical_fallacies, #q:163, acc: 0.933 +subject: machine_learning, #q:112, acc: 0.830 +subject: management, #q:103, acc: 0.942 +subject: marketing, #q:234, acc: 0.940 +subject: medical_genetics, #q:100, acc: 0.990 +subject: miscellaneous, #q:783, acc: 0.959 +subject: moral_disputes, #q:346, acc: 0.873 +subject: moral_scenarios, #q:895, acc: 0.837 +subject: nutrition, #q:306, acc: 0.922 +subject: philosophy, #q:311, acc: 0.897 +subject: prehistory, #q:324, acc: 0.929 +subject: professional_accounting, #q:282, acc: 0.844 +subject: professional_law, #q:1534, acc: 0.714 +subject: professional_medicine, #q:272, acc: 0.941 +subject: professional_psychology, #q:612, acc: 0.913 +subject: public_relations, #q:110, acc: 0.791 +subject: security_studies, #q:245, acc: 0.878 +subject: sociology, #q:201, acc: 0.940 +subject: us_foreign_policy, #q:100, acc: 0.920 +subject: virology, #q:166, acc: 0.596 +subject: world_religions, #q:171, acc: 0.936 +Total latency: 165.275 +Average accuracy: 0.877 +``` + +### 5.3 AMD GPU Benchmarks + +#### 5.3.1 GSM8K Benchmark (MI325/MI35x) + +- MI325/MI35x Test (GLM-5 BF16, `tp=8`, TileLang NSA backends) + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 +``` + +```text Output +Accuracy: 0.970 +Invalid: 0.000 +``` + +Results from [AMD nightly CI](https://github.com/sgl-project/sglang/actions/runs/22556197510/attempts/2#summary-65346783629). See also [sglang#18911](https://github.com/sgl-project/sglang/pull/18911). diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-Glyph.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-Glyph.mdx new file mode 100644 index 000000000000..050721fdc6c2 --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-Glyph.mdx @@ -0,0 +1,829 @@ +--- +title: GLM Glyph +metatags: + description: "Deploy GLM-Glyph with SGLang - community contribution guide for Zhipu AI's GLM Glyph model deployment." +--- + +import { GLMGlyphDeployment } from '/src/snippets/autoregressive/glm-glyph-deployment.jsx'; + +## 1. Model Introduction + +[Glyph](https://huggingface.co/zai-org/Glyph) is a powerful language model developed by Zhipu AI, featuring advanced capabilities in reasoning, function calling, and multi-modal understanding. + +**Hardware Support:** NVIDIA B200/H100/H200, AMD MI300X/MI325X/MI355X + +**Key Features:** + +- **Advanced Reasoning**: Built-in reasoning capabilities for complex problem-solving +- **Multiple Quantizations**: BF16 and FP8 variants for different performance/memory trade-offs +- **High Performance**: Optimized for both throughput and latency scenarios + +**Available Models:** + +- **BF16 (Full precision)**: [zai-org/Glyph](https://huggingface.co/zai-org/Glyph) +- **FP8 (8-bit quantized)**: [zai-org/Glyph-FP8](https://huggingface.co/zai-org/Glyph-FP8) + +**License:** + +Please refer to the [official Glyph model card](https://huggingface.co/zai-org/Glyph) for license details. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and other options. + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [GLM-4.5/GLM-4.6 Usage](../../../docs/basic_usage/glm45). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Thinking Mode + +Glyph supports thinking mode for enhanced reasoning. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model-path zai-org/Glyph \ + --reasoning-parser glm45 \ + --tp 4 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="zai-org/Glyph", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +**Disable Thinking Mode:** + +To disable thinking mode for a specific request: + +```python Example +response = client.chat.completions.create( + model="zai-org/Glyph", + messages=[{"role": "user", "content": "What is the capital of France?"}], + extra_body={"chat_template_kwargs": {"enable_thinking": False}} +) +``` + +#### 4.2.2 Tool Calling + +Glyph supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model-path zai-org/Glyph \ + --reasoning-parser glm45 \ + --tool-call-parser glm45 \ + --tp 4 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="zai-org/Glyph", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="zai-org/Glyph", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Model: Glyph +- SGLang Version: 0.5.6.post1 + +**Benchmark Methodology:** + +We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms. + +#### 5.1.1 Standard Scenario Benchmark + +- **Model Deployment** +```bash Command +python -m sglang.launch_server \ + --model zai-org/Glyph \ + --tp 2 +``` + +##### 5.1.1.1 Low Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 17.03 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.59 +Input token throughput (tok/s): 358.17 +Output token throughput (tok/s): 247.74 +Peak output token throughput (tok/s): 251.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 605.91 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1702.14 +Median E2E Latency (ms): 1361.72 +---------------Time to First Token---------------- +Mean TTFT (ms): 22.35 +Median TTFT (ms): 22.61 +P99 TTFT (ms): 23.76 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 3.99 +Median TPOT (ms): 3.99 +P99 TPOT (ms): 4.01 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 3.99 +Median ITL (ms): 3.99 +P95 ITL (ms): 4.03 +P99 ITL (ms): 4.12 +Max ITL (ms): 7.46 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 16.27 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40804 +Request throughput (req/s): 4.92 +Input token throughput (tok/s): 2438.06 +Output token throughput (tok/s): 2507.94 +Peak output token throughput (tok/s): 3069.00 +Peak concurrent requests: 26 +Total token throughput (tok/s): 4946.00 +Concurrency: 13.44 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2733.43 +Median E2E Latency (ms): 2892.98 +---------------Time to First Token---------------- +Mean TTFT (ms): 33.10 +Median TTFT (ms): 27.73 +P99 TTFT (ms): 49.34 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 5.33 +Median TPOT (ms): 5.39 +P99 TPOT (ms): 5.86 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 5.30 +Median ITL (ms): 4.89 +P95 ITL (ms): 5.54 +P99 ITL (ms): 21.17 +Max ITL (ms): 25.14 +================================================== +``` + +##### 5.1.1.3 High Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 25.67 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252657 +Request throughput (req/s): 19.48 +Input token throughput (tok/s): 9733.69 +Output token throughput (tok/s): 9843.99 +Peak output token throughput (tok/s): 13398.00 +Peak concurrent requests: 127 +Total token throughput (tok/s): 19577.68 +Concurrency: 89.49 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4593.75 +Median E2E Latency (ms): 4431.03 +---------------Time to First Token---------------- +Mean TTFT (ms): 48.66 +Median TTFT (ms): 35.88 +P99 TTFT (ms): 120.61 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.10 +Median TPOT (ms): 9.55 +P99 TPOT (ms): 11.00 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.01 +Median ITL (ms): 6.51 +P95 ITL (ms): 23.19 +P99 ITL (ms): 25.54 +Max ITL (ms): 52.93 +================================================== +``` + +#### 5.1.2 Reasoning Scenario Benchmark + +##### 5.1.2.1 Low Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 201.53 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44455 +Request throughput (req/s): 0.05 +Input token throughput (tok/s): 30.27 +Output token throughput (tok/s): 220.63 +Peak output token throughput (tok/s): 251.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 250.90 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 20151.45 +Median E2E Latency (ms): 21576.31 +---------------Time to First Token---------------- +Mean TTFT (ms): 2362.23 +Median TTFT (ms): 23.03 +P99 TTFT (ms): 21310.14 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.00 +Median TPOT (ms): 4.00 +P99 TPOT (ms): 4.01 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.00 +Median ITL (ms): 4.00 +P95 ITL (ms): 4.05 +P99 ITL (ms): 4.08 +Max ITL (ms): 5.67 +================================================== +``` + +##### 5.1.2.2 Medium Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 118.67 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 318306 +Total generated tokens (retokenized): 318270 +Request throughput (req/s): 0.67 +Input token throughput (tok/s): 334.27 +Output token throughput (tok/s): 2682.26 +Peak output token throughput (tok/s): 3264.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 3016.53 +Concurrency: 13.74 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 20387.23 +Median E2E Latency (ms): 20466.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 132.47 +Median TTFT (ms): 27.19 +P99 TTFT (ms): 583.15 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 5.09 +Median TPOT (ms): 5.13 +P99 TPOT (ms): 5.19 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 5.09 +Median ITL (ms): 5.08 +P95 ITL (ms): 5.18 +P99 ITL (ms): 5.57 +Max ITL (ms): 522.26 +================================================== +``` + +##### 5.1.2.3 High Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 150.00 +Total input tokens: 158939 +Total input text tokens: 158939 +Total input vision tokens: 0 +Total generated tokens: 1301025 +Total generated tokens (retokenized): 1300901 +Request throughput (req/s): 2.13 +Input token throughput (tok/s): 1059.59 +Output token throughput (tok/s): 8673.49 +Peak output token throughput (tok/s): 11899.00 +Peak concurrent requests: 71 +Total token throughput (tok/s): 9733.09 +Concurrency: 54.71 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 25645.42 +Median E2E Latency (ms): 26913.26 +---------------Time to First Token---------------- +Mean TTFT (ms): 163.75 +Median TTFT (ms): 93.67 +P99 TTFT (ms): 426.19 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 6.27 +Median TPOT (ms): 6.39 +P99 TPOT (ms): 6.59 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.27 +Median ITL (ms): 0.17 +P95 ITL (ms): 32.94 +P99 ITL (ms): 67.89 +Max ITL (ms): 136.00 +================================================== +``` + +#### 5.1.3 Summarization Scenario Benchmark + +#### 5.1.3.1 Low Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 17.44 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.57 +Input token throughput (tok/s): 2405.19 +Output token throughput (tok/s): 242.00 +Peak output token throughput (tok/s): 250.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 2647.19 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1742.54 +Median E2E Latency (ms): 1412.47 +---------------Time to First Token---------------- +Mean TTFT (ms): 53.48 +Median TTFT (ms): 45.05 +P99 TTFT (ms): 98.57 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.01 +Median TPOT (ms): 4.01 +P99 TPOT (ms): 4.03 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.01 +Median ITL (ms): 4.01 +P95 ITL (ms): 4.06 +P99 ITL (ms): 4.09 +Max ITL (ms): 4.95 +================================================== +``` + +##### 5.1.3.2 Medium Concurrency +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 16.90 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41668 +Request throughput (req/s): 4.73 +Input token throughput (tok/s): 17753.58 +Output token throughput (tok/s): 2465.75 +Peak output token throughput (tok/s): 3005.00 +Peak concurrent requests: 25 +Total token throughput (tok/s): 20219.33 +Concurrency: 13.68 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2890.33 +Median E2E Latency (ms): 3069.55 +---------------Time to First Token---------------- +Mean TTFT (ms): 41.46 +Median TTFT (ms): 31.75 +P99 TTFT (ms): 93.18 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 5.52 +Median TPOT (ms): 5.58 +P99 TPOT (ms): 6.14 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 5.48 +Median ITL (ms): 5.13 +P95 ITL (ms): 5.93 +P99 ITL (ms): 20.76 +Max ITL (ms): 36.01 +================================================== +``` + +##### 5.1.3.3 High Concurrency + +- **Benchmark Command**: +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model zai-org/Glyph \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` +- **Test Results**: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 35.54 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169994 +Request throughput (req/s): 9.01 +Input token throughput (tok/s): 35848.57 +Output token throughput (tok/s): 4783.96 +Peak output token throughput (tok/s): 8396.00 +Peak concurrent requests: 80 +Total token throughput (tok/s): 40632.53 +Concurrency: 59.26 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6580.96 +Median E2E Latency (ms): 6248.74 +---------------Time to First Token---------------- +Mean TTFT (ms): 345.27 +Median TTFT (ms): 96.06 +P99 TTFT (ms): 2823.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 12.26 +Median TPOT (ms): 12.53 +P99 TPOT (ms): 23.58 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.76 +Median ITL (ms): 6.57 +P95 ITL (ms): 27.66 +P99 ITL (ms): 91.24 +Max ITL (ms): 2609.64 +================================================== +``` + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command + +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --num-questions 200 +``` + +- Test Result + +```text Output +Accuracy: 0.890 +Invalid: 0.000 +Latency: 3.718 s +Output throughput: 5245.606 token/s +``` diff --git a/docs_new/cookbook/autoregressive/GLM/GLM-OCR.mdx b/docs_new/cookbook/autoregressive/GLM/GLM-OCR.mdx new file mode 100644 index 000000000000..4aafb211aa5c --- /dev/null +++ b/docs_new/cookbook/autoregressive/GLM/GLM-OCR.mdx @@ -0,0 +1,227 @@ +--- +title: GLM-OCR +metatags: + description: "Deploy GLM-OCR with SGLang - state-of-the-art OCR performance for complex document understanding." +--- + +## 1. Model Introduction + +[GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. + +The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts. + +**Hardware Support:** NVIDIA B200/H100/H200 + +**Key Features:** + +- **State-of-the-Art Performance**: Achieves 94.62 on OmniDocBench V1.5, ranking #1, and delivers SOTA results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction. +- **Optimized for Real-World Scenarios**: Specifically optimized for practical business cases, maintaining stable and accurate performance on complex tables, code documents, seals, and other challenging layouts. +- **Efficient Inference**: With only 0.9B parameters, GLM-OCR supports deployment via vLLM and SGLang, significantly reducing inference latency and compute cost—well suited for high-concurrency and edge deployments. +- **Easy to Use**: Fully open-sourced with a complete SDK and inference toolchain, enabling one-line invocation and seamless integration into existing systems. + +For more details, please refer to the [official GLM-OCR model card](https://huggingface.co/zai-org/GLM-OCR). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and deployment options. You can optionally enable MTP (Multi-Token Prediction) for faster inference using EAGLE speculative decoding. + +import { GLMOCRDeployment } from '/src/snippets/autoregressive/glm-ocr-deployment.jsx' + + + +### 3.2 Configuration Tips + +- **CUDA IPC Transport**: The `SGLANG_USE_CUDA_IPC_TRANSPORT=1` environment variable enables CUDA IPC for transferring multimodal features, which significantly improves TTFT. +- **MTP (Multi-Token Prediction)**: Enable MTP to use EAGLE speculative decoding for faster inference. This feature predicts multiple tokens at once to reduce latency. +- **Memory Management**: For memory-constrained environments, you may need to adjust `--mem-fraction-static` and/or `--max-running-requests`. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 OCR Image Processing + +GLM-OCR supports OCR tasks on various document types. Here's a basic example: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Please extract all text from this image." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="zai-org/GLM-OCR", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 2.29s +Generated text: CINNAMON SUGAR +1 x 17,000 17,000 + +SUB TOTAL 17,000 + +GRAND TOTAL 17,000 + +CASH IDR 20,000 + +CHANGE DUE 3,000 + +``` + +#### 4.2.2 Complex Document Processing + +GLM-OCR excels at processing complex documents including: + +- **Tables**: Accurate extraction of tabular data with structure preservation +- **Formulas**: Mathematical formula recognition +- **Code Documents**: Source code extraction from screenshots +- **Seals and Stamps**: Recognition of seals and stamps in documents +- **Multi-layout Documents**: Mixed content with text, images, and tables + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +# Example: Processing a document with tables +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "YOUR_DOCUMENT_IMAGE_URL" + } + }, + { + "type": "text", + "text": "Please extract the table content from this document and format it as markdown." + } + ] + } +] + +response = client.chat.completions.create( + model="zai-org/GLM-OCR", + messages=messages, + max_tokens=4096 +) +print(response.choices[0].message.content) +``` + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.1.1 OCRBench Benchmark + +- Benchmark Command + +```bash Command +python3 -m lmms_eval \ + --model openai_compatible \ + --model_args "model_version=zai-org/GLM-OCR" \ + --tasks ocrbench \ + --batch_size 128 \ + --log_samples \ + --log_samples_suffix "openai_compatible" \ + --output_path ./logs +``` + +- Test Result + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TasksVersionFiltern-shotMetricValueStderr
ocrbenchYamlnone0ocrbench_accuracy0.806N/A
+ +#### 5.1.2 OmniDocBench V1.5 + +GLM-OCR achieves **94.62** on OmniDocBench V1.5, ranking #1 among all models, demonstrating state-of-the-art performance across major document understanding benchmarks. diff --git a/docs_new/cookbook/autoregressive/Google/Gemma4.mdx b/docs_new/cookbook/autoregressive/Google/Gemma4.mdx new file mode 100644 index 000000000000..a462c2fa5df3 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Google/Gemma4.mdx @@ -0,0 +1,1346 @@ +--- +title: Gemma 4 +metatags: + description: "Deploy Gemma 4 with SGLang - Google's next-generation open models with MoE variants and multimodal support for text, vision, and audio." +tag: NEW +--- + +import { Gemma4Deployment } from '/src/snippets/autoregressive/gemma4-deployment.jsx'; + +## 1. Model Introduction + +Gemma 4 is Google's next-generation family of open models, building on the Gemma 3 architecture with improved performance, MoE variants, and multimodal support for text, vision, and audio. + +**Key Features:** + +- **Hybrid Attention**: Combines sliding window and full attention layers for efficient long-context processing +- **Multimodal**: Supports text, image, and audio inputs via dedicated vision and audio encoders +- **MoE Variant**: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference +- **Per-Layer Embeddings (PLE)**: Layer-specific token embeddings for enhanced representations +- **Reasoning**: Built-in thinking mode with `gemma4` reasoning parser +- **Tool Calling**: Function call support with streaming via `gemma4` tool call parser +- **Fused Operations**: Triton-optimized RMSNorm + residual + scalar kernels + +**Available Models:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelArchitectureParameters
[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)Dense~2B
[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)Dense~4B
[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)Dense31B
[google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)MoE26B total / 4B active
+ +## 2. SGLang Installation + +Gemma 4 support requires [sgl-project/sglang#21952](https://github.com/sgl-project/sglang/pull/21952) and a specific transformers commit: + +```bash Command +# Install SGLang from main branch (after sglang#21952 is merged) +pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python' + +# Install transformers with Gemma 4 support +pip install 'git+https://github.com/huggingface/transformers.git@91b1ab1fdfa81a552644a92fbe3e8d88de40e167' + +# Or use Docker AMD64 +docker pull lmsysorg/sglang:gemma4 # CUDA 12.9 +docker pull lmsysorg/sglang:cu13-gemma4 # CUDA 13 + +# For ARM64 (GB200 / GB300) +docker pull lmsysorg/sglang:dev-gemma4 # CUDA 12.9 +docker pull lmsysorg/sglang:dev-cu13-gemma4 # CUDA 13 +``` + +For the full Docker setup and other installation methods, please refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model variant. + + + +### 3.2 Configuration Tips + +- SGLang automatically selects the Triton attention backend for Gemma 4 models (required for bidirectional image-token attention during prefill). +- For the 26B-A4B MoE model, consider `--tp 2` for high-throughput workloads. +- **Speculative Decoding (MTP)**: Each Gemma 4 variant ships with a paired `*-assistant` draft model that enables NEXTN multi-token prediction. Enable it via the selector above, or pass `--speculative-algorithm NEXTN --speculative-draft-model-path google/gemma-4--it-assistant --speculative-num-steps 5 --speculative-num-draft-tokens 6 --speculative-eagle-topk 1`. MTP can significantly reduce latency for interactive use cases. The 26B-A4B MoE model requires `--tp 2` when MTP is enabled. +- Hardware requirements: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelHardwareTP
gemma-4-E2B-it1x H200 / 1x MI300X / 1x MI325X / 1x MI355X1
gemma-4-E4B-it1x H200 / 1x MI300X / 1x MI325X / 1x MI355X1
gemma-4-31B-it2x H200 / 1x MI300X / 1x MI325X / 1x MI355X2 (H200) / 1 (AMD)
gemma-4-26B-A4B-it1x H200 / 1x MI300X / 1x MI325X / 1x MI355X1
+ +### 3.3 AMD GPU Deployment (MI300X / MI325X / MI355X) + +SGLang automatically selects the correct attention backend on AMD GPUs. For the small E-models (`gemma-4-E2B-it`, `gemma-4-E4B-it`), disable AITER on AMD GPUs and use the same command line otherwise: + +```bash Command +SGLANG_USE_AITER=0 sglang serve --model-path google/gemma-4-E4B-it \ + --reasoning-parser gemma4 \ + --tool-call-parser gemma4 \ + --host 0.0.0.0 --port 30000 +``` + +For `gemma-4-31B-it` and `gemma-4-26B-A4B-it`, the same commands above work on MI300X, MI325X, and MI355X without additional command-line changes. + +> **Status**: AMD benchmarks are available in [Section 5.1](#51-speed-benchmark). + +## 4. Model Invocation + +Deploy gemma-4-26B-A4B-it (MoE) with all features enabled: + +```bash Command +sglang serve --model-path google/gemma-4-26B-A4B-it \ + --reasoning-parser gemma4 \ + --tool-call-parser gemma4 \ + --host 0.0.0.0 --port 30000 +``` + +#### Speculative Decoding (MTP) Server Commands + +Each Gemma 4 variant ships with a paired `*-assistant` draft model for NEXTN multi-token prediction. Use the commands below to enable MTP for the corresponding target model. These match the configuration generated when you toggle **Speculative Decoding (MTP) → Enabled** in the [interactive selector](#31-basic-configuration). + +```bash Command +# Gemma 4 E2B + MTP +sglang serve \ + --model-path google/gemma-4-E2B-it \ + --speculative-algorithm NEXTN \ + --speculative-draft-model-path google/gemma-4-E2B-it-assistant \ + --speculative-num-steps 5 \ + --speculative-num-draft-tokens 6 \ + --speculative-eagle-topk 1 \ + --mem-fraction-static 0.85 +``` + +```bash Command +# Gemma 4 E4B + MTP +sglang serve \ + --model-path google/gemma-4-E4B-it \ + --speculative-algorithm NEXTN \ + --speculative-draft-model-path google/gemma-4-E4B-it-assistant \ + --speculative-num-steps 5 \ + --speculative-num-draft-tokens 6 \ + --speculative-eagle-topk 1 \ + --mem-fraction-static 0.85 +``` + +```bash Command +# Gemma 4 31B + MTP +sglang serve \ + --model-path google/gemma-4-31B-it \ + --tp-size 2 \ + --speculative-algorithm NEXTN \ + --speculative-draft-model-path google/gemma-4-31B-it-assistant \ + --speculative-num-steps 5 \ + --speculative-num-draft-tokens 6 \ + --speculative-eagle-topk 1 \ + --mem-fraction-static 0.85 +``` + +```bash Command +# Gemma 4 26B-A4B + MTP +sglang serve \ + --model-path google/gemma-4-26B-A4B-it \ + --tp-size 2 \ + --speculative-algorithm NEXTN \ + --speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant \ + --speculative-num-steps 5 \ + --speculative-num-draft-tokens 6 \ + --speculative-eagle-topk 1 \ + --mem-fraction-static 0.85 +``` + +### 4.1 Basic Usage + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="google/gemma-4-26B-A4B-it", + messages=[ + {"role": "user", "content": "What are the key differences between TCP and UDP?"} + ], + max_tokens=1024 +) + +print(response.choices[0].message.content) +``` + +
+Example Output + +```text Output +The fundamental difference between **TCP (Transmission Control Protocol)** and **UDP (User Datagram +Protocol)** lies in how they prioritize data integrity versus speed. + +### 1. Connection Type +* **TCP (Connection-Oriented):** Before any data is sent, TCP performs a "three-way handshake." + The sender and receiver exchange signals to establish a formal connection. +* **UDP (Connectionless):** UDP does not establish a connection. It simply starts blasting packets + to the destination IP address without checking if the receiver is ready. + +### 2. Reliability and Error Checking +* **TCP (Reliable):** If a packet is lost or arrives corrupted, TCP detects the error and + retransmits the missing data. +* **UDP (Unreliable):** If a packet is lost or corrupted, it is simply discarded. There is no + mechanism to ask for a retransmission. + +### 3. Ordering of Data +* **TCP (Ordered):** Segments are assigned sequence numbers and reassembled in the correct order. +* **UDP (Unordered):** Packets may arrive in a different order than sent. + +### 4. Speed and Overhead +* **TCP (Slower):** Managing connections, tracking, and retransmissions adds significant overhead. +* **UDP (Faster):** No handshake, no tracking — extremely fast and ideal for real-time needs. + +| Feature | TCP | UDP | +| :--- | :--- | :--- | +| **Connection** | Connection-oriented | Connectionless | +| **Reliability** | Guaranteed delivery | Best-effort | +| **Ordering** | Maintains strict order | No guaranteed order | +| **Speed** | Slower (High overhead) | Faster (Low overhead) | +``` + +
+ +### 4.2 Vision Input + +Gemma 4 multimodal variants accept images alongside text: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="google/gemma-4-26B-A4B-it", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg" + } + }, + { + "type": "text", + "text": "Describe this image in detail." + } + ] + } + ], + max_tokens=1024 +) + +print(response.choices[0].message.content) +``` + +
+Example Output + +```text Output +A vertical, full shot shows a girl and a boy standing in front of a giant teddy bear. The boy, who +is on the left, is of South Asian descent, has short dark hair, and is smiling at the camera. He is +wearing a navy blue sweatshirt with a white collar, blue jeans, and white, black, and red sneakers. +The girl, on the right, is also of South Asian descent and has long, dark hair. She is smiling at +the camera and is wearing a pink t-shirt, a white long-sleeve shirt underneath, blue jeans, and pink +sneakers. The giant teddy bear is light brown and is standing behind the two children. The bear has +large, dark eyes and a black nose. In the background, on the left, there is a large wooden basket +filled with small teddy bears. To the left of the basket, an American flag is hanging on the wall. +On the right side of the image, there is a green leafy plant. The floor is a dark purple carpet. The +lighting is bright and even. +``` + +
+ +### 4.3 Reasoning (Thinking Mode) + +Gemma 4 supports hybrid reasoning. Thinking is **not enabled by default** — pass `chat_template_kwargs: {"enable_thinking": true}` via `extra_body` to activate it. The reasoning parser separates thinking and content, returning the thinking process via `reasoning_content` in the streaming response. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="google/gemma-4-26B-A4B-it", + messages=[ + {"role": "user", "content": "Solve step by step: If a train travels at 60 km/h for 2.5 hours, how far does it go?"} + ], + max_tokens=4096, + stream=True, + extra_body={"chat_template_kwargs": {"enable_thinking": True}} +) + +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +
+Example Output + +```text Output +=============== Thinking ================= +* Input: Speed = 60 km/h, Time = 2.5 hours. + * Goal: Find the distance traveled. + * Distance = Speed × Time. + * Step 1: Identify given values. Speed = 60 km/h, Time = 2.5 hours + * Step 2: Formula. Distance = Speed × Time + * Step 3: Calculation. 60 × 2.5 + Mental math: 60 × 2 = 120; 60 × 0.5 = 30; 120 + 30 = 150. + * Step 4: Final Result. 150 km. + +=============== Content ================= +To find the distance traveled, you can follow these steps: + +### 1. Identify the given information: +* **Speed:** 60 km/h +* **Time:** 2.5 hours + +### 2. Use the distance formula: +Distance = Speed × Time + +### 3. Substitute the values: +Distance = 60 km/h × 2.5 hours + +### 4. Perform the calculation: +* 60 × 2 = 120 +* 60 × 0.5 = 30 +* 120 + 30 = 150 + +**Final Answer: The train travels 150 km.** +``` + +
+ +### 4.4 Tool Calling + +Gemma 4 supports function calling with the `gemma4` tool call parser. Enable it during deployment with `--tool-call-parser gemma4`. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +response = client.chat.completions.create( + model="google/gemma-4-26B-A4B-it", + messages=[ + {"role": "user", "content": "What's the weather in Tokyo?"} + ], + tools=tools, + stream=True +) + +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if hasattr(delta, 'tool_calls') and delta.tool_calls: + if has_thinking and thinking_started: + print("\n=============== Tool Calls ================", flush=True) + thinking_started = False + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +
+Example Output + +```text Output +=============== Tool Calls ================ +Tool Call: get_weather + Arguments: {"location": "Tokyo"} +``` + +
+ +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: H200 +- SGLang Version: gemma4 branch + +#### gemma-4-E2B-it (1x H200, TP=1) + +Server Launch Command: +```bash Command +sglang serve --model-path google/gemma-4-E2B-it +``` + +**Latency Benchmark (Text)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang \ + --host 0.0.0.0 --port 30000 \ + --dataset-name random --num-prompts 10 --max-concurrency 1 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 17.44 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.57 +Output token throughput (tok/s): 242.03 +Total token throughput (tok/s): 591.94 +Mean TTFT (ms): 50.19 +Median TTFT (ms): 54.22 +Mean TPOT (ms): 3.99 +Median ITL (ms): 4.05 +================================================== +``` + +**Latency Benchmark (Image)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang-oai-chat \ + --host 0.0.0.0 --port 30000 \ + --dataset-name image --image-count 2 --image-resolution 720p \ + --random-input-len 128 --random-output-len 1024 \ + --num-prompts 10 --max-concurrency 1 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 18.05 +Total input tokens: 6097 +Total input vision tokens: 5340 +Total generated tokens: 4220 +Request throughput (req/s): 0.55 +Output token throughput (tok/s): 233.84 +Total token throughput (tok/s): 571.69 +Mean TTFT (ms): 109.59 +Median TTFT (ms): 112.62 +Mean TPOT (ms): 4.01 +Median ITL (ms): 4.04 +================================================== +``` + +**Throughput Benchmark (Text)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang \ + --host 0.0.0.0 --port 30000 \ + --dataset-name random --num-prompts 1000 --max-concurrency 100 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 51.73 +Total input tokens: 512842 +Total generated tokens: 510855 +Request throughput (req/s): 19.33 +Output token throughput (tok/s): 9876.36 +Peak output token throughput (tok/s): 13863.00 +Total token throughput (tok/s): 19791.14 +Mean TTFT (ms): 86.57 +Mean TPOT (ms): 9.56 +Median ITL (ms): 5.99 +================================================== +``` + +**Throughput Benchmark (Image)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang-oai-chat \ + --host 0.0.0.0 --port 30000 \ + --dataset-name image --image-count 2 --image-resolution 720p \ + --random-input-len 128 --random-output-len 1024 \ + --num-prompts 1000 --max-concurrency 100 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 89.07 +Total input tokens: 617799 +Total input vision tokens: 534000 +Total generated tokens: 510855 +Request throughput (req/s): 11.23 +Output token throughput (tok/s): 5735.75 +Peak output token throughput (tok/s): 12823.00 +Total token throughput (tok/s): 12672.23 +Mean TTFT (ms): 636.46 +Mean TPOT (ms): 16.34 +Median ITL (ms): 5.68 +================================================== +``` + +#### gemma-4-E4B-it (1x H200, TP=1) + +Server Launch Command: +```bash Command +sglang serve --model-path google/gemma-4-E4B-it +``` + +**Latency Benchmark (Text)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 24.49 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.41 +Output token throughput (tok/s): 172.32 +Total token throughput (tok/s): 421.45 +Mean TTFT (ms): 52.76 +Median TTFT (ms): 53.66 +Mean TPOT (ms): 5.64 +Median ITL (ms): 5.74 +================================================== +``` + +**Latency Benchmark (Image)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 25.04 +Total input tokens: 6124 +Total input vision tokens: 5340 +Total generated tokens: 4220 +Request throughput (req/s): 0.40 +Output token throughput (tok/s): 168.54 +Total token throughput (tok/s): 413.13 +Mean TTFT (ms): 110.15 +Median TTFT (ms): 108.24 +Mean TPOT (ms): 5.66 +Median ITL (ms): 5.73 +================================================== +``` + +**Throughput Benchmark (Text)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 72.95 +Total input tokens: 512842 +Total generated tokens: 510855 +Request throughput (req/s): 13.71 +Output token throughput (tok/s): 7002.68 +Peak output token throughput (tok/s): 9878.00 +Total token throughput (tok/s): 14032.60 +Mean TTFT (ms): 166.33 +Mean TPOT (ms): 13.36 +Median ITL (ms): 8.88 +================================================== +``` + +**Throughput Benchmark (Image)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 108.99 +Total input tokens: 616952 +Total input vision tokens: 534000 +Total generated tokens: 510855 +Request throughput (req/s): 9.18 +Output token throughput (tok/s): 4687.38 +Peak output token throughput (tok/s): 9277.00 +Total token throughput (tok/s): 10348.25 +Mean TTFT (ms): 626.17 +Mean TPOT (ms): 20.00 +Median ITL (ms): 8.64 +================================================== +``` + +#### gemma-4-31B-it (2x H200, TP=2) + +Server Launch Command: +```bash Command +sglang serve --model-path google/gemma-4-31B-it --tp 2 +``` + +**Latency Benchmark (Text)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 53.05 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.19 +Output token throughput (tok/s): 79.55 +Total token throughput (tok/s): 194.55 +Mean TTFT (ms): 72.77 +Median TTFT (ms): 75.05 +Mean TPOT (ms): 12.32 +Median ITL (ms): 12.53 +================================================== +``` + +**Latency Benchmark (Image)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 53.78 +Total input tokens: 6162 +Total input vision tokens: 5340 +Total generated tokens: 4220 +Request throughput (req/s): 0.19 +Output token throughput (tok/s): 78.46 +Total token throughput (tok/s): 193.03 +Mean TTFT (ms): 143.35 +Median TTFT (ms): 146.85 +Mean TPOT (ms): 12.37 +Median ITL (ms): 12.48 +================================================== +``` + +**Throughput Benchmark (Text)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 182.00 +Total input tokens: 512842 +Total generated tokens: 510855 +Request throughput (req/s): 5.49 +Output token throughput (tok/s): 2806.82 +Peak output token throughput (tok/s): 3798.00 +Total token throughput (tok/s): 5624.56 +Mean TTFT (ms): 324.67 +Mean TPOT (ms): 33.95 +Median ITL (ms): 25.44 +================================================== +``` + +**Throughput Benchmark (Image)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 236.46 +Total input tokens: 621630 +Total input vision tokens: 534000 +Total generated tokens: 510855 +Request throughput (req/s): 4.23 +Output token throughput (tok/s): 2160.42 +Peak output token throughput (tok/s): 3745.00 +Total token throughput (tok/s): 4789.30 +Mean TTFT (ms): 952.02 +Mean TPOT (ms): 44.17 +Median ITL (ms): 26.81 +================================================== +``` + +#### gemma-4-26B-A4B-it (MoE, 1x H200, TP=1) + +Server Launch Command: +```bash Command +sglang serve --model-path google/gemma-4-26B-A4B-it +``` + +> **Tip**: Consider `--tp 2` for high-throughput workloads. + +**Latency Benchmark (Text)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 25.00 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.40 +Output token throughput (tok/s): 168.81 +Total token throughput (tok/s): 412.85 +Mean TTFT (ms): 103.74 +Median TTFT (ms): 46.57 +Mean TPOT (ms): 5.60 +Median ITL (ms): 5.78 +================================================== +``` + +**Latency Benchmark (Image)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 25.31 +Total input tokens: 6164 +Total input vision tokens: 5340 +Total generated tokens: 4220 +Request throughput (req/s): 0.40 +Output token throughput (tok/s): 166.70 +Total token throughput (tok/s): 410.20 +Mean TTFT (ms): 129.22 +Median TTFT (ms): 132.54 +Mean TPOT (ms): 5.68 +Median ITL (ms): 5.75 +================================================== +``` + +**Throughput Benchmark (Text)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 138.98 +Total input tokens: 512842 +Total generated tokens: 510855 +Request throughput (req/s): 7.20 +Output token throughput (tok/s): 3675.81 +Peak output token throughput (tok/s): 4799.00 +Total token throughput (tok/s): 7365.91 +Mean TTFT (ms): 153.77 +Mean TPOT (ms): 25.95 +Median ITL (ms): 20.23 +================================================== +``` + +**Throughput Benchmark (Image)** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 186.38 +Total input tokens: 621146 +Total input vision tokens: 534000 +Total generated tokens: 510855 +Request throughput (req/s): 5.37 +Output token throughput (tok/s): 2740.86 +Peak output token throughput (tok/s): 4962.00 +Total token throughput (tok/s): 6073.47 +Mean TTFT (ms): 854.71 +Mean TPOT (ms): 34.64 +Median ITL (ms): 19.08 +================================================== +``` + +#### gemma-4-31B-it (1x MI300X, TP=1) + +Server Launch Command: +```bash Command +sglang serve --model-path google/gemma-4-31B-it +``` + +> **Note**: The 31B dense model fits on a single MI300X (192 GB VRAM) at TP=1, unlike H200 (141 GB) which requires TP=2. + +**Latency Benchmark (Text)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang \ + --host 0.0.0.0 --port 30000 \ + --dataset-name random --num-prompts 10 --max-concurrency 1 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 103.55 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.10 +Output token throughput (tok/s): 40.75 +Total token throughput (tok/s): 99.67 +Mean TTFT (ms): 152.35 +Median TTFT (ms): 169.66 +Mean TPOT (ms): 24.13 +Median ITL (ms): 24.23 +================================================== +``` + +**Throughput Benchmark (Text)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang \ + --host 0.0.0.0 --port 30000 \ + --dataset-name random --num-prompts 1000 --max-concurrency 100 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 441.59 +Total input tokens: 512842 +Total generated tokens: 510855 +Request throughput (req/s): 2.26 +Output token throughput (tok/s): 1156.85 +Peak output token throughput (tok/s): 1759.00 +Total token throughput (tok/s): 2318.19 +Mean TTFT (ms): 819.22 +Mean TPOT (ms): 82.51 +Median ITL (ms): 63.45 +================================================== +``` + +#### gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1) + +Server Launch Command: +```bash Command +sglang serve --model-path google/gemma-4-26B-A4B-it +``` + +**Latency Benchmark (Text)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang \ + --host 0.0.0.0 --port 30000 \ + --dataset-name random --num-prompts 10 --max-concurrency 1 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 43.73 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.23 +Output token throughput (tok/s): 96.49 +Total token throughput (tok/s): 236.00 +Mean TTFT (ms): 185.58 +Median TTFT (ms): 90.18 +Mean TPOT (ms): 9.78 +Median ITL (ms): 9.57 +================================================== +``` + +**Throughput Benchmark (Text)** + +```bash Command +python3 -m sglang.bench_serving --backend sglang \ + --host 0.0.0.0 --port 30000 \ + --dataset-name random --num-prompts 1000 --max-concurrency 100 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 219.43 +Total input tokens: 512842 +Total generated tokens: 510855 +Request throughput (req/s): 4.56 +Output token throughput (tok/s): 2328.05 +Peak output token throughput (tok/s): 3500.00 +Total token throughput (tok/s): 4665.16 +Mean TTFT (ms): 168.44 +Mean TPOT (ms): 41.23 +Median ITL (ms): 29.31 +================================================== +``` + +### 5.2 Accuracy Benchmark + +**Test Environment:** + +- Hardware: H200 +- SGLang Version: gemma4 branch + +#### MMLU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelHumanitiesSocial SciencesSTEMOtherOverall
gemma-4-E2B-it0.6210.7390.8300.736**0.720**
gemma-4-E4B-it0.7030.8620.9020.825**0.810**
gemma-4-31B-it0.8780.9210.8840.911**0.896**
gemma-4-26B-A4B-it0.8530.9060.9380.886**0.891**
+ +#### GSM8K + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelAccuracyInvalidLatency (s)Output Throughput (tok/s)
gemma-4-E2B-it0.1700.0003.9908041.739
gemma-4-E4B-it0.7450.0004.1744672.030
gemma-4-31B-it0.8050.00516.1481559.914
gemma-4-26B-A4B-it0.4500.01013.0014089.457
+ +#### MMMU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelOverall
gemma-4-E2B-it**0.307**
gemma-4-E4B-it**0.396**
gemma-4-31B-it**0.589**
gemma-4-26B-A4B-it**0.549**
+ +
+MMMU detailed scores (per domain) + +**gemma-4-E2B-it** + +```json Config +{"Overall-Art and Design": {"num": 120, "acc": 0.45}, "Art": {"num": 30, "acc": 0.5}, "Art_Theory": {"num": 30, "acc": 0.467}, "Design": {"num": 30, "acc": 0.5}, "Music": {"num": 30, "acc": 0.333}, "Overall-Business": {"num": 150, "acc": 0.26}, "Accounting": {"num": 30, "acc": 0.367}, "Economics": {"num": 30, "acc": 0.233}, "Finance": {"num": 30, "acc": 0.2}, "Manage": {"num": 30, "acc": 0.233}, "Marketing": {"num": 30, "acc": 0.267}, "Overall-Science": {"num": 150, "acc": 0.273}, "Biology": {"num": 30, "acc": 0.233}, "Chemistry": {"num": 30, "acc": 0.267}, "Geography": {"num": 30, "acc": 0.367}, "Math": {"num": 30, "acc": 0.233}, "Physics": {"num": 30, "acc": 0.267}, "Overall-Health and Medicine": {"num": 150, "acc": 0.273}, "Basic_Medical_Science": {"num": 30, "acc": 0.5}, "Clinical_Medicine": {"num": 30, "acc": 0.233}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.233}, "Pharmacy": {"num": 30, "acc": 0.3}, "Public_Health": {"num": 30, "acc": 0.1}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.4}, "History": {"num": 30, "acc": 0.4}, "Literature": {"num": 30, "acc": 0.567}, "Sociology": {"num": 30, "acc": 0.333}, "Psychology": {"num": 30, "acc": 0.3}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.252}, "Agriculture": {"num": 30, "acc": 0.333}, "Architecture_and_Engineering": {"num": 30, "acc": 0.267}, "Computer_Science": {"num": 30, "acc": 0.233}, "Electronics": {"num": 30, "acc": 0.1}, "Energy_and_Power": {"num": 30, "acc": 0.3}, "Materials": {"num": 30, "acc": 0.2}, "Mechanical_Engineering": {"num": 30, "acc": 0.333}, "Overall": {"num": 900, "acc": 0.307}} +``` + +**gemma-4-E4B-it** + +```json Config +{"Overall-Art and Design": {"num": 120, "acc": 0.458}, "Art": {"num": 30, "acc": 0.433}, "Art_Theory": {"num": 30, "acc": 0.567}, "Design": {"num": 30, "acc": 0.667}, "Music": {"num": 30, "acc": 0.167}, "Overall-Business": {"num": 150, "acc": 0.287}, "Accounting": {"num": 30, "acc": 0.233}, "Economics": {"num": 30, "acc": 0.467}, "Finance": {"num": 30, "acc": 0.133}, "Manage": {"num": 30, "acc": 0.3}, "Marketing": {"num": 30, "acc": 0.3}, "Overall-Science": {"num": 150, "acc": 0.28}, "Biology": {"num": 30, "acc": 0.333}, "Chemistry": {"num": 30, "acc": 0.133}, "Geography": {"num": 30, "acc": 0.4}, "Math": {"num": 30, "acc": 0.2}, "Physics": {"num": 30, "acc": 0.333}, "Overall-Health and Medicine": {"num": 150, "acc": 0.427}, "Basic_Medical_Science": {"num": 30, "acc": 0.4}, "Clinical_Medicine": {"num": 30, "acc": 0.533}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.4}, "Pharmacy": {"num": 30, "acc": 0.4}, "Public_Health": {"num": 30, "acc": 0.4}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.7}, "History": {"num": 30, "acc": 0.633}, "Literature": {"num": 30, "acc": 0.867}, "Sociology": {"num": 30, "acc": 0.733}, "Psychology": {"num": 30, "acc": 0.567}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.324}, "Agriculture": {"num": 30, "acc": 0.533}, "Architecture_and_Engineering": {"num": 30, "acc": 0.3}, "Computer_Science": {"num": 30, "acc": 0.367}, "Electronics": {"num": 30, "acc": 0.133}, "Energy_and_Power": {"num": 30, "acc": 0.4}, "Materials": {"num": 30, "acc": 0.2}, "Mechanical_Engineering": {"num": 30, "acc": 0.333}, "Overall": {"num": 900, "acc": 0.396}} +``` + +**gemma-4-31B-it** + +```json Config +{"Overall-Art and Design": {"num": 120, "acc": 0.667}, "Art": {"num": 30, "acc": 0.667}, "Art_Theory": {"num": 30, "acc": 0.867}, "Design": {"num": 30, "acc": 0.8}, "Music": {"num": 30, "acc": 0.333}, "Overall-Business": {"num": 150, "acc": 0.573}, "Accounting": {"num": 30, "acc": 0.633}, "Economics": {"num": 30, "acc": 0.733}, "Finance": {"num": 30, "acc": 0.433}, "Manage": {"num": 30, "acc": 0.533}, "Marketing": {"num": 30, "acc": 0.533}, "Overall-Science": {"num": 150, "acc": 0.527}, "Biology": {"num": 30, "acc": 0.667}, "Chemistry": {"num": 30, "acc": 0.567}, "Geography": {"num": 30, "acc": 0.5}, "Math": {"num": 30, "acc": 0.267}, "Physics": {"num": 30, "acc": 0.633}, "Overall-Health and Medicine": {"num": 150, "acc": 0.673}, "Basic_Medical_Science": {"num": 30, "acc": 0.733}, "Clinical_Medicine": {"num": 30, "acc": 0.533}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.467}, "Pharmacy": {"num": 30, "acc": 0.8}, "Public_Health": {"num": 30, "acc": 0.833}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.825}, "History": {"num": 30, "acc": 0.833}, "Literature": {"num": 30, "acc": 0.867}, "Sociology": {"num": 30, "acc": 0.767}, "Psychology": {"num": 30, "acc": 0.833}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.405}, "Agriculture": {"num": 30, "acc": 0.667}, "Architecture_and_Engineering": {"num": 30, "acc": 0.2}, "Computer_Science": {"num": 30, "acc": 0.567}, "Electronics": {"num": 30, "acc": 0.333}, "Energy_and_Power": {"num": 30, "acc": 0.533}, "Materials": {"num": 30, "acc": 0.3}, "Mechanical_Engineering": {"num": 30, "acc": 0.233}, "Overall": {"num": 900, "acc": 0.589}} +``` + +**gemma-4-26B-A4B-it** + +```json Config +{"Overall-Art and Design": {"num": 120, "acc": 0.717}, "Art": {"num": 30, "acc": 0.733}, "Art_Theory": {"num": 30, "acc": 0.833}, "Design": {"num": 30, "acc": 0.867}, "Music": {"num": 30, "acc": 0.433}, "Overall-Business": {"num": 150, "acc": 0.493}, "Accounting": {"num": 30, "acc": 0.533}, "Economics": {"num": 30, "acc": 0.533}, "Finance": {"num": 30, "acc": 0.333}, "Manage": {"num": 30, "acc": 0.5}, "Marketing": {"num": 30, "acc": 0.567}, "Overall-Science": {"num": 150, "acc": 0.473}, "Biology": {"num": 30, "acc": 0.633}, "Chemistry": {"num": 30, "acc": 0.367}, "Geography": {"num": 30, "acc": 0.533}, "Math": {"num": 30, "acc": 0.267}, "Physics": {"num": 30, "acc": 0.567}, "Overall-Health and Medicine": {"num": 150, "acc": 0.62}, "Basic_Medical_Science": {"num": 30, "acc": 0.767}, "Clinical_Medicine": {"num": 30, "acc": 0.533}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.433}, "Pharmacy": {"num": 30, "acc": 0.7}, "Public_Health": {"num": 30, "acc": 0.667}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.758}, "History": {"num": 30, "acc": 0.8}, "Literature": {"num": 30, "acc": 0.833}, "Sociology": {"num": 30, "acc": 0.733}, "Psychology": {"num": 30, "acc": 0.667}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.376}, "Agriculture": {"num": 30, "acc": 0.633}, "Architecture_and_Engineering": {"num": 30, "acc": 0.367}, "Computer_Science": {"num": 30, "acc": 0.533}, "Electronics": {"num": 30, "acc": 0.167}, "Energy_and_Power": {"num": 30, "acc": 0.367}, "Materials": {"num": 30, "acc": 0.367}, "Mechanical_Engineering": {"num": 30, "acc": 0.2}, "Overall": {"num": 900, "acc": 0.549}} +``` + +#### ASR + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelWERAvg Latency (s)Throughput (req/s)
gemma-4-E2B-it23.86%0.2122.99
gemma-4-E4B-it29.55%0.3662.46
gemma-4-31B-itNot Supported
gemma-4-26B-A4B-itNot Supported
+ +#### FLEUR (EN_US) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelWERAvg Latency (s)Throughput (req/s)
gemma-4-E2B-it7.37%0.8963s16.25
gemma-4-E4B-it6.08%0.8707s16.20
gemma-4-31B-itNot Supported
gemma-4-26B-A4B-itNot Supported
+ +### 5.3 Logits correctness validation + +**gemma-4-E2B-it** +```shell Command +$ python -m sglang.bench_one_batch --correct --model google/gemma-4-E2B-it .... +prefill logits (final): tensor([[-25.3063, -2.5718, -10.3674, ..., -25.3779, -25.5181, -25.2337]], + device='cuda:0') +.... + +$ python scripts/playground/reference_hf.py --model-path google/gemma-4-E2B-it +.... +prefill logits (final) tensor([-25.3281, -2.1367, -10.2266, ..., -25.4375, -25.5000, -25.2500], + device='cuda:0', dtype=torch.float16) +.... +``` + +**gemma-4-E4B-it** + +```shell Command +$ python -m sglang.bench_one_batch --correct --model google/gemma-4-E4B-it .... +prefill logits (final): tensor([[-17.6478, 7.9901, -5.6505, ..., -17.5658, -17.6478, -17.7293]], + device='cuda:0') +.... + +$ python scripts/playground/reference_hf.py --model-path google/gemma-4-E4B-it +.... +prefill logits (final) tensor([-17.5625, 8.0469, -5.5742, ..., -17.4688, -17.5625, -17.6719], + device='cuda:0', dtype=torch.float16) +.... +``` + +**gemma-4-31B-it** +```shell Command +$ python -m sglang.bench_one_batch --correct --model google/gemma-4-31B-it .... +prefill logits (final): tensor([[-2.0748, 1.1245, -7.4356, ..., -2.1059, -2.1525, -2.2303]], + device='cuda:0') +.... + +$ python scripts/playground/reference_hf.py --model-path google/gemma-4-31B-it +.... +prefill logits (final) tensor([-2.1133, 1.2656, -7.4766, ..., -2.1523, -2.2012, -2.2695], + device='cuda:0', dtype=torch.float16) +.... +``` + +
diff --git a/docs_new/cookbook/autoregressive/InclusionAI/LLaDA-2.1.mdx b/docs_new/cookbook/autoregressive/InclusionAI/LLaDA-2.1.mdx new file mode 100644 index 000000000000..aab3190ef080 --- /dev/null +++ b/docs_new/cookbook/autoregressive/InclusionAI/LLaDA-2.1.mdx @@ -0,0 +1,700 @@ +--- +title: LLaDA 2.1 +metatags: + description: "Deploy LLaDA 2.1 with SGLang - large-scale discrete diffusion language model with parallel token generation, iterative denoising, MoE architecture, and reinforcement learning for reasoning." +--- + +import { LLaDA21Deployment } from '/src/snippets/autoregressive/llada-21-deployment.jsx'; + +## 1. Model Introduction + +[LLaDA 2.1](https://github.com/inclusionAI/LLaDA2.X) is a series of large-scale discrete diffusion language models (dLLMs) developed by the InclusionAI team at Ant Group. Unlike traditional autoregressive models that generate text left-to-right one token at a time, LLaDA 2.1 uses a diffusion-based approach — drafting tokens in parallel and refining them through iterative denoising, enabling self-correction during generation. + +**Key Features:** + +- **Token Editing (T2T + M2T)**: Combines Mask-to-Token (M2T) and Token-to-Token (T2T) editing, allowing the model to not only unmask tokens but also revise already-generated tokens mid-flight +- **Dual Decoding Modes**: Speed Mode (S) for maximum throughput with T2T refinement, and Quality Mode (Q) for conservative thresholds and higher benchmark scores +- **MoE Architecture**: Both variants use Mixture-of-Experts architecture for efficient scaling +- **First Large-Scale RL for dLLMs**: Implements the first reinforcement learning framework specifically designed for diffusion language models, improving reasoning and instruction-following +- **Lightning-Fast Decoding**: Up to 892 tokens/s on HumanEval+ for the 100B model + +**Available Models:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelParametersArchitectureContext LengthHuggingFace
**LLaDA2.1-mini**16BMoE (20 layers, 16 attention heads)32,768 tokens[inclusionAI/LLaDA2.1-mini](https://huggingface.co/inclusionAI/LLaDA2.1-mini)
**LLaDA2.1-flash**100BMoE32,768 tokens[inclusionAI/LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash)
+ +**License:** + +Apache 2.0. Please refer to the [official LLaDA2.X repository](https://github.com/inclusionAI/LLaDA2.X) for details. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, and decoding mode. SGLang supports serving LLaDA-2.1 on NVIDIA H100, H200, B200, and AMD MI300X, MI325X, MI355X GPUs. + + + +### 3.2 Configuration Tips + +**dLLM-Specific Parameters:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionRecommended Value
`--dllm-algorithm`Diffusion decoding algorithm`JointThreshold`
`--trust-remote-code`Required for LLaDA model loadingAlways enabled
`--mem-fraction-static`Static memory fraction for KV cache`0.8`
`--max-running-requests`Maximum concurrent requests`1` (for best quality)
`--attention-backend`Attention computation backend`flashinfer`
+ +**Decoding Mode Comparison:** + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModeThresholdSpeedQualityBest For
**Quality Mode (Q)**ConservativeModerateHigher benchmark scoresAccuracy-critical tasks
**Speed Mode (S)**AggressiveVery fast, relies on T2T editingSlightly lowerThroughput-critical tasks
+ +**Hardware Requirements:** + +- **LLaDA2.1-mini (16B)**: ~47 GB VRAM, runs on a single GPU (TP=1) +- **LLaDA2.1-flash (100B)**: Requires multi-GPU setup (TP=4 on H100/H200, TP=2 on B200) + +## 4. Model Invocation + +### 4.1 Deployment + +Start the server using the command generated above, for example: + +```shell Command +python -m sglang.launch_server \ + --model-path inclusionAI/LLaDA2.1-mini \ + --dllm-algorithm JointThreshold \ + --tp 1 \ + --trust-remote-code \ + --mem-fraction-static 0.8 \ + --max-running-requests 1 \ + --attention-backend flashinfer \ + --host 0.0.0.0 \ + --port 8000 +``` + +### 4.2 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +**Simple Completion Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="inclusionAI/LLaDA2.1-mini", + messages=[ + {"role": "user", "content": "Explain what a diffusion language model is in simple terms."} + ], + max_tokens=1024 +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +Sure! Let's break it down in simple terms. + +A **diffusion language model** is a type of artificial intelligence that learns to generate text—like sentences, stories, or emails—by studying a lot of written text. + +Here’s how it works, using a simple real-life analogy: + +Imagine you have a big book full of stories. A diffusion language model is trying to learn how to write a new story. Instead of being told the rules, it starts by looking at all the words in the book and trying to understand how words usually go together. + +Now, think of the process like this: + +1. **Start with random noise**: The model begins with a completely random set of words (like a scribble on paper). +2. ** ** "clean up" the noise**: It gradually "denoises" the noise by turning it into meaningful text, word by word, based on what it learned learned from the book. +3. **Learn from patterns**: As it does this, it learns patterns—like how words often follow each other, or how sentences start. +4. **Generate new text**: Once it’s learned the patterns, it can create new, coherent sentences or stories by starting from a and and building it up word by word. + +So, the "diffusion" part comes from the idea of going from random noise to clear, meaningful text—like turning a scribble into a full story. + +In short: +A diffusion language model is an AI that learns to write text by reading lots of books and gradually turning random noise into coherent, meaningful sentences based on what it learned. +``` + +### 4.3 Advanced Usage + +#### 4.3.1 Streaming + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="inclusionAI/LLaDA2.1-mini", + messages=[ + {"role": "user", "content": "Write a Python function to compute the Fibonacci sequence."} + ], + max_tokens=2048, + stream=True +) + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +````text Output +Here are several ways to implement the Fibonacci sequence in Python: + +## 1. Recursive Approach (Simple but Inefficient) + +```python +def fibonacci_recursive(n): + """ + Compute the nth Fibonacci number using recursion. + + Args: + n (int): The position in the Fibonacci sequence (0-indexed) + + Returns: + int: The nth Fibonacci number + + Raises: + ValueError: If n is negative + """ + if n < 0: + raise ValueError("n must be non-negative") + + if n <= 1: + return n + + return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2) + +# Example usage +print(fibonacci_recursive(10)) # Output: 55 +``` + +## 2. Iterative Approach (Efficient) +... +```` + +#### 4.3.2 Code Generation + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="inclusionAI/LLaDA2.1-mini", + messages=[ + {"role": "user", "content": "Write a Python function that checks if a string is a palindrome. Include docstring and test cases."} + ], + max_tokens=2048 +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +````text Output +```python +def is_palindrome(s): + """ + Check if a string is a palindrome. + + A palindrome is a word, phrase, or sequence that reads the same backward as forward. + This function ignores case, spaces, punctuation, and non characters characters. + + Args: + s (str): The string to check + + Returns: + bool: True if the string is a palindrome, False otherwise + + Examples: + >>> is_palindrome("racecar") + True + >>> is_palindrome("A man a plan a canal Panama") + True + >>> is_palindrome("race a car") + False + >>> is_palindrome("") + True + >>> is_palindrome("a") + True + """ + # Remove non-alphanumeric characters and convert to lowercase + cleaned = ''.join(char.lower() for char in s if char.isalnum()) + + # Check if the cleaned string reads the same forwards and backwards + return cleaned == cleaned[::-1] + +# Test cases +def test_is_palindrome(): + """Test the is_palindrome function with various inputs.""" + + # Test basic palindromes + assert is_palindrome("racecar") == True + assert is_palindrome("level") == True + assert is_palindrome("madam") == True + assert is_palindrome("radar") == True + + # Test palindromes with spaces and punctuation + assert is_palindrome("A man a plan a canal Panama") == True + assert is_palindrome("race a car") == False + assert is_palindrome("Was it a car or a cat I saw?") == True + assert is_palindrome("Madam, I'm Adam") == True + + # Test edge cases + assert is_palindrome("") == True + assert is_palindrome("a") == True + assert is_palindrome("A") == True + assert is_palindrome("Aa") == True + + # Test non-palindromes + assert is_palindrome("hello") == False + assert is_palindrome("world") == False + assert is_palindrome("python") == False + + # Test single characters + assert is_palindrome("1") == True + assert is_palindrome("1") == True + + print("All tests passed!") + +# Run the tests +if __name__ == "__main__": + # Example usage + print("Testing isalindrome function:") + print(f"'racecar' {is_palindrome('racecar')}") + print(f"'A man a plan a canal Panama': {is_palindrome('A man a plan a canal Panama')}") + print(f"'race a car': {is_palindrome('race a car')}") + print(f"'hello': {is_palindrome('hello')}") + + # Run tests + test_is_palindrome() +``` + +This implementation includes: + +1. **Comprehensive function** `is_palindrome()` that: + - Ignores case by converting to lowercase + - Removes all non-alphanumeric characters (spaces, punctuation, etc.) + - Uses string slicing (`[::-1]`) to reverse the string + +2. **Detailed docstring** explaining: + - What the function does + - How it works + - Return value + - Examples of usage + +3. **Extensive test cases** covering: + - Basic palindromes + - Palindromes with spaces and punctuation + - Edge cases (empty string, single character) + - Non-palindromes + - Mixed case scenarios + +4. **Test function** that uses assertions to verify the function works correctly + +The function efficiently handles real-world palindrome checking by ignoring case, spaces, and punctuation, making it suitable for phrases like "A man a plan a canal Panama". +```` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 (4x) +- SGLang Version: 0.5.8+ + +#### 5.1.1 LLaDA2.1-mini + +**Model Deployment:** + +```bash Command +python -m sglang.launch_server \ + --model-path inclusionAI/LLaDA2.1-mini \ + --dllm-algorithm JointThreshold \ + --tp 1 \ + --trust-remote-code \ + --mem-fraction-static 0.8 \ + --max-running-requests 1 \ + --attention-backend flashinfer +``` + +- Latency Benchmark + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model inclusionAI/LLaDA2.1-mini \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Latency Result**: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 9.90 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 3433 +Request throughput (req/s): 1.01 +Input token throughput (tok/s): 616.26 +Output token throughput (tok/s): 426.26 +Peak output token throughput (tok/s): 1010.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 1042.53 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 988.87 +Median E2E Latency (ms): 655.27 +P90 E2E Latency (ms): 1952.50 +P99 E2E Latency (ms): 2932.19 +---------------Time to First Token---------------- +Mean TTFT (ms): 152.74 +Median TTFT (ms): 150.37 +P99 TTFT (ms): 229.78 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 2.16 +Median TPOT (ms): 2.08 +P99 TPOT (ms): 3.72 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 2.10 +Median ITL (ms): 1.99 +P95 ITL (ms): 4.03 +P99 ITL (ms): 6.34 +Max ITL (ms): 26.59 +================================================== +``` + +- Throughput Benchmark + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model inclusionAI/LLaDA2.1-mini \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- **Throughput Result**: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 467.74 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 189717 +Request throughput (req/s): 1.07 +Input token throughput (tok/s): 534.12 +Output token throughput (tok/s): 540.17 +Peak output token throughput (tok/s): 1753.00 +Peak concurrent requests: 105 +Total token throughput (tok/s): 1074.30 +Concurrency: 90.77 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 84912.27 +Median E2E Latency (ms): 86564.26 +P90 E2E Latency (ms): 110567.26 +P99 E2E Latency (ms): 114303.38 +---------------Time to First Token---------------- +Mean TTFT (ms): 83920.39 +Median TTFT (ms): 85669.54 +P99 TTFT (ms): 112969.91 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 2.67 +Median TPOT (ms): 1.65 +P99 TPOT (ms): 4.43 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 1.69 +Median ITL (ms): 1.46 +P95 ITL (ms): 3.96 +P99 ITL (ms): 4.84 +Max ITL (ms): 92.08 +================================================== +``` + +#### 5.1.2 LLaDA2.1-flash + +**Model Deployment:** + +```bash Command +python -m sglang.launch_server \ + --model-path inclusionAI/LLaDA2.1-flash \ + --dllm-algorithm JointThreshold \ + --tp 4 \ + --trust-remote-code \ + --mem-fraction-static 0.8 \ + --max-running-requests 1 \ + --attention-backend flashinfer +``` + +- Latency Benchmark + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model inclusionAI/LLaDA2.1-flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Latency Result**: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 14.46 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 3276 +Request throughput (req/s): 0.69 +Input token throughput (tok/s): 421.79 +Output token throughput (tok/s): 291.75 +Peak output token throughput (tok/s): 676.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 713.53 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1445.16 +Median E2E Latency (ms): 968.06 +P90 E2E Latency (ms): 3101.86 +P99 E2E Latency (ms): 4208.49 +---------------Time to First Token---------------- +Mean TTFT (ms): 231.63 +Median TTFT (ms): 242.67 +P99 TTFT (ms): 341.33 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 3.04 +Median TPOT (ms): 2.79 +P99 TPOT (ms): 5.33 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 3.05 +Median ITL (ms): 2.41 +P95 ITL (ms): 7.25 +P99 ITL (ms): 8.27 +Max ITL (ms): 29.27 +================================================== +``` + +- Throughput Benchmark + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model inclusionAI/LLaDA2.1-flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- **Throughput Result**: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 671.85 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 177961 +Request throughput (req/s): 0.74 +Input token throughput (tok/s): 371.85 +Output token throughput (tok/s): 376.07 +Peak output token throughput (tok/s): 1521.00 +Peak concurrent requests: 103 +Total token throughput (tok/s): 747.92 +Concurrency: 91.28 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 122658.36 +Median E2E Latency (ms): 125265.55 +P90 E2E Latency (ms): 159554.07 +P99 E2E Latency (ms): 165174.88 +---------------Time to First Token---------------- +Mean TTFT (ms): 121009.17 +Median TTFT (ms): 124437.80 +P99 TTFT (ms): 163579.29 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.73 +Median TPOT (ms): 2.16 +P99 TPOT (ms): 7.13 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 2.38 +Median ITL (ms): 1.40 +P95 ITL (ms): 6.89 +P99 ITL (ms): 8.60 +Max ITL (ms): 176.78 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --num-questions 200 \ + --port 8000 +``` + +**Results:** + +```text Output +Accuracy: 0.895 +Invalid: 0.000 +Latency: 100.552 s +Output throughput: 262.094 token/s +``` diff --git a/docs_new/cookbook/autoregressive/InclusionAI/Ling-2.5-1T.mdx b/docs_new/cookbook/autoregressive/InclusionAI/Ling-2.5-1T.mdx new file mode 100644 index 000000000000..68ab2a43c329 --- /dev/null +++ b/docs_new/cookbook/autoregressive/InclusionAI/Ling-2.5-1T.mdx @@ -0,0 +1,220 @@ +--- +title: Ling-2.5-1T +metatags: + description: "Deploy Ling-2.5-1T with SGLang - 1T parameter MoE model with 63B active parameters, trillion-scale context length up to 1M tokens, and agentic tool calling capabilities." +--- + +## 1. Model Introduction + +[Ling-2.5-1T](https://huggingface.co/inclusionAI/Ling-2.5-1T) is the latest flagship instant model in the Ling family. Thinking models raise the ceiling of intelligence, while instant models expand its reach by balancing efficiency and performance—making AGI not only more powerful, but also more accessible. Ling-2.5-1T delivers comprehensive upgrades across model architecture, token efficiency, and preference alignment, designed to bring universally accessible AI to a new level of quality. + +**Key Features:** + +- **Trillion-Scale Model**: 1T total parameters with 63B active parameters (up from 51B in the previous generation). Pre-training corpus expanded from 20T to 29T tokens. Leveraging an efficient hybrid linear attention architecture (1:7 MLA + Lightning Linear Attention), the model delivers exceptionally high throughput while processing context lengths of up to 1M tokens. +- **Token Efficiency**: By introducing a composite reward mechanism combining "Correctness" and "Process Redundancy", Ling-2.5-1T further pushes the frontier of efficiency-performance balance in instant models. At comparable token efficiency levels, Ling-2.5-1T's reasoning capabilities significantly outperform its predecessor, approaching the level of frontier "thinking models" that typically consume ~4x the output tokens. +- **Preference Alignment**: Through refined alignment strategies—such as bidirectional RL feedback and Agent-based instruction constraint verification—Ling-2.5-1T achieves substantial improvements over the previous generation in preference alignment tasks, including creative writing and instruction following. +- **Agentic Capabilities**: Trained with Agentic RL in large-scale high-fidelity interactive environments, Ling-2.5-1T is compatible with mainstream agent platforms such as Claude Code, OpenCode, and OpenClaw. It achieves leading open-source performance on the general tool-calling benchmark, BFCL-V4. +- **Context Length**: 256K -> 1M (YaRN) + +**Available Models:** + +- **BF16**: [inclusionAI/Ling-2.5-1T](https://huggingface.co/inclusionAI/Ling-2.5-1T) + +**License:** MIT + +## 2. SGLang Installation + +Ling-2.5-1T requires a specific SGLang Docker image: + +```bash Command +# For H200/B200 +docker pull lmsysorg/sglang:nightly-dev-20260213-a0ebaa64 + +# For GB200/GB300 +docker pull lmsysorg/sglang:nightly-dev-cu13-20260213-a0ebaa64 +``` + +For other installation methods, please refer to the [official SGLang installation guide](../../../docs/get-started/install). + +Ling-2.5-1T is also supported via the **nightly PyPI builds**. See the [SGLang Installation (PyPI)](../../../docs/get-started/install) guide for setup instructions. + +## 3. Model Deployment + +Ling-2.5-1T is a trillion-parameter BF16 model that requires multi-node deployment (at least 2 nodes). Use the configuration selector below to generate the deployment command for your hardware platform. + +import { Ling251TDeployment } from '/src/snippets/autoregressive/ling-25-1t-deployment.jsx' + + + +### Configuration Tips + +- The `--trust-remote-code` flag is required for this model due to custom modeling code. +- `--tp-size` can be set to a maximum of 8 for this model. If you have more GPUs available, increase `--pp-size` to scale across additional nodes. +- Adding `--model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}'` enables faster model loading. +- On H200/GB200/GB300 with 2-node deployment, `--mem-frac 0.95` is required to avoid OOM since the model occupies most of the GPU memory. For better throughput, consider 4-node deployment (ref [model card](https://huggingface.co/inclusionAI/Ling-2.5-1T#run-inference) for more details). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For example, launch the server on 2 H200 nodes: + +```bash Command +export MASTER_IP=10.10.0.1 # The IP of Node 0 +export PORT=30000 +export DIST_PORT=50000 + +# Node 0: +python3 -m sglang.launch_server \ +--model-path inclusionAI/Ling-2.5-1T \ +--trust-remote-code \ +--tp-size 8 \ +--pp-size 2 \ +--nnodes 2 \ +--node-rank 0 \ +--host 0.0.0.0 \ +--port ${PORT} \ +--dist-init-addr ${MASTER_IP}:${DIST_PORT} \ +--tool-call-parser qwen \ +--model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \ +--mem-frac 0.95 + + +# Node 1: +python3 -m sglang.launch_server \ +--model-path inclusionAI/Ling-2.5-1T \ +--trust-remote-code \ +--tp-size 8 \ +--pp-size 2 \ +--nnodes 2 \ +--node-rank 1 \ +--dist-init-addr ${MASTER_IP}:${DIST_PORT} \ +--tool-call-parser qwen \ +--model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \ +--mem-frac 0.95 +``` + +Once the server is running, send requests to the master node: + +```bash Command +curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}' +``` +Output: +```json Config +{ + "id": "e82af153da844ee6aed7a27a3187f2f4", + "object": "chat.completion", + "created": 1771216764, + "model": "auto", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "The capital of France is **Paris**.\n\n**Additional details:**\n* It is the largest city in France.\n* It is located in the north-central part of the country along the Seine River.\n* Paris is often referred to as \"The City of Light\" (*La Ville Lumière*).", + "reasoning_content": null, + "tool_calls": null + }, + "logprobs": null, + "finish_reason": "stop", + "matched_stop": 156895 + } + ], + "usage": { + "prompt_tokens": 25, + "total_tokens": 93, + "completion_tokens": 68, + "prompt_tokens_details": null, + "reasoning_tokens": 0 + } +} +``` + +For more API usage examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Tool Calling Example + +```bash Command +curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "inclusionAI/Ling-2.5-1T", + "messages": [{"role": "user", "content": "Search for the latest news about AI"}], + "tools": [{ + "type": "function", + "function": { + "name": "search", + "description": "Search for information on the internet", + "parameters": { + "type": "object", + "properties": { + "query": {"type": "string", "description": "The search query"} + }, + "required": ["query"] + } + } + }], + "tool_choice": "auto" + }' +``` +Output: +```json Config +{ + "id": "b968e45c7d414f7482c8ffc0f9c6b688", + "object": "chat.completion", + "created": 1771216520, + "model": "inclusionAI/Ling-2.5-1T", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": null, + "reasoning_content": null, + "tool_calls": [ + { + "id": "call_e75f711d8ad840ed9d382c9e", + "index": 0, + "type": "function", + "function": { + "name": "search", + "arguments": "{\"query\": \"latest news about AI\"}" + } + } + ] + }, + "logprobs": null, + "finish_reason": "tool_calls", + "matched_stop": null + } + ], + "usage": { + "prompt_tokens": 173, + "total_tokens": 196, + "completion_tokens": 23, + "prompt_tokens_details": null, + "reasoning_tokens": 0 + } +} +``` + +## 5. Benchmark + +### GSM8K + +- Benchmark Command +```bash Command +python3 benchmark/gsm8k/bench_sglang.py +``` + +- Test Result +```text Output +Accuracy: 0.960 +Invalid: 0.000 +Latency: 45.410 s +Output throughput: 560.642 token/s +``` diff --git a/docs_new/cookbook/autoregressive/InclusionAI/Ling-2.6.mdx b/docs_new/cookbook/autoregressive/InclusionAI/Ling-2.6.mdx new file mode 100644 index 000000000000..5bbb2343f6df --- /dev/null +++ b/docs_new/cookbook/autoregressive/InclusionAI/Ling-2.6.mdx @@ -0,0 +1,223 @@ +--- +title: Ling-2.6 +metatags: + description: "Deploy the Ling-2.6 family with SGLang - Ling-2.6-flash (104B total / 7.4B active BF16 MoE) and Ling-2.6-1T (~1T FP8 MoE) with hybrid linear attention and agentic tool calling." +--- + +## 1. Model Introduction + +The **Ling-2.6** family from inclusionAI is the next iteration of the Ling instant-model series. Continuing the architectural direction set by Ling-2.5, Ling-2.6 doubles down on **inference efficiency**, **token efficiency**, and **agent performance** — staying competitive with frontier instant models while being faster, leaner, and better suited for production agent workloads. + +**Key Features:** + +- **Hybrid Linear Attention**: A `1:7 MLA + Lightning Linear` hybrid built on top of a highly sparse MoE backbone. Compared with same-class SOTA models, Ling-2.6-flash shows up to ~4× higher prefill and decode throughput in long-context scenarios; Ling-2.6-1T is shipped in FP8 so it fits a single GB300 node with `--tp 4`. +- **Token Efficiency**: Trained with explicit token-efficiency objectives. On the full Artificial Analysis suite, Ling-2.6-flash uses only ~15M output tokens while remaining competitive — a meaningfully stronger intelligence-per-token profile than long-reasoning peers. +- **Agentic Capabilities**: Refined for tool use, multi-step planning, and long-horizon execution. Reaches SOTA-class results on **BFCL-V4**, **TAU2-bench**, **SWE-bench Verified**, **Claw-Eval**, and **PinchBench**, and is validated against Claude Code, Kilo Code, Qwen Code, Hermes Agent, and OpenClaw. +- **Long Context**: Native 128K, extendable to **256K (Ling-2.6-flash)** and **256K → 1M (Ling-2.6-1T via YaRN)**. + +**Available Models:** + +- **BF16**: [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) — 104B total / 7.4B active +- **FP8 (E4M3)**: [inclusionAI/Ling-2.6-1T](https://huggingface.co/inclusionAI/Ling-2.6-1T) — ~1T total + +**License:** MIT + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +### 3.1 Ling-2.6-flash + +Ling-2.6-flash is a 104B/7.4B-active MoE that runs comfortably on a single 4-GPU node. Use the selector below to generate the launch command for your hardware. + +import { Ling26FlashDeployment } from '/src/snippets/autoregressive/ling-26-flash-deployment.jsx' + + + +#### Configuration Tips + +- `--trust-remote-code` is required (custom `BailingMoeV2_5ForCausalLM` modeling code). +- `--tp-size 4` is the reference layout. On 4× H20-3e the model reaches ~340 tokens/s decode at TP=4, batch 32. +- Native context is 128K. Enable YaRN (`--json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, ...}}'`) to extend to 256K — the snippet does this for you. +- `--tool-call-parser qwen25` matches the model's `...` schema. +- The recommended baseline does **not** include `--reasoning-parser qwen3`. Ling-2.6 is a controllable-reasoning model whose chat template defaults to `detailed thinking off`; the SGLang `qwen3` reasoning parser, in contrast, assumes default-thinking semantics and would mis-route normal output into `reasoning_content`. Only enable it if you specifically want `...` blocks split out — see [§4.3 Thinking Mode](#4-3-thinking-mode). +- **MTP (multi-token prediction)** is supported. Add `--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mamba-scheduler-strategy extra_buffer` to enable it — see the [model card](https://huggingface.co/inclusionAI/Ling-2.6-flash#run-inference) for the full example. + +### 3.2 Ling-2.6-1T + +Ling-2.6-1T ships in **FP8 (E4M3)**, so unlike Ling-2.5-1T it fits a **single GB300 node with `--tp 4`**. On smaller GPUs (H200/B200), a 2-node deployment with `--pp-size 2` is required. + +import { Ling261TDeployment } from '/src/snippets/autoregressive/ling-26-1t-deployment.jsx' + + + +#### Configuration Tips + +- `--trust-remote-code` is required for the custom modeling code. +- `--model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}'` significantly speeds up the multi-shard FP8 weight load (26 safetensors shards + an MTP layer). +- Use `--tool-call-parser qwen` for tool calling. +- The recommended baseline does **not** include `--reasoning-parser qwen3`. Ling-2.6's chat template defaults to `detailed thinking off`, while SGLang's `qwen3` reasoning parser assumes default-thinking semantics — combining the two requires a per-request workaround for tool calls (see [§4.3 Thinking Mode](#4-3-thinking-mode)). Only enable `--reasoning-parser qwen3` if you specifically want `...` blocks split into `reasoning_content`. +- For 2-node deployments, set `MASTER_IP`, `PORT`, and `DIST_PORT` consistently across both nodes. + +## 4. Model Invocation + +For example, launch a Ling-2.6-1T server on a single GB300 node: + +```bash Command +sglang serve \ + --model-path inclusionAI/Ling-2.6-1T \ + --tp-size 4 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 \ + --tool-call-parser qwen \ + --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' +``` + +### 4.1 Basic Usage + +```bash Command +curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}' +``` + +Output: +```json Config +{ + "id": "...", + "object": "chat.completion", + "model": "auto", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "The capital of France is **Paris**.", + "reasoning_content": null, + "tool_calls": null + }, + "finish_reason": "stop" + } + ] +} +``` + +### 4.2 Tool Calling Example + +```bash Command +curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "auto", + "messages": [{"role": "user", "content": "Search for the latest news about AI"}], + "tools": [{ + "type": "function", + "function": { + "name": "search", + "description": "Search for information on the internet", + "parameters": { + "type": "object", + "properties": { + "query": {"type": "string", "description": "The search query"} + }, + "required": ["query"] + } + } + }], + "tool_choice": "auto" + }' +``` + +Output: +```json Config +{ + "choices": [ + { + "message": { + "role": "assistant", + "content": null, + "tool_calls": [ + { + "id": "call_...", + "type": "function", + "function": { + "name": "search", + "arguments": "{\"query\": \"latest news about AI\"}" + } + } + ] + }, + "finish_reason": "tool_calls" + } + ] +} +``` + +### 4.3 Thinking Mode + +Both Ling-2.6-flash and Ling-2.6-1T are **controllable-reasoning** models. Their chat template uses textual directives in the system message — `detailed thinking on` or `detailed thinking off` — to toggle thinking. The template **defaults to `detailed thinking off`** when neither phrase is present, and it does **not** read the Qwen3-style `enable_thinking` template variable. + +#### Enabling thinking + +Include `detailed thinking on` in the first system message: + +```bash Command +curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "auto", + "messages": [ + {"role": "system", "content": "detailed thinking on"}, + {"role": "user", "content": "If a box has 12 red balls and 8 blue balls, then 5 red balls are removed, how many balls remain?"} + ] + }' +``` + +If you already have a system prompt, append the directive on its own line: + +```json +{"role": "system", "content": "You are a helpful assistant.\ndetailed thinking on"} +``` + +When thinking is on, the model emits `...` blocks before its final answer. To get those split into `message.reasoning_content` automatically, also launch the server with `--reasoning-parser qwen3`. + +#### Caveat: `--reasoning-parser qwen3` + tool calling + +The SGLang `qwen3` reasoning parser was written for Qwen3, where models are **default-thinking** and clients opt out via `chat_template_kwargs.enable_thinking=false`. Ling-2.6 is the opposite — default-non-thinking, with toggling done in the system message. As a result, when the server is launched with **both** `--tool-call-parser qwen` and `--reasoning-parser qwen3`, every tool-call request must include `chat_template_kwargs.enable_thinking=false`, otherwise the parser routes the `...` block into `reasoning_content` instead of `message.tool_calls`: + +```bash Command +curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "auto", + "messages": [{"role": "user", "content": "Search for the latest news about AI"}], + "tools": [...], + "tool_choice": "auto", + "chat_template_kwargs": {"enable_thinking": false} + }' +``` + +`enable_thinking` here is consumed by the SGLang reasoning parser, **not** by the chat template — Ling-2.6's template ignores it. For the simplest configuration, just omit `--reasoning-parser qwen3` and toggle thinking via the system message. + +For more API examples, see the [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request). + +## 5. Benchmark + +### GSM8K (Ling-2.6-1T, GB300 × 4) + +Reference run on a single GB300 node with `--tp 4`: + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py +``` + +```text Output +Accuracy: 0.9621 (1269 / 1319) +``` + +For Ling-2.6-flash, see the official numbers on the [model card](https://huggingface.co/inclusionAI/Ling-2.6-flash) (BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, PinchBench, Artificial Analysis). diff --git a/docs_new/cookbook/autoregressive/InclusionAI/Ring-2.5-1T.mdx b/docs_new/cookbook/autoregressive/InclusionAI/Ring-2.5-1T.mdx new file mode 100644 index 000000000000..ad711e7ee22f --- /dev/null +++ b/docs_new/cookbook/autoregressive/InclusionAI/Ring-2.5-1T.mdx @@ -0,0 +1,265 @@ +--- +title: Ring-2.5-1T +metatags: + description: "Deploy Ring-2.5-1T with SGLang - world's first open-source 1T parameter reasoning model with hybrid linear attention, deep reasoning, and agentic tool calling capabilities." +--- + +## 1. Model Introduction + +[Ring-2.5-1T](https://huggingface.co/inclusionAI/Ring-2.5-1T) is the world's first open-source trillion-parameter reasoning model based on hybrid linear attention architecture, developed by InclusionAI. Building on Ring-1T, Ring-2.5-1T demonstrates substantial improvements in generation efficiency, reasoning depth, and long-horizon task execution capabilities. + +**Key Features:** + +- **Trillion-Scale Model**: ~1T total parameters with 63B activation parameters using a hybrid linear attention architecture (1:7 MLA + Lightning Linear Attention) +- **Generation Efficiency**: Reduces memory access overhead by over 10x and increases generation throughput by more than 3x for sequences exceeding 32K tokens +- **Deep Reasoning**: Achieves gold medal level for both IMO 2025 and CMO 2025, with dense rewards for rigorous reasoning process feedback +- **Long-horizon Task Execution**: Enhanced autonomous execution capability through large-scale fully-async agentic RL training +- **Tool Calling**: Supports function calling with XML-style tool call format +- **Context Length**: 128K -> 256K (YaRN) + +**Available Models:** + +- **FP8 (8-bit quantized)**: [inclusionAI/Ring-2.5-1T](https://huggingface.co/inclusionAI/Ring-2.5-1T) + +**License:** MIT + +## 2. SGLang Installation + +Ring-2.5-1T requires a specific SGLang Docker image: + +```bash Command +# For H200/B200 +docker pull lmsysorg/sglang:nightly-dev-20260213-a0ebaa64 + +# For GB200/GB300 +docker pull lmsysorg/sglang:nightly-dev-cu13-20260213-a0ebaa64 + +# For MI300X/325X +docker pull lmsysorg/sglang:v0.5.9-rocm700-mi30x + +# For MI355X +docker pull lmsysorg/sglang:v0.5.9-rocm700-mi35x +``` + +For other installation methods, please refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform. + +import { Ring251TDeployment } from '/src/snippets/autoregressive/ring-25-1t-deployment.jsx' + + + +### 3.2 Configuration Tips + +- The `--trust-remote-code` flag is required for this model due to custom modeling code. +- The model uses FP8 quantization (compressed-tensors format). + +## 4. Model Invocation + +Deploy Ring-2.5-1T with the following command (on H200, all features enabled): + +```shell Command +sglang serve \ + --model-path inclusionAI/Ring-2.5-1T \ + --tp 8 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +To enable reasoning output separation, add `--reasoning-parser deepseek-r1` when launching the server. The thinking process is returned via `reasoning_content` in the streaming response. + +```shell Command +sglang serve \ + --model-path inclusionAI/Ring-2.5-1T \ + --tp 8 \ + --trust-remote-code \ + --reasoning-parser deepseek-r1 \ + --host 0.0.0.0 \ + --port 30000 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="inclusionAI/Ring-2.5-1T", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + stream=True +) + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +
+Output Example + +````text Output +We are asked: "Solve this problem step by step: What is 15% of 240?" This is a straightforward percentage calculation. We need to show step-by-step solution. + +We can compute 15% of 240 as (15/100)*240 = 0.15 * 240 = 36. + +But we need to present step by step. Also ensure it's clear. + +We could also break down: 10% of 240 = 24, then 5% = 12, so 15% = 36. + +But any method is fine. + +We'll produce a solution with explanation: "To find 15% of 240, multiply 240 by 0.15 (or 15/100)." + +We'll show: + +15% = 15/100 = 0.15 + +Then 0.15 × 240 = 36. + +Alternatively: (15/100) × 240 = (15 × 240) / 100 = 3600/100 = 36. + +Finally, answer: 36. + +We can also illustrate stepwise: "First, convert the percentage to a decimal: 15% = 0.15. Then multiply by the number: 0.15 × 240 = 36." + +We'll present as a final answer: \boxed{36}. + +However, we need to provide step-by-step solution as per instructions. We'll write a full explanation. + +We can also use the fraction method: 15% of 240 = (15/100)*240 = (15*240)/100 = 3600/100 = 36. + +Alr. + +I think that's it. + + +**Step 1:** Write 15% as a fraction or decimal. +\[ 15\% = \frac{15}{100} = 0.15\] + +**Step 2:** Multiply the number (240) by this fraction/decimal. +\[ 240 \times 0.15 = 36\] + +Alternatively, using the fraction: +\[ \frac{15}{100} \times 240 = \frac{15 \times 240}{100} = \frac{3600}{100} = 36\] + +**Conclusion:** 15% of 240 is 36. + +\[ \boxed{36} \] +```` + +
+ +#### 4.2.2 Tool Calling + +To enable tool calling, add `--tool-call-parser qwen` when launching the server. + +```shell Command +sglang serve \ + --model-path inclusionAI/Ring-2.5-1T \ + --tp 8 \ + --trust-remote-code \ + --tool-call-parser qwen \ + --host 0.0.0.0 \ + --port 30000 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + } + }, + "required": ["location"] + } + } + } +] + +response = client.chat.completions.create( + model="inclusionAI/Ring-2.5-1T", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools +) + +print(response.choices[0].message.tool_calls) +``` + +**Output Example:** + +```text Output +[ChatCompletionMessageFunctionToolCall(id='call_770360e31d194ed79d32cd8c', function=Function(arguments='{"location": "Beijing"}', name='get_weather'), type='function', index=0)] +``` + +## 5. Benchmark + +### GSM8K + +- Deployment Command +```bash Command +sglang serve \ + --model-path inclusionAI/Ring-2.5-1T \ + --tp-size 8 \ + --trust-remote-code +``` + +- Benchmark Command +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --temperature 1.2 --top-p 0.8 --max-new-tokens 32768 --num-questions 200 --tokenizer-path inclusionAI/Ring-2.5-1T --enable-thinking +``` + +- Test Result +```text Output +Accuracy: 0.955 +Invalid: 0.010 +Latency: 615.833 s +Output throughput: 412.360 token/s +``` diff --git a/docs_new/cookbook/autoregressive/InternLM/Intern-S1.mdx b/docs_new/cookbook/autoregressive/InternLM/Intern-S1.mdx new file mode 100644 index 000000000000..ae68d7be004b --- /dev/null +++ b/docs_new/cookbook/autoregressive/InternLM/Intern-S1.mdx @@ -0,0 +1,28 @@ +--- +title: Intern-S1 +metatags: + description: "Deploy Intern-S1 with SGLang - community contribution guide for InternLM's Intern-S1 model deployment." +--- + +## 📝 Community Contribution Welcome + +This guide is currently under development. We welcome community contributions! + +If you have experience deploying **Intern-S1** with SGLang, please help us complete this documentation. + +## 🚀 How to Contribute + +```shell Command +git clone https://github.com/YOUR_USERNAME/sglang-cookbook.git +cd sglang-cookbook +git checkout -b add-intern-s1-guide +# Edit this file and submit a PR +``` + +## 📚 Reference + +- [GLM-4.6V](../GLM/GLM-4.6V) + +--- + +**Let's build this together!** 🌟 diff --git a/docs_new/cookbook/autoregressive/InternVL/InternVL3.5.mdx b/docs_new/cookbook/autoregressive/InternVL/InternVL3.5.mdx new file mode 100644 index 000000000000..a235290d7b65 --- /dev/null +++ b/docs_new/cookbook/autoregressive/InternVL/InternVL3.5.mdx @@ -0,0 +1,29 @@ +--- +title: InternVL3.5 +metatags: + description: "Deploy InternVL3.5 vision-language model with SGLang - community contribution guide for OpenGVLab's multimodal model." +--- + + +## 📝 Community Contribution Welcome + +This guide is currently under development. We welcome community contributions! + +If you have experience deploying **InternVL3.5** with SGLang, please help us complete this documentation. + +## 🚀 How to Contribute + +```shell Command +git clone https://github.com/YOUR_USERNAME/sglang-cookbook.git +cd sglang-cookbook +git checkout -b add-internvl3-5-guide +# Edit this file and submit a PR +``` + +## 📚 Reference + +- [GLM-4.6V](../GLM/GLM-4.6V) + +--- + +**Let's build this together!** 🌟 diff --git a/docs_new/cookbook/autoregressive/Jina/Jina-reranker-m0.mdx b/docs_new/cookbook/autoregressive/Jina/Jina-reranker-m0.mdx new file mode 100644 index 000000000000..25acd214be87 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Jina/Jina-reranker-m0.mdx @@ -0,0 +1,28 @@ +--- +title: Jina-reranker-m0 +metatags: + description: "Deploy Jina-reranker-m0 with SGLang - community contribution guide for Jina AI's reranker model deployment." +--- + +## 📝 Community Contribution Welcome + +This guide is currently under development. We welcome community contributions! + +If you have experience deploying **Jina-reranker-m0** with SGLang, please help us complete this documentation. + +## 🚀 How to Contribute + +```shell Command +git clone https://github.com/YOUR_USERNAME/sglang-cookbook.git +cd sglang-cookbook +git checkout -b add-jina-reranker-m0-guide +# Edit this file and submit a PR +``` + +## 📚 Reference + +- [DeepSeek-V3.2](../DeepSeek/DeepSeek-V3_2.md) + +--- + +**Let's build this together!** 🌟 diff --git a/docs_new/cookbook/autoregressive/Llama/Llama3.1.mdx b/docs_new/cookbook/autoregressive/Llama/Llama3.1.mdx new file mode 100644 index 000000000000..ac7679cf6856 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Llama/Llama3.1.mdx @@ -0,0 +1,644 @@ +--- +title: Llama-3.1 +metatags: + description: "Deploy Llama 3.1 (8B/70B/405B) with SGLang - 128K context, tool use, multilingual support, and speculative decoding optimization." +--- +## 1. Model Introduction + +Llama 3.1 is a collection of pretrained and instruction tuned generative models, released in July 2024 by Meta. These models are available in 8B, 70B and 405B sizes, with the 405B variant being the most capable fully-open source model at the time. + +These models bring open intelligence to all, with several new features and improvements: + +- **Stronger General Intelligence**: These models showcase significant improvements in coding, state-of-the-art tool use, and overall stronger reasoning capabilities. +- **Extended Context Length**: Llama 3.1 extends the context length to 128K tokens to improve performance over long context tasks such as summarization and code reasoning. +- **Tool Use**: Llama 3.1 is trained to interact with a search engine, python interpreter and mathematical engine, and also improves zero-shot tool use capabilities to interact with potentially unseen tools. +- **Multilinguality**: Llama 3.1 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. + +For further details, please refer to the [Llama 3.1 blog](https://ai.meta.com/blog/meta-llama-3-1/) and the [Llama 3.1 model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md).note + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to generate a launch command for Llama 3.1 collection of models. + +import { Llama31Deployment } from "/src/snippets/autoregressive/llama31-deployment.jsx"; + + +### 3.2 Configuration Tips + +**Speculative Decoding (NVIDIA GPUs):** + +- Using Speculative Decoding for latency-sensitive scenarios: + - `--speculative-algorithm EAGLE3`: Speculative decoding algorithm + - `--speculative-num-steps 3`: Number of speculative verification rounds + - `--speculative-eagle-topk 1`: Top-k sampling for draft tokens + - `--speculative-num-draft-tokens 4`: Number of draft tokens per step + - `--speculative-draft-model-path`: The path of the draft model weights. This can be a local folder or a Hugging Face repo ID such as [`yuhuili/EAGLE3-LLaMA3.1-Instruct-8B`](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B). + +**AMD GPU Deployment:** + +- **Hardware-Aware TP**: MI355X (256GB memory) supports lower TP values compared to MI300X/MI325X (192GB) +- **Verified TP Configurations**: + - MI300X/MI325X: 405B BF16 (TP=8), 405B FP8 (TP=4), 70B/8B (TP=1) + - MI355X: 405B BF16 (TP=4), 405B FP8 (TP=2), 70B/8B (TP=1) +- **FP8 Model Variants**: + - 405B: Use Meta's official `meta-llama/Llama-3.1-405B-Instruct-FP8` + - 70B/8B: Use AMD's optimized `amd/Llama-3.1-{size}-Instruct-FP8-KV` +- **Tool Calling**: Enable with `--tool-call-parser llama3` for Instruct models + +## 4. Model Invocation + +### 4.1 Basic Usage + +SGLang exposes an OpenAI-compatible endpoint. First, start the server + +```shell Command +sglang serve \ + --model-path Meta-Llama/Llama-3.1-405B-Instruct \ + --tp 8 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY", +) + +resp = client.chat.completions.create( + model="Meta-Llama/Llama-3.1-405B-Instruct", + messages=[ + {"role": "system", "content": "You are a helpful coding assistant."}, + {"role": "user", "content": "Write a Python function that retries a request with exponential backoff."}, + ], + temperature=0.2, + max_tokens=512, +) + +print(resp.choices[0].message.content) +``` + +**Output Example:** + +````text Output +**Exponential Backoff Retry Function in Python** +===================================================== + +Below is a Python function that uses the `requests` library to retry a request with exponential backoff. + +```python +import requests +import time +import random + +def exponential_backoff_retry(url, method, retries=3, backoff_factor=1, max_delay=60): + """ + Retry a request with exponential backoff. + + Args: + url (str): The URL to make the request to. + method (str): The HTTP method to use (e.g. 'GET', 'POST', etc.). + retries (int): The number of retries to attempt. Defaults to 3. + backoff_factor (int): The factor to multiply the delay by for each retry. Defaults to 1. + max_delay (int): The maximum delay to wait between retries in seconds. Defaults to 60. + + Returns: + The response object from the successful request. + """ + + delay = 1 + for attempt in range(retries + 1): + try: + response = requests.request(method, url) + response.raise_for_status() # Raise an exception for HTTP errors + return response + except requests.RequestException as e: + if attempt < retries: + # Calculate the delay for this retry + delay = min(delay * backoff_factor, max_delay) + # Add a random jitter to the delay to prevent thundering herd problem + delay += random.uniform(0, delay * 0.1) + # Wait for the calculated delay before retrying + time.sleep(delay) + else: + # If all retries have failed, raise the exception + raise e +... +```` + +### 4.2 Advanced Usage + +#### 4.2.1 Tool Calling + +Llama3 supports tool calling capabilities. First, start the server with tool call parser enabled: + +```shell Command +sglang serve \ + --model-path Meta-Llama/Llama-3.1-405B-Instruct \ + --tool-call-parser llama3 \ + --tp 8 +``` + +**Python Example** + +```python Example +from openai import OpenAI + +client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:8000/v1") + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the weather in a given location", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The city to find the weather for, e.g. 'San Francisco'", + }, + "unit": { + "type": "string", + "description": "The unit to fetch the temperature in", + "enum": ["celsius", "fahrenheit"], + }, + }, + "required": ["city", "unit"], + }, + }, + } +] + +response = client.chat.completions.create( + model="meta-llama/Llama-3.1-405B-Instruct", + messages=[ + { + "role": "user", + "content": "What's the weather like in Boston today?", + } + ], + temperature=0.7, + stream=True, + tools=tools, +) + + +arguments = [] + +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'tool_calls') and delta.tool_calls: + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +Reference: [SGLang Tool Parser Documentation](../../../docs/advanced_features/tool_parser#OpenAI-Compatible-API) + +**Output Example** + +```text Output +🔧 Tool Call: get_weather + Arguments: {"city": "Boston", "unit": "fahrenheit"} +``` + +**Handling Tool Call Results** +After getting the tool call, you can execute the function: + +```python Example +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather like in Boston today?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Boston", "unit": "fahrenheit"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Boston", "fahrenheit") + } +] + +final_response = client.chat.completions.create( + model="Meta-Llama/Llama-3.1-405B-Instruct", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The current weather in Boston is **22°C** and **sunny**. A perfect day to spend outside" +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA A100 GPU (8x) +- Model: Meta-Llama/Llama-3.1-70B +- Tensor Parallelism: 8 +- sglang version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. + +#### 5.1.1 Standard Scenario Benchmark + +- Model Deployment Command: + +```shell Command +sglang serve \ + --model-path Meta-Llama/Llama-3.1-70B \ + --tp 8 +``` + +##### 5.1.1.1 Low Concurrency + +- Benchmark Command: + +```shell Command +sglang serve \ + --backend sglang \ + --model Meta-Llama/Llama-3.1-70B \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 79.81 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4208 +Request throughput (req/s): 0.13 +Input token throughput (tok/s): 76.44 +Output token throughput (tok/s): 52.88 +Peak output token throughput (tok/s): 54.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 129.32 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 7977.81 +Median E2E Latency (ms): 6373.48 +---------------Time to First Token---------------- +Mean TTFT (ms): 131.61 +Median TTFT (ms): 131.77 +P99 TTFT (ms): 163.88 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 18.63 +Median TPOT (ms): 18.63 +P99 TPOT (ms): 18.65 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 18.64 +Median ITL (ms): 18.64 +P95 ITL (ms): 18.69 +P99 ITL (ms): 18.74 +Max ITL (ms): 21.95 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency + +```shell Command +sglang serve \ + --backend sglang \ + --model-path Meta-Llama/Llama-3.1-70B \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 79.47 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 38450 +Request throughput (req/s): 1.01 +Input token throughput (tok/s): 499.17 +Output token throughput (tok/s): 513.48 +Peak output token throughput (tok/s): 674.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 1012.65 +Concurrency: 13.47 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 13376.67 +Median E2E Latency (ms): 14130.48 +---------------Time to First Token---------------- +Mean TTFT (ms): 264.84 +Median TTFT (ms): 147.02 +P99 TTFT (ms): 791.93 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 26.09 +Median TPOT (ms): 26.08 +P99 TPOT (ms): 34.65 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 25.76 +Median ITL (ms): 23.95 +P95 ITL (ms): 24.72 +P99 ITL (ms): 98.32 +Max ITL (ms): 478.92 +================================================== +``` + +##### 5.1.1.3 High Concurrency + +```shell Command +sglang serve \ + --backend sglang \ + --model-path Meta-Llama/Llama-3.1-70B \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 131.64 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 243641 +Request throughput (req/s): 3.80 +Input token throughput (tok/s): 1897.87 +Output token throughput (tok/s): 1919.38 +Peak output token throughput (tok/s): 3100.00 +Peak concurrent requests: 107 +Total token throughput (tok/s): 3817.25 +Concurrency: 89.70 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 23616.71 +Median E2E Latency (ms): 22770.44 +---------------Time to First Token---------------- +Mean TTFT (ms): 245.98 +Median TTFT (ms): 184.22 +P99 TTFT (ms): 1251.67 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 47.19 +Median TPOT (ms): 48.67 +P99 TPOT (ms): 56.37 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 46.34 +Median ITL (ms): 33.46 +P95 ITL (ms): 108.61 +P99 ITL (ms): 166.11 +Max ITL (ms): 1107.09 +================================================== +``` + +#### 5.1.2 Summarization Scenario Benchmark + +##### 5.1.2.1 Low Concurrency + +```shell Command +sglang serve \ + --backend sglang \ + --model-path Meta-Llama/Llama-3.1-70B\ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 83.25 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.12 +Input token throughput (tok/s): 503.77 +Output token throughput (tok/s): 50.69 +Peak output token throughput (tok/s): 54.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 554.46 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8322.45 +Median E2E Latency (ms): 6873.36 +---------------Time to First Token---------------- +Mean TTFT (ms): 395.25 +Median TTFT (ms): 318.02 +P99 TTFT (ms): 850.80 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 18.80 +Median TPOT (ms): 18.81 +P99 TPOT (ms): 19.03 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 18.83 +Median ITL (ms): 18.81 +P95 ITL (ms): 19.06 +P99 ITL (ms): 19.08 +Max ITL (ms): 23.08 +================================================== +``` + +##### 5.1.2.2 Medium Concurrency + +```shell Command +sglang serve \ + --backend sglang \ + --model-path Meta-Llama/Llama-3.1-70B \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 107.12 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41603 +Request throughput (req/s): 0.75 +Input token throughput (tok/s): 2800.81 +Output token throughput (tok/s): 389.00 +Peak output token throughput (tok/s): 624.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 3189.81 +Concurrency: 14.18 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 18988.30 +Median E2E Latency (ms): 20290.66 +---------------Time to First Token---------------- +Mean TTFT (ms): 603.42 +Median TTFT (ms): 531.82 +P99 TTFT (ms): 2607.95 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 36.94 +Median TPOT (ms): 36.73 +P99 TPOT (ms): 79.19 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 35.36 +Median ITL (ms): 25.72 +P95 ITL (ms): 27.07 +P99 ITL (ms): 439.74 +Max ITL (ms): 2529.51 +================================================== +``` + +##### 5.1.2.3 High Concurrency + +```shell Command +sglang serve \ + --backend sglang \ + --model-path Meta-Llama/Llama-3.1-70B \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 215.66 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169035 +Request throughput (req/s): 1.48 +Input token throughput (tok/s): 5906.92 +Output token throughput (tok/s): 788.27 +Peak output token throughput (tok/s): 1920.00 +Peak concurrent requests: 69 +Total token throughput (tok/s): 6695.19 +Concurrency: 60.01 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 40443.85 +Median E2E Latency (ms): 39813.12 +---------------Time to First Token---------------- +Mean TTFT (ms): 633.32 +Median TTFT (ms): 616.38 +P99 TTFT (ms): 1912.97 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 74.95 +Median TPOT (ms): 82.85 +P99 TPOT (ms): 118.46 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 75.08 +Median ITL (ms): 34.12 +P95 ITL (ms): 261.18 +P99 ITL (ms): 828.12 +Max ITL (ms): 1970.03 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +- **Results**: + +```text Output +Accuracy: 0.830 +Invalid: 0.000 +Latency: 11.794 s +Output throughput: 1406.961 token/s +``` diff --git a/docs_new/cookbook/autoregressive/Llama/Llama3.3-70B.mdx b/docs_new/cookbook/autoregressive/Llama/Llama3.3-70B.mdx new file mode 100644 index 000000000000..f865b52fe2c3 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Llama/Llama3.3-70B.mdx @@ -0,0 +1,229 @@ +--- +title: Llama-3.3-70B +metatags: + description: "Deploy Llama-3.3-70B-Instruct with SGLang on AMD GPUs - 128K context, enhanced reasoning, tool calling, and multilingual support." +--- +## 1. Model Introduction + +[Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) is Meta's latest 70 billion parameter instruction-tuned language model, featuring improved performance and efficiency over Llama 3.1. With a 128K token context window and enhanced capabilities across reasoning, coding, and multilingual tasks, Llama 3.3 delivers state-of-the-art results while maintaining accessibility for production deployment. + +**Key Features:** + +- **Enhanced Performance**: Improved instruction following, reasoning, and task completion over Llama 3.1 +- **Tool Calling**: Native support for function calling and tool use scenarios +- **Multilingual Support**: Optimized for 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai) +- **Extended Context**: 128K token context window for processing long documents and complex tasks +- **Efficient Deployment**: 70B parameters enable deployment on single GPU with AMD MI300X + +**License:** +Llama 3.3 is licensed under the Llama 3.3 Community License. See [LICENSE](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE) for details. + +For more details, please refer to the [official Llama models repository](https://github.com/meta-llama/llama-models). + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for AMD GPUs (MI300X, MI325X, MI355X). + +### 3.1 Interactive Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your AMD GPU setup. + +import { Llama33Deployment } from "/src/snippets/autoregressive/llama33-70b-deployment.jsx"; + + + +### 3.2 Configuration Tips + +**AMD GPU Deployment:** + +- All AMD GPUs (MI300X, MI325X, MI355X) support TP=1 for both BF16 and FP8 variants +- **FP8 Model Variant**: Use AMD's optimized `amd/Llama-3.3-70B-Instruct-FP8-KV` +- **Tool Calling**: Enable with `--tool-call-parser llama3` for function calling support +- **Higher Throughput**: Optional TP=2 or TP=4 can be used for increased throughput + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Tool Calling + +Llama 3.3 70B Instruct supports native tool calling. Enable the tool parser during deployment: + +```shell Command +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.3-70B-Instruct \ + --tool-call-parser llama3 \ + --tp 1 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Python Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request +response = client.chat.completions.create( + model="meta-llama/Llama-3.3-70B-Instruct", + messages=[ + {"role": "user", "content": "What's the weather in Tokyo?"} + ], + tools=tools, + temperature=0.7 +) + +# Check for tool calls +message = response.choices[0].message +if message.tool_calls: + tool_call = message.tool_calls[0] + print(f"Function: {tool_call.function.name}") + print(f"Arguments: {tool_call.function.arguments}") +``` + +**Handling Tool Call Results:** + +```python Example +# After executing the function, send the result back +def get_weather(location, unit="celsius"): + # Your weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Build conversation with tool result +messages = [ + {"role": "user", "content": "What's the weather in Tokyo?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Tokyo", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Tokyo", "celsius") + } +] + +final_response = client.chat.completions.create( + model="meta-llama/Llama-3.3-70B-Instruct", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The current weather in Tokyo is 22°C and sunny. A perfect day!" +``` + +#### 4.2.2 Long Context Processing + +Leverage the 128K context window for processing long documents: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Example with long document +long_document = "..." * 10000 # Your long document here + +response = client.chat.completions.create( + model="meta-llama/Llama-3.3-70B-Instruct", + messages=[ + {"role": "user", "content": f"Summarize this document:\n\n{long_document}"} + ], + temperature=0.7, + max_tokens=1000 +) + +print(response.choices[0].message.content) +``` + +## 5. Benchmarking + +Use the SGLang benchmarking suite to test model performance with different workload patterns: + +### 5.1 Basic Benchmark Command + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 1000 \ + --random-input 1024 \ + --random-output 1024 \ + --max-concurrency 16 +``` + +### 5.2 Adjusting Benchmark Parameters + +**Input/Output Length**: Adjust `--random-input` and `--random-output` to test different workload patterns: + +- Short conversations: `--random-input 1024 --random-output 1024` +- Long outputs: `--random-input 1024 --random-output 8192` +- Long inputs: `--random-input 8192 --random-output 1024` + +**Concurrency Levels**: Adjust `--max-concurrency` to test different load scenarios: + +- Low concurrency (latency-focused): `--max-concurrency 1 --num-prompts 100` +- Medium concurrency (balanced): `--max-concurrency 16 --num-prompts 1000` +- High concurrency (throughput-focused): `--max-concurrency 100 --num-prompts 2000` + +--- + +## 📚 Additional Resources + +- [Meta Llama Models Repository](https://github.com/meta-llama/llama-models) +- [Llama 3.3 Model Card](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) +- [SGLang Documentation](/) +- [AMD ROCm Documentation](https://rocm.docs.amd.com/) diff --git a/docs_new/cookbook/autoregressive/Llama/Llama4.mdx b/docs_new/cookbook/autoregressive/Llama/Llama4.mdx new file mode 100644 index 000000000000..81e7e343acbe --- /dev/null +++ b/docs_new/cookbook/autoregressive/Llama/Llama4.mdx @@ -0,0 +1,474 @@ +--- +title: Llama 4 +metatags: + description: "Deploy Llama 4 Scout and Maverick with SGLang - Meta's latest generation open-source LLMs with industry-leading performance." +--- + +import { Llama4ScoutDeployment } from '/src/snippets/autoregressive/llama4-scout-deployment.jsx'; +import { Llama4MaverickDeployment } from '/src/snippets/autoregressive/llama4-maverick-deployment.jsx'; + +## 1. Model Introduction + +[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance. + +SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5). + +Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118). + +This generation delivers comprehensive upgrades across the board: + +The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts. +The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts. +Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi). + +For more details, please refer to the official llama4 Repository:https://www.llama.com/models/llama-4/ + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + + + + + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Launch the docker +```shell Command +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x +``` + +```shell Command +docker run -d -it --ipc=host --network=host --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd --device=/dev/dri --device=/dev/mem \ + --group-add video --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + --name Llama4 \ + lmsysorg/sglang:v0.5.9-rocm720-mi30x \ + /bin/bash +``` + +#### 4.2.2 Launch the server + +### Llama-4-Scout +8-GPU deployment command: + +```bash Command +sglang serve \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --tp 8 \ + --context-length 1000000 \ + --trust-remote-code +``` + +### Llama-4-Maverick +8-GPU deployment command: + +```bash Command +sglang serve \ + --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --tp 8 \ + --context-length 1000000 \ + --trust-remote-code +``` + +## 5. Benchmark +### 5.1 Speed Benchmark +Test Environment: + +Hardware: AMD MI300x GPU + +Model: Llama-4-Scout + +Tensor Parallelism: 8 + +sglang version: 0.5.9 + +- **Model Deployment** +```bash Command +sglang serve \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --tp 8 \ + --context-length 1000000 \ + --trust-remote-code +``` + +### 5.1.1 Low Concurrency (Latency-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 74.62 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4211 +Request throughput (req/s): 0.14 +Input token throughput (tok/s): 82.88 +Output token throughput (tok/s): 57.42 +Peak output token throughput (tok/s): 146.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 140.20 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 7459.48 +Median E2E Latency (ms): 4489.77 +---------------Time to First Token---------------- +Mean TTFT (ms): 4246.98 +Median TTFT (ms): 68.57 +P99 TTFT (ms): 48091.05 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.49 +Median TPOT (ms): 7.40 +P99 TPOT (ms): 7.40 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 7.49 +Median ITL (ms): 7.49 +P95 ITL (ms): 7.47 +P99 ITL (ms): 7.52 +Max ITL (ms): 10.44 +================================================== +``` +### 5.1.2 Medium Concurrency (Balanced) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 45.41 +Total input tokens: 49668 +Total input text tokens: 49668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40516 +Request throughput (req/s): 2.26 +Input token throughput (tok/s): 1120.46 +Output token throughput (tok/s): 1152.47 +Peak output token throughput (tok/s): 1520.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 2272.84 +Concurrency: 14.76 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6089.22 +Median E2E Latency (ms): 6568.80 +---------------Time to First Token---------------- +Mean TTFT (ms): 124.44 +Median TTFT (ms): 87.42 +P99 TTFT (ms): 268.72 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 11.88 +Median TPOT (ms): 12.00 +P99 TPOT (ms): 15.49 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.72 +Median ITL (ms): 10.54 +P95 ITL (ms): 11.22 +P99 ITL (ms): 67.88 +Max ITL (ms): 74.05 +================================================== +``` +### 5.1.3 High Concurrency (Throughput-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 85.84 +Total input tokens: 249841 +Total input text tokens: 249841 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 250498 +Request throughput (req/s): 5.84 +Input token throughput (tok/s): 2910.84 +Output token throughput (tok/s): 2944.82 +Peak output token throughput (tok/s): 4100.00 +Peak concurrent requests: 110 +Total token throughput (tok/s): 5854.65 +Concurrency: 92.24 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 15844.00 +Median E2E Latency (ms): 15262.56 +---------------Time to First Token---------------- +Mean TTFT (ms): 204.46 +Median TTFT (ms): 129.96 +P99 TTFT (ms): 528.54 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 41.56 +Median TPOT (ms): 42.90 +P99 TPOT (ms): 47.48 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 40.99 +Median ITL (ms): 24.46 +P95 ITL (ms): 84.46 +P99 ITL (ms): 87.64 +Max ITL (ms): 226.06 +================================================== +``` + +### 5.2 Speed Benchmark +Test Environment: + +Hardware: AMD MI300x GPU + +Model: Llama-4-Maverick + +Tensor Parallelism: 8 + +sglang version: 0.5.9 + +- **Model Deployment** +```bash Command +sglang serve \ + --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --tp 8 \ + --context-length 1000000 \ + --trust-remote-code +``` + +### 5.2.1 Low Concurrency (Latency-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 68.08 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4202 +Request throughput (req/s): 0.15 +Input token throughput (tok/s): 89.62 +Output token throughput (tok/s): 61.99 +Peak output token throughput (tok/s): 168.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 151.61 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6805.62 +Median E2E Latency (ms): 2733.91 +---------------Time to First Token---------------- +Mean TTFT (ms): 4296.56 +Median TTFT (ms): 57.45 +P99 TTFT (ms): 38633.95 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 5.95 +Median TPOT (ms): 5.96 +P99 TPOT (ms): 5.97 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 5.96 +Median ITL (ms): 5.96 +P95 ITL (ms): 6.02 +P99 ITL (ms): 6.08 +Max ITL (ms): 7.02 +================================================== +``` +### 5.2.2 Medium Concurrency (Balanced) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 30.72 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40923 +Request throughput (req/s): 2.60 +Input token throughput (tok/s): 1291.39 +Output token throughput (tok/s): 1328.41 +Peak output token throughput (tok/s): 1760.00 +Peak concurrent requests: 22 +Total token throughput (tok/s): 2619.80 +Concurrency: 13.92 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5345.15 +Median E2E Latency (ms): 5679.73 +---------------Time to First Token---------------- +Mean TTFT (ms): 259.30 +Median TTFT (ms): 72.60 +P99 TTFT (ms): 1063.45 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.53 +Median TPOT (ms): 10.22 +P99 TPOT (ms): 20.27 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.99 +Median ITL (ms): 9.10 +P95 ITL (ms): 9.87 +P99 ITL (ms): 55.62 +Max ITL (ms): 868.54 +================================================== +``` +### 5.2.3 High Concurrency (Throughput-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 90.95 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 251625 +Request throughput (req/s): 5.50 +Input token throughput (tok/s): 2746.77 +Output token throughput (tok/s): 2777.90 +Peak output token throughput (tok/s): 3700.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 5524.67 +Concurrency: 93.04 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16924.17 +Median E2E Latency (ms): 16294.85 +---------------Time to First Token---------------- +Mean TTFT (ms): 188.19 +Median TTFT (ms): 128.96 +P99 TTFT (ms): 534.81 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 33.63 +Median TPOT (ms): 35.37 +P99 TPOT (ms): 38.26 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 33.19 +Median ITL (ms): 27.66 +P95 ITL (ms): 76.91 +P99 ITL (ms): 78.82 +Max ITL (ms): 268.17 +================================================== +``` +### 5.3 Accuracy Benchmark + +#### 5.3.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + - Llama-4-Scout-17B-16E-Instruct +```text Output +Accuracy: 0.945 +Invalid: 0.000 +Latency: 12.731 s +Output throughput: 1595.418 token/s +``` + - Llama-4-Maverick-17B-128E-Instruct +```text Output +Accuracy: 0.895 +Invalid: 0.000 +Latency: 9.739 s +Output throughput: 2405.505 token/s +``` diff --git a/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.5.mdx b/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.5.mdx new file mode 100644 index 000000000000..d4c46623fa75 --- /dev/null +++ b/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.5.mdx @@ -0,0 +1,1040 @@ +--- +title: MiniMax-M2.5 +metatags: + description: "Deploy MiniMax-M2.5 with SGLang - community contribution guide for MiniMax M2.5 model deployment." +--- + +import { MiniMaxM25Deployment } from '/src/snippets/autoregressive/minimax-m25-deployment.jsx'; + +## 1. Model Introduction + +[MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) is a powerful language model developed by MiniMax, built for real-world productivity with state-of-the-art performance across coding, reasoning, agentic tasks, and tool use. + +As the latest iteration in the MiniMax model series, MiniMax-M2.5 achieves comprehensive enhancements across multiple domains. Details are as follows: + +- **Superior coding performance**: Achieves 79.7 on Droid and 76.1 on OpenCode, surpassing Opus 4.6 (78.9 and 75.9 respectively). Strong results on SWE-bench Verified, SWE-bench Multilingual, SWE-bench-pro, and Multi-SWE-bench. +- **Advanced reasoning**: Demonstrates strong performance on AIME25 and other reasoning benchmarks, with robust tool use during inference. +- **More capable agents**: Excels in agentic tasks including web browsing (BrowseComp, Wide Search), information retrieval (RISE), and complex tool use scenarios (Terminal Bench 2, MEWC, Finance Modeling). +- **Real-world productivity**: Designed for production-grade workloads with strong performance on practical coding, data analysis, and multi-step reasoning tasks. + +For more details, please refer to the [official MiniMax-M2.5 announcement](https://www.minimax.io/news/minimax-m25). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +**For AMD MI300X/MI325X/MI355X GPUs:** + +```bash Command +# Docker (AMD MI300X/MI325X) +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x + +# Docker (AMD MI355X) +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi35x +``` + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and feature capabilities. + + + +### 3.2 Configuration Tips + +**Key Parameters:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionRecommended Value
`--tool-call-parser`Tool call parser for function calling support`minimax-m2`
`--reasoning-parser`Reasoning parser for thinking mode`minimax-append-think`
`--trust-remote-code`Required for MiniMax model loadingAlways enabled
`--mem-fraction-static`Static memory fraction for KV cache`0.85`
`--tp`Tensor parallelism size`2` (2-GPU) or `4` (4-GPU) or `8` (8-GPU)
`--ep`Expert parallelism size`8` (NVIDIA 8-GPU) or EP=TP (AMD)
`--kv-cache-dtype`KV cache data type (AMD only)`fp8_e4m3`
`--attention-backend`Attention backend (AMD only)`triton`
+ +**Hardware Requirements: NVIDIA** + +- **4-GPU deployment**: Requires 4× high-memory GPUs (e.g., H200, B200, A100, H100) with TP=4 +- **8-GPU deployment**: Requires 8× GPUs (e.g., H200, B200, A100, H100) with TP=8 and EP=8 + +**Hardware Requirements: AMD** + +- **2-GPU deployment**: Requires 2× high-memory GPUs (e.g., MI300X, MI325X, MI355X) with TP=2, EP=2 +- **4-GPU deployment**: Requires 4× GPUs (e.g., MI300X, MI325X, MI355X) with TP=4, EP=4 +- **8-GPU deployment**: Requires 8× GPUs (e.g., MI300X, MI325X, MI355X) with TP=8, EP=8 + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +**Testing Deployment:** + +After startup, you can test the SGLang OpenAI-compatible API with the following command: + +```bash Command +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "MiniMaxAI/MiniMax-M2.5", + "messages": [ + {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, + {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]} + ] + }' +``` + +**Simple Completion Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.5", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Who won the world series in 2020?"} + ], + max_tokens=1024 +) + +print(response.choices[0].message.content) +``` +**Example Output**: +```text Output +The user asks: "Who won the world series in 2020?" That is a straightforward factual question. The answer: the Los Angeles Dodgers. They won the 2020 World Series, beating the Tampa Bay Rays. The user is presumably expecting that answer. + +We must follow the policies. The question is safe: no disallowed content. It's just a factual question. Provide answer. + +We must ensure compliance: Use no disallowed content. Should we provide context? Just answer straightforwardly. + +The user simply asks "Who won the world series in 2020?" We'll answer: The Los Angeles Dodgers. + +No additional relevant info needed, but could elaborate briefly: They beat the Tampa Bay Rays in six games, the series was played in a bubble at Globe Life Field in Arlington, Texas due to COVID-19. + +No need for any extra. That's it. + + +The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games. +``` +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +MiniMax-M2.5 supports Thinking mode. Enable the reasoning parser during deployment to separate the thinking and the content sections: + +```shell Command +python -m sglang.launch_server \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp 4 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +**Streaming with Thinking Process** + +With `minimax-append-think`, the thinking content is wrapped in `...` tags within the `content` field. You can parse these tags on the client side to separate the thinking and content sections: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.5", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream, separating ... from content +in_think = False +think_printed_header = False +content_printed_header = False +buffer = "" + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + buffer += delta.content + + while buffer: + if in_think: + # Look for closing tag + end_idx = buffer.find("") + if end_idx != -1: + print(buffer[:end_idx], end="", flush=True) + buffer = buffer[end_idx + len(""):] + in_think = False + else: + # Still in thinking, print what we have + print(buffer, end="", flush=True) + buffer = "" + else: + # Look for opening tag + start_idx = buffer.find("") + if start_idx != -1: + # Print any content before + before = buffer[:start_idx] + if before: + if not content_printed_header: + print("=============== Content =================", flush=True) + content_printed_header = True + print(before, end="", flush=True) + buffer = buffer[start_idx + len(""):] + in_think = True + if not think_printed_header: + print("=============== Thinking =================", flush=True) + think_printed_header = True + else: + # No tag, print as content + if not content_printed_header and think_printed_header: + print("\n=============== Content =================", flush=True) + content_printed_header = True + print(buffer, end="", flush=True) + buffer = "" + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user asks: "Solve this problem step by step: What is 15% of 240?" This is straightforward: 15% = 0.15; 0.15*240 = 36. So answer: 36. Provide step-by-step: convert percent to decimal, multiply. + +We need to obey policies. There's no policy violation. Just answer. Provide step by step. Should respond with solution. + +We can also mention alternative method: 15% = 15/100 = 3/20. Multiply 240 * 3/20 = (240/20)*3 = 12*3 = 36. + +Thus answer 36. + +We can add step-by-step. That's it. + +=============== Content ================= + +**Step‑by‑step solution** + +1. **Convert the percent to a decimal** + \[ + 15\% = \frac{15}{100}=0.15 + \] + +2. **Multiply the decimal by the number** + \[ + 0.15 \times 240 = 36 + \] + +(You can also think of it as \(15\% = \frac{3}{20}\) and then \(240 \times \frac{3}{20}=12 \times 3 = 36\).) + +\[ +\boxed{36} +\] +``` + +**Note:** The `minimax-append-think` reasoning parser embeds the thinking process in `...` tags within the `content` field. The code above parses these tags in real-time to display thinking and content separately. + +#### 4.2.2 Tool Calling + +MiniMax-M2.5 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp 4 \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +**Python Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Non-streaming request +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.5", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7 +) + +message = response.choices[0].message + +# Check for tool calls +if message.tool_calls: + for tool_call in message.tool_calls: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") +else: + print(message.content) +``` + +**Output Example**: +```text Output +Tool Call: get_weather + Arguments: {"location": "Beijing"} +``` + +**Note:** + +- Tool calls are returned in `message.tool_calls` with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.5", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment**: + +- Hardware: NVIDIA B200 GPU (8x) +- Model: MiniMax-M2.5 +- Tensor Parallelism: 8 +- Expert Parallelism: 8 +- sglang version: 0.5.8 + +#### 5.1.1 Standard Scenario Benchmark +- Model Deployment Command: +```shell Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp 8 \ + --ep 8 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 \ + --tool-call-parser minimax-m2 +``` +##### 5.1.1.1 Low Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 42.99 +Total input tokens: 6091 +Total input text tokens: 6091 +Total generated tokens: 4220 +Total generated tokens (retokenized): 3804 +Request throughput (req/s): 0.23 +Input token throughput (tok/s): 141.70 +Output token throughput (tok/s): 98.17 +Peak output token throughput (tok/s): 102.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 239.87 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4295.92 +Median E2E Latency (ms): 3419.28 +P90 E2E Latency (ms): 7832.04 +P99 E2E Latency (ms): 9601.40 +---------------Time to First Token---------------- +Mean TTFT (ms): 130.57 +Median TTFT (ms): 116.10 +P99 TTFT (ms): 190.90 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.89 +Median TPOT (ms): 9.89 +P99 TPOT (ms): 9.91 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.89 +Median ITL (ms): 9.89 +P95 ITL (ms): 10.15 +P99 ITL (ms): 10.32 +Max ITL (ms): 14.46 +================================================== +``` +##### 5.1.1.2 Medium Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 48.43 +Total input tokens: 39588 +Total input text tokens: 39588 +Total generated tokens: 40805 +Total generated tokens (retokenized): 37142 +Request throughput (req/s): 1.65 +Input token throughput (tok/s): 817.37 +Output token throughput (tok/s): 842.49 +Peak output token throughput (tok/s): 1184.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 1659.86 +Concurrency: 13.67 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8274.32 +Median E2E Latency (ms): 8692.90 +P90 E2E Latency (ms): 13690.70 +P99 E2E Latency (ms): 16104.18 +---------------Time to First Token---------------- +Mean TTFT (ms): 305.44 +Median TTFT (ms): 106.75 +P99 TTFT (ms): 1053.26 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 16.20 +Median TPOT (ms): 16.06 +P99 TPOT (ms): 26.75 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.65 +Median ITL (ms): 13.63 +P95 ITL (ms): 14.90 +P99 ITL (ms): 87.99 +Max ITL (ms): 483.53 +================================================== +``` +##### 5.1.1.3 High Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 92.31 +Total input tokens: 249331 +Total input text tokens: 249331 +Total generated tokens: 252662 +Total generated tokens (retokenized): 218975 +Request throughput (req/s): 5.42 +Input token throughput (tok/s): 2700.94 +Output token throughput (tok/s): 2737.02 +Peak output token throughput (tok/s): 4479.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 5437.97 +Concurrency: 91.19 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16835.82 +Median E2E Latency (ms): 16042.08 +P90 E2E Latency (ms): 31027.63 +P99 E2E Latency (ms): 34787.91 +---------------Time to First Token---------------- +Mean TTFT (ms): 391.06 +Median TTFT (ms): 133.12 +P99 TTFT (ms): 1712.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 33.04 +Median TPOT (ms): 34.29 +P99 TPOT (ms): 41.98 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 32.61 +Median ITL (ms): 21.67 +P95 ITL (ms): 87.76 +P99 ITL (ms): 118.81 +Max ITL (ms): 1145.62 +================================================== +``` +#### 5.1.2 Summarization Scenario Benchmark +- Model Deployment Command: +```shell Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp 8 \ + --ep 8 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 \ + --tool-call-parser minimax-m2 +``` +##### 5.1.2.1 Low Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 43.49 +Total input tokens: 41941 +Total input text tokens: 41941 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.23 +Input token throughput (tok/s): 964.42 +Output token throughput (tok/s): 97.04 +Peak output token throughput (tok/s): 102.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 1061.46 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4346.83 +Median E2E Latency (ms): 3508.84 +P90 E2E Latency (ms): 7972.23 +P99 E2E Latency (ms): 9659.71 +---------------Time to First Token---------------- +Mean TTFT (ms): 131.50 +Median TTFT (ms): 126.76 +P99 TTFT (ms): 182.52 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.00 +Median TPOT (ms): 10.01 +P99 TPOT (ms): 10.12 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.01 +Median ITL (ms): 10.02 +P95 ITL (ms): 10.29 +P99 ITL (ms): 10.44 +Max ITL (ms): 14.11 +================================================== +``` +##### 5.1.2.2 Medium Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 50.12 +Total input tokens: 300020 +Total input text tokens: 300020 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41662 +Request throughput (req/s): 1.60 +Input token throughput (tok/s): 5986.00 +Output token throughput (tok/s): 831.38 +Peak output token throughput (tok/s): 1152.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 6817.38 +Concurrency: 13.93 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8727.66 +Median E2E Latency (ms): 9170.52 +P90 E2E Latency (ms): 14220.00 +P99 E2E Latency (ms): 16896.54 +---------------Time to First Token---------------- +Mean TTFT (ms): 282.56 +Median TTFT (ms): 149.37 +P99 TTFT (ms): 1278.62 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 16.60 +Median TPOT (ms): 16.61 +P99 TPOT (ms): 25.17 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 16.24 +Median ITL (ms): 13.89 +P95 ITL (ms): 15.96 +P99 ITL (ms): 105.79 +Max ITL (ms): 1065.02 +================================================== +``` +##### 5.1.2.3 High Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 93.92 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169999 +Request throughput (req/s): 3.41 +Input token throughput (tok/s): 13563.30 +Output token throughput (tok/s): 1810.01 +Peak output token throughput (tok/s): 2881.00 +Peak concurrent requests: 71 +Total token throughput (tok/s): 15373.31 +Concurrency: 58.87 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 17277.69 +Median E2E Latency (ms): 16827.33 +P90 E2E Latency (ms): 29045.40 +P99 E2E Latency (ms): 33496.77 +---------------Time to First Token---------------- +Mean TTFT (ms): 692.26 +Median TTFT (ms): 188.46 +P99 TTFT (ms): 4932.70 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 32.19 +Median TPOT (ms): 32.69 +P99 TPOT (ms): 50.46 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 31.28 +Median ITL (ms): 21.59 +P95 ITL (ms): 101.35 +P99 ITL (ms): 136.74 +Max ITL (ms): 4649.23 +================================================== +``` + +#### 5.1.3 H100 Benchmark + +**Test Environment**: + +- Hardware: NVIDIA H100 80GB HBM3 GPU (8x) +- Model: MiniMax-M2.5 +- Tensor Parallelism: 8 +- Expert Parallelism: 8 +- sglang version: 0.5.9 + +- Model Deployment Command: +```shell Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp 8 \ + --ep 8 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 \ + --tool-call-parser minimax-m2 +``` +##### 5.1.3.1 Low Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 35.44 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.28 +Input token throughput (tok/s): 172.16 +Output token throughput (tok/s): 119.08 +Peak output token throughput (tok/s): 127.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 291.24 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3542.38 +Median E2E Latency (ms): 2791.92 +P90 E2E Latency (ms): 6317.77 +P99 E2E Latency (ms): 7780.15 +---------------Time to First Token---------------- +Mean TTFT (ms): 145.20 +Median TTFT (ms): 80.38 +P99 TTFT (ms): 633.08 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 8.05 +Median TPOT (ms): 8.08 +P99 TPOT (ms): 8.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 8.07 +Median ITL (ms): 8.08 +P95 ITL (ms): 8.12 +P99 ITL (ms): 8.16 +Max ITL (ms): 10.10 +================================================== +``` +##### 5.1.3.2 Medium Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 43.68 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40805 +Request throughput (req/s): 1.83 +Input token throughput (tok/s): 908.19 +Output token throughput (tok/s): 934.22 +Peak output token throughput (tok/s): 1184.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 1842.42 +Concurrency: 13.83 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 7551.91 +Median E2E Latency (ms): 8094.28 +P90 E2E Latency (ms): 12606.99 +P99 E2E Latency (ms): 14977.84 +---------------Time to First Token---------------- +Mean TTFT (ms): 116.86 +Median TTFT (ms): 82.33 +P99 TTFT (ms): 240.59 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.81 +Median TPOT (ms): 14.98 +P99 TPOT (ms): 17.98 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 14.61 +Median ITL (ms): 13.50 +P95 ITL (ms): 14.15 +P99 ITL (ms): 66.52 +Max ITL (ms): 107.39 +================================================== +``` +##### 5.1.3.3 High Concurrency +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 80.63 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252331 +Request throughput (req/s): 6.20 +Input token throughput (tok/s): 3098.45 +Output token throughput (tok/s): 3133.56 +Peak output token throughput (tok/s): 4800.00 +Peak concurrent requests: 113 +Total token throughput (tok/s): 6232.01 +Concurrency: 90.56 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 14604.59 +Median E2E Latency (ms): 14044.04 +P90 E2E Latency (ms): 26456.53 +P99 E2E Latency (ms): 30136.68 +---------------Time to First Token---------------- +Mean TTFT (ms): 149.32 +Median TTFT (ms): 95.16 +P99 TTFT (ms): 374.62 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 28.92 +Median TPOT (ms): 30.09 +P99 TPOT (ms): 34.31 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 28.66 +Median ITL (ms): 21.52 +P95 ITL (ms): 66.90 +P99 ITL (ms): 96.76 +Max ITL (ms): 376.34 +================================================== +``` + +### 5.2 Accuracy Benchmark +#### 5.2.1 GSM8K Benchmark +- Benchmark Command: +```shell Command +python benchmark/gsm8k/bench_sglang.py --port 30000 +``` +- Test Results: +```text Output +Accuracy: 0.950 +Invalid: 0.000 +Latency: 18.033 s +Output throughput: 1130.161 token/s +``` +#### 5.2.2 MMLU Benchmark +- Benchmark Command: +```shell Command +cd benchmark/mmlu +bash download_data.sh +python3 bench_sglang.py --port 30000 +``` +- Test Results: +```text Output +subject: abstract_algebra, #q:100, acc: 0.620 +subject: anatomy, #q:135, acc: 0.830 +subject: astronomy, #q:152, acc: 0.928 +subject: business_ethics, #q:100, acc: 0.810 +subject: clinical_knowledge, #q:265, acc: 0.891 +subject: college_biology, #q:144, acc: 0.951 +subject: college_chemistry, #q:100, acc: 0.670 +subject: college_computer_science, #q:100, acc: 0.820 +subject: college_mathematics, #q:100, acc: 0.660 +subject: college_medicine, #q:173, acc: 0.832 +subject: college_physics, #q:102, acc: 0.814 +subject: computer_security, #q:100, acc: 0.880 +subject: conceptual_physics, #q:235, acc: 0.915 +subject: econometrics, #q:114, acc: 0.719 +subject: electrical_engineering, #q:145, acc: 0.834 +subject: elementary_mathematics, #q:378, acc: 0.902 +subject: formal_logic, #q:126, acc: 0.698 +subject: global_facts, #q:100, acc: 0.710 +subject: high_school_biology, #q:310, acc: 0.926 +subject: high_school_chemistry, #q:203, acc: 0.793 +subject: high_school_computer_science, #q:100, acc: 0.910 +subject: high_school_european_history, #q:165, acc: 0.879 +subject: high_school_geography, #q:198, acc: 0.955 +subject: high_school_government_and_politics, #q:193, acc: 0.964 +subject: high_school_macroeconomics, #q:390, acc: 0.908 +subject: high_school_mathematics, #q:270, acc: 0.600 +subject: high_school_microeconomics, #q:238, acc: 0.954 +subject: high_school_physics, #q:151, acc: 0.781 +subject: high_school_psychology, #q:545, acc: 0.956 +subject: high_school_statistics, #q:216, acc: 0.847 +subject: high_school_us_history, #q:204, acc: 0.922 +subject: high_school_world_history, #q:237, acc: 0.916 +subject: human_aging, #q:223, acc: 0.839 +subject: human_sexuality, #q:131, acc: 0.893 +subject: international_law, #q:121, acc: 0.934 +subject: jurisprudence, #q:108, acc: 0.861 +subject: logical_fallacies, #q:163, acc: 0.890 +subject: machine_learning, #q:112, acc: 0.750 +subject: management, #q:103, acc: 0.883 +subject: marketing, #q:234, acc: 0.944 +subject: medical_genetics, #q:100, acc: 0.920 +subject: miscellaneous, #q:783, acc: 0.936 +subject: moral_disputes, #q:346, acc: 0.829 +subject: moral_scenarios, #q:895, acc: 0.632 +subject: nutrition, #q:306, acc: 0.863 +subject: philosophy, #q:311, acc: 0.833 +subject: prehistory, #q:324, acc: 0.907 +subject: professional_accounting, #q:282, acc: 0.720 +subject: professional_law, #q:1534, acc: 0.640 +subject: professional_medicine, #q:272, acc: 0.923 +subject: professional_psychology, #q:612, acc: 0.871 +subject: public_relations, #q:110, acc: 0.773 +subject: security_studies, #q:245, acc: 0.845 +subject: sociology, #q:201, acc: 0.930 +subject: us_foreign_policy, #q:100, acc: 0.940 +subject: virology, #q:166, acc: 0.614 +subject: world_religions, #q:171, acc: 0.895 +Total latency: 81.468 +Average accuracy: 0.825 +``` diff --git a/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.7.mdx b/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.7.mdx new file mode 100644 index 000000000000..a0dfeb1fdb16 --- /dev/null +++ b/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.7.mdx @@ -0,0 +1,723 @@ +--- +title: MiniMax-M2.7 +metatags: + description: "Deploy MiniMax-M2.7 with SGLang on NVIDIA and AMD GPUs — model self-evolution, professional software engineering, and native agent teams." +tag: NEW +--- + +## 1. Model Introduction + +[MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) is MiniMax's first model deeply participating in its own evolution. Built for real-world productivity, M2.7 excels at building complex agent harnesses and completing highly elaborate productivity tasks, leveraging Agent Teams, complex Skills, and dynamic tool search. + +Key highlights: + +- **Model Self-Evolution**: During development, M2.7 updates its own memory, builds complex skills for RL experiments, and improves its own learning process. An internal version autonomously optimized a programming scaffold over 100+ rounds, achieving a **30% performance improvement**. On MLE Bench Lite, M2.7 achieved a **66.6% medal rate**. +- **Professional Software Engineering**: Delivers outstanding real-world programming capabilities. On SWE-Pro, M2.7 achieved **56.22%**, with strong results on SWE Multilingual (76.5) and Multi SWE Bench (52.7). On Terminal Bench 2 (57.0%) and NL2Repo (39.8%), M2.7 demonstrates deep understanding of complex engineering systems. +- **Professional Work**: Achieved an ELO score of **1495** on GDPval-AA (highest among open-source models). On Toolathon, M2.7 reached **46.3%** accuracy (global top tier). +- **Native Agent Teams**: Supports multi-agent collaboration with stable role identity and autonomous decision-making. + +For more details, see the [official MiniMax-M2.7 blog post](https://www.minimax.io/news/minimax-m27-en). + +**License**: [Modified-MIT (MiniMax Model License)](https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE) + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +**Docker Images by Hardware Platform:** + + + + + + + + + + + + + + + + + + + + + + + + + + +
Hardware PlatformDocker Image
NVIDIA A100 / H100 / H200 / B200`lmsysorg/sglang:v0.5.10.post1`
NVIDIA B300 / GB300`lmsysorg/sglang:v0.5.10.post1-cu130`
AMD MI300X / MI325X`lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x`
AMD MI355X`lmsysorg/sglang:v0.5.10.post1-rocm720-mi35x`
+ +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and feature capabilities. + +import { MiniMaxM27Deployment } from '/src/snippets/autoregressive/minimax-m27-deployment.jsx' + + + +### 3.2 Configuration Tips + +**Key Parameters:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionRecommended Value
`--tool-call-parser`Tool call parser for function calling support`minimax-m2`
`--reasoning-parser`Reasoning parser for thinking mode`minimax-append-think`
`--trust-remote-code`Required for MiniMax model loadingAlways enabled
`--mem-fraction-static`Static memory fraction for KV cache`0.85`
`--tp`Tensor parallelism size`2` / `4` / `8` depending on hardware
`--ep`Expert parallelism size`8` (NVIDIA 8-GPU) or EP=TP (AMD)
`--kv-cache-dtype`KV cache data type (AMD only)`fp8_e4m3`
`--attention-backend`Attention backend (AMD only)`triton`
+ +**Hardware Requirements: NVIDIA** + +- **4-GPU deployment**: Requires 4× high-memory GPUs (e.g., H200, B200, A100, H100) with TP=4 +- **8-GPU deployment**: Requires 8× GPUs (e.g., H200, B200, A100, H100) with TP=8 and EP=8 + +**Hardware Requirements: NVIDIA GB300** + +- **2-GPU deployment**: GB300 (275GB per die) can host the model with TP=2 +- **4-GPU deployment**: Maximum single-node TP for GB300, recommended for higher throughput + +**Hardware Requirements: AMD** + +- **2-GPU deployment**: Requires 2× high-memory GPUs (e.g., MI300X, MI325X, MI355X) with TP=2, EP=2 +- **4-GPU deployment**: Requires 4× GPUs (e.g., MI300X, MI325X, MI355X) with TP=4, EP=4 +- **8-GPU deployment**: Requires 8× GPUs (e.g., MI300X, MI325X, MI355X) with TP=8, EP=8 + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +**Deployment Command:** + +```bash Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2.7 \ + --tp 4 \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +**Testing Deployment:** + +After startup, you can test the SGLang OpenAI-compatible API with the following command: + +```bash Command +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "MiniMaxAI/MiniMax-M2.7", + "messages": [ + {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, + {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]} + ] + }' +``` + +**Simple Completion Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.7", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Who won the world series in 2020?"} + ], + max_tokens=1024 +) + +print(response.choices[0].message.content) +``` + +**Example Output**: +```text Output +The user asks: "Who won the World Series in 2020?" That's a simple factual question. The answer: the Los Angeles Dodgers won the 2020 MLB World Series, defeating the Tampa Bay Rays. So answer accordingly. + +We must be mindful of policy: it's a factual question about sports. It's allowed. Provide answer with brief context. + +We should answer concisely. + +Hence final answer: The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games (best-of-seven series). Possibly mention it was played at a neutral site due to COVID-19, at Globe Life Field in Arlington, Texas. + +We must avoid disallowed content, no issue. + +Thus final. + + +The **Los Angeles Dodgers** won the 2020 World Series. They defeated the **Tampa Bay Rays** in six games (4‑2) in a best‑of‑seven series that was played at Globe Life Field in Arlington, Texas, under the MLB bubble‑like arrangements for the COVID‑19 pandemic. +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +MiniMax-M2.7 supports Thinking mode. Enable the reasoning parser during deployment to separate the thinking and the content sections: + +```bash Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2.7 \ + --tp 4 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +**Streaming with Thinking Process** + +With `minimax-append-think`, the thinking content is wrapped in `...` tags within the `content` field. You can parse these tags on the client side to separate the thinking and content sections: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.7", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + stream=True +) + +# Process the stream, separating ... from content +in_think = False +think_printed_header = False +content_printed_header = False +buffer = "" + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + buffer += delta.content + + while buffer: + if in_think: + # Look for closing
tag + end_idx = buffer.find("
") + if end_idx != -1: + print(buffer[:end_idx], end="", flush=True) + buffer = buffer[end_idx + len("
"):] + in_think = False + else: + # Still in thinking, print what we have + print(buffer, end="", flush=True) + buffer = "" + else: + # Look for opening tag + start_idx = buffer.find("") + if start_idx != -1: + # Print any content before + before = buffer[:start_idx] + if before: + if not content_printed_header: + print("=============== Content =================", flush=True) + content_printed_header = True + print(before, end="", flush=True) + buffer = buffer[start_idx + len(""):] + in_think = True + if not think_printed_header: + print("=============== Thinking =================", flush=True) + think_printed_header = True + else: + # No tag, print as content + if not content_printed_header and think_printed_header: + print("\n=============== Content =================", flush=True) + content_printed_header = True + print(buffer, end="", flush=True) + buffer = "" + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user asks: "Solve this problem step by step: What is 15% of 240?" Straightforward. Provide solution: 15% = 15/100 = 0.15. Multiply 240 * 0.15 = 36. Show steps. So answer: 36. Provide explanation. + +But also ensure we follow any policy? No issues. Just straightforward. + +I'll provide a step-by-step solution. + +Also could show fraction: 15% = 15/100 = 3/20, multiply 240 * 3/20 = (240/20)*3 = 12*3 = 36. + +Yes. Provide final answer. Also show verification: 10% of 240 is 24, 5% is 12, total 36. + +All good. + +=============== Content ================= + +**Step‑by‑step solution** + +1. **Convert the percent to a decimal (or a fraction).** + + 15% = 15/100 = 0.15 = 3/20 + +2. **Multiply the original number (240) by this decimal/fraction.** + + Using the decimal: + 240 × 0.15 = 36 + + Or using the fraction: + 240 × 3/20 = (240/20) × 3 = 12 × 3 = 36 + +3. **Result:** + + 15% of 240 = **36** + +*Check:* +- 10% of 240 = 24 +- 5% of 240 = 12 +- Adding them: 24 + 12 = 36, which matches the calculation. +``` + +**Note:** The `minimax-append-think` reasoning parser embeds the thinking process in `...` tags within the `content` field. The code above parses these tags in real-time to display thinking and content separately. + +#### 4.2.2 Tool Calling + +MiniMax-M2.7 supports tool calling capabilities. Enable the tool call parser: + +```bash Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2.7 \ + --tp 4 \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +**Python Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Non-streaming request +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.7", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools +) + +message = response.choices[0].message + +# Check for tool calls +if message.tool_calls: + for tool_call in message.tool_calls: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") +else: + print(message.content) +``` + +**Output Example**: +```text Output +Tool Call: get_weather + Arguments: {"location": "Beijing"} +``` + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2.7", + messages=messages +) + +print(final_response.choices[0].message.content) +``` + +**Output Example:** +```text Output +The weather in Beijing is currently 22°C and sunny. +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +**Test Environment**: + +- Hardware: 2× NVIDIA GB300 (275GB per die) +- Docker Image: `lmsysorg/sglang:v0.5.10.post1-cu130` +- Model: MiniMax-M2.7 (FP8) +- Tensor Parallelism: 2 +- SGLang version: 0.5.10.post1 + +### 5.1 Accuracy Benchmark + +**Evaluation Tool**: [NVIDIA NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills) + +**Evaluation Settings**: temperature=0.6, top_p=0.95, 8 seeds, max_tokens=120,000, `parse_reasoning=True` + +#### 5.1.1 GPQA Diamond + +- Dataset: [GPQA Diamond](https://huggingface.co/datasets/Idavidrein/gpqa) (198 questions) +- Prompt: `eval/aai/mcq-4choices` (4-choice multiple choice, matching [Artificial Analysis methodology](https://artificialanalysis.ai/methodology/intelligence-benchmarking)) +- Evaluation command: +```bash Command +ns prepare_data gpqa + +ns eval \ + --cluster=local \ + --server_type=openai \ + --model=MiniMaxAI/MiniMax-M2.7 \ + --server_address=http://localhost:30000/v1 \ + --output_dir=./m2.7-eval/ \ + --benchmarks=gpqa:8 \ + ++prompt_config=eval/aai/mcq-4choices \ + ++inference.tokens_to_generate=120000 \ + ++inference.temperature=0.6 \ + ++inference.top_p=0.95 \ + ++parse_reasoning=True +``` +- Test Results: + + + + + + + + + + + + + + + + + + + + + + + + + + +
Evaluation ModeAccuracyNo Answer
pass@1 (avg-of-8)84.91%3.54%
**majority@8****88.89%**0.00%
pass@896.46%0.00%
+ +#### 5.1.2 AIME 2025 + +- Dataset: AIME 2025 (30 problems) +- Prompt: `generic/math` (boxed answer format) +- Evaluation command: +```bash Command +ns prepare_data aime25 + +ns eval \ + --cluster=local \ + --server_type=openai \ + --model=MiniMaxAI/MiniMax-M2.7 \ + --server_address=http://localhost:30000/v1 \ + --output_dir=./m2.7-eval/ \ + --benchmarks=aime25:8 \ + ++inference.tokens_to_generate=120000 \ + ++inference.temperature=0.6 \ + ++inference.top_p=0.95 \ + ++parse_reasoning=True +``` +- Test Results: + + + + + + + + + + + + + + + + + + + + + + + + + + +
Evaluation ModeAccuracyNo Answer
pass@1 (avg-of-8)92.50% ± 5.56%2.92%
**majority@8****97.08%**0.00%
pass@8100.00%0.00%
+ +#### 5.1.3 MMLU-Pro + +- Dataset: [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) (12,032 questions, 10-choice) +- Prompt: `eval/aai/mcq-10choices` (10-choice multiple choice) +- Evaluation command: +```bash Command +ns prepare_data mmlu-pro + +ns eval \ + --cluster=local \ + --server_type=openai \ + --model=MiniMaxAI/MiniMax-M2.7 \ + --server_address=http://localhost:30000/v1 \ + --output_dir=./m2.7-eval/ \ + --benchmarks=mmlu-pro \ + ++prompt_config=eval/aai/mcq-10choices \ + ++inference.tokens_to_generate=32768 \ + ++inference.temperature=0.0 \ + ++parse_reasoning=True +``` +- Test Results: + + + + + + + + + + + + + + + + +
Evaluation ModeAccuracyNo Answer
pass@1 (greedy)69.41%18.75%
+ +> **Note**: The high no-answer rate is due to the 32K token limit being insufficient for M2.7's extended thinking on some questions. A rerun with 120K tokens is expected to improve accuracy significantly. + +#### 5.1.4 GSM8K Benchmark +- Benchmark Method: 8-shot Chain-of-Thought, evaluated via OpenAI-compatible API +- Test Results: +```text Output +GSM8K Results (8-shot CoT) +Model: MiniMaxAI/MiniMax-M2.7 +Total: 1319 +Correct: 1218 +Accuracy: 92.34% +``` + +### 5.2 Speed Benchmark + +#### 5.2.1 Low Concurrency + +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 34.33 +Total input tokens: 6101 +Total generated tokens: 4220 +Request throughput (req/s): 0.29 +Input token throughput (tok/s): 177.71 +Output token throughput (tok/s): 122.92 +Total token throughput (tok/s): 300.63 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3431.21 +Median E2E Latency (ms): 2742.57 +---------------Time to First Token---------------- +Mean TTFT (ms): 50.28 +Median TTFT (ms): 53.85 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 8.02 +Median TPOT (ms): 8.01 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 8.03 +Median ITL (ms): 8.02 +================================================== +``` + +#### 5.2.2 High Concurrency + +- Benchmark Command: +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2.7 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 100.20 +Total input tokens: 249831 +Total generated tokens: 252662 +Request throughput (req/s): 4.99 +Input token throughput (tok/s): 2493.41 +Output token throughput (tok/s): 2521.66 +Total token throughput (tok/s): 5015.07 +Concurrency: 90.19 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 18072.69 +Median E2E Latency (ms): 17761.84 +---------------Time to First Token---------------- +Mean TTFT (ms): 247.94 +Median TTFT (ms): 92.05 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 35.75 +Median TPOT (ms): 36.67 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 35.34 +Median ITL (ms): 30.55 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.mdx b/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.mdx new file mode 100644 index 000000000000..11758aa1ce69 --- /dev/null +++ b/docs_new/cookbook/autoregressive/MiniMax/MiniMax-M2.mdx @@ -0,0 +1,541 @@ +--- +title: MiniMax-M2 +metatags: + description: "Deploy MiniMax-M2 with SGLang - community contribution guide for MiniMax M2 model deployment." +--- + +import { MiniMaxM2Deployment } from '/src/snippets/autoregressive/minimax-m2-deployment.jsx'; + +## 1. Model Introduction + +[MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) is a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. + +This generation delivers comprehensive upgrades across the board: + +- **Superior Intelligence**: MiniMax-M2 demonstrates highly competitive general intelligence across mathematics, science, instruction following, coding, and agentic tool use in [Artificial Analysis](https://artificialanalysis.ai/). Its composite score ranks #1 among open-source models globally. + +- **Advanced Coding**: Engineered for end-to-end developer workflows, MiniMax-M2 excels at multi-file edits, coding-run-fix loops, and test-validated repairs. Strong performance on Terminal-Bench and (Multi-)SWE-Bench–style tasks demonstrates practical effectiveness in terminals, IDEs, and CI across languages. + +- **Agent Performance**: MiniMax-M2 plans and executes complex, long-horizon toolchains across shell, browser, retrieval, and code runners. In BrowseComp-style evaluations, it consistently locates hard-to-surface sources, maintains evidence traceable, and gracefully recovers from flaky steps. + +- **Efficient Design**: With 10 billion activated parameters (230 billion in total), MiniMax-M2 delivers lower latency, lower cost, and higher throughput for interactive agents and batched sampling—perfectly aligned with the shift toward highly deployable models that still shine on coding and agentic tasks. + +For more details, please refer to the [official Minimax GitHub Repository](https://github.com/MiniMax-AI). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. The AMD environment is currently available in SGLang via Docker image install. + +### 2.1 AMD Docker +#### 2.1.1 Launch docker +```shell Command +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x +``` +```shell Command +docker run -d -it --ipc=host --network=host --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd --device=/dev/dri --device=/dev/mem \ + --group-add video --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + --name Minimax \ + lmsysorg/sglang:v0.5.9-rocm720-mi30x \ + /bin/bash +``` + +#### 2.1.2 Make modifications inside the docker + +```shell Command +mv /sgl-workspace/sglang/python/sglang/srt/models/transformers.py \ + /sgl-workspace/sglang/python/sglang/srt/models/hf_transformers_model.py +``` + +#### 2.1.3 Fix torch compile +Comment out the following line: @torch.compile(dynamic=True, backend=get_compiler_backend()) in /sgl-workspace/sglang/python/sglang/srt/models/minimax_m2.py +```shell Command +#@torch.compile(dynamic=True, backend=get_compiler_backend()) +``` + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + + + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser +Server Command: +```shell Command + sglang serve \ + --model-path MiniMaxAI/MiniMax-M2 \ + --tp-size 4 \ + --reasoning-parser minimax-append-think \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` +Test Code: +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.6, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` +Output Example: +```text Output +First, the user asks: "What is 15% of 240?" This is a straightforward percentage calculation. I need to solve it step by step as per the instruction. + +The problem is: What is 15% of 240? + +To find a percentage of a number, I multiply the number by the percentage divided by 100. So, 15% is 15/100, which simplifies to 0.15. + +Therefore, 15% of 240 is 240 times 0.15. + +Let me calculate that: 240 × 0.15. + +I can break it down: 240 × 0.15 = 240 × (15/100) = (240 × 15) / 100. + +Now, 240 × 15. 200 × 15 = 3000, and 40 × 15 = 600, so total 3000 + 600 = 3600. + +Then, divide by 100: 3600 / 100 = 36. + +So, 15% of 240 is 36. + +I should confirm this with another method. For example, 10% of 240 is 24, and 5% is half of that, which is 12. Then 15% is 10% + 5% = 24 + 12 = 36. Same answer. + +Or, using fractions: 15% = 3/20, so 240 × 3/20 = (240 / 20) × 3 = 12 × 3 = 36. + +All methods confirm it's 36. + +The user said "solve this problem step by step," so I should present the steps clearly. + +Step 1: Understand that "15% of 240" means 15 per hundred of 240. + +Step 2: Convert percentage to decimal: 15% = 15/100 = 0.15. + +Step 3: Multiply the number by the decimal: 240 × 0.15. + +Step 4: Calculate the multiplication: 240 × 0.15 = 36. + +Step 5: Therefore, 15% of 240 is 36. + +I should also mention that percentage means per hundred, so it's straightforward. + +Finally, I need to box the answer as per the instruction. + +So, the final answer is 36. + + +To find 15% of 240, follow these steps: + +1. **Understand the percentage**: "15%" means 15 per hundred, or 15/100. +2. **Convert to a decimal**: 15/100 = 0.15. +3. **Multiply by the number**: 240 × 0.15. +4. **Calculate the result**: + - 240 × 0.15 = 36. + +Alternatively, you can break it down: +- 10% of 240 is 24 (since 240 ÷ 10 = 24). +- 5% of 240 is half of 10%, which is 12. +- Therefore, 15% is 10% + 5% = 24 + 12 = 36. + +Both methods confirm the result. + +**Answer**: 36 +``` + +### 4.2.2 Tool Calling + +Server Command: +```shell Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2 \ + --tp-size 4 \ + --tool-call-parser minimax-m2 \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` +Test Code: +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="MiniMaxAI/MiniMax-M2", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` +Output Example: +```text Output +Alright, the user is asking about the weather in Beijing. This is a straightforward request that I can help with using the get_weather tool that's available to me. + +Let me think about what I need to do here. The user wants to know the current weather conditions in Beijing, which is the capital city of China. To provide this information, I need to use the get_weather tool that's been provided to me. + +Looking at the tool's parameters, I can see it requires: +1. location - which is required and should be a string representing the city name +2. unit - which is optional and can be either "celsius" or "fahrenheit" + +For the location parameter, I'll use "Beijing" since that's what the user asked about. + +For the unit parameter, the user didn't specify their preference between celsius and fahrenheit. Since Beijing is in China, which primarily uses celsius, and celsius is the more standard unit internationally, I'll default to celsius. If the user wants the temperature in fahrenheit instead, they can ask in a follow-up message and I can provide that information. + +So I need to make a tool call to get_weather with the following parameters: +- location: "Beijing" +- unit: "celsius" + +This should return the current weather information for Beijing, which I can then share with the user. I'll format my response using the required XML tags for tool calls as specified in my instructions. +
+ +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +## 5. Benchmark +### 5.1 Speed Benchmark +**Test Environment**: + +- Hardware: AMD MI300X GPU(4x) + +- Model: MiniMax-M2 + +- Tensor Parallelism: 4 + +- sglang version: 0.5.7 + +**Model Deployment**: + +```bash Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2 \ + --tp-size 4 \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +### 5.1.1 Low Concurrency (Latency-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf + +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 138.91 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.07 +Input token throughput (tok/s): 43.92 +Output token throughput (tok/s): 30.38 +Peak output token throughput (tok/s): 46.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 74.30 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 13887.62 +Median E2E Latency (ms): 10377.26 +---------------Time to First Token---------------- +Mean TTFT (ms): 4528.94 +Median TTFT (ms): 385.23 +P99 TTFT (ms): 38338.51 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 22.21 +Median TPOT (ms): 22.24 +P99 TPOT (ms): 22.25 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 22.23 +Median ITL (ms): 22.24 +P95 ITL (ms): 22.35 +P99 ITL (ms): 22.41 +Max ITL (ms): 23.64 +================================================== +``` + +### 5.1.2 Medium Concurrency (Balanced) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf + +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 81.07 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40803 +Request throughput (req/s): 0.99 +Input token throughput (tok/s): 489.29 +Output token throughput (tok/s): 503.32 +Peak output token throughput (tok/s): 704.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 992.61 +Concurrency: 13.74 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 13925.95 +Median E2E Latency (ms): 14348.75 +---------------Time to First Token---------------- +Mean TTFT (ms): 532.32 +Median TTFT (ms): 147.69 +P99 TTFT (ms): 1978.48 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 27.49 +Median TPOT (ms): 26.56 +P99 TPOT (ms): 46.52 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 26.31 +Median ITL (ms): 23.47 +P95 ITL (ms): 24.37 +P99 ITL (ms): 125.10 +Max ITL (ms): 1192.51 +================================================== +``` + +### 5.1.3 High Concurrency (Throughput-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model MiniMaxAI/MiniMax-M2 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 153.71 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 250982 +Request throughput (req/s): 3.25 +Input token throughput (tok/s): 1625.33 +Output token throughput (tok/s): 1643.75 +Peak output token throughput (tok/s): 2597.00 +Peak concurrent requests: 107 +Total token throughput (tok/s): 3269.09 +Concurrency: 91.14 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 28017.24 +Median E2E Latency (ms): 26865.28 +---------------Time to First Token---------------- +Mean TTFT (ms): 387.41 +Median TTFT (ms): 183.90 +P99 TTFT (ms): 1192.44 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 55.23 +Median TPOT (ms): 57.84 +P99 TPOT (ms): 70.23 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 54.79 +Median ITL (ms): 39.01 +P95 ITL (ms): 143.10 +P99 ITL (ms): 150.46 +Max ITL (ms): 986.14 +================================================== +``` + +### 5.2 Accuracy Benchmark +#### 5.2.1 GSM8K Benchmark + +- **Server Command**: +```shell Command +sglang serve \ + --model-path MiniMaxAI/MiniMax-M2 \ + --tp-size 4 \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +- **Benchmark Command**: +```shell Command + python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` +- **Result**: + - MiniMax-M2 +```text Output + Accuracy: 0.950 + Invalid: 0.000 + Latency: 15.120 s + Output throughput: 1306.711 token/s +``` diff --git a/docs_new/cookbook/autoregressive/Mistral/Devstral-2.mdx b/docs_new/cookbook/autoregressive/Mistral/Devstral-2.mdx new file mode 100644 index 000000000000..845db315df02 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Mistral/Devstral-2.mdx @@ -0,0 +1,518 @@ +--- +title: Devstral 2 (Mistral) +metatags: + description: "Deploy Devstral 2 agentic coding models with SGLang - optimized for tool use, codebase exploration, and multi-file edits with 256K context." +--- + +## 1. Model Introduction + +**Devstral 2** is an agentic LLM family for software engineering tasks. It is designed for agentic workflows such as tool use, codebase exploration, and multi-file edits, and achieves strong performance on **SWE-bench**. + +The **Devstral 2 Instruct** checkpoints are instruction-tuned **FP8** models, making them a good fit for chat, tool-using agents, and instruction-following SWE workloads. + +**Key Features:** + +- **Agentic coding**: Optimized for tool-driven coding and software engineering agents +- **Improved performance**: A step up compared to earlier Devstral models +- **Better generalization**: More robust across diverse prompts and coding environments +- **Long context**: Up to a **256K** context window + +**Use Cases:** +AI code assistants, agentic coding, and software engineering tasks that require deep codebase understanding and tool integration. + +For enterprises requiring specialized capabilities (increased context, domain-specific knowledge, etc.), please reach out to Mistral. + +**Models:** + +- **Collection**: [mistralai/devstral-2 (Hugging Face)](https://huggingface.co/collections/mistralai/devstral-2) +- **FP8 Instruct**: + - **[mistralai/Devstral-2-123B-Instruct-2512](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512)** + - **[mistralai/Devstral-Small-2-24B-Instruct-2512](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512)** + +--- + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + + +Devstral 2 requires a recent `transformers`. Please verify `transformers >= 5.0.0.rc`: + +```shell Command +python -c "import transformers; print(transformers.__version__)" +``` + +If your version is lower, upgrade: + +```shell Command +pip install -U --pre "transformers>=5.0.0rc0" +``` + + +--- + +## 3. Model Deployment + +### 3.1 Basic configuration + +**Interactive Command Generator**: Use the configuration selector below to generate a launch command for Devstral Small 2 (24B) or Devstral 2 (123B). + + +The TP size is set to the minimum required for the selected model size. + + + +import { Devstral2Deployment } from "/src/snippets/autoregressive/devstral-2-deployment.jsx"; + + + +### 3.2 Configuration tips + +- **Context length vs memory**: Devstral 2 advertises a long context window; if you are memory-constrained, start by lowering `--context-length` (for example `32768`) and increase once things are stable. +- **FP8 checkpoints**: Both Devstral Small 2 and Devstral 2 are published as **FP8** weights. If you hit kernel / dtype issues, try a newer SGLang build and recent CUDA drivers. + +--- + +## 4. Model Invocation + +### 4.1 Basic Usage (OpenAI-Compatible API) + +SGLang exposes an OpenAI-compatible endpoint. Example: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +resp = client.chat.completions.create( + model="mistralai/Devstral-Small-2-24B-Instruct-2512", + messages=[ + {"role": "system", "content": "You are a helpful coding assistant."}, + {"role": "user", "content": "Write a Python function that retries a request with exponential backoff."}, + ], + temperature=0.2, + max_tokens=512, +) + +print(resp.choices[0].message.content) +``` + +**Output Example:** + +```text Output + Here's a Python function that implements exponential backoff for retrying a request. This function uses the `requests` library to make HTTP requests and includes error handling for common HTTP and connection errors. + + ```python + import time + import requests + from requests.exceptions import RequestException + + def retry_with_exponential_backoff( + url, + max_retries=3, + initial_delay=1, + backoff_factor=2, + method="GET", + **kwargs + ): + """ + Retry a request with exponential backoff. + + Parameters: + - url: The URL to request. + - max_retries: Maximum number of retry attempts (default: 3). + - initial_delay: Initial delay in seconds (default: 1). + - backoff_factor: Multiplier for the delay between retries (default: 2). + - method: HTTP method to use (default: "GET"). + - **kwargs: Additional arguments to pass to the request function (e.g., headers, data, etc.). + + Returns: + - Response object if the request succeeds. + - Raises an exception if all retries fail. + """ + retry_count = 0 + delay = initial_delay + + while retry_count < max_retries: + try: + response = requests.request(method, url, **kwargs) + # Check if the response status code indicates success + if response.status_code < 400: + return response + else: + raise RequestException(f"HTTP {response.status_code}: {response.text}") + + except RequestException as e: + if retry_count == max_retries - 1: + raise Exception(f"All retries failed. Last error: {e}") + + print(f"Attempt {retry_count + 1} failed. Retrying in {delay} seconds...") + time.sleep(delay) +... +``` + +### 4.2 Tool calling (optional) + +Devstral 2 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model mistralai/Devstral-2-123B-Instruct-2512 \ + --tp 2 \ + --tool-call-parser mistral +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="mistralai/Devstral-2-123B-Instruct-2512", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing"} +``` + + +## AMD GPU Support + +## 1. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + + +### 1.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 1.2 Advanced Usage + + +```shell Command +python3 -m sglang.launch_server \ + --model-path mistralai/Devstral-2-123B-Instruct-2512 \ + --tp 8 \ + --trust-remote-code \ + --port 8888 +``` + +## 2.Benchmark + +### 5.1 Benchmark Commands + +**Scenario 1: Chat (1K/1K) - Most Important** + +- **Model Deployment** + +```bash Command +python3 -m sglang.launch_server \ + --model-path mistralai/Devstral-2-123B-Instruct-2512 \ + --tp 8 \ + --trust-remote-code \ + --port 8888 +``` + +- Low Concurrency (Latency-Optimized) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model mistralai/Devstral-2-123B-Instruct-2512 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf \ + --port 8888 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 94.30 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4206 +Request throughput (req/s): 0.11 +Input token throughput (tok/s): 64.70 +Output token throughput (tok/s): 44.75 +Peak output token throughput (tok/s): 82.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 109.44 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 9427.59 +Median E2E Latency (ms): 5637.23 +---------------Time to First Token---------------- +Mean TTFT (ms): 4253.85 +Median TTFT (ms): 116.95 +P99 TTFT (ms): 37764.48 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 12.28 +Median TPOT (ms): 12.29 +P99 TPOT (ms): 12.30 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 12.29 +Median ITL (ms): 12.29 +P95 ITL (ms): 12.38 +P99 ITL (ms): 12.42 +Max ITL (ms): 12.90 +================================================== +``` + +- Medium Concurrency (Balanced) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model mistralai/Devstral-2-123B-Instruct-2512 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf \ + --port 8888 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 52.11 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40761 +Request throughput (req/s): 1.54 +Input token throughput (tok/s): 761.31 +Output token throughput (tok/s): 783.13 +Peak output token throughput (tok/s): 1120.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 1544.44 +Concurrency: 13.60 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8856.19 +Median E2E Latency (ms): 9314.71 +---------------Time to First Token---------------- +Mean TTFT (ms): 398.80 +Median TTFT (ms): 127.81 +P99 TTFT (ms): 1500.32 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 17.32 +Median TPOT (ms): 16.90 +P99 TPOT (ms): 32.78 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 16.61 +Median ITL (ms): 14.26 +P95 ITL (ms): 15.07 +P99 ITL (ms): 114.46 +Max ITL (ms): 1224.45 +================================================== +``` + +- High Concurrency (Throughput-Optimized) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model mistralai/Devstral-2-123B-Instruct-2512 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf \ + --port 8888 +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 116.08 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252523 +Request throughput (req/s): 4.31 +Input token throughput (tok/s): 2152.21 +Output token throughput (tok/s): 2176.60 +Peak output token throughput (tok/s): 3600.00 +Peak concurrent requests: 107 +Total token throughput (tok/s): 4328.81 +Concurrency: 92.42 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 21456.71 +Median E2E Latency (ms): 20126.82 +---------------Time to First Token---------------- +Mean TTFT (ms): 291.60 +Median TTFT (ms): 199.24 +P99 TTFT (ms): 866.02 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 42.42 +Median TPOT (ms): 45.18 +P99 TPOT (ms): 53.32 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 41.97 +Median ITL (ms): 27.59 +P95 ITL (ms): 130.43 +P99 ITL (ms): 137.87 +Max ITL (ms): 616.73 +================================================== +``` + + + +#### 5.2 Understanding the Results + +**Key Metrics:** + +- **Request Throughput (req/s)**: Number of requests processed per second +- **Output Token Throughput (tok/s)**: Total tokens generated per second +- **Mean TTFT (ms)**: Time to First Token - measures responsiveness +- **Mean TPOT (ms)**: Time Per Output Token - measures generation speed +- **Mean ITL (ms)**: Inter-Token Latency - measures streaming consistency + +**Why These Configurations Matter:** + +- **1K/1K (Chat)**: Represents the most common conversational AI workload. This is the highest priority scenario for most deployments. +- **1K/8K (Reasoning)**: Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations. +- **8K/1K (Summarization)**: Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks. +- **Variable Concurrency**: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput. + +**Interpreting Results:** + +- Compare your results against baseline numbers for your hardware +- Higher throughput at same latency = better performance +- Lower TTFT = more responsive user experience +- Lower TPOT = faster generation speed + +### 5.3 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.3.1 GSM8K Benchmark + +- Benchmark Command + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py \ + --num-shots 8 \ + --num-questions 1316 \ + --parallel 1316 \ + --port 8888 +``` + +**Test Results:** + +```text Output +Accuracy: 0.922 +Invalid: 0.000 +Latency: 35.800 s +Output throughput: 4507.697 token/s +``` diff --git a/docs_new/cookbook/autoregressive/Mistral/Ministral-3.mdx b/docs_new/cookbook/autoregressive/Mistral/Ministral-3.mdx new file mode 100644 index 000000000000..dd2dd8eb1a45 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Mistral/Ministral-3.mdx @@ -0,0 +1,288 @@ +--- +title: Ministral-3 +metatags: + description: "Deploy Mistral 3 with SGLang - deployment configurations and usage patterns for Mistral's latest model." +--- + +import { Ministral3Deployment } from '/src/snippets/autoregressive/ministral-3-deployment.jsx'; + +## 1. Model Introduction +The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities. + +The Ministral 3 14B Instruct model offers the following capabilities: + +Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text. +Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic. +System Prompt: Maintains strong adherence and support for system prompts. +Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting. +Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere. +Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes. +Large Context Window: Supports a 256k context window. + +For further details, please refer to the [official documentation](https://github.com/mistralai) + +## 2. SGLang Installation + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + + + +### 3.2 Configuration Tips +**Context length vs memory**: Ministral-3 advertises a long context window; if you are memory-constrained, start by lowering --context-length (for example 32768) and increase once things are stable. + +**Pre-installation steps**: Adding the following steps after launching the docker +```shell Command +pip install mistral-common --upgrade +pip install transformers==5.0.0.rc0 +``` +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Launch the docker +```shell Command +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x +``` + +```shell Command +docker run -d -it --ipc=host --network=host --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd --device=/dev/dri --device=/dev/mem \ + --group-add video --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + --name Ministral \ + lmsysorg/sglang:v0.5.9-rocm720-mi30x \ + /bin/bash +``` + +#### 4.2.2 Launch the server +```shell Command +sglang serve \ + --model-path mistralai/Ministral-3-14B-Instruct-2512 \ + --tp 1 \ + --trust-remote-code +``` + +## 5. Benchmark + +This section uses **industry-standard configurations** for comparable benchmark results. + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: MI300X GPU (8x) +- Model: mistralai/Ministral-3-14B-Instruct-2512 +- Tensor Parallelism: 1 +- SGLang Version: 0.5.7 + +- Model Deployment Command: + +```bash Command +sglang serve \ + --model-path mistralai/Ministral-3-14B-Instruct-2512 \ + --tp 1 \ + --trust-remote-code +``` + +##### Low Concurrency +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model mistralai/Ministral-3-14B-Instruct-2512 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 65.08 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4218 +Request throughput (req/s): 0.15 +Input token throughput (tok/s): 93.75 +Output token throughput (tok/s): 64.84 +Peak output token throughput (tok/s): 151.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 158.59 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6505.51 +Median E2E Latency (ms): 3037.37 +---------------Time to First Token---------------- +Mean TTFT (ms): 3709.33 +Median TTFT (ms): 53.72 +P99 TTFT (ms): 33320.77 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 6.63 +Median TPOT (ms): 6.64 +P99 TPOT (ms): 6.66 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.64 +Median ITL (ms): 6.65 +P95 ITL (ms): 6.75 +P99 ITL (ms): 6.82 +Max ITL (ms): 8.45 +================================================== +``` + +##### Medium Concurrency +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model mistralai/Ministral-3-14B-Instruct-2512 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 31.20 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40783 +Request throughput (req/s): 2.56 +Input token throughput (tok/s): 1271.38 +Output token throughput (tok/s): 1307.82 +Peak output token throughput (tok/s): 1760.00 +Peak concurrent requests: 22 +Total token throughput (tok/s): 2579.20 +Concurrency: 13.72 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5351.07 +Median E2E Latency (ms): 5626.45 +---------------Time to First Token---------------- +Mean TTFT (ms): 280.87 +Median TTFT (ms): 68.16 +P99 TTFT (ms): 1194.79 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.47 +Median TPOT (ms): 10.10 +P99 TPOT (ms): 20.00 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.96 +Median ITL (ms): 9.10 +P95 ITL (ms): 9.87 +P99 ITL (ms): 51.39 +Max ITL (ms): 888.63 +================================================== +``` + +##### High Concurrency +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model mistralai/Ministral-3-14B-Instruct-2512 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 88.75 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252547 +Request throughput (req/s): 5.63 +Input token throughput (tok/s): 2815.01 +Output token throughput (tok/s): 2846.91 +Peak output token throughput (tok/s): 4271.00 +Peak concurrent requests: 110 +Total token throughput (tok/s): 5661.93 +Concurrency: 93.04 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16514.45 +Median E2E Latency (ms): 15834.45 +---------------Time to First Token---------------- +Mean TTFT (ms): 148.57 +Median TTFT (ms): 99.15 +P99 TTFT (ms): 455.86 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 32.93 +Median TPOT (ms): 34.73 +P99 TPOT (ms): 38.05 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 32.45 +Median ITL (ms): 27.30 +P95 ITL (ms): 71.73 +P99 ITL (ms): 73.45 +Max ITL (ms): 328.10 +================================================== +``` + +### 5.2 Accuracy Benchmark + +Document model accuracy on standard benchmarks: + +#### 5.2.1 GSM8K Benchmark + +- Benchmark Command + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py \ + --num-shots 8 \ + --num-questions 1316 \ + --parallel 1316 +``` + +**Test Results:** + +```text Output +Accuracy: 0.959 +Invalid: 0.000 +Latency: 29.185 s +Output throughput: 4854.672 token/s +``` diff --git a/docs_new/cookbook/autoregressive/Mistral/Mistral-Medium-3.5.mdx b/docs_new/cookbook/autoregressive/Mistral/Mistral-Medium-3.5.mdx new file mode 100644 index 000000000000..69af325d063a --- /dev/null +++ b/docs_new/cookbook/autoregressive/Mistral/Mistral-Medium-3.5.mdx @@ -0,0 +1,463 @@ +--- +title: Mistral Medium 3.5 +metatags: + description: "Deploy Mistral Medium 3.5 with SGLang - 128B dense flagship merged model with hybrid reasoning, 256K context, vision input, and FP8 quantization." +--- + +import { MistralMedium35Deployment } from '/src/snippets/autoregressive/mistral-medium-3-5-deployment.jsx'; + +## 1. Model Introduction + +**Mistral Medium 3.5** is Mistral AI's first flagship **merged model** — a single dense 128B checkpoint that handles instruction following, reasoning, and coding in one set of weights. It replaces Mistral Medium 3.1 and Magistral in Le Chat, and replaces Devstral 2 in the Vibe coding agent. Reasoning effort is configurable per request, so the same model can answer a quick chat reply or work through a deep agentic run. The vision encoder was trained from scratch to handle variable image sizes and aspect ratios. + +**Key Features:** + +- **Dense 128B parameters** — no MoE, no MLA, plain GQA (96 heads, 8 KV heads, head_dim=128) +- **256K context window** — YARN RoPE scaling on top of the original 4K base +- **Hybrid Reasoning**: Toggle between instant reply and deep reasoning per request via `reasoning_effort` (`"none"` or `"high"`) +- **Vision**: Accepts text + image input; from-scratch encoder that handles variable image sizes/aspect ratios +- **Function Calling**: Native tool calling and JSON output +- **FP8 Native**: Released with FP8 e4m3 static-tensor quantization built in +- **Multilingual**: 24 supported languages including English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Persian, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, and Bengali +- **License**: Modified MIT (open for commercial and non-commercial use except for companies with large revenue) + +**Architecture:** + +- Mistral 3 backbone with YARN RoPE for 256K context +- Dense (no MoE), 128B parameters +- Standard GQA attention (not MLA) +- Pixtral-style vision encoder (48 layers, patch_size=14, spatial_merge=2, image_size=1540) trained from scratch +- Multimodal input: text + image + +**Models:** + +- **[mistralai/Mistral-Medium-3.5-128B](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B)** (FP8) + +The HuggingFace repo ships both the mistral native layout (`params.json` + `consolidated-*.safetensors`) and the HF layout (`config.json` + `model-*.safetensors`). SGLang auto-detects the format — the HF layout is preferred when both are present. + +--- + +## 2. SGLang Installation + +Refer to the [official SGLang installation guide](../../../docs/get-started/install). + +**Docker Images by Hardware:** + +| Hardware | Docker Image | +| --- | --- | +| H100 / H200 (Hopper, CUDA 12.9) | `lmsysorg/sglang:dev-mistral-medium-3.5` | +| B200 / B300 (Blackwell, CUDA 13.0) | `lmsysorg/sglang:dev-cu13-mistral-medium-3.5` | + +> Day-0 support for Mistral Medium 3.5 is not yet in `lmsysorg/sglang:latest` — pull one of the tags above (matching your GPU's CUDA driver) until the changes propagate to the next stable release. + +--- + +## 3. Model Deployment + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to generate a launch command for Mistral Medium 3.5. + + + +### 3.2 Configuration Tips + +- **Tensor Parallelism**: Mistral Medium 3.5 FP8 (~130 GB) requires `--tp 4` on Hopper (H100/H200) and `--tp 2` on Blackwell (B200/B300). +- **Reasoning effort**: Reasoning depth is configurable per request via `reasoning_effort` (`"none"`, `"high"`). No restart required — toggle per call. +- **Recommended temperature**: `0.7` when `reasoning_effort="high"`. Anywhere from `0.0` to `0.7` when `reasoning_effort="none"`, depending on the task — lower for to-the-point answers, higher for creative output. +- **Context length vs memory**: The model has a 256K context window. If you are memory-constrained, lower `--context-length` (e.g. `32768`) and increase once things are stable. +- **Tool calling**: Enable `--tool-call-parser mistral` to activate native function calling support. +- **Reasoning parser**: Enable `--reasoning-parser mistral` to separate `reasoning_content` from the main response content. +- **System prompt**: The model ships with a recommended system prompt in `chat_template.jinja` and `SYSTEM_PROMPT.txt`. If you do not pass a system message yourself, the chat template injects Mistral's default (model identity, current date, tool-use guidelines). For full fidelity with Mistral's reference setup, load `SYSTEM_PROMPT.txt` from the HF repo and substitute `{name}`, `{today}`, `{yesterday}` (see Section 4.6). + +### 3.3 Speculative Decoding (EAGLE) + +Mistral ships an EAGLE draft head, [`mistralai/Mistral-Medium-3.5-128B-EAGLE`](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B-EAGLE), that lets you run speculative decoding on top of the dense 128B target. The draft is a 2-layer GQA body sharing the target's vocab/head, FP8-quantized like the target (~4 GB), and is meant for low-concurrency latency-bound serving. + +```bash Command +python -m sglang.launch_server \ + --model-path mistralai/Mistral-Medium-3.5-128B \ + --tp 4 \ + --dtype bfloat16 \ + --tool-call-parser mistral \ + --reasoning-parser mistral \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path mistralai/Mistral-Medium-3.5-128B-EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --port 30000 +``` + +- **`--dtype bfloat16` is required.** The draft `params.json` does not carry a `dtype` field, so `--dtype auto` falls back to fp32 and downcasts to fp16, which conflicts with the bf16 target when the embed/head are shared. Setting bf16 explicitly keeps both sides aligned (this is a no-op for the target — it already loads as bf16). +- The draft uses the same vocab and lm_head as the target. Memory overhead on top of the base model is ~4 GB per TP shard. +- `(num-steps, eagle-topk, num-draft-tokens) = (3, 1, 4)` is the recommended starting point. Tune for your workload — wider trees (higher `eagle-topk` / `num-draft-tokens`) help high-acceptance (templated) outputs, narrower trees keep latency tight on more diverse text. +- EAGLE shines at low concurrency. At high concurrency, throughput is dominated by the target's batched forward pass and the draft's contribution shrinks; consider running without EAGLE for batch-serving workloads. + +--- + +## 4. Model Invocation + +### 4.1 Thinking Mode + +Mistral Medium 3.5 is a hybrid reasoning model. By default it does not produce a reasoning trace — pass `reasoning_effort="high"` to switch on the deep-reasoning path. Mistral recommends `temperature=0.7` for reasoning mode. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +response = client.chat.completions.create( + model="mistralai/Mistral-Medium-3.5-128B", + messages=[ + {"role": "user", "content": "Solve step by step: what is 17 × 23 + 144 / 12?"}, + ], + temperature=0.7, + extra_body={"reasoning_effort": "high"}, +) + +print("Reasoning:", response.choices[0].message.reasoning_content) +print("Answer:", response.choices[0].message.content) +``` + +**Output:** + +```text Output +Reasoning: I need to follow the order of operations (PEMDAS/BODMAS): multiplication and +division before addition, evaluated left to right. + +17 × 23: I'll break it as 17 × (20 + 3) = 340 + 51 = 391. +144 / 12 = 12. +Finally, 391 + 12 = 403. + +Answer: **17 × 23 + 144 / 12 = 403** + +Step by step: +1. 17 × 23 = 391 +2. 144 / 12 = 12 +3. 391 + 12 = 403 +``` + +### 4.2 Instruct Mode (Reasoning Off) + +To skip the reasoning trace and get a fast direct response, set `reasoning_effort="none"`. For instruct mode, Mistral recommends temperature in the `0.0`–`0.7` range depending on how creative the task is: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +response = client.chat.completions.create( + model="mistralai/Mistral-Medium-3.5-128B", + messages=[ + {"role": "user", "content": "What is the capital of France?"}, + ], + temperature=0.1, + extra_body={"reasoning_effort": "none"}, +) + +print(response.choices[0].message.content) +``` + +**Output:** + +```text Output +The capital of France is **Paris**. It is one of the most famous and visited cities in +the world, known for its rich history, art, culture, and landmarks like the Eiffel Tower, +Louvre Museum, and Notre-Dame Cathedral. +``` + +### 4.3 Streaming with Reasoning + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +stream = client.chat.completions.create( + model="mistralai/Mistral-Medium-3.5-128B", + messages=[ + {"role": "user", "content": "Explain the difference between async and threading in Python."}, + ], + temperature=0.7, + extra_body={"reasoning_effort": "high"}, + stream=True, +) + +print("=== Reasoning ===") +for chunk in stream: + delta = chunk.choices[0].delta + if hasattr(delta, "reasoning_content") and delta.reasoning_content: + print(delta.reasoning_content, end="", flush=True) + elif delta.content: + print("\n=== Response ===") + print(delta.content, end="", flush=True) +print() +``` + +### 4.4 Tool Calling + +Mistral Medium 3.5 supports native function calling. Enable with `--tool-call-parser mistral`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a city", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City name"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["location"], + }, + }, + } +] + +response = client.chat.completions.create( + model="mistralai/Mistral-Medium-3.5-128B", + messages=[{"role": "user", "content": "What's the weather in Paris?"}], + tools=tools, + tool_choice="auto", +) + +tool_calls = response.choices[0].message.tool_calls +for tc in tool_calls: + print(f"Tool: {tc.function.name}") + print(f"Args: {tc.function.arguments}") +``` + +**Output:** + +```text Output +Tool: get_weather +Args: {"location": "Paris"} +``` + +### 4.5 Vision (Image Input) + +Mistral Medium 3.5 accepts image inputs alongside text. The vision encoder was retrained from scratch to handle variable image sizes and aspect ratios: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +response = client.chat.completions.create( + model="mistralai/Mistral-Medium-3.5-128B", + messages=[ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe what you see in this image."}, + { + "type": "image_url", + "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"}, + }, + ], + } + ], + temperature=0.7, + extra_body={"reasoning_effort": "none"}, +) + +print(response.choices[0].message.content) +``` + +**Output:** + +```text Output +The image features a stylized representation of the acronym "SGL." The letters +are large, bold, and orange with a brown outline, giving them a three-dimensional +effect. To the left of the letters, there is a graphic that resembles a neuron +or a node with connections, also in a similar orange and brown color scheme. The +node has a code symbol () inside a square, suggesting a connection to +programming or technology. +``` + +### 4.6 Loading the Reference System Prompt + +Mistral ships a `SYSTEM_PROMPT.txt` alongside the weights. The reference setup loads it from the HF repo and substitutes `{name}`, `{today}`, and `{yesterday}` at runtime so the model knows its identity and the current date. SGLang's chat template will inject a default system prompt if you omit one, but for full parity with Mistral's reference, load it explicitly: + +```python Example +from datetime import datetime, timedelta +from huggingface_hub import hf_hub_download +from openai import OpenAI + +MODEL = "mistralai/Mistral-Medium-3.5-128B" + +def load_system_prompt(repo_id: str, filename: str = "SYSTEM_PROMPT.txt") -> str: + path = hf_hub_download(repo_id=repo_id, filename=filename) + today = datetime.today().strftime("%Y-%m-%d") + yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") + name = repo_id.split("/")[-1] + with open(path) as f: + return f.read().format(name=name, today=today, yesterday=yesterday) + +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +response = client.chat.completions.create( + model=MODEL, + messages=[ + {"role": "system", "content": load_system_prompt(MODEL)}, + {"role": "user", "content": "Write me a sentence where every word starts with the next letter in the alphabet — start with 'a' and end with 'z'."}, + ], + temperature=0.1, + extra_body={"reasoning_effort": "none"}, +) + +print(response.choices[0].message.content) +``` + +--- + +## 5. Benchmarks + +Validation runs on 4× H200 with `--tp 4`, served via the `/v1/chat/completions` endpoint. + +### 5.1 Accuracy Benchmarks + +#### GSM8K + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +**Results:** + +```text Output +Accuracy: 0.945 +Invalid: 0.000 +Latency: 13.594 s +Output throughput: 1560.660 token/s +``` + +#### MMMU + +```bash Command +python3 benchmark/mmmu/bench_sglang.py --port 30000 +``` + +**Results:** + +```text Output +Overall accuracy: 0.586 +``` + +### 5.2 Speed Benchmarks + +#### Latency (Low Concurrency) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 10 \ + --max-concurrency 1 \ + --random-input-len 1024 \ + --random-output-len 512 \ + --port 30000 +``` + +**Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Successful requests: 10 +Benchmark duration (s): 38.86 +Total input tokens: 6101 +Total generated tokens: 2684 +Output token throughput (tok/s): 69.07 +Mean E2E Latency (ms): 3883.80 +Median TTFT (ms): 95.90 +Median TPOT (ms): 14.19 +================================================== +``` + +#### Throughput (High Concurrency) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 1000 \ + --max-concurrency 100 \ + --random-input-len 1024 \ + --random-output-len 512 \ + --port 30000 +``` + +**Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Successful requests: 1000 +Benchmark duration (s): 117.28 +Total input tokens: 512842 +Total generated tokens: 262023 +Output token throughput (tok/s): 2234.18 +Total token throughput (tok/s): 6607.01 +Mean E2E Latency (ms): 11303.79 +Median TTFT (ms): 152.95 +Median TPOT (ms): 42.53 +================================================== +``` + +### 5.3 EAGLE Speculative Decoding (Latency) + +Same 4× H200 setup, EAGLE configuration from [Section 3.3](#3-3-speculative-decoding-eagle). Single-stream latency benchmark (`--max-concurrency 1`). + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 10 \ + --max-concurrency 1 \ + --random-input-len 1024 \ + --random-output-len 512 \ + --port 30000 +``` + +**Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Successful requests: 10 +Benchmark duration (s): 27.64 +Total input tokens: 6101 +Total generated tokens: 2684 +Output token throughput (tok/s): 97.10 +Mean E2E Latency (ms): 2762.99 +Median TTFT (ms): 90.69 +Median TPOT (ms): 9.73 +Accept length: 1.72 +================================================== +``` + +EAGLE delivers **~1.41× output throughput and ~29% lower E2E latency** vs. the baseline in [Section 5.2](#5-2-speed-benchmarks) on the same workload. Acceptance length of 1.72 means each draft cycle averages roughly 1.7 accepted tokens. diff --git a/docs_new/cookbook/autoregressive/Mistral/Mistral-Small-4.mdx b/docs_new/cookbook/autoregressive/Mistral/Mistral-Small-4.mdx new file mode 100644 index 000000000000..23ffd4258075 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Mistral/Mistral-Small-4.mdx @@ -0,0 +1,393 @@ +--- +title: Mistral Small 4 +metatags: + description: "Deploy Mistral Small 4 with SGLang - unified hybrid model combining instruct, reasoning, and agentic capabilities with multimodal support." +--- + +import { MistralSmall4Deployment } from '/src/snippets/autoregressive/mistral-small-4-deployment.jsx'; + +## 1. Model Introduction + +**Mistral Small 4** is a powerful hybrid model from Mistral AI that unifies the capabilities of three different model families — **Instruct**, **Reasoning** (formerly called Magistral), and **Agentic (formerly called Devstral)** — into a single, unified model. + +With its multimodal capabilities, efficient MoE architecture, and flexible mode switching, Mistral Small 4 is a versatile general-purpose model for virtually any task. In a latency-optimized setup, it achieves a 40% reduction in end-to-end completion time; in a throughput-optimized setup, it delivers 3× more requests per second compared to Mistral Small 3. + +**Key Features:** + +- **Hybrid Reasoning**: Switch between instant reply mode and deep reasoning/thinking mode — reasoning effort is configurable per request +- **Vision**: Accepts both text and image inputs, providing insights based on visual content +- **Function Calling**: Native tool calling and JSON output support with best-in-class agentic capabilities +- **Multilingual**: Supports dozens of languages including English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, and more +- **Context Window**: 256K context window +- **Efficient MoE**: 119B total parameters, 128 experts, 4 active per token (6.5B activated parameters) +- **Apache 2.0 License**: Open-source, usable and modifiable for commercial and non-commercial purposes +- Reasoning effort supported are only **"none" and "high"** + +**Architecture:** + +- Same general architecture as Mistral 3 +- MoE: 128 experts, 4 active per token +- 119B total parameters, 6.5B activated per token +- Multimodal input: text + image + +**Models:** + +- **[mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603)** (FP8) +- **[mistralai/Mistral-Small-4-119B-2603-NVFP4](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-NVFP4)** +- **[mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603)** — same architecture, use the same launch commands as Mistral-Small-4-119B-2603 +- **[mistralai/Mistral-Small-4-119B-2603-eagle](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-eagle)** — EAGLE speculative decoding weights for faster inference + +--- + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + + +Mistral Small 4 support landed in [sgl-project/sglang#20708](https://github.com/sgl-project/sglang/pull/20708) and has been merged into `main`. A model-specific Docker image is no longer required. Use the standard SGLang installation methods from the [official installation guide](../../../docs/get-started/install). + + +--- + +## 3. Model Deployment + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to generate a launch command for Mistral Small 4. + + + +### 3.2 Configuration Tips + +- **Tensor Parallelism**: Mistral Small 4 FP8 (~119 GB) requires tp=2 on Hopper (H100/H200), tp=1 on Blackwell (B200/B300). NVFP4 (~60 GB, Blackwell only) runs with tp=1. +- **Reasoning effort**: Reasoning depth is configurable per request via `reasoning_effort` (`"none"`, `"high"`). No restart required — toggle per call. +- **Context length vs memory**: The model has a 256K context window. If you are memory-constrained, lower `--context-length` (e.g. `32768`) and increase once things are stable. +- **Tool calling**: Enable `--tool-call-parser mistral` to activate native function calling support. +- **Reasoning parser**: Enable `--reasoning-parser mistral` to separate `reasoning_content` from the main response content. +- **Speculative decoding (EAGLE)**: Enable with `--speculative-algorithm EAGLE --speculative-draft-model-path mistralai/Mistral-Small-4-119B-2603-eagle` using the [EAGLE weights](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-eagle) for lower latency. + +--- + +## 4. Model Invocation + +### 4.1 Thinking Mode + +Mistral Small 4 is a hybrid reasoning model. By default, it does not produce a default reasoning response. Use `--reasoning_effort high` to toggle reasoning on. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +response = client.chat.completions.create( + model="mistralai/Mistral-Small-4-119B-2603", + messages=[ + {"role": "user", "content": "Solve step by step: what is 17 × 23 + 144 / 12?"}, + ], + extra_body={"reasoning_effort": "high"}, +) + +print("Reasoning:", response.choices[0].message.reasoning_content) +print("Answer:", response.choices[0].message.content) +``` + +**Output:** + +```text Output +Reasoning: First, I'll break down the problem into two parts: the multiplication and +the division. According to the order of operations (PEMDAS/BODMAS), multiplication and +division are performed from left to right before addition. + +17 × 23 = 17 × (20 + 3) = (17 × 20) + (17 × 3) = 340 + 51 = 391 +144 / 12 = 12 + +Finally, add the results: 391 + 12 = 403 + +Answer: The solution to the problem is as follows: + +1. First, perform the multiplication: 17 × 23. + - 17 × 20 = 340 + - 17 × 3 = 51 + - 340 + 51 = 391 + +2. Then, perform the division: 144 / 12 = 12. + +3. Finally, add the results: + - 391 + 12 = 403 + +**Answer:** \boxed{403} +``` + +### 4.2 Instruct Mode (Reasoning Off) + +To skip the reasoning trace and get a fast direct response, set `reasoning_effort` to `"none"`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +response = client.chat.completions.create( + model="mistralai/Mistral-Small-4-119B-2603", + messages=[ + {"role": "user", "content": "Write a Python function to reverse a string."}, + ], + extra_body={"reasoning_effort": "none"}, +) + +print(response.choices[0].message.content) +``` + +**Output:** + +````text Output +# Python Function to Reverse a String + +Here are several ways to write a Python function to reverse a string: + +## Method 1: Using String Slicing (Most Pythonic) +```python +def reverse_string(s): + """Reverse a string using slicing.""" + return s[::-1] +``` + +## Method 2: Using a Loop +```python Example +def reverse_string(s): + """Reverse a string using a loop.""" + reversed_str = "" + for char in s: + reversed_str = char + reversed_str + return reversed_str +``` + +## Method 3: Using reversed() function +```python Example +def reverse_string(s): + """Reverse a string using reversed() function.""" + return ''.join(reversed(s)) +``` + +The first method using string slicing (`s[::-1]`) is generally the most efficient and +recommended approach in Python. + +Example usage: +```python Example +original = "Hello, World!" +reversed_str = reverse_string(original) +print(reversed_str) # Output: "!dlroW ,olleH" +``` +```` + +### 4.3 Streaming with Reasoning + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +stream = client.chat.completions.create( + model="mistralai/Mistral-Small-4-119B-2603", + messages=[ + {"role": "user", "content": "Explain the difference between async and threading in Python."}, + ], + extra_body={"reasoning_effort": "high"}, + stream=True, +) + +print("=== Reasoning ===") +for chunk in stream: + delta = chunk.choices[0].delta + if hasattr(delta, "reasoning_content") and delta.reasoning_content: + print(delta.reasoning_content, end="", flush=True) + elif delta.content: + print("\n=== Response ===") + print(delta.content, end="", flush=True) +print() +``` + +**Output:** + +```text Output +=== Reasoning === +Okay, the user is asking about the difference between async and threading in Python. +I need to break this down clearly, covering the key aspects of both, like their +purposes, performance characteristics, and use cases... +=== Response === +In Python, **`async`/`asyncio`** and **`threading`** are two different concurrency +models, each suited for specific use cases. Here's a breakdown of their key differences: + +### 1. Model of Concurrency +- **Threading**: Based on preemptive multitasking using OS threads. +- **Async** (`asyncio`): Based on cooperative multitasking. Tasks voluntarily yield... +``` + +### 4.4 Tool Calling + +Mistral Small 4 supports native function calling. Enable with `--tool-call-parser mistral`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a city", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City name"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["location"], + }, + }, + } +] + +response = client.chat.completions.create( + model="mistralai/Mistral-Small-4-119B-2603", + messages=[{"role": "user", "content": "What's the weather in Paris?"}], + tools=tools, + tool_choice="auto", +) + +tool_calls = response.choices[0].message.tool_calls +for tc in tool_calls: + print(f"Tool: {tc.function.name}") + print(f"Args: {tc.function.arguments}") +``` + +**Output:** + +```text Output +Tool: get_weather +Args: {"location": "Paris"} +``` + +### 4.5 Vision (Image Input) + +Mistral Small 4 accepts image inputs alongside text: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +response = client.chat.completions.create( + model="mistralai/Mistral-Small-4-119B-2603", + messages=[ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe what you see in this image."}, + { + "type": "image_url", + "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"}, + }, + ], + } + ], +) + +print(response.choices[0].message.content) +``` + +**Output:** + +```text Output +The image is a copyright symbol, represented by a stylized version of the lowercase +letter "c" inside a circle. The "c" is depicted in a white or light-colored font, and +the circle is orange. The design is simple yet striking, using oval and elliptical +shapes to create a distinct symbol which signifies copyright protection. +``` + +--- + +## 5. Benchmarks + +### 5.1 Accuracy Benchmarks + +#### GSM8K + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +**Results:** + +```text Output +TODO +``` + +#### MMLU + +```bash Command +python3 benchmark/mmlu/bench_sglang.py --port 30000 +``` + +**Results:** + +```text Output +TODO +``` + +### 5.2 Speed Benchmarks + +#### Latency (Low Concurrency) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --num-prompts 10 \ + --max-concurrency 1 \ + --random-input-len 1024 \ + --random-output-len 512 \ + --port 30000 +``` + +**Results:** + +```text Output +TODO +``` + +#### Throughput (High Concurrency) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --num-prompts 1000 \ + --max-concurrency 100 \ + --random-input-len 1024 \ + --random-output-len 512 \ + --port 30000 +``` + +**Results:** + +```text Output +TODO +``` diff --git a/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.5.mdx b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.5.mdx new file mode 100644 index 000000000000..9b3a37edb22e --- /dev/null +++ b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.5.mdx @@ -0,0 +1,1244 @@ +--- +title: Kimi-K2.5 +metatags: + description: "Deploy Kimi-K2.5 MoE model with SGLang - 1T total parameters, 32B active, step-by-step reasoning and tool calling capabilities." +--- + +## 1. Model Introduction + +[Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is an open-source, native multimodal agentic model by Moonshot AI, built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes. + +**Key Features:** + +- **Native Multimodality**: Pre-trained on vision-language tokens, K2.5 excels in visual knowledge, cross-modal reasoning, and agentic tool use grounded in visual inputs. +- **Coding with Vision**: K2.5 generates code from visual specifications (UI designs, video workflows) and autonomously orchestrates tools for visual data processing. +- **Agent Swarm**: K2.5 transitions from single-agent scaling to a self-directed, coordinated swarm-like execution scheme. It decomposes complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents. +- **Speculative Decoding**: EAGLE-based speculative decoding support for lower latency. + +**Available Models**: +- INT4 (Initial Released): [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) +- NVFP4 (4-bit quantized): [nvidia/Kimi-K2.5-NVFP4](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) + +For details, see [official documentation](https://huggingface.co/moonshotai/Kimi-K2.5) and [deployment guidance](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/docs/deploy_guidance.md). + +## 2. SGLang Installation + +Refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and capabilities. + +import { KimiK25Deployment } from '/src/snippets/autoregressive/kimi-k25-deployment.jsx' + + + +### 3.2 Configuration Tips + +- **Memory**: Requires GPUs with >=140GB each. Supported platforms: H200 (8x, TP=8), B300 (8x, TP=8), MI300X/MI325X (4x, TP=4), MI350X/MI355X (4x, TP=4). Use `--context-length 128000` to conserve memory. +- **AMD GPU TP Constraint**: On AMD GPUs, TP must be <= 4 (not 8). Kimi-K2.5 has 64 attention heads; the AITER MLA kernel requires `heads_per_gpu % 16 == 0`. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). +- **AMD Docker Image**: Use `lmsysorg/sglang:v0.5.9-rocm700-mi35x` for MI350X/MI355X and `lmsysorg/sglang:v0.5.9-rocm700-mi30x` for MI300X/MI325X. The ROCm 7.2 images (`rocm720`) have an AITER compatibility issue. +- **DP Attention**: Enable with `--dp --enable-dp-attention` for production throughput. A common choice is to set `--dp` equal to `--tp`, but this is not required. +- **Reasoning Parser**: Add `--reasoning-parser kimi_k2` to separate thinking and content in model outputs. +- **Tool Call Parser**: Add `--tool-call-parser kimi_k2` for structured tool calls. + +## 4. Model Invocation + +### 4.1 Basic Usage + +See [Basic API Usage](../../../docs/basic_usage/send_request). + +### 4.2 Advanced Usage + +#### 4.2.1 Multimodal (Vision + Text) Input + +Kimi-K2.5 supports native multimodal input with images: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.5", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "What is in this image? Describe it in detail." + } + ] + } + ] +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +This image shows a **receipt from Auntie Anne's** (a pretzel franchise restaurant). + +## Key Details: + +**Item Purchased:** +- **CINNAMON SUGAR** - 1 unit x 17,000 = **17,000** + +**Payment Summary:** +- **SUB TOTAL:** 17,000 +- **GRAND TOTAL:** 17,000 +- **CASH IDR:** 20,000 (Indonesian Rupiah) +- **CHANGE DUE:** 3,000 + +## Context: +The receipt indicates a transaction in **Indonesian Rupiah (IDR)**. A customer purchased one Cinnamon Sugar pretzel for 17,000 IDR, paid with a 20,000 IDR note, and received 3,000 IDR in change. + +The top of the receipt shows the Auntie Anne's logo (a heart-shaped pretzel with a halo), and some text appears blurred for privacy, likely obscuring the store location, date, and transaction number. The receipt is printed on white thermal paper. +``` + +#### 4.2.2 Reasoning Output + +Kimi-K2.5 supports both thinking mode (default) and instant mode. + +**Thinking Mode (default)** -- reasoning content is automatically separated: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.5", + messages=[ + {"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."} + ] +) + +print("====== Reasoning Content (Thinking Mode) ======") +print(response.choices[0].message.reasoning_content) +print("====== Response (Thinking Mode) ======") +print(response.choices[0].message.content) +``` + +**Instant Mode (thinking off)** -- disable thinking for faster responses: + +```python Example +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.5", + messages=[ + {"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."} + ], + extra_body={"chat_template_kwargs": {"thinking": False}} +) + +print("====== Response (Instant Mode) ======") +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +====== Reasoning Content (Thinking Mode) ====== +The user is asking which number is bigger: 9.11 or 9.9. + +At first glance, someone might think 9.11 is bigger because 11 > 9, but that's incorrect because we're dealing with decimal numbers, not whole numbers. + +Let me compare them properly: +- 9.9 = 9.90 +- 9.11 + +When comparing decimals, we look at each place value from left to right: +- Units place: 9 = 9 (tie) +- Tenths place: 9 vs 1 + +Since 9 > 1, we have 9.9 > 9.11. + +Alternatively, we can think of it as: +- 9.9 = 9 + 9/10 = 9 + 0.9 = 9.90 +- 9.11 = 9 + 11/100 = 9 + 0.11 + +Since 0.90 > 0.11, then 9.9 > 9.11. + +So the answer is clearly 9.9 is bigger. + +The "think carefully" hint suggests the user is trying to catch the common error where people compare 11 and 9 as whole numbers rather than understanding decimal place value (tenths vs hundredths). + +I should explain this clearly to avoid confusion. +====== Response (Thinking Mode) ====== +**9.9 is bigger.** + +Here's why this can be tricky: Many people instinctively compare 11 and 9 and think "11 is bigger than 9," but that's comparing the wrong place values. + +When comparing decimals, align them by place value: +- 9.9 = 9.**90** +- 9.11 = 9.**11** + +After the decimal point: +- The first digit (tenths place): **9** vs **1** +- Since 9 > 1, we stop there. **9.9 is larger.** + +Think of it as money: +- $9.90 (nine dollars and ninety cents) +- $9.11 (nine dollars and eleven cents) + +$9.90 is clearly more than $9.11. +====== Response (Instant Mode) ====== + Let me think through this carefully. + +**9.9 is bigger than 9.11** + +Here's why: When comparing decimals, we need to align them by their decimal places: + +- 9.9 = 9.90 +- 9.11 = 9.11 + +Now comparing: +- The whole number parts are equal (9 = 9) +- Comparing tenths: **9 > 1** + +So 9.90 > 9.11 + +A common mistake is thinking 11 hundredths is larger than 9 tenths, but 9 tenths = 90 hundredths, which is clearly larger than 11 hundredths. +``` + +#### 4.2.3 Tool Calling + +Kimi-K2.5 supports tool calling capabilities for agentic tasks: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.5", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + stream=True +) + +# Process streaming response +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'tool_calls') and delta.tool_calls: + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = {'name': None, 'arguments': ''} + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + if delta.content: + print(delta.content, end="", flush=True) + +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") +``` + +**Output Example:** + +```text Output +Tool Call: get_weather + Arguments: {"location": "Beijing"} +``` + +**Handling Tool Call Results:** + +```python Example +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": "The weather in Beijing is 22°C and sunny." + } +] + +final_response = client.chat.completions.create( + model="moonshotai/Kimi-K2.5", + messages=messages +) + +print(final_response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +The weather in Beijing is **22°C and sunny**. ☀️ + +It's a nice day there with comfortable temperatures and clear skies! +``` + +#### 4.2.4 Multimodal + Tool Calling (Agentic Vision) + +Combine vision understanding with tool calling for advanced agentic tasks: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "search_product", + "description": "Search for a product by name or description", + "parameters": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The product name or description to search for" + } + }, + "required": ["query"] + } + } + } +] + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.5", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Can you identify this product and search for similar items?" + } + ] + } + ], + tools=tools +) + +msg = response.choices[0].message + +# Print reasoning process +if msg.reasoning_content: + print("=== Reasoning ===") + print(msg.reasoning_content) + +# Print response content +if msg.content: + print("=== Content ===") + print(msg.content) + +# Print tool calls +if msg.tool_calls: + print("=== Tool Calls ===") + for tc in msg.tool_calls: + print(f" Function: {tc.function.name}") + print(f" Arguments: {tc.function.arguments}") +``` + +**Output Example:** + +```text Output +=== Reasoning === +The user is asking me to identify a product from a receipt and search for similar items. +Looking at the receipt, I can see: + + 1. The store is "Auntie Anne's" - which is a popular pretzel chain + 2. The product purchased is "CINNAMON SUGAR" + 3. Price is 17,000 (likely Indonesian Rupiah based on "CASH IDR") + 4. Quantity is 1 + +So the product is a Cinnamon Sugar pretzel from Auntie Anne's. +Now I need to search for this product or similar items using the search_product function. +=== Content === +I can see from the receipt that the product is a **Cinnamon Sugar** item from **Auntie Anne's** (the famous pretzel chain). This appears to be a Cinnamon Sugar Pretzel purchased for 17,000 IDR (Indonesian Rupiah). + +Let me search for this product and similar items: +=== Tool Calls === + Function: search_product + Arguments: {"query": "Auntie Anne's Cinnamon Sugar Pretzel"} +``` + +#### 4.2.5 Speculative Decoding + +**Nvidia** + +Deploy Kimi-K2.5 with the following command (H200/B200, all features enabled): + +```shell Command +SGLANG_ENABLE_SPEC_V2=1 sglang serve \ + --model-path moonshotai/Kimi-K2.5 \ + --tp 8 \ + --reasoning-parser kimi_k2 \ + --tool-call-parser kimi_k2 \ + --speculative-algorithm=EAGLE3 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +Deploy Kimi-K2.5-NVFP4 with the following command (B200, all features enabled): + +```shell Command +SGLANG_ENABLE_SPEC_V2=1 sglang serve \ + --model-path nvidia/Kimi-K2.5-NVFP4 \ + --tp 8 \ + --reasoning-parser kimi_k2 \ + --tool-call-parser kimi_k2 \ + --kv-cache-dtype fp8_e4m3 \ + --speculative-algorithm=EAGLE3 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +#### 5.1.1 MMMU Benchmark + +You can evaluate the model's accuracy using the MMMU benchmark, which tests multimodal understanding and reasoning across various subjects: + +- **Benchmark Command:** + +```shell Command +python3 benchmark/mmmu/bench_sglang.py \ + --response-answer-regex "(?i)(?:answer|ans)[:\s]*(?:\*\*)?[\(\[]?([A-Za-z])[\)\]]?(?:\*\*)?" \ + --port 30000 \ + --concurrency 64 +``` + +- **Result:** + +```text Output +Benchmark time: 2785.4322692090645 +answers saved to: ./answer_sglang.json +Evaluating... +answers saved to: ./answer_sglang.json +{'Accounting': {'acc': 0.667, 'num': 30}, + 'Agriculture': {'acc': 0.567, 'num': 30}, + 'Architecture_and_Engineering': {'acc': 0.733, 'num': 30}, + 'Art': {'acc': 0.833, 'num': 30}, + 'Art_Theory': {'acc': 0.8, 'num': 30}, + 'Basic_Medical_Science': {'acc': 0.833, 'num': 30}, + 'Biology': {'acc': 0.6, 'num': 30}, + 'Chemistry': {'acc': 0.633, 'num': 30}, + 'Clinical_Medicine': {'acc': 0.733, 'num': 30}, + 'Computer_Science': {'acc': 0.667, 'num': 30}, + 'Design': {'acc': 0.7, 'num': 30}, + 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.5, 'num': 30}, + 'Economics': {'acc': 0.867, 'num': 30}, + 'Electronics': {'acc': 0.3, 'num': 30}, + 'Energy_and_Power': {'acc': 0.767, 'num': 30}, + 'Finance': {'acc': 0.833, 'num': 30}, + 'Geography': {'acc': 0.667, 'num': 30}, + 'History': {'acc': 0.767, 'num': 30}, + 'Literature': {'acc': 0.767, 'num': 30}, + 'Manage': {'acc': 0.733, 'num': 30}, + 'Marketing': {'acc': 0.833, 'num': 30}, + 'Materials': {'acc': 0.567, 'num': 30}, + 'Math': {'acc': 0.633, 'num': 30}, + 'Mechanical_Engineering': {'acc': 0.567, 'num': 30}, + 'Music': {'acc': 0.5, 'num': 30}, + 'Overall': {'acc': 0.698, 'num': 900}, + 'Overall-Art and Design': {'acc': 0.708, 'num': 120}, + 'Overall-Business': {'acc': 0.787, 'num': 150}, + 'Overall-Health and Medicine': {'acc': 0.74, 'num': 150}, + 'Overall-Humanities and Social Science': {'acc': 0.75, 'num': 120}, + 'Overall-Science': {'acc': 0.66, 'num': 150}, + 'Overall-Tech and Engineering': {'acc': 0.595, 'num': 210}, + 'Pharmacy': {'acc': 0.767, 'num': 30}, + 'Physics': {'acc': 0.767, 'num': 30}, + 'Psychology': {'acc': 0.667, 'num': 30}, + 'Public_Health': {'acc': 0.867, 'num': 30}, + 'Sociology': {'acc': 0.8, 'num': 30}} +eval out saved to ./val_sglang.json +Overall accuracy: 0.698 +``` + +### 5.2 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA H200 GPU (8x) +- Model: Kimi-K2.5 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.6.post2 + +We use SGLang's built-in benchmarking tool with the `random` dataset for standardized performance evaluation. + +#### 5.2.1 Latency Benchmark + +- **Model Deployment:** + +```bash Command +sglang serve \ + --model-path moonshotai/Kimi-K2.5 \ + --tp 8 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +- **Benchmark Command:** + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 39.77 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4221 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 153.40 +Output token throughput (tok/s): 106.10 +Peak output token throughput (tok/s): 156.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 259.50 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3972.87 +Median E2E Latency (ms): 4044.55 +P90 E2E Latency (ms): 7046.30 +P99 E2E Latency (ms): 7441.13 +---------------Time to First Token---------------- +Mean TTFT (ms): 176.89 +Median TTFT (ms): 154.24 +P99 TTFT (ms): 285.75 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.22 +Median TPOT (ms): 9.32 +P99 TPOT (ms): 12.72 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.02 +Median ITL (ms): 8.80 +P95 ITL (ms): 13.23 +P99 ITL (ms): 14.17 +Max ITL (ms): 29.38 +================================================== +``` + +- Medium Concurrency (Balanced) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 158.05 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40775 +Request throughput (req/s): 0.51 +Input token throughput (tok/s): 250.99 +Output token throughput (tok/s): 258.18 +Peak output token throughput (tok/s): 1103.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 509.17 +Concurrency: 14.09 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 27837.05 +Median E2E Latency (ms): 23508.00 +P90 E2E Latency (ms): 57126.31 +P99 E2E Latency (ms): 66044.35 +---------------Time to First Token---------------- +Mean TTFT (ms): 374.30 +Median TTFT (ms): 375.51 +P99 TTFT (ms): 695.58 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 53.25 +Median TPOT (ms): 57.93 +P99 TPOT (ms): 85.45 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 53.95 +Median ITL (ms): 53.97 +P95 ITL (ms): 84.74 +P99 ITL (ms): 244.84 +Max ITL (ms): 655.61 +================================================== +``` + +- High Concurrency (Throughput-Optimized) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- **Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 996.64 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252588 +Request throughput (req/s): 0.50 +Input token throughput (tok/s): 250.67 +Output token throughput (tok/s): 253.51 +Peak output token throughput (tok/s): 1199.00 +Peak concurrent requests: 104 +Total token throughput (tok/s): 504.18 +Concurrency: 92.70 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 184773.75 +Median E2E Latency (ms): 174183.65 +P90 E2E Latency (ms): 343625.28 +P99 E2E Latency (ms): 404284.53 +---------------Time to First Token---------------- +Mean TTFT (ms): 1289.59 +Median TTFT (ms): 1313.35 +P99 TTFT (ms): 2346.78 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 364.70 +Median TPOT (ms): 403.32 +P99 TPOT (ms): 452.34 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 363.82 +Median ITL (ms): 316.21 +P95 ITL (ms): 745.91 +P99 ITL (ms): 1345.88 +Max ITL (ms): 3118.59 +================================================== +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 680.26 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44455 +Request throughput (req/s): 0.01 +Input token throughput (tok/s): 8.97 +Output token throughput (tok/s): 65.36 +Peak output token throughput (tok/s): 151.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 74.33 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 68019.29 +Median E2E Latency (ms): 70568.85 +P90 E2E Latency (ms): 113237.40 +P99 E2E Latency (ms): 121682.34 +---------------Time to First Token---------------- +Mean TTFT (ms): 206.17 +Median TTFT (ms): 177.28 +P99 TTFT (ms): 445.37 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.36 +Median TPOT (ms): 15.89 +P99 TPOT (ms): 16.43 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.26 +Median ITL (ms): 15.85 +P95 ITL (ms): 17.50 +P99 ITL (ms): 23.21 +Max ITL (ms): 45.22 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 2475.98 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 318306 +Total generated tokens (retokenized): 318166 +Request throughput (req/s): 0.03 +Input token throughput (tok/s): 16.02 +Output token throughput (tok/s): 128.56 +Peak output token throughput (tok/s): 847.00 +Peak concurrent requests: 18 +Total token throughput (tok/s): 144.58 +Concurrency: 14.62 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 452592.46 +Median E2E Latency (ms): 486002.05 +P90 E2E Latency (ms): 833197.57 +P99 E2E Latency (ms): 957399.48 +---------------Time to First Token---------------- +Mean TTFT (ms): 359.38 +Median TTFT (ms): 350.78 +P99 TTFT (ms): 500.36 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 111.18 +Median TPOT (ms): 122.76 +P99 TPOT (ms): 145.90 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 113.69 +Median ITL (ms): 122.81 +P95 ITL (ms): 147.87 +P99 ITL (ms): 151.03 +Max ITL (ms): 272.05 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +Waiting for completion... +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 120.73 +Total input tokens: 41941 +Total input text tokens: 41941 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.08 +Input token throughput (tok/s): 347.41 +Output token throughput (tok/s): 34.96 +Peak output token throughput (tok/s): 73.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 382.36 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 12068.56 +Median E2E Latency (ms): 10211.36 +P90 E2E Latency (ms): 23203.32 +P99 E2E Latency (ms): 30677.66 +---------------Time to First Token---------------- +Mean TTFT (ms): 1625.64 +Median TTFT (ms): 1526.63 +P99 TTFT (ms): 3743.51 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 24.95 +Median TPOT (ms): 23.95 +P99 TPOT (ms): 35.40 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 24.80 +Median ITL (ms): 21.73 +P95 ITL (ms): 59.56 +P99 ITL (ms): 61.10 +Max ITL (ms): 62.70 +================================================== +``` + +- Medium Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 389.96 +Total input tokens: 300020 +Total input text tokens: 300020 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41670 +Request throughput (req/s): 0.21 +Input token throughput (tok/s): 769.36 +Output token throughput (tok/s): 106.86 +Peak output token throughput (tok/s): 304.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 876.22 +Concurrency: 14.95 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 72870.97 +Median E2E Latency (ms): 70495.88 +P90 E2E Latency (ms): 121820.46 +P99 E2E Latency (ms): 148933.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 2460.45 +Median TTFT (ms): 1976.29 +P99 TTFT (ms): 7305.53 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 140.57 +Median TPOT (ms): 142.31 +P99 TPOT (ms): 273.40 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 135.44 +Median ITL (ms): 95.96 +P95 ITL (ms): 152.93 +P99 ITL (ms): 1488.37 +Max ITL (ms): 6540.24 +================================================== +``` + +- High Concurrency + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 1279.50 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169981 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 995.62 +Output token throughput (tok/s): 132.86 +Peak output token throughput (tok/s): 703.00 +Peak concurrent requests: 67 +Total token throughput (tok/s): 1128.49 +Concurrency: 60.12 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 240385.63 +Median E2E Latency (ms): 236266.30 +P90 E2E Latency (ms): 429882.12 +P99 E2E Latency (ms): 515158.36 +---------------Time to First Token---------------- +Mean TTFT (ms): 2710.44 +Median TTFT (ms): 2345.63 +P99 TTFT (ms): 7144.20 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 443.84 +Median TPOT (ms): 493.29 +P99 TPOT (ms): 606.19 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 448.23 +Median ITL (ms): 296.17 +P95 ITL (ms): 1869.15 +P99 ITL (ms): 2708.95 +Max ITL (ms): 7778.47 +================================================== +``` + +#### 5.2.2 Speculative Decoding Benchmark + +- **Model Deployment:** + +```bash Command +SGLANG_ENABLE_SPEC_V2=1 sglang serve \ + --model-path moonshotai/Kimi-K2.5 \ + --tp 8 \ + --reasoning-parser kimi_k2 \ + --tool-call-parser kimi_k2 \ + --speculative-algorithm=EAGLE3 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +- **Benchmark Command:** + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Results:** + +```text Output +Pending update... +``` + +- Medium Concurrency (Balanced) + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +Pending update... +``` + +- High Concurrency (Throughput-Optimized) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +Pending update... +``` + +### 5.3 Speed Benchmark (AMD MI350X) + +**Test Environment:** + +- Hardware: AMD Instinct MI350X GPU (4x) +- Model: Kimi-K2.5 (BF16) +- Tensor Parallelism: 4 +- SGLang Version: 0.5.9 +- Docker Image: `lmsysorg/sglang:v0.5.9-rocm700-mi35x` +- ROCm: 7.0 + +We use SGLang's built-in benchmarking tool with the `random` dataset for standardized performance evaluation. + +:::info AMD GPU TP Constraint +Kimi-K2.5 requires TP <= 4 on AMD GPUs. The model has 64 attention heads, and the AITER MLA kernel requires `heads_per_gpu % 16 == 0`. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). +::: + +#### 5.3.1 Latency Benchmark + +- **Model Deployment:** + +```bash Command +SGLANG_USE_AITER=1 SGLANG_ROCM_FUSED_DECODE_MLA=0 \ +sglang serve \ + --model-path moonshotai/Kimi-K2.5 \ + --tp 4 \ + --mem-fraction-static 0.8 \ + --trust-remote-code \ + --reasoning-parser kimi_k2 \ + --host 0.0.0.0 \ + --port 30000 +``` + +- **Benchmark Command:** + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 155.81 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4222 +Request throughput (req/s): 0.06 +Input token throughput (tok/s): 39.16 +Output token throughput (tok/s): 27.09 +Peak output token throughput (tok/s): 29.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 66.24 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 15576.22 +Median E2E Latency (ms): 12539.80 +P90 E2E Latency (ms): 28150.56 +P99 E2E Latency (ms): 34873.51 +---------------Time to First Token---------------- +Mean TTFT (ms): 563.50 +Median TTFT (ms): 594.92 +P99 TTFT (ms): 830.31 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 35.61 +Median TPOT (ms): 35.66 +P99 TPOT (ms): 35.77 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 35.66 +Median ITL (ms): 35.69 +P95 ITL (ms): 35.96 +P99 ITL (ms): 36.13 +Max ITL (ms): 36.92 +================================================== +``` + +- Medium Concurrency (Balanced) + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.5 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 526.66 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40798 +Request throughput (req/s): 0.15 +Input token throughput (tok/s): 75.32 +Output token throughput (tok/s): 77.48 +Peak output token throughput (tok/s): 96.00 +Peak concurrent requests: 18 +Total token throughput (tok/s): 152.80 +Concurrency: 14.59 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 96023.27 +Median E2E Latency (ms): 93940.20 +P90 E2E Latency (ms): 159449.54 +P99 E2E Latency (ms): 194706.61 +---------------Time to First Token---------------- +Mean TTFT (ms): 989.08 +Median TTFT (ms): 886.42 +P99 TTFT (ms): 1543.60 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 191.04 +Median TPOT (ms): 195.20 +P99 TPOT (ms): 238.84 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 186.68 +Median ITL (ms): 183.82 +P95 ITL (ms): 189.90 +P99 ITL (ms): 673.64 +Max ITL (ms): 1633.20 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.6.mdx b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.6.mdx new file mode 100644 index 000000000000..99ca67f4f11f --- /dev/null +++ b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.6.mdx @@ -0,0 +1,1341 @@ +--- +title: Kimi-K2.6 +metatags: + description: "Deploy Kimi-K2.6 native multimodal agentic model with SGLang - reasoning, tool calling, and multimodal capabilities." +tag: NEW +--- + +## 1. Model Introduction + +[Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6) is an open-source, native multimodal agentic model by Moonshot AI, delivering industry-leading coding, long-horizon execution, and agent swarm capabilities. It matches or surpasses GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro across key benchmarks. + +**Key Features:** + +- **Long-Horizon Coding**: Excels at complex, end-to-end coding tasks with 13+ hours of continuous execution and 4,000+ lines of code modification, generalizing across languages (Rust, Go, Python) and tasks (frontend, devops, performance optimization). +- **Coding-Driven Design**: Transforms prompts and visual inputs into production-ready interfaces with motion-rich elements including WebGL shaders, GSAP + Framer Motion, and Three.js 3D. +- **Agent Swarms Elevated**: Scales to 300 parallel sub-agents executing 4,000 coordinated steps per run. One prompt, 100+ files. +- **Proactive Agents**: Powers OpenClaw, Hermes Agent, and other autonomous frameworks for 5-day continuous operation. +- **Native Multimodality**: Pre-trained on vision–language tokens with MoonViT (400M parameters) for visual understanding, cross-modal reasoning, and agentic tool use grounded in visual inputs. + +**Benchmarks (Open-Source SOTA):** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BenchmarkScore
HLE w/ tools54.0
SWE-Bench Pro58.6
SWE-bench Multilingual76.7
BrowseComp83.2
Toolathlon50.0
AIME 202696.4
GPQA-Diamond90.5
LiveCodeBench89.6
+ +**Recommended Generation Parameters:** +- Thinking Mode: `temperature=1.0`, `top_p=0.95` +- Instant Mode: `temperature=0.6`, `top_p=0.95` + +**License:** Modified MIT + +For details, see [official documentation](https://huggingface.co/moonshotai/Kimi-K2.6) and [tech blog](https://kimi.com/blog/kimi-k2-6). + +## 2. SGLang Installation + +Refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and capabilities. + +import { KimiK26Deployment } from '/src/snippets/autoregressive/kimi-k26-deployment.jsx' + + + +### 3.2 Configuration Tips + +- **Memory**: Requires GPUs with ≥140GB each. Supported platforms: H200 (8×, TP=8), B200 (8×, TP=8), B300 (8×, TP=8), GB200 (4×, TP=4), GB300 (4×, TP=4), MI300X/MI325X (4×, TP=4), MI350X/MI355X (4×, TP=4). Use `--context-length 128000` to conserve memory. +- **AMD GPU TP Constraint**: On AMD GPUs, TP must be ≤ 4 (not 8). Kimi-K2.6 has 64 attention heads; the AITER MLA kernel requires `heads_per_gpu % 16 == 0`. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). +- **AMD Docker Image**: Use `lmsysorg/sglang:v0.5.9-rocm700-mi35x` for MI350X/MI355X and `lmsysorg/sglang:v0.5.9-rocm700-mi30x` for MI300X/MI325X. +- **DP Attention**: Enable with `--dp --enable-dp-attention` for production throughput. A common choice is to set `--dp` equal to `--tp`, but this is not required. +- **Reasoning Parser**: Add `--reasoning-parser kimi_k2` to separate thinking and content in model outputs. +- **Tool Call Parser**: Add `--tool-call-parser kimi_k2` for structured tool calls. +- **AMD FP8 KV Cache**: On AMD platforms the generator adds `--kv-cache-dtype fp8_e4m3` by default and sets `--mem-fraction-static 0.8` to fit the INT4 weights plus KV cache. FP8 KV cache trades a small amount of accuracy for memory; omit the flag if you observe accuracy regressions on your workload. + +## 4. Model Invocation + +### 4.1 Basic Usage + +See [Basic API Usage](../../../docs/basic_usage/send_request). + +### 4.2 Advanced Usage + +#### 4.2.1 Multimodal (Vision + Text) Input + +Kimi-K2.6 supports native multimodal input with images: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.6", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "What is in this image? Describe it in detail." + } + ] + } + ] +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +This image shows a **paper receipt from Auntie Anne's**, the pretzel chain restaurant. Here's a detailed breakdown: + +## Header +- At the top left is the Auntie Anne's logo (a pretzel with a halo) +- The store name "**Auntie Anne's**" is printed prominently at the top +- Some text below the store name appears blurred/redacted (likely store location, address, or transaction details) + +## Purchase Details +- **Item**: CINNAMON SUGAR +- **Quantity & Price**: 1 × 17,000 +- **Item Total**: 17,000 + +## Financial Summary +- **SUB TOTAL**: 17,000 +- **GRAND TOTAL**: 17,000 +- **CASH IDR**: 20,000 (customer paid 20,000 Indonesian Rupiah) +- **CHANGE DUE**: 3,000 + +## Physical Description +- The receipt is printed on white thermal paper +- Some information in the middle section and toward the bottom is intentionally blurred/obscured +- The paper appears slightly curved/wrinkled and is placed on a dark brown surface (likely a table or counter) + +The transaction is in **Indonesian Rupiah (IDR)**, indicating this purchase was made at an Auntie Anne's location in Indonesia. The customer bought one Cinnamon Sugar pretzel for 17,000 IDR and received 3,000 IDR in change after paying with 20,000 IDR cash. +``` + +#### 4.2.2 Reasoning Output + +Kimi-K2.6 supports both thinking mode (default) and instant mode. + +**Thinking Mode (default)** — reasoning content is automatically separated: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.6", + messages=[ + {"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."} + ] +) + +print("====== Reasoning Content (Thinking Mode) ======") +print(response.choices[0].message.reasoning_content) +print("====== Response (Thinking Mode) ======") +print(response.choices[0].message.content) +``` + +**Instant Mode (thinking off)** — disable thinking for faster responses: + +```python Example +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.6", + messages=[ + {"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."} + ], + extra_body={"chat_template_kwargs": {"thinking": False}} +) + +print("====== Response (Instant Mode) ======") +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +====== Reasoning Content (Thinking Mode) ====== +The user is asking which number is bigger: 9.11 or 9.9. This seems straightforward, but there's a viral internet debate about this due to decimal confusion. + +Let me think carefully: +- 9.11 means 9 + 11/100 = 9.11 +- 9.9 means 9 + 9/10 = 9.90 + +So 9.9 = 9.90, and 9.90 > 9.11 because 0.90 > 0.11. + +The confusion often comes from people thinking of software versioning (where 9.11 comes after 9.9) or comparing the numbers after the decimal as whole numbers (11 vs 9, thinking 11 > 9). + +So mathematically, 9.9 is clearly bigger. 9.9 - 9.11 = 0.79. + +I should explain this clearly and address the common misconception. +====== Response (Thinking Mode) ====== +Mathematically, **9.9 is bigger**. + +Here's why: + +**9.9 = 9.90** + +When comparing decimals, you need to look at the same place values: +- 9.11 = 9 ones, 1 tenth, and 1 hundredth +- 9.9 = 9 ones, 9 tenths, and 0 hundredths (9.90) + +Since **0.90 > 0.11**, it follows that **9.9 > 9.11**. + +The difference is: +9.9 - 9.11 = 0.79 + +**Why people get confused:** Many mistakenly treat the decimals like whole numbers (thinking "11 is bigger than 9") or confuse this with software version numbering (where version 9.11 comes after version 9.9). But in standard mathematics, 9.9 is definitively larger. +====== Response (Instant Mode) ====== +I need to compare 9.11 and 9.9. + +Let me think carefully by aligning the decimal places: + +- 9.11 = 9 and 11/100 = 9.11 +- 9.9 = 9 and 9/10 = 9.90 + +Since 0.90 > 0.11 + +**9.9 is bigger.** + +This is a common trick question because people sometimes mistakenly compare 11 and 9 as whole numbers after the decimal point, forgetting that 9.9 = 9.90, which is greater than 9.11. +``` + +#### 4.2.3 Tool Calling + +Kimi-K2.6 supports tool calling capabilities for agentic tasks: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.6", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + stream=True +) + +# Process streaming response +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'tool_calls') and delta.tool_calls: + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = {'name': None, 'arguments': ''} + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + if delta.content: + print(delta.content, end="", flush=True) + +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") +``` + +**Output Example:** + +```text Output +Tool Call: get_weather + Arguments: {"location": "Beijing"} +``` + +**Handling Tool Call Results:** + +```python Example +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": "The weather in Beijing is 22°C and sunny." + } +] + +final_response = client.chat.completions.create( + model="moonshotai/Kimi-K2.6", + messages=messages +) + +print(final_response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +The weather in Beijing is currently **22°C and sunny**. ☀️ + +It's a nice, warm day there—great for being outdoors! +``` + +#### 4.2.4 Multimodal + Tool Calling (Agentic Vision) + +Combine vision understanding with tool calling for advanced agentic tasks: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "search_product", + "description": "Search for a product by name or description", + "parameters": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The product name or description to search for" + } + }, + "required": ["query"] + } + } + } +] + +response = client.chat.completions.create( + model="moonshotai/Kimi-K2.6", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Can you identify this product and search for similar items?" + } + ] + } + ], + tools=tools +) + +msg = response.choices[0].message + +# Print reasoning process +if msg.reasoning_content: + print("=== Reasoning ===") + print(msg.reasoning_content) + +# Print response content +if msg.content: + print("=== Content ===") + print(msg.content) + +# Print tool calls +if msg.tool_calls: + print("=== Tool Calls ===") + for tc in msg.tool_calls: + print(f" Function: {tc.function.name}") + print(f" Arguments: {tc.function.arguments}") +``` + +**Output Example:** + +```text Output +=== Reasoning === +The user wants me to identify the product from the receipt and search for similar items. Looking at the receipt, it's from Auntie Anne's and the item purchased is "CINNAMON SUGAR" for 17,000 IDR. This is likely a Cinnamon Sugar Pretzel from Auntie Anne's, which is a popular pretzel chain. + +I should search for this product using the search_product function. The query should be something like "Auntie Anne's Cinnamon Sugar Pretzel" or just "Cinnamon Sugar Pretzel" to find similar items. +=== Content === +Based on the receipt, the product is a **Cinnamon Sugar Pretzel** from **Auntie Anne's** (a popular pretzel bakery chain). The receipt shows it was purchased for 17,000 Indonesian Rupiah (IDR). + +Let me search for this product and similar items for you. +=== Tool Calls === + Function: search_product + Arguments: {"query":"Auntie Anne's Cinnamon Sugar Pretzel"} +``` + + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +**Test Environment:** + +- Hardware: 8× NVIDIA H200 +- Model: moonshotai/Kimi-K2.6 (INT4) +- Tensor Parallelism: 8 +- SGLang version: 0.5.9 +- Reasoning Parser: `kimi_k2` +- Tool Call Parser: `kimi_k2` + +#### 5.1.1 K2-Vendor-Verifier (Tool Calling) + +- Dataset: [K2-Vendor-Verifier](https://github.com/MoonshotAI/K2-Vendor-Verifier) tool-calls dataset (2,000 requests) +- Evaluation Tool: K2-Vendor-Verifier `tool_calls_eval.py` +- Settings: temperature=1.0, max_tokens=64,000, concurrency=256 + +**Evaluation Command:** + +```shell Command +cd K2-Vendor-Verifier + +python tool_calls_eval.py tool-calls/samples.jsonl \ + --model "moonshotai/Kimi-K2.6" \ + --base-url "http://localhost:30000/v1" \ + --api-key "placeholder" \ + --concurrency 256 \ + --temperature 1.0 \ + --max-tokens 64000 \ + --output kimi-k26-results.jsonl +``` + +**Results:** + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricValue
Success Rate99.95% (1999/2000)
Tool Call Triggered970
Tool Call Valid89.6% (869/970)
Tool Call Invalid (schema error)10.4% (101/970)
+ +#### 5.1.2 AIME 2025 + +- Dataset: [AIME 2025](https://huggingface.co/datasets/nvidia/aime25) (30 problems) +- Evaluation Tool: [NVIDIA NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills) +- Prompt: `eval/matharena/aime` (MathArena format with `\boxed{}` answers) +- Settings: temperature=1.0, top_p=0.95, max_tokens=131,072, 32 seeds + +**Evaluation Command:** + +```shell Command +# Prepare dataset +python3 nemo_skills/dataset/aime25/prepare.py + +# Run 32 seeds in parallel +for RS in $(seq 0 31); do + python3 nemo_skills/inference/generate.py \ + input_file=nemo_skills/dataset/aime25/test.jsonl \ + output_file=results/kimi-k26/aime25/output-rs${RS}.jsonl \ + prompt_config=eval/matharena/aime \ + prompt_format=openai \ + +server.server_type=openai \ + +server.model=moonshotai/Kimi-K2.6 \ + +server.base_url=http://localhost:30000/v1 \ + ++inference.temperature=1.0 \ + ++inference.top_p=0.95 \ + ++inference.tokens_to_generate=131072 \ + ++inference.random_seed=${RS} \ + max_concurrent_requests=512 & +done +``` + +**Results:** + + + + + + + + + + + + + + + + + + + + + + +
Evaluation ModeAccuracy
pass@1 (avg-of-32)98.9% (29.7/30)
majority@32100.0% (30/30)
pass@32100.0%
+ +> 22 out of 32 seeds achieved a perfect score of 30/30. The remaining 10 seeds each missed exactly 1 problem (29/30). + +#### 5.1.3 GPQA Diamond + +- Dataset: [GPQA Diamond](https://huggingface.co/datasets/Idavidrein/gpqa) (198 questions, 4-choice multiple choice) +- Evaluation Tool: [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) with `inspect_evals/gpqa_diamond` +- Settings: temperature=1.0, top_p=0.95, max_tokens=131,072, 4 epochs, cot=True + +**Evaluation Command:** + +```shell Command +OPENAI_BASE_URL=http://localhost:30000/v1 OPENAI_API_KEY=placeholder \ +inspect eval inspect_evals/gpqa_diamond \ + --model openai/moonshotai/Kimi-K2.6 \ + --max-tokens 131072 \ + --temperature 1.0 \ + --top-p 0.95 \ + --max-connections 128 \ + -T cot=True +``` + +**Results (partial — 553/792 samples across 4 epochs):** + + + + + + + + + + + + + + +
Evaluation ModeAccuracy
pass@1 (avg across epochs)96.9%
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
EpochAccuracy
196.4% (160/166)
296.9% (156/161)
396.9% (155/160)
498.5% (65/66)
+ +#### 5.1.4 OCRBench + +- Dataset: [OCRBench](https://huggingface.co/datasets/echo840/OCRBench) (1,000 questions with images) +- Evaluation Tool: [Kimi-Vendor-Verifier](https://github.com/MoonshotAI/Kimi-Vendor-Verifier) (inspect-ai based) +- Settings: max_tokens=4,096, thinking mode enabled (opensource) + +**Evaluation Command:** + +```shell Command +cd Kimi-Vendor-Verifier + +OPENAI_BASE_URL=http://localhost:30000/v1 OPENAI_API_KEY=placeholder \ +python3 eval.py ocrbench \ + --model openai/moonshotai/Kimi-K2.6 \ + --max-tokens 4096 \ + --think-mode opensource \ + --thinking \ + --max-connections 256 +``` + +**Results:** + + + + + + + + + + + + + + +
Evaluation ModeAccuracy
pass@190.8%
+ +#### 5.1.5 MMMU Pro Vision + +- Dataset: [MMMU Pro](https://huggingface.co/datasets/MMMU/MMMU_Pro) standard 10-option subset (1,730 questions with images) +- Evaluation Tool: [Kimi-Vendor-Verifier](https://github.com/MoonshotAI/Kimi-Vendor-Verifier) (inspect-ai based) +- Settings: max_tokens=32,768, thinking mode (default), max_connections=256 + +> **Important**: Kimi-K2.6 is a reasoning model. Setting `max_tokens` too low (e.g., 4096) causes the thinking process to consume the entire token budget, leaving no tokens for the final answer. Use `max_tokens=32768` or higher. + +**Evaluation Command:** + +```shell Command +cd Kimi-Vendor-Verifier + +OPENAI_BASE_URL=http://localhost:30000/v1 OPENAI_API_KEY=placeholder \ +python3 eval.py mmmu \ + --model openai/moonshotai/Kimi-K2.6 \ + --max-tokens 32768 \ + --think-mode none \ + --max-connections 256 +``` + +**Results (1,481/1,730 samples completed):** + + + + + + + + + + + + + + +
Evaluation ModeAccuracy
pass@182.2%
+ +### 5.2 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA H200 GPU (8x) +- Model: Kimi-K2.6 +- Tensor Parallelism: 8 +- SGLang Version: 0.5.9 + + +Kimi-K2.6 shares the same architecture as K2.5. Speed benchmarks are expected to be equivalent. The results below are measured with K2.5 and serve as a reference. + + +We use SGLang's built-in benchmarking tool with the `random` dataset for standardized performance evaluation. + +#### 5.2.1 Latency Benchmark + +- **Model Deployment:** + +```shell Command +sglang serve \ + --model-path moonshotai/Kimi-K2.6 \ + --tp 8 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Scenario 1: Chat (1K/1K)** + +- Low Concurrency + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 39.77 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4221 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 153.40 +Output token throughput (tok/s): 106.10 +Peak output token throughput (tok/s): 156.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 259.50 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3972.87 +Median E2E Latency (ms): 4044.55 +P90 E2E Latency (ms): 7046.30 +P99 E2E Latency (ms): 7441.13 +---------------Time to First Token---------------- +Mean TTFT (ms): 176.89 +Median TTFT (ms): 154.24 +P99 TTFT (ms): 285.75 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 9.22 +Median TPOT (ms): 9.32 +P99 TPOT (ms): 12.72 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 9.02 +Median ITL (ms): 8.80 +P95 ITL (ms): 13.23 +P99 ITL (ms): 14.17 +Max ITL (ms): 29.38 +================================================== +``` + +- Medium Concurrency (Balanced) + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 158.05 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40775 +Request throughput (req/s): 0.51 +Input token throughput (tok/s): 250.99 +Output token throughput (tok/s): 258.18 +Peak output token throughput (tok/s): 1103.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 509.17 +Concurrency: 14.09 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 27837.05 +Median E2E Latency (ms): 23508.00 +P90 E2E Latency (ms): 57126.31 +P99 E2E Latency (ms): 66044.35 +---------------Time to First Token---------------- +Mean TTFT (ms): 374.30 +Median TTFT (ms): 375.51 +P99 TTFT (ms): 695.58 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 53.25 +Median TPOT (ms): 57.93 +P99 TPOT (ms): 85.45 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 53.95 +Median ITL (ms): 53.97 +P95 ITL (ms): 84.74 +P99 ITL (ms): 244.84 +Max ITL (ms): 655.61 +================================================== +``` + +- High Concurrency (Throughput-Optimized) + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 996.64 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252588 +Request throughput (req/s): 0.50 +Input token throughput (tok/s): 250.67 +Output token throughput (tok/s): 253.51 +Peak output token throughput (tok/s): 1199.00 +Peak concurrent requests: 104 +Total token throughput (tok/s): 504.18 +Concurrency: 92.70 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 184773.75 +Median E2E Latency (ms): 174183.65 +P90 E2E Latency (ms): 343625.28 +P99 E2E Latency (ms): 404284.53 +---------------Time to First Token---------------- +Mean TTFT (ms): 1289.59 +Median TTFT (ms): 1313.35 +P99 TTFT (ms): 2346.78 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 364.70 +Median TPOT (ms): 403.32 +P99 TPOT (ms): 452.34 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 363.82 +Median ITL (ms): 316.21 +P95 ITL (ms): 745.91 +P99 ITL (ms): 1345.88 +Max ITL (ms): 3118.59 +================================================== +``` + +**Scenario 2: Reasoning (1K/8K)** + +- Low Concurrency + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 680.26 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44455 +Request throughput (req/s): 0.01 +Input token throughput (tok/s): 8.97 +Output token throughput (tok/s): 65.36 +Peak output token throughput (tok/s): 151.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 74.33 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 68019.29 +Median E2E Latency (ms): 70568.85 +P90 E2E Latency (ms): 113237.40 +P99 E2E Latency (ms): 121682.34 +---------------Time to First Token---------------- +Mean TTFT (ms): 206.17 +Median TTFT (ms): 177.28 +P99 TTFT (ms): 445.37 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.36 +Median TPOT (ms): 15.89 +P99 TPOT (ms): 16.43 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.26 +Median ITL (ms): 15.85 +P95 ITL (ms): 17.50 +P99 ITL (ms): 23.21 +Max ITL (ms): 45.22 +================================================== +``` + +- Medium Concurrency + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 2475.98 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 318306 +Total generated tokens (retokenized): 318166 +Request throughput (req/s): 0.03 +Input token throughput (tok/s): 16.02 +Output token throughput (tok/s): 128.56 +Peak output token throughput (tok/s): 847.00 +Peak concurrent requests: 18 +Total token throughput (tok/s): 144.58 +Concurrency: 14.62 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 452592.46 +Median E2E Latency (ms): 486002.05 +P90 E2E Latency (ms): 833197.57 +P99 E2E Latency (ms): 957399.48 +---------------Time to First Token---------------- +Mean TTFT (ms): 359.38 +Median TTFT (ms): 350.78 +P99 TTFT (ms): 500.36 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 111.18 +Median TPOT (ms): 122.76 +P99 TPOT (ms): 145.90 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 113.69 +Median ITL (ms): 122.81 +P95 ITL (ms): 147.87 +P99 ITL (ms): 151.03 +Max ITL (ms): 272.05 +================================================== +``` + +**Scenario 3: Summarization (8K/1K)** + +- Low Concurrency + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 120.73 +Total input tokens: 41941 +Total input text tokens: 41941 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.08 +Input token throughput (tok/s): 347.41 +Output token throughput (tok/s): 34.96 +Peak output token throughput (tok/s): 73.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 382.36 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 12068.56 +Median E2E Latency (ms): 10211.36 +P90 E2E Latency (ms): 23203.32 +P99 E2E Latency (ms): 30677.66 +---------------Time to First Token---------------- +Mean TTFT (ms): 1625.64 +Median TTFT (ms): 1526.63 +P99 TTFT (ms): 3743.51 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 24.95 +Median TPOT (ms): 23.95 +P99 TPOT (ms): 35.40 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 24.80 +Median ITL (ms): 21.73 +P95 ITL (ms): 59.56 +P99 ITL (ms): 61.10 +Max ITL (ms): 62.70 +================================================== +``` + +- Medium Concurrency + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 389.96 +Total input tokens: 300020 +Total input text tokens: 300020 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41670 +Request throughput (req/s): 0.21 +Input token throughput (tok/s): 769.36 +Output token throughput (tok/s): 106.86 +Peak output token throughput (tok/s): 304.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 876.22 +Concurrency: 14.95 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 72870.97 +Median E2E Latency (ms): 70495.88 +P90 E2E Latency (ms): 121820.46 +P99 E2E Latency (ms): 148933.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 2460.45 +Median TTFT (ms): 1976.29 +P99 TTFT (ms): 7305.53 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 140.57 +Median TPOT (ms): 142.31 +P99 TPOT (ms): 273.40 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 135.44 +Median ITL (ms): 95.96 +P95 ITL (ms): 152.93 +P99 ITL (ms): 1488.37 +Max ITL (ms): 6540.24 +================================================== +``` + +- High Concurrency + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 1279.50 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169981 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 995.62 +Output token throughput (tok/s): 132.86 +Peak output token throughput (tok/s): 703.00 +Peak concurrent requests: 67 +Total token throughput (tok/s): 1128.49 +Concurrency: 60.12 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 240385.63 +Median E2E Latency (ms): 236266.30 +P90 E2E Latency (ms): 429882.12 +P99 E2E Latency (ms): 515158.36 +---------------Time to First Token---------------- +Mean TTFT (ms): 2710.44 +Median TTFT (ms): 2345.63 +P99 TTFT (ms): 7144.20 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 443.84 +Median TPOT (ms): 493.29 +P99 TPOT (ms): 606.19 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 448.23 +Median ITL (ms): 296.17 +P95 ITL (ms): 1869.15 +P99 ITL (ms): 2708.95 +Max ITL (ms): 7778.47 +================================================== +``` + +### 5.3 Speed Benchmark (AMD MI350X) + +**Test Environment:** + +- Hardware: AMD Instinct MI350X GPU (4x) +- Model: Kimi-K2.6 (INT4) +- Tensor Parallelism: 4 +- SGLang Version: 0.5.9 +- Docker Image: `lmsysorg/sglang:v0.5.9-rocm700-mi35x` +- ROCm: 7.0 + +We use SGLang's built-in benchmarking tool with the `random` dataset for standardized performance evaluation. + + +**AMD GPU TP Constraint**: Kimi-K2.6 requires TP ≤ 4 on AMD GPUs. The model has 64 attention heads, and the AITER MLA kernel requires `heads_per_gpu % 16 == 0`. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). + + +#### 5.3.1 Latency Benchmark + +- **Model Deployment:** + +```shell Command +SGLANG_USE_AITER=1 SGLANG_ROCM_FUSED_DECODE_MLA=0 \ +sglang serve \ + --model-path moonshotai/Kimi-K2.6 \ + --tp 4 \ + --mem-fraction-static 0.8 \ + --trust-remote-code \ + --reasoning-parser kimi_k2 \ + --tool-call-parser kimi_k2 \ + --kv-cache-dtype fp8_e4m3 \ + --host 0.0.0.0 \ + --port 30000 +``` + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- **Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 155.81 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4222 +Request throughput (req/s): 0.06 +Input token throughput (tok/s): 39.16 +Output token throughput (tok/s): 27.09 +Peak output token throughput (tok/s): 29.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 66.24 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 15576.22 +Median E2E Latency (ms): 12539.80 +P90 E2E Latency (ms): 28150.56 +P99 E2E Latency (ms): 34873.51 +---------------Time to First Token---------------- +Mean TTFT (ms): 563.50 +Median TTFT (ms): 594.92 +P99 TTFT (ms): 830.31 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 35.61 +Median TPOT (ms): 35.66 +P99 TPOT (ms): 35.77 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 35.66 +Median ITL (ms): 35.69 +P95 ITL (ms): 35.96 +P99 ITL (ms): 36.13 +Max ITL (ms): 36.92 +================================================== +``` + +- Medium Concurrency (Balanced) + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-K2.6 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 526.66 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40798 +Request throughput (req/s): 0.15 +Input token throughput (tok/s): 75.32 +Output token throughput (tok/s): 77.48 +Peak output token throughput (tok/s): 96.00 +Peak concurrent requests: 18 +Total token throughput (tok/s): 152.80 +Concurrency: 14.59 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 96023.27 +Median E2E Latency (ms): 93940.20 +P90 E2E Latency (ms): 159449.54 +P99 E2E Latency (ms): 194706.61 +---------------Time to First Token---------------- +Mean TTFT (ms): 989.08 +Median TTFT (ms): 886.42 +P99 TTFT (ms): 1543.60 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 191.04 +Median TPOT (ms): 195.20 +P99 TPOT (ms): 238.84 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 186.68 +Median ITL (ms): 183.82 +P95 ITL (ms): 189.90 +P99 ITL (ms): 673.64 +Max ITL (ms): 1633.20 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.mdx b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.mdx new file mode 100644 index 000000000000..abe1abea1fec --- /dev/null +++ b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-K2.mdx @@ -0,0 +1,520 @@ +--- +title: Kimi-K2 +metatags: + description: "Deploy Kimi-K2 MoE model with SGLang - 1T total parameters, 32B active, step-by-step reasoning and tool calling capabilities." +--- + +import { KimiK2Deployment } from '/src/snippets/autoregressive/kimi-k2-deployment.jsx'; + +## 1. Model Introduction + +[Kimi-K2](https://moonshotai.github.io/Kimi-K2/) is a state-of-the-art MoE language model by Moonshot AI with 32B activated parameters and 1T total parameters. + +**Model Variants:** + +- **[Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct)**: Post-trained model optimized for general-purpose chat and agentic tasks. Compatible with vLLM, SGLang, KTransformers, and TensorRT-LLM. +- **[Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking)**: Advanced thinking model with step-by-step reasoning and tool calling. Native INT4 quantization with 256k context window. Ideal for complex reasoning and multi-step tool use. +- **ROCm Support**: Compatible with AMD MI300X GPUs via SGLang (verified). + +For details, see [official documentation](https://github.com/MoonshotAI/Kimi-K2) and [technical report](https://www.arxiv.org/abs/2507.20534). + +## 2. SGLang Installation + +Refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and capabilities. + + + +### 3.2 Configuration Tips + +- **Memory**: Requires 8 GPUs with ≥140GB each (H200/B200). Use `--context-length 128000` to conserve memory. +- **Expert Parallelism (EP)**: Use `--ep` for better MoE throughput. See [EP docs](../../../docs/advanced_features/expert_parallelism). +- **Data Parallel (DP)**: Enable with `--dp 4 --enable-dp-attention` for production throughput. +- **KV Cache**: Use `--kv-cache-dtype fp8_e4m3` to reduce memory by 50% (CUDA 11.8+). +- **Reasoning Parser**: Add `--reasoning-parser kimi_k2` for Kimi-K2-Thinking to separate thinking and content. +- **Tool Call Parser**: Add `--tool-call-parser kimi_k2` for structured tool calls. +- **AMD GPU**: Set `SGLANG_ROCM_FUSED_DECODE_MLA=0` before launching AMD GPU. + +## 4. Model Invocation + +### 4.1 Basic Usage + +See [Basic API Usage](../../../docs/get-started/quickstart). + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +Enable reasoning parser for Kimi-K2-Thinking: + +```shell Command +python -m sglang.launch_server \ + --model moonshotai/Kimi-K2-Thinking \ + --reasoning-parser kimi_k2 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="moonshotai/Kimi-K2-Thinking", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.6, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= + The user asks: "What is 15% of 240?" This is a straightforward percentage calculation problem. I need to solve it step by step. + +Step 1: Understand what "percent" means. +- "Percent" means "per hundred". So 15% means 15 per 100, or 15/100, or 0.15. + +Step 2: Convert the percentage to a decimal. +- 15% = 15 / 100 = 0.15 + +Step 3: Multiply the decimal by the number. +- 0.15 * 240 + +Step 4: Perform the multiplication. +- 0.15 * 240 = (15/100) * 240 +- = 15 * 240 / 100 +- = 3600 / 100 +- = 36 + +Alternatively, I can calculate it directly: +- 0.15 * 240 +- 15 * 240 = 3600 +- 3600 / 100 = 36 + +Or, break it down: +- 10% of 240 = 24 +- 5% of 240 = half of 10% = 12 +- 15% of 240 = 10% + 5% = 24 + 12 = 36 + +I should present the solution clearly with steps. The most standard method is converting to decimal and multiplying. + +Let me structure the answer: +1. Convert the percentage to a decimal. +2. Multiply the decimal by the number. +3. Show the calculation. +4. State the final answer. + +This is simple and easy to follow. +=============== Content ================= + Here is the step-by-step solution: + +**Step 1: Convert the percentage to a decimal** +15% means 15 per 100, which is 15 ÷ 100 = **0.15** + +**Step 2: Multiply the decimal by the number** +0.15 × 240 + +**Step 3: Calculate the result** +0.15 × 240 = **36** + +**Answer:** 15% of 240 is **36**. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.2 Tool Calling + +Kimi-K2-Instruct and Kimi-K2-Thinking support tool calling capabilities. Enable the tool call parser during deployment: + +**Deployment Command:** + +```shell Command +python -m sglang.launch_server \ + --model moonshotai/Kimi-K2-Instruct \ + --tool-call-parser kimi_k2 \ + --tp 8 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="moonshotai/Kimi-K2-Thinking", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= + The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. Beijing is a major city in China, so I should be able to get weather data for it. The location parameter is required, but the unit parameter is optional. Since the user didn't specify a temperature unit, I can just provide the location and let the function use its default. I'll check the weather in Beijing for you. +=============== Content ================= + + 🔧 Tool Call: get_weather + Arguments: {"location":"Beijing"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="moonshotai/Kimi-K2-Thinking", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x) +- Model: Kimi-K2-Instruct +- sglang version: 0.5.6.post1 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-K2-Instruct \ + --tp 8 \ + --dp 4 \ + --enable-dp-attention \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 8000 \ + --model moonshotai/Kimi-K2-Instruct\ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results**: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 44.93 +Total input tokens: 1951 +Total input text tokens: 1951 +Total input vision tokens: 0 +Total generated tokens: 2755 +Total generated tokens (retokenized): 2748 +Request throughput (req/s): 0.22 +Input token throughput (tok/s): 43.42 +Output token throughput (tok/s): 61.32 +Peak output token throughput (tok/s): 64.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 104.74 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4489.56 +Median E2E Latency (ms): 4994.53 +---------------Time to First Token---------------- +Mean TTFT (ms): 141.22 +Median TTFT (ms): 158.28 +P99 TTFT (ms): 166.90 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 18.40 +Median TPOT (ms): 15.63 +P99 TPOT (ms): 39.88 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.78 +Median ITL (ms): 15.76 +P95 ITL (ms): 16.36 +P99 ITL (ms): 16.59 +Max ITL (ms): 19.94 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-K2-Instruct \ + --tp 8 \ + --dp 4 \ + --ep 4 \ + --enable-dp-attention \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 8000 \ + --model moonshotai/Kimi-K2-Instruct\ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results**: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 174.11 +Total input tokens: 296642 +Total input text tokens: 296642 +Total input vision tokens: 0 +Total generated tokens: 193831 +Total generated tokens (retokenized): 168687 +Request throughput (req/s): 5.74 +Input token throughput (tok/s): 1703.73 +Output token throughput (tok/s): 1113.25 +Peak output token throughput (tok/s): 2383.00 +Peak concurrent requests: 112 +Total token throughput (tok/s): 2816.97 +Concurrency: 89.60 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 15601.09 +Median E2E Latency (ms): 10780.52 +---------------Time to First Token---------------- +Mean TTFT (ms): 457.42 +Median TTFT (ms): 221.62 +P99 TTFT (ms): 2475.32 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 97.23 +Median TPOT (ms): 85.61 +P99 TPOT (ms): 435.95 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 78.61 +Median ITL (ms): 43.66 +P95 ITL (ms): 169.53 +P99 ITL (ms): 260.91 +Max ITL (ms): 1703.21 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- Server Command + +```shell Command +python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-K2-Instruct \ + --tp 8 \ + --dp 4 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000 +``` + +- **Result**: + +```text Output +Accuracy: 0.960 +Invalid: 0.000 +Latency: 15.956 s +Output throughput: 1231.699 token/s +``` diff --git a/docs_new/cookbook/autoregressive/Moonshotai/Kimi-Linear.mdx b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-Linear.mdx new file mode 100644 index 000000000000..4c9d1158deb0 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Moonshotai/Kimi-Linear.mdx @@ -0,0 +1,297 @@ +--- +title: Kimi-Linear +metatags: + description: "Deploy Kimi-Linear with SGLang - community contribution guide for Moonshot AI's Kimi-Linear model deployment." +--- + +import { KimiLinearDeployment } from '/src/snippets/autoregressive/kimi-linear-deployment.jsx'; + +## AMD GPU Support + +## 1. Model Introduction +Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory. + +This generation delivers comprehensive upgrades across the board: + +Kimi Delta Attention (KDA): A linear attention mechanism that refines the gated delta rule with finegrained gating. +Hybrid Architecture: A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention. +Superior Performance: Outperforms full attention in a variety of tasks, including long-context and RL-style benchmarks on 1.4T token training runs with fair comparisons. +High Throughput: Achieves up to 6× faster decoding and significantly reduces time per output token (TPOT). + +For more details, please refer to the [official Kimi Linear GitHub Repository]: https://github.com/MoonshotAI/Kimi-Linear + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. + + + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Launch the docker +```shell Command +docker pull lmsysorg/sglang:v0.5.7-rocm700-mi30x +``` + +```shell Command +docker run -d -it --ipc=host --network=host --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd --device=/dev/dri --device=/dev/mem \ + --group-add video --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + --name Kimi-linear \ + lmsysorg/sglang:v0.5.7-rocm700-mi30x \ + /bin/bash +``` + +#### 4.2.2 pre-installation steps inside the docker + +```shell Command +pip install sentencepiece tiktoken +``` + +#### 4.2.3 Launch the server +```shell Command +export SGLANG_ROCM_FUSED_DECODE_MLA=0 + +SGLANG_ROCM_FUSED_DECODE_MLA=0 python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tokenizer-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tp 4 \ + --trust-remote-code +``` + +## 5. Benchmark +### 5.1 Speed Benchmark +Test Environment: + +Hardware: AMD MI300X GPU + +Model: Kimi-Linear-48B-A3B-Instruct + +Tensor Parallelism: 4 + +sglang version: 0.5.7 + +- **Model Deployment** + +```bash Command +SGLANG_ROCM_FUSED_DECODE_MLA=0 python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tokenizer-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tp 4 \ + --trust-remote-code +``` + +### 5.1.1 Low Concurrency (Latency-Optimized) + +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 23.86 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4001 +Request throughput (req/s): 0.42 +Input token throughput (tok/s): 255.70 +Output token throughput (tok/s): 176.86 +Peak output token throughput (tok/s): 190.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 432.56 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2383.93 +Median E2E Latency (ms): 1911.63 +---------------Time to First Token---------------- +Mean TTFT (ms): 141.33 +Median TTFT (ms): 126.27 +P99 TTFT (ms): 294.76 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 5.32 +Median TPOT (ms): 5.33 +P99 TPOT (ms): 5.36 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 5.33 +Median ITL (ms): 5.32 +P95 ITL (ms): 5.44 +P99 ITL (ms): 5.58 +Max ITL (ms): 11.46 +================================================== +``` + +### 5.1.2 Medium Concurrency (Balanced) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 31.38 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40805 +Total generated tokens (retokenized): 39667 +Request throughput (req/s): 2.55 +Input token throughput (tok/s): 1264.13 +Output token throughput (tok/s): 1300.37 +Peak output token throughput (tok/s): 1801.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 2564.50 +Concurrency: 14.13 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5543.18 +Median E2E Latency (ms): 5755.31 +---------------Time to First Token---------------- +Mean TTFT (ms): 175.25 +Median TTFT (ms): 137.87 +P99 TTFT (ms): 292.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.75 +Median TPOT (ms): 10.87 +P99 TPOT (ms): 16.74 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.54 +Median ITL (ms): 7.95 +P95 ITL (ms): 13.68 +P99 ITL (ms): 116.80 +Max ITL (ms): 299.89 +================================================== + +``` + +### 5.1.3 High Concurrency (Throughput-Optimized) +- Benchmark Command: +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 \ + --request-rate inf +``` + +- Test Results: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 79.71 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252662 +Total generated tokens (retokenized): 228448 +Request throughput (req/s): 6.27 +Input token throughput (tok/s): 3134.20 +Output token throughput (tok/s): 3169.72 +Peak output token throughput (tok/s): 6109.00 +Peak concurrent requests: 110 +Total token throughput (tok/s): 6303.92 +Concurrency: 94.80 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 15113.92 +Median E2E Latency (ms): 13851.52 +---------------Time to First Token---------------- +Mean TTFT (ms): 564.46 +Median TTFT (ms): 226.04 +P99 TTFT (ms): 2683.14 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 29.63 +Median TPOT (ms): 31.28 +P99 TPOT (ms): 38.84 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 28.85 +Median ITL (ms): 16.29 +P95 ITL (ms): 123.42 +P99 ITL (ms): 157.80 +Max ITL (ms): 2481.11 +================================================== +``` +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- Server Command + +```shell Command +SGLANG_ROCM_FUSED_DECODE_MLA=0 python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tokenizer-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tp 4 \ + --trust-remote-code +``` + +- Benchmark Command + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +- **Result**: + +```text Output +Accuracy: 0.705 +Invalid: 0.000 +Latency: 11.855 s +Output throughput: 3224.982 token/s +``` diff --git a/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano-Omni.mdx b/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano-Omni.mdx new file mode 100644 index 000000000000..492782b6b465 --- /dev/null +++ b/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano-Omni.mdx @@ -0,0 +1,657 @@ +--- +title: Nemotron 3 Nano Omni +metatags: + description: "Deploy NVIDIA Nemotron 3 Nano Omni multimodal MoE model with SGLang - text, image, video, and audio inputs with reasoning and tool calling." +tag: + NEW +--- + +import { Nemotron3NanoOmniDeployment } from '/src/snippets/autoregressive/nemotron3-nano-omni-deployment.jsx'; + +## 1. Model Introduction + +`NVIDIA Nemotron 3 Nano Omni` is a 30B-parameter hybrid MoE multimodal model that activates only 3B parameters per forward pass, combining vision and audio encoders into a unified architecture. Part of the Nemotron 3 family, it is designed to power multimodal sub-agents that perceive and reason across vision, audio, and language in a single inference loop — eliminating the fragmented stacks of separate models for each modality. + +Architecture and key features: + +- **Hybrid Transformer-Mamba Architecture (MoE):** Combines Mixture of Experts with a hybrid Transformer-Mamba architecture for efficient routing and sequence modeling. +- **30B total / 3B active parameters:** Delivers strong multimodal accuracy at a fraction of the cost of dense models. +- **1M token context window:** Sustains coherent agent state across extended multimodal workflows — screen history, document content, and audio context remain in view without re-ingestion. +- **Unified vision and audio encoders:** One model replaces fragmented multimodal stacks; vision and audio perception happen in the same forward pass. +- **3D Convolution (Conv3D):** Efficient temporal-spatial processing for video inputs. +- **Efficient Video Sampling (EVS):** Enables longer video processing at the same compute budget via temporal-aware perception and adaptive frame sampling. +- **FP8 and NVFP4 quantization:** FP8 supports deployment from workstation (RTX 6000, DGX Spark) to cloud (H100, H200, B200, A100, L40S); NVFP4 requires Blackwell hardware. +- **9x higher throughput** than other open omni models at the same interactivity level. +- **~20% higher multimodal intelligence** compared to the best open alternative. +- **Post-trained with multi-environment reinforcement learning** via NVIDIA NeMo RL and NeMo Gym across text, image, audio, and video environments, improving instruction following and convergence to correct multimodal answers. + +**Modalities:** Input: text, image, video, audio — Output: text + +**Supported GPUs:** NVIDIA B200, H100, H200, A100, L40S, DGX Spark, RTX 6000 + +Available model variants on HuggingFace: +- [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning) +- [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-BF16) +- [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8) +- [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4) + +**Agentic workloads this model enables:** +- **Computer Use Agent:** Perception loop for agents navigating GUIs — reads screens, understands UI state over time, validates outcomes. Collapses vision and reasoning into a single loop. +- **Document Intelligence:** Interprets documents, charts, tables, screenshots, and mixed media inputs for enterprise analysis and compliance workflows. +- **Audio & Video Understanding Agents:** Maintains continuous audio-video context for customer service, research, and monitoring workflows, tying what was said, shown, and documented into a single reasoning stream. + +## 2. SGLang Installation + +Install SGLang via pip or from source: + +```shell Command +# Install via pip +pip install sglang + +# Or install from source +uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python' + +# Or use Docker +docker pull lmsysorg/sglang:dev-cu13-nemotronh-nano-omni-reasoning-v3 +``` + +For the full Docker setup and other installation methods, refer to the [official SGLang installation guide](../../../docs/get-started/installation). + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance tuning. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: select hardware, model variant, and common knobs to generate a launch command. + + + +### 3.2 Configuration Tips + +- **Attention backend:** + + **H100/H200:** Use flash attention 3 backend by default. + **B200:** Use flashinfer backend by default. + +- **TP support:** + + To set tensor parallelism, use `--tp <1|2|4|8>`. A 4×H100 setup is recommended for the BF16/Reasoning variant. + +- **FP8 KV cache:** + + To enable FP8 KV cache, append `--kv-cache-dtype fp8_e4m3`. FP8 KV cache trades a small amount of accuracy for memory; omit the flag if you observe accuracy regressions on your workload. + +- **Reasoning parser:** + + Append `--reasoning-parser deepseek-r1` to enable structured reasoning traces (`reasoning_content` field in the response). + +- **Tool calling:** + + Append `--tool-call-parser qwen3_coder` to enable tool calling support. + +## 4. Model Invocation + +The command below launches the server for a 4×H100 setup with reasoning and tool calling enabled. See [Section 4.8](#48-fp8-and-nvfp4-deployment) for FP8 and NVFP4 variants. + +```shell Command +sglang serve \ + --model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \ + --host 0.0.0.0 \ + --port 30000 \ + --tp 4 \ + --trust-remote-code \ + --tool-call-parser qwen3_coder \ + --reasoning-parser deepseek-r1 +``` + +### 4.1 Basic Usage (Text) + +SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client: + +```python Example +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "Give me 3 bullet points about SGLang."}, + ], + temperature=0.6, + max_tokens=512, +) +print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content) +``` + +Output: +```text Output +Reasoning: SGLang is a serving framework I know from my training data. Let me recall the key features... + +Content: +- **Radix Attention** — SGLang reuses KV cache across requests sharing a common prefix, dramatically reducing memory and compute for multi-turn and few-shot workloads. +- **OpenAI-compatible API** — Drop-in replacement for the OpenAI Python client; no application code changes required to serve a locally-hosted model. +- **High-throughput serving** — Continuous batching, chunked prefill, and optimized CUDA kernels deliver state-of-the-art throughput on NVIDIA GPUs across A100, H100, and B200. +``` + +Streaming chat completion: + +```python Example +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +stream = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "What are the first 5 prime numbers?"}, + ], + temperature=0.6, + max_tokens=512, + stream=True, +) +for chunk in stream: + delta = chunk.choices[0].delta + if delta and delta.content: + print(delta.content, end="", flush=True) +``` + +### 4.2 Image Understanding + +Pass image inputs using the OpenAI vision format. Supports both URLs and base64-encoded images: + +```python Example +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +# From URL +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}, + }, + {"type": "text", "text": "Describe this image in detail."}, + ], + } + ], + temperature=0.6, + max_tokens=512, +) +print(resp.choices[0].message.reasoning_content) +print(resp.choices[0].message.content) +``` + +For local images, encode as base64: + +```python Example +import base64 +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +with open("screenshot.png", "rb") as f: + image_b64 = base64.b64encode(f.read()).decode("utf-8") + +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": {"url": f"data:image/png;base64,{image_b64}"}, + }, + {"type": "text", "text": "What UI elements are visible on this screen? What action would you take next?"}, + ], + } + ], + temperature=0.6, + max_tokens=512, +) +print(resp.choices[0].message.content) +``` + +### 4.3 Video Understanding + +Nemotron 3 Nano Omni uses Conv3D layers and Efficient Video Sampling (EVS) for temporal-spatial video reasoning, processing longer videos at the same compute budget: + +```python Example +import base64 +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +with open("video.mp4", "rb") as f: + video_b64 = base64.b64encode(f.read()).decode("utf-8") + +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + { + "role": "user", + "content": [ + { + "type": "video_url", + "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}, + }, + {"type": "text", "text": "Summarize what happens in this video step by step."}, + ], + } + ], + temperature=0.6, + max_tokens=1024, +) +print(resp.choices[0].message.reasoning_content) +print(resp.choices[0].message.content) +``` + +### 4.4 Audio Understanding + +Pass audio inputs as base64-encoded WAV or MP3 data: + +```python Example +import base64 +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +with open("audio.wav", "rb") as f: + audio_b64 = base64.b64encode(f.read()).decode("utf-8") + +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + { + "role": "user", + "content": [ + { + "type": "input_audio", + "input_audio": {"data": audio_b64, "format": "wav"}, + }, + {"type": "text", "text": "Transcribe and summarize what was said in this audio."}, + ], + } + ], + temperature=0.6, + max_tokens=512, +) +print(resp.choices[0].message.content) +``` + +### 4.5 Mixed Multimodal Input + +Combine modalities in a single request. For example, an image alongside an audio question about it: + +```python Example +import base64 +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +with open("chart.png", "rb") as f: + image_b64 = base64.b64encode(f.read()).decode("utf-8") + +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": {"url": f"data:image/png;base64,{image_b64}"}, + }, + {"type": "text", "text": "Analyze this chart. What are the key trends and what conclusion does the data support?"}, + ], + } + ], + temperature=0.6, + max_tokens=1024, +) +print(resp.choices[0].message.reasoning_content) +print(resp.choices[0].message.content) +``` + +### 4.6 Reasoning + +The model supports two modes — Reasoning ON (default) vs OFF. Toggle per-request by setting `enable_thinking` to `False`: + +```python Example +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +# Reasoning ON (default) +print("Reasoning on") +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is the derivative of x^3 sin(x)?"}, + ], + temperature=0.6, + max_tokens=1024, +) +print(f"Reasoning:\n{resp.choices[0].message.reasoning_content[:300]}...\nContent:\n{resp.choices[0].message.content}") +print("\n") + +# Reasoning OFF +print("Reasoning off") +resp = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is 15% of 200?"}, + ], + temperature=0.6, + max_tokens=256, + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, +) +print(f"Content:\n{resp.choices[0].message.content}") +``` + +Output: +```text Output +Reasoning on +Reasoning: +The user wants the derivative of x^3 sin(x). I'll apply the product rule: d/dx[u·v] = u'v + uv'. Here u = x^3, v = sin(x). So u' = 3x^2, v' = cos(x). The result is 3x^2·sin(x) + x^3·cos(x)... +Content: +Using the product rule: d/dx[x³ sin(x)] = 3x² sin(x) + x³ cos(x) + + +Reasoning off +Content: +15% of 200 is **30**. +``` + +### 4.7 Tool Calling + +Call functions using the OpenAI Tools schema. The server must be launched with `--tool-call-parser qwen3_coder`: + +```python Example +from openai import OpenAI + +SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning" +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +TOOLS = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "City and state, e.g. San Francisco, CA", + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + }, + }, + "required": ["location"], + }, + }, + } +] + +completion = client.chat.completions.create( + model=SERVED_MODEL_NAME, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is the weather like in Santa Clara, CA?"}, + ], + tools=TOOLS, + temperature=0.6, + top_p=0.95, + max_tokens=512, + stream=False, +) +print(completion.choices[0].message.reasoning_content) +print(completion.choices[0].message.tool_calls) +``` + +Output: +```text Output +The user is asking about weather in Santa Clara, CA. I have a get_weather function that takes a location and optional unit. I should call it with location="Santa Clara, CA". + +[ChatCompletionMessageFunctionToolCall(id='call_abc123', function=Function(arguments='{"location": "Santa Clara, CA", "unit": "fahrenheit"}', name='get_weather'), type='function', index=0)] +``` + +### 4.8 FP8 and NVFP4 Deployment + +**FP8 variant** (recommended for throughput-critical serving on H100/H200/B200): + +```shell Command +sglang serve \ + --model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8 \ + --host 0.0.0.0 \ + --port 30000 \ + --tp 4 \ + --trust-remote-code \ + --tool-call-parser qwen3_coder \ + --reasoning-parser deepseek-r1 +``` + +**NVFP4 variant** (maximum efficiency on Blackwell B200): + +```shell Command +sglang serve \ + --model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4 \ + --host 0.0.0.0 \ + --port 30000 \ + --tp 2 \ + --trust-remote-code \ + --tool-call-parser qwen3_coder \ + --reasoning-parser deepseek-r1 +``` + +--- + +## 5. Benchmark + +### 5.1 Efficiency Benchmark + +Nemotron 3 Nano Omni achieves **9x higher throughput** than other open omni models at the same interactivity level, delivering lower cost and better scalability without sacrificing responsiveness. It also achieves **~20% higher multimodal intelligence** compared to the best open alternative across image, video, and audio reasoning tasks. + +### 5.2 Speed Benchmark + +**Test Environment:** +- Hardware: B200 (8×) +- Model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning +- Tensor Parallelism: 4 +- SGLang Version: main branch + +Model Deployment Command: + +```shell Command +sglang serve \ + --model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \ + --trust-remote-code \ + --tp 4 \ + --max-running-requests 1024 \ + --host 0.0.0.0 \ + --attention-backend flashinfer \ + --port 30000 +``` + +Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 4096 \ + --max-concurrency 256 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 256 +Successful requests: 4096 +Benchmark duration (s): 206.52 +Total input tokens: 2081726 +Total input text tokens: 2081726 +Total generated tokens: 2087288 +Total generated tokens (retokenized): 1945477 +Request throughput (req/s): 19.83 +Input token throughput (tok/s): 10080.25 +Output token throughput (tok/s): 10107.18 +Peak output token throughput (tok/s): 20199.00 +Peak concurrent requests: 291 +Total token throughput (tok/s): 20187.44 +Concurrency: 250.83 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 12646.47 +Median E2E Latency (ms): 12371.84 +P90 E2E Latency (ms): 22889.81 +P99 E2E Latency (ms): 26528.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 220.66 +Median TTFT (ms): 97.67 +P99 TTFT (ms): 2068.63 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 24.98 +Median TPOT (ms): 24.36 +P99 TPOT (ms): 44.97 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 24.43 +Median ITL (ms): 10.91 +P95 ITL (ms): 62.68 +P99 ITL (ms): 100.60 +Max ITL (ms): 2171.93 +================================================== +``` + +### 5.3 Accuracy Benchmark + +**Environment** +- Hardware: B200 (8×) +- Model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning +- Tensor Parallelism: 4 +- SGLang Version: main branch + +**Launch Model** +```shell Command +sglang serve \ + --model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \ + --trust-remote-code \ + --tp 4 \ + --attention-backend flashinfer \ + --reasoning-parser deepseek-r1 +``` + +#### 5.3.1 GSM8K Benchmark + +**Run Benchmark** +```shell Command +python3 benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +**Test Results:** +```text Output +Accuracy: 0.830 +Invalid: 0.000 +Latency: 13.970 s +Output throughput: 1611.623 token/s +``` + +#### 5.3.2 MMLU Benchmark + +**Run Benchmark** +```shell Command +python3 benchmark/mmlu/bench_sglang.py --port 30000 +``` + +**Test Results:** +```text Output +subject: abstract_algebra, #q:100, acc: 0.510 +subject: anatomy, #q:135, acc: 0.711 +subject: astronomy, #q:152, acc: 0.829 +subject: business_ethics, #q:100, acc: 0.760 +subject: clinical_knowledge, #q:265, acc: 0.781 +subject: college_biology, #q:144, acc: 0.854 +subject: college_chemistry, #q:100, acc: 0.560 +subject: college_computer_science, #q:100, acc: 0.700 +subject: college_mathematics, #q:100, acc: 0.590 +subject: college_medicine, #q:173, acc: 0.775 +subject: college_physics, #q:102, acc: 0.559 +subject: computer_security, #q:100, acc: 0.750 +subject: conceptual_physics, #q:235, acc: 0.821 +subject: econometrics, #q:114, acc: 0.605 +subject: electrical_engineering, #q:145, acc: 0.759 +subject: elementary_mathematics, #q:378, acc: 0.638 +subject: formal_logic, #q:126, acc: 0.524 +subject: global_facts, #q:100, acc: 0.400 +subject: high_school_biology, #q:310, acc: 0.906 +subject: high_school_chemistry, #q:203, acc: 0.759 +subject: high_school_computer_science, #q:100, acc: 0.860 +subject: high_school_european_history, #q:165, acc: 0.812 +subject: high_school_geography, #q:198, acc: 0.889 +subject: high_school_government_and_politics, #q:193, acc: 0.933 +subject: high_school_macroeconomics, #q:390, acc: 0.785 +subject: high_school_mathematics, #q:270, acc: 0.496 +subject: high_school_microeconomics, #q:238, acc: 0.887 +subject: high_school_physics, #q:151, acc: 0.675 +subject: high_school_psychology, #q:545, acc: 0.895 +subject: high_school_statistics, #q:216, acc: 0.731 +subject: high_school_us_history, #q:204, acc: 0.858 +subject: high_school_world_history, #q:237, acc: 0.873 +subject: human_aging, #q:223, acc: 0.740 +subject: human_sexuality, #q:131, acc: 0.855 +subject: international_law, #q:121, acc: 0.851 +subject: jurisprudence, #q:108, acc: 0.815 +subject: logical_fallacies, #q:163, acc: 0.847 +subject: machine_learning, #q:112, acc: 0.598 +subject: management, #q:103, acc: 0.864 +subject: marketing, #q:234, acc: 0.910 +subject: medical_genetics, #q:100, acc: 0.880 +subject: miscellaneous, #q:783, acc: 0.881 +subject: moral_disputes, #q:346, acc: 0.780 +subject: moral_scenarios, #q:895, acc: 0.543 +subject: nutrition, #q:306, acc: 0.814 +subject: philosophy, #q:311, acc: 0.733 +subject: prehistory, #q:324, acc: 0.852 +subject: professional_accounting, #q:282, acc: 0.553 +subject: professional_law, #q:1534, acc: 0.565 +subject: professional_medicine, #q:272, acc: 0.779 +subject: professional_psychology, #q:612, acc: 0.760 +subject: public_relations, #q:110, acc: 0.709 +subject: security_studies, #q:245, acc: 0.759 +subject: sociology, #q:201, acc: 0.831 +subject: us_foreign_policy, #q:100, acc: 0.910 +subject: virology, #q:166, acc: 0.560 +subject: world_religions, #q:171, acc: 0.807 +Total latency: 67.512 +Average accuracy: 0.737 +``` diff --git a/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano.mdx b/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano.mdx new file mode 100644 index 000000000000..0da99a106c0e --- /dev/null +++ b/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano.mdx @@ -0,0 +1,375 @@ +--- +title: Nemotron3-Nano +metatags: + description: "Deploy NVIDIA Nemotron3-Nano 30B hybrid LLM with SGLang - MoE, Mamba2, and attention layers with BF16/FP8 precision options." +--- + +import { Nemotron3NanoDeployment } from '/src/snippets/autoregressive/nemotron3-nano-deployment.jsx'; + +## 1. Model Introduction + +`NVIDIA Nemotron3-Nano` is a 30B-parameter hybrid LLM that mixes Mixture-of-Experts (MoE) feed-forward layers, Mamba2 sequence-modeling layers, and standard self-attention layers in a single stack rather than classic “attention + MLP” transformer blocks. + +The BF16 variant (`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`) is designed as a high-fidelity reference model. For optimized inference performance on modern NVIDIA GPUs, the FP8 variant (`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8`) and the NVFP4 variant (`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`) are supported. + +At a high level: + +- **Hybrid layer stack (Mamba2 + MoE + attention):** The network is composed of interleaved layers that are *either* Mamba2, *or* MoE feed-forward, *or* attention-only. +- **Non-uniform layer ordering:** The order and mix of these specialized layers is not a simple, rigid pattern, enabling the model to trade off sequence modeling, routing capacity, and expressivity across depth. +- **Deployment-friendly precision:** Use BF16 for accuracy-sensitive and evaluation workloads; use FP8 for latency- and throughput-critical serving on recent NVIDIA GPUs. + +## 2. SGLang Installation + +Refer to the [official SGLang installation guide](../../../docs/get-started/install), or install nightly wheel through: +```bash Command +uv pip install sglang==0.5.6.post3.dev1278+gad1b4e472 --extra-index-url https://sgl-project.github.io/whl/nightly/ +``` + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance tuning. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: select hardware, model variant, and common knobs to generate a launch command. + + + +### 3.2 Configuration Tips + +- **Attention backend**: + + **H200**: Use flash attention 3 backend by default. + **B200**: Use flashinfer backend by default. + +- **TP support**: + + To set tp size, use `--tp <1|2|4|8>`. + +- **FP8 KV cache**: + + To enable fp8 kv cache, please append `--kv-cache-dtype fp8_e4m3`. + +## 4. Model Invocation + +### 4.1 Basic Usage (OpenAI-Compatible API) + +SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +resp = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Summarize what MoE models are in 5 bullets."}, + ], + temperature=0.7, + max_tokens=256, +) + +print(resp.choices[0].message.content) + +``` + +Streaming chat completion +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +stream = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "What are the first 5 prime numbers?"} + ], + temperature=0.7, + max_tokens=1024, + stream=True, +) +for chunk in stream: + delta = chunk.choices[0].delta + if delta and delta.content: + print(delta.content, end="", flush=True) +``` + +### 4.2 Reasoning +To enable reasoning, `--reasoning-parser nemotron_3` should be appended to the launching command. The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +# Reasoning on (default) +print("Reasoning on") +resp = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a haiku about GPUs."} + ], + temperature=0.7, + max_tokens=512, +) +print(resp.choices[0].message.reasoning_content) + +# Reasoning off +print("Reasoning off") +resp = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a haiku about GPUs."} + ], + temperature=0.6, + max_tokens=256, + extra_body={"chat_template_kwargs": {"enable_thinking": False}} +) +print(resp.choices[0].message.reasoning_content) + +``` + +### 4.3 Tool calling +To enable reasoning, `--tool-call-parser qwen3_coder` should be appended to the launching command. Call functions using the OpenAI Tools schema and inspect returned tool_calls. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY", +) + +# Tool calling via OpenAI tools schema +TOOLS = [ + { + "type": "function", + "function": { + "name": "calculate_tip", + "parameters": { + "type": "object", + "properties": { + "bill_total": { + "type": "integer", + "description": "The total amount of the bill" + }, + "tip_percentage": { + "type": "integer", + "description": "The percentage of tip to be applied" + } + }, + "required": ["bill_total", "tip_percentage"] + } + } + } +] + +completion = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", + messages=[ + {"role": "system", "content": ""}, + {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"} + ], + tools=TOOLS, + temperature=0.6, + top_p=0.95, + max_tokens=512, + stream=False +) + +print(completion.choices[0].message.reasoning_content) +print(completion.choices[0].message.tool_calls) +``` + +--- + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU + +**FP8 variant** + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \ + --trust-remote-code \ + --max-running-requests 1024 \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 4096 \ + --max-concurrency 256 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 256 +Successful requests: 4096 +Benchmark duration (s): 183.18 +Total input tokens: 2081726 +Total input text tokens: 2081726 +Total input vision tokens: 0 +Total generated tokens: 2116125 +Total generated tokens (retokenized): 1076256 +Request throughput (req/s): 22.36 +Input token throughput (tok/s): 11364.25 +Output token throughput (tok/s): 11552.04 +Peak output token throughput (tok/s): 24692.00 +Peak concurrent requests: 294 +Total token throughput (tok/s): 22916.30 +Concurrency: 251.19 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 11233.74 +Median E2E Latency (ms): 11142.97 +---------------Time to First Token---------------- +Mean TTFT (ms): 172.99 +Median TTFT (ms): 116.57 +P99 TTFT (ms): 1193.68 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 21.74 +Median TPOT (ms): 21.14 +P99 TPOT (ms): 41.12 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 21.45 +Median ITL (ms): 9.06 +P95 ITL (ms): 62.59 +P99 ITL (ms): 110.83 +Max ITL (ms): 5368.19 +================================================== +``` + +**BF16 variant** + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ + --trust-remote-code \ + --max-running-requests 1024 \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 4096 \ + --max-concurrency 256 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 256 +Successful requests: 4096 +Benchmark duration (s): 360.22 +Total input tokens: 2081726 +Total input text tokens: 2081726 +Total input vision tokens: 0 +Total generated tokens: 2087288 +Total generated tokens (retokenized): 1940652 +Request throughput (req/s): 11.37 +Input token throughput (tok/s): 5779.10 +Output token throughput (tok/s): 5794.55 +Peak output token throughput (tok/s): 9169.00 +Peak concurrent requests: 276 +Total token throughput (tok/s): 11573.65 +Concurrency: 249.76 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 21965.10 +Median E2E Latency (ms): 21706.35 +---------------Time to First Token---------------- +Mean TTFT (ms): 211.54 +Median TTFT (ms): 93.06 +P99 TTFT (ms): 2637.66 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 43.27 +Median TPOT (ms): 43.04 +P99 TPOT (ms): 61.15 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 42.77 +Median ITL (ms): 28.46 +P95 ITL (ms): 71.85 +P99 ITL (ms): 113.20 +Max ITL (ms): 5237.28 +================================================== + +``` +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +**Environment** +- Hardware: NVIDIA B200 GPU +- Model: BF16 checkpoint + +**Launch Model** +```bash Command +python3 -m sglang.launch_server \ + --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ + --trust-remote-code \ + --reasoning-parser nemotron_3 +``` + +**Run Benchmark with lm-eval** +```bash Command +pip install lm-eval[api]==0.4.9.2 + +lm_eval --model local-completions --tasks gsm8k --model_args "model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False,max_lengths=16384" --gen_kwargs '{"chat_template_kwargs":{"thinking":true}}' --batch_size 256 +``` + +**Test Results:** +```text Output +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.5603|± |0.0137| +| | |strict-match | 5|exact_match|↑ |0.8453|± |0.0100| +``` diff --git a/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Super.mdx b/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Super.mdx new file mode 100644 index 000000000000..2adebe7a8196 --- /dev/null +++ b/docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Super.mdx @@ -0,0 +1,571 @@ +--- +title: NVIDIA Nemotron3-Super +metatags: + description: "Deploy NVIDIA Nemotron3-Super with SGLang - 120B hybrid MoE model (12B active) with 1M context window optimized for multi-agent systems and tool use." +--- + +import { Nemotron3SuperDeployment } from '/src/snippets/autoregressive/nemotron3-super-deployment.jsx'; + +## 1. Model Introduction + +`NVIDIA Nemotron3-Super` is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. It is optimized for agentic systems that chain planning, reasoning, and tool use workloads that generate far more tokens than single turn chat and require strong reasoning at every step. + +Nemotron 3 Super is a 120B parameter hybrid MoE model that activates only 12B parameters per forward pass, delivering strong accuracy for coding, tool calling, and instruction following at a fraction of the cost. It also supports a 1M token context window so agents can keep conversation history and plan state in view across long workflows. + +Architecture and key features: + +- **Hybrid Transformer-Mamba Architecture (MoE):** Combines Mixture of Experts with a hybrid Transformer-Mamba architecture, enabling efficient routing and sequence modeling in a single stack. +- **Highest throughput efficiency in its size category:** Delivers up to 5x higher throughput compared to the previous Nemotron Super model (Llama Nemotron Super 1.5). +- **Multi-Token Prediction (MTP):** By predicting several future tokens simultaneously in a single forward pass, MTP drastically accelerates the generation of long-form text. +- **Thinking Budget support:** Supports Thinking Budget for optimal accuracy with minimum reasoning token generation. + +## 2. SGLang Installation + +SGLang from the main branch is required for Nemotron3-Super. You can install from source and with a nightly docker. + +```bash Command +# Install from source +uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python' + +# Or use Docker +docker pull lmsysorg/sglang:nightly-dev-20260310-0fd9a57d +``` + +For the full Docker setup and other installation methods, please refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance tuning. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: select hardware, tensor parallelism, and common knobs to generate a launch command. + + + +### 3.2 Configuration Tips + +- **Attention backend**: + + **H200**: Use flash attention 3 backend by default. + **B200**: Use flashinfer backend by default. + +- **TP support**: + + To set tp size, use `--tp <2|4|8>`. + +- **FP8 KV cache**: + + To enable fp8 kv cache, please append `--kv-cache-dtype fp8_e4m3`. + +## 4. Model Invocation + +```shell Command +python3 -m sglang.launch_server \ + --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ + --host 0.0.0.0 \ + --port 5000 \ + --trust-remote-code \ + --tp 4 \ + --tool-call-parser qwen3_coder \ + --reasoning-parser nemotron_3 +``` + +### 4.1 Basic Usage (OpenAI-Compatible API) + +SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:5000/v1", + api_key="EMPTY", +) + +resp = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "Give me 3 bullet points about SGLang."}, + ], + temperature=0.6, + max_tokens=1024, +) +print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content) +print("\n") +``` + +Output: +```text Output +Reasoning: Okay, the user is asking for 3 bullet points about SGLang. Let me recall what I know about SGLang. It's a framework for serving large language models, right? Developed by the team at UC Berkeley and others. + +First, I should verify the key features. SGLang is known for its high-performance serving capabilities, especially with features like Radix Attention and chunked prefill. Those are important points to mention...(more tokens) + +Content: - SGLang introduces **Radix Attention**, an innovative attention mechanism that significantly reduces KV cache memory usage and improves computational efficiency during LLM serving by reusing intermediate states across tokens. +- It features **chunked prefill** for handling long prompts efficiently, breaking input sequences into manageable chunks to minimize latency and memory pressure while maintaining high throughput. +- Designed for **high-performance LLM serving**, SGLang achieves superior throughput and lower latency compared to traditional systems (like vLLM or TensorRT-LLM) through optimized kernel fusion, dynamic batching, and seamless integration with Hugging Face Transformers. +``` + +Streaming chat completion: +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:5000/v1", + api_key="EMPTY", +) + +stream = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "What are the first 5 prime numbers?"} + ], + temperature=0.7, + max_tokens=1024, + stream=True, +) +for chunk in stream: + delta = chunk.choices[0].delta + if delta and delta.content: + print(delta.content, end="", flush=True) +``` + +Output: +```text Output +The first 5 prime numbers are: +**2, 3, 5, 7, 11**. + +### Explanation: +- A **prime number** is a natural number greater than 1 that has no positive divisors other than 1 and itself. +- **2** is the smallest and only even prime number. +- **3** is prime (divisible only by 1 and 3). +- **4** is not prime (divisible by 2). +- **5** is prime. +- **6** is not prime (divisible by 2 and 3). +- **7** is prime. +- **8, 9, 10** are not prime. +- **11** is prime (the fifth in the sequence). + +Note: **1 is not considered a prime number** by definition, as it has only one positive divisor. +This list is universally accepted in mathematics. Let me know if you'd like to explore more primes or related concepts! 😊 +``` + +### 4.2 Reasoning + +The model supports two modes — Reasoning ON (default) vs OFF. This can be toggled by setting `enable_thinking` to `False`, as shown below. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:5000/v1", + api_key="EMPTY", +) + +# Reasoning on (default) +print("Reasoning on") +resp = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a haiku about GPUs. Please make thinking process short."} + ], + temperature=1, + max_tokens=1024, +) +print(f"Reasoning: \n{resp.choices[0].message.reasoning_content[:200]}... \nContent: \n{resp.choices[0].message.content[:200]}...") +print("\n") +# Reasoning off +print("Reasoning off") +resp = client.chat.completions.create( + model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Give me 3 facts about SGLang."} + ], + temperature=0, + max_tokens=256, + extra_body={"chat_template_kwargs": {"enable_thinking": False}} +) +print(f"Content: \n{resp.choices[0].message.reasoning_content[:200]}...") +``` + +Output: +```text Output +Reasoning on +Reasoning: +We need to output a haiku about GPUs, with short thinking process. Probably we just need to produce the haiku. No extra commentary needed. Provide a haiku: 5-7-5 syllable lines about GPUs. + +Let's deci... +Content: +Silicon hearts beat +Paint vivid worlds with bright light +GPU dreams rise... + +Reasoning off +Content: +Certainly! Here are three accurate and informative facts about **SGLang**: + +1. **SGLang is a high-performance serving system for large language models (LLMs)** + Developed by researchers at UC Berk... +``` + +### 4.3 Tool Calling + +Call functions using the OpenAI Tools schema and inspect returned `tool_calls`. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:5000/v1", + api_key="EMPTY", +) + +# Tool calling via OpenAI tools schema +TOOLS = [ + { + "type": "function", + "function": { + "name": "calculate_tip", + "parameters": { + "type": "object", + "properties": { + "bill_total": { + "type": "integer", + "description": "The total amount of the bill" + }, + "tip_percentage": { + "type": "integer", + "description": "The percentage of tip to be applied" + } + }, + "required": ["bill_total", "tip_percentage"] + } + } + } +] + +completion = client.chat.completions.create( + model="nemotron", + messages=[ + {"role": "system", "content": ""}, + {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"} + ], + tools=TOOLS, + temperature=0.6, + top_p=0.95, + max_tokens=512, + stream=False +) + +print(completion.choices[0].message.reasoning_content) +print(completion.choices[0].message.tool_calls) +``` + +Output: +```text Output +The user wants to calculate a 15% tip on a $50 bill. I have a function called calculate_tip that takes bill_total and tip_percentage as parameters. The bill_total is $50, and tip_percentage is 15. I need to call the function with these values. Let me do that. + +[ChatCompletionMessageFunctionToolCall(id='call_ced9a83a3baa448e9d587aaf', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function', index=0)] +``` + +### 4.4 Controlling Reasoning Budget + +The `reasoning_budget` parameter allows you to limit the length of the model's reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character. + +If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at `reasoning_budget + 500` tokens. + +```python Example +from typing import Any, Dict, List +import openai +from transformers import AutoTokenizer + +class ThinkingBudgetClient: + def __init__(self, base_url: str, api_key: str, tokenizer_name_or_path: str): + self.base_url = base_url + self.api_key = api_key + self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path) + self.client = openai.OpenAI(base_url=self.base_url, api_key=self.api_key) + + def chat_completion( + self, + model: str, + messages: List[Dict[str, Any]], + reasoning_budget: int = 512, + max_tokens: int = 1024, + **kwargs, + ) -> Dict[str, Any]: + assert ( + max_tokens > reasoning_budget + ), f"reasoning_budget must be smaller than max_tokens. Given {max_tokens=} and {reasoning_budget=}" + + # 1. first call chat completion to get reasoning content + response = self.client.chat.completions.create( + model=model, + messages=messages, + max_tokens=reasoning_budget, + **kwargs + ) + + reasoning_content = response.choices[0].message.reasoning_content or "" + + if "
" not in reasoning_content: + # reasoning content is too long, closed with a period (.) + reasoning_content = f"{reasoning_content}.\n
\n\n" + + reasoning_tokens_used = len( + self.tokenizer.encode(reasoning_content, add_special_tokens=False) + ) + remaining_tokens = max_tokens - reasoning_tokens_used + + assert ( + remaining_tokens > 0 + ), f"remaining tokens must be positive. Given {remaining_tokens=}. Increase max_tokens or lower reasoning_budget." + + # 2. append reasoning content to messages and call completion + messages.append({"role": "assistant", "content": reasoning_content}) + prompt = self.tokenizer.apply_chat_template( + messages, + tokenize=False, + continue_final_message=True, + ) + + response = self.client.completions.create( + model=model, + prompt=prompt, + max_tokens=remaining_tokens, + **kwargs + ) + + response_data = { + "reasoning_content": reasoning_content.strip().strip("
").strip(), + "content": response.choices[0].text, + "finish_reason": response.choices[0].finish_reason, + } + return response_data +``` + +Usage example with `reasoning_budget=128`: + +```python Example +SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16" + +# Client +client = ThinkingBudgetClient( + base_url="http://127.0.0.1:5000/v1", + api_key="null", + tokenizer_name_or_path=SERVED_MODEL_NAME +) + +resp = client.chat_completion( + model=SERVED_MODEL_NAME, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a haiku about GPUs."} + ], + temperature=1, + max_tokens=512, + reasoning_budget=128 +) +print("Reasoning:", resp["reasoning_content"], "\nContent:", resp["content"]) +``` + +Output: +```text Output +Reasoning: Okay, the user wants a haiku about GPUs. Let me recall what a haiku is: a traditional Japanese poem with three lines, 5-7-5 syllable structure. So I need to make sure the syllable count is exact. + +First, I should think about what makes GPUs interesting. They're used for graphics rendering, parallel processing, AI, gaming, etc. Maybe focus on their speed, power, or how they handle many tasks at once. + +Let me brainstorm some words and phrases related to GPUs: silicon, cores, transistors, parallel, rendering, pixels, frames per second, CUDA, tensor. +Content: + +Silicon minds awaken, +Thousands of cores hum in unison— +Lightning paints the void. +``` + +--- + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** +- Hardware: H200 (4x) +- Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 +- Tensor Parallelism: 4 +- SGLang Version: main branch + +- Model Deployment Command: + +```shell Command +python3 -m sglang.launch_server \ + --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ + --trust-remote-code \ + --tp 4 \ + --max-running-requests 1024 \ + --host 0.0.0.0 \ + --port 5000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 5000 \ + --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 4096 \ + --max-concurrency 256 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 256 +Successful requests: 4096 +Benchmark duration (s): 623.49 +Total input tokens: 2081726 +Total input text tokens: 2081726 +Total generated tokens: 2087288 +Total generated tokens (retokenized): 2044666 +Request throughput (req/s): 6.57 +Input token throughput (tok/s): 3338.85 +Output token throughput (tok/s): 3347.77 +Peak output token throughput (tok/s): 6349.00 +Peak concurrent requests: 270 +Total token throughput (tok/s): 6686.62 +Concurrency: 250.35 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 38108.46 +Median E2E Latency (ms): 37186.80 +P90 E2E Latency (ms): 69325.24 +P99 E2E Latency (ms): 77776.90 +---------------Time to First Token---------------- +Mean TTFT (ms): 436.49 +Median TTFT (ms): 114.90 +P99 TTFT (ms): 6938.11 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 75.02 +Median TPOT (ms): 76.02 +P99 TPOT (ms): 92.27 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 74.07 +Median ITL (ms): 38.45 +P95 ITL (ms): 230.42 +P99 ITL (ms): 242.70 +Max ITL (ms): 7181.72 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +**Environment** +- Hardware: H200 (4x) +- Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 +- Tensor Parallelism: 4 +- SGLang Version: main branch + +**Launch Model** +```bash Command +python3 -m sglang.launch_server \ + --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ + --trust-remote-code \ + --tp 4 \ + --reasoning-parser nemotron_3 +``` + +**Run Benchmark** +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --port 5000 +``` + +**Test Results:** +```text Output +Accuracy: 0.950 +Invalid: 0.000 +Latency: 21.442 s +Output throughput: 996.815 token/s +``` + +#### 5.2.2 MMLU Benchmark + +**Run Benchmark** +```bash Command +python3 benchmark/mmlu/bench_sglang.py --port 5000 +``` + +**Test Results:** +```text Output +subject: abstract_algebra, #q:100, acc: 0.730 +subject: anatomy, #q:135, acc: 0.830 +subject: astronomy, #q:152, acc: 0.934 +subject: business_ethics, #q:100, acc: 0.830 +subject: clinical_knowledge, #q:265, acc: 0.879 +subject: college_biology, #q:144, acc: 0.931 +subject: college_chemistry, #q:100, acc: 0.620 +subject: college_computer_science, #q:100, acc: 0.840 +subject: college_mathematics, #q:100, acc: 0.820 +subject: college_medicine, #q:173, acc: 0.821 +subject: college_physics, #q:102, acc: 0.794 +subject: computer_security, #q:100, acc: 0.880 +subject: conceptual_physics, #q:235, acc: 0.919 +subject: econometrics, #q:114, acc: 0.746 +subject: electrical_engineering, #q:145, acc: 0.828 +subject: elementary_mathematics, #q:378, acc: 0.926 +subject: formal_logic, #q:126, acc: 0.857 +subject: global_facts, #q:100, acc: 0.570 +subject: high_school_biology, #q:310, acc: 0.952 +subject: high_school_chemistry, #q:203, acc: 0.828 +subject: high_school_computer_science, #q:100, acc: 0.940 +subject: high_school_european_history, #q:165, acc: 0.861 +subject: high_school_geography, #q:198, acc: 0.939 +subject: high_school_government_and_politics, #q:193, acc: 0.990 +subject: high_school_macroeconomics, #q:390, acc: 0.928 +subject: high_school_mathematics, #q:270, acc: 0.700 +subject: high_school_microeconomics, #q:238, acc: 0.966 +subject: high_school_physics, #q:151, acc: 0.834 +subject: high_school_psychology, #q:545, acc: 0.960 +subject: high_school_statistics, #q:216, acc: 0.852 +subject: high_school_us_history, #q:204, acc: 0.926 +subject: high_school_world_history, #q:237, acc: 0.937 +subject: human_aging, #q:223, acc: 0.879 +subject: human_sexuality, #q:131, acc: 0.939 +subject: international_law, #q:121, acc: 0.934 +subject: jurisprudence, #q:108, acc: 0.898 +subject: logical_fallacies, #q:163, acc: 0.914 +subject: machine_learning, #q:112, acc: 0.821 +subject: management, #q:103, acc: 0.903 +subject: marketing, #q:234, acc: 0.944 +subject: medical_genetics, #q:100, acc: 0.980 +subject: miscellaneous, #q:783, acc: 0.945 +subject: moral_disputes, #q:346, acc: 0.861 +subject: moral_scenarios, #q:895, acc: 0.542 +subject: nutrition, #q:306, acc: 0.902 +subject: philosophy, #q:311, acc: 0.884 +subject: prehistory, #q:324, acc: 0.920 +subject: professional_accounting, #q:282, acc: 0.805 +subject: professional_law, #q:1534, acc: 0.681 +subject: professional_medicine, #q:272, acc: 0.923 +subject: professional_psychology, #q:612, acc: 0.889 +subject: public_relations, #q:110, acc: 0.800 +subject: security_studies, #q:245, acc: 0.837 +subject: sociology, #q:201, acc: 0.960 +subject: us_foreign_policy, #q:100, acc: 0.920 +subject: virology, #q:166, acc: 0.590 +subject: world_religions, #q:171, acc: 0.906 +Total latency: 150.267 +Average accuracy: 0.841 +``` diff --git a/docs_new/cookbook/autoregressive/OpenAI/GPT-OSS.mdx b/docs_new/cookbook/autoregressive/OpenAI/GPT-OSS.mdx new file mode 100644 index 000000000000..30e9f94ca12b --- /dev/null +++ b/docs_new/cookbook/autoregressive/OpenAI/GPT-OSS.mdx @@ -0,0 +1,572 @@ +--- +title: GPT-OSS +metatags: + description: "Deploy GPT-OSS (20B/120B) with SGLang - configurable reasoning, full chain-of-thought, MXFP4 quantization for single GPU deployment." +--- + +## 1.Model Introduction + +[GPT-OSS](https://huggingface.co/openai/gpt-oss-20b) is an advanced large language model developed by OpenAI designed for power reasoning, agentic tasks, and versatile developer use cases. It has versions with two model sizes. + +- **gpt-oss-120b** — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 80GB or AMD MI300X 192GB) (117B parameters with 5.1B active parameters) +- **gpt-oss-20b** — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) + +GPT-OSS introduces several groundbreaking innovations: + +- **Configurable reasoning effort**: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. +- **Full chain-of-thought**: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. +- **Fine-tunable**: Fully customize models to your specific use case through parameter fine-tuning. +- **Agentic capabilities**: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. +- **MXFP4 quantization**: The models were post-trained with MXFP4 quantization of the MoE weights, making gpt-oss-120b run on a single 80GB GPU (like NVIDIA H100 80GB or AMD MI300X 192GB) and the gpt-oss-20b model run within 16GB of memory. All evals were performed with the same MXFP4 quantization. + +## 2.SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3.Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The GPT-OSS series comes in two sizes. Recommended starting configurations vary depending on hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities. + +import { GPTOSSDeployment } from "/src/snippets/autoregressive/gpt-oss-deployment.jsx"; + + + +### 3.2 Configuration Tips + +For more detailed configuration tips, please refer to [GPS-OSS Usage](../../../docs/basic_usage/gpt_oss). + +## 4.Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +GPT-OSS supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model openai/gpt-oss-120b \ + --reasoning-parser gpt-oss \ + --tp 8 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="openai/gpt-oss-120b", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user asks: "Solve this problem step by step: What is 15% of 240?" So we need to provide step-by-step solution. Compute 15% of 240: 0.15 * 240 = 36. Provide steps: convert percent to decimal, multiply, maybe use fraction. Provide answer. +=============== Content ================= +**Step‑by‑step solution** + +1. **Understand what “percent” means** + “15 %” means 15 out of every 100 parts, i.e. the fraction \(\displaystyle \frac{15}{100}\). + +2. **Convert the percent to a decimal (or fraction)** + \[ + \frac{15}{100}=0.15 + \] + +3. **Set up the multiplication** + To find 15 % of 240 we multiply 240 by the decimal 0.15: + \[ + 240 \times 0.15 + \] + +4. **Do the multiplication** + One convenient way is to break it into two easier parts: + \[ + 240 \times 0.15 = 240 \times \left(\frac{15}{100}\right) + = \frac{240 \times 15}{100} + \] + + - First compute \(240 \times 15\): + \[ + 240 \times 15 = 240 \times (10 + 5) = 2400 + 1200 = 3600 + \] + + - Then divide by 100: + \[ + \frac{3600}{100} = 36 + \] + +5. **Write the result** + \[ + 15\% \text{ of } 240 = 36 + \] + +--- + +**Answer:** \(36\) +``` + +#### 4.2.2 Tool Calling + +GPT-OSS supports tool calling capabilities. Enable the tool call parser: + +**Python Example (without Thinking Process):** + +Start sglang server: + +```shell Command +python -m sglang.launch_server \ + --model openai/gpt-oss-120b \ + --tool-call-parser gpt-oss \ + --tp 8 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="openai/gpt-oss-120b", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"🔧 Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Python Example (with Thinking Process):** + +Start sglang server: + +```shell Command +python -m sglang.launch_server \ + --model openai/gpt-oss-120b \ + --reasoning-parser gpt-oss \ + --tool-call-parser gpt-oss \ + --tp 8 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="openai/gpt-oss-120b", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"🔧 Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +User asks: "What's the weather in Beijing?" We need to get current weather. Use function get_weather with location "Beijing". No unit specified; default? Probably use default (maybe Celsius). We can specify unit as "celsius". We'll call function. +=============== Content ================= +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="openai/gpt-oss-120b", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The current weather in Beijing is 22 °C and sunny. Let me know if you’d like a forecast for the next few days or any other details!" +``` + +## 5.Benchmark + +### 5.1 Speed Benchmark + +- Hardware: NVIDIA B200 GPU (8x) +- Tensor Parallelism: 8 +- Model: openai/gpt-oss-120b +- sglang version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Server Command: + +```shell Command +python -m sglang.launch_server \ + --model openai/gpt-oss-120b \ + --tp 8 +``` + +- Test Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --num-prompt 100 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 100 +Benchmark duration (s): 52.35 +Total input tokens: 33178 +Total input text tokens: 33178 +Total input vision tokens: 0 +Total generated tokens: 21251 +Total generated tokens (retokenized): 20868 +Request throughput (req/s): 1.91 +Input token throughput (tok/s): 633.76 +Output token throughput (tok/s): 405.93 +Peak output token throughput (tok/s): 433.00 +Peak concurrent requests: 8 +Total token throughput (tok/s): 1039.69 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 523.30 +Median E2E Latency (ms): 389.91 +---------------Time to First Token---------------- +Mean TTFT (ms): 33.71 +Median TTFT (ms): 31.79 +P99 TTFT (ms): 108.98 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 2.31 +Median TPOT (ms): 2.31 +P99 TPOT (ms): 2.39 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 2.31 +Median ITL (ms): 2.31 +P95 ITL (ms): 2.35 +P99 ITL (ms): 2.38 +Max ITL (ms): 3.54 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Server Command: + +```shell Command +python -m sglang.launch_server \ + --model openai/gpt-oss-120b \ + --tp 8 +``` + +- Test Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --num-prompt 1000 \ + --max-concurrency 100 +``` + +**Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 24.76 +Total input tokens: 297156 +Total input text tokens: 297156 +Total input vision tokens: 0 +Total generated tokens: 192432 +Total generated tokens (retokenized): 187145 +Request throughput (req/s): 40.39 +Input token throughput (tok/s): 12003.57 +Output token throughput (tok/s): 7773.26 +Peak output token throughput (tok/s): 13780.00 +Peak concurrent requests: 156 +Total token throughput (tok/s): 19776.83 +Concurrency: 89.23 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2208.97 +Median E2E Latency (ms): 1591.11 +---------------Time to First Token---------------- +Mean TTFT (ms): 102.94 +Median TTFT (ms): 31.53 +P99 TTFT (ms): 674.32 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.31 +Median TPOT (ms): 11.00 +P99 TPOT (ms): 91.28 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.00 +Median ITL (ms): 5.75 +P95 ITL (ms): 25.35 +P99 ITL (ms): 43.18 +Max ITL (ms): 621.42 +================================================== +``` + +### 5.2 Accuracy Benchmark + +### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000 +``` + +- **Results**: + + - GPT-OSS-120b + + ```text Output + Accuracy: 0.880 + Invalid: 0.005 + Latency: 5.262 s + Output throughput: 12143.675 token/s + ``` + + - GPT-OSS-20b + + ```text Output + Accuracy: 0.535 + Invalid: 0.165 + Latency: 4.157 s + Output throughput: 19589.165 token/s + ``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen2.5-VL.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen2.5-VL.mdx new file mode 100644 index 000000000000..b8889f01b215 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen2.5-VL.mdx @@ -0,0 +1,392 @@ +--- +title: Qwen2.5-VL +metatags: + description: "Deploy Qwen2.5-VL vision-language models with SGLang on AMD MI300X - available in 3B to 72B sizes with enhanced visual understanding." +--- + +import { Qwen25VLDeployment } from '/src/snippets/autoregressive/qwen25-vl-deployment.jsx'; + +## 1. Model Introduction + +**[Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl)** is a vision-language model series from the Qwen team, offering significant improvements over its predecessor in understanding, reasoning, and multi-modal processing. + +**Key Features:** + +- **Understand things visually**: Proficient in recognizing common objects such as flowers, birds, fish, and insects, and it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. +- **More Agentic**: Play as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. +- **Understanding long videos and capturing events**: Supports comprehending videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments. +- **Capable of visual localization in different formats**: Accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. +- **Generating structured outputs**: Supports structured outputs of the contents, benefiting usages in finance, commerce, etc for data like scans of invoices, forms, tables, etc. +- **Dynamic Resolution and Frame Rate Training for Video Understanding**: Extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. +- **Multiple Sizes**: Available in 3B, 7B, 32B, and 72B variants to suit different deployment needs. +- **ROCm Support**: Compatible with AMD MI300X, MI325X and MI355X GPUs via SGLang (verified). + +For more details, please refer to the [official Qwen2.5-VL GitHub Repository](https://github.com/QwenLM/Qwen3-VL). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for AMD MI300X, MI325X and MI355X hardware platforms and different use cases. + +### 3.1 Basic Configuration + +The Qwen2.5-VL series offers models in various sizes. The following configurations have been verified on AMD MI300X, MI325X and MI355X GPUs. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model size. + + + +### 3.2 Configuration Tips + +* **Memory Management**: For the 72B model on MI300X/MI325X/MI355X, we have verified successful deployment with `--context-length 128000`. Smaller context lengths can be used to reduce memory usage if needed. +* **Multi-GPU Deployment**: Use Tensor Parallelism (`--tp`) to scale across multiple GPUs. For example, use `--tp 8` for the 72B model and `--tp 2` for the 32B model on MI300X/MI325X/MI355X. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Multi-Modal Inputs + +Qwen2.5-VL supports image inputs. Here's a basic example with single image input: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Read all the text in the image." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="Qwen/Qwen2.5-VL-7B-Instruct", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 2.31s +Generated text: Auntie Anne's + +CINNAMON SUGAR +1 x 17,000 +SUB TOTAL +17,000 + +GRAND TOTAL +17,000 + +CASH IDR +20,000 + +CHANGE DUE +3,000 +``` + +**Multi-Image Input Example:** + +Qwen2.5-VL can process multiple images in a single request for comparison or analysis: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg" + } + }, + { + "type": "image_url", + "image_url": { + "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg" + } + }, + { + "type": "text", + "text": "Compare these two images and describe the differences in 100 words or less." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="Qwen/Qwen2.5-VL-7B-Instruct", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 13.79s +Generated text: The first image shows a single red taxi driving on a street with a few other taxis in the background. The second image shows a large number of taxis parked in a lot, with some appearing to be in various states of repair. The first image has a single taxi with a visible license plate, while the second image has multiple taxis with different license plates. The first image has a clear view of the street and surrounding area, while the second image is taken from an elevated perspective, showing a wider view of the parking lot and the surrounding area. +``` + +**Note:** + +- You can also provide local file paths using `file://` protocol. +- For larger images, you may need more memory, adjust `--mem-fraction-static` accordingly. + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X GPU (8x) +- Model: Qwen2.5-VL-72B-Instruct +- Tensor Parallelism: 8 +- SGLang Version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation with random images. To simulate real-world usage, you can specify different input and output lengths for each request. For example, each request can have 128 input tokens, two 720p images, and 1024 output tokens. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen2.5-VL-72B-Instruct \ + --tp 8 \ + --host 0.0.0.0 \ + --port 30000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen2.5-VL-72B-Instruct \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen2.5-VL-72B-Instruct \ + --tp 8 \ + --host 0.0.0.0 \ + --port 30000 +``` +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 37.99 +Total input tokens: 24781 +Total input text tokens: 821 +Total input vision tokens: 23960 +Total generated tokens: 4220 +Total generated tokens (retokenized): 2365 +Request throughput (req/s): 0.26 +Input token throughput (tok/s): 652.26 +Output token throughput (tok/s): 111.07 +Peak output token throughput (tok/s): 128.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 763.34 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3797.61 +Median E2E Latency (ms): 3140.90 +P90 E2E Latency (ms): 6545.54 +P99 E2E Latency (ms): 7939.56 +---------------Time to First Token---------------- +Mean TTFT (ms): 504.45 +Median TTFT (ms): 510.93 +P99 TTFT (ms): 521.78 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.82 +Median TPOT (ms): 7.82 +P99 TPOT (ms): 7.84 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.07 +Median ITL (ms): 7.90 +P95 ITL (ms): 15.79 +P99 ITL (ms): 15.93 +Max ITL (ms): 23.60 +================================================== +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen2.5-VL-72B-Instruct \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 454.68 +Total input tokens: 2481865 +Total input text tokens: 85865 +Total input vision tokens: 2396000 +Total generated tokens: 510855 +Total generated tokens (retokenized): 296466 +Request throughput (req/s): 2.20 +Input token throughput (tok/s): 5458.50 +Output token throughput (tok/s): 1123.55 +Peak output token throughput (tok/s): 5004.00 +Peak concurrent requests: 106 +Total token throughput (tok/s): 6582.05 +Concurrency: 98.63 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 44844.92 +Median E2E Latency (ms): 42866.15 +P90 E2E Latency (ms): 82798.20 +P99 E2E Latency (ms): 106306.30 +---------------Time to First Token---------------- +Mean TTFT (ms): 4507.79 +Median TTFT (ms): 1180.83 +P99 TTFT (ms): 39975.22 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 80.26 +Median TPOT (ms): 82.38 +P99 TPOT (ms): 152.89 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 100.66 +Median ITL (ms): 13.26 +P95 ITL (ms): 428.45 +P99 ITL (ms): 1393.35 +Max ITL (ms): 31943.26 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 MMMU Benchmark + +You can evaluate the model's accuracy using the MMMU dataset: + +- Benchmark Command: + +```shell Command +python3 benchmark/mmmu/bench_sglang.py \ + --port 30000 \ + --concurrency 64 +``` +```text Output +Benchmark time: 97.75084622902796 +answers saved to: ./answer_sglang.json +Evaluating... +answers saved to: ./answer_sglang.json +{'Accounting': {'acc': 0.633, 'num': 30}, + 'Agriculture': {'acc': 0.5, 'num': 30}, + 'Architecture_and_Engineering': {'acc': 0.367, 'num': 30}, + 'Art': {'acc': 0.767, 'num': 30}, + 'Art_Theory': {'acc': 0.9, 'num': 30}, + 'Basic_Medical_Science': {'acc': 0.7, 'num': 30}, + 'Biology': {'acc': 0.467, 'num': 30}, + 'Chemistry': {'acc': 0.433, 'num': 30}, + 'Clinical_Medicine': {'acc': 0.733, 'num': 30}, + 'Computer_Science': {'acc': 0.567, 'num': 30}, + 'Design': {'acc': 0.833, 'num': 30}, + 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.467, 'num': 30}, + 'Economics': {'acc': 0.767, 'num': 30}, + 'Electronics': {'acc': 0.433, 'num': 30}, + 'Energy_and_Power': {'acc': 0.467, 'num': 30}, + 'Finance': {'acc': 0.533, 'num': 30}, + 'Geography': {'acc': 0.633, 'num': 30}, + 'History': {'acc': 0.7, 'num': 30}, + 'Literature': {'acc': 0.867, 'num': 30}, + 'Manage': {'acc': 0.633, 'num': 30}, + 'Marketing': {'acc': 0.733, 'num': 30}, + 'Materials': {'acc': 0.333, 'num': 30}, + 'Math': {'acc': 0.533, 'num': 30}, + 'Mechanical_Engineering': {'acc': 0.433, 'num': 30}, + 'Music': {'acc': 0.367, 'num': 30}, + 'Overall': {'acc': 0.62, 'num': 900}, + 'Overall-Art and Design': {'acc': 0.717, 'num': 120}, + 'Overall-Business': {'acc': 0.66, 'num': 150}, + 'Overall-Health and Medicine': {'acc': 0.693, 'num': 150}, + 'Overall-Humanities and Social Science': {'acc': 0.775, 'num': 120}, + 'Overall-Science': {'acc': 0.553, 'num': 150}, + 'Overall-Tech and Engineering': {'acc': 0.443, 'num': 210}, + 'Pharmacy': {'acc': 0.833, 'num': 30}, + 'Physics': {'acc': 0.7, 'num': 30}, + 'Psychology': {'acc': 0.767, 'num': 30}, + 'Public_Health': {'acc': 0.733, 'num': 30}, + 'Sociology': {'acc': 0.767, 'num': 30}} +eval out saved to ./val_sglang.json +Overall accuracy: 0.62 +``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3-Coder-Next.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3-Coder-Next.mdx new file mode 100644 index 000000000000..001b2fa6e113 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3-Coder-Next.mdx @@ -0,0 +1,902 @@ +--- +title: Qwen3-Coder-Next +metatags: + description: "Deploy Qwen3-Coder-Next code-focused models with SGLang on AMD MI300X - available in 3B to 80B sizes with enhanced code understanding." +--- + +import { Qwen3CoderNextDeployment } from '/src/snippets/autoregressive/qwen3-coder-next-deployment.jsx'; + +## 1. Model Introduction + +[Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) is a cost-efficient code-focused language model from the Qwen team (Alibaba). With 80B total parameters but only 3B activated parameters, it achieves performance comparable to models with 10–20x more active parameters through its innovative hybrid architecture. + +**Key Features:** + +- **Hybrid Architecture**: Uses a 48-layer hybrid layout combining Gated DeltaNet and Gated Attention with Mixture-of-Experts (512 total experts, 10 activated, 1 shared), enabling exceptional efficiency. +- **Tool Calling Support**: Advanced agentic capabilities with native support for function calling and tool use via the `qwen3_coder` parser. +- **Extended Context Length**: Supports up to 256K tokens for processing large codebases and long documents. +- **Cost-Efficient Inference**: Only 3B parameters activated per token, making it ideal for local development and cost-effective deployment at scale. +- **IDE Integration**: Compatible with Claude Code, Qwen Code, Cline, and other IDE platforms. + +For more details, please refer to the [Qwen3-Coder-Next model card](https://huggingface.co/Qwen/Qwen3-Coder-Next). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +**Note:** Qwen3-Coder-Next requires SGLang v0.5.8 or later. + +## 3. Model Deployment + +This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and deployment options. + + + +### 3.2 Configuration Tips + +- **Context Length**: The model supports up to 256K tokens natively. If you encounter OOM issues, try `--context-length 32768`. +- **Tool Use**: To enable tool calling capabilities, use the `--tool-call-parser qwen3_coder` flag. +- **Sampling Parameters**: SGLang automatically applies the recommended sampling parameters from the model's `generation_config.json`. No manual configuration is needed. +- **Mamba Radix Cache**: Qwen3-Coder-Next's hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via `--mamba-scheduler-strategy`: + - **V1 (`no_buffer`)**: Default. No overlap scheduler, lower memory usage. + - **V2 (`extra_buffer`)**: Enables overlap scheduling and branching point caching with `--mamba-scheduler-strategy extra_buffer --page-size 64`. Requires FLA kernel backend. Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency. `--page-size` must satisfy `FLA_CHUNK_SIZE % page_size == 0` or `page_size % FLA_CHUNK_SIZE == 0` (`FLA_CHUNK_SIZE` is currently 64). + +## 4. Model Invocation + +**Deployment Command:** + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Coder-Next \ + --tp 2 \ + --tool-call-parser qwen3_coder \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Code Generation Example + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3-Coder-Next", + messages=[ + {"role": "user", "content": "Write a Python function that implements binary search on a sorted list. Include type hints."} + ], + max_tokens=2048 +) + +print(response.choices[0].message.content) +``` + +**Example Output:** + +````text Output +Here's a Python function implementing binary search on a sorted list, with comprehensive type hints: + +```python +from typing import Sequence, TypeVar, Optional + +T = TypeVar('T') + +def binary_search(sorted_list: Sequence[T], target: T) -> Optional[int]: + """ + Perform binary search on a sorted list to find the index of a target element. + + Args: + sorted_list: A sequence (e.g., list, tuple) sorted in ascending order. + target: The element to search for in the list. + + Returns: + The index of the target element if found, or None if not found. + + Time Complexity: O(log n) + Space Complexity: O(1) + + Note: + The function assumes the list is sorted in ascending order. + If the list contains duplicate elements, it returns the index of one of them. + """ + left = 0 + right = len(sorted_list) - 1 + + while left <= right: + mid = (left + right) // 2 + mid_val = sorted_list[mid] + + if mid_val == target: + return mid + elif mid_val < target: + left = mid + 1 + else: + right = mid - 1 + + return None +``` + +### Example usage: + +```python +# Example 1: Finding an existing element +numbers = [1, 3, 5, 7, 9, 11] +print(binary_search(numbers, 7)) # Output: 3 + +# Example 2: Element not in the list +print(binary_search(numbers, 4)) # Output: None + +# Example 3: Empty list +print(binary_search([], 5)) # Output: None + +# Example 4: Single element +print(binary_search([1], 1)) # Output: 0 +print(binary_search([1], 2)) # Output: None +``` + +### Key features: +- Uses `TypeVar` to support generic types (as long as comparison operations are defined) +- Returns `Optional[int]` to indicate either the index or no match found +- Uses `Sequence[T]` to accept any sequence type (list, tuple, etc.) +- Includes comprehensive docstring with time/space complexity +- Implements standard iterative binary search for O(1) space complexity +```` + +#### 4.2.2 Streaming Example + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3-Coder-Next", + messages=[ + {"role": "user", "content": "Explain the difference between a stack and a queue in 3 sentences."} + ], + max_tokens=512, + stream=True +) + +for chunk in response: + if chunk.choices and chunk.choices[0].delta.content: + print(chunk.choices[0].delta.content, end="", flush=True) +print() +``` + +**Example Output:** + +```text Output +A **stack** follows the **Last In, First Out (LIFO)** principle, meaning the last element added is the first one removed—operations like `push` (add) and `pop` (remove) occur at the same end, called the *top*. In contrast, a **queue** follows the **First In, First Out (FIFO)** principle, where elements are added at the *back* (enqueue) and removed from the *front* (dequeue), preserving the order of insertion. This structural difference makes stacks ideal for tasks like function call management and expression evaluation, while queues suit scheduling, buffering, and breadth-first traversal. +``` + +#### 4.2.3 Tool Calling Example + +Qwen3-Coder-Next supports tool calling capabilities. Make sure `--tool-call-parser qwen3_coder` is included in the deployment command above. + +**Python Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "execute_code", + "description": "Execute Python code and return the result", + "parameters": { + "type": "object", + "properties": { + "code": { + "type": "string", + "description": "The Python code to execute" + } + }, + "required": ["code"] + } + } + } +] + +response = client.chat.completions.create( + model="Qwen/Qwen3-Coder-Next", + messages=[ + {"role": "user", "content": "Calculate the factorial of 10 using Python"} + ], + tools=tools +) + +# Check if the model wants to call a tool +if response.choices[0].message.tool_calls: + tool_call = response.choices[0].message.tool_calls[0] + print(f"Tool: {tool_call.function.name}") + print(f"Arguments: {tool_call.function.arguments}") +else: + print(response.choices[0].message.content) +``` + +**Example Output:** + +```text Output +Tool: execute_code +Arguments: {"code": "import math\nmath.factorial(10)"} +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (2x) +- Model: Qwen/Qwen3-Coder-Next +- Tensor Parallelism: 2 +- sglang version: 0.5.8+ + +#### 5.1.1 Standard Scenario Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Coder-Next \ + --tp 2 \ + --host 0.0.0.0 \ + --port 30000 +``` + +##### 5.1.1.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 27.86 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4218 +Request throughput (req/s): 0.36 +Input token throughput (tok/s): 219.00 +Output token throughput (tok/s): 151.48 +Peak output token throughput (tok/s): 166.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 370.48 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2784.14 +Median E2E Latency (ms): 2258.08 +P90 E2E Latency (ms): 5044.43 +P99 E2E Latency (ms): 6130.52 +---------------Time to First Token---------------- +Mean TTFT (ms): 161.68 +Median TTFT (ms): 168.09 +P99 TTFT (ms): 183.26 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 6.19 +Median TPOT (ms): 6.23 +P99 TPOT (ms): 6.32 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.23 +Median ITL (ms): 6.23 +P95 ITL (ms): 6.51 +P99 ITL (ms): 6.64 +Max ITL (ms): 13.45 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 39.06 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40789 +Request throughput (req/s): 2.05 +Input token throughput (tok/s): 1015.62 +Output token throughput (tok/s): 1044.73 +Peak output token throughput (tok/s): 1664.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 2060.34 +Concurrency: 14.16 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6910.97 +Median E2E Latency (ms): 7248.27 +P90 E2E Latency (ms): 11612.63 +P99 E2E Latency (ms): 13933.91 +---------------Time to First Token---------------- +Mean TTFT (ms): 183.48 +Median TTFT (ms): 156.50 +P99 TTFT (ms): 311.46 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 13.61 +Median TPOT (ms): 13.59 +P99 TPOT (ms): 21.11 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 13.22 +Median ITL (ms): 9.76 +P95 ITL (ms): 10.43 +P99 ITL (ms): 158.04 +Max ITL (ms): 394.39 +================================================== +``` + +##### 5.1.1.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 102.81 +Total input tokens: 249831 +Total input text tokens: 249831 +Total generated tokens: 252662 +Total generated tokens (retokenized): 252536 +Request throughput (req/s): 4.86 +Input token throughput (tok/s): 2429.99 +Output token throughput (tok/s): 2457.53 +Peak output token throughput (tok/s): 5299.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 4887.52 +Concurrency: 94.28 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 19385.20 +Median E2E Latency (ms): 17584.09 +P90 E2E Latency (ms): 36762.15 +P99 E2E Latency (ms): 42518.35 +---------------Time to First Token---------------- +Mean TTFT (ms): 270.62 +Median TTFT (ms): 159.65 +P99 TTFT (ms): 938.90 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 38.57 +Median TPOT (ms): 41.78 +P99 TPOT (ms): 53.28 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 37.90 +Median ITL (ms): 18.26 +P95 ITL (ms): 167.82 +P99 ITL (ms): 311.45 +Max ITL (ms): 993.20 +================================================== +``` + +#### 5.1.2 Reasoning Scenario Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Coder-Next \ + --tp 2 \ + --host 0.0.0.0 \ + --port 30000 +``` + +##### 5.1.2.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Result: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 285.02 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 44462 +Total generated tokens (retokenized): 44432 +Request throughput (req/s): 0.04 +Input token throughput (tok/s): 21.41 +Output token throughput (tok/s): 156.00 +Peak output token throughput (tok/s): 173.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 177.40 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 28499.54 +Median E2E Latency (ms): 30424.65 +P90 E2E Latency (ms): 49132.26 +P99 E2E Latency (ms): 51075.28 +---------------Time to First Token---------------- +Mean TTFT (ms): 95.51 +Median TTFT (ms): 93.86 +P99 TTFT (ms): 112.56 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 6.24 +Median TPOT (ms): 6.30 +P99 TPOT (ms): 6.60 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.39 +Median ITL (ms): 6.34 +P95 ITL (ms): 7.16 +P99 ITL (ms): 7.42 +Max ITL (ms): 12.48 +================================================== +``` + +##### 5.1.2.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Result: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 237.77 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 318306 +Total generated tokens (retokenized): 315646 +Request throughput (req/s): 0.34 +Input token throughput (tok/s): 166.83 +Output token throughput (tok/s): 1338.72 +Peak output token throughput (tok/s): 1727.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 1505.55 +Concurrency: 13.88 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 41266.21 +Median E2E Latency (ms): 41010.10 +P90 E2E Latency (ms): 77574.22 +P99 E2E Latency (ms): 82688.04 +---------------Time to First Token---------------- +Mean TTFT (ms): 140.73 +Median TTFT (ms): 84.52 +P99 TTFT (ms): 365.86 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.32 +Median TPOT (ms): 10.38 +P99 TPOT (ms): 10.87 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.34 +Median ITL (ms): 10.19 +P95 ITL (ms): 10.75 +P99 ITL (ms): 11.18 +Max ITL (ms): 206.79 +================================================== +``` + +##### 5.1.2.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` + +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 384.82 +Total input tokens: 158939 +Total input text tokens: 158939 +Total generated tokens: 1301025 +Total generated tokens (retokenized): 1299908 +Request throughput (req/s): 0.83 +Input token throughput (tok/s): 413.02 +Output token throughput (tok/s): 3380.83 +Peak output token throughput (tok/s): 4317.00 +Peak concurrent requests: 69 +Total token throughput (tok/s): 3793.85 +Concurrency: 56.42 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 67847.54 +Median E2E Latency (ms): 70724.38 +P90 E2E Latency (ms): 120888.83 +P99 E2E Latency (ms): 133234.48 +---------------Time to First Token---------------- +Mean TTFT (ms): 212.24 +Median TTFT (ms): 115.96 +P99 TTFT (ms): 652.93 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 16.76 +Median TPOT (ms): 16.99 +P99 TPOT (ms): 18.18 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 16.64 +Median ITL (ms): 15.83 +P95 ITL (ms): 31.64 +P99 ITL (ms): 90.85 +Max ITL (ms): 576.60 +================================================== +``` + +#### 5.1.3 Summarization Scenario Benchmark + +##### 5.1.3.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 29.42 +Total input tokens: 41941 +Total input text tokens: 41941 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4220 +Request throughput (req/s): 0.34 +Input token throughput (tok/s): 1425.35 +Output token throughput (tok/s): 143.42 +Peak output token throughput (tok/s): 169.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 1568.77 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2941.19 +Median E2E Latency (ms): 2411.84 +P90 E2E Latency (ms): 5661.26 +P99 E2E Latency (ms): 6497.45 +---------------Time to First Token---------------- +Mean TTFT (ms): 139.46 +Median TTFT (ms): 160.33 +P99 TTFT (ms): 184.30 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 6.56 +Median TPOT (ms): 6.65 +P99 TPOT (ms): 7.29 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.65 +Median ITL (ms): 6.68 +P95 ITL (ms): 7.39 +P99 ITL (ms): 7.51 +Max ITL (ms): 16.34 +================================================== +``` + +##### 5.1.3.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 41.62 +Total input tokens: 300020 +Total input text tokens: 300020 +Total generated tokens: 41669 +Total generated tokens (retokenized): 41664 +Request throughput (req/s): 1.92 +Input token throughput (tok/s): 7208.67 +Output token throughput (tok/s): 1001.19 +Peak output token throughput (tok/s): 1536.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 8209.86 +Concurrency: 14.27 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 7421.29 +Median E2E Latency (ms): 7985.77 +P90 E2E Latency (ms): 12122.09 +P99 E2E Latency (ms): 14595.05 +---------------Time to First Token---------------- +Mean TTFT (ms): 248.49 +Median TTFT (ms): 179.25 +P99 TTFT (ms): 915.90 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.13 +Median TPOT (ms): 14.28 +P99 TPOT (ms): 24.02 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 13.80 +Median ITL (ms): 10.46 +P95 ITL (ms): 11.00 +P99 ITL (ms): 173.14 +Max ITL (ms): 823.32 +================================================== +``` + +##### 5.1.3.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3-Coder-Next \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` + +- Result: +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 85.74 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total generated tokens: 170000 +Total generated tokens (retokenized): 169983 +Request throughput (req/s): 3.73 +Input token throughput (tok/s): 14858.12 +Output token throughput (tok/s): 1982.80 +Peak output token throughput (tok/s): 3734.00 +Peak concurrent requests: 70 +Total token throughput (tok/s): 16840.92 +Concurrency: 59.75 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16008.12 +Median E2E Latency (ms): 15460.65 +P90 E2E Latency (ms): 27705.81 +P99 E2E Latency (ms): 32874.74 +---------------Time to First Token---------------- +Mean TTFT (ms): 476.99 +Median TTFT (ms): 177.50 +P99 TTFT (ms): 3014.39 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 29.81 +Median TPOT (ms): 31.19 +P99 TPOT (ms): 45.53 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 29.29 +Median ITL (ms): 15.75 +P95 ITL (ms): 173.94 +P99 ITL (ms): 202.00 +Max ITL (ms): 2783.23 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +- **Test Results:** + +```text Output +Accuracy: 0.965 +Invalid: 0.000 +Latency: 26.407 s +Output throughput: 929.132 token/s +``` + +#### 5.2.2 MMLU Benchmark + +- **Benchmark Command:** + +```shell Command +cd benchmark/mmlu +bash download_data.sh +python3 bench_sglang.py --port 30000 +``` + +- **Test Results:** + +```text Output +subject: abstract_algebra, #q:100, acc: 0.780 +subject: anatomy, #q:135, acc: 0.807 +subject: astronomy, #q:152, acc: 0.921 +subject: business_ethics, #q:100, acc: 0.820 +subject: clinical_knowledge, #q:265, acc: 0.860 +subject: college_biology, #q:144, acc: 0.944 +subject: college_chemistry, #q:100, acc: 0.590 +subject: college_computer_science, #q:100, acc: 0.820 +subject: college_mathematics, #q:100, acc: 0.800 +subject: college_medicine, #q:173, acc: 0.803 +subject: college_physics, #q:102, acc: 0.775 +subject: computer_security, #q:100, acc: 0.880 +subject: conceptual_physics, #q:235, acc: 0.936 +subject: econometrics, #q:114, acc: 0.807 +subject: electrical_engineering, #q:145, acc: 0.834 +subject: elementary_mathematics, #q:378, acc: 0.854 +subject: formal_logic, #q:126, acc: 0.802 +subject: global_facts, #q:100, acc: 0.610 +subject: high_school_biology, #q:310, acc: 0.971 +subject: high_school_chemistry, #q:203, acc: 0.803 +subject: high_school_computer_science, #q:100, acc: 0.920 +subject: high_school_european_history, #q:165, acc: 0.891 +subject: high_school_geography, #q:198, acc: 0.929 +subject: high_school_government_and_politics, #q:193, acc: 0.969 +subject: high_school_macroeconomics, #q:390, acc: 0.903 +subject: high_school_mathematics, #q:270, acc: 0.689 +subject: high_school_microeconomics, #q:238, acc: 0.962 +subject: high_school_physics, #q:151, acc: 0.854 +subject: high_school_psychology, #q:545, acc: 0.947 +subject: high_school_statistics, #q:216, acc: 0.815 +subject: high_school_us_history, #q:204, acc: 0.907 +subject: high_school_world_history, #q:237, acc: 0.937 +subject: human_aging, #q:223, acc: 0.821 +subject: human_sexuality, #q:131, acc: 0.840 +subject: international_law, #q:121, acc: 0.934 +subject: jurisprudence, #q:108, acc: 0.870 +subject: logical_fallacies, #q:163, acc: 0.847 +subject: machine_learning, #q:112, acc: 0.812 +subject: management, #q:103, acc: 0.922 +subject: marketing, #q:234, acc: 0.923 +subject: medical_genetics, #q:100, acc: 0.970 +subject: miscellaneous, #q:783, acc: 0.941 +subject: moral_disputes, #q:346, acc: 0.850 +subject: moral_scenarios, #q:895, acc: 0.726 +subject: nutrition, #q:306, acc: 0.915 +subject: philosophy, #q:311, acc: 0.859 +subject: prehistory, #q:324, acc: 0.889 +subject: professional_accounting, #q:282, acc: 0.723 +subject: professional_law, #q:1534, acc: 0.648 +subject: professional_medicine, #q:272, acc: 0.923 +subject: professional_psychology, #q:612, acc: 0.845 +subject: public_relations, #q:110, acc: 0.782 +subject: security_studies, #q:245, acc: 0.796 +subject: sociology, #q:201, acc: 0.925 +subject: us_foreign_policy, #q:100, acc: 0.950 +subject: virology, #q:166, acc: 0.572 +subject: world_religions, #q:171, acc: 0.883 +Total latency: 208.985 +Average accuracy: 0.834 +``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3-Coder.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3-Coder.mdx new file mode 100644 index 000000000000..738d4654f63b --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3-Coder.mdx @@ -0,0 +1,520 @@ +--- +title: Qwen3-Coder +metatags: + description: "Deploy Qwen3-Coder(480B, 30B) MoE coding model with SGLang on AMD MI300X (MI325X, MI355X)" +--- + +import { Qwen3CoderDeployment } from '/src/snippets/autoregressive/qwen3-coder-deployment.jsx'; + +## 1. Model Introduction + +[Qwen3-Coder](https://huggingface.co/collections/Qwen/qwen3-coder) is the latest code-focused large language model series from the Qwen team. Built on the foundation of Qwen3, Qwen3-Coder delivers exceptional performance in code generation, understanding, and reasoning tasks. + +**Key Features:** + +- **State-of-the-art Coding Performance**: Achieves top-tier results on HumanEval, MBPP, LiveCodeBench, and other major coding benchmarks. +- **Tool Calling Support**: Native support for function calling and tool use, enabling seamless integration with external APIs and services. +- **Extended Context Length**: Supports up to 256K tokens for processing large codebases and long documents. +- **Multilingual Code Support**: Proficient in Python, JavaScript, TypeScript, Java, C++, Go, Rust, and many other programming languages. +- **MoE Architecture**: Efficient Mixture-of-Experts design for optimal performance-to-cost ratio. +- **ROCm Support**: Compatible with AMD MI300X, MI325X and MI355X GPUs via SGLang (verified). +- **NVIDIA GPU Support**: Compatible with NVIDIA GB200 and B200 GPUs via SGLang (verified). + +For more details, please refer to the [official Qwen3-Coder GitHub Repository](https://github.com/QwenLM/Qwen3-Coder). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations verified on AMD MI300X, MI325X, MI355X and NVIDIA B200, GB200 hardware platforms. + +### 3.1 Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, and quantization method. + + + +### 3.2 Configuration Tips + +**AMD (MI300X/MI325X/MI355X):** +* **Memory Management**: We have verified successful deployment on MI300X/MI325X/MI355X with `--context-length 8192`. Larger context lengths may be supported but require additional memory. +* **Expert Parallelism**: For 480B-A35B with FP8 quantization, `--ep 2` is required to satisfy the dimension alignment requirement. +* **Page Size**: `--page-size 32` is recommended for MoE models to optimize memory usage. +* **Environment Variable**: If you encounter aiter-related issues, try setting `SGLANG_USE_AITER=0`. + +**NVIDIA (B200/GB200):** +* **MOE Runner Backend**: FP8 uses `--moe-runner-backend triton`, NVFP4 uses `--moe-runner-backend flashinfer_cutlass`. +* **NVFP4 Quantization**: Requires `--quantization modelopt_fp4` and uses a different model path (`nvidia/Qwen3-Coder-...`). +* **DP Attention**: NVFP4 configuration supports `--enable-dp-attention` for improved throughput. + +**General:** +* **Tool Use**: To enable tool calling capabilities, add `--tool-call-parser qwen3_coder` to the launch command. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Code Generation Example + +```python Example +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": "Write a Python function that implements binary search on a sorted list. Include docstring and type hints." + } +] + +response = client.chat.completions.create( + model="Qwen/Qwen3-Coder-480B-A35B-Instruct", + messages=messages, + max_tokens=2048, + temperature=0.7 +) + +print(response.choices[0].message.content) +``` + +**Example Output:** + +````text Output +```python +from typing import List, Optional, TypeVar + +T = TypeVar('T') + +def binary_search(arr: List[T], target: T) -> Optional[int]: + """ + Perform binary search on a sorted list to find the index of a target element. + + This function implements the binary search algorithm, which efficiently finds + a target value in a sorted array by repeatedly dividing the search interval + in half. + + Args: + arr (List[T]): A sorted list of elements to search through. + target (T): The element to search for in the list. + + Returns: + Optional[int]: The index of the target element if found, None otherwise. + + Time Complexity: + O(log n) where n is the number of elements in the array. + + Space Complexity: + O(1) - iterative implementation uses constant extra space. + + Examples: + >>> binary_search([1, 2, 3, 4, 5], 3) + 2 + >>> binary_search([1, 2, 3, 4, 5], 6) + None + >>> binary_search(['a', 'b', 'c', 'd'], 'b') + 1 + >>> binary_search([], 1) + None + """ + if not arr: + return None + + left: int = 0 + right: int = len(arr) - 1 + + while left <= right: + mid: int = (left + right) // 2 + + if arr[mid] == target: + return mid + elif arr[mid] < target: + left = mid + 1 + else: + right = mid - 1 + + return None + +# Alternative recursive implementation +def binary_search_recursive(arr: List[T], target: T, left: int = 0, right: Optional[int] = None) -> Optional[int]: + """ + Perform binary search recursively on a sorted list to find the index of a target element. + + Args: + arr (List[T]): A sorted list of elements to search through. + target (T): The element to search for in the list. + left (int): Left boundary of the search range (inclusive). + right (Optional[int]): Right boundary of the search range (inclusive). + + Returns: + Optional[int]: The index of the target element if found, None otherwise. + + Time Complexity: + O(log n) where n is the number of elements in the array. + + Space Complexity: + O(log n) due to recursive call stack. + + Examples: + >>> binary_search_recursive([1, 2, 3, 4, 5], 3) + 2 + >>> binary_search_recursive([1, 2, 3, 4, 5], 6) + None + """ + if not arr: + return None + + if right is None: + right = len(arr) - 1 + + if left > right: + return None + + mid: int = (left + right) // 2 + + if arr[mid] == target: + return mid + elif arr[mid] < target: + return binary_search_recursive(arr, target, mid + 1, right) + else: + return binary_search_recursive(arr, target, left, mid - 1) +``` + +This implementation provides: + +1. **Main function** (`binary_search`): An iterative implementation that's more memory-efficient +2. **Alternative function** (`binary_search_recursive`): A recursive implementation for educational purposes +3. **Type hints**: Using generics (`TypeVar`) to work with any comparable type +4. **Comprehensive docstring**: Including description, parameters, return value, complexity analysis, and examples +5. **Edge case handling**: Empty lists, elements not found, etc. +6. **Clear variable names**: Self-documenting code +7. **Examples**: Doctest-style examples in the docstring + +The function works with any sorted list of comparable elements (integers, strings, etc.) and returns the index of the target element if found, or `None` if not found. +```` + +#### 4.2.2 Tool Calling Example + +Qwen3-Coder supports tool calling capabilities. Enable the tool call parser during deployment. The following example uses 30B-A3B model: + +```shell Command +SGLANG_USE_AITER=0 python -m sglang.launch_server \ + --model Qwen/Qwen3-Coder-30B-A3B-Instruct \ + --tp 1 \ + --context-length 8192 \ + --page-size 32 \ + --tool-call-parser qwen3_coder +``` + +**Python Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "execute_code", + "description": "Execute Python code and return the result", + "parameters": { + "type": "object", + "properties": { + "code": { + "type": "string", + "description": "The Python code to execute" + } + }, + "required": ["code"] + } + } + } +] + +response = client.chat.completions.create( + model="Qwen/Qwen3-Coder-30B-A3B-Instruct", + messages=[ + {"role": "user", "content": "Calculate the factorial of 10 using Python"} + ], + tools=tools, + temperature=0.7 +) + +# Check if the model wants to call a tool +if response.choices[0].message.tool_calls: + tool_call = response.choices[0].message.tool_calls[0] + print(f"Tool: {tool_call.function.name}") + print(f"Arguments: {tool_call.function.arguments}") +else: + # Model may return tool call in content format + print(response.choices[0].message.content) +``` + +**Example Output:** + +```text Output +Tool: execute_code +Arguments: {"code": "def factorial(n):\n if n == 0 or n == 1:\n return 1\n else:\n return n * factorial(n-1)\n\nresult = factorial(10)\nresult"} +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: AMD MI300X GPU (8x) +- Model: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 +- Tensor Parallelism: 8 +- Expert Parallelism: 2 +- sglang version: 0.5.7 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation with random dataset. + +#### 5.1.1 Standard Scenario Benchmark + +- Model Deployment Command: + +```shell Command +SGLANG_USE_AITER=0 python -m sglang.launch_server \ + --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ + --tp 8 \ + --ep 2 \ + --context-length 8192 \ + --page-size 32 \ + --trust-remote-code +``` + +##### 5.1.1.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 73.79 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4104 +Request throughput (req/s): 0.14 +Input token throughput (tok/s): 82.68 +Output token throughput (tok/s): 57.19 +Peak output token throughput (tok/s): 59.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 139.86 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 7376.26 +Median E2E Latency (ms): 5851.51 +P90 E2E Latency (ms): 13351.89 +P99 E2E Latency (ms): 16908.32 +---------------Time to First Token---------------- +Mean TTFT (ms): 191.93 +Median TTFT (ms): 126.06 +P99 TTFT (ms): 662.15 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 17.06 +Median TPOT (ms): 17.07 +P99 TPOT (ms): 17.08 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 17.06 +Median ITL (ms): 17.06 +P95 ITL (ms): 17.14 +P99 ITL (ms): 17.19 +Max ITL (ms): 18.53 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 87.04 +Total input tokens: 39668 +Total input text tokens: 39668 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40364 +Request throughput (req/s): 0.92 +Input token throughput (tok/s): 455.77 +Output token throughput (tok/s): 468.83 +Peak output token throughput (tok/s): 608.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 924.59 +Concurrency: 13.76 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 14966.88 +Median E2E Latency (ms): 15871.93 +P90 E2E Latency (ms): 24983.41 +P99 E2E Latency (ms): 29504.85 +---------------Time to First Token---------------- +Mean TTFT (ms): 388.94 +Median TTFT (ms): 157.49 +P99 TTFT (ms): 1318.63 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 29.41 +Median TPOT (ms): 29.22 +P99 TPOT (ms): 43.48 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 28.64 +Median ITL (ms): 26.42 +P95 ITL (ms): 27.51 +P99 ITL (ms): 131.63 +Max ITL (ms): 995.11 +================================================== +``` + +##### 5.1.1.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 177.82 +Total input tokens: 158939 +Total input text tokens: 158939 +Total generated tokens: 170134 +Total generated tokens (retokenized): 168387 +Request throughput (req/s): 1.80 +Input token throughput (tok/s): 893.84 +Output token throughput (tok/s): 956.80 +Peak output token throughput (tok/s): 1728.00 +Peak concurrent requests: 70 +Total token throughput (tok/s): 1850.64 +Concurrency: 58.88 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 32716.53 +Median E2E Latency (ms): 30896.37 +P90 E2E Latency (ms): 65605.24 +P99 E2E Latency (ms): 80970.63 +---------------Time to First Token---------------- +Mean TTFT (ms): 372.97 +Median TTFT (ms): 181.67 +P99 TTFT (ms): 529.01 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 62.98 +Median TPOT (ms): 50.44 +P99 TPOT (ms): 204.24 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 60.95 +Median ITL (ms): 37.87 +P95 ITL (ms): 143.98 +P99 ITL (ms): 148.02 +Max ITL (ms): 36863.32 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +##### AMD (MI300X/MI325X/MI355X) + +- **Results**: + + - Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 + ``` + Accuracy: 0.965 + Invalid: 0.000 + Latency: 23.084 s + Output throughput: 1148.425 token/s + ``` + +##### NVIDIA (B200/GB200) + +For deployment commands, see [Section 3.1](#31-configuration). + + - Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 (tp=8, ep=2) + ``` + Accuracy: 0.950 + Invalid: 0.000 + Latency: 12.914 s + Output throughput: 2065.515 token/s + ``` + + - nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP (NVFP4, tp=8, ep=1) + ``` + Accuracy: 0.970 + Invalid: 0.000 + Latency: 71.280 s + Output throughput: 390.080 token/s + ``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3-Next.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3-Next.mdx new file mode 100644 index 000000000000..57bb422b9e3a --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3-Next.mdx @@ -0,0 +1,774 @@ +--- +title: Qwen3-Next +metatags: + description: "Deploy Qwen3-Next with SGLang - hybrid attention architecture supporting 262K context, 80B MoE with 3B active parameters, and multi-token prediction." +--- + +import { Qwen3NextDeployment } from '/src/snippets/autoregressive/qwen3-next-deployment.jsx'; + +## 1. Model Introduction + +[Qwen3-Next](https://huggingface.co/collections/Qwen/qwen3-next) is an advanced large language model architecture developed by Alibaba's Qwen team, designed to enhance efficiency and performance in handling extensive contexts and large-scale parameters. It features advanced capabilities in reasoning, function calling, and multilingual understanding. + +Qwen3-Next introduces several groundbreaking innovations: + +- **Hybrid Attention Mechanism**: Replaces standard attention with a combination of **Gated DeltaNet** (linear attention) and **Full Attention**, enabling efficient processing of context lengths up to 262,144 tokens. This hybrid approach makes it ideal for analyzing lengthy documents such as entire books or contracts. + +- **Highly Sparse Mixture-of-Experts (MoE)**: Features an 80-billion parameter architecture where only 3 billion parameters are active during inference. This design reduces computational costs by up to 90% while maintaining high performance, drastically reducing FLOPs per token without compromising model capacity. + +- **Multi-Token Prediction (MTP)**: Enables generation of multiple tokens per inference step, significantly reducing latency and enhancing user experience in real-time applications. This innovation boosts both pretraining performance and inference speed. + +- **Multilingual Support**: Natively supports 119 languages, facilitating seamless cross-lingual tasks and making it versatile for global applications. + +- **Enterprise-Ready Deployment**: Released under the Apache 2.0 license, offering flexible deployment options including on-premises, virtual private cloud (VPC), and private cloud environments, ensuring security and compliance for enterprise use. + +- **Advanced Reasoning & Stability**: Demonstrates clear improvement in reasoning performance with support for tool use during inference. Includes stability optimizations such as **zero-centered** and **weight-decayed layernorm** for robust pre-training and post-training. + +For more details, please refer to the [official Qwen3-Next blog](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The Qwen3-Next series comes in only one size but offers different thinking modes. Recommended starting configurations vary depending on hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities. + + + +### 3.2 Configuration Tips + +- `--max-mamba-cache-size`: Adjust `--max-mamba-cache-size` to increase mamba cache space and max running requests capability. It will decrease KV cache space as a trade-off. You can adjust it according to workload. + +- `--mamba-ssm-dtype`: `bfloat16` or `float32`, use `bfloat16` to save mamba cache size and `float32` to get more accurate results. The default setting is `float32`. + +- `--mamba-full-memory-ratio`: Adjust `--mamba-full-memory-ratio` to set the ratio of mamba state memory to full kv cache memory. The default setting is `0.9`. + +- **Mamba Radix Cache**: Qwen3-Next's hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via `--mamba-scheduler-strategy`: + - **V1 (`no_buffer`)**: Default. No overlap scheduler, lower memory usage. + - **V2 (`extra_buffer`)**: Enables overlap scheduling and branching point caching with `--mamba-scheduler-strategy extra_buffer --page-size 64`. Requires FLA kernel backend. Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency. `--page-size` must satisfy `FLA_CHUNK_SIZE % page_size == 0` or `page_size % FLA_CHUNK_SIZE == 0` (`FLA_CHUNK_SIZE` is currently 64). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +1. **Streaming with Thinking Process:** + + Qwen3-Next-80B-A3B-Thinking only supports thinking mode. Enable the reasoning parser during deployment to separate the thinking and the content sections. + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Thinking \ + --reasoning-parser qwen3 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="Qwen/Qwen3-Next-80B-A3B-Thinking", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +Okay, let's see. I need to find 15% of 240. Hmm, percentages. Right, "percent" means per hundred, so 15% is 15 per 100, or 15/100. To find a percentage of a number, I think you multiply the number by the percentage as a decimal. So first, maybe convert 15% to a decimal. To convert a percentage to a decimal, you divide by 100. So 15 divided by 100 is 0.15. Then, multiply that by 240. Let me check that. So 0.15 times 240. Let's calculate that. Maybe break it down. 10% of 240 is 24, because 10% is just moving the decimal one place left, so 240 becomes 24. Then 5% would be half of 10%, so half of 24 is 12. So 10% + 5% = 15%, so 24 + 12 = 36. Oh, that's another way to do it. Let me verify with the multiplication. 0.15 * 240. Let's do 240 * 0.1 = 24, 240 * 0.05 = 12, so 24 + 12 = 36. Yep, that works. Alternatively, 240 * 15 = 3600, then divide by 100, which is 36. Because 15% of 240 is (15/100)*240 = (15*240)/100. 15*240: 10*240=2400, 5*240=1200, so 2400+1200=3600. Then 3600/100=36. So that's 36. So the answer should be 36. Let me make sure. 15% of 240. If I take 240 and multiply by 0.15, 240*0.15. Let's compute 240*0.1=24, 240*0.05=12, so 24+12=36. Yep, that's right. So 15% of 240 is 36. + +=============== Content ================= + +To find **15% of 240**, follow these steps: + +--- + +### **Step 1: Understand what "percent" means** +- "Percent" means **per hundred**, so **15% = 15/100 = 0.15** in decimal form. + +--- + +### **Step 2: Multiply the number by the decimal** +- To find 15% of 240, multiply: + $$ + 240 \times 0.15 + $$ + +--- + +### **Step 3: Break it down for clarity (optional but helpful)** +- **10% of 240** = $ 240 \times 0.1 = 24 $ +- **5% of 240** = $ 240 \times 0.05 = 12 $ +- Add them together: + $$ + 24 + 12 = 36 + $$ + +--- + +### **Step 4: Confirm with direct multiplication** +- $ 240 \times 0.15 = 36 $ + +--- + +### ✅ Final Answer: +$$ +\boxed{36} +$$ +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +2. **Turn off Thinking:** + + Qwen3-Next-80B-A3B-Instruct only supports instruct (non-thinking) mode. + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Turn off thinking process +response = client.chat.completions.create( + model="Qwen/Qwen3-Next-80B-A3B-Instruct", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True, + extra_body={"chat_template_kwargs": {"enable_thinking": False}} +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +To find **15% of 240**, follow these steps: + +--- + +### **Step 1: Understand what percentage means** +"Percent" means "per hundred," so **15%** is the same as **15 per 100**, or the fraction: + +$$ +\frac{15}{100} +$$ + +--- + +### **Step 2: Multiply the fraction by the number** +To find 15% of 240, multiply: + +$$ +\frac{15}{100} \times 240 +$$ + +--- + +### **Step 3: Simplify the multiplication** +You can simplify this in a couple of ways. + +#### **Option A: Multiply first, then divide** +$$ +15 \times 240 = 3600 +$$ +Then divide by 100: +$$ +\frac{3600}{100} = 36 +$$ + +#### **Option B: Simplify the fraction first** +$$ +\frac{15}{100} = \frac{3}{20} \quad \text{(divided numerator and denominator by 5)} +$$ +Now multiply: +$$ +\frac{3}{20} \times 240 = \frac{3 \times 240}{20} = \frac{720}{20} = 36 +$$ + +--- + +### **Step 4: Final Answer** +$$ +\boxed{36} +$$ + +So, **15% of 240 is 36**. +``` + +#### 4.2.2 Tool Calling + +Qwen/Qwen3-Next-80B-A3B-Instruct | Qwen/Qwen3-Next-80B-A3B-Thinking both support tool calling capabilities. Enable the tool call parser: + +**Python Example (without Thinking Process):** + +Start sglang server: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tool-call-parser qwen \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="Qwen/Qwen3-Next-80B-A3B-Instruct", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"🔧 Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output + +{"name": "get_weather", "arguments": {"location": "Beijing"}} + +``` + +**Python Example (with Thinking Process):** + +Start sglang server: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Thinking \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="Qwen/Qwen3-Next-80B-A3B-Thinking", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"🔧 Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +Okay, the user is asking for the weather in Beijing. Let me check the available tools. There's a get_weather function that requires location and optionally unit. The location is needed, so I need to provide Beijing as the location. The unit is optional, but the user didn't specify Celsius or Fahrenheit. Since the default might be Celsius, but maybe I should check if the parameters require unit. Wait, the required field is only location, so unit is optional. So I can just call get_weather with location "Beijing" and not include the unit. Let me confirm the parameters. The parameters for get_weather have location as required, and unit is an enum with celsius or fahrenheit, but not required. So the correct call is to send location as Beijing, and omit unit. So the tool call should be {"name": "get_weather", "arguments": {"location": "Beijing"}}. + + +{"name": "get_weather", "arguments": {"location": "Beijing"}} + +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="Qwen/Qwen3-Next-80B-A3B-Thinking", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +#### 4.2.3 Processing Ultra-Long Texts + +Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method. + +**Qwen3-Next-80B-A3B-Instruct** + +```shell Command +SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 8 --host 0.0.0.0 --port 8000 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144}}' --context-length 1010000 + +``` + +**Qwen3-Next-80B-A3B-Thinking** + +```shell Command +SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Thinking --reasoning-parser qwen3 --tp 8 --host 0.0.0.0 --port 8000 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144}}' --context-length 1010000 + +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x) +- Tensor Parallelism: 8 +- Model: Qwen/Qwen3-Next-80B-A3B-Instruct +- sglang version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Server Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tp 8 +``` + +- Test Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --num-prompt 100 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 100 +Benchmark duration (s): 146.52 +Total input tokens: 33839 +Total input text tokens: 33839 +Total input vision tokens: 0 +Total generated tokens: 21640 +Total generated tokens (retokenized): 21619 +Request throughput (req/s): 0.68 +Input token throughput (tok/s): 230.95 +Output token throughput (tok/s): 147.70 +Peak output token throughput (tok/s): 164.00 +Peak concurrent requests: 6 +Total token throughput (tok/s): 378.65 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1464.81 +Median E2E Latency (ms): 1077.48 +---------------Time to First Token---------------- +Mean TTFT (ms): 127.88 +Median TTFT (ms): 132.88 +P99 TTFT (ms): 212.85 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 6.19 +Median TPOT (ms): 6.17 +P99 TPOT (ms): 6.64 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.21 +Median ITL (ms): 6.16 +P95 ITL (ms): 6.51 +P99 ITL (ms): 6.71 +Max ITL (ms): 10.07 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Server Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tp 8 \ +``` + +- Test Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --num-prompt 1000 \ + --max-concurrency 100 +``` + +**Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 100.32 +Total input tokens: 302118 +Total input text tokens: 302118 +Total input vision tokens: 0 +Total generated tokens: 195775 +Total generated tokens (retokenized): 195016 +Request throughput (req/s): 9.97 +Input token throughput (tok/s): 3011.69 +Output token throughput (tok/s): 1951.60 +Peak output token throughput (tok/s): 5909.00 +Peak concurrent requests: 120 +Total token throughput (tok/s): 4963.29 +Concurrency: 93.05 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 9333.98 +Median E2E Latency (ms): 6054.12 +---------------Time to First Token---------------- +Mean TTFT (ms): 161.77 +Median TTFT (ms): 137.94 +P99 TTFT (ms): 503.29 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 50.87 +Median TPOT (ms): 50.28 +P99 TPOT (ms): 122.87 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 47.11 +Median ITL (ms): 13.84 +P95 ITL (ms): 195.33 +P99 ITL (ms): 289.56 +Max ITL (ms): 486.38 +================================================== +``` + +### 5.2 Accuracy Benchmark + +### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000 +``` + +- **Results**: + + - Qwen3-Next-80B-A3B-Instruct + + ``` + Accuracy: 0.960 + Invalid: 0.000 + Latency: 12.673 s + Output throughput: 2538.255 token/s + ``` + + - Qwen3-Next-80B-A3B-Thinking + ``` + Accuracy: 0.935 + Invalid: 0.000 + Latency: 9.912 s + Output throughput: 3288.737 token/s + ``` + +### 5.2.2 MMLU Benchmark + +- **Benchmark Command:** + +```shell Command +cd sglang +bash benchmark/mmlu/download_data.sh +python3 benchmark/mmlu/bench_sglang.py --nsub 10 +``` + +- **Results**: + + - Qwen3-Next-80B-A3B-Instruct + + ``` + subject: abstract_algebra, #q:100, acc: 0.800 + subject: anatomy, #q:135, acc: 0.807 + subject: astronomy, #q:152, acc: 0.947 + subject: business_ethics, #q:100, acc: 0.810 + subject: clinical_knowledge, #q:265, acc: 0.894 + subject: college_biology, #q:144, acc: 0.972 + subject: college_chemistry, #q:100, acc: 0.680 + subject: college_computer_science, #q:100, acc: 0.860 + subject: college_mathematics, #q:100, acc: 0.780 + subject: college_medicine, #q:173, acc: 0.861 + Total latency: 10.098 + Average accuracy: 0.856 + ``` + + - Qwen3-Next-80B-A3B-Thinking + ``` + subject: abstract_algebra, #q:100, acc: 0.780 + subject: anatomy, #q:135, acc: 0.815 + subject: astronomy, #q:152, acc: 0.941 + subject: business_ethics, #q:100, acc: 0.870 + subject: clinical_knowledge, #q:265, acc: 0.894 + subject: college_biology, #q:144, acc: 0.965 + subject: college_chemistry, #q:100, acc: 0.670 + subject: college_computer_science, #q:100, acc: 0.840 + subject: college_mathematics, #q:100, acc: 0.770 + subject: college_medicine, #q:173, acc: 0.861 + Total latency: 10.236 + Average accuracy: 0.855 + ``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3-VL.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3-VL.mdx new file mode 100644 index 000000000000..849e648d401a --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3-VL.mdx @@ -0,0 +1,777 @@ +--- +title: Qwen3-VL +metatags: + description: "Deploy Qwen3-VL vision-language models with SGLang - open model for text, 262K context, enhanced visual reasoning and agent capabilities." +--- + + +## 1. Model Introduction + +[Qwen3-VL series](https://github.com/QwenLM/Qwen3-VL) are the most powerful vision-language models in the Qwen series to date, featuring advanced capabilities in multi-modal understanding, reasoning, and agentic applications. + +This generation delivers comprehensive upgrades across the board: + +- **Superior text understanding & generation**: Qwen3-VL-235B-A22B-Instruct was ranked as the [#1 open model for text on lmarena.ai](https://x.com/arena/status/1973151703563460942) +- **Deeper visual perception & reasoning**: Enhanced image and video understanding capabilities. +- **Extended context length**: Supports up to 262K tokens for processing long documents and videos. +- **Enhanced spatial and video dynamics comprehension**: Better understanding of spatial relationships and temporal dynamics. +- **Stronger agent interaction capabilities**: Improved tool use and search-based agent performance. +- **Flexible deployment options**: Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning-enhanced Thinking editions. + +For more details, please refer to the [official Qwen3-VL GitHub Repository](https://github.com/QwenLM/Qwen3-VL). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The Qwen3-VL series offers models in various sizes and architectures, optimized for different hardware platforms including NVIDIA and AMD GPUs. The recommended launch configurations vary by hardware and model size. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities. + +import { Qwen3VLDeployment } from "/src/snippets/autoregressive/qwen3-vl-deployment.jsx"; + + + +### 3.2 Configuration Tips + +* **Multimodal attention backend** : Usually, `--mm-attention-backend` is default to `fa3` on H100/H200/A100 for better performance, but it is default to `triton_attn` on B200 for compatibility. +* **TTFT Optimization** : Set `SGLANG_USE_CUDA_IPC_TRANSPORT=1` to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting `--mem-fraction-static` and/or `--max-running-requests`. (additional memory is proportional to image size * number of images in current running requests.) +* **Memory Management** : Set lower `--context-length` to conserve memory. A value of `128000` is sufficient for most scenarios, down from the default 262K. +* **Expert Parallelism** : SGLang supports Expert Parallelism (EP) via `--ep`, allowing experts in MoE models to be deployed on separate GPUs for better throughput. One thing to note is that, for quantized models, you need to set `--ep` to a value that satisfies the requirement: `(moe_intermediate_size / moe_tp_size) % weight_block_size_n == 0, where moe_tp_size is equal to tp_size divided by ep_size.` Note that EP may perform worse in low concurrency scenarios due to additional communication overhead. Check out [Expert Parallelism Deployment](../../../docs/advanced_features/expert_parallelism) for more details. +* **Kernel Tuning** : For MoE Triton kernel tuning on your specific hardware, refer to [fused_moe_triton](https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Multi-Modal Inputs + +Qwen3-VL supports both image and video inputs. Here's a basic example with image input: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:8000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Read all the text in the image." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="Qwen/Qwen3-VL-235B-A22B-Instruct", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 3.37s +Generated text: Auntie Anne's + +CINNAMON SUGAR +1 x 17,000 17,000 + +SUB TOTAL 17,000 + +GRAND TOTAL 17,000 + +CASH IDR 20,000 + +CHANGE DUE 3,000 +``` + +**Multi-Image Input Example:** + +Qwen3-VL can process multiple images in a single request for comparison or analysis: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:8000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg" + } + }, + { + "type": "image_url", + "image_url": { + "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg" + } + }, + { + "type": "text", + "text": "Compare these two images and describe the differences in 100 words or less. Focus on the key visual elements, colors, textures, and any notable contrasts between the two scenes. Be specific about what you see in each image." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="Qwen/Qwen3-VL-235B-A22B-Instruct", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 10.18s +Generated text: The two images present starkly different portrayals of Hong Kong’s iconic red taxis, contrasting a dynamic street-level moment with a static, large-scale gathering. + +The first image is a close-up, eye-level shot capturing a single red Toyota Crown taxi (license plate RX 5004) in motion or paused at an urban intersection. Its glossy red paint gleams under daylight, reflecting the vibrant, cluttered backdrop of a Hong Kong street — neon signs, glass-fronted shops displaying sunglasses, and Chinese characters. The taxi’s chrome grille, clear headlights, and black trim provide visual contrast. A green “4 SEATS” sticker and a “的士 TAXI” sign on the side reinforce its identity. The composition is intimate, focusing on the vehicle’s details — the texture of its paint, the slight reflections on the windows, and the crispness of its license plate. Other red taxis flank it, suggesting a bustling city rhythm, but the central taxi dominates the frame, conveying movement and immediacy. + +In contrast, the second image is an elevated, wide-angle shot of dozens of red taxis — along with a few green ones — parked in neat, grid-like rows on what appears to be a highway or staging area. The scene is static, almost ceremonial. Many taxis have their hoods open, suggesting maintenance, inspection, or protest. People are scattered among the vehicles, some inspecting engines, others conversing — adding a human, documentary element. The dominant color remains red, but the repetition creates a visual pattern rather than individual focus. The green taxis offer a subtle color contrast, hinting at different service zones (green for New Territories, red for urban areas). The setting is more utilitarian — concrete barriers, metal railings, and sparse vegetation — with an overpass looming in the background. The texture here is less about polished paint and more about the collective mass of vehicles, the asphalt, and the functional layout. + +Key contrasts emerge: the first image is kinetic and personal, emphasizing the taxi as a working vehicle in the city’s daily flow; the second is static and collective, portraying the taxis as a fleet, possibly for logistical or political purposes. The lighting in both is bright daylight, but the first has richer color saturation and depth due to its proximity and urban backdrop, while the second feels flatter, more documentary in tone. The first image invites you into the city’s pulse; the second invites you to observe a system — organized, perhaps even paused — from a distance. + +In essence, the first image celebrates the individual taxi in its natural habitat; the second reveals the scale and structure behind the fleet, transforming the familiar red icon into a symbol of coordination, maintenance, or collective action. Both are quintessentially Hong Kong, yet they offer vastly different narratives — one of motion and commerce, the other of assembly and purpose. +``` + +**Video Input Example:** + +Qwen3-VL supports video understanding by processing video URLs: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:8000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "video_url", + "video_url": { + "url": "https://videos.pexels.com/video-files/4114797/4114797-uhd_3840_2160_25fps.mp4" + } + }, + { + "type": "text", + "text": "Describe what happens in this video." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="Qwen/Qwen3-VL-235B-A22B-Instruct", + messages=messages, + max_tokens=2048 +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Note:** + +- For video processing, ensure you have sufficient context length configured (up to 262K tokens) +- Video processing may require more memory; adjust `--mem-fraction-static` accordingly +- You can also provide local file paths using `file://` protocol + +**Example Output:** + +```text Output +Response costs: 3.89s +Generated text: A person wearing blue gloves is using a microscope. They are adjusting the focus knob with one hand while holding a pipette with the other, suggesting they are preparing or examining a sample on the slide beneath the objective lens. The microscope's 40x objective lens is positioned over the slide, indicating a high-magnification observation. The person carefully manipulates the slide and the microscope controls, likely to achieve a clear view of the specimen. +``` + +#### 4.2.2 Reasoning Parser + +Qwen3-VL-Thinking supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-VL-235B-A22B-Thinking \ + --reasoning-parser qwen3 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="Qwen/Qwen3-VL-235B-A22B-Thinking", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +To solve this problem, I need to calculate 15% of 240. +Step 1: Convert 15% to decimal: 15% = 0.15 +Step 2: Multiply 240 by 0.15 +Step 3: 240 × 0.15 = 36 +=============== Content ================= + +The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.3 Tool Calling + +Qwen3-VL supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-VL-235B-A22B-Thinking \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="Qwen/Qwen3-VL-235B-A22B-Thinking", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. +I should call the function with location="Beijing". +=============== Content ================= + +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="Qwen/Qwen3-VL-235B-A22B-Thinking", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The weather in Beijing is currently 22°C and sunny." +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x) +- Model: Qwen3-VL-235B-A22B-Instruct +- Tensor Parallelism: 8 +- sglang version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation with random images. To simulate real-world usage, you can specify different input and output lengths for each request. For example, each request can have 128 input tokens, two 720p images, and 1024 output tokens. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-VL-235B-A22B-Instruct \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 8000 \ + --model Qwen/Qwen3-VL-235B-A22B-Instruct \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 45.97 +Total input tokens: 18348 +Total input text tokens: 708 +Total input vision tokens: 17640 +Total generated tokens: 4220 +Total generated tokens (retokenized): 3423 +Request throughput (req/s): 0.22 +Input token throughput (tok/s): 399.17 +Output token throughput (tok/s): 91.81 +Peak output token throughput (tok/s): 96.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 490.98 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4594.52 +Median E2E Latency (ms): 3725.04 +---------------Time to First Token---------------- +Mean TTFT (ms): 193.35 +Median TTFT (ms): 196.32 +P99 TTFT (ms): 222.75 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.44 +Median TPOT (ms): 10.44 +P99 TPOT (ms): 10.47 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 11.78 +Median ITL (ms): 10.48 +P95 ITL (ms): 21.01 +P99 ITL (ms): 31.40 +Max ITL (ms): 31.92 +================================================== +``` + +**Optimized Results (with CUDA IPC Transport):** + +For further TTFT optimization, enable CUDA IPC Transport for multimodal features by setting `SGLANG_USE_CUDA_IPC_TRANSPORT=1`. This significantly reduces TTFT by using CUDA IPC for transferring multimodal features. + +- Model Deployment Command: + +```shell Command +SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server \ + --model Qwen/Qwen3-VL-235B-A22B-Instruct \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 8000 \ + --model Qwen/Qwen3-VL-235B-A22B-Instruct \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 100 \ + --max-concurrency 1 +``` + +- **Test Results:** + + With `SGLANG_USE_CUDA_IPC_TRANSPORT=1`, TTFT improves significantly: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 100 +Benchmark duration (s): 566.84 +Total input tokens: 183667 +Total input text tokens: 7267 +Total input vision tokens: 176400 +Total generated tokens: 52444 +Total generated tokens (retokenized): 28702 +Request throughput (req/s): 0.18 +Input token throughput (tok/s): 324.02 +Output token throughput (tok/s): 92.52 +Peak output token throughput (tok/s): 96.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 416.54 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 5667.50 +Median E2E Latency (ms): 5830.00 +---------------Time to First Token---------------- +Mean TTFT (ms): 191.16 +Median TTFT (ms): 182.58 +P99 TTFT (ms): 244.58 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.46 +Median TPOT (ms): 10.46 +P99 TPOT (ms): 10.48 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 13.91 +Median ITL (ms): 10.56 +P95 ITL (ms): 21.35 +P99 ITL (ms): 31.55 +Max ITL (ms): 42.36 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-VL-235B-A22B-Instruct \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 8000 \ + --model Qwen/Qwen3-VL-235B-A22B-Instruct \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 584.65 +Total input tokens: 1839015 +Total input text tokens: 75015 +Total input vision tokens: 1764000 +Total generated tokens: 510855 +Total generated tokens (retokenized): 284284 +Request throughput (req/s): 1.71 +Input token throughput (tok/s): 3145.50 +Output token throughput (tok/s): 873.78 +Peak output token throughput (tok/s): 2855.00 +Peak concurrent requests: 107 +Total token throughput (tok/s): 4019.29 +Concurrency: 98.35 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 57502.05 +Median E2E Latency (ms): 54301.08 +---------------Time to First Token---------------- +Mean TTFT (ms): 5802.23 +Median TTFT (ms): 1444.75 +P99 TTFT (ms): 46675.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 100.22 +Median TPOT (ms): 105.43 +P99 TPOT (ms): 144.37 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 134.20 +Median ITL (ms): 25.57 +P95 ITL (ms): 558.14 +P99 ITL (ms): 1449.01 +Max ITL (ms): 33453.23 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 MMMU Benchmark + +You can evaluate the model's accuracy using the MMMU dataset with `lmms_eval`: + +- Benchmark Command: + +```shell Command +uv pip install lmms_eval + +python3 -m lmms_eval \ + --model openai_compatible \ + --model_args "model=Qwen/Qwen3-VL-235B-A22B-Instruct,api_key=EMPTY,base_url=http://127.0.0.1:8000/v1/" \ + --tasks mmmu_val \ + --batch_size 128 \ + --log_samples \ + --log_samples_suffix "openai_compatible" \ + --output_path ./logs \ + --gen_kwargs "max_new_tokens=4096" +``` + +- **Test Results:** + +```text Output + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TasksVersionFiltern-shotMetricValueStderr
mmmu_val0none0mmmu_acc0.6567±N/A
+``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3.5.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3.5.mdx new file mode 100644 index 000000000000..39a41e861864 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3.5.mdx @@ -0,0 +1,1001 @@ +--- +title: Qwen3.5 +metatags: + description: "Deploy Qwen3.5 with SGLang - flagship Qwen model with unified vision-language foundation, hybrid architecture, and scalable reasoning." +--- + +import { Qwen35Deployment } from '/src/snippets/autoregressive/qwen35-deployment.jsx' + +## 1. Model Introduction + +[Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) is the latest flagship model in the Qwen series developed by Alibaba, representing a significant leap forward with unified vision-language foundation, efficient hybrid architecture, and scalable reinforcement learning. + +Qwen3.5 features a Gated Delta Networks combined with sparse Mixture-of-Experts architecture (397B total parameters, 17B activated), delivering high-throughput inference with minimal latency. It supports multimodal inputs (text, image, video) and natively handles context lengths of up to 262,144 tokens, extensible to over 1M tokens. + +**Key Features:** + +- **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models +- **Efficient Hybrid Architecture**: Gated Delta Networks + sparse MoE (397B total / 17B active) for high-throughput inference +- **Hybrid Reasoning**: Thinking mode enabled by default with step-by-step reasoning, can be disabled for direct responses +- **Tool Calling**: Built-in tool calling support with `qwen3_coder` parser +- **Multi-Token Prediction (MTP)**: Speculative decoding support for lower latency +- **201 Language Support**: Expanded multilingual coverage across 201 languages and dialects + +**Available Models:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelBF16 (Full precision)FP8 (8-bit Quantized)FP4 (4-bit Quantized)
Qwen3.5-397B-A17B[Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B)[Qwen/Qwen3.5-397B-A17B-FP8](https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8)[nvidia/Qwen3.5-397B-A17B-NVFP4](https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4)
Qwen3.5-122B-A10B[Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)[Qwen/Qwen3.5-122B-A10B-FP8](https://huggingface.co/Qwen/Qwen3.5-122B-A10B-FP8)-
Qwen3.5-35B-A3B[Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)[Qwen/Qwen3.5-35B-A3B-FP8](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-FP8)-
Qwen3.5-27B[Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)[Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8)-
Qwen3.5-9B[Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B)--
Qwen3.5-4B[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)--
Qwen3.5-2B[Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B)--
Qwen3.5-0.8B[Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B)--
+ +**License:** Apache 2.0 + +## 2. SGLang Installation + +SGLang from the main branch is required for Qwen3.5. You can install from source or use a Docker image: + +```bash Command +# Install from source +uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python' + +# Or use Docker (NVIDIA GPUs) +docker pull lmsysorg/sglang:nightly-dev-20260216-d3bae71e + +# Or use Docker (AMD MI300X/MI325X) +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x + +# Or use Docker (AMD MI355X) +docker pull lmsysorg/sglang:v0.5.9-rocm720-mi35x +``` + +For the full Docker setup and other installation methods, please refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and capabilities. + + + +### 3.2 Configuration Tips + +- Speculative decoding (MTP) can significantly reduce latency for interactive use cases. +- **Mamba Radix Cache**: Qwen3.5's hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via `--mamba-scheduler-strategy`: + - **V1 (`no_buffer`)**: Default. No overlap scheduler, lower memory usage. Required for AMD MI GPUs. + - **V2 (`extra_buffer`)**: Enables overlap scheduling and branching point caching with `--mamba-scheduler-strategy extra_buffer --page-size 64`. Requires FLA kernel backend (NVIDIA GPUs only). Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency. `--page-size` must satisfy `FLA_CHUNK_SIZE % page_size == 0` or `page_size % FLA_CHUNK_SIZE == 0` (`FLA_CHUNK_SIZE` is currently 64). +- The `--mem-fraction-static` flag is recommended for optimal memory utilization, adjust it based on your hardware and workload. +- Context length defaults to 262,144 tokens. If you encounter OOM errors, consider reducing it, but maintain at least 128K to preserve thinking capabilities. +- To speed up weight loading for this large model, add `--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'` to the launch command. +- **CUDA IPC Transport**: Add `SGLANG_USE_CUDA_IPC_TRANSPORT=1` as an environment variable to use CUDA IPC for transferring multimodal features, significantly improving TTFT (Time To First Token). Note: this consumes additional memory proportional to image size, so you may need to lower `--mem-fraction-static` or `--max-running-requests`. +- **Multimodal Attention Backend**: Use `--mm-attention-backend fa3` on H100/H200 for better vision performance, or `--mm-attention-backend fa4` on B200/B300. +- **B200 (FP8)**: Add `--enable-flashinfer-allreduce-fusion` for optimized throughput on Blackwell. +- For processing large images or videos, you may need to lower `--mem-fraction-static` to leave room for image feature tensors. +- Hardware requirements: + - **BF16**: ~397B parameters require ~800GB of GPU memory for weights. + - **H100 (80GB)** requires tp=16 (2 nodes) since each rank needs ~100GB at tp=8. + - **H200 (141GB)** runs with tp=8. + - **B200 (183GB)** runs with tp=8. + - **B300 (275GB)** runs with tp=4. + - **MI300X (192GB)** runs with tp=8. + - **MI325X (256GB)** runs with tp=4. + - **MI355X (288GB)** runs with tp=4. + - **FP8**: The FP8 quantized model requires ~400GB for weights, cutting memory in half. + - **H100 (80GB)** runs with tp=8. + - **H200 (141GB)** runs with tp=4. + - **B200 (183GB)** runs with tp=4. + - **B300 (275GB)** runs with tp=2. + - **MI300X (192GB)** runs with tp=4. + - **MI325X (256GB)** runs with tp=2. + - **MI355X (288GB)** runs with tp=2. + - **FP4**: The FP4 quantized model requires ~250GB for weights, cutting memory by almost 4x. Only compatible with B200/B300 (Blackwell architecture). + - **B200 (183GB)** runs with tp=4. + - **B300 (275GB)** runs with tp=2. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareMemoryBF16 TPFP8 TPFP4 TP
H10080GB168N/A
H200141GB84N/A
B200183GB844
B300275GB422
MI300X192GB84N/A
MI325X256GB42N/A
MI355X288GB42N/A
+ + +**FP8 KV Cache**: `--kv-cache-dtype fp8_e4m3` quantizes the KV cache to FP8 at runtime. Since these FP8 model checkpoints do not include pre-calibrated KV cache scaling factors, SGLang defaults to a scale of 1.0, which may cause noticeable accuracy degradation on reasoning-heavy tasks. It is not included in the generated commands above; add it manually only if memory constraints require the trade-off. + + +## 4. Model Invocation + +**NVIDIA:** + +Deploy Qwen3.5-397B-A17B with the following command (H200, all features enabled): + +```shell Command +sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --speculative-algo NEXTN \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --mem-fraction-static 0.8 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**AMD:** + +Deploy Qwen3.5-397B-A17B with the following command (MI300X/MI325X/MI355X): + +```shell Command +sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --mem-fraction-static 0.8 \ + --attention-backend triton \ + --host 0.0.0.0 \ + --port 30000 +``` +> **Note:** TP8 works on all MI GPUs. For MI325X/MI355X, you can use --tp 4 as the minimum requirement. + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Vision Input + +Qwen3.5 supports image and video inputs as a unified vision-language model. Here is an example with an image: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3.5-397B-A17B", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg" + } + }, + { + "type": "text", + "text": "Describe this image in detail." + } + ] + } + ], + max_tokens=2048, + stream=True +) + +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user wants a detailed description of the provided image. + +1. **Identify the main components:** + * There is a central square. + * There are four circles of varying sizes. + +2. **Analyze the arrangement and relationships:** + * **The Square:** It's in the middle, oriented upright (sides are vertical and horizontal). It's drawn with a thin black line. + * **The Circles:** + * **Top Right:** A large circle. It overlaps the top right corner of the square. A significant portion of the circle is outside the square, but a chunk of it is inside. + * **Bottom Left:** A large circle, roughly the same size as the top right one. It overlaps the bottom left corner of the square. Again, a chunk is inside, most is outside. + * **Top Left:** A smaller circle. It is positioned near the top left corner of the square. It overlaps the corner slightly. + * **Bottom Right:** A smaller circle, roughly the same size as the top left one. It is positioned near the bottom right corner of the square. It overlaps the corner slightly. + +3. **Synthesize the description:** + * Start with a general overview: A geometric line drawing. + * Describe the central shape: A square. + * Describe the surrounding shapes: Four circles. + * Detail the positioning: + * Two large circles are diagonally opposite (bottom-left and top-right). + * Two smaller circles are diagonally opposite (top-left and bottom-right). + * Describe the interaction/overlap: The shapes are not just touching; they intersect. The lines cross over each other. The square acts as a central anchor. The circles seem to be "orbiting" or clustered around the corners of the square. + * Mention the style: Black lines on a white background. Minimalist. No shading or color. + +4. **Refine the details:** + * Let's look closer at the overlaps. + * **Top Left Circle:** It sits mostly above and to the left of the square's top-left corner. The bottom-right part of this small circle cuts into the square. + * **Top Right Circle:** This is the largest circle (or tied for largest). Its bottom-left quadrant overlaps significantly with the top-right quadrant of the square. + * **Bottom Left Circle:** This is also a large circle. Its top-right quadrant overlaps significantly with the bottom-left quadrant of the square. + * **Bottom Right Circle:** This is a small circle. Its top-left part overlaps with the bottom-right corner of the square. + +5. **Final Polish:** Structure the description logically. Start with the big picture, then break down each element's position relative to the center. + + * *Drafting the description:* The image shows a black and white line drawing of geometric shapes. In the center is a square. Surrounding it are four circles of two different sizes. + * *Adding detail:* There are two large circles and two small circles. + * *Positioning:* + * A large circle is in the bottom left. It overlaps the bottom-left corner of the square. + * A large circle is in the top right. It overlaps the top-right corner of the square. + * A small circle is in the top left. It overlaps the top-left corner of the square. + * A small circle is in the bottom right. It overlaps the bottom-right corner of the square. + * *Visual style:* The lines are thin and black. The background is white. The shapes are transparent (wireframe style), so you can see the lines of the square through the circles and vice versa where they overlap. + +6. **Final Review against the image:** Does this match? Yes. The description is accurate. The "transparency" point is important because it's a line drawing, not solid shapes. The lines cross. + + * Wait, let me look really closely at the "Top Left" circle. It actually looks like it's touching the top edge and left edge. It overlaps the corner. + * Let me look at the "Bottom Right" circle. It overlaps the bottom right corner. + * The arrangement creates a sort of diagonal symmetry. Large-Large on one diagonal, Small-Small on the other. + + Let's write the final output based on these observations. +=============== Content ================= + + +Based on the image provided, here is a detailed description: + +**Overview** +The image is a minimalist, black-and-white line drawing featuring geometric shapes. It consists of a central square surrounded by four circles of varying sizes. The lines are thin and black against a plain white background. The shapes are drawn in a "wireframe" style, meaning they are transparent outlines; where shapes overlap, the lines cross over each other rather than one blocking the other. + +**Detailed Breakdown** + +1. **The Central Square:** + * There is a single square positioned in the center of the composition. It is oriented upright with vertical and horizontal sides. + +2. **The Circles:** + * There are four circles arranged around the corners of the square. They appear in two distinct sizes: two large circles and two smaller circles. + * **Top Right:** A large circle is positioned at the top right. It overlaps significantly with the top-right corner of the square. A portion of the circle's interior is inside the square's boundary. + * **Bottom Left:** Another large circle (roughly the same size as the top right one) is positioned at the bottom left. It overlaps significantly with the bottom-left corner of the square. + * **Top Left:** A smaller circle is positioned near the top left corner. It overlaps slightly with the top-left corner of the square. + * **Bottom Right:** A smaller circle (roughly the same size as the top left one) is positioned near the bottom right corner. It overlaps slightly with the bottom-right corner of the square. + +**Composition** +The arrangement creates a diagonal symmetry. The two largest circles are on a diagonal from bottom-left to top-right, while the two smallest circles are on a diagonal from top-left to bottom-right. The intersecting lines create a complex web of curves and angles in the center of the image. +``` + +### 4.3 Advanced Usage + +#### 4.3.1 Reasoning Parser + +Qwen3.5 supports Thinking mode **by default**. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via `reasoning_content` in the streaming response. + +To disable thinking and use Instruct mode, pass `chat_template_kwargs` at request time: + +- **Thinking mode** (default): The model performs step-by-step reasoning before answering. No extra parameters needed. +- **Instruct mode** (`{"enable_thinking": false}`): The model responds directly without a thinking process. + +**Example 1: Thinking Mode (Default)** + +Thinking mode is enabled by default. The model will reason step-by-step before answering, and the thinking process is returned via `reasoning_content`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Thinking mode is enabled by default, no extra parameters needed +response = client.chat.completions.create( + model="Qwen/Qwen3.5-397B-A17B", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +Thinking Process: + +1. **Analyze the Request:** The user wants to solve a math problem: "What is 15% of 240?" and expects a step-by-step explanation. + +2. **Identify the Core Operation:** The problem asks for a percentage of a number. The formula is: $\text{Percentage} \times \text{Number} = \text{Result}$. + * Percentage: $15\%$ + * Number: $240$ + +3. **Determine the Steps:** + * Step 1: Convert the percentage to a decimal or fraction. + * Step 2: Multiply the converted value by the number. + * Step 3: Calculate the final result. + * Alternative Step (Mental Math): Break down 15% into 10% + 5%. + +4. **Draft the Explanation (Method 1: Decimal Conversion):** + * Convert $15\%$ to $0.15$. + * Multiply $0.15 \times 240$. + * Calculation: $15 \times 24 = 360$, adjust decimal places $\rightarrow 36.0$. + +5. **Draft the Explanation (Method 2: Fraction Conversion):** + * Convert $15\%$ to $\frac{15}{100}$ or $\frac{3}{20}$. + * Multiply $\frac{3}{20} \times 240$. + * $240 / 20 = 12$. + * $3 \times 12 = 36$. + +6. **Draft the Explanation (Method 3: Breaking it down - often easier for step-by-step):** + * Find 10% of 240. + * Find 5% of 240 (which is half of 10%). + * Add them together. + * 10% of 240 = 24. + * 5% of 240 = 12. + * $24 + 12 = 36$. + +7. **Select the Best Presentation:** Method 3 is very intuitive for "step-by-step" requests as it shows logical reasoning without heavy multiplication. However, the standard decimal multiplication (Method 1) is the most formal. I will present the standard method clearly, perhaps mentioning the breakdown method as a check or alternative way to understand it. Let's stick to the standard multiplication method for clarity but explain the conversion clearly. Actually, providing the breakdown method is often more helpful for understanding *why*. I'll provide the standard calculation method as the primary steps. + + * Step 1: Convert percentage to decimal. + * Step 2: Multiply. + * Step 3: State answer. + + Let's refine the calculation for the final output. + $15\% = 0.15$ + $0.15 \times 240$ + $= \frac{15}{100} \times 240$ + $= 15 \times 2.4$ + $= 36$ + + Or: + $10\% = 24$ + $5\% = 12$ + $15\% = 36$ + + I will write out the solution using the decimal/fraction multiplication method as it's universally applicable, but I'll make the steps very clear. + +8. **Final Review:** Does this answer the prompt accurately and step-by-step? Yes. + + * Step 1: Convert 15% to a decimal. + * Step 2: Multiply by 240. + * Step 3: Calculate. + + Let's write it out. + + Step 1: Convert the percentage to a decimal. + $15\% = 0.15$ + + Step 2: Multiply the decimal by the number. + $0.15 \times 240$ + + Step 3: Perform the multiplication. + $15 \times 240 = 3600$ + Adjust decimal (2 places): $36.00$ + Result: 36. + + Wait, $0.15 \times 240$. + $15 \times 24 = 360$. + One zero from 240 cancels one decimal place? + $0.15 \times 240 = 15 \times 2.4 = 36$. + Or $\frac{15}{100} \times 240 = \frac{15 \times 240}{100} = \frac{3600}{100} = 36$. + This looks correct. + + I will also include the "10% + 5%" method as an alternative way to think about it, as it's often easier for mental math. + + Plan: + 1. Define the formula. + 2. Convert percent to decimal/fraction. + 3. Multiply. + 4. Provide the alternative mental math method for completeness. + 5. State final answer clearly. + + Actually, keeping it simple is better. Just one clear path. I'll use the fraction/decimal multiplication path. + + Step 1: Convert 15% to a decimal. + Step 2: Multiply 240 by 0.15. + Step 3: Final result. + + Let's go.cw +=============== Content ================= + + +Here is the step-by-step solution to find 15% of 240: + +**Step 1: Convert the percentage to a decimal.** +To convert a percentage to a decimal, divide by 100. +$$15\% = \frac{15}{100} = 0.15$$ + +**Step 2: Multiply the decimal by the number.** +Now, multiply 0.15 by 240. +$$0.15 \times 240$$ + +**Step 3: Calculate the result.** +You can think of this as: +$$15 \times 240 = 3600$$ +Since there are two decimal places in 0.15, move the decimal point in the result two places to the left: +$$3600 \rightarrow 36.00$$ + +**Alternative Method (Mental Math):** +* Find 10% of 240: $240 \div 10 = 24$ +* Find 5% of 240 (half of 10%): $24 \div 2 = 12$ +* Add them together (10% + 5% = 15%): $24 + 12 = 36$ + +**Answer:** +15% of 240 is **36**. +``` + +**Example 2: Instruct Mode (Thinking Off)** + +To disable thinking and get a direct response, pass `{"enable_thinking": false}` via `chat_template_kwargs`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Disable thinking mode via chat_template_kwargs +response = client.chat.completions.create( + model="Qwen/Qwen3.5-397B-A17B", + messages=[ + {"role": "user", "content": "What is 15% of 240?"} + ], + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, + max_tokens=2048, + stream=True +) + +# In Instruct mode, the model responds directly without reasoning_content +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +To find 15% of 240, you can follow these steps: + +### Step-by-Step Deduction + +1. **Convert the percentage to a decimal +**: + To convert a percentage to a decimal, divide by 100. + $$15\% = \frac{15}{100} = 0.15$$ + +2. **Multiply the decimal by the number**: + Multiply $0.15$ by $240$. + $$0.15 \times 240$$ + + *Alternative Method (Mental Math)*: + - Find 10% of 240: $240 \times 0.10 = 24$ + - Find 5% of 240 (which is half of 10%): $24 / 2 = 12$ + - Add them together ($10\% + 5\% = 15\%$): $24 + 12 = 36$ + +3. **Calculation**: + $$240 \times 0.15 = 36$$ + +### Final Conclusion +15% of 240 is **36**. +``` + +#### 4.3.2 Tool Calling + +Qwen3.5 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, pass `extra_body={"chat_template_kwargs": {"enable_thinking": False}}`. + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="Qwen/Qwen3.5-397B-A17B", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I have access to a get_weather function that can provide current weather information for a location. Let me check the parameters: + +- location (required): "Beijing" - this is provided by the user +- unit (optional): The user didn't specify a temperature unit, so I won't include this optional parameter + +I should call the get_weather function with Beijing as the location. + + +=============== Content ================= +Tool Call: get_weather + Arguments: +Tool Call: None + Arguments: { +Tool Call: None + Arguments: "location": "Beijing" +Tool Call: None + Arguments: } +``` + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +#### 5.1.1 GSM8K Benchmark + +- Benchmark Command +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --port 30000 +``` + +- Test Result +```text Output +100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:31<00:00, 6.43it/s] +Accuracy: 0.975 +Invalid: 0.005 +Latency: 31.784 s +Output throughput: 998.166 token/s +``` + +#### 5.1.2 MMMU Benchmark + +- Benchmark Command +```bash Command +python3 benchmark/mmmu/bench_sglang.py --concurrency 128 --port 30000 --max-new-tokens 512 +``` + +- Test Result +```text Output +{'Accounting': {'acc': 1.0, 'num': 3}, + 'Agriculture': {'acc': 1.0, 'num': 4}, + 'Art': {'acc': 1.0, 'num': 9}, + 'Art_Theory': {'acc': 1.0, 'num': 5}, + 'Basic_Medical_Science': {'acc': 1.0, 'num': 2}, + 'Biology': {'acc': 1.0, 'num': 1}, + 'Chemistry': {'acc': 1.0, 'num': 1}, + 'Computer_Science': {'acc': 1.0, 'num': 1}, + 'Design': {'acc': 0.909, 'num': 11}, + 'Diagnostics_and_Laboratory_Medicine': {'acc': 1.0, 'num': 1}, + 'Economics': {'acc': 1.0, 'num': 5}, + 'Finance': {'acc': 1.0, 'num': 2}, + 'Geography': {'acc': 1.0, 'num': 3}, + 'History': {'acc': 1.0, 'num': 3}, + 'Literature': {'acc': 0.938, 'num': 16}, + 'Manage': {'acc': 1.0, 'num': 2}, + 'Marketing': {'acc': 1.0, 'num': 5}, + 'Math': {'acc': 1.0, 'num': 1}, + 'Overall': {'acc': 0.978, 'num': 91}, + 'Overall-Art and Design': {'acc': 0.96, 'num': 25}, + 'Overall-Business': {'acc': 1.0, 'num': 17}, + 'Overall-Health and Medicine': {'acc': 1.0, 'num': 7}, + 'Overall-Humanities and Social Science': {'acc': 0.966, 'num': 29}, + 'Overall-Science': {'acc': 1.0, 'num': 8}, + 'Overall-Tech and Engineering': {'acc': 1.0, 'num': 5}, + 'Pharmacy': {'acc': 1.0, 'num': 2}, + 'Physics': {'acc': 1.0, 'num': 2}, + 'Psychology': {'acc': 1.0, 'num': 4}, + 'Public_Health': {'acc': 1.0, 'num': 2}, + 'Sociology': {'acc': 1.0, 'num': 6}} +eval out saved to ./val_sglang.json +Overall accuracy: 0.978 +``` + +### 5.2 Speed Benchmark + +**Test Environment:** + +- Hardware: H200 (8x) +- Model: Qwen3.5-397B-A17B +- Tensor Parallelism: 8 +- SGLang Version: main branch + +Server Launch Command: +```bash Command +SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server \ + --model Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --speculative-algo NEXTN \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --mem-fraction-static 0.8 \ + --host 0.0.0.0 \ + --port 30000 +``` + +#### 5.3.1 Latency Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3.5-397B-A17B \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 18.94 +Total input tokens: 6101 +Total input text tokens: 6101 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4211 +Request throughput (req/s): 0.53 +Input token throughput (tok/s): 322.16 +Output token throughput (tok/s): 222.84 +Peak output token throughput (tok/s): 289.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 545.00 +Concurrency: 1.00 +Accept length: 3.12 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1892.35 +Median E2E Latency (ms): 1410.85 +P90 E2E Latency (ms): 3749.34 +P99 E2E Latency (ms): 4216.52 +---------------Time to First Token---------------- +Mean TTFT (ms): 190.40 +Median TTFT (ms): 208.46 +P99 TTFT (ms): 261.27 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 3.96 +Median TPOT (ms): 3.79 +P99 TPOT (ms): 4.96 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.04 +Median ITL (ms): 3.15 +P95 ITL (ms): 6.65 +P99 ITL (ms): 12.60 +Max ITL (ms): 58.03 +================================================== +``` + +#### 5.3.2 Throughput Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3.5-397B-A17B \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 1000 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 283.04 +Total input tokens: 502493 +Total input text tokens: 502493 +Total generated tokens: 500251 +Total generated tokens (retokenized): 498222 +Request throughput (req/s): 3.53 +Input token throughput (tok/s): 1775.37 +Output token throughput (tok/s): 1767.45 +Peak output token throughput (tok/s): 3630.00 +Peak concurrent requests: 108 +Total token throughput (tok/s): 3542.82 +Concurrency: 96.71 +Accept length: 3.31 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 27372.05 +Median E2E Latency (ms): 26660.21 +P90 E2E Latency (ms): 39951.91 +P99 E2E Latency (ms): 48405.51 +---------------Time to First Token---------------- +Mean TTFT (ms): 14247.21 +Median TTFT (ms): 14932.44 +P99 TTFT (ms): 20998.45 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 26.16 +Median TPOT (ms): 26.13 +P99 TPOT (ms): 41.33 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 26.29 +Median ITL (ms): 11.38 +P95 ITL (ms): 72.10 +P99 ITL (ms): 149.57 +Max ITL (ms): 1220.68 +================================================== +``` + +### 5.3 Vision Speed Benchmark + +We use SGLang's built-in benchmarking tool to conduct performance evaluation with random images. Each request has 128 input tokens, two 720p images, and 1024 output tokens. + +#### 5.3.1 Latency Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3.5-397B-A17B \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 \ + --request-rate inf +``` + +```text Output +TODO +``` + +#### 5.3.2 Throughput Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model Qwen/Qwen3.5-397B-A17B \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 \ + --request-rate inf +``` + +```text Output +TODO +``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3.6.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3.6.mdx new file mode 100644 index 000000000000..3a926d395bf4 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3.6.mdx @@ -0,0 +1,491 @@ +--- +title: Qwen3.6 +metatags: + description: "Deploy Qwen3.6 with SGLang - open-weight multimodal series with a 35B MoE (3B active) variant and a 27B dense variant, hybrid reasoning, tool calling, MTP, and long-context support." +tag: NEW +--- + +import { Qwen36Deployment } from '/src/snippets/autoregressive/qwen36-deployment.jsx'; + +## 1. Model Introduction + +The Qwen3.6 series is developed by Alibaba. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, delivering substantial upgrades in agentic coding and thinking preservation. Two size/sparsity variants are released: + +- [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — **Sparse MoE** (35B total, 3B active) on a Gated Delta Networks backbone. +- [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) — **Dense** hybrid GDN; smaller weights footprint, single-GPU friendly. + +Both variants share the same hybrid reasoning, tool-calling, and multimodal interface and natively handle context lengths of up to 262,144 tokens, extensible to over 1M tokens. + +**Key Features:** + +- **Agentic Coding**: Handles frontend workflows and repository-level reasoning with greater fluency and precision +- **Thinking Preservation**: New option to retain reasoning context from historical messages, streamlining iterative development +- **Efficient Hybrid Architecture**: Gated Delta Networks backbone; sparse MoE (35B / 3B active) or dense 27B variant +- **Hybrid Reasoning**: Thinking mode enabled by default with step-by-step reasoning, can be disabled for direct responses +- **Tool Calling**: Built-in tool calling support with `qwen3_coder` parser +- **Multi-Token Prediction (MTP)**: Speculative decoding support for lower latency; both MoE and Dense variants ship `mtp.safetensors` +- **Multimodal**: Unified vision-language model supporting text, image, and video inputs + +**Available Models:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelArchitectureWeights
Qwen3.6-35B-A3B (BF16)MoE 35B / 3B active[Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
Qwen3.6-35B-A3B (FP8)MoE 35B / 3B active[Qwen/Qwen3.6-35B-A3B-FP8](https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8)
Qwen3.6-27B (BF16)Dense 27B[Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
Qwen3.6-27B (FP8)Dense 27B[Qwen/Qwen3.6-27B-FP8](https://huggingface.co/Qwen/Qwen3.6-27B-FP8)
+ +**License:** Apache 2.0 + +## 2. SGLang Installation + +SGLang `>=0.5.10` is required for Qwen3.6. You can install from PyPI, from source, or use a Docker image: + +```bash Command +# Install from PyPI +uv pip install sglang + +# Or install from source +uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python' + +# Or use Docker (NVIDIA GPUs) +docker pull lmsysorg/sglang:latest +``` + +For the full Docker setup and other installation methods, please refer to the [official SGLang installation guide](../../../docs/get-started/install). + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and capabilities. + + + + +### 3.2 Configuration Tips + +- Speculative decoding (MTP) can significantly reduce latency for interactive use cases. +- **Mamba Radix Cache**: Qwen3.6's hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via `--mamba-scheduler-strategy`: + - **V1 (`no_buffer`)**: Default. No overlap scheduler, lower memory usage. + - **V2 (`extra_buffer`)**: Enables overlap scheduling and branching point caching with `--mamba-scheduler-strategy extra_buffer --page-size 64`. Requires FLA kernel backend (NVIDIA GPUs only). Trades higher mamba state memory for better throughput. +- The `--mem-fraction-static` flag is recommended for optimal memory utilization, adjust it based on your hardware and workload. +- Context length defaults to 262,144 tokens. If you encounter OOM errors, consider reducing it, but maintain at least 128K to preserve thinking capabilities. +- **CUDA IPC Transport**: Add `SGLANG_USE_CUDA_IPC_TRANSPORT=1` as an environment variable to use CUDA IPC for transferring multimodal features, significantly improving TTFT (Time To First Token). Note: this consumes additional memory proportional to image size, so you may need to lower `--mem-fraction-static` or `--max-running-requests`. +- **Multimodal Attention Backend**: Use `--mm-attention-backend fa3` on H100/H200 for better vision performance, or `--mm-attention-backend fa4` on B200. +- For processing large images or videos, you may need to lower `--mem-fraction-static` to leave room for image feature tensors. +- Hardware requirements: + - **35B-A3B BF16**: ~70GB for weights. TP=1 fits on all supported hardware. + - **35B-A3B FP8**: ~35GB for weights. TP=1 fits on all supported hardware. + - **27B BF16**: ~54GB for weights. TP=1 fits on all supported hardware. + - **27B FP8**: ~27GB for weights. TP=1 fits on all supported hardware. + +All Qwen3.6 variants (MoE 35B-A3B and Dense 27B) fit on a single supported GPU at both precisions: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareMemoryBF16 TPFP8 TP
H10080GB11
H200141GB11
B200183GB11
+ + +## 4. Model Invocation + +Deploy Qwen3.6 with the following command (H200, all features enabled). Swap `--model-path` to `Qwen/Qwen3.6-27B-FP8` for the dense 27B variant — all other flags carry over: + +```shell Command +SGLANG_ENABLE_SPEC_V2=1 sglang serve \ + --model-path Qwen/Qwen3.6-35B-A3B-FP8 \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --mem-fraction-static 0.8 \ + --host 0.0.0.0 \ + --port 30000 +``` + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Vision Input + +Qwen3.6 supports image and video inputs as a unified vision-language model. + +**Image Input Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3.6-35B-A3B-FP8", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg" + } + }, + { + "type": "text", + "text": "Describe this image in detail." + } + ] + } + ], + max_tokens=2048, + stream=True +) + +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Video Input Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3.6-35B-A3B-FP8", + messages=[ + { + "role": "user", + "content": [ + { + "type": "video_url", + "video_url": { + "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4" + } + }, + { + "type": "text", + "text": "Describe what happens in this video." + } + ] + } + ], + max_tokens=2048, + stream=True +) + +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +### 4.3 Advanced Usage + +#### 4.3.1 Reasoning Parser + +Qwen3.6 supports Thinking mode **by default**. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via `reasoning_content` in the streaming response. + +To disable thinking and use Instruct mode, pass `chat_template_kwargs` at request time: + +- **Thinking mode** (default): The model performs step-by-step reasoning before answering. No extra parameters needed. +- **Instruct mode** (`{"enable_thinking": false}`): The model responds directly without a thinking process. + +**Example 1: Thinking Mode (Default)** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3.6-35B-A3B-FP8", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + max_tokens=2048, + stream=True +) + +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Example 2: Instruct Mode (Thinking Off)** + +To disable thinking and get a direct response, pass `{"enable_thinking": false}` via `chat_template_kwargs`: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3.6-35B-A3B-FP8", + messages=[ + {"role": "user", "content": "What is 15% of 240?"} + ], + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, + max_tokens=2048, + stream=True +) + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` + +#### 4.3.2 Thinking Preservation + +Qwen3.6 has been trained to preserve and leverage thinking traces from historical messages. Enable this for agent scenarios where maintaining full reasoning context improves decision consistency: + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="Qwen/Qwen3.6-35B-A3B-FP8", + messages=[ + {"role": "user", "content": "Help me plan a web app architecture."} + ], + extra_body={"chat_template_kwargs": {"preserve_thinking": True}}, + max_tokens=2048, + stream=True +) + +thinking_started = False +has_thinking = False +has_answer = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if delta.content: + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +#### 4.3.3 Tool Calling + +Qwen3.6 supports tool calling capabilities. Enable the tool call parser during deployment. + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +response = client.chat.completions.create( + model="Qwen/Qwen3.6-35B-A3B-FP8", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + stream=True +) + +thinking_started = False +has_thinking = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + if hasattr(delta, 'tool_calls') and delta.tool_calls: + if has_thinking and thinking_started: + print("\n=============== Content =================", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + if tool_call.function: + print(f"Tool Call: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + if delta.content: + print(delta.content, end="", flush=True) + +print() +``` diff --git a/docs_new/cookbook/autoregressive/Qwen/Qwen3.mdx b/docs_new/cookbook/autoregressive/Qwen/Qwen3.mdx new file mode 100644 index 000000000000..8c4c140dd63b --- /dev/null +++ b/docs_new/cookbook/autoregressive/Qwen/Qwen3.mdx @@ -0,0 +1,884 @@ +--- +title: Qwen3 +metatags: + description: "Deploy Qwen3 series models with SGLang - featuring advanced reasoning, 256K context, and flexible Dense/MoE architectures for edge to cloud." +--- + + +## 1. Model Introduction + +[Qwen3 series](https://github.com/QwenLM/Qwen3) are the most powerful vision-language models in the Qwen series to date, featuring advanced capabilities in multi-modal understanding, reasoning, and agentic applications. + +This generation delivers comprehensive upgrades across the board: + +- **Stronger general intelligence**: Significant improvements in instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage. +- **Broader multilingual knowledge**: Substantial gains in long-tail knowledge coverage across multiple languages. +- **More helpful & aligned responses**: Markedly better alignment with user preferences in subjective and open-ended tasks, enabling higher-quality, more useful text generation. +- **Extended context length**: Enhanced capabilities in understanding and reasoning over 256K-token long contexts. +- **Stronger agent interaction capabilities**: Improved tool use and search-based agent performance. +- **Flexible deployment options**: Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning-enhanced Thinking editions. + +For more details, please refer to the [official Qwen3 GitHub Repository](https://github.com/QwenLM/Qwen3). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The Qwen3 series offers models in various sizes and architectures, optimized for different hardware platforms including NVIDIA and AMD GPUs. The recommended launch configurations vary by hardware and model size. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities. + +import { Qwen3Deployment } from "/src/snippets/autoregressive/qwen3-deployment.jsx"; + + + +### 3.2 Configuration Tips + +- **Memory Management** : Set lower `--context-length` to conserve memory. A value of `128000` is sufficient for most scenarios, down from the default 262K. +- **Expert Parallelism** : SGLang supports Expert Parallelism (EP) via `--ep`, allowing experts in MoE models to be deployed on separate GPUs for better throughput. One thing to note is that, for quantized models, you need to set `--ep` to a value that satisfies the requirement: `(moe_intermediate_size / moe_tp_size) % weight_block_size_n == 0, where moe_tp_size is equal to tp_size divided by ep_size.` Note that EP may perform worse in low concurrency scenarios due to additional communication overhead. Check out [Expert Parallelism Deployment](../../../docs/advanced_features/expert_parallelism) for more details. +- **Kernel Tuning** : For MoE Triton kernel tuning on your specific hardware, refer to [fused_moe_triton](https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton). +- **Speculative Decoding**: Using Speculative Decoding for latency-sensitive scenarios. + - `--speculative-algorithm EAGLE3`: Speculative decoding algorithm + - `--speculative-num-steps 3`: Number of speculative verification rounds + - `--speculative-eagle-topk 1`: Top-k sampling for draft tokens + - `--speculative-num-draft-tokens 4`: Number of draft tokens per step + - `--speculative-draft-model-path`: The path of the draft model weights. This can be a local folder or a Hugging Face repo ID such as [`lmsys/SGLang-EAGLE3-Qwen3-235B-A22B-Instruct-2507-SpecForge-Meituan`](https://huggingface.co/lmsys/SGLang-EAGLE3-Qwen3-235B-A22B-Instruct-2507-SpecForge-Meituan). + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +Qwen3-235B-A22B supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-235B-A22B-Thinking-2507 \ + --reasoning-parser qwen3 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="Qwen/Qwen3-235B-A22B-Thinking-2507", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= + +Okay, so I need to figure out what 15% of 240 is. Hmm, percentages can sometimes trip me up, but I think I remember some basics. Let me start by recalling that "percent" means "per hundred," so 15% is the same as 15 per 100, or 15/100. So, maybe I can convert 15% into a decimal first? Yeah, I think that's a common method. +... +So conclusion: The answer is 36. + +=============== Content ================= + + +To determine what 15% of 240 is, we can follow a systematic approach that involves converting the percentage to a decimal and then performing multiplication. Here's a step-by-step breakdown of the solution: + +.... + +### Final Answer: + +$$ +\boxed{36} +$$ + +Thus, 15% of 240 is **36**. +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.3 Tool Calling + +Qwen3 supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-235B-A22B-Thinking-2507 \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen25 \ + --tp 8 \ + --host 0.0.0.0 \ + --port 8000 +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="Qwen/Qwen3-235B-A22B-Thinking-2507", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"🔧 Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= + +Okay, the user is asking for the weather in Beijing. Let me check the tools available. There's a function called get_weather that takes location and unit parameters. The location is required, so I need to specify Beijing as the location. The unit is optional and can be either celsius or fahrenheit. Since the user didn't specify the unit, maybe I should default to a common one. In China, they usually use celsius, so I'll set unit to celsius. I'll call the get_weather function with location: Beijing and unit: celsius. That should get the current weather for them. + + + +=============== Content ================= + +🔧 Tool Call: get_weather + Arguments: {"location": "Beijing", "unit": "celsius"} +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="Qwen/Qwen3-235B-A22B-Thinking-2507", + messages=messages, + temperature=0.7 +) + +print(final_response.choices[0].message.content) +# Output: "The current weather in Beijing is **22°C** and **sunny**. A perfect day to enjoy outdoor activities! 🌞" +``` + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8x) +- Model: Qwen3-235B-A22B-Instruct-2507 +- Tensor Parallelism: 8 +- sglang version: 0.5.6 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. + +#### 5.1.1 Standard Scenario Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --tp 8 +``` + +##### 5.1.1.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 43.56 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 4210 +Total generated tokens (retokenized): 4206 +Request throughput (req/s): 0.23 +Input token throughput (tok/s): 140.07 +Output token throughput (tok/s): 96.65 +Peak output token throughput (tok/s): 100.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 236.72 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4353.63 +Median E2E Latency (ms): 3475.79 +---------------Time to First Token---------------- +Mean TTFT (ms): 99.03 +Median TTFT (ms): 92.18 +P99 TTFT (ms): 166.05 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.12 +Median TPOT (ms): 10.12 +P99 TPOT (ms): 10.15 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.13 +Median ITL (ms): 10.12 +P95 ITL (ms): 10.49 +P99 ITL (ms): 10.70 +Max ITL (ms): 13.45 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 48.95 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 40725 +Total generated tokens (retokenized): 40716 +Request throughput (req/s): 1.63 +Input token throughput (tok/s): 810.44 +Output token throughput (tok/s): 832.04 +Peak output token throughput (tok/s): 1151.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 1642.48 +Concurrency: 13.61 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8326.72 +Median E2E Latency (ms): 8827.86 +---------------Time to First Token---------------- +Mean TTFT (ms): 215.70 +Median TTFT (ms): 88.82 +P99 TTFT (ms): 727.08 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 16.36 +Median TPOT (ms): 16.12 +P99 TPOT (ms): 24.09 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 15.96 +Median ITL (ms): 14.52 +P95 ITL (ms): 16.04 +P99 ITL (ms): 67.69 +Max ITL (ms): 457.52 +================================================== +``` + +##### 5.1.1.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 92.07 +Total input tokens: 249831 +Total input text tokens: 249831 +Total input vision tokens: 0 +Total generated tokens: 252162 +Total generated tokens (retokenized): 251124 +Request throughput (req/s): 5.43 +Input token throughput (tok/s): 2713.46 +Output token throughput (tok/s): 2738.78 +Peak output token throughput (tok/s): 4400.00 +Peak concurrent requests: 110 +Total token throughput (tok/s): 5452.24 +Concurrency: 90.50 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 16665.09 +Median E2E Latency (ms): 16060.10 +---------------Time to First Token---------------- +Mean TTFT (ms): 260.55 +Median TTFT (ms): 122.68 +P99 TTFT (ms): 863.11 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 32.94 +Median TPOT (ms): 34.04 +P99 TPOT (ms): 41.19 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 32.59 +Median ITL (ms): 23.54 +P95 ITL (ms): 69.79 +P99 ITL (ms): 119.09 +Max ITL (ms): 577.70 +================================================== +``` + +#### 5.1.2 Reasoning Scenario Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --tp 8 +``` + +##### 5.1.2.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 457.45 +Total input tokens: 6101 +Total input text tokens: 6101 +Total input vision tokens: 0 +Total generated tokens: 44452 +Total generated tokens (retokenized): 44059 +Request throughput (req/s): 0.02 +Input token throughput (tok/s): 13.34 +Output token throughput (tok/s): 97.17 +Peak output token throughput (tok/s): 100.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 110.51 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 45742.42 +Median E2E Latency (ms): 49266.87 +---------------Time to First Token---------------- +Mean TTFT (ms): 110.60 +Median TTFT (ms): 109.36 +P99 TTFT (ms): 167.43 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.23 +Median TPOT (ms): 10.24 +P99 TPOT (ms): 10.32 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.27 +Median ITL (ms): 10.26 +P95 ITL (ms): 10.71 +P99 ITL (ms): 10.97 +Max ITL (ms): 15.79 +================================================== +``` + +##### 5.1.2.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 340.17 +Total input tokens: 39668 +Total input text tokens: 39668 +Total input vision tokens: 0 +Total generated tokens: 318226 +Total generated tokens (retokenized): 318104 +Request throughput (req/s): 0.24 +Input token throughput (tok/s): 116.61 +Output token throughput (tok/s): 935.49 +Peak output token throughput (tok/s): 1120.00 +Peak concurrent requests: 19 +Total token throughput (tok/s): 1052.10 +Concurrency: 13.85 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 58885.30 +Median E2E Latency (ms): 59238.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 169.71 +Median TTFT (ms): 101.61 +P99 TTFT (ms): 455.71 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 14.82 +Median TPOT (ms): 14.91 +P99 TPOT (ms): 15.20 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 14.76 +Median ITL (ms): 14.63 +P95 ITL (ms): 15.46 +P99 ITL (ms): 16.62 +Max ITL (ms): 104.94 +================================================== +``` + +##### 5.1.2.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 8000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 544.83 +Total input tokens: 158939 +Total input text tokens: 158939 +Total input vision tokens: 0 +Total generated tokens: 1300705 +Total generated tokens (retokenized): 1293015 +Request throughput (req/s): 0.59 +Input token throughput (tok/s): 291.72 +Output token throughput (tok/s): 2387.34 +Peak output token throughput (tok/s): 3008.00 +Peak concurrent requests: 68 +Total token throughput (tok/s): 2679.06 +Concurrency: 56.35 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 95937.70 +Median E2E Latency (ms): 99362.32 +---------------Time to First Token---------------- +Mean TTFT (ms): 265.03 +Median TTFT (ms): 129.11 +P99 TTFT (ms): 823.85 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 23.66 +Median TPOT (ms): 24.07 +P99 TPOT (ms): 24.97 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 23.54 +Median ITL (ms): 23.07 +P95 ITL (ms): 25.92 +P99 ITL (ms): 63.87 +Max ITL (ms): 408.30 +================================================== +``` + +#### 5.1.3 Summarization Scenario Benchmark + +##### 5.1.3.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 44.82 +Total input tokens: 41941 +Total input text tokens: 41941 +Total input vision tokens: 0 +Total generated tokens: 4210 +Total generated tokens (retokenized): 4210 +Request throughput (req/s): 0.22 +Input token throughput (tok/s): 935.86 +Output token throughput (tok/s): 93.94 +Peak output token throughput (tok/s): 99.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 1029.80 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 4479.60 +Median E2E Latency (ms): 3622.99 +---------------Time to First Token---------------- +Mean TTFT (ms): 139.90 +Median TTFT (ms): 114.85 +P99 TTFT (ms): 225.17 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.31 +Median TPOT (ms): 10.33 +P99 TPOT (ms): 10.51 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 10.33 +Median ITL (ms): 10.33 +P95 ITL (ms): 10.73 +P99 ITL (ms): 10.93 +Max ITL (ms): 14.48 +================================================== +``` + +##### 5.1.3.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 50.68 +Total input tokens: 300020 +Total input text tokens: 300020 +Total input vision tokens: 0 +Total generated tokens: 41589 +Total generated tokens (retokenized): 41578 +Request throughput (req/s): 1.58 +Input token throughput (tok/s): 5920.41 +Output token throughput (tok/s): 820.69 +Peak output token throughput (tok/s): 1200.00 +Peak concurrent requests: 20 +Total token throughput (tok/s): 6741.10 +Concurrency: 13.90 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8805.54 +Median E2E Latency (ms): 9368.79 +---------------Time to First Token---------------- +Mean TTFT (ms): 284.29 +Median TTFT (ms): 168.48 +P99 TTFT (ms): 1027.21 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 16.81 +Median TPOT (ms): 16.66 +P99 TPOT (ms): 27.18 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 16.42 +Median ITL (ms): 13.68 +P95 ITL (ms): 17.23 +P99 ITL (ms): 90.75 +Max ITL (ms): 574.64 +================================================== +``` + +##### 5.1.3.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --num-prompts 320 \ + --max-concurrency 64 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 64 +Successful requests: 320 +Benchmark duration (s): 94.77 +Total input tokens: 1273893 +Total input text tokens: 1273893 +Total input vision tokens: 0 +Total generated tokens: 169680 +Total generated tokens (retokenized): 169640 +Request throughput (req/s): 3.38 +Input token throughput (tok/s): 13441.86 +Output token throughput (tok/s): 1790.43 +Peak output token throughput (tok/s): 2687.00 +Peak concurrent requests: 70 +Total token throughput (tok/s): 15232.28 +Concurrency: 58.63 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 17364.14 +Median E2E Latency (ms): 17495.95 +---------------Time to First Token---------------- +Mean TTFT (ms): 238.22 +Median TTFT (ms): 203.27 +P99 TTFT (ms): 510.48 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 32.50 +Median TPOT (ms): 34.27 +P99 TPOT (ms): 40.59 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 32.36 +Median ITL (ms): 22.50 +P95 ITL (ms): 97.81 +P99 ITL (ms): 151.55 +Max ITL (ms): 352.79 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +- **Results**: + + - Qwen/Qwen3-235B-A22B-Instruct-2507 + ```text Output + Accuracy: 0.945 + Invalid: 0.000 + Latency: 11.980 s + Output throughput: 2358.105 token/s + ``` diff --git a/docs_new/cookbook/autoregressive/StepFun/Step3-VL-10B.mdx b/docs_new/cookbook/autoregressive/StepFun/Step3-VL-10B.mdx new file mode 100644 index 000000000000..845fb94f5848 --- /dev/null +++ b/docs_new/cookbook/autoregressive/StepFun/Step3-VL-10B.mdx @@ -0,0 +1,694 @@ +--- +title: Step3-VL-10B +metatags: + description: "Deploy Step3-VL-10B multimodal model with SGLang - compact 10B dense model with frontier-level vision understanding, complex reasoning, and tool calling capabilities." +--- + +import { Step3VL10BDeployment } from '/src/snippets/autoregressive/step-3vl-10b-deployment.jsx'; + +## 1. Model Introduction + +[Step3-VL-10B](https://huggingface.co/stepfun-ai/Step3-VL-10B) is a lightweight open-source multimodal model developed by StepFun, designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact 10B parameter footprint, Step3-VL-10B excels in visual perception, complex reasoning, and human-centric alignment. + +Key highlights of Step3-VL-10B include: + +- **STEM Reasoning**: Achieves 94.43% on AIME 2025 and 75.95% on MathVision (with PaCoRe), demonstrating exceptional complex reasoning capabilities that outperform models 10×–20× larger. +- **Visual Perception**: Records 92.05% on MMBench and 80.11% on MMMU, establishing strong general visual understanding and multimodal reasoning. +- **GUI & OCR**: Delivers state-of-the-art performance on ScreenSpot-V2 (92.61%), ScreenSpot-Pro (51.55%), and OCRBench (86.75%), optimized for agentic and document understanding tasks. +- **Spatial Understanding**: Demonstrates emergent spatial awareness with 66.79% on BLINK and 57.21% on All-Angles-Bench, establishing strong potential for embodied intelligence applications. + +For more details, please refer to the [Step3-VL-10B model card on Hugging Face](https://huggingface.co/stepfun-ai/Step3-VL-10B). + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +Step3-VL-10B is a compact 10B dense model that can run on a single GPU. Recommended starting configurations vary depending on hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and quantization method. SGLang supports serving Step3-VL-10B on NVIDIA B200, H200, H100, and AMD MI355X, MI325X, MI300X GPUs. + + + +### 3.2 Configuration Tips + +- **Single GPU Deployment**: Step3-VL-10B fits comfortably on a single GPU with BF16 precision, no tensor parallelism required. +- **Memory Management**: Set lower `--context-length` to conserve memory if needed. A value of `32768` is sufficient for most scenarios. +- **FP8 Quantization**: Use FP8 quantization to further reduce memory usage while maintaining quality. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) +- [SGLang OpenAI Vision API Guide](../../../docs/basic_usage/openai_api_vision) + +### 4.2 Advanced Usage + +#### 4.2.1 Multi-Modal Inputs + +Step3-VL-10B supports image inputs. Here's a basic example with image input: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png" + } + }, + { + "type": "text", + "text": "Read all the text in the image." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="stepfun-ai/Step3-VL-10B", + messages=messages, + max_tokens=2048, + extra_body={"top_k": -1} +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example output:** + +```text Output +Response costs: 5.89s +Generated text: Auntie Anne's + +CINNAMON SUGAR +1 × 17,000               17,000 + +SUB TOTAL                    17,000 + +GRAND TOTAL                 17,000 + +CASH IDR                    20,000 + +CHANGE DUE                 3,000 +``` + +**Multi-Image Input Example:** + +Step3-VL-10B can process multiple images in a single request for comparison or analysis: + +```python Example +import time +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:30000/v1", + timeout=3600 +) + +messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg" + } + }, + { + "type": "image_url", + "image_url": { + "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg" + } + }, + { + "type": "text", + "text": "Compare these two images and describe the differences in 100 words or less." + } + ] + } +] + +start = time.time() +response = client.chat.completions.create( + model="stepfun-ai/Step3-VL-10B", + messages=messages, + max_tokens=2048, + extra_body={"top_k": -1} +) +print(f"Response costs: {time.time() - start:.2f}s") +print(f"Generated text: {response.choices[0].message.content}") +``` + +**Example Output:** + +```text Output +Response costs: 3.24s +Generated text: First image: Single red Hong Kong taxi close - up, clear license plate (RX 5004), “4 SEATS” sticker, urban street with shops behind. Second image: Aerial view of many taxis (red, green) on a highway with a viaduct, some hoods open, dense arrangement. Differences: Scale (single vs many), perspective (close - up vs aerial), context (street shops vs highway), and taxi conditions (normal vs some open hoods). +``` + +#### 4.2.2 Reasoning Parser + +Step3-VL-10B supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +python -m sglang.launch_server \ + --model stepfun-ai/Step3-VL-10B \ + --reasoning-parser deepseek-r1 \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code +``` + +**Streaming with Thinking Process:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="stepfun-ai/Step3-VL-10B", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True, + extra_body={"top_k": -1} +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Example Output:** +```text Output +=============== Thinking ================= +Okay, let's see. The problem is asking for 15% of 240. Hmm, I need to remember how to calculate percentages. So, percentage means "per hundred," right? So, 15% is the same as 15 per 100, or 15/100. To find a percentage of a number, I think you convert the percentage to a decimal and then multiply it by the number. Let me check that. + +First, 15% as a decimal. To convert a percentage to a decimal, you divide by 100. So 15 divided by 100 is 0.15. Yeah, that's right. So 15% is 0.15 in decimal form. Then, to find 15% of 24 +0, I need to multiply 0.15 by 240. Let me do that calculation. + +Let me write it out: 0.15 * 240. Let's compute that. Maybe break it down. 0.1 is 10%, and 0.05 is 5%, so 10% of 240 is 24, and 5% of 240 is 12. Then 10% + 5% is 15%, so 24 + 12 is 36. Oh, + that's a good way to check. So 15% is 10% plus 5%, which adds up to 36. Let me verify with the decimal method. 0.15 * 240. Let's multiply 240 by 0.1 first: 24. Then 240 by 0.05: 12. Adding them gives 36. Yep, that matches. Alternatively, 240 * 15 = 3600, and then divide by 100 (since it's per hundred), so 3600 / 100 = 36. That's another way. So 15% of 240 is 36. Let me make sure I didn't make a mistake. Let's check with another method. 10% of 240 is 24, 20% would be 48, so 15% is halfway between 10% and 20%, which is (24 + 48)/2 = 36. Yep, that works too +. So all methods point to 36. I think that's the answer. + +=============== Content ================= + +To solve the problem "What is 15% of 240?" step by step: + +--- + +### **Step 1: Understand the concept of percentage** +A percentage represents a portion of a whole. Specifically, "percent" means "per hundred." So, 15% means **15 out of 100**, or **15/100**. + +--- + +### **Step 2: Convert the percentage to a decimal** +To use percentages in calculations, convert them to decimals by dividing by 100: +$$ +15\% = \frac{15}{100} = 0.15 +$$ + +--- + +### **Step 3: Multiply the decimal by the given number** +Now, multiply 0.15 (the decimal form of 15%) by 240: +$$ +0.15 \times 240 = 36 +$$ + +--- + +### **Alternative Verification Methods** + +#### **Method A: Break into parts** +- 10% of 240 = $ 0.10 \times 240 = 24 $ +- 5% of 240 = $ 0.05 \times 240 = 12 $ +- Add them: $ 24 + 12 = 36 $ + +#### **Method B: Use direct multiplication** +- $ 15\% \text{ of } 240 = \frac{15}{100} \times 240 = \frac{3600}{100} = 36 $ + +#### **Method C: Estimate using known percentages** +- 20% of 240 = $ 0.20 \times 240 = 48 $ +- 10% of 240 = $ 0.10 \times 240 = 24 $ +- 15% is halfway between 10% and 20%: $ \frac{24 + 48}{2} = 36 $ + +--- + +### **Final Answer** +$$ +\boxed{36} +$$ +``` + +**Note:** The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions. + +#### 4.2.3 Tool Calling + +Step3-VL-10B supports tool calling capabilities. Enable the tool call parser: + +```shell Command +python -m sglang.launch_server \ + --model stepfun-ai/Step3-VL-10B \ + --reasoning-parser deepseek-r1 \ + --tool-call-parser hermes \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code +``` + +**Python Example (with Thinking Process):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Define available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city name" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + } +] + +# Make request with streaming to see thinking process +response = client.chat.completions.create( + model="stepfun-ai/Step3-VL-10B", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=0.7, + stream=True, + extra_body={"top_k": -1} +) + +# Process streaming response +thinking_started = False +has_thinking = False +tool_calls_accumulator = {} + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Accumulate tool calls + if hasattr(delta, 'tool_calls') and delta.tool_calls: + # Close thinking section if needed + if has_thinking and thinking_started: + print("\n=============== Content =================\n", flush=True) + thinking_started = False + + for tool_call in delta.tool_calls: + index = tool_call.index + if index not in tool_calls_accumulator: + tool_calls_accumulator[index] = { + 'name': None, + 'arguments': '' + } + + if tool_call.function: + if tool_call.function.name: + tool_calls_accumulator[index]['name'] = tool_call.function.name + if tool_call.function.arguments: + tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments + + # Print content + if delta.content: + print(delta.content, end="", flush=True) + +# Print accumulated tool calls +for index, tool_call in sorted(tool_calls_accumulator.items()): + print(f"Tool Call: {tool_call['name']}") + print(f" Arguments: {tool_call['arguments']}") + +print() +``` + +**Example Output:** +```text Output +=============== Thinking ================= +The user is asking about the weather in Beijing. I have a function called "get_weather" that can provide weather information for a location. Let me check the parameters: + +- location: required (string) - "Beijing" +- unit: optional (string, enum: ["celsius", "fahrenheit"]) - not specified by the user, so I won't include it + +I should call the function with location="Beijing". + + + +=============== Content ================= + +Tool Call: get_weather + Arguments: {"location": "Beijing"} +``` + +**Handling Tool Call Results:** + +```python Example +# After getting the tool call, execute the function +def get_weather(location, unit="celsius"): + # Your actual weather API call here + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# Send tool result back to the model +messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + { + "role": "assistant", + "content": None, + "tool_calls": [{ + "id": "call_123", + "type": "function", + "function": { + "name": "get_weather", + "arguments": '{"location": "Beijing", "unit": "celsius"}' + } + }] + }, + { + "role": "tool", + "tool_call_id": "call_123", + "content": get_weather("Beijing", "celsius") + } +] + +final_response = client.chat.completions.create( + model="stepfun-ai/Step3-VL-10B", + messages=messages, + temperature=0.7, + extra_body={"top_k": -1} +) + +print(final_response.choices[0].message.content) +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (1x) +- Model: stepfun-ai/Step3-VL-10B +- Tensor Parallelism: 1 +- sglang version: 0.5.8+ + +We use SGLang's built-in benchmarking tool to conduct performance evaluation with random images. + +#### 5.1.1 Latency-Sensitive Benchmark + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model stepfun-ai/Step3-VL-10B \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code +``` + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model stepfun-ai/Step3-VL-10B \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Result: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 30.85 +Total input tokens: 14120 +Total input text tokens: 720 +Total input vision tokens: 13400 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4217 +Request throughput (req/s): 0.32 +Input token throughput (tok/s): 457.71 +Output token throughput (tok/s): 136.79 +Peak output token throughput (tok/s): 240.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 594.50 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3083.40 +Median E2E Latency (ms): 2747.00 +P90 E2E Latency (ms): 4574.50 +P99 E2E Latency (ms): 5462.49 +---------------Time to First Token---------------- +Mean TTFT (ms): 1327.69 +Median TTFT (ms): 1341.01 +P99 TTFT (ms): 1486.11 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.16 +Median TPOT (ms): 4.17 +P99 TPOT (ms): 4.18 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.17 +Median ITL (ms): 4.18 +P95 ITL (ms): 4.30 +P99 ITL (ms): 4.38 +Max ITL (ms): 8.24 +================================================== +``` + +#### 5.1.2 Throughput-Sensitive Benchmark + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model stepfun-ai/Step3-VL-10B \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- Result: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 976.52 +Total input tokens: 1416949 +Total input text tokens: 76949 +Total input vision tokens: 1340000 +Total generated tokens: 510855 +Total generated tokens (retokenized): 510526 +Request throughput (req/s): 1.02 +Input token throughput (tok/s): 1451.02 +Output token throughput (tok/s): 523.14 +Peak output token throughput (tok/s): 20429.00 +Peak concurrent requests: 103 +Total token throughput (tok/s): 1974.16 +Concurrency: 99.81 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 97463.22 +Median E2E Latency (ms): 91872.75 +P90 E2E Latency (ms): 118553.42 +P99 E2E Latency (ms): 198445.56 +---------------Time to First Token---------------- +Mean TTFT (ms): 94379.07 +Median TTFT (ms): 87163.09 +P99 TTFT (ms): 194871.41 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 5.89 +Median TPOT (ms): 5.72 +P99 TPOT (ms): 23.58 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 6.05 +Median ITL (ms): 0.13 +P95 ITL (ms): 0.56 +P99 ITL (ms): 3.99 +Max ITL (ms): 97551.06 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 MMMU Benchmark + +You can evaluate the model's accuracy using the MMMU dataset: + +- Model Deployment Command: + +```shell Command +python -m sglang.launch_server \ + --model stepfun-ai/Step3-VL-10B \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code +``` + +- Benchmark Command: + +```shell Command +python3 benchmark/mmmu/bench_sglang.py \ + --port 30000 \ + --concurrency 64 +``` + +- Result: + +```text Output +Benchmark time: 934.6179109360091 +answers saved to: ./answer_sglang.json +Evaluating... +answers saved to: ./answer_sglang.json +{'Accounting': {'acc': 0.667, 'num': 30}, + 'Agriculture': {'acc': 0.367, 'num': 30}, + 'Architecture_and_Engineering': {'acc': 0.4, 'num': 30}, + 'Art': {'acc': 0.467, 'num': 30}, + 'Art_Theory': {'acc': 0.5, 'num': 30}, + 'Basic_Medical_Science': {'acc': 0.367, 'num': 30}, + 'Biology': {'acc': 0.3, 'num': 30}, + 'Chemistry': {'acc': 0.467, 'num': 30}, + 'Clinical_Medicine': {'acc': 0.567, 'num': 30}, + 'Computer_Science': {'acc': 0.467, 'num': 30}, + 'Design': {'acc': 0.567, 'num': 30}, + 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.3, 'num': 30}, + 'Economics': {'acc': 0.6, 'num': 30}, + 'Electronics': {'acc': 0.567, 'num': 30}, + 'Energy_and_Power': {'acc': 0.633, 'num': 30}, + 'Finance': {'acc': 0.733, 'num': 30}, + 'Geography': {'acc': 0.333, 'num': 30}, + 'History': {'acc': 0.533, 'num': 30}, + 'Literature': {'acc': 0.533, 'num': 30}, + 'Manage': {'acc': 0.6, 'num': 30}, + 'Marketing': {'acc': 0.767, 'num': 30}, + 'Materials': {'acc': 0.6, 'num': 30}, + 'Math': {'acc': 0.7, 'num': 30}, + 'Mechanical_Engineering': {'acc': 0.333, 'num': 30}, + 'Music': {'acc': 0.4, 'num': 30}, + 'Overall': {'acc': 0.523, 'num': 900}, + 'Overall-Art and Design': {'acc': 0.483, 'num': 120}, + 'Overall-Business': {'acc': 0.673, 'num': 150}, + 'Overall-Health and Medicine': {'acc': 0.513, 'num': 150}, + 'Overall-Humanities and Social Science': {'acc': 0.492, 'num': 120}, + 'Overall-Science': {'acc': 0.5, 'num': 150}, + 'Overall-Tech and Engineering': {'acc': 0.481, 'num': 210}, + 'Pharmacy': {'acc': 0.6, 'num': 30}, + 'Physics': {'acc': 0.7, 'num': 30}, + 'Psychology': {'acc': 0.467, 'num': 30}, + 'Public_Health': {'acc': 0.733, 'num': 30}, + 'Sociology': {'acc': 0.433, 'num': 30}} +eval out saved to ./val_sglang.json +Overall accuracy: 0.523 +``` diff --git a/docs_new/cookbook/autoregressive/StepFun/Step3.5.mdx b/docs_new/cookbook/autoregressive/StepFun/Step3.5.mdx new file mode 100644 index 000000000000..176c2bf96af3 --- /dev/null +++ b/docs_new/cookbook/autoregressive/StepFun/Step3.5.mdx @@ -0,0 +1,528 @@ +--- +title: Step-3.5 +metatags: + description: "Deploy Step-3.5 reasoning engine with SGLang. " +--- + +import { Step35Deployment } from '/src/snippets/autoregressive/step-35-deployment.jsx'; + +## 1. Model Introduction + +[Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) is StepFun's production-grade reasoning engine built to decouple elite intelligence from heavy compute, and cuts attention cost for low-latency, cost-effective long-context inference—purpose-built for autonomous agents in real-world workflows. The model is available in multiple quantization formats optimized for different hardware platforms. + +This generation delivers comprehensive upgrades across the board: +- **Hybrid Attention Architecture**: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 128-token window. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models. +- **Sparse Mixture-of-Experts**: Only 11B active parameters out of 196B parameters. +- **Multi-Layer Multi-Token Prediction (MTP)**: Equipped with a 3-way Multi-Token Prediction (MTP-3). This allows for complex, multi-step reasoning chains with immediate responsiveness. + +## 2.SGLang Installation + +Step-3.5-Flash is currently available in SGLang via Docker image install. + +### Docker (NVIDIA) +```bash Command +# Pull the docker image +docker pull lmsysorg/sglang:dev-pr-18084 + +# Launch the container +docker run -it --gpus all \ + --shm-size=32g \ + --ipc=host \ + --network=host \ + lmsysorg/sglang:dev-pr-18084 bash +``` + +### Docker (AMD ROCm) +```bash Command +# For MI300X/MI325X +docker pull lmsysorg/sglang:v0.5.9-rocm700-mi30x + +# For MI350X/MI355X +docker pull lmsysorg/sglang:v0.5.9-rocm700-mi35x + +docker run -it \ + --device=/dev/kfd --device=/dev/dri \ + --shm-size=32g \ + --ipc=host \ + --network=host \ + --group-add video --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + lmsysorg/sglang:v0.5.9-rocm700-mi30x bash # or mi35x for MI350X/MI355X +``` + +## 3.Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The Step-3.5-Flash series comes in only one sizes. Recommended starting configurations vary depending on hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities. + + + +### 3.2 Configuration Tips + +- **Memory**: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4×, TP=4), MI300X/MI325X/MI350X/MI355X (4×, TP=4 EP=4). +- **AMD Docker Image**: Use `lmsysorg/sglang:v0.5.9-rocm700-mi30x` for MI300X/MI325X and `lmsysorg/sglang:v0.5.9-rocm700-mi35x` for MI350X/MI355X. +- **AMD Expert Parallelism Required**: On AMD GPUs, always use `--ep 4` with `--tp 4`. Both BF16 and FP8 models require expert parallelism. Without EP, the MoE intermediate dimension is split across GPUs (N=320), which triggers an AITER CK GEMM incompatibility. With EP=4, each GPU handles 72 full experts (N=1280), which works correctly with cuda graph enabled. +- **AITER JIT Compilation**: First inference on AMD may take 30-40 seconds for AITER kernel JIT compilation. Subsequent requests use cached kernels. + +## 4.Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser + +Step-3.5-Flash only supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections: + +```shell Command +sglang serve \ + --model-path stepfun-ai/Step-3.5-Flash \ + --tp 4 \ + --ep 4 \ + --reasoning-parser step3p5 +``` + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# Enable streaming to see the thinking process in real-time +response = client.chat.completions.create( + model="stepfun-ai/Step-3.5-Flash", + messages=[ + {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} + ], + temperature=0.7, + max_tokens=2048, + stream=True +) + +# Process the stream +has_thinking = False +has_answer = False +thinking_started = False + +for chunk in response: + if chunk.choices and len(chunk.choices) > 0: + delta = chunk.choices[0].delta + + # Print thinking process + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if not thinking_started: + print("=============== Thinking =================", flush=True) + thinking_started = True + has_thinking = True + print(delta.reasoning_content, end="", flush=True) + + # Print answer content + if delta.content: + # Close thinking section and add content header + if has_thinking and not has_answer: + print("\n=============== Content =================", flush=True) + has_answer = True + print(delta.content, end="", flush=True) + +print() +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +We are asked: "What is 15% of 240?" We need to solve step by step. + +Step 1: Understand that "15% of 240" means we need to calculate 15 percent of 240. In mathematical terms, it is (15/100) * 240. + +Step 2: Simplify the calculation. We can compute 15% of 240 by first finding 10% of 240 and then 5% of 240, and adding them. Alternatively, we can multiply directly. + +Method 1: +10% of 240 = 240 * 0.10 = 24. +5% is half of 10%, so 5% of 240 = 24 / 2 = 12. +Then 15% = 10% + 5% = 24 + 12 = 36. + +Method 2: Direct multiplication: 15% = 15/100 = 0.15, so 0.15 * 240 = 36. + +We can also compute fractionally: (15/100)*240 = (15*240)/100. 15*240 = 3600, divided by 100 gives 36. + +Thus, the answer is 36. + +We'll present the solution step by step. + +=============== Content ================= + +To find 15% of 240, follow these steps: + +1. **Convert the percentage to a decimal**: + \( 15\% = \frac{15}{100} = 0.15 \) + +2. **Multiply by the number**: + \( 0.15 \times 240 = 36 \) + +Alternatively, break it down: +- \( 10\% \text{ of } 240 = 240 \times 0.10 = 24 \) +- \( 5\% \text{ of } 240 = \frac{24}{2} = 12 \) (since 5% is half of 10%) +- \( 15\% = 10\% + 5\% = 24 + 12 = 36 \) + +**Answer:** 36 +``` + +#### 4.2.2 Tool Calling + +Step-3.5 supports tool calling capabilities. Enable the tool call parser: + +**Python Example:** + +Start sglang server: + +```shell Command +sglang serve \ + --model-path stepfun-ai/Step-3.5-Flash \ + --tp 4 \ + --ep 4 \ + --reasoning-parser step3p5 \ + --tool-call-parser step3p5 +``` + +```python Example +from openai import OpenAI +import json + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +# 1. define tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "The city name"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"} + }, + "required": ["location"] + } + } + } +] + +# 2. tool run +def get_weather(location, unit="celsius"): + return f"The weather in {location} is 22°{unit[0].upper()} and sunny." + +# 3. send first request +print("--- Sending first request ---") +response = client.chat.completions.create( + model="stepfun-ai/Step-3.5-Flash", + messages=[ + {"role": "user", "content": "What's the weather in Beijing?"} + ], + tools=tools, + temperature=1.0, + stream=False +) + +message = response.choices[0].message + +# 4. Handle Reasoning Content +reasoning = getattr(message, 'reasoning_content', None) +if reasoning: + print("=============== Thinking =================") + print(reasoning) + print("==========================================") + +# 5. Handle Tool Calls +if message.tool_calls: + print("\n🔧 Tool Calls detected:") + history_messages = [ + {"role": "user", "content": "What's the weather in Beijing?"}, + message + ] + + for tool_call in message.tool_calls: + print(f" Tool: {tool_call.function.name}") + print(f" Args: {tool_call.function.arguments}") + + args = json.loads(tool_call.function.arguments) + tool_result = get_weather(args.get("location"), args.get("unit", "celsius")) + + history_messages.append({ + "role": "tool", + "tool_call_id": tool_call.id, + "content": tool_result + }) + + print("\n--- Sending tool results ---") + final_response = client.chat.completions.create( + model="stepfun-ai/Step-3.5-Flash", + messages=history_messages, + temperature=1.0, + stream=False + ) + + print("=============== Final Content =================") + print(final_response.choices[0].message.content) + +else: + if message.content: + print("=============== Content =================") + print(message.content) +``` + +**Output Example:** + +```text Output +--- Sending first request --- +=============== Thinking ================= +The user is asking for the weather in Beijing. I should use the get_weather function with location="Beijing". The unit parameter is optional and the user didn't specify a preference, so I'll leave it out (the default should be fine). + +========================================== + +🔧 Tool Calls detected: + Tool: get_weather + Args: {"location": "Beijing"} + +--- Sending tool results --- +=============== Final Content ================= +The weather in Beijing is 22°C and sunny. +``` + +**Note:** + +- The reasoning parser shows how the model decides to use a tool +- Tool calls are clearly marked with the function name and arguments +- You can then execute the function and send the result back to continue the conversation + +## 5. Benchmark + +### 5.1 Speed Benchmark + +**Test Environment:** + +- Hardware: NVIDIA H200 GPU (4x) +- Model: Step-3.5-Flash +- Tensor Parallelism: 4 +- Expert Parallelism: 4 +- sglang version: 0.5.8 + +We use SGLang's built-in benchmarking tool to conduct performance evaluation on the [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. + +#### 5.1.1 Standard Scenario Benchmark + +- Model Deployment Command: + +```shell Command +sglang serve \ + --model-path stepfun-ai/Step-3.5-Flash \ + --tp 4 \ + --ep 4 +``` + +##### 5.1.1.1 Low Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model stepfun-ai/Step-3.5-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 35.30 +Total input tokens: 6091 +Total input text tokens: 6091 +Total generated tokens: 4220 +Total generated tokens (retokenized): 4212 +Request throughput (req/s): 0.28 +Input token throughput (tok/s): 172.57 +Output token throughput (tok/s): 119.56 +Peak output token throughput (tok/s): 124.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 292.14 +Concurrency: 1.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3527.94 +Median E2E Latency (ms): 2884.72 +P90 E2E Latency (ms): 6350.38 +P99 E2E Latency (ms): 7858.53 +---------------Time to First Token---------------- +Mean TTFT (ms): 107.53 +Median TTFT (ms): 80.93 +P99 TTFT (ms): 269.52 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 8.12 +Median TPOT (ms): 8.13 +P99 TPOT (ms): 8.14 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 8.12 +Median ITL (ms): 8.11 +P95 ITL (ms): 8.61 +P99 ITL (ms): 8.91 +Max ITL (ms): 20.77 +================================================== +``` + +##### 5.1.1.2 Medium Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model stepfun-ai/Step-3.5-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 80 \ + --max-concurrency 16 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 16 +Successful requests: 80 +Benchmark duration (s): 54.06 +Total input tokens: 39588 +Total input text tokens: 39588 +Total generated tokens: 40805 +Total generated tokens (retokenized): 40479 +Request throughput (req/s): 1.48 +Input token throughput (tok/s): 732.33 +Output token throughput (tok/s): 754.84 +Peak output token throughput (tok/s): 928.00 +Peak concurrent requests: 21 +Total token throughput (tok/s): 1487.17 +Concurrency: 14.06 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 9501.23 +Median E2E Latency (ms): 10010.71 +P90 E2E Latency (ms): 15655.09 +P99 E2E Latency (ms): 18803.63 +---------------Time to First Token---------------- +Mean TTFT (ms): 198.34 +Median TTFT (ms): 89.50 +P99 TTFT (ms): 984.66 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 18.97 +Median TPOT (ms): 18.80 +P99 TPOT (ms): 35.67 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 18.27 +Median ITL (ms): 17.48 +P95 ITL (ms): 18.44 +P99 ITL (ms): 62.47 +Max ITL (ms): 460.85 +================================================== +``` + +##### 5.1.1.3 High Concurrency + +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model stepfun-ai/Step-3.5-Flash \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` + +- Test Results: + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 500 +Benchmark duration (s): 125.88 +Total input tokens: 249331 +Total input text tokens: 249331 +Total generated tokens: 252662 +Total generated tokens (retokenized): 251323 +Request throughput (req/s): 3.97 +Input token throughput (tok/s): 1980.77 +Output token throughput (tok/s): 2007.23 +Peak output token throughput (tok/s): 2500.00 +Peak concurrent requests: 109 +Total token throughput (tok/s): 3987.99 +Concurrency: 92.25 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 23223.31 +Median E2E Latency (ms): 22631.90 +P90 E2E Latency (ms): 42269.38 +P99 E2E Latency (ms): 47637.53 +---------------Time to First Token---------------- +Mean TTFT (ms): 372.13 +Median TTFT (ms): 127.26 +P99 TTFT (ms): 1880.42 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 46.06 +Median TPOT (ms): 47.61 +P99 TPOT (ms): 51.34 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 45.31 +Median ITL (ms): 39.86 +P95 ITL (ms): 72.49 +P99 ITL (ms): 117.05 +Max ITL (ms): 1359.81 +================================================== +``` + +### 5.2 Accuracy Benchmark + +#### 5.2.1 GSM8K Benchmark + +- **Benchmark Command:** + +```shell Command +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +- **Results**: + + - Step-3.5-Flash + ``` + Accuracy: 0.885 + Invalid: 0.005 + Latency: 9.986 s + Output throughput: 1972.911 token/s + ``` diff --git a/docs_new/cookbook/autoregressive/Tencent/Hunyuan3-Preview.mdx b/docs_new/cookbook/autoregressive/Tencent/Hunyuan3-Preview.mdx new file mode 100644 index 000000000000..84749039bc87 --- /dev/null +++ b/docs_new/cookbook/autoregressive/Tencent/Hunyuan3-Preview.mdx @@ -0,0 +1,527 @@ +--- +title: Hunyuan 3 Preview +metatags: + description: "Deploy Tencent Hunyuan 3 Preview BF16 (~276B / ~20B active MoE) on NVIDIA GPUs with SGLang — hybrid thinking, native tool calling, 256K context, and built-in MTP speculative decoding." +tag: NEW +--- + +## 1. Model Introduction + +Hunyuan 3 Preview (Hy3-preview) is Tencent's preview of its third-generation flagship MoE language model, featuring hybrid thinking, native tool calling, long-context reasoning, and Multi-Token Prediction (MTP) for low-latency serving. + +**Key Features:** + +- **MoE Architecture**: 192 routed experts + 1 shared expert, 8 experts activated per token. ~276B total parameters with ~20B active, delivering dense-model quality at MoE inference cost. +- **Hybrid Thinking**: Reasoning modes (`high`, `medium`, `low`, `none`) controllable via OpenAI-standard `reasoning_effort`, allowing the same weights to trade off latency and depth of reasoning. +- **Native Tool Calling**: Trained on structured `` / `` / `` grammar. Pairs with SGLang's `hunyuan` tool-call parser for streaming OpenAI-compatible function-calling output. +- **Long Context**: 256K token context window (262,144 positions) for repository-scale code and document reasoning. +- **Multi-Token Prediction (MTP)**: Ships with a built-in MTP draft module enabling speculative decoding out of the box. + +**Available Models:** + +- [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview) — BF16 instruct +- [tencent/Hy3-preview-Base](https://huggingface.co/tencent/Hy3-preview-Base) — BF16 base + +**Recommended Generation Parameters:** + + + + + + + + + + + + + + + + + + + + + + +
ParameterValue
`temperature`0.7
`top_p`0.9
`reasoning_effort``high` / `medium` / `low` (thinking) or `none` (instant)
+ +**License:** TODO — verify on HuggingFace model card. + +## 2. SGLang Installation + +SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang installation guide](../../../docs/get-started/install) for installation instructions. + +**Docker Images by Hardware Platform:** + + + + + + + + + + + + + + + + + + +
Hardware PlatformDocker Image
NVIDIA H200 / B200`lmsysorg/sglang:hy3-preview`
NVIDIA B300 / GB300`lmsysorg/sglang:hy3-preview-cu130`
+ +The `hy3-preview` tag bundles the HYV3 model code, the `hunyuan` tool-call / reasoning parsers, and the MTP draft-module runtime. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization, and feature capabilities. + +import { Hunyuan3PreviewDeployment } from '/src/snippets/autoregressive/hunyuan3-preview-deployment.jsx' + + + +### 3.2 Configuration Tips + +**Key Parameters:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionRecommended Value
`--tool-call-parser`Tool call parser for function-calling support`hunyuan`
`--reasoning-parser`Reasoning parser for hybrid thinking modes`hunyuan`
`--trust-remote-code`Required for Hunyuan model loadingAlways enabled
`--mem-fraction-static`Static memory fraction (KV + activations)`0.9`
`--tp`Tensor parallelism size`2` / `4` / `8` depending on hardware
`--attention-backend`Attention backend (Blackwell only)`trtllm_mha`
`--speculative-algorithm`Speculative decoding via the bundled MTP draft`EAGLE` + `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` (set env `SGLANG_ENABLE_SPEC_V2=1`)
+ +**Hardware Requirements: NVIDIA BF16 (`Hy3-preview`, ~552GB weights)** + +- **H200 (141GB) / B200 (180GB)**: TP=8 (minimum for BF16 to fit single-node). +- **B300 (275GB) / GB300**: TP=4. +- **A100 / H100 (80GB)**: not supported single-node — BF16 requires multi-node TP=16+ on 80GB-class GPUs. + +**Blackwell (B200 / B300 / GB300):** Auto-selected attention backend can mis-route for HYV3 on Blackwell. Always pass `--attention-backend trtllm_mha` explicitly on Blackwell hardware (the config generator above enforces this). + +**Multi-Token Prediction (MTP):** The `Hy3-preview` release bundles an MTP draft module. SGLang runs it via its EAGLE speculative-decoding path — the draft module auto-loads from the same `--model-path`. Enable with the `SGLANG_ENABLE_SPEC_V2=1` env var and the standard MTP flags: + +```bash Command +SGLANG_ENABLE_SPEC_V2=1 sglang serve \ + --model-path tencent/Hy3-preview \ + --tp 8 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --reasoning-parser hunyuan \ + --tool-call-parser hunyuan \ + --trust-remote-code \ + --mem-fraction-static 0.85 +``` + +Toggle the "Speculative Decoding (MTP)" option in the generator above to add these flags automatically. Tune `num-steps` / `num-draft-tokens` based on acceptance rate in your workload. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For basic API usage and request examples, please refer to: + +- [SGLang Basic Usage Guide](../../../docs/basic_usage/send_request) + +**Deployment Command (H200 × 8, BF16 default):** + +```bash Command +sglang serve \ + --model-path tencent/Hy3-preview \ + --tp 8 \ + --reasoning-parser hunyuan \ + --tool-call-parser hunyuan \ + --trust-remote-code \ + --mem-fraction-static 0.9 +``` + +**Testing Deployment:** + +After startup, you can test the SGLang OpenAI-compatible API with the following command: + +```bash Command +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "tencent/Hy3-preview", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Who won the world series in 2020?"} + ] + }' +``` + +**Simple Completion Example:** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="tencent/Hy3-preview", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Who won the world series in 2020?"} + ], + max_tokens=1024 +) + +print("Reasoning:", response.choices[0].message.reasoning_content) +print("Content: ", response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +Reasoning: None +Content: The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in six games (4-2). This was the Dodgers' first World Series championship since 1988. The series was notable for being played in a neutral-site bubble at Globe Life Field in Arlington, Texas, due to the COVID-19 pandemic. +``` + +When `reasoning_effort` is not set, the server defaults to instant mode (no thinking, `reasoning_content=None`). To opt into thinking, pass `reasoning_effort="high" / "medium" / "low"` on the request — see the Hybrid Thinking section below. + +### 4.2 Advanced Usage + +#### 4.2.1 Reasoning Parser (Hybrid Thinking) + +Hy3-preview is a hybrid-thinking model. Control the thinking budget via the OpenAI-standard `reasoning_effort`: + +- `high` / `medium` / `low` — increasing amounts of chain-of-thought in `reasoning_content` +- `none` — skip thinking entirely (instant responses, content-only) + +Enable the reasoning parser during deployment so that the thinking section (`...`) is separated into `reasoning_content`: + +```bash Command +sglang serve \ + --model-path tencent/Hy3-preview \ + --tp 8 \ + --reasoning-parser hunyuan \ + --trust-remote-code \ + --mem-fraction-static 0.9 +``` + +**Thinking Mode — High Effort:** + +```python Example +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +response = client.chat.completions.create( + model="tencent/Hy3-preview", + messages=[{"role": "user", "content": "Solve step by step: What is 15% of 240?"}], + reasoning_effort="high", + max_tokens=2048, +) + +msg = response.choices[0].message +print("=============== Thinking =================") +print(msg.reasoning_content) +print("=============== Content =================") +print(msg.content) +``` + +**Output Example:** + +```text Output +=============== Thinking ================= +We need to solve: "What is 15% of 240?" Step by step. So we need to compute 15% of 240. The process: 15% means 15 per hundred, i.e., 15/100 = 0.15. Multiply 0.15 by 240. Or we can do: 10% of 240 = 24, 5% is half of 10% = 12, so sum = 36. Or do multiplication: 15/100 * 240 = (15*240)/100 = (3600)/100 = 36. So answer is 36. + +We need to produce step-by-step explanation. The instruction: "Solve step by step: What is 15% of 240?" So we should provide a clear solution with steps. The final answer: 36. Also maybe include units? No units. + +We'll output the solution in a clear manner. +=============== Content ================= +To find 15% of 240, follow these steps: + +1. **Understand that percent means "per hundred."** + So, 15% = 15/100 or 0.15. + +2. **Multiply the number (240) by the percentage in decimal form.** + 0.15 × 240. + + Alternatively, you can use fractions: + (15/100) × 240. + +3. **Perform the multiplication.** + 0.15 × 240 = 36. + Or: + (15 × 240) / 100 = 3600 / 100 = 36. + +4. **Check using an alternative method:** + - 10% of 240 = 24. + - 5% of 240 = half of 10% = 12. + - 15% = 10% + 5% = 24 + 12 = 36. + +Thus, **15% of 240 is 36**. +``` + +**Instant Mode — No Thinking:** + +```python Example +response = client.chat.completions.create( + model="tencent/Hy3-preview", + messages=[{"role": "user", "content": "Give me a one-line summary of relativity."}], + reasoning_effort="none", + max_tokens=256, +) + +print("Content:", response.choices[0].message.content) +``` + +**Output Example:** + +```text Output +Content: Relativity is Einstein's theory that space, time, mass, and gravity are interconnected and relative, not fixed, fundamentally changing our understanding of the universe. +``` + +#### 4.2.2 Tool Calling + +Hy3-preview supports streaming OpenAI-compatible tool calls. Enable both parsers together — the reasoning parser strips thinking tokens before the tool-call parser runs: + +```bash Command +sglang serve \ + --model-path tencent/Hy3-preview \ + --tp 8 \ + --reasoning-parser hunyuan \ + --tool-call-parser hunyuan \ + --trust-remote-code \ + --mem-fraction-static 0.9 +``` + +**Non-Streaming Example:** + +```python Example +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a city.", + "parameters": { + "type": "object", + "properties": { + "city": {"type": "string"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["city"], + }, + }, + } +] + +response = client.chat.completions.create( + model="tencent/Hy3-preview", + messages=[{"role": "user", "content": "What's the weather in Beijing? Use fahrenheit."}], + tools=tools, +) + +msg = response.choices[0].message +print("Reasoning:", msg.reasoning_content) +print("Content: ", msg.content) +for tc in msg.tool_calls or []: + print(f"Tool Call: {tc.function.name}") + print(f" Arguments: {tc.function.arguments}") +``` + +**Output Example:** + +```text Output +Reasoning: None +Content: I'll get the current weather for Beijing in Fahrenheit for you. +Tool Call: get_weather + Arguments: {"city": "Beijing", "unit": "fahrenheit"} +``` + +**Streaming Example (incremental argument deltas):** + +Hy3-preview's `hunyuan` tool-call parser emits tool names first, then argument JSON in incremental fragments — matching the OpenAI streaming contract: + +```python Example +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +stream = client.chat.completions.create( + model="tencent/Hy3-preview", + messages=[{"role": "user", "content": "What's the weather in Beijing? Use fahrenheit."}], + tools=tools, + stream=True, +) + +tool_buffer = {} +for chunk in stream: + delta = chunk.choices[0].delta + if delta.content: + print(delta.content, end="", flush=True) + for tc in delta.tool_calls or []: + buf = tool_buffer.setdefault(tc.index, {"name": "", "args": ""}) + if tc.function and tc.function.name: + buf["name"] += tc.function.name + if tc.function and tc.function.arguments: + buf["args"] += tc.function.arguments + +for idx, buf in tool_buffer.items(): + print(f"\nTool[{idx}] {buf['name']}({buf['args']})") +``` + +**Output Example:** + +```text Output +I'll check the current weather in Beijing for you using Fahrenheit. +Tool[0] get_weather({"city": "Beijing", "unit": "fahrenheit"}) +``` + +## 5. Benchmark + +### 5.1 Accuracy Benchmark + +**Test Environment:** + +- Hardware: 8× NVIDIA H200 (141GB) +- Docker Image: `lmsysorg/sglang:hy3-preview` +- Model: `tencent/Hy3-preview` (BF16) +- Tensor Parallelism: 8 +- SGLang version: latest `main` + +#### 5.1.1 GSM8K + +- Benchmark Method: 5-shot CoT on 200 questions, evaluated via SGLang native backend +- Benchmark Command: + +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 64 +``` + +- Test Results: + +```text Output +TODO — replace with real GSM8K accuracy after benchmark run on Hy3-preview (BF16). +``` + +#### 5.1.2 MMLU + +- Benchmark Method: 5-shot, all 57 subjects +- Benchmark Command: + +```bash Command +python3 benchmark/mmlu/bench_sglang.py --nsub 60 --parallel 64 +``` + +- Test Results: + +```text Output +TODO — replace with real MMLU accuracy after benchmark run on Hy3-preview (BF16). +``` + +#### 5.1.3 Tool-Call Accuracy (MiniMax-Provider-Verifier) + +- Benchmark Tool: [MiniMax-Provider-Verifier](https://github.com/MiniMax-AI/MiniMax-Provider-Verifier) +- Metric: function-call schema validity, argument match, and end-to-end response correctness +- Test Results: + +```text Output +TODO — replace with real tool-call accuracy after benchmark run on Hy3-preview (BF16). +``` + +### 5.2 Speed Benchmark + +#### 5.2.1 Low Concurrency + +- Benchmark Command: + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model tencent/Hy3-preview \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- Test Results: + +```text Output +TODO — replace with real low-concurrency output on Hy3-preview (BF16). +``` + +#### 5.2.2 High Concurrency + +- Benchmark Command: + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --model tencent/Hy3-preview \ + --dataset-name random \ + --random-input-len 1000 \ + --random-output-len 1000 \ + --num-prompts 500 \ + --max-concurrency 100 +``` + +- Test Results: + +```text Output +TODO — replace with real high-concurrency output on Hy3-preview (BF16). +``` diff --git a/docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2-Flash.mdx b/docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2-Flash.mdx new file mode 100644 index 000000000000..3823ba0ebf4f --- /dev/null +++ b/docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2-Flash.mdx @@ -0,0 +1,106 @@ +--- +title: MiMo-V2-Flash +metatags: + description: "Deploy MiMo-V2-Flash 309B MoE model with SGLang - hybrid attention, multi-token prediction, and 256K context for efficient inference." +--- + +## Introduction + +XiaomiMiMo/MiMo-V2-Flash, with 309B total parameters and 15B activated parameters, is a new inference-centric model designed to maximize decoding efficiency created by XiaomiMiMo Team explicitly co-designed for real-world serving workloads, enabling flexible tradeoffs between throughput and latency on different hardware. + +This model creates a new balance between long-context modeling capability and inference efficiency. Key features include: +- **Hybrid Attention Architecture**: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and an aggressive 128-token window. This reduces KV-cache storage by nearly 6x while maintaining long-context performance via learnable attention sink bias. +- **Multi-Token Prediction (MTP)**: Equipped with a lightweight MTP module (0.33B params/block) using dense FFNs. This triples output speed during inference and will be good to accelerates rollout in RL training. +- **Efficient Pre-Training**: Trained on 27T tokens using FP8 mixed precision and native 32k seq length. The context window supports up to 256k length. +- **Agentic Capabilities**: Post-training utilizes Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic RL, achieving superior performance on SWE-Bench and complex reasoning tasks. + + +## Installation + +MiMo-V2-Flash is currently available in SGLang via Docker image and pip install. + +### Docker + +```bash Command +# Pull the docker image +docker pull lmsysorg/sglang:dev-pr-15207 + +# Launch the container +docker run -it --gpus all \ + --shm-size=32g \ + --ipc=host \ + --network=host \ + lmsysorg/sglang:dev-pr-15207 bash +``` + +### Pip Installation + +```bash Command +# On a machine with SGLang dependencies installed or inside a SGLang nightly container +# Start an SGLang nightly container +docker run -it --gpus all \ + --shm-size=32g \ + --ipc=host \ + --network=host \ + lmsysorg/sglang:nightly-dev-20251215-4449c170 bash + +# If you already have SGLang installed, uninstall the current SGLang version +pip uninstall sglang -y + +# Install the PyPI Package +pip install sglang==0.5.6.post2.dev8005+pr.15207.g39d5bd57a \ + --extra-index-url https://sgl-project.github.io/whl/pr/ +``` + +## Model Deployment + +Use the configuration selector below to automatically generate the appropriate deployment command. + +import { MiMoV2FlashDeployment } from "/src/snippets/autoregressive/mimo-v2-flash-deployment.jsx"; + + + +MI355X (ROCm) is validated in the selector above with `--tp-size 4`, Triton attention, and `--disable-custom-all-reduce`. `--tp-size 8` hit a QKV sharding error during validation. EAGLE speculative decoding is still WIP on MI355X. + +## Testing the deployment + +Once the server is running, test it with a chat completion request in another terminal: + +```bash Command +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "XiaomiMiMo/MiMo-V2-Flash", + "messages": [ + {"role": "user", "content": "Hello! What can you help me with?"} + ], + "temperature": 0.7, + "max_tokens": 100 + }' +``` + +**Expected response:** + +```json Config +{ + "id": "...", + "object": "chat.completion", + "model": "XiaomiMiMo/MiMo-V2-Flash", + "choices": [{ + "message": { + "role": "assistant", + "content": "Hello! I can help you with..." + } + }] +} +``` + +## Troubleshooting + +**DeepGEMM Timeout Error** + +Occasionally DeepGEMM timeout errors occur during first launch. Simply rerun the server command in the same container - the compiled kernels are cached and subsequent launches will be fast. + +**ROCm MI355X Attention Backend** + +If you see an error such as `AiterAttnBackend.forward_decode() got an unexpected keyword argument 'sinks'` on MI355X, use the `MI355X` + `Performance Optimizations` command from the selector above, which switches to Triton attention and keeps `--disable-custom-all-reduce`. diff --git a/docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdx b/docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdx new file mode 100644 index 000000000000..653ef4c87bad --- /dev/null +++ b/docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdx @@ -0,0 +1,666 @@ +--- +title: MiMo-V2.5 +metatags: + description: "Deploy XiaomiMiMo MiMo-V2.5-Pro (1.02T MoE, text) and MiMo-V2.5 (310B MoE, multimodal) with SGLang — EAGLE speculative decoding, hybrid attention, and 1M-token context." +tag: NEW +--- + +## 1. Model Introduction + +[MiMo-V2.5-Pro](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro) and [MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5) are next-generation Mixture-of-Experts models from the XiaomiMiMo Team. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
VariantTotal paramsActive (MoE)Modalities
MiMo-V2.5-Pro1.02T42BText (multimodal planned)
MiMo-V2.5310B15BText, Image, Video, Audio
+ +**Key Features:** + +- **Hybrid Attention Architecture**: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) for reduced KV cache while preserving long-context capability. +- **Multi-Token Prediction (MTP)**: 3-layer MTP module accelerates decoding. Both variants support EAGLE speculative decoding with MTP weights. +- **1M-Token Context**: Both variants support up to 1 million token context windows. +- **Agentic Capabilities**: Post-training with large-scale agentic RL achieves strong performance on coding, reasoning, and tool-use benchmarks. +- **MiMo-V2.5 Multimodal** (V2.5 only): Native omnimodal architecture with a 729M-param ViT Vision Encoder (28 layers: 24 SWA + 4 Full) and a 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full); supports image, video, and audio understanding via standard OpenAI-compatible multimodal API. + +**License:** Apache 2.0 + +## 2. SGLang Installation + +Refer to the [official SGLang installation guide](../../../docs/get-started/install). + +**Docker Images by Variant × Hardware:** + +| Variant | Hardware | Docker Image | +| --- | --- | --- | +| **MiMo-V2.5 (310B)** | H100 / H200 (Hopper, CUDA 12.9) | `lmsysorg/sglang:dev-mimo-v2.5` | +| **MiMo-V2.5 (310B)** | B200 / GB300 (Blackwell, CUDA 13.0) | `lmsysorg/sglang:dev-cu13-mimo-v2.5` | +| **MiMo-V2.5-Pro (1.02T)** | H100 / H200 (Hopper, CUDA 12.9) | `lmsysorg/sglang:dev-mimo-v2.5-pro` | +| **MiMo-V2.5-Pro (1.02T)** | B200 / GB300 (Blackwell, CUDA 13.0) | `lmsysorg/sglang:dev-cu13-mimo-v2.5-pro` | + +> Pull the image matching your GPU's CUDA driver. `lmsysorg/sglang:latest` will not load either checkpoint. + +**TPU (sgl-jax):** MiMo-V2.5-Pro can also be served on TPU via the JAX-based [sgl-jax](https://github.com/sgl-project/sglang-jax) runtime. The container image and `pip install` steps are listed in [§3.3 TPU Deployment](#33-tpu-deployment-mimo-v25-pro-sgl-jax). + +## 3. Model Deployment + +### 3.1 Basic Configuration + +Use the selector below to generate the deployment command for your variant and hardware. + +import { MiMoV25Deployment } from '/src/snippets/autoregressive/mimo-v25-deployment.jsx' + + + +### 3.2 Configuration Tips + +**MiMo-V2.5-Pro (1.02T):** +- **B200**: single node, TP=8 (verified). Uses `--attention-backend fa4` + `--moe-runner-backend flashinfer_trtllm` + `--mem-fraction-static 0.8`. Set `--swa-full-tokens-ratio 0.1` to keep KV-cache footprint within 192 GB HBM. +- **GB300**: 2 nodes, TP=8 (verified). Same Blackwell stack as B200; multi-node interconnect requires `NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1`. Default SWA ratio is fine. +- **H100/H200**: 2 nodes × 8 GPUs (TP=16, not yet verified). Uses the Hopper stack (`fa3` + DeepEP + EAGLE multi-layer); fits with `--mem-fraction-static 0.7` and `--swa-full-tokens-ratio 0.3`. DeepEP dispatch tuning: `SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256` avoids memory spikes during prefill. +- EAGLE speculative decoding (3 steps, topk=1) typically yields a 2–3× decode speedup. Requires `SGLANG_ENABLE_SPEC_V2=1`; on Hopper also pass `--enable-multi-layer-eagle`. + +**MiMo-V2.5 (310B):** +- The checkpoint has a TP=4-interleaved fused `qkv_proj`; attention-TP per DP group **must** be 4. Use `--dp = TP / 4`; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare `--tp 8` without `--dp 2` will fail to load with `MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8`. +- Single-node deployments: H100/H200 8× GPUs (`--tp 8 --dp 2`), B200 4× GPUs (`--tp 4`, dp=1, no DP-attn flag needed), GB300 4× GPUs (`--tp 4`, single NVL4 node). FP8 quantization. +- `--enable-dp-lm-head` and `--mm-enable-dp-encoder` are required whenever `--enable-dp-attention` is on, to keep LM head and encoder sharding consistent. +- EAGLE MTP uses the checkpoint's MTP weights. For H100/H200, enable `SGLANG_ENABLE_SPEC_V2=1`, `--speculative-algorithm EAGLE`, and `--enable-multi-layer-eagle`. +- **Multimodal**: Supports image, video, and audio understanding; see Section 4.3 for invocation examples. + +**DeepEP (optional toggle, Hopper-only):** +- DeepEP replaces the default MoE all-to-all dispatch with a fused [DeepEP](https://github.com/deepseek-ai/DeepEP) backend; it lowers expert dispatch latency and memory traffic, so it pays off under **high concurrency / throughput-bound workloads** on H100/H200. Under concurrency=1 / latency-bound workloads the gain is negligible — leave it off. +- Enabling adds `--moe-a2a-backend deepep` + `--moe-dense-tp-size 1` (and `--ep ` for Pro) plus `SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256` env to cap the dispatch buffer. Requires `pip install deep_ep` (not part of the default sglang install). +- On Blackwell (B200, GB300) the verified MoE backend is `flashinfer_trtllm`; the DeepEP toggle is a no-op there. + +### 3.3 TPU Deployment (MiMo-V2.5-Pro, sgl-jax) + +MiMo-V2.5-Pro can also be served on TPU via [sgl-jax](https://github.com/sgl-project/sglang-jax). The runtime is a separate JAX-based stack (`sgl_jax.launch_server`); pick **TPU v7x** or **TPU v6e** in the panel above to generate the launch command. Verified topologies: + +| TPU Type | Topology | Chips/Node | Nodes | Total Chips | JAX Devices/Chip | Total JAX Devices (= `--tp-size`) | +| --- | --- | --- | --- | --- | --- | --- | +| **v7x** | 2×2×4 | 4 | 4 | 16 | 2 | 32 | +| **v6e** | 4×4×4 | 4 | 16 | 64 | 1 | 64 | + +> v7x exposes **2 logical JAX devices per chip**, so `--tp-size = 16 chips × 2 = 32`. v6e exposes 1 device per chip, so `--tp-size = 64`. Always set `--tp-size` to the total JAX device count across all nodes, not the chip count. + +All nodes must sit in the same TPU slice and reach each other on the JAX init port (`20000`) and the TPU process port (`8471`). + +**Step 1 — Launch the JAX TPU container on every node:** + +```shell Command +docker run -it --privileged \ + --shm-size=32g \ + --ipc=host \ + --network=host \ + -v /dev:/dev \ + us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:jax0.8.1-rev1 bash +``` + +> The image is pinned to `jax0.8.1-rev1` to keep the JAX runtime aligned with sgl-jax's TPU extras. + +**Step 2 — Clone and install sgl-jax (inside the container):** + +```shell Command +git clone https://github.com/sgl-project/sglang-jax.git +cd sglang-jax +pip install -e "python[tpu]" +``` + +## 4. Model Invocation + +### 4.1 Basic Usage + +See [Basic API Usage](../../../docs/basic_usage/send_request). + +### 4.2 Reasoning Output + +Both variants support hybrid thinking mode. Thinking content is separated via the reasoning parser. + +**Thinking Mode (default):** + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +response = client.chat.completions.create( + model="XiaomiMiMo/MiMo-V2.5", + messages=[ + {"role": "user", "content": "Which is larger, 9.11 or 9.9? Think carefully."} + ] +) + +print("====== Reasoning ======") +print(response.choices[0].message.reasoning_content) +print("====== Answer ======") +print(response.choices[0].message.content) +``` + +**Output Example (MiMo-V2.5):** + +```text +====== Reasoning ====== +Comparing 9.11 and 9.9. + +The integer parts are both 9. Now compare the decimal parts: 0.11 vs 0.9. + +0.9 = 0.90, which is greater than 0.11. + +So 9.9 > 9.11. +====== Answer ====== +**9.9 is larger than 9.11.** + +Here's the reasoning: When comparing decimals, line them up to the same number of decimal places: + +- 9.11 +- 9.90 + +Both have a **9** in the ones place, but in the tenths place, **9 > 1**, so 9.90 > 0.11. + +**9.9 > 9.11** +``` + +**Thinking Off (instant mode):** + +```python Example +response = client.chat.completions.create( + model="XiaomiMiMo/MiMo-V2.5", + messages=[ + {"role": "user", "content": "Which is larger, 9.11 or 9.9? Think carefully."} + ], + extra_body={"chat_template_kwargs": {"thinking": False}} +) + +print(response.choices[0].message.content) +``` + +**Output Example (MiMo-V2.5):** + +```text +## Comparing 9.11 and 9.9 + +**9.9 is larger.** + +The key is to compare them place by place. It helps to write them with the same number of decimal places: + +- **9.11** → 9.11 +- **9.9** → 9.90 + +Both have **9** in the ones place, but in the tenths place: **9** (in 9.90) is greater than **1** (in 9.11). + +So **9.90 > 9.11**. +``` + +### 4.3 Multimodal Invocation (V2.5 only) + +**Image Understanding:** + +```python Example +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") + +response = client.chat.completions.create( + model="XiaomiMiMo/MiMo-V2.5", + messages=[{ + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sgl-test-files/refs/heads/main/images/man_ironing_on_back_of_suv.png"}}, + {"type": "text", "text": "Describe this image in detail."} + ] + }] +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text +Based on the image provided, here is a detailed description: + +The image captures a whimsical or surreal scene set on a busy city street, likely in New York City given the iconic yellow cabs. In the center foreground, a man is sitting on a folding chair, casually crossing his legs. He is wearing a bright yellow hoodie with a graphic on the front and blue jeans. He is intently focused on ironing a white dress shirt that rests on an ironing board set up directly on the asphalt. + +Behind him, a yellow SUV taxi cab is stopped or moving slowly, angled slightly away from the camera. To his left, another yellow taxi sedan is captured in motion blur, indicating it is driving past him. The background features tall city buildings with glass windows and storefronts. There are banners hanging from streetlights, and some greenery is visible in the distance. The overall impression is one of incongruity—performing a domestic chore like ironing in the middle of a chaotic urban environment. +``` + +**Video Understanding:** + +```python Example +response = client.chat.completions.create( + model="XiaomiMiMo/MiMo-V2.5", + messages=[{ + "role": "user", + "content": [ + {"type": "video_url", "video_url": {"url": "https://videos.pexels.com/video-files/4114797/4114797-uhd_3840_2160_25fps.mp4"}}, + {"type": "text", "text": "Summarize what happens in this video."} + ] + }] +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text +A person wearing blue protective gloves is shown operating a microscope in a close-up shot. The individual is adjusting a knob on the side of the microscope, which moves the stage holding a glass slide, likely focusing the lens on the specimen. +``` + +> Video decoding requires `decord` (`pip install decord`); SGLang's MiMo-V2.5 multimodal processor uses `decord.VideoReader` for frame extraction. + +**Audio Understanding:** + +```python Example +response = client.chat.completions.create( + model="XiaomiMiMo/MiMo-V2.5", + messages=[{ + "role": "user", + "content": [ + {"type": "audio_url", "audio_url": {"url": "https://raw.githubusercontent.com/sgl-project/sgl-test-files/refs/heads/main/audios/Trump_WEF_2018_10s.mp3"}}, + {"type": "text", "text": "Transcribe and summarize this audio."} + ] + }] +) + +print(response.choices[0].message.content) +``` + +**Output Example:** + +```text +**Transcript:** +"Thank you Klaus very much. It's a privilege to be here at this forum where leaders in business, science, art, diplomacy and world affairs have gathered for..." + +**Summary:** +The speaker thanks Klaus for the introduction and expresses their honor at attending a forum. They highlight that the event has brought together high-level leaders from various sectors, including business, science, art, and diplomacy. +``` + +### 4.4 Tool Calling + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="EMPTY" +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City name"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} + }, + "required": ["location"] + } + } + } +] + +response = client.chat.completions.create( + model="XiaomiMiMo/MiMo-V2.5", + messages=[{"role": "user", "content": "What's the weather in Beijing?"}], + tools=tools +) + +msg = response.choices[0].message +if msg.reasoning_content: + print("=== Reasoning ===") + print(msg.reasoning_content) +if msg.tool_calls: + print("=== Tool Calls ===") + for tc in msg.tool_calls: + print(f" Function: {tc.function.name}") + print(f" Arguments: {tc.function.arguments}") +``` + +**Output Example (MiMo-V2.5):** + +```text +=== Reasoning === +The user wants to know the weather in Beijing. I have a function available called "get_weather" that can retrieve current weather for a location. Let me call that function with Beijing as the location. +=== Tool Calls === + Function: get_weather + Arguments: {"location": "Beijing"} +``` + +## 5. Benchmark + +Accuracy numbers come from `sglang.test.run_eval` (GSM8K standard 5-shot, MMMU validation split). Speed numbers come from `sglang.bench_serving` with generated random prompts; text runs use 1024 input tokens and 1024 output tokens per request, and the image run uses 2 random 720p images per request. + +### 5.1 Accuracy Benchmark + +#### 5.1.1 GSM8K + +Standard 5-shot, `temperature=0`, `max_tokens=4096`, model defaults to thinking-on (responses contain `...` and the eval extracts the trailing number via regex). Server launch: see [Section 3](#3-model-deployment). + +**Benchmark Command:** + +```shell Command +python3 -m sglang.test.run_eval \ + --base-url http://127.0.0.1:30000 \ + --model XiaomiMiMo/MiMo-V2.5 \ + --eval-name gsm8k \ + --num-examples 200 \ + --num-threads 8 \ + --max-tokens 4096 \ + --temperature 0.0 +``` + +> `run_eval.py` automatically appends `/v1` to `--base-url`; pass the bare `host:port` URL (without trailing `/v1`), otherwise requests resolve to `/v1/v1/chat/completions` and 404. + +- **Test Results:** + - MiMo-V2.5-Pro (FP8) + ``` + Pending update + ``` + - MiMo-V2.5 (FP8, 8× H200) + ``` + Score: 0.980 (196 / 200) + Latency: 477.52 s + Output throughput: 88.9 tok/s + ``` + +#### 5.1.2 MMMU (V2.5 only) + +`MMMU/MMMU` validation split (multi-discipline multimodal), `concurrency=16`, default sampling. + +- **Benchmark Command:** + +```shell Command +python3 benchmark/mmmu/bench_sglang.py \ + --port 30000 \ + --model XiaomiMiMo/MiMo-V2.5 \ + --concurrency 16 +``` + +- **Test Results:** + - MiMo-V2.5 (FP8) + ``` + Pending update + ``` + +### 5.2 Speed Benchmark — MiMo-V2.5-Pro + +**Test Environment:** + +- Hardware: NVIDIA B200 GPU (8×) +- Model: `XiaomiMiMo/MiMo-V2.5-Pro` (FP8) +- Tensor Parallelism: 8 +- Recipe: Balanced (DP-attn + DeepEP + EAGLE MTP) +- sglang version: Pending update + +#### 5.2.1 Latency-Sensitive Benchmark + +- **Model Deployment Command:** see the [command panel above](#3-model-deployment). +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model XiaomiMiMo/MiMo-V2.5-Pro \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +Pending update — replace with real bench_serving output after the latency run. +``` + +#### 5.2.2 Throughput-Sensitive Benchmark + +- **Model Deployment Command:** see the [command panel above](#3-model-deployment). +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model XiaomiMiMo/MiMo-V2.5-Pro \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +Pending update — replace with real bench_serving output after the throughput run. +``` + +### 5.3 Speed Benchmark — MiMo-V2.5 + +**Test Environment:** + +- Hardware: NVIDIA H200 GPU (8×) +- Model: `XiaomiMiMo/MiMo-V2.5` (FP8) +- Tensor Parallelism: 8 (DP-attention with `--dp 2`) +- Recipe: Balanced (DP-attn + EAGLE MTP) +- sglang version: `0.0.0.dev1+g7d99af439` (`lmsysorg/sglang:dev-mimo-v2.5`) + +#### 5.3.1 Latency-Sensitive Benchmark + +- **Model Deployment Command:** select MiMo-V2.5, H200, and EAGLE MTP in the [command panel above](#3-model-deployment). +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model XiaomiMiMo/MiMo-V2.5 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 14.72 +Total input tokens: 1997 +Total input text tokens: 1997 +Total generated tokens: 2798 +Total generated tokens (retokenized): 2697 +Request throughput (req/s): 0.68 +Input token throughput (tok/s): 135.67 +Output token throughput (tok/s): 190.09 +Peak output token throughput (tok/s): 245.00 +Peak concurrent requests: 3 +Total token throughput (tok/s): 325.77 +Concurrency: 1.00 +Accept length: 3.08 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1469.98 +Median E2E Latency (ms): 1652.84 +P90 E2E Latency (ms): 2210.80 +P99 E2E Latency (ms): 2823.86 +---------------Time to First Token---------------- +Mean TTFT (ms): 143.89 +Median TTFT (ms): 99.25 +P99 TTFT (ms): 481.01 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 4.87 +Median TPOT (ms): 4.30 +P99 TPOT (ms): 6.64 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 4.76 +Median ITL (ms): 3.46 +P95 ITL (ms): 13.52 +P99 ITL (ms): 13.84 +Max ITL (ms): 74.37 +================================================== +``` + +#### 5.3.2 Throughput-Sensitive Benchmark + +- **Model Deployment Command:** select MiMo-V2.5, H200, and EAGLE MTP in the [command panel above](#3-model-deployment). +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 \ + --port 30000 \ + --model XiaomiMiMo/MiMo-V2.5 \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: inf +Max request concurrency: 100 +Successful requests: 1000 +Benchmark duration (s): 93.41 +Total input tokens: 302118 +Total input text tokens: 302118 +Total generated tokens: 195775 +Total generated tokens (retokenized): 188139 +Request throughput (req/s): 10.71 +Input token throughput (tok/s): 3234.48 +Output token throughput (tok/s): 2095.97 +Peak output token throughput (tok/s): 3019.00 +Peak concurrent requests: 121 +Total token throughput (tok/s): 5330.45 +Concurrency: 91.04 +Accept length: 2.95 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 8503.45 +Median E2E Latency (ms): 7491.96 +P90 E2E Latency (ms): 13706.99 +P99 E2E Latency (ms): 20474.33 +---------------Time to First Token---------------- +Mean TTFT (ms): 4399.20 +Median TTFT (ms): 4333.35 +P99 TTFT (ms): 8004.81 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 58.23 +Median TPOT (ms): 21.78 +P99 TPOT (ms): 747.79 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 20.06 +Median ITL (ms): 15.28 +P95 ITL (ms): 48.36 +P99 ITL (ms): 96.99 +Max ITL (ms): 969.61 +================================================== +``` + +#### 5.3.3 Multimodal (Image) Benchmark + +- **Model Deployment Command:** select MiMo-V2.5, H200, and EAGLE MTP in the [command panel above](#3-model-deployment). +- Benchmark Command: + +```shell Command +python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --host 127.0.0.1 \ + --port 30000 \ + --model XiaomiMiMo/MiMo-V2.5 \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 \ + --random-output-len 1024 \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +- **Test Results:** + +```text Output +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Traffic request rate: inf +Max request concurrency: 1 +Successful requests: 10 +Benchmark duration (s): 25.73 +Total input tokens: 661 +Total input text tokens: 631 +Total input vision tokens: 30 +Total generated tokens: 4220 +Total generated tokens (retokenized): 0 +Request throughput (req/s): 0.39 +Input token throughput (tok/s): 25.69 +Output token throughput (tok/s): 164.03 +Peak output token throughput (tok/s): 1.00 +Peak concurrent requests: 2 +Total token throughput (tok/s): 189.73 +Concurrency: 1.00 +Accept length: 2.94 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2570.74 +Median E2E Latency (ms): 2411.92 +P90 E2E Latency (ms): 3711.62 +P99 E2E Latency (ms): 4949.74 +---------------Time to First Token---------------- +Mean TTFT (ms): 0.00 +Median TTFT (ms): 0.00 +P99 TTFT (ms): 0.00 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 7.31 +Median TPOT (ms): 6.17 +P99 TPOT (ms): 17.18 +---------------Inter-Token Latency---------------- +Mean ITL (ms): 0.00 +Median ITL (ms): 0.00 +P95 ITL (ms): 0.00 +P99 ITL (ms): 0.00 +Max ITL (ms): 0.00 +================================================== +``` diff --git a/docs_new/cookbook/autoregressive/intro.mdx b/docs_new/cookbook/autoregressive/intro.mdx new file mode 100644 index 000000000000..3c172c4dfad4 --- /dev/null +++ b/docs_new/cookbook/autoregressive/intro.mdx @@ -0,0 +1,118 @@ +--- +title: Overview +mode: wide +description: Practical guides for deploying and using large language models and vision language models with SGLang. +metatags: + description: "Explore SGLang autoregressive model cookbooks for LLM and VLM deployment, invocation, optimization, and benchmarking examples." +--- + + + + + + + + + + + + + + + + + + + + + diff --git a/docs_new/cookbook/base/benchmarks/autoregressive_model_benchmark.mdx b/docs_new/cookbook/base/benchmarks/autoregressive_model_benchmark.mdx new file mode 100644 index 000000000000..6d04c632fb55 --- /dev/null +++ b/docs_new/cookbook/base/benchmarks/autoregressive_model_benchmark.mdx @@ -0,0 +1,287 @@ +--- +title: Autoregressive Model Benchmark Documentation +metatags: + description: "Benchmark LLM and VLM serving throughput and latency with sglang.bench_serving - supports SGLang, vLLM, and multiple datasets." +--- + +`sglang.bench_serving` is a command-line tool designed to benchmark the online serving throughput and latency of Large Language Models (LLMs) and Vision Language Models(VLMs). It supports various backends (`SGLang`, `vLLM`, etc.) and offers flexible configurations for request rates, dataset types, and profiling. + +## 1. Quick Start + +### Basic Usage (Random Data) + +Run a benchmark using randomly generated prompts with a local SGLang server. + +```bash Command +python -m sglang.bench_serving --backend sglang --port 30000 --dataset-name random --num-prompts 100 +``` + +### Real-World Data (ShareGPT) + +Run a benchmark using the ShareGPT dataset with a specific request rate. + +```shell Command +python -m sglang.bench_serving \ + --backend sglang \ + --dataset-name sharegpt \ + --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ + --num-prompts 1000 \ + --request-rate 10 +``` + +## 2. Parameter Reference + +### 2.1 Backend & Server Configuration + +These parameters define the target server and the inference engine being used. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
`--backend`**Required.** Specifies the backend engine. Options: `sglang`, `sglang-native`, `sglang-oai`, `sglang-oai-chat`, `vllm`, `vllm-chat`, `lmdeploy`, `lmdeploy-chat`, `trt`, `gserver`, `truss`.
`--base-url`The API base URL (if not using specific host/port flags).
`--host`Server hostname. Default: `0.0.0.0`.
`--port`Server port. If not set, it defaults to the specific backend's standard port.
`--model`Model name or path. If unset, it queries `/v1/models` for configuration.
`--served-model-name`The model name used in the API request body. Defaults to the value of `--model`.
`--tokenizer`Path or name of the tokenizer. Defaults to the model configuration.
+ +### 2.2 Dataset Configuration + +Controls the source of the prompts used for benchmarking. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
`--dataset-name`The type of dataset. Options: `sharegpt`, `custom`, `random`, `random-ids`, `generated-shared-prefix`, `mmmu`, `image`, `mooncake`.
`--dataset-path`File path to the dataset (e.g., local JSON file for ShareGPT).
`--num-prompts`Total number of prompts to process. Default: `1000`.
`--seed`Random seed for reproducibility.
`--tokenize-prompt`Uses integer IDs instead of strings for inputs. Useful for precise length control.
+ +### 2.3 Input/Output Length Control + +Parameters to control the shape of requests (context length and generation length). + +#### For Random/Image Datasets: + +- `--random-input-len`: Number of input tokens per request. +- `--random-output-len`: Number of output tokens per request. +- `--random-range-ratio`: Range ratio for sampling input/output lengths. + +#### For ShareGPT Dataset: + +- `--sharegpt-output-len`: Overrides the output length defined in the dataset for each request. +- `--sharegpt-context-len`: Max context length. Requests exceeding this are dropped. + +#### General Request Modifiers: + +- `--extra-request-body`: Appends a JSON object to the request payload (e.g., \{"key": "value"\}). Useful for passing sampling parameters. +- `--prompt-suffix`: A string suffix appended to all user prompts. +- `--disable-ignore-eos`: If set, the model will stop generation upon hitting the EOS token (benchmarks usually ignore EOS to force max generation length). +- `--apply-chat-template`: Applies the model's chat template to the input. + +### 2.4 Traffic & Concurrency + +Controls how fast requests are sent to the server. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
`--request-rate`Requests per second (RPS). If `inf` (default), all requests are sent immediately (burst). Otherwise, arrival times follow a Poisson process.
`--max-concurrency`The maximum number of active requests allowed at once. Even if `request-rate` is high, the client will hold back requests if this limit is reached.
`--warmup-requests`Number of requests to run before the actual measurement begins to warm up the server.
`--flush-cache`Flushes the server cache before starting the benchmark.
+ +### 2.5 Output & Logging + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
`--output-file`Path to save the results in JSONL format.
`--output-details`Includes detailed metrics in the output.
`--print-requests`Prints requests to stdout as they are sent (useful for debugging).
`--disable-tqdm`Hides the progress bar.
`--disable-stream`Disables streaming mode (waits for full response).
`--return-logprob`Requests logprobs from the server.
`--tag`An arbitrary string tag added to the output file for identification.
+ +### 2.6 Advanced + +#### 2.6.1 Image / Multi-modal + +Only applicable when --dataset-name is set to image. + +- `--image-count`: Number of images per request. +- `--image-resolution`: Resolution (e.g., 1080p, 4k, or custom 1080x1920). +- `--image-format`: jpeg or png. +- `--image-content`: random (noise) or blank. + +#### 2.6.2 LoRA Benchmarking + +Used to simulate multi-LoRA serving scenarios. + +- `--lora-name`: A list of LoRA adapter names (e.g., `--lora-name` adapter1 adapter2). +- `--lora-request-distribution`: How requests are assigned to adapters: + - `uniform`: Equal probability. + - `distinct`: New adapter for every request. + - `skewed`: Follows a Zipf distribution (simulating hot/cold adapters). +- `--lora-zipf-alpha`: The alpha parameter for the Zipf distribution (if `skewed` is used). + +#### 2.6.3 Profiling + +Tools for deep performance analysis. + +- `--profile`: Enables Torch Profiler (Requires `SGLANG_TORCH_PROFILER_DIR` env var on server). +- `--plot-throughput`: Generates throughput/concurrency plots (requires `termplotlib` and `gnuplot`). +- `--profile-activities`: Activities to profile (CPU, GPU, CUDA_PROFILER). +- `--profile-num-steps`: Number of steps to profile. +- `--profile-by-stage` / `--profile-stages`: Profile specific processing stages. + +#### 2.6.4 PD Disaggregation + +For benchmarking Prefill-Decode (PD) separated architectures. + +- `--pd-separated`: Enable PD disaggregation benchmarking. +- `--profile-prefill-url`: URL(s) of prefill workers for profiling. +- `--profile-decode-url`: URL(s) of decode workers for profiling. + +Note: In PD mode, `prefill` and `decode` must be profiled separately. + +### 2.7 Specialized Datasets + +#### 2.7.1 Generated Shared Prefix (GSP): + +Designed to test system prompt caching/prefix sharing performance. + +- `--gsp-num-groups`: Number of unique system prompts. +- `--gsp-prompts-per-group`: How many user questions share the same system prompt. +- `--gsp-system-prompt-len`: Length of the shared prefix. +- `--gsp-fast-prepare`: Skips some statistics calculation for faster startup. + +#### 2.7.2 Mooncake + +Designed for trace replay. + +- `--mooncake-slowdown-factor`: Slows down the trace replay (e.g., 2.0 = 2x slower). +- `--mooncake-num-rounds`: Number of conversation rounds (supports multi-turn). +- `--use-trace-timestamps`: Schedules requests based on timestamps found in the trace file. + +## 3. Metrics + +After running the benchmark, the tool generally reports: + +- `E2E` (End-to-End Latency): The total time from sending the request to receiving the final token. +- `TTFT` (Time To First Token): The time between sending the request and seeing the first word appear. This represents the Prefill time (processing the image and text prompt). +- `TPOT` (Time per Output Token): The average time it takes to generate one token (excluding the first one). This is calculated per request. +- `ITL` (Inter-Token Latency): The time gap between two distinct streaming packets. While TPOT is an average, ITL measures the "jitter" or smoothness of the stream. diff --git a/docs_new/cookbook/base/benchmarks/diffusion_model_benchmark.mdx b/docs_new/cookbook/base/benchmarks/diffusion_model_benchmark.mdx new file mode 100644 index 000000000000..4c8e7a0928aa --- /dev/null +++ b/docs_new/cookbook/base/benchmarks/diffusion_model_benchmark.mdx @@ -0,0 +1,235 @@ +--- +title: Diffusion Models Benchmark Documentation +metatags: + description: "Benchmark diffusion model serving throughput and latency with SGLang - supports image and video generation with flexible configurations." +--- + +`sglang.multimodal_gen.benchmarks.bench_serving` is a command-line tool designed to benchmark the online serving throughput and latency of Diffusion Models. It supports two backends (`sglang-image`, `sglang-video`) and offers flexible configurations for request rates, dataset types, and profiling. + +## 1. Quick Start + +### 1.1 Benchmarking in Low Concurrency + +Run a benchmark on a local server (port 30000) generating 1 videos/images from the `vbench` dataset. + +```bash Command +# For text to video: such as Wan2.2-T2V-A14B-Diffusers +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task t2v --num-prompts 1 --max-concurrency 1 + +# For image to video: such as Wan2.2-I2V-A14B-Diffusers +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task i2v --num-prompts 1 --max-concurrency 1 + +# For image-text to video: such as Wan2.2-TI2V-5B-Diffusers +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task ti2v --num-prompts 1 --max-concurrency 1 + +# For text to image: such as Qwen-Image +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task t2i --num-prompts 1 --max-concurrency 1 + +# For image-text to image: such as Qwen-Image-Edit +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task ti2i --num-prompts 1 --max-concurrency 1 +``` + +### 1.2 Benchmarking in High Concurrency + +Run a benchmark on a local server (port 30000) generating 20 videos/images from the `vbench` dataset. + +```bash Command +# For text to video: such as Wan2.2-T2V-A14B-Diffusers +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task t2v --num-prompts 20 --max-concurrency 20 + +# For image to video: such as Wan2.2-I2V-A14B-Diffusers +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task i2v --num-prompts 20 --max-concurrency 20 + +# For image-text to video: such as Wan2.2-TI2V-5B-Diffusers +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task ti2v --num-prompts 20 --max-concurrency 20 + +# For text to image: such as Qwen-Image +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task t2i --num-prompts 20 --max-concurrency 20 + +# For image-text to image: such as Qwen-Image-Edit +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task ti2i --num-prompts 20 --max-concurrency 20 +``` + +## 2. Parameter Reference + +### 2.1 Connection & Backend Settings + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultDescription
`--backend`**Required**The backend type to use. Choices: `sglang-image`, `sglang-video`.
`--base-url``None`Base URL of the server (e.g., `http://localhost:30000`). If specified, this overrides `--host` and `--port`.
`--host``None`The server host (e.g., `127.0.0.1`).
`--port``None`The server port.
`--model``None`Model name or path.
+ +### 2.2 Workload & Task Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentChoicesDescription
`--task``t2v`, `i2v`, `ti2v`, `t2i`, `ti2i`Defines the generation task: `t2v` (Text-to-Video), `i2v` (Image-to-Video), `ti2v` (Text+Image-to-Video), `t2i` (Text-to-image), `ti2i` (Text+Image-to-Image).
`--dataset``vbench`, `random`The source of prompts/inputs.
`--dataset-path``None`(Optional) Path to a local dataset file if not using built-in presets.
`--num-prompts``None`The total number of prompts/requests to execute during the benchmark.
+ +### 2.3 Generation Parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescription
`--width`The target width for the generated image or video.
`--height`The target height for the generated image or video.
`--num-frames`Number of frames to generate (Specific to Video backends).
`--fps`Frames Per Second configuration (Specific to Video backends).
+ +### 2.4 Concurrency & Load Control + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescription
`--request-rate`The number of requests initiated per second. If set to `inf`, all requests are sent immediately (burst). If set to a number, request arrival times follow a Poisson process.
`--max-concurrency`The maximum number of requests allowed to execute simultaneously. This simulates a semaphore or upstream limit. Even if `request-rate` is high, the actual processing rate is capped by this value.
+ +### 2.5 Logging & Output + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescription
`--output-file`Path to save the benchmark metrics (JSON format).
`--disable-tqdm`If set, disables the progress bar in the console.
+ +## 3. Metrics + +- `Request Throughput` (req/s), Output Throughput (tok/s) +- `Latency Mean` (ms): Time to Per Step +- `Peak Memory Max` (ms): Max Memory Usage during running diff --git a/docs_new/cookbook/base/reference/server_arguments.mdx b/docs_new/cookbook/base/reference/server_arguments.mdx new file mode 100644 index 000000000000..737a1b35f4d0 --- /dev/null +++ b/docs_new/cookbook/base/reference/server_arguments.mdx @@ -0,0 +1,46 @@ +--- +title: Server Arguments +metatags: + description: "SGLang server CLI arguments reference - tensor parallelism, data parallelism, expert parallelism, and configuration options." +--- + +This guide explains the parallelism configuration fields used in SGLang model configurations and how they map to SGLang server command-line arguments. + +## Quick Reference + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Config FieldSGLang CLI ArgumentDescription
`tp``--tp-size`, `--tensor-parallel-size`Tensor Parallelism - splits model across GPUs
`dp``--dp-size`, `--data-parallel-size`Data Parallelism - runs multiple model replicas
`ep``--ep-size`, `--expert-parallel-size`, `--ep`Expert Parallelism - distributes MoE experts
`enable_dp_attention``--enable-dp-attention`DP for attention, TP for FFN (hybrid)
diff --git a/docs_new/cookbook/diffusion/FLUX/FLUX.mdx b/docs_new/cookbook/diffusion/FLUX/FLUX.mdx new file mode 100644 index 000000000000..ada4058ab1b0 --- /dev/null +++ b/docs_new/cookbook/diffusion/FLUX/FLUX.mdx @@ -0,0 +1,294 @@ +--- +title: FLUX +metatags: + description: "Deploy FLUX diffusion models with SGLang - 12B/32B rectified flow transformers for high-quality text-to-image generation." +--- + +import { FluxDeployment } from '/src/snippets/diffusion/flux-deployment.jsx'; + +## 1. Model Introduction + +[FLUX](https://blackforestlabs.ai/) is a family of rectified flow transformer models developed by Black Forest Labs for high-quality image generation from text descriptions. + +[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. + +**Key Features:** + +- **Cutting-edge Output Quality**: Second only to the state-of-the-art FLUX.1 [pro] model +- **Competitive Prompt Following**: Matches the performance of closed-source alternatives +- **Guidance Distillation**: Trained using guidance distillation for improved efficiency +- **Open Weights**: Available for personal, scientific, and commercial purposes under the FLUX [dev] Non-Commercial License + +[FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev) is a 32 billion parameter rectified flow transformer capable of generating, editing, and combining images based on text instructions. + +**Key Features:** + +- **State-of-the-art Performance**: Leading open model in text-to-image generation, single-reference editing, and multi-reference editing +- **No Finetuning Required**: Character, object, and style reference without additional training in one model +- **Guidance Distillation**: Trained using guidance distillation for improved efficiency +- **Open Weights**: Available for personal, scientific, and commercial purposes under the FLUX [dev] Non-Commercial License + +For more details, please refer to the [FLUX.1-dev HuggingFace page](https://huggingface.co/black-forest-labs/FLUX.1-dev), [FLUX.2-dev HuggingFace page](https://huggingface.co/black-forest-labs/FLUX.2-dev), and the [official blog post](https://blackforestlabs.ai/announcing-black-forest-labs/). + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/install.md) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +FLUX models are optimized for high-quality image generation. The recommended launch configurations vary by hardware and model version. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model version. SGLang supports serving FLUX on NVIDIA B200, H200, H100, and AMD MI355X, MI325X, MI300X GPUs. + + + +### 3.2 Configuration Tips + +Current supported optimization all listed [here](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/support_matrix.md). + +- `--vae-path`: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path. +- `--num-gpus`: Number of GPUs to use +- `--tp-size`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) +- `--sp-degree`: Sequence parallelism size (typically should match the number of GPUs) +- `--ulysses-degree`: The degree of DeepSpeed-Ulysses-style SP in USP +- `--ring-degree`: The degree of ring attention-style SP in USP + +## 4. API Usage + +For complete API documentation, please refer to the [official API usage guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/openai_api.md). + +### 4.1 Generate an Image + +```python Example +import base64 +from openai import OpenAI + +client = OpenAI(api_key="EMPTY", base_url="http://localhost:3000/v1") + +response = client.images.generate( + model="black-forest-labs/FLUX.1-dev", + prompt="A cat holding a sign that says hello world", + size="1024x1024", + n=1, + response_format="b64_json", +) + +# Save the generated image +image_bytes = base64.b64decode(response.data[0].b64_json) +with open("output.png", "wb") as f: + f.write(image_bytes) +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/cache_dit.md). + +**Basic Usage** + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path black-forest-labs/FLUX.1-dev +``` + +**Advanced Usage** + +- DBCache Parameters: DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ + Combined Configuration Example: + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path black-forest-labs/FLUX.1-dev +``` + +#### 4.2.2 CPU Offload + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if run out of memory. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. +- `--vae-cpu-offload`: Use CPU offload for VAE. +- `--pin-cpu-memory`: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument". + +## 5. Benchmark + +### 5.1 Speedup Benchmark + +#### 5.1.1 Generate a image + +Test Environment: + +- Hardware: NVIDIA B200 GPU (1x) +- Model: black-forest-labs/FLUX.1-dev +- sglang diffusion version: 0.5.6.post2 + +**Server Command**: + +```shell Command +sglang serve --model-path black-forest-labs/FLUX.1-dev --port 30000 +``` + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task t2v --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Backend: sglang-image +Model: black-forest-labs/FLUX.1-dev +Dataset: vbench +Task: t2v +-------------------------------------------------- +Benchmark duration (s): 50.97 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.02 +Latency Mean (s): 50.9681 +Latency Median (s): 50.9681 +Latency P99 (s): 50.9681 +-------------------------------------------------- +Peak Memory Max (MB): 27905.19 +Peak Memory Mean (MB): 27905.19 +Peak Memory Median (MB): 27905.19 +============================================================ +``` + +#### 5.1.2 Generate images with high concurrency + +**Server Command** : + +```shell Command +sglang serve --model-path black-forest-labs/FLUX.1-dev --port 30000 +``` + +**Benchmark Command** : + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task t2v --num-prompts 20 --max-concurrency 20 +``` + +**Result** : + +```text Output +================= Serving Benchmark Result ================= +Backend: sglang-image +Model: black-forest-labs/FLUX.1-dev +Dataset: vbench +Task: t2v +-------------------------------------------------- +Benchmark duration (s): 111.79 +Request rate: inf +Max request concurrency: 20 +Successful requests: 20/20 +-------------------------------------------------- +Request throughput (req/s): 0.18 +Latency Mean (s): 67.0646 +Latency Median (s): 66.9691 +Latency P99 (s): 110.8949 +-------------------------------------------------- +Peak Memory Max (MB): 27917.19 +Peak Memory Mean (MB): 27916.59 +Peak Memory Median (MB): 27917.19 +============================================================ +``` diff --git a/docs_new/cookbook/diffusion/LTX/LTX.mdx b/docs_new/cookbook/diffusion/LTX/LTX.mdx new file mode 100644 index 000000000000..967843e2e2c3 --- /dev/null +++ b/docs_new/cookbook/diffusion/LTX/LTX.mdx @@ -0,0 +1,206 @@ +--- +title: LTX +description: Run LTX-2 and LTX-2.3 video generation pipelines with SGLang Diffusion. +metatags: + description: "Deploy and use LTX-2 and LTX-2.3 video generation models with SGLang Diffusion, including one-stage, two-stage, HQ, TI2V, and LoRA examples." +--- + +import { LTXDeployment } from '/src/snippets/diffusion/ltx-deployment.jsx'; + +## 1. Model Introduction + +[LTX-2](https://huggingface.co/Lightricks/LTX-2) and [LTX-2.3](https://huggingface.co/Lightricks/LTX-2.3) are video generation models from Lightricks. SGLang Diffusion supports the LTX series through native one-stage and two-stage pipelines for text-to-video and image-conditioned video generation. + +Use `Lightricks/LTX-2` or `Lightricks/LTX-2.3` as `--model-path`. For two-stage generation, SGLang uses the spatial upsampler and distilled LoRA components from the model snapshot by default. LTX-2.3 also supports the HQ two-stage variant. + + +**License notice:** LTX-2 and LTX-2.3 are released under the LTX-2 Community License Agreement, not Apache 2.0. The license includes commercial-use restrictions for some entities. Review the [official Lightricks license](https://huggingface.co/Lightricks/LTX-2.3/blob/main/LICENSE) before production or commercial use; SGLang support does not grant additional model usage rights. + + +## 2. SGLang-diffusion Installation + +Install SGLang with diffusion dependencies: + +```bash Command +uv pip install "sglang[diffusion]" --prerelease=allow +``` + +For platform-specific setup, see the [SGLang Diffusion installation guide](/docs/sglang-diffusion/installation). + +## 3. Model Deployment + +This section provides deployment configurations optimized for different LTX pipelines and hardware targets. + +### 3.1 Basic Configuration + +The LTX series supports one-stage and two-stage pipelines. LTX-2.3 also supports the HQ two-stage pipeline. The recommended launch configuration depends on whether the target GPU can keep both two-stage DiTs resident. + +**Interactive Command Generator**: Use the configuration selector below to generate a deployment command. The default selection targets a single NVIDIA H200 with `resident` two-stage mode, which is the fastest startup path for the specified high-memory environment. + + + +### 3.2 Configuration Tips + +Choose the pipeline class based on the quality and latency target: + +| Use case | Pipeline class | Notes | +| --- | --- | --- | +| One-stage generation | `LTX2Pipeline` | Fastest LTX native path. Supports T2V and TI2V. | +| Two-stage generation | `LTX2TwoStagePipeline` | Uses a base stage and a refinement stage. Supported by LTX-2 and LTX-2.3. | +| Two-stage High Quality (HQ) generation | `LTX2TwoStageHQPipeline` | LTX-2.3 HQ path; defaults to 1920x1088 unless you override `--width` and `--height`. | + +Feature compatibility: + +| Pipeline class | T2V | TI2V (`--image-path`) | LoRA (`--lora-path`) | Notes | +| --- | --- | --- | --- | --- | +| `LTX2Pipeline` | Yes | Yes | Yes | One-stage path. Cannot be combined with HQ because HQ is a separate two-stage pipeline class. | +| `LTX2TwoStagePipeline` | Yes | Yes | Yes | Standard two-stage path for LTX-2 and LTX-2.3. | +| `LTX2TwoStageHQPipeline` | Yes | Yes | Yes | High Quality two-stage path for LTX-2.3. Use this instead of `LTX2Pipeline`; it is not a one-stage mode flag. | + +For two-stage pipelines, `--ltx2-two-stage-device-mode` controls transformer residency: + +| Mode | When to use it | +| --- | --- | +| `snapshot` | Recommended default. Balances latency and VRAM. | +| `resident` | Best latency on high-VRAM GPUs because both DiTs can stay resident. | +| `original` | Closest to the original two-stage switching semantics. | + +Other deployment flags: + +- `--lora-path`: Preload a community LoRA adapter. +- `--lora-weight-name`: Select the exact safetensors file when the LoRA repository contains multiple weight files. + + +For native LTX-2.3 two-stage serving without a user LoRA, `resident` is the fastest high-VRAM path. When you pass `--lora-path`, SGLang still applies the user LoRA during the two-stage switch, so use `resident` on H200-class GPUs for enough VRAM, but do not expect the same premerged-stage2 benefit as the no-user-LoRA path. + + +## 4. Model Invocation + +### 4.1 Basic Usage + +The examples below spell out the current SGLang sampling defaults for reproducibility: + +| Model path | Default output | Default frames | Default steps | +| --- | --- | --- | --- | +| `Lightricks/LTX-2` | 768x512 | 121 | 40 | +| `Lightricks/LTX-2.3` | 768x512 | 121 | 30 | +| `Lightricks/LTX-2.3` with `LTX2TwoStageHQPipeline` | 1920x1088 | 121 | 15 | + +#### 4.1.1 LTX-2 one-stage text-to-video + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2 \ + --pipeline-class-name LTX2Pipeline \ + --prompt "A quiet coastal town at sunrise, fishing boats moving slowly through golden mist, cinematic camera movement" \ + --save-output +``` + +#### 4.1.2 LTX-2.3 one-stage text-to-video + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2Pipeline \ + --prompt "A quiet coastal town at sunrise, fishing boats moving slowly through golden mist, cinematic camera movement" \ + --save-output +``` + +#### 4.1.3 LTX-2 two-stage text-to-video + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2 \ + --pipeline-class-name LTX2TwoStagePipeline \ + --prompt "A handheld shot follows a red tram crossing a rainy city square at night, reflections on the pavement, cinematic lighting" \ + --save-output +``` + +#### 4.1.4 LTX-2.3 two-stage text-to-video + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2TwoStagePipeline \ + --prompt "A handheld shot follows a red tram crossing a rainy city square at night, reflections on the pavement, cinematic lighting" \ + --save-output +``` + +#### 4.1.5 LTX-2.3 HQ text-to-video + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2TwoStageHQPipeline \ + --prompt "A wide cinematic shot of alpine clouds rolling over a mountain ridge, soft morning light, slow aerial camera movement" \ + --save-output +``` + +#### 4.1.6 Image-to-video with one reference image + +Pass one image to `--image-path` for image-conditioned generation: + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2TwoStagePipeline \ + --image-path ./inputs/start.png \ + --prompt "The camera slowly pushes forward as the subject turns toward warm window light, subtle natural motion, cinematic" \ + --save-output +``` + +#### 4.1.7 First-to-last-frame transition with two reference images + +Pass two images to `--image-path` for transition-style TI2V. The first image is used as the starting condition and the second image is used as the ending condition. + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2TwoStagePipeline \ + --image-path ./inputs/start.png ./inputs/end.png \ + --prompt "A smooth cinematic transition from the first scene into the final scene, dynamic camera motion, motion blur, zhuanchang" \ + --save-output +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Use community LoRAs + +Use `--lora-path` to load a LoRA adapter. If the Hugging Face repo contains multiple safetensors files, use `--lora-weight-name` to select the exact file. `--lora-scale` maps to the standard LoRA merge scale and defaults to `1.0`. + +The following example uses [`valiantcat/LTX-2.3-Transition-LORA`](https://huggingface.co/valiantcat/LTX-2.3-Transition-LORA): + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2TwoStagePipeline \ + --lora-path valiantcat/LTX-2.3-Transition-LORA \ + --lora-weight-name ltx2.3-transition.safetensors \ + --prompt "A low-angle tracking shot moves through a foggy forest road. The camera rises above the treetops and transitions into a clear view of a snowy mountain peak under bright sunlight, zhuanchang" \ + --save-output +``` + +You can combine the Transition LoRA with two reference images: + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --pipeline-class-name LTX2TwoStagePipeline \ + --image-path ./inputs/start.png ./inputs/end.png \ + --lora-path valiantcat/LTX-2.3-Transition-LORA \ + --lora-weight-name ltx2.3-transition.safetensors \ + --prompt "A fast cinematic transition from the first image to the second image, whip-pan motion, atmospheric lighting, zhuanchang" \ + --save-output +``` + + +Some community LoRAs only include weights for transformer blocks. In that case, SGLang logs a concise coverage summary and leaves unmatched LoRA-capable layers on the base model weights. This is expected when the adapter format intentionally omits those layers. + + +## 5. Practical Tips + +- Use `--pipeline-class-name LTX2TwoStagePipeline` as the default LTX two-stage quality path. +- Use `--pipeline-class-name LTX2TwoStageHQPipeline` when you want the HQ path and have enough VRAM for larger outputs. +- Use `--ltx2-two-stage-device-mode resident` on high-VRAM GPUs if latency matters more than memory usage. +- Use `--ltx2-two-stage-device-mode original` when comparing against official two-stage behavior. +- Keep `--width` and `--height` aligned with the target model resolution; for LTX models, these are output video dimensions. diff --git a/docs_new/cookbook/diffusion/MOVA/MOVA.mdx b/docs_new/cookbook/diffusion/MOVA/MOVA.mdx new file mode 100644 index 000000000000..8f7235b8210d --- /dev/null +++ b/docs_new/cookbook/diffusion/MOVA/MOVA.mdx @@ -0,0 +1,268 @@ +--- +title: MOVA +metatags: + description: "Deploy MOVA with SGLang - simultaneous video and audio generation with asymmetric dual-tower architecture, precise lip-sync, and environment-aware sound effects." +--- + +## 1. Model Introduction + +[MOVA](https://github.com/OpenMOSS/MOVA) (MOSS Video and Audio) is a foundation model developed by the SII-OpenMOSS Team, designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously in a single inference pass for perfect alignment. It adopts an Asymmetric Dual-Tower Architecture, fusing pre-trained video and audio towers through a bidirectional cross-attention mechanism to maintain tight synchronization between video and audio during generation. + +[MOVA-360p](https://huggingface.co/OpenMOSS-Team/MOVA-360p) is suitable for fast inference and resource-constrained environments. [MOVA-720p](https://huggingface.co/OpenMOSS-Team/MOVA-720p) provides higher resolution video generation. Both versions support generating up to 8 seconds of video-audio content. + +**Key Features:** + +- **Native Bimodal Generation**: Generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation from cascaded pipelines +- **Precise Lip-Sync**: Achieves state-of-the-art performance in multilingual lip-synchronization (LSE-D: 7.094, LSE-C: 7.452 with Dual CFG on Verse-Bench Set3) +- **Environment-Aware Sound Effects**: Generates corresponding environmental sound effects including physical interaction sounds, ambient sounds, and spatial/textural sound feedback +- **Fully Open-Source**: Model weights, inference code, training pipelines, and LoRA fine-tuning scripts are all open-sourced + +For more details, please refer to the [MOVA-360p HuggingFace page](https://huggingface.co/OpenMOSS-Team/MOVA-360p), the [MOVA-720p HuggingFace page](https://huggingface.co/OpenMOSS-Team/MOVA-720p), the [GitHub repository](https://github.com/OpenMOSS/MOVA), and the [technical report (arXiv)](https://arxiv.org/abs/2602.08794). + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/install.md) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +MOVA supports both online serving and CLI generation modes. The recommended launch configurations vary by hardware and resolution. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform. + +import { MOVADeployment } from '/src/snippets/diffusion/mova-deployment.jsx' + + + +### 3.2 Configuration Tips + +Current supported optimization all listed [here](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/support_matrix.md). + +- `--num-gpus`: Number of GPUs to use +- `--tp`: Tensor parallelism size (should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) +- `--ring-degree`: The degree of ring attention-style SP in USP +- `--ulysses-degree`: The degree of DeepSpeed-Ulysses-style SP in USP +- `--adjust-frames`: Whether to adjust frames automatically (set to `false` for MOVA) +- `--enable-torch-compile`: Enable torch.compile for faster inference + +## 4. API Usage + +For complete API documentation, please refer to the [official API usage guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/openai_api.md). + +### 4.1 CLI Generation (sglang generate) + +```bash Command +sglang generate \ + --model-path OpenMOSS-Team/MOVA-720p \ + --prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, \ + framed by wooden furniture and a filled bookshelf. \ + Quiet room acoustics underscore his measured tone as he delivers his remarks. \ + At one point, he says, \"I would also believe that this advance in AI recently wasn't unexpected.\"" \ + --image-path "" \ + --adjust-frames false \ + --num-gpus 8 \ + --ring-degree 2 \ + --ulysses-degree 4 \ + --num-frames 193 \ + --fps 24 \ + --seed 67 \ + --num-inference-steps 25 \ + --enable-torch-compile \ + --save-output +``` + +### 4.2 Generate a Video + +```bash Command +curl -X POST "http://0.0.0.0:30002/v1/videos" \ + -F "prompt=A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also believe that this advance in AI recently wasn't unexpected.\"" \ + -F "input_reference=@" \ + -F "size=640x352" \ + -F "num_frames=193" \ + -F "fps=24" \ + -F "seed=67" \ + -F "guidance_scale=5.0" \ + -F "num_inference_steps=25" \ + -o create_video.json +``` + +### 4.3 Advanced Usage + +#### 4.3.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/cache_dit.md). + +**Basic Usage** + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path OpenMOSS-Team/MOVA-720p +``` + +**Advanced Usage** + +- DBCache Parameters: DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+ +- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ + Combined Configuration Example: + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path OpenMOSS-Team/MOVA-720p +``` + +#### 4.3.2 CPU Offload + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if run out of memory. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. +- `--vae-cpu-offload`: Use CPU offload for VAE. +- `--pin-cpu-memory`: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument". + +## 5. Benchmark + +### 5.1 Speedup Benchmark + +#### 5.1.1 Generate a video + +Test Environment: + +- Hardware: NVIDIA H200 x 8 +- git revision: 443b1a8 +- Model: OpenMOSS-Team/MOVA-720p + +**Server Command**: + +```bash Command +sglang serve --model-path OpenMOSS-Team/MOVA-720p --port 30002 \ + --adjust-frames false --num-gpus 8 --ring-degree 2 --ulysses-degree 4 \ + --tp 1 --enable-torch-compile +``` + +**Benchmark Command**: + +```bash Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --task image-to-video --dataset vbench --num-prompts 1 --max-concurrency 1 \ + --port 30002 +``` + +**Result**: +```text Output +================= Serving Benchmark Result ================= +Task: image-to-video +Model: OpenMOSS-Team/MOVA-720p +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 590.76 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.00 +Latency Mean (s): 590.7549 +Latency Median (s): 590.7549 +Latency P99 (s): 590.7549 +-------------------------------------------------- +Peak Memory Max (MB): 74996.00 +Peak Memory Mean (MB): 74996.00 +Peak Memory Median (MB): 74996.00 +============================================================ +``` + +#### 5.1.2 Generate videos with high concurrency + +**Server Command**: + +```bash Command +sglang serve --model-path OpenMOSS-Team/MOVA-720p --port 30002 \ + --adjust-frames false --num-gpus 8 --ring-degree 2 --ulysses-degree 4 \ + --tp 1 --enable-torch-compile +``` + +**Benchmark Command**: + +```bash Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --task image-to-video --dataset vbench --num-prompts 20 --max-concurrency 20 \ + --port 30002 +``` diff --git a/docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image-Edit.mdx b/docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image-Edit.mdx new file mode 100644 index 000000000000..d833289e21f2 --- /dev/null +++ b/docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image-Edit.mdx @@ -0,0 +1,280 @@ +--- +title: Qwen-Image-Edit-2511 +metatags: + description: "Deploy Qwen-Image-Edit-2511 with SGLang - 20B image editing model with text rendering, character consistency, and geometric reasoning." +--- + +import { QwenImageEditDeployment } from '/src/snippets/diffusion/qwen-image-edit-deployment.jsx'; + +## 1. Model Introduction + +[Qwen-Image-Edit-2511](https://huggingface.co/Qwen/Qwen-Image-Edit-2511) is an enhanced version over Qwen-Image-Edit-2509, featuring multiple improvements—including notably better consistency. Built upon the 20B Qwen-Image model, Qwen-Image-Edit-2511 successfully extends Qwen-Image's unique text rendering capabilities to image editing tasks, enabling precise text editing. + +Key Enhancements in Qwen-Image-Edit-2511: + +- **Mitigate Image Drift**: Reduces unwanted changes in non-edited regions of the image. +- **Improved Character Consistency**: The model can perform imaginative edits based on an input portrait while preserving the identity and visual characteristics of the subject. +- **Multi-Person Consistency**: Enhanced consistency in multi-person group photos, enabling high-fidelity fusion of two separate person images into a coherent group shot. +- **Integrated LoRA Capabilities**: Selected popular community-created LoRAs are integrated directly into the base model, unlocking their effects without extra tuning (e.g., lighting enhancement, viewpoint generation). +- **Enhanced Industrial Design Generation**: Special attention to practical engineering scenarios, including batch industrial product design and material replacement for industrial components. +- **Strengthened Geometric Reasoning**: Stronger geometric reasoning capability for generating auxiliary construction lines for design or annotation purposes. + +For more details, please refer to the [official Qwen-Image-Edit-2511 HuggingFace page](https://huggingface.co/Qwen/Qwen-Image-Edit-2511), the [Blog](https://qwenlm.github.io/blog/qwen-image-edit-2511/), and the [Tech Report](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf). + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/install.md) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +Qwen-Image-Edit-2511 is a 20B parameter model optimized for image editing tasks. The recommended launch configurations vary by hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform. + + + +### 3.2 Configuration Tips + +Current supported optimization all listed [here](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/support_matrix.md). + +- `--vae-path`: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path. +- `--num-gpus`: Number of GPUs to use +- `--tp-size`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) +- `--sp-degree`: Sequence parallelism size (typically should match the number of GPUs) +- `--ulysses-degree`: The degree of DeepSpeed-Ulysses-style SP in USP +- `--ring-degree`: The degree of ring attention-style SP in USP + +## 4. API Usage + +For complete API documentation, please refer to the [official API usage guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/openai_api.md). + +### 4.1 Edit an Image + +```python Example +import base64 +from openai import OpenAI + +client = OpenAI(api_key="EMPTY", base_url="http://localhost:3000/v1") + +response = client.images.edit( + model="Qwen/Qwen-Image-Edit-2511", + image=open("input.png", "rb"), + prompt="Change the color of the taxi to black.", + n=1, + response_format="b64_json", +) + +# Save the edited image +image_bytes = base64.b64decode(response.data[0].b64_json) +with open("output.png", "wb") as f: + f.write(image_bytes) +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/cache_dit.md). + +**Basic Usage** + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Qwen/Qwen-Image-Edit-2511 +``` + +**Advanced Usage** + +- DBCache Parameters: DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ + Combined Configuration Example: + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path Qwen/Qwen-Image-Edit-2511 +``` + +#### 4.2.2 CPU Offload + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if run out of memory. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. +- `--image-encoder-cpu-offload`: Use CPU offload for image encoder inference. +- `--vae-cpu-offload`: Use CPU offload for VAE. +- `--pin-cpu-memory`: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument". + +## 5. Benchmark + +Test Environment: + +- Hardware: NVIDIA B200 GPU (1x) +- Model: Qwen/Qwen-Image-Edit-2511 +- sglang diffusion version: 0.5.6.post2 + +### 5.1 Speedup Benchmark + +#### 5.1.1 Edit a image + +**Server Command**: + +```shell Command +sglang serve --model-path Qwen/Qwen-Image-Edit-2511 --port 30000 +``` + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task ti2i --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Backend: sglang-image +Model: Qwen/Qwen-Image-Edit-2511 +Dataset: vbench +Task: ti2i +-------------------------------------------------- +Benchmark duration (s): 35.31 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.03 +Latency Mean (s): 35.3053 +Latency Median (s): 35.3053 +Latency P99 (s): 35.3053 +-------------------------------------------------- +Peak Memory Max (MB): 47959.35 +Peak Memory Mean (MB): 47959.35 +Peak Memory Median (MB): 47959.35 +============================================================ +``` + +#### 5.1.2 Edit a image with high concurrency + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task ti2i --num-prompts 20 --max-concurrency 20 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Backend: sglang-image +Model: Qwen/Qwen-Image-Edit-2511 +Dataset: vbench +Task: ti2i +-------------------------------------------------- +Benchmark duration (s): 286.11 +Request rate: inf +Max request concurrency: 20 +Successful requests: 20/20 +-------------------------------------------------- +Request throughput (req/s): 0.07 +Latency Mean (s): 150.0428 +Latency Median (s): 150.0600 +Latency P99 (s): 283.3843 +-------------------------------------------------- +Peak Memory Max (MB): 47971.82 +Peak Memory Mean (MB): 47971.49 +Peak Memory Median (MB): 47971.29 +============================================================ +``` diff --git a/docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image.mdx b/docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image.mdx new file mode 100644 index 000000000000..dc1c0ecc5cf0 --- /dev/null +++ b/docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image.mdx @@ -0,0 +1,271 @@ +--- +title: Qwen-Image +metatags: + description: "Deploy Qwen-Image with SGLang - community contribution guide for Qwen's image generation model." +--- + +import { QwenImageDeployment } from '/src/snippets/diffusion/qwen-image-deployment.jsx'; + +## 1. Model Introduction + +[Qwen-Image](https://huggingface.co/Qwen/Qwen-Image) is a text-to-image diffusion model developed by the Qwen team. + +For more details, please refer to the [official Qwen-Image HuggingFace page](https://huggingface.co/Qwen/Qwen-Image), the [Blog](https://qwenlm.github.io/blog/qwen-image/), and the [Tech Report](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf). + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](../../../docs/sglang-diffusion/installation) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +Qwen-Image is a text-to-image model. The recommended launch configurations vary by hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform. + + + +### 3.2 Configuration Tips + +Current supported optimization all listed [here](../../../docs/sglang-diffusion/attention_backends#platform-support-matrix). + +- `--vae-path`: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path. +- `--num-gpus`: Number of GPUs to use +- `--tp-size`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) +- `--sp-degree`: Sequence parallelism size (typically should match the number of GPUs) +- `--ulysses-degree`: The degree of DeepSpeed-Ulysses-style SP in USP +- `--ring-degree`: The degree of ring attention-style SP in USP + +**AMD ROCm Notes**: Requires SGLang >= v0.5.8. + +## 4. API Usage + +For complete API documentation, please refer to the [official API usage guide](../../../docs/sglang-diffusion/api/openai_api). + +### 4.1 Generate an Image + +```python Example +import base64 +from openai import OpenAI + +client = OpenAI(api_key="EMPTY", base_url="http://localhost:30000/v1") + +response = client.images.generate( + model="Qwen/Qwen-Image", + prompt="A logo With Bold Large text: SGL Diffusion", + n=1, + response_format="b64_json", +) + +# Save the generated image +image_bytes = base64.b64decode(response.data[0].b64_json) +with open("output.png", "wb") as f: + f.write(image_bytes) +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](../../../docs/sglang-diffusion/cache_dit). + +**Basic Usage** + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Qwen/Qwen-Image +``` + +**Advanced Usage** + +- DBCache Parameters: DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ + Combined Configuration Example: + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path Qwen/Qwen-Image +``` + +#### 4.2.2 CPU Offload + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if run out of memory. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. +- `--vae-cpu-offload`: Use CPU offload for VAE. +- `--pin-cpu-memory`: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument". + +## 5. Benchmark + +Test Environment: + +- Hardware: AMD Instinct MI300X GPU (1x) +- Model: Qwen/Qwen-Image +- Docker Image: lmsysorg/sglang:v0.5.8-rocm700-mi30x +- sglang diffusion version: 0.5.8 + +### 5.1 Speedup Benchmark + +#### 5.1.1 Generate an image + +**Server Command**: + +```shell Command +sglang serve --model-path Qwen/Qwen-Image \ + --ulysses-degree=1 --ring-degree=1 --port 30000 +``` + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task text-to-image --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Task: text-to-image +Model: Qwen/Qwen-Image +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 29.04 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.03 +Latency Mean (s): 29.0378 +Latency Median (s): 29.0378 +Latency P99 (s): 29.0378 +-------------------------------------------------- +Peak Memory Max (MB): 48018.83 +Peak Memory Mean (MB): 48018.83 +Peak Memory Median (MB): 48018.83 +============================================================ +``` + +#### 5.1.2 Generate images with high concurrency + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task text-to-image --num-prompts 20 --max-concurrency 20 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Task: text-to-image +Model: Qwen/Qwen-Image +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 300.79 +Request rate: inf +Max request concurrency: 20 +Successful requests: 14/20 +-------------------------------------------------- +Request throughput (req/s): 0.05 +Latency Mean (s): 154.5368 +Latency Median (s): 154.8363 +Latency P99 (s): 285.4603 +-------------------------------------------------- +Peak Memory Max (MB): 48030.31 +Peak Memory Mean (MB): 48030.30 +Peak Memory Median (MB): 48030.29 +============================================================ +``` diff --git a/docs_new/cookbook/diffusion/README.mdx b/docs_new/cookbook/diffusion/README.mdx new file mode 100644 index 000000000000..75e37534b2c7 --- /dev/null +++ b/docs_new/cookbook/diffusion/README.mdx @@ -0,0 +1,91 @@ +--- +title: "Diffusion Cookbook" +description: "Cookbook recipes for running diffusion models with SGLang" +metatags: + description: "Explore SGLang diffusion cookbook structure, categories, and contribution guidance for image and video generation recipes." +--- + +# SGLang Diffusion Cookbook + +
+ License + PRs Welcome +
+ +Create a comprehensive cookbook for diffusion models in SGLang, demonstrating SGLang's performance advantages for image and video generation workloads. + +## 🎯 What You'll Find Here + +This cookbook aggregates battle-tested SGLang recipes covering: + +- **Models**: Mainstream Image and Video generation Models +- **Use Cases**: Inference serving, deployment strategies +- **Hardware**: GPU and CPU configurations, optimization for different accelerators +- **Best Practices**: Configuration templates, performance tuning, troubleshooting guides + +Each recipe provides step-by-step instructions to help you quickly implement SGLang solutions for your specific requirements. + +## 🚀 Quick Start + +1. Browse the recipe index above to find your model +2. Follow the step-by-step instructions in each guide +3. Adapt configurations to your specific hardware and requirements +4. Join our community to share feedback and improvements + +The sglang diffusion cookbook directory structure are shown below: + +```text Example +sgl-cookbook/docs/diffusion/ +├── README.md # Main cookbook (this file) +├── Qwen-Image/ # Qwen-Image series models docs +│ ├── Qwen-Image.md +│ └── Qwen-Image-Edit.md +├── Wan/ # Wan series models docs +│ ├── Wan2.1.md +│ └── Wan2.2.md +├── Z-Image/ # Z-Image series models docs +│ └── Z-Image-Turbo.md +└── ... +``` + +## 🤝 Contributing + +We believe the best documentation comes from practitioners. Whether you've optimized SGLang for a specific model, solved a tricky deployment challenge, or discovered performance improvements, we encourage you to contribute your recipes! + +**💪How to Contribute** + +- Comment below if interested (mention which role) +- Join discussion on implementation details +- Fork repo and work on assigned section +- Submit PR following SGLang cookbook standards +- Iterate based on review feedback + +**To contribute:** + +```shell Command +# Fork the repo and clone locally +git clone https://github.com/YOUR_USERNAME/sglang-cookbook.git +cd sglang-cookbook + +# Create a new branch +git checkout -b add-my-recipe + +# Add your recipe following the template in DeepSeek-V3.2 +# Submit a PR! +``` + +## 📖 Resources + +- [SGLang GitHub](https://github.com/sgl-project/sglang) +- [SGLang Documentation](https://sgl-project.github.io) +- [SGLANG Diffusion Documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/README.md) +- [SLACK Channel](https://sgl-fru7574.slack.com/archives/C07GLLLESNR) +- [Community Slack/Discord](https://discord.gg/MpEEuAeb) + +## 📄 License + +This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/sgl-project/sgl-cookbook/blob/main/LICENSE) file for details. + +--- + +**Let's build this resource together!** 🚀 Star the repo and contribute your recipes to help the SGLang community grow. diff --git a/docs_new/cookbook/diffusion/Wan/Wan2.1.mdx b/docs_new/cookbook/diffusion/Wan/Wan2.1.mdx new file mode 100644 index 000000000000..7cafe28fc346 --- /dev/null +++ b/docs_new/cookbook/diffusion/Wan/Wan2.1.mdx @@ -0,0 +1,268 @@ +--- +title: Wan2.1 +metatags: + description: "Deploy Wan2.1 video generation models with SGLang - community contribution guide for Wan Video's diffusion models." +--- + +import { Wan21Deployment } from '/src/snippets/diffusion/wan21-deployment.jsx'; + +## 1. Model Introduction + +[Wan2.1 series](https://github.com/Wan-Video/Wan2.1) is an open and advanced suite of large-scale video generative models from Wan-AI. + +Key characteristics: + +- **State-of-the-art video quality**: Consistently outperforms many open-source and commercial video models on internal and public benchmarks, especially for motion richness and temporal consistency. +- **Consumer GPU friendly**: The T2V-1.3B variant can generate 5-second 480P videos on consumer GPUs with modest VRAM requirements. +- **Multi-capability suite**: Supports Text-to-Video (T2V), Image-to-Video (I2V), video editing, text-to-image, and video-to-audio generation. +- **Robust text rendering**: First-generation Wan model capable of generating both Chinese and English text in videos with strong readability. +- **Powerful Wan-VAE**: A 3D causal VAE that encodes/decodes long 1080P videos while preserving temporal information, enabling efficient high-resolution video generation. + +For more details, refer to the official Wan2.1 resources: + +- **GitHub**: [Wan-Video/Wan2.1](https://github.com/Wan-Video/Wan2.1) +- **Hugging Face collection**: [Wan-AI Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](../../../docs/sglang-diffusion/installation) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The Wan2.1 series offers models in multiple sizes and resolutions, optimized for different hardware platforms. The recommended launch configurations vary by hardware and model size. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate an appropriate deployment command for your model variant and options. + + + +### 3.2 Configuration Tips + +Current supported optimization options are listed in the [SGLang diffusion support matrix](../../../docs/sglang-diffusion/attention_backends#platform-support-matrix). + +- `--vae-path`: Path to a custom VAE model or HuggingFace model ID. If not specified, the VAE will be loaded from the main model path. +- `--num-gpus {NUM_GPUS}`: Number of GPUs to use. +- `--tp-size {TP_SIZE}`: Tensor parallelism size (for the encoder/DiT; keep \(\leq 1\) if relying heavily on CPU offload). +- `--sp-degree {SP_SIZE}`: Sequence parallelism degree. +- `--ulysses-degree {ULYSSES_DEGREE}`: Degree of DeepSpeed-Ulysses-style SP in USP. +- `--ring-degree {RING_DEGREE}`: Degree of ring attention-style SP in USP. +- `--text-encoder-cpu-offload`, `--dit-cpu-offload`, `--vae-cpu-offload`: Use CPU offload to reduce peak GPU memory when needed. + +## 4. Model Invocation + +### 4.1 Basic Usage + +For more API usage and request examples, please refer to: +[SGLang Diffusion OpenAI API](../../../docs/sglang-diffusion/api/openai_api) + +#### 4.1.1 Launch a server and then send requests + +```bash Command +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers --port 30000 + +curl http://127.0.0.1:30000/v1/images/generations \ + -o >(jq -r '.data[0].b64_json' | base64 --decode > example.png) \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $OPENAI_API_KEY" \ + -d '{ + "model": "Wan-AI/Wan2.1-T2V-14B-Diffusers", + "prompt": "A cute baby sea otter", + "n": 1, + "size": "1024x1024", + "response_format": "b64_json" + }' +``` + +#### 4.1.2 Generate a video without launching a server + +```bash Command +SERVER_ARGS=( + --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers + --text-encoder-cpu-offload + --pin-cpu-memory + --num-gpus 4 + --ulysses-degree=2 + --enable-cfg-parallel +) + +SAMPLING_ARGS=( + --prompt "A curious raccoon" + --save-output + --output-path outputs + --output-file-name "A curious raccoon.mp4" +) + +sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve significant inference speedups with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](../../../docs/sglang-diffusion/cache_dit). + +**Basic Usage** + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers +``` + +**Advanced Usage** + + Combined Configuration Example: + ```bash Command + SGLANG_CACHE_DIT_ENABLED=true \ + SGLANG_CACHE_DIT_FN=2 \ + SGLANG_CACHE_DIT_BN=1 \ + SGLANG_CACHE_DIT_WARMUP=4 \ + SGLANG_CACHE_DIT_RDT=0.4 \ + SGLANG_CACHE_DIT_MC=4 \ + SGLANG_CACHE_DIT_TAYLORSEER=true \ + SGLANG_CACHE_DIT_TS_ORDER=2 \ + sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers + ``` + +#### 4.2.2 GPU Optimization + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if you run out of memory with FSDP. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. +- `--image-encoder-cpu-offload`: Use CPU offload for image encoder inference. +- `--vae-cpu-offload`: Use CPU offload for VAE. +- `--pin-cpu-memory`: Pin memory for CPU offload. Use as a workaround if you see "CUDA error: invalid argument". + +#### 4.2.3 Supported LoRA Registry + +SGLang supports applying Wan2.1 LoRA adapters on top of base models: + + + + + + + + + + + + + + + + + + + + + + +
origin modelsupported LoRA
[Wan-AI/Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)[NIVEDAN/wan2.1-lora](https://huggingface.co/NIVEDAN/wan2.1-lora)
[Wan-AI/Wan2.1-I2V-14B-720P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P)[valiantcat/Wan2.1-Fight-LoRA](https://huggingface.co/valiantcat/Wan2.1-Fight-LoRA)
+ +**Example**: + +```bash Command +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers --port 30000 \ + --lora-path NIVEDAN/wan2.1-lora +``` + +## 5. Benchmark + +Test Environment: + +- Hardware: AMD MI300X GPU (1x) +- Model: Wan-AI/Wan2.1-T2V-14B-Diffusers +- SGLang Docker Image Version: 0.5.9 + +### 5.1 How to Run Benchmarks with SGLang + +You can use the built-in SGLang diffusion benchmark script to evaluate Wan2.1 performance on your hardware. + +#### 5.1.1 Generate a single video + +**Server Command**: + +```bash Command +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers +``` + +**Benchmark Command**: + +```bash Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task text-to-video --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Task: text-to-video +Model: Wan-AI/Wan2.1-T2V-14B-Diffusers +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 1958.41 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.00 +Latency Mean (s): 1958.4059 +Latency Median (s): 1958.4059 +Latency P99 (s): 1958.4059 +-------------------------------------------------- +Peak Memory Max (MB): 59662.00 +Peak Memory Mean (MB): 59662.00 +Peak Memory Median (MB): 59662.00 +============================================================ +``` + +#### 5.1.2 Generate videos with Cache-DiT acceleration + +**Server Command**: + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers +``` + +**Benchmark Command**: + +```bash Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task text-to-video --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Task: text-to-video +Model: Wan-AI/Wan2.1-T2V-14B-Diffusers +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 556.99 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.00 +Latency Mean (s): 556.9885 +Latency Median (s): 556.9885 +Latency P99 (s): 556.9885 +-------------------------------------------------- +Peak Memory Max (MB): 69306.00 +Peak Memory Mean (MB): 69306.00 +Peak Memory Median (MB): 69306.00 +============================================================ +``` diff --git a/docs_new/cookbook/diffusion/Wan/Wan2.2.mdx b/docs_new/cookbook/diffusion/Wan/Wan2.2.mdx new file mode 100644 index 000000000000..1a4f3a5c6699 --- /dev/null +++ b/docs_new/cookbook/diffusion/Wan/Wan2.2.mdx @@ -0,0 +1,346 @@ +--- +title: Wan2.2 +metatags: + description: "Deploy Wan2.2 video generation models with SGLang - MoE architecture, cinematic aesthetics, and efficient 720P@24fps generation." +--- + +import { Wan22Deployment } from '/src/snippets/diffusion/wan22-deployment.jsx'; + +## 1. Model Introduction + +[Wan2.2 series](https://github.com/Wan-Video/Wan2.2) are the most popular and open and advanced large-scale video generative models. + +This generation delivers comprehensive upgrades across the board: + +- **Effective MoE Architecture**: Introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost. +- **Cinematic-level Aesthetics**: Incorporates meticulously curated aesthetic data, complete with detailed labels for lighting, composition, contrast, color tone, and more. This allows for more precise and controllable cinematic style generation, facilitating the creation of videos with customizable aesthetic preferences. +- **Complex Motion Generation**: Trained on a significantly larger data, with +65.6% more images and +83.2% more videos. This expansion notably enhances the model's generalization across multiple dimensions such as motions, semantics, and aesthetics, achieving TOP performance among all open-sourced and closed-sourced models. +- **Efficient High-Definition Hybrid TI2V**: Open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of 16×16×4. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can also run on consumer-grade graphics cards like 4090. It is one of the fastest 720P@24fps models currently available, capable of serving both the industrial and academic sectors simultaneously. + +For more details, please refer to the [official Wan2.2 GitHub Repository](https://github.com/Wan-Video/Wan2.2). + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/install.md) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +The Wan2.2 series offers models in various sizes, architectures and input types, optimized for different hardware platforms. The recommended launch configurations vary by hardware and model size. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size. SGLang supports serving Wan2.2 on NVIDIA B200, H200 and AMD MI300X, MI325X and MI355X GPUs. + + + +### 3.2 Configuration Tips + +Current supported optimization all listed [here](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/support_matrix.md). + +- `--vae-path`: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path. +- `--num-gpus {NUM_GPUS}`: Number of GPUs to use +- `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) +- `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs) +- `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP +- `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP + +## 4. Model Invocation + +### 4.1 Basic Usage + +For more API usage and request examples, please refer to: +[SGLang Diffusion OpenAI API](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/openai_api.md) + +#### 4.1.1 Launch a server and then send requests + +```shell Command +sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --port 3000 + +curl http://127.0.0.1:3000/v1/images/generations \ + -o >(jq -r '.data[0].b64_json' | base64 --decode > example.png) \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $OPENAI_API_KEY" \ + -d '{ + "model": "black-forest-labs/FLUX.1-dev", + "prompt": "A cute baby sea otter", + "n": 1, + "size": "1024x1024", + "response_format": "b64_json" + }' +``` + +#### 4.1.2 Generate a video without launching a server + +```shell Command +SERVER_ARGS=( + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers + --text-encoder-cpu-offload + --pin-cpu-memory + --num-gpus 4 + --ulysses-degree=2 + --enable-cfg-parallel +) + +SAMPLING_ARGS=( + --prompt "A curious raccoon" + --save-output + --output-path outputs + --output-file-name "A curious raccoon.mp4" +) + +sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" + +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/cache/cache_dit.md). + +**Basic Usage** + +```shell Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers +``` + +**Advanced Usage** + +- DBCache Parameters: DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+ +- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ + Combined Configuration Example: + +```shell Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers +``` + +#### 4.2.2 GPU Optimization + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if run out of memory with FSDP. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. Enable if run out of memory with FSDP. +- `--image-encoder-cpu-offload`: Use CPU offload for image encoder inference. Enable if run out of memory with FSDP. +- `--vae-cpu-offload`: Use CPU offload for VAE. Enable if run out of memory. +- `--pin-cpu-memory`: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument". + +#### 4.2.3 Supported LoRA Registry + + + + + + + + + + + + + + + + + + + + + +
origin modelsupported LoRA
[Wan-AI/Wan2.2-I2V-A14B-Diffusers](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers)[lightx2v/Wan2.2-Distill-Loras](https://huggingface.co/lightx2v/Wan2.2-Distill-Loras)
[Wan-AI/Wan2.2-T2V-A14B-Diffusers](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)[Cseti/wan2.2-14B-Arcane_Jinx-lora-v1](https://huggingface.co/Cseti/wan2.2-14B-Arcane_Jinx-lora-v1)
+**Example**: +```shell Command +sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --port 3000 \ + --lora-path Cseti/wan2.2-14B-Arcane_Jinx-lora-v1 +``` + +## 5. Benchmark + +Test Environment: + +- Hardware: NVIDIA B200 GPU (1x) +- Model: Wan-AI/Wan2.2-T2V-A14B-Diffusers +- sglang diffusion version: 0.5.6.post2 + +### 5.1 Speedup Benchmark + +#### 5.1.1 Generate a video + +**Server Command**: + +```shell Command +sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers +``` + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task t2v --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Backend: sglang-video +Model: Wan-AI/Wan2.2-T2V-A14B-Diffusers +Dataset: vbench +Task: t2v +-------------------------------------------------- +Benchmark duration (s): 630.43 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.00 +Latency Mean (s): 630.4277 +Latency Median (s): 630.4277 +Latency P99 (s): 630.4277 +-------------------------------------------------- +Peak Memory Max (MB): 62627.41 +Peak Memory Mean (MB): 62627.41 +Peak Memory Median (MB): 62627.41 + +============================================================ +``` + +#### 5.1.2 Generate videos with high concurrency + +**Server Command**: + +```shell Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers +``` + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-video --dataset vbench --task t2v --num-prompts 20 --max-concurrency 20 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Backend: sglang-video +Model: Wan-AI/Wan2.2-T2V-A14B-Diffusers +Dataset: vbench +Task: t2v +-------------------------------------------------- +Benchmark duration (s): 5163.21 +Request rate: inf +Max request concurrency: 20 +Successful requests: 20/20 +-------------------------------------------------- +Request throughput (req/s): 0.00 +Latency Mean (s): 2739.7695 +Latency Median (s): 2742.0673 +Latency P99 (s): 5121.6331 +-------------------------------------------------- +Peak Memory Max (MB): 72523.56 +Peak Memory Mean (MB): 70253.34 +Peak Memory Median (MB): 70824.46 + +============================================================ +``` diff --git a/docs_new/cookbook/diffusion/Z-Image/Z-Image-Turbo.mdx b/docs_new/cookbook/diffusion/Z-Image/Z-Image-Turbo.mdx new file mode 100644 index 000000000000..8e43ce374865 --- /dev/null +++ b/docs_new/cookbook/diffusion/Z-Image/Z-Image-Turbo.mdx @@ -0,0 +1,281 @@ +--- +title: Z-Image-Turbo +metatags: + description: "Deploy Z-Image-Turbo with SGLang - community contribution guide for Z-Image's fast image generation model." +--- + +import { ZImageTurboDeployment } from '/src/snippets/diffusion/zimage-turbo-deployment.jsx'; + +## 1. Model Introduction + +[Z-Image](https://github.com/Tongyi-MAI/Z-Image) is a powerful and highly efficient image generation model family with 6B parameters, developed by Tongyi-MAI. It adopts a Scalable Single-Stream DiT (S3-DiT) architecture, where text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches. + +[Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It is powered by two core techniques: **Decoupled-DMD** (few-step distillation) and **DMDR** (fusing DMD with Reinforcement Learning). + +**Key Features:** + +- **Sub-second Inference Latency**: Achieves sub-second inference on enterprise-grade H800 GPUs and fits comfortably within 16GB VRAM consumer devices +- **Photorealistic Image Generation**: Excels in high-quality photorealistic image generation with rich aesthetics +- **Bilingual Text Rendering**: Supports accurate bilingual text rendering in both English and Chinese +- **Robust Instruction Adherence**: Strong prompt following and instruction adherence capabilities +- **#1 Open-Source Model**: Ranked 8th overall and #1 among open-source models on the [Artificial Analysis Text-to-Image Leaderboard](https://artificialanalysis.ai/image/leaderboard/text-to-image) + +For more details, please refer to the [Z-Image-Turbo HuggingFace page](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo), the [GitHub repository](https://github.com/Tongyi-MAI/Z-Image), and the [technical report (arXiv)](https://arxiv.org/abs/2511.22699). + +## 2. SGLang-diffusion Installation + +SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. + +Please refer to the [official SGLang-diffusion installation guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/install.md) for installation instructions. + +## 3. Model Deployment + +This section provides deployment configurations optimized for different hardware platforms and use cases. + +### 3.1 Basic Configuration + +Z-Image-Turbo is optimized for high-quality image generation with only 8 inference steps. The recommended launch configurations vary by hardware. + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform. + + + +### 3.2 Configuration Tips + +Current supported optimization all listed [here](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/support_matrix.md). + +- `--vae-path`: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path. +- `--num-gpus`: Number of GPUs to use +- `--tp-size`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) +- `--sp-degree`: Sequence parallelism size (typically should match the number of GPUs) +- `--ulysses-degree`: The degree of DeepSpeed-Ulysses-style SP in USP +- `--ring-degree`: The degree of ring attention-style SP in USP + +**AMD ROCm Notes**: Requires SGLang >= v0.5.8. + +## 4. API Usage + +For complete API documentation, please refer to the [official API usage guide](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/openai_api.md). + +### 4.1 Generate an Image + +```python Example +import base64 +from openai import OpenAI + +client = OpenAI(api_key="EMPTY", base_url="http://localhost:30000/v1") + +response = client.images.generate( + model="Tongyi-MAI/Z-Image-Turbo", + prompt="A logo With Bold Large text: SGL Diffusion", + n=1, + response_format="b64_json", +) + +# Save the generated image +image_bytes = base64.b64decode(response.data[0].b64_json) +with open("output.png", "wb") as f: + f.write(image_bytes) +``` + +### 4.2 Advanced Usage + +#### 4.2.1 Cache-DiT Acceleration + +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set `SGLANG_CACHE_DIT_ENABLED=True` to enable it. For more details, please refer to the SGLang Cache-DiT [documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/docs/cache_dit.md). + +**Basic Usage** + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Tongyi-MAI/Z-Image-Turbo +``` + +**Advanced Usage** + +- DBCache Parameters: DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ + Combined Configuration Example: + +```bash Command +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang serve --model-path Tongyi-MAI/Z-Image-Turbo +``` + +#### 4.2.2 CPU Offload + +- `--dit-cpu-offload`: Use CPU offload for DiT inference. Enable if run out of memory. +- `--text-encoder-cpu-offload`: Use CPU offload for text encoder inference. +- `--vae-cpu-offload`: Use CPU offload for VAE. +- `--pin-cpu-memory`: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument". + +## 5. Benchmark + +Test Environment: + +- Hardware: AMD Instinct MI300X GPU (1x) +- Model: Tongyi-MAI/Z-Image-Turbo +- Docker Image: lmsysorg/sglang:v0.5.8-rocm700-mi30x +- sglang diffusion version: 0.5.8 + +### 5.1 Speedup Benchmark + +#### 5.1.1 Generate an image + +**Server Command**: + +```shell Command +sglang serve --model-path Tongyi-MAI/Z-Image-Turbo \ + --ulysses-degree=1 --ring-degree=1 --port 30000 +``` + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task text-to-image --num-prompts 1 --max-concurrency 1 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Task: text-to-image +Model: Tongyi-MAI/Z-Image-Turbo +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 1.84 +Request rate: inf +Max request concurrency: 1 +Successful requests: 1/1 +-------------------------------------------------- +Request throughput (req/s): 0.54 +Latency Mean (s): 1.8435 +Latency Median (s): 1.8435 +Latency P99 (s): 1.8435 +-------------------------------------------------- +Peak Memory Max (MB): 30689.20 +Peak Memory Mean (MB): 30689.20 +Peak Memory Median (MB): 30689.20 +============================================================ +``` + +#### 5.1.2 Generate images with high concurrency + +**Benchmark Command**: + +```shell Command +python3 -m sglang.multimodal_gen.benchmarks.bench_serving \ + --backend sglang-image --dataset vbench --task text-to-image --num-prompts 20 --max-concurrency 20 +``` + +**Result**: + +```text Output +================= Serving Benchmark Result ================= +Task: text-to-image +Model: Tongyi-MAI/Z-Image-Turbo +Dataset: vbench +-------------------------------------------------- +Benchmark duration (s): 35.32 +Request rate: inf +Max request concurrency: 20 +Successful requests: 20/20 +-------------------------------------------------- +Request throughput (req/s): 0.57 +Latency Mean (s): 18.5672 +Latency Median (s): 18.5573 +Latency P99 (s): 34.9880 +-------------------------------------------------- +Peak Memory Max (MB): 30689.26 +Peak Memory Mean (MB): 30689.21 +Peak Memory Median (MB): 30689.21 +============================================================ +``` diff --git a/docs_new/cookbook/diffusion/intro.mdx b/docs_new/cookbook/diffusion/intro.mdx new file mode 100644 index 000000000000..15aa7c64850b --- /dev/null +++ b/docs_new/cookbook/diffusion/intro.mdx @@ -0,0 +1,46 @@ +--- +title: Overview +mode: wide +description: Practical guides for deploying and using diffusion models with SGLang. +metatags: + description: "Explore SGLang diffusion model cookbooks for image and video generation deployment, invocation, optimization, and benchmarking examples." +--- + + + + + + + + + diff --git a/docs_new/cookbook/intro copy.mdx b/docs_new/cookbook/intro copy.mdx new file mode 100644 index 000000000000..e59160118438 --- /dev/null +++ b/docs_new/cookbook/intro copy.mdx @@ -0,0 +1,226 @@ +--- +title: SGLang Cookbook +metatags: + description: The SGLang Cookbook is a practical collection of examples and guides that show developers how to efficiently run SGLang with a variety of models on different platforms. +--- + +[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) +[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/sgl-project/sgl-cookbook/pulls) + +A community-maintained repository of practical guides and recipes for deploying and using SGLang in production environments. Our mission is simple: answer the question **"How do I use SGLang (and related models) on hardware Y for task Z?"** with clear, actionable solutions. + +## 🎯 What You'll Find Here + +This cookbook aggregates battle-tested SGLang recipes covering: + +- **Models**: Mainstream LLMs and Vision-Language Models (VLMs) +- **Use Cases**: Inference serving, deployment strategies, multimodal applications +- **Hardware**: GPU and CPU configurations, optimization for different accelerators +- **Best Practices**: Configuration templates, performance tuning, troubleshooting guides + +Each recipe provides step-by-step instructions to help you quickly implement SGLang solutions for your specific requirements. + +## Guides + +### Autoregressive Models + +#### Qwen + +- [x] [Qwen3.5](./autoregressive/Qwen/Qwen3.5) NEW +- [x] [Qwen3](./autoregressive/Qwen/Qwen3) +- [x] [Qwen3-Next](./autoregressive/Qwen/Qwen3-Next) +- [x] [Qwen3-VL](./autoregressive/Qwen/Qwen3-VL) +- [x] [Qwen3-Coder](./autoregressive/Qwen/Qwen3-Coder) +- [x] [Qwen3-Coder-Next](./autoregressive/Qwen/Qwen3-Coder-Next) NEW +- [x] [Qwen2.5-VL](./autoregressive/Qwen/Qwen2.5-VL) + +#### DeepSeek + +- [x] [DeepSeek-V3.2](./autoregressive/DeepSeek/DeepSeek-V3_2) +- [x] [DeepSeek-V3.1](./autoregressive/DeepSeek/DeepSeek-V3_1) +- [x] [DeepSeek-V3](./autoregressive/DeepSeek/DeepSeek-V3) +- [x] [DeepSeek-R1](./autoregressive/DeepSeek/DeepSeek-R1) +- [x] [DeepSeek-OCR](./autoregressive/DeepSeek/DeepSeek-OCR) +- [x] [DeepSeek-OCR-2](./autoregressive/DeepSeek/DeepSeek-OCR-2) NEW + +#### Llama + +- [ ] [Llama4-Scout](./autoregressive/Llama/Llama4) +- [x] [Llama3.3-70B](./autoregressive/Llama/Llama3.3-70B) +- [x] [Llama3.1](./autoregressive/Llama/Llama3.1) + +#### GLM + +- [ ] [GLM-Glyph](./autoregressive/GLM/GLM-Glyph) +- [x] [GLM-5](./autoregressive/GLM/GLM-5) NEW +- [x] [GLM-OCR](./autoregressive/GLM/GLM-OCR) NEW +- [x] [GLM-4.5](./autoregressive/GLM/GLM-4.5) +- [x] [GLM-4.5V](./autoregressive/GLM/GLM-4.5V) +- [x] [GLM-4.6](./autoregressive/GLM/GLM-4.6) +- [x] [GLM-4.6V](./autoregressive/GLM/GLM-4.6V) +- [x] [GLM-4.7](./autoregressive/GLM/GLM-4.7) +- [x] [GLM-4.7-Flash](./autoregressive/GLM/GLM-4.7-Flash) NEW + +#### OpenAI + +- [x] [gpt-oss](./autoregressive/OpenAI/GPT-OSS) + +#### Moonshotai + +- [x] [Kimi-K2.6](./autoregressive/Moonshotai/Kimi-K2.6) NEW +- [x] [Kimi-K2.5](./autoregressive/Moonshotai/Kimi-K2.5) +- [x] [Kimi-K2](./autoregressive/Moonshotai/Kimi-K2) +- [x] [Kimi-Linear](./autoregressive/Moonshotai/Kimi-Linear) + +#### MiniMax + +- [ ] [MiniMax-M2](./autoregressive/MiniMax/MiniMax-M2) +- [x] [MiniMax-M2.5](./autoregressive/MiniMax/MiniMax-M2.5) NEW + +#### NVIDIA + +- [x] [Nemotron 3 Nano Omni](./autoregressive/NVIDIA/Nemotron3-Nano-Omni) +- [x] [Nemotron-Nano-3-30B-A3B](./autoregressive/NVIDIA/Nemotron3-Nano) +- [x] [Nemotron3-Super](./autoregressive/NVIDIA/Nemotron3-Super) + +#### Ernie + +- [x] [Ernie4.5](./autoregressive/Ernie/Ernie4.5) +- [ ] [Ernie4.5-VL](./autoregressive/Ernie/Ernie4.5-VL) + +#### InternVL + +- [ ] [InternVL3.5](./autoregressive/InternVL/InternVL3.5) + +#### InternLM + +- [ ] [Intern-S1](./autoregressive/InternLM/Intern-S1) + +#### Jina AI + +- [ ] [Jina-reranker-m0](./autoregressive/Jina/Jina-reranker-m0) + +#### Mistral + +- [ ] [Mistral-3](./autoregressive/Mistral/Ministral-3) +- [x] [Devstral 2](./autoregressive/Mistral/Devstral-2) + +#### Xiaomi + +- [x] [MiMo-V2-Flash](./autoregressive/Xiaomi/MiMo-V2-Flash) + +#### FlashLabs + +- [x] [Chroma 1.0](./autoregressive/FlashLabs/Chroma1.0)NEW + +#### StepFun + +- [x] [Step-3.5-Flash](./autoregressive/StepFun/Step3.5) NEW +- [x] [Step3-VL-10B](./autoregressive/StepFun/Step3-VL-10B) NEW + +#### InclusionAI + +- [x] [Ling-2.5-1T](./autoregressive/InclusionAI/Ling-2.5-1T) NEW +- [x] [Ring-2.5-1T](./autoregressive/InclusionAI/Ring-2.5-1T) NEW +- [x] [LLaDA-2.1](./autoregressive/InclusionAI/LLaDA-2.1) NEW + +### Diffusion Models + +#### FLUX + +- [x] [FLUX](./diffusion/FLUX/FLUX) + +#### Qwen-Image + +- [ ] [Qwen-Image](./diffusion/Qwen-Image/Qwen-Image) +- [x] [Qwen-Image-Edit](./diffusion/Qwen-Image/Qwen-Image-Edit) + +#### Wan + +- [ ] [Wan2.1](./diffusion/Wan/Wan2.1) +- [x] [Wan2.2](./diffusion/Wan/Wan2.2) + +#### Z-Image + +- [x] [Z-Image-Turbo](./diffusion/Z-Image/Z-Image-Turbo) + +### Benchmarks + +- [x] [Diffusion Model Benchmark](./base/benchmarks/diffusion_model_benchmark.mdx) +- [x] [LLM Benchmark](./base/benchmarks/autoregressive_model_benchmark.mdx) + +## Reference + +- [Installation (PyPI)](../docs/get-started/install) - Install SGLang via pip or uv (stable and nightly) +- [Server arguments](./base/reference/server_arguments) - Understanding all the arguments + +## 🚀 Quick Start + +1. Browse the recipe index above to find your model +2. Follow the step-by-step instructions in each guide +3. Adapt configurations to your specific hardware and requirements +4. Join our community to share feedback and improvements + +## 🤝 Contributing + +We believe the best documentation comes from practitioners. Whether you've optimized SGLang for a specific model, solved a tricky deployment challenge, or discovered performance improvements, we encourage you to contribute your recipes! + +**Ways to contribute:** + +- Add a new recipe for a model not yet covered +- Improve existing recipes with additional tips or configurations +- Report issues or suggest enhancements +- Share your production deployment experiences + +**To contribute:** + + +```bash Contribute a Recipe +# Fork the repo and clone locally +git clone https://github.com/YOUR_USERNAME/sglang-cookbook.git +cd sglang-cookbook + +# Create a new branch +git checkout -b add-my-recipe + +# Add your recipe following the template in DeepSeek-V3.2 +# Submit a PR! +``` + + +## 🛠️ Local Development + +### Prerequisites + +- Node.js >= 20.0 +- npm or yarn + +### Setup and Run + +Install dependencies and start the development server: + + +```bash Local Development +# Install dependencies +npm install + +# Start development server (hot reload enabled) +npm start +``` + + +The site will automatically open in your browser at `http://localhost:3000`. + +## 📖 Resources + +- [SGLang GitHub](https://github.com/sgl-project/sglang) +- [SGLang Documentation](https://sgl-project.github.io) +- [Community Slack/Discord](https://discord.gg/MpEEuAeb) + +## 📄 License + +This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/sgl-project/sgl-cookbook/blob/main/LICENSE) file for details. + +--- + +**Let's build this resource together!** 🚀 Star the repo and contribute your recipes to help the SGLang community grow. diff --git a/docs_new/cookbook/intro.mdx b/docs_new/cookbook/intro.mdx new file mode 100644 index 000000000000..3cd1747ee181 --- /dev/null +++ b/docs_new/cookbook/intro.mdx @@ -0,0 +1,42 @@ +--- +title: SGLang Cookbook +metatags: + description: The SGLang Cookbook is a practical collection of examples and guides that show developers how to efficiently run SGLang with a variety of models on different platforms. +--- + +A community-maintained repository of practical guides and recipes for deploying and using SGLang in production environments. Our mission is simple: answer the question **"How do I use SGLang (and related models) on hardware Y for task Z?"** with clear, actionable solutions. + + +## Guides + + + + + + +## Benchmarks + + + + + diff --git a/docs_new/cookbook/omni/FishAudio/S2-Pro.mdx b/docs_new/cookbook/omni/FishAudio/S2-Pro.mdx new file mode 100644 index 000000000000..2c659fd46107 --- /dev/null +++ b/docs_new/cookbook/omni/FishAudio/S2-Pro.mdx @@ -0,0 +1,208 @@ +--- +title: FishAudio S2 Pro +metatags: + description: "Deploy FishAudio S2 Pro with SGLang - state-of-the-art text-to-speech with dual-autoregressive architecture, voice cloning, prosody control, and 80+ language support." +tag: NEW +--- + +## 1. Model Introduction + +[FishAudio S2 Pro](https://huggingface.co/fishaudio/s2-pro) is a state-of-the-art text-to-speech model developed by [FishAudio](https://fish.audio), featuring fine-grained prosody and emotion control. Built on a Dual-Autoregressive (Dual-AR) transformer architecture with RVQ-based audio codec, S2 Pro achieves state-of-the-art quality across multiple TTS benchmarks. + +S2 Pro tops the Audio Turing Test (0.515 posterior mean) and EmergentTTS-Eval (81.88% win rate against gpt-4o-mini-tts) while achieving the lowest WER on Seed-TTS Eval among all evaluated models including closed-source systems. Trained on over 10 million hours of audio across approximately 100 languages and aligned with GRPO-based reinforcement learning, it supports voice cloning and fine-grained inline control of prosody and emotion through natural-language tags. + +**Key Features:** + +- **Dual-AR Architecture**: 5B parameter model (4B Slow AR + 400M Fast AR) with RVQ-based audio codec at 10 codebooks (~21 Hz frame rate) +- **Voice Cloning**: High-quality voice cloning from a short reference audio clip +- **Prosody & Emotion Control**: Fine-grained inline control of prosody and emotion through natural-language tags +- **Multilingual**: 80+ language support (Tier 1: Japanese, English, Chinese; Tier 2: Korean, Spanish, Portuguese, Arabic, Russian, French, German) +- **SGLang Integration**: Inherits LLM-native serving optimizations (paged KV cache, radix prefix caching) + +**License:** [FISH AUDIO RESEARCH LICENSE AGREEMENT](https://huggingface.co/fishaudio/s2-pro/blob/main/LICENSE.md) + +This work is a collaboration between the SGLang Omni Team and [FishAudio Team](https://fish.audio). For more details on S2 Pro's model design and training, see FishAudio's [S2 release blog post](https://fish.audio/blog/fish-audio-open-sources-s2/). + +## 2. Installation + +S2 Pro uses `sglang-omni`, an ecosystem project for SGLang. Start with the Docker image, then install the `sglang-omni` package inside the container. + +### 2.1 Docker + +```bash Command +docker pull frankleeeee/sglang-omni:dev + +docker run -it --shm-size 32g --gpus all frankleeeee/sglang-omni:dev /bin/zsh +``` + +### 2.2 Install sglang-omni (inside Docker) + +```bash Command +git clone https://github.com/sgl-project/sglang-omni.git +cd sglang-omni +uv venv .venv -p 3.12 && source .venv/bin/activate +uv pip install -v ".[s2pro]" +huggingface-cli download fishaudio/s2-pro +``` + +## 3. Model Deployment + +S2 Pro can be served via an OpenAI-compatible HTTP server or explored interactively through a Gradio playground. + +### 3.1 Server + +```bash Command +python -m sglang_omni.cli.cli serve \ + --model-path fishaudio/s2-pro \ + --config examples/configs/s2pro_tts.yaml \ + --port 8000 +``` + +### 3.2 Interactive Playground + +We provide a Gradio-based interactive playground. We highly recommend using the playground since audio data is hard to interact with by CLI. + +```bash Command +./playground/tts/start.sh +``` + +## 4. Model Invocation + +### 4.1 Text-to-Speech + +Generate speech from text using the OpenAI-compatible `/v1/audio/speech` endpoint. + + +Without a reference audio clip, the generated voice will use a default voice. Provide a reference audio for voice cloning. + + +```bash Command +curl -X POST http://localhost:8000/v1/audio/speech \ + -H "Content-Type: application/json" \ + -d '{"input": "Hello, how are you?"}' \ + --output output.wav +``` + +### 4.2 Voice Cloning + +Provide a reference audio file and its transcript for high-quality voice cloning: + +```bash Command +curl -X POST http://localhost:8000/v1/audio/speech \ + -H "Content-Type: application/json" \ + -d '{ + "input": "Hello, how are you?", + "references": [{"audio_path": "ref.wav", "text": "Transcript of ref audio."}] + }' \ + --output output.wav +``` + +## 5. Architecture + +S2 Pro uses a 3-stage pipeline: + +```text Example +Text input ──► Preprocessing ──► SGLang AR Engine ──► DAC Vocoder ──► Audio output + (CPU) (GPU) (GPU) +``` + +**Stage 1 — Preprocessing:** Tokenizes the input text into a Qwen3-style chat prompt. For voice cloning, encodes the reference audio into VQ codes via the DAC codec and prepends them to the prompt as a system message. + +**Stage 2 — Dual-AR Generation:** The Slow AR runs inside SGLang along the time axis. At each decode step, it predicts a semantic token, then the Fast AR (4-layer transformer) generates the remaining 9 residual codebook tokens conditioned on the hidden state. VQ embeddings are injected into the input embedding at masked positions, allowing the model to attend over both text and audio context through SGLang's KV cache. + +**Stage 3 — Vocoder:** The accumulated codebook indices are decoded into a waveform by a DAC codec, producing the final audio output. + +## 6. Performance + +Evaluated on the full seed-tts-eval EN testset (1,088 samples) on a single H200 GPU. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricBS=1BS=2BS=4BS=8
Tok/s (mean)63.345.831.919.6
RTF (mean)0.3400.4730.6761.097
Latency (mean)1.33s1.80s2.69s4.36s
TTFT (mean)19.6 ms22.0 ms31.6 ms50.7 ms
TTFB (mean)172.8 ms249.9 ms319.1 ms509.6 ms
+ +## 7. SGLang Omni Optimizations + +By integrating S2 Pro's Dual-AR backbone into SGLang's paged-attention engine, we inherit LLM-native optimizations: + +- **Paged KV cache** — SGLang manages KV cache for the Slow AR path, enabling efficient memory usage and high concurrency. +- **Radix prefix caching** — Shared system prompt and reference audio prefixes are cached across requests, keeping TTFT consistently low (~18ms). +- **torch.compile on Fast AR** — The 9-step codebook loop is compiled with torch.compile, achieving 5x speedup over eager mode. +- **FlashAttention 3** — Forced FA3 backend to match training-time attention numerics, avoiding early-EOS divergence from flashinfer. + +## 8. Future Optimizations + +To further improve throughput and latency in the future: + +- **CUDA Graphs while torch.compile enabled.** The current implementation uses torch.compile on the Fast AR codebook loop (achieving 5x over eager), but does not capture CUDA graphs for the Slow AR path. Enabling CUDA graphs requires resolving numerical divergence from deterministic-mode constraints and adapting SGLang's graph capture to S2 Pro's interleaved VQ embedding injection, involving significant engineering that we leave for a future release. + +- **Batched Fast AR head processing.** Currently, the Fast AR codebook decoding loop runs sequentially per request. Batching these steps across concurrent requests would improve GPU utilization at higher batch sizes, potentially improving throughput. + +## 9. Engineering Appendix + + + +### BF16 RoPE Precision Mismatch + +SGLang's default RoPE implementation precomputes `cos_sin_cache` in float32, but S2 Pro's model was trained entirely in bfloat16 including the RoPE frequencies. The precision difference caused logit divergence producing garbled audio with abnormally long sequences of tokens. + +It's worth attention for any future engineering for fish audio inference infrastructure, since it's uncommon and hard to debug when accuracy of inference engine is higher than the precision of the model. Below is a simple fix once problem identified. + +```python Example +def _truncate_rope_to_bf16(model: torch.nn.Module) -> None: + for module in model.modules(): + if hasattr(module, "cos_sin_cache"): + module.cos_sin_cache.data = module.cos_sin_cache.data.to(torch.bfloat16).to( + torch.float32 + ) +``` + +### Attention Backend Divergence Causing Early Stopping + +SGLang defaults to flashinfer for attention, but S2 Pro was trained with FlashAttention. When future engineering meet early EOS token issue, this could suggest the fix. + + diff --git a/docs_new/cookbook/omni/intro.mdx b/docs_new/cookbook/omni/intro.mdx new file mode 100644 index 000000000000..28702aae1e29 --- /dev/null +++ b/docs_new/cookbook/omni/intro.mdx @@ -0,0 +1,16 @@ +--- +title: Overview +mode: wide +description: Practical guides for deploying and using omni models (TTS, audio) with SGLang. +metatags: + description: "Explore SGLang omni model cookbooks for speech, audio, and multimodal deployment examples." +--- + + + + diff --git a/docs_new/cookbook/specbundle/specbundle_usage.mdx b/docs_new/cookbook/specbundle/specbundle_usage.mdx new file mode 100644 index 000000000000..c8735077e880 --- /dev/null +++ b/docs_new/cookbook/specbundle/specbundle_usage.mdx @@ -0,0 +1,152 @@ +--- +title: SpecBundle Usage +metatags: + description: "SpecBundle usage guide - production-grade EAGLE3 speculative decoding with SGLang for faster LLM inference." +--- + +![specbundle logo](/logo/logo.png) + +## About SpecBundle + +Speculative decoding, especially EAGLE3, offer strong theoretical guarantees alongside consistent empirical improvements in token acceptance rate and end-to-end inference speed. However, despite these advances, adoption of speculative decoding—especially EAGLE3—remains limited in the open-source ecosystem, due primarily to three key factors. + +1. Lack of production-ready training infrastructure: Existing speculative decoding toolchains are largely research prototypes, offering limited system-level optimization and inadequate support for diverse architectures and large-scale models. +2. Scarcity of high-quality draft models: Effective speculative decoding depends on strong draft models, yet publicly available EAGLE3-compatible checkpoints are extremely limited, primarily originating from the original authors. +3. Insufficient training scale of existing drafts: Most available draft models are trained on small or curated datasets and fail to generalize to the large, diverse corpora used in modern LLM training, resulting in low token acceptance rates and diminished practical speedups. + +**SpecBundle** is a direct response to these limitations. Jointly driven by the open-source community and industry partners including **Ant Group**, **Meituan**, **Nex-AGI** and **EigenAI**, **SpecBundle** represents the **first open initiative** aimed at democratizing speculative decoding by providing high-performance, production-grade EAGLE3 draft model weights for mainstream open-source LLMs. This initiative also serves to verify the robustness of the [**SpecForge**](https://github.com/sgl-project/SpecForge) framework through multiple scales and architectures. + +## Installation + +```bash Command +git clone https://github.com/sgl-project/SpecForge.git +``` + +## Usage + +### Launch SGLang Server with SpecBundle models + +You can use the following command to launch the SGLang server with SpecBundle models. Please add `--tp`, `--ep` and `--mem-fraction-static` arguments when you encounter memory issues. + +```bash Command +python3 -m sglang.launch_server \ + --model \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 +``` + +For example: + +```bash Command +SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ + --model Qwen/Qwen3-30B-A3B-Instruct-2507 \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path lmsys/SGLang-EAGLE3-Qwen3-30B-A3B-Instruct-2507-SpecForge-Nex \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --tp 4 +``` + +### Use SpecBundle to compare the performance of Speculative Decoding draft models + +We provide a benchmark suite to evaluate the performance of SpecBundle draft models [here](https://github.com/sgl-project/SpecForge/tree/main/benchmarks). + +#### Example: + +1. Launch a SGLang Server + +```bash Command +SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ + --model Qwen/Qwen3-30B-A3B-Instruct-2507 \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path lmsys/SGLang-EAGLE3-Qwen3-30B-A3B-Instruct-2507-SpecForge-Nex \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --tp 4 +``` + +2. Use the benchmark suite to evaluate the performance of SpecBundle draft models + +`bench_eagle3.py` can help you launch a SGLang server process and a Benchmarking process concurrently. In this way, you don't have to launch the SGLang server manually, this script will manually handle the SGLang launch under different speculative decoding configurations. Some important arguments are: + +- `--model-path`: the path to the target model. +- `--speculative-draft-model-path`: the path to the draft model. +- `--port`: the port to launch the SGLang server. +- `--trust-remote-code`: trust the remote code. +- `--mem-fraction-static`: the memory fraction for the static memory. +- `--tp-size`: the tensor parallelism size. +- `--attention-backend`: the attention backend. +- `--config-list`: the list of speculative decoding configuration to test, the format is `,,,`. +- `--benchmark-list`: the list of benchmarks to test, the format is `::`. + +```bash Command +cd SpecForge/benchmarks +python bench_eagle3.py \ + --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 \ + --port 30000 \ + --config-list 1,3,1,4 \ + --benchmark-list mtbench:5 gsm8k:100 \ + --skip-launch-server +``` + +**Interactive Command Generator**: Use the configuration selector below to automatically generate the appropriate test command for your model and benchmark. + +import { SpecBundleDeployment } from "/src/snippets/specbundle/specbundle-deployment.jsx"; + + + +It will generate a json file, content is listed below: + +```json Config +{ + "mtbench": [ + { + "batch_size": 1, + "steps": null, + "topk": null, + "num_draft_tokens": null, + "metrics": [ + { + "latency": 12.232808108034078, + "output_throughput": 319.71399906382845, + "accept_length": 2.170366259711432, + "accuracy": null, + "num_questions": 5, + "num_valid_predictions": 0, + "categorical_performance": null + } + ], + "num_samples": 5 + } + ], + "gsm8k": [ + { + "batch_size": 1, + "steps": null, + "topk": null, + "num_draft_tokens": null, + "metrics": [ + { + "latency": 37.42077191895805, + "output_throughput": 373.6160234823207, + "accept_length": 2.643410852713178, + "accuracy": 0.96, + "num_questions": 100, + "num_valid_predictions": 100, + "categorical_performance": null + } + ], + "num_samples": 100 + } + ] +} +``` + +## Performance Scores + +We evaluate the performance of SpecBundle draft models on various benchmarks, please visit the [Performance Dashboard](https://docs.sglang.io/SpecForge/SpecBundle/index.html) for more details. diff --git a/docs_new/cookbook/specbundle/supported_models.mdx b/docs_new/cookbook/specbundle/supported_models.mdx new file mode 100644 index 000000000000..f3b0f6bf7846 --- /dev/null +++ b/docs_new/cookbook/specbundle/supported_models.mdx @@ -0,0 +1,191 @@ +--- +title: Supported Models +metatags: + description: "SpecBundle supported EAGLE3 draft models for speculative decoding - Llama, Qwen, DeepSeek, GLM, and more." +--- + +## [Released Models](https://huggingface.co/collections/lmsys/specbundle) + +We list the models released by the SpecForge and several industrial partners below. These models are released as part of the SpecBundle models, which are trained on large-scale multi-domain datasets and deliver exceptional performance on various benchmarks. + +> We also include some of the models previously trained by the SpecForge team but not technically part of the SpecBundle release. +> We mark models trained on ShareGPT+Ultrachat datasets with a **\*** mark and models trained on Perfect-Blend datasets but released before SpecBundle with **+** mark. + +### Llama Series + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
meta-llama/Llama-3.1-8B-Instruct[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Llama-3.1-8B-Instruct-SpecForge)
meta-llama/Llama-3.3-70B-Instruct[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Llama-3.3-70B-Instruct-SpecForge)
meta-llama/Llama-4-Scout-17B-16E-Instruct[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-SpecForge)
meta-llama/Llama-4-Maverick-17B-128E-Instruct[🤗 Hugging Face \*](https://huggingface.co/lmsys/sglang-EAGLE3-Llama-4-Maverick-17B-128E-Instruct-v1)
+ +### Qwen Series + + + + + + + + + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
Qwen/Qwen3-30B-A3B-Instruct-2507[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Qwen3-30B-A3B-Instruct-2507-SpecForge-Nex)
Qwen/Qwen3-235B-A22B-Instruct-2507[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Qwen3-235B-A22B-Instruct-2507-SpecForge-Meituan)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Qwen3-Next-80B-A3B-Instruct-FP8-perfect-blend-regenerated)
+ +### Qwen Coder Series + + + + + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
Qwen/Qwen3-Coder-30B-A3B-Instruct[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct-SpecForge)
Qwen/Qwen3-Coder-480B-A35B-Instruct[🤗 Hugging Face](https://huggingface.co/lmsys/SGLang-EAGLE3-Qwen3-Coder-480B-A35B-Instruct-SpecForge-EigenAI)
+ +### Ling Series + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
inclusionAI/Ling-flash-2.0[🤗 Hugging Face](https://huggingface.co/AQ-MedAI/Ling-Flash-2.0-eagle3)
+ +### Kimi Series + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
moonshotai/Kimi-K2-Instruct[🤗 Hugging Face](https://huggingface.co/AQ-MedAI/Kimi-K2-Instruct-eagle3)
+ +### GPT-OSS Series + + + + + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
openai/gpt-oss-20b[🤗 Hugging Face +](https://huggingface.co/zhuyksir/EAGLE3-gpt-oss-20b-bf16)
openai/gpt-oss-120b[🤗 Hugging Face +](https://huggingface.co/lmsys/EAGLE3-gpt-oss-120b-bf16)
+ +### Nex Series + + + + + + + + + + + + + + + + + + + + + + +
Target ModelEAGLE3 Draft Model
nex-agi/Qwen3-30B-A3B-Nex-N1[🤗 Hugging Face](https://huggingface.co/nex-agi/SGLANG-EAGLE3-Qwen3-30B-A3B-Nex-N1)
nex-agi/Qwen3-32B-Nex-N1[🤗 Hugging Face](https://huggingface.co/nex-agi/SGLANG-EAGLE3-Qwen3-32B-Nex-N1)
diff --git a/docs_new/custom.css b/docs_new/custom.css new file mode 100644 index 000000000000..3dd898727beb --- /dev/null +++ b/docs_new/custom.css @@ -0,0 +1,101 @@ +:where(*) { + scrollbar-width: none; + -ms-overflow-style: none; +} + +:where(*)::-webkit-scrollbar { + width: 0; + height: 0; +} + +:where(pre, code, .code-block, .code-group, #request-example, #response-example), +:where(pre, code, .code-block, .code-group, #request-example, #response-example) * { + scrollbar-width: auto; + -ms-overflow-style: auto; +} + +:where(pre, code, .code-block, .code-group, #request-example, #response-example)::-webkit-scrollbar, +:where(pre, code, .code-block, .code-group, #request-example, #response-example) *::-webkit-scrollbar { + width: 10px; + height: 10px; +} + +/* Global table styling to match vision-language-models.mdx reference */ +table { + width: 100%; + border-collapse: collapse; + table-layout: fixed; +} + +table thead tr { + border-bottom: 2px solid #d55816; +} + +table thead th { + text-align: left; + padding: 10px 12px; + font-weight: 700; + white-space: nowrap; +} + +table thead th:nth-child(odd) { + background-color: rgba(255,255,255,0.02); +} + +table thead th:nth-child(even) { + background-color: rgba(255,255,255,0.05); +} + +table tbody tr:nth-child(odd) td { + background-color: rgba(255,255,255,0.02); +} + +table tbody tr:nth-child(even) td { + background-color: rgba(255,255,255,0.05); +} + +table tbody td { + padding: 9px 12px; +} + +table tbody td:first-child { + font-weight: 500; +} + +/* Dark mode support for tables */ +html.dark table thead th:nth-child(odd), +[data-theme="dark"] table thead th:nth-child(odd) { + background-color: rgba(255,255,255,0.02); +} + +html.dark table thead th:nth-child(even), +[data-theme="dark"] table thead th:nth-child(even) { + background-color: rgba(255,255,255,0.05); +} + +html.dark table tbody tr:nth-child(odd) td, +[data-theme="dark"] table tbody tr:nth-child(odd) td { + background-color: rgba(255,255,255,0.02); +} + +html.dark table tbody tr:nth-child(even) td, +[data-theme="dark"] table tbody tr:nth-child(even) td { + background-color: rgba(255,255,255,0.05); +} + +/* Bold text (**text**) */ +.prose strong, .prose b { + font-weight: 600; +} + +/* Inline code (single backtick) */ +:not(pre) > code { + background-color: rgba(0, 0, 0, 0.07); + font-weight: 600; +} + +html.dark :not(pre) > code, +[data-theme="dark"] :not(pre) > code { + background-color: rgba(255, 255, 255, 0.13); + font-weight: 600; +} diff --git a/docs_new/docs.json b/docs_new/docs.json new file mode 100644 index 000000000000..7b150e01d9b3 --- /dev/null +++ b/docs_new/docs.json @@ -0,0 +1,1237 @@ +{ + "$schema": "https://mintlify.com/docs.json", + "theme": "aspen", + "name": "SGLang Documentation", + "seo": { + "metatags": { + "google-site-verification": "bX3ofyYQhraIpAYf4DpyZQXZO_G4xLR_RqeBAKnJA7g" + } + }, + "redirects": [ + { + "source": "/docs/references/learn_more", + "destination": "/" + }, + { + "source": "/cookbook", + "destination": "/cookbook/intro" + }, + { + "source": "/whl", + "destination": "https://sgl-project.github.io/whl/", + "permanent": false + }, + { + "source": "/whl/:path*", + "destination": "https://sgl-project.github.io/whl/:path*", + "permanent": false + }, + { + "source": "/sglang-omni", + "destination": "https://sgl-project.github.io/sglang-omni/", + "permanent": false + }, + { + "source": "/sglang-omni/:path*", + "destination": "https://sgl-project.github.io/sglang-omni/:path*", + "permanent": false + }, + { + "source": "/SpecForge", + "destination": "https://sgl-project.github.io/SpecForge/", + "permanent": false + }, + { + "source": "/SpecForge/:path*", + "destination": "https://sgl-project.github.io/SpecForge/:path*", + "permanent": false + }, + { + "source": "/specforge", + "destination": "https://sgl-project.github.io/SpecForge/", + "permanent": false + }, + { + "source": "/specforge/:path*", + "destination": "https://sgl-project.github.io/SpecForge/:path*", + "permanent": false + }, + { + "source": "/index.html", + "destination": "/" + }, + { + "source": "/advanced_features/adaptive_speculative_decoding.html", + "destination": "/docs/advanced_features/adaptive_speculative_decoding" + }, + { + "source": "/advanced_features/attention_backend.html", + "destination": "/docs/advanced_features/attention_backend" + }, + { + "source": "/advanced_features/breakable_cuda_graph.html", + "destination": "/docs/advanced_features/breakable_cuda_graph" + }, + { + "source": "/advanced_features/checkpoint_engine.html", + "destination": "/docs/advanced_features/checkpoint_engine" + }, + { + "source": "/advanced_features/cuda_graph_for_multi_modal_encoder.html", + "destination": "/docs/advanced_features/cuda_graph_for_multi_modal_encoder" + }, + { + "source": "/advanced_features/deterministic_inference.html", + "destination": "/docs/advanced_features/deterministic_inference" + }, + { + "source": "/advanced_features/dp_dpa_smg_guide.html", + "destination": "/docs/advanced_features/dp_dpa_smg_guide" + }, + { + "source": "/advanced_features/dp_for_multi_modal_encoder.html", + "destination": "/docs/advanced_features/dp_for_multi_modal_encoder" + }, + { + "source": "/advanced_features/epd_disaggregation.html", + "destination": "/docs/advanced_features/epd_disaggregation" + }, + { + "source": "/advanced_features/expert_parallelism.html", + "destination": "/docs/advanced_features/expert_parallelism" + }, + { + "source": "/advanced_features/forward_hooks.html", + "destination": "/docs/advanced_features/forward_hooks" + }, + { + "source": "/advanced_features/hicache.html", + "destination": "/docs/advanced_features/hicache" + }, + { + "source": "/advanced_features/hicache_best_practices.html", + "destination": "/docs/advanced_features/hicache_best_practices" + }, + { + "source": "/advanced_features/hicache_design.html", + "destination": "/docs/advanced_features/hicache_design" + }, + { + "source": "/advanced_features/hicache_storage_runtime_attach_detach.html", + "destination": "/docs/advanced_features/hicache_storage_runtime_attach_detach" + }, + { + "source": "/advanced_features/hisparse_guide.html", + "destination": "/docs/advanced_features/hisparse_guide" + }, + { + "source": "/advanced_features/hyperparameter_tuning.html", + "destination": "/docs/advanced_features/hyperparameter_tuning" + }, + { + "source": "/advanced_features/lora.html", + "destination": "/docs/advanced_features/lora" + }, + { + "source": "/advanced_features/object_storage.html", + "destination": "/docs/advanced_features/object_storage" + }, + { + "source": "/advanced_features/observability.html", + "destination": "/docs/advanced_features/observability" + }, + { + "source": "/advanced_features/pd_disaggregation.html", + "destination": "/docs/advanced_features/pd_disaggregation" + }, + { + "source": "/advanced_features/piecewise_cuda_graph.html", + "destination": "/docs/advanced_features/piecewise_cuda_graph" + }, + { + "source": "/advanced_features/pipeline_parallelism.html", + "destination": "/docs/advanced_features/pipeline_parallelism" + }, + { + "source": "/advanced_features/quantization.html", + "destination": "/docs/advanced_features/quantization" + }, + { + "source": "/advanced_features/quantized_kv_cache.html", + "destination": "/docs/advanced_features/quantized_kv_cache" + }, + { + "source": "/advanced_features/rfork.html", + "destination": "/docs/advanced_features/rfork" + }, + { + "source": "/advanced_features/separate_reasoning.html", + "destination": "/docs/advanced_features/separate_reasoning" + }, + { + "source": "/advanced_features/server_arguments.html", + "destination": "/docs/advanced_features/server_arguments" + }, + { + "source": "/advanced_features/sgl_model_gateway.html", + "destination": "/docs/advanced_features/sgl_model_gateway" + }, + { + "source": "/advanced_features/sglang_for_rl.html", + "destination": "/docs/advanced_features/sglang_for_rl" + }, + { + "source": "/advanced_features/speculative_decoding.html", + "destination": "/docs/advanced_features/speculative_decoding" + }, + { + "source": "/advanced_features/structured_outputs.html", + "destination": "/docs/advanced_features/structured_outputs" + }, + { + "source": "/advanced_features/structured_outputs_for_reasoning_models.html", + "destination": "/docs/advanced_features/structured_outputs_for_reasoning_models" + }, + { + "source": "/advanced_features/tool_parser.html", + "destination": "/docs/advanced_features/tool_parser" + }, + { + "source": "/advanced_features/vlm_query.html", + "destination": "/docs/advanced_features/vlm_query" + }, + { + "source": "/basic_usage/deepseek_ocr.html", + "destination": "/docs/basic_usage/deepseek_ocr" + }, + { + "source": "/basic_usage/deepseek_v3.html", + "destination": "/docs/basic_usage/deepseek_v3" + }, + { + "source": "/basic_usage/deepseek_v32.html", + "destination": "/docs/basic_usage/deepseek_v32" + }, + { + "source": "/basic_usage/glm45.html", + "destination": "/docs/basic_usage/glm45" + }, + { + "source": "/basic_usage/glmv.html", + "destination": "/docs/basic_usage/glmv" + }, + { + "source": "/basic_usage/gpt_oss.html", + "destination": "/docs/basic_usage/gpt_oss" + }, + { + "source": "/basic_usage/llama4.html", + "destination": "/docs/basic_usage/llama4" + }, + { + "source": "/basic_usage/minimax_m2.html", + "destination": "/docs/basic_usage/minimax_m2" + }, + { + "source": "/basic_usage/native_api.html", + "destination": "/docs/basic_usage/native_api" + }, + { + "source": "/basic_usage/offline_engine_api.html", + "destination": "/docs/basic_usage/offline_engine_api" + }, + { + "source": "/basic_usage/ollama_api.html", + "destination": "/docs/basic_usage/ollama_api" + }, + { + "source": "/basic_usage/openai_api.html", + "destination": "/docs/basic_usage/openai_api" + }, + { + "source": "/basic_usage/openai_api_completions.html", + "destination": "/docs/basic_usage/openai_api_completions" + }, + { + "source": "/basic_usage/openai_api_embeddings.html", + "destination": "/docs/basic_usage/openai_api_embeddings" + }, + { + "source": "/basic_usage/openai_api_vision.html", + "destination": "/docs/basic_usage/openai_api_vision" + }, + { + "source": "/basic_usage/popular_model_usage.html", + "destination": "/docs/basic_usage/popular_model_usage" + }, + { + "source": "/basic_usage/qwen3.html", + "destination": "/docs/basic_usage/qwen3" + }, + { + "source": "/basic_usage/qwen3_5.html", + "destination": "/docs/basic_usage/qwen3_5" + }, + { + "source": "/basic_usage/qwen3_vl.html", + "destination": "/docs/basic_usage/qwen3_vl" + }, + { + "source": "/basic_usage/sampling_params.html", + "destination": "/docs/basic_usage/sampling_params" + }, + { + "source": "/basic_usage/send_request.html", + "destination": "/docs/basic_usage/send_request" + }, + { + "source": "/developer_guide/bench_serving.html", + "destination": "/docs/developer_guide/bench_serving" + }, + { + "source": "/developer_guide/benchmark_and_profiling.html", + "destination": "/docs/developer_guide/benchmark_and_profiling" + }, + { + "source": "/developer_guide/contribution_guide.html", + "destination": "/docs/developer_guide/contribution_guide" + }, + { + "source": "/developer_guide/development_guide_using_docker.html", + "destination": "/docs/developer_guide/development_guide_using_docker" + }, + { + "source": "/developer_guide/development_jit_kernel_guide.html", + "destination": "/docs/developer_guide/development_jit_kernel_guide" + }, + { + "source": "/developer_guide/evaluating_new_models.html", + "destination": "/docs/developer_guide/evaluating_new_models" + }, + { + "source": "/developer_guide/release_process.html", + "destination": "/docs/developer_guide/release_process" + }, + { + "source": "/developer_guide/setup_github_runner.html", + "destination": "/docs/developer_guide/setup_github_runner" + }, + { + "source": "/diffusion/api/cli.html", + "destination": "/docs/sglang-diffusion/api/cli" + }, + { + "source": "/diffusion/api/openai_api.html", + "destination": "/docs/sglang-diffusion/api/openai_api" + }, + { + "source": "/diffusion/api/post_processing.html", + "destination": "/docs/sglang-diffusion/api/post_processing" + }, + { + "source": "/diffusion/ci_perf.html", + "destination": "/docs/sglang-diffusion/ci_perf" + }, + { + "source": "/diffusion/compatibility_matrix.html", + "destination": "/docs/sglang-diffusion/compatibility_matrix" + }, + { + "source": "/diffusion/contributing.html", + "destination": "/docs/sglang-diffusion/contributing" + }, + { + "source": "/diffusion/development.html", + "destination": "/docs/sglang-diffusion/installation" + }, + { + "source": "/diffusion/disaggregation.html", + "destination": "/docs/sglang-diffusion/disaggregation" + }, + { + "source": "/diffusion/environment_variables.html", + "destination": "/docs/sglang-diffusion/environment_variables" + }, + { + "source": "/diffusion/index.html", + "destination": "/docs/sglang-diffusion/index" + }, + { + "source": "/diffusion/installation.html", + "destination": "/docs/sglang-diffusion/installation" + }, + { + "source": "/diffusion/performance/attention_backends.html", + "destination": "/docs/sglang-diffusion/attention_backends" + }, + { + "source": "/diffusion/performance/dynamic_batching.html", + "destination": "/docs/sglang-diffusion/dynamic_batching" + }, + { + "source": "/diffusion/performance/cache/cache_dit.html", + "destination": "/docs/sglang-diffusion/cache_dit" + }, + { + "source": "/diffusion/performance/cache/index.html", + "destination": "/docs/sglang-diffusion/caching-acceleration" + }, + { + "source": "/diffusion/performance/cache/teacache.html", + "destination": "/docs/sglang-diffusion/teacache" + }, + { + "source": "/diffusion/performance/index.html", + "destination": "/docs/sglang-diffusion/performance-optimization" + }, + { + "source": "/diffusion/performance/profiling.html", + "destination": "/docs/sglang-diffusion/profiling" + }, + { + "source": "/diffusion/performance/ring_sp_performance.html", + "destination": "/docs/sglang-diffusion/ring_sp_performance" + }, + { + "source": "/diffusion/quantization.html", + "destination": "/docs/sglang-diffusion/quantization" + }, + { + "source": "/diffusion/reference.html", + "destination": "/docs/sglang-diffusion/installation" + }, + { + "source": "/diffusion/support_new_models.html", + "destination": "/docs/sglang-diffusion/support_new_models" + }, + { + "source": "/diffusion/usage.html", + "destination": "/docs/sglang-diffusion/installation" + }, + { + "source": "/get_started/install.html", + "destination": "/docs/get-started/install" + }, + { + "source": "/platforms/amd_gpu.html", + "destination": "/docs/hardware-platforms/amd_gpu" + }, + { + "source": "/platforms/apple_metal.html", + "destination": "/docs/hardware-platforms/apple_metal" + }, + { + "source": "/platforms/ascend/ascend_contribution_guide.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_contribution_guide" + }, + { + "source": "/platforms/ascend/ascend_npu.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu" + }, + { + "source": "/platforms/ascend/ascend_npu_best_practice.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_best_practice" + }, + { + "source": "/platforms/ascend/ascend_npu_deepseek_example.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_deepseek_example" + }, + { + "source": "/platforms/ascend/ascend_npu_environment_variables.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_environment_variables" + }, + { + "source": "/platforms/ascend/ascend_npu_glm5_examples.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_glm5_examples" + }, + { + "source": "/platforms/ascend/ascend_npu_quantization.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_quantization" + }, + { + "source": "/platforms/ascend/ascend_npu_quick_start.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start" + }, + { + "source": "/platforms/ascend/ascend_npu_qwen3_5_examples.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_5_examples" + }, + { + "source": "/platforms/ascend/ascend_npu_qwen3_examples.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_examples" + }, + { + "source": "/platforms/ascend/ascend_npu_support.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start" + }, + { + "source": "/platforms/ascend/ascend_npu_support_features.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_support_features" + }, + { + "source": "/platforms/ascend/ascend_npu_support_models.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_support_models" + }, + { + "source": "/platforms/ascend/mindspore_backend.html", + "destination": "/docs/hardware-platforms/ascend-npus/mindspore_backend" + }, + { + "source": "/platforms/ascend_npu_ring_sp_performance.html", + "destination": "/docs/hardware-platforms/ascend-npus/ascend_npu_ring_sp_performance" + }, + { + "source": "/platforms/cpu_server.html", + "destination": "/docs/hardware-platforms/cpu_server" + }, + { + "source": "/platforms/mthreads_gpu.html", + "destination": "/docs/hardware-platforms/mthreads_gpu" + }, + { + "source": "/platforms/nvidia_jetson.html", + "destination": "/docs/hardware-platforms/nvidia_jetson" + }, + { + "source": "/platforms/plugin.html", + "destination": "/docs/hardware-platforms/plugin" + }, + { + "source": "/platforms/tpu.html", + "destination": "/docs/hardware-platforms/tpu" + }, + { + "source": "/platforms/xpu.html", + "destination": "/docs/hardware-platforms/xpu" + }, + { + "source": "/references/custom_chat_template.html", + "destination": "/docs/references/custom_chat_template" + }, + { + "source": "/references/environment_variables.html", + "destination": "/docs/references/environment_variables" + }, + { + "source": "/references/faq.html", + "destination": "/docs/references/faq" + }, + { + "source": "/references/frontend/choices_methods.html", + "destination": "/docs/references/frontend/choices_methods" + }, + { + "source": "/references/frontend/frontend_index.html", + "destination": "/docs/references/frontend/frontend_index" + }, + { + "source": "/references/frontend/frontend_tutorial.html", + "destination": "/docs/references/frontend/frontend_tutorial" + }, + { + "source": "/references/learn_more.html", + "destination": "/" + }, + { + "source": "/references/multi_node_deployment/deploy_on_k8s.html", + "destination": "/docs/references/multi_node_deployment/deploy_on_k8s" + }, + { + "source": "/references/multi_node_deployment/lws_pd/lws_pd_deploy.html", + "destination": "/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy" + }, + { + "source": "/references/multi_node_deployment/multi_node.html", + "destination": "/docs/references/multi_node_deployment/multi_node" + }, + { + "source": "/references/multi_node_deployment/multi_node_index.html", + "destination": "/docs/references/multi_node_deployment/multi_node_index" + }, + { + "source": "/references/multi_node_deployment/rbg_pd/deepseekv32_pd.html", + "destination": "/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd" + }, + { + "source": "/references/post_training_integration.html", + "destination": "/docs/references/post_training_integration" + }, + { + "source": "/references/production_metrics.html", + "destination": "/docs/references/production_metrics" + }, + { + "source": "/references/production_request_trace.html", + "destination": "/docs/references/production_request_trace" + }, + { + "source": "/references/release_lookup.html", + "destination": "/docs/references/overview" + }, + { + "source": "/references/torch_compile_cache.html", + "destination": "/docs/references/torch_compile_cache" + }, + { + "source": "/supported_models/extending/index.html", + "destination": "/docs/supported-models" + }, + { + "source": "/supported_models/extending/mindspore_models.html", + "destination": "/docs/supported-models/mindspore_models" + }, + { + "source": "/supported_models/extending/modelscope.html", + "destination": "/docs/supported-models/modelscope" + }, + { + "source": "/supported_models/extending/support_new_models.html", + "destination": "/docs/supported-models/support_new_models" + }, + { + "source": "/supported_models/extending/transformers_fallback.html", + "destination": "/docs/supported-models/transformers_fallback" + }, + { + "source": "/supported_models/index.html", + "destination": "/docs/supported-models" + }, + { + "source": "/supported_models/retrieval_ranking/classify_models.html", + "destination": "/docs/supported-models/classify_models" + }, + { + "source": "/supported_models/retrieval_ranking/embedding_models.html", + "destination": "/docs/supported-models/embedding_models" + }, + { + "source": "/supported_models/retrieval_ranking/index.html", + "destination": "/docs/supported-models" + }, + { + "source": "/supported_models/retrieval_ranking/rerank_models.html", + "destination": "/docs/supported-models/rerank_models" + }, + { + "source": "/supported_models/specialized/index.html", + "destination": "/docs/supported-models" + }, + { + "source": "/supported_models/specialized/reward_models.html", + "destination": "/docs/supported-models/reward_models" + }, + { + "source": "/supported_models/text_generation/diffusion_language_models.html", + "destination": "/docs/supported-models/diffusion_language_models" + }, + { + "source": "/supported_models/text_generation/generative_models.html", + "destination": "/docs/supported-models/generative_models" + }, + { + "source": "/supported_models/text_generation/index.html", + "destination": "/docs/supported-models" + }, + { + "source": "/supported_models/text_generation/multimodal_language_models.html", + "destination": "/docs/supported-models/multimodal_language_models" + }, + { + "source": "/supported_models.html", + "destination": "/docs/supported-models" + }, + { + "source": "/diffusion.html", + "destination": "/docs/sglang-diffusion/index" + } + ], + "colors": { + "primary": "#d55816", + "light": "#d55816", + "dark": "#d55816" + }, + "background": { + "decoration": "grid", + "color": { + "dark": "#1d1d1d", + "light": "#fffcfb" + } + }, + "fonts": { + "heading": { + "family": "Inter", + "weight": 600 + }, + "body": { + "family": "Inter", + "weight": 400 + } + }, + "favicon": "/favicon.png", + "navigation": { + "tabs": [ + { + "tab": "Get Started", + "groups": [ + { + "group": "Get Started", + "icon": "play", + "pages": [ + "index", + "docs/get-started/install", + "docs/get-started/quickstart", + "docs/basic_usage/send_request" + ] + } + ] + }, + { + "tab": "User Guide", + "groups": [ + { + "group": "Basic Usage", + "icon": "book-open", + "pages": [ + "docs/basic_usage/overview", + { + "group": "OpenAI-Compatible APIs", + "pages": [ + "docs/basic_usage/openai_api", + "docs/basic_usage/openai_api_completions", + "docs/basic_usage/openai_api_vision", + "docs/basic_usage/openai_api_embeddings" + ] + }, + "docs/basic_usage/ollama_api", + "docs/basic_usage/offline_engine_api", + "docs/basic_usage/native_api", + "docs/basic_usage/sampling_params", + { + "group": "Popular Model Usage", + "pages": [ + "docs/basic_usage/popular_model_usage", + "docs/basic_usage/deepseek_v3", + "docs/basic_usage/deepseek_v32", + "docs/basic_usage/deepseek_ocr", + "docs/basic_usage/glm45", + "docs/basic_usage/glmv", + "docs/basic_usage/gpt_oss", + "docs/basic_usage/kimi_k2_5", + "docs/basic_usage/minimax_m2", + "docs/basic_usage/qwen3", + "docs/basic_usage/qwen3_5", + "docs/basic_usage/qwen3_vl", + "docs/basic_usage/llama4" + ] + } + ] + }, + { + "group": "Advanced Features", + "icon": "gears", + "pages": [ + "docs/advanced_features/overview", + "docs/advanced_features/server_arguments", + "docs/advanced_features/object_storage", + "docs/advanced_features/hyperparameter_tuning", + "docs/advanced_features/attention_backend", + "docs/advanced_features/hisparse_guide", + "docs/advanced_features/speculative_decoding", + "docs/advanced_features/adaptive_speculative_decoding", + "docs/advanced_features/structured_outputs", + "docs/advanced_features/structured_outputs_for_reasoning_models", + "docs/advanced_features/tool_parser", + "docs/advanced_features/separate_reasoning", + "docs/advanced_features/quantization", + "docs/advanced_features/quantized_kv_cache", + "docs/advanced_features/dp_dpa_smg_guide", + "docs/advanced_features/expert_parallelism", + "docs/advanced_features/lora", + "docs/advanced_features/pd_disaggregation", + "docs/advanced_features/epd_disaggregation", + "docs/advanced_features/pipeline_parallelism", + { + "group": "Hierarchical KV Caching (HiCache)", + "pages": [ + "docs/advanced_features/hicache", + "docs/advanced_features/hicache_best_practices", + "docs/advanced_features/hicache_design", + "docs/advanced_features/hicache_storage_runtime_attach_detach" + ] + }, + "docs/advanced_features/vlm_query", + "docs/advanced_features/dp_for_multi_modal_encoder", + "docs/advanced_features/cuda_graph_for_multi_modal_encoder", + "docs/advanced_features/breakable_cuda_graph", + "docs/advanced_features/piecewise_cuda_graph", + "docs/advanced_features/sgl_model_gateway", + "docs/advanced_features/deterministic_inference", + "docs/advanced_features/observability", + "docs/advanced_features/checkpoint_engine", + "docs/advanced_features/sglang_for_rl" + ] + }, + { + "group": "Supported Models", + "icon": "cubes", + "pages": [ + "docs/supported-models", + { + "group": "Text Generation", + "pages": [ + "docs/supported-models/generative_models", + "docs/supported-models/multimodal_language_models", + "docs/supported-models/diffusion_language_models" + ] + }, + { + "group": "Retrieval and Ranking", + "pages": [ + "docs/supported-models/embedding_models", + "docs/supported-models/rerank_models", + "docs/supported-models/classify_models" + ] + }, + { + "group": "Specialized Models", + "pages": [ + "docs/supported-models/reward_models" + ] + }, + { + "group": "Extending SGLang", + "pages": [ + "docs/supported-models/support_new_models", + "docs/supported-models/transformers_fallback", + "docs/supported-models/modelscope", + "docs/supported-models/mindspore_models" + ] + } + ] + }, + { + "group": "Developer Guide", + "icon": "code", + "pages": [ + "docs/developer_guide/overview", + "docs/developer_guide/contribution_guide", + { + "group": "Development", + "pages": [ + "docs/developer_guide/development_guide_using_docker", + "docs/developer_guide/development_jit_kernel_guide" + ] + }, + { + "group": "Benchmarking", + "pages": [ + "docs/developer_guide/benchmark_and_profiling", + "docs/developer_guide/bench_serving" + ] + }, + "docs/developer_guide/evaluating_new_models", + "docs/developer_guide/msprobe_debugging_guide" + ] + }, + { + "group": "References", + "icon": "bookmark", + "pages": [ + "docs/references/overview", + "docs/references/faq", + "docs/references/environment_variables", + "docs/references/production_metrics", + "docs/references/production_request_trace", + { + "group": "Multi-Node Deployment", + "pages": [ + "docs/references/multi_node_deployment/multi_node_index", + "docs/references/multi_node_deployment/multi_node", + "docs/references/multi_node_deployment/deploy_on_k8s", + "docs/references/multi_node_deployment/lws_pd/lws_pd_deploy", + "docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd" + ] + }, + "docs/references/custom_chat_template", + { + "group": "Frontend Language", + "pages": [ + "docs/references/frontend/frontend_index", + "docs/references/frontend/frontend_tutorial", + "docs/references/frontend/choices_methods" + ] + }, + { + "group": "Cookbook", + "pages": [ + "cookbook/base/reference/server_arguments" + ] + }, + "docs/references/post_training_integration" + ] + } + ] + }, + { + "tab": "Hardware", + "groups": [ + { + "group": "Hardware Platforms", + "icon": "microchip", + "pages": [ + "docs/hardware-platforms/overview", + "docs/hardware-platforms/nvidia-gpus", + "docs/hardware-platforms/amd_gpu", + "docs/hardware-platforms/apple_metal", + { + "group": "Ascend NPUs", + "pages": [ + "docs/hardware-platforms/ascend-npus/ascend_npu_quick_start", + "docs/hardware-platforms/ascend-npus/ascend_npu", + "docs/hardware-platforms/ascend-npus/ascend_npu_support_features", + "docs/hardware-platforms/ascend-npus/ascend_npu_support_models", + "docs/hardware-platforms/ascend-npus/ascend_npu_quantization", + "docs/hardware-platforms/ascend-npus/ascend_npu_deepseek_example", + "docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_examples", + "docs/hardware-platforms/ascend-npus/mindspore_backend", + "docs/hardware-platforms/ascend-npus/ascend_contribution_guide", + "docs/hardware-platforms/ascend-npus/ascend_npu_best_practice", + "docs/hardware-platforms/ascend-npus/ascend_npu_ring_sp_performance", + "docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_5_examples", + "docs/hardware-platforms/ascend-npus/ascend_npu_glm5_examples", + "docs/hardware-platforms/ascend-npus/ascend_npu_environment_variables" + ] + }, + "docs/hardware-platforms/cpu_server", + { + "group": "Edge & Embedded", + "pages": [ + "docs/hardware-platforms/nvidia_jetson" + ] + }, + "docs/hardware-platforms/mthreads_gpu", + "docs/hardware-platforms/tpu", + "docs/hardware-platforms/xpu", + "docs/hardware-platforms/plugin" + ] + } + ] + }, + { + "tab": "Cookbook", + "groups": [ + { + "group": "Cookbook", + "icon": "book", + "pages": [ + "cookbook/intro", + { + "group": "Autoregressive Models", + "pages": [ + "cookbook/autoregressive/intro", + { + "group": "Qwen", + "pages": [ + "cookbook/autoregressive/Qwen/Qwen3.6", + "cookbook/autoregressive/Qwen/Qwen3.5", + "cookbook/autoregressive/Qwen/Qwen3", + "cookbook/autoregressive/Qwen/Qwen3-Next", + "cookbook/autoregressive/Qwen/Qwen3-Coder", + "cookbook/autoregressive/Qwen/Qwen3-Coder-Next", + "cookbook/autoregressive/Qwen/Qwen3-VL", + "cookbook/autoregressive/Qwen/Qwen2.5-VL" + ] + }, + { + "group": "DeepSeek", + "pages": [ + "cookbook/autoregressive/DeepSeek/DeepSeek-V4", + "cookbook/autoregressive/DeepSeek/DeepSeek-V3_2", + "cookbook/autoregressive/DeepSeek/DeepSeek-V3_1", + "cookbook/autoregressive/DeepSeek/DeepSeek-V3", + "cookbook/autoregressive/DeepSeek/DeepSeek-R1", + "cookbook/autoregressive/DeepSeek/DeepSeek-Math-V2", + "cookbook/autoregressive/DeepSeek/DeepSeek-OCR", + "cookbook/autoregressive/DeepSeek/DeepSeek-OCR-2" + ] + }, + { + "group": "Llama", + "pages": [ + "cookbook/autoregressive/Llama/Llama4", + "cookbook/autoregressive/Llama/Llama3.3-70B", + "cookbook/autoregressive/Llama/Llama3.1" + ] + }, + { + "group": "GLM", + "pages": [ + "cookbook/autoregressive/GLM/GLM-5.1", + "cookbook/autoregressive/GLM/GLM-5", + "cookbook/autoregressive/GLM/GLM-OCR", + "cookbook/autoregressive/GLM/GLM-Glyph", + "cookbook/autoregressive/GLM/GLM-4.7", + "cookbook/autoregressive/GLM/GLM-4.7-Flash", + "cookbook/autoregressive/GLM/GLM-4.6", + "cookbook/autoregressive/GLM/GLM-4.6V", + "cookbook/autoregressive/GLM/GLM-4.5", + "cookbook/autoregressive/GLM/GLM-4.5V" + ] + }, + { + "group": "Google", + "pages": [ + "cookbook/autoregressive/Google/Gemma4" + ] + }, + { + "group": "OpenAI", + "pages": [ + "cookbook/autoregressive/OpenAI/GPT-OSS" + ] + }, + { + "group": "Moonshotai", + "pages": [ + "cookbook/autoregressive/Moonshotai/Kimi-K2.6", + "cookbook/autoregressive/Moonshotai/Kimi-K2.5", + "cookbook/autoregressive/Moonshotai/Kimi-K2", + "cookbook/autoregressive/Moonshotai/Kimi-Linear" + ] + }, + { + "group": "MiniMax", + "pages": [ + "cookbook/autoregressive/MiniMax/MiniMax-M2.7", + "cookbook/autoregressive/MiniMax/MiniMax-M2", + "cookbook/autoregressive/MiniMax/MiniMax-M2.5" + ] + }, + { + "group": "NVIDIA", + "pages": [ + "cookbook/autoregressive/NVIDIA/Nemotron3-Nano-Omni", + "cookbook/autoregressive/NVIDIA/Nemotron3-Nano", + "cookbook/autoregressive/NVIDIA/Nemotron3-Super" + ] + }, + { + "group": "Ernie", + "pages": [ + "cookbook/autoregressive/Ernie/Ernie4.5", + "cookbook/autoregressive/Ernie/Ernie4.5-VL" + ] + }, + { + "group": "StepFun", + "pages": [ + "cookbook/autoregressive/StepFun/Step3.5", + "cookbook/autoregressive/StepFun/Step3-VL-10B" + ] + }, + { + "group": "InclusionAI", + "pages": [ + "cookbook/autoregressive/InclusionAI/Ling-2.6", + "cookbook/autoregressive/InclusionAI/Ling-2.5-1T", + "cookbook/autoregressive/InclusionAI/Ring-2.5-1T", + "cookbook/autoregressive/InclusionAI/LLaDA-2.1" + ] + }, + { + "group": "InternLM", + "pages": [ + "cookbook/autoregressive/InternLM/Intern-S1" + ] + }, + { + "group": "InternVL", + "pages": [ + "cookbook/autoregressive/InternVL/InternVL3.5" + ] + }, + { + "group": "Jina AI", + "pages": [ + "cookbook/autoregressive/Jina/Jina-reranker-m0" + ] + }, + { + "group": "Mistral", + "pages": [ + "cookbook/autoregressive/Mistral/Ministral-3", + "cookbook/autoregressive/Mistral/Mistral-Small-4", + "cookbook/autoregressive/Mistral/Mistral-Medium-3.5", + "cookbook/autoregressive/Mistral/Devstral-2" + ] + }, + { + "group": "Xiaomi", + "pages": [ + "cookbook/autoregressive/Xiaomi/MiMo-V2.5", + "cookbook/autoregressive/Xiaomi/MiMo-V2-Flash" + ] + }, + { + "group": "FlashLabs", + "pages": [ + "cookbook/autoregressive/FlashLabs/Chroma1.0" + ] + }, + { + "group": "Tencent", + "pages": [ + "cookbook/autoregressive/Tencent/Hunyuan3-Preview" + ] + } + ] + }, + { + "group": "Diffusion Models", + "pages": [ + "cookbook/diffusion/intro", + { + "group": "FLUX", + "pages": [ + "cookbook/diffusion/FLUX/FLUX" + ] + }, + { + "group": "Wan", + "pages": [ + "cookbook/diffusion/Wan/Wan2.1", + "cookbook/diffusion/Wan/Wan2.2" + ] + }, + { + "group": "LTX", + "pages": [ + "cookbook/diffusion/LTX/LTX" + ] + }, + { + "group": "Qwen-Image", + "pages": [ + "cookbook/diffusion/Qwen-Image/Qwen-Image", + "cookbook/diffusion/Qwen-Image/Qwen-Image-Edit" + ] + }, + { + "group": "Z-Image", + "pages": [ + "cookbook/diffusion/Z-Image/Z-Image-Turbo" + ] + }, + { + "group": "MOVA", + "pages": [ + "cookbook/diffusion/MOVA/MOVA" + ] + } + ] + }, + { + "group": "SpecBundle", + "pages": [ + "cookbook/specbundle/supported_models", + "cookbook/specbundle/specbundle_usage" + ] + }, + { + "group": "Benchmarks", + "pages": [ + "cookbook/base/benchmarks/autoregressive_model_benchmark", + "cookbook/base/benchmarks/diffusion_model_benchmark" + ] + } + ] + } + ] + }, + { + "tab": "SGLang Diffusion", + "groups": [ + { + "group": "SGLang Diffusion", + "icon": "sparkles", + "pages": [ + "docs/sglang-diffusion/index", + "docs/sglang-diffusion/installation", + "docs/sglang-diffusion/compatibility_matrix", + "docs/sglang-diffusion/disaggregation", + "docs/sglang-diffusion/quantization", + { + "group": "Usage", + "pages": [ + "docs/sglang-diffusion/api/cli", + "docs/sglang-diffusion/api/openai_api", + "docs/sglang-diffusion/api/post_processing" + ] + }, + { + "group": "Performance Optimization", + "pages": [ + "docs/sglang-diffusion/performance-optimization", + "docs/sglang-diffusion/ring_sp_performance", + "docs/sglang-diffusion/attention_backends", + { + "group": "Inference Batching", + "pages": [ + "docs/sglang-diffusion/dynamic_batching" + ] + }, + "docs/sglang-diffusion/profiling", + "docs/sglang-diffusion/ci_perf" + ] + }, + { + "group": "Caching Strategies", + "pages": [ + "docs/sglang-diffusion/caching-acceleration", + "docs/sglang-diffusion/cache_dit", + "docs/sglang-diffusion/teacache" + ] + }, + { + "group": "References", + "pages": [ + "docs/sglang-diffusion/environment_variables", + "docs/sglang-diffusion/support_new_models", + "docs/sglang-diffusion/contributing" + ] + } + ] + } + ] + } + ], + "global": { + "anchors": [] + } + }, + "logo": { + "light": "/logo/logo.png", + "dark": "/logo/logo.png" + }, + "contextual": { + "options": [ + "copy", + "view", + "chatgpt", + "claude", + "perplexity", + "mcp", + "cursor", + "vscode" + ] + }, + "footer": { + "socials": { + "github": "https://github.com/sgl-project/sglang", + "x": "https://x.com/lmsysorg", + "linkedin": "https://www.linkedin.com/company/sgl-project/posts?feedView=all", + "slack": "https://slack.sglang.io/", + "discord": "https://discord.gg/4ugb2t6YY2" + } + } +} diff --git a/docs_new/docs/advanced_features/adaptive_speculative_decoding.mdx b/docs_new/docs/advanced_features/adaptive_speculative_decoding.mdx new file mode 100644 index 000000000000..7a59376fada5 --- /dev/null +++ b/docs_new/docs/advanced_features/adaptive_speculative_decoding.mdx @@ -0,0 +1,197 @@ +--- +title: "Adaptive Speculative Decoding" +metatags: + description: "Configure adaptive speculative decoding so SGLang can adjust speculative steps and draft tokens at runtime based on acceptance behavior." +--- + +Adaptive speculative decoding lets SGLang adjust `speculative_num_steps/speculative_num_draft_tokens` at runtime instead of keeping a single fixed value for the whole server lifetime. +It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal. + +## Current support + +- Only `--speculative-algorithm EAGLE` +- Only `--speculative-eagle-topk 1` +- If either condition is not met, SGLang falls back to static speculative settings + +## Why adaptive steps help + +`speculative_num_steps` controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload. + +- If `num_steps` is too small, the draft model could have produced more accepted tokens, but the round stops too early. +- If `num_steps` is too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted. + +Real traffic often moves between high-acceptance and low-acceptance phases, so one fixed step count is usually a compromise. Adaptive mode tries to follow the workload instead of hard-coding a single global `num_steps`. + +## Design overview + +The adaptive mechanism has three pieces: + +- `AdaptiveSpeculativeParams`: the EMA-based policy +- `SpecRuntimeState`: the per-tier runtime state bundle +- `AdaptiveController`: the coordinator that chooses a tier and activates the matching runtime state + +At startup, SGLang pre-builds one runtime state per candidate tier. By default, the candidate tiers are `candidate_steps = [1, 3, 7]`. + +```mermaid +--- +title: "SpecRuntimeState — speculative_num_steps / speculative_num_draft_tokens" +--- +graph LR + subgraph SR[" "] + direction LR + subgraph D["Draft stage"] + direction TB + d1[attn_backend] + d2[cuda_graph] + end + subgraph V["Verify stage"] + direction TB + v1[attn_backend] + v2[cuda_graph] + end + subgraph E["Extend stage"] + direction TB + e1[attn_backend] + e2[cuda_graph] + end + end +``` + +This matters because `CudaGraphRunner` is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture. + +## Runtime flow + +The adaptive update happens after verify and affects the next round, not the current one: + +```mermaid +--- +title: "EAGLEWorker.forward_batch_generation() — decode path" +--- +flowchart TD + A["① draft(batch)
draft model multi-step generation with current tier"] + B["② verify(batch, spec_info)
target model tree verification → produces accept_length_per_req"] + C["③ forward_draft_extend_after_decode(batch)
draft model KV-cache catch-up"] + D["④ adaptive_controller.on_verify_complete(accept_lengths)
update EMA, apply warmup / interval / hysteresis gates
if tier changed, select a pre-built state from pool"] + E["worker.apply_runtime_state(state)"] + A --> B --> C --> D --> E +``` + +> Tier switch happens after the current round completes. Backends and CUDA graphs are never swapped mid-round. + +## How the policy decides + +After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the pre-built candidate tiers `[1, 3, 7]` by default. + +The decision logic is intentionally conservative: + +- `warmup_batches` skips the first few batches +- `update_interval` avoids switching every batch +- `down_hysteresis` and `up_hysteresis` reduce oscillation + +Conceptually, the policy probes one step beyond the observed acceptance: + +```text +target_steps ≈ clamp(round(ema_accept_len) + 1, min(candidate_steps), max(candidate_steps)) +``` + +So if recent requests consistently accept more drafted tokens, the policy tends to move up. If they start rejecting earlier, it tends to move down. + +## Usage + +`--speculative-adaptive-config` is optional, but the speculative setup still needs to be valid for adaptive mode. + +```bash +python3 -m sglang.launch_server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \ + --speculative-eagle-topk 1 \ + --speculative-num-steps 3 \ + --speculative-num-draft-tokens 4 \ + --speculative-adaptive +``` + +If you want to override the defaults, add `--speculative-adaptive-config /path/to/adaptive_spec.json`. + +Example config: + +```json +{ + "candidate_steps": [1, 3, 7], + "ema_alpha": 0.2, + "warmup_batches": 10, + "update_interval": 5 +} +``` + +## Config file reference + +The config file is optional. Any omitted keys use defaults. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
KeyDefaultMeaning
candidate_steps[1, 3, 7]Discrete speculative_num_steps tiers that adaptive mode can switch between
ema_alpha0.2EMA smoothing factor for accepted draft length
update_interval5Recompute interval, in verify batches, after warmup
warmup_batches10Number of verify batches to observe before switching
down_hysteresis-0.25Extra margin before moving to a smaller step
up_hysteresis0.0Extra margin before moving to a larger step
+ +The initial `--speculative-num-steps` is snapped to the nearest value in `candidate_steps`. + +## Monitoring + +You can inspect the active tier and acceptance metric via `/server_info`: + +```bash +curl -s http://127.0.0.1:30000/server_info | jq '.internal_states[0] | {speculative_num_steps, avg_spec_accept_length}' +``` + +- `speculative_num_steps` is the current active tier +- `avg_spec_accept_length` helps explain whether the server is likely to move up or down + +## Tuning tips + +- Start with the default candidate tiers `[1, 3, 7]` +- Use fewer tiers if you want lower startup and graph-memory overhead +- Increase `ema_alpha` to react faster, or lower it for more stability +- Increase `warmup_batches` or `update_interval` if tier switching is too noisy +- If your workload is already stable and one static setting is well tuned, adaptive mode may not help much diff --git a/docs_new/docs/advanced_features/attention_backend.mdx b/docs_new/docs/advanced_features/attention_backend.mdx new file mode 100644 index 000000000000..9ad998f51382 --- /dev/null +++ b/docs_new/docs/advanced_features/attention_backend.mdx @@ -0,0 +1,694 @@ +--- +title: "Attention Backend" +metatags: + description: "SGLang attention backend guide: FlashInfer, FA3, FA4, Triton, FlashMLA, TRTLLM MLA, hybrid attention. Support matrix for MHA and MLA models." +--- +SGLang supports a large variety of attention backends. Each of them has different pros and cons. +You can test them according to your needs. + + +Selecting an optimal attention backend is crucial for maximizing your performance. Different backends excel in various scenarios, so choose based on your model, hardware, and use case. Not all backends are supported on all platforms and model architectures. + +If you don't specify `--attention-backend`, SGLang makes a best effort to automatically select the most performant backend based on your hardware and model architecture. + + +## Support Matrix + +The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](../basic_usage/deepseek_v3#multi-head-latent-attention-mla-throughput-optimizations) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434). + +### MHA Backends + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
**Backend****Page Size > 1 (native)****FP8 KV Cache****FP4 KV Cache****Spec topk=1****Spec topk>1****Sliding Window****MultiModal**
**FlashInfer**
**FA3 (FlashAttention 3)**
**FA4 (FlashAttention 4)**128
**Triton**
**Torch Native (SDPA)**
**FlexAttention (PyTorch)**
**TRTLLM MHA**16, 32 or 64
**Dual Chunk FlashAttention**
**AITER (ROCm)**
**Wave (ROCm)**
**Ascend (NPU)**
**Intel XPU**
**Intel AMX (CPU)**
+ +### MLA Backends + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
**Backend****Native Page Sizes****FP8 KV Cache****FP4 KV Cache****Chunked Prefix Cache****Spec topk=1****Spec topk>1**
**FlashInfer MLA**1
**FlashMLA**64
**Cutlass MLA**128
**TRTLLM MLA (Blackwell)**32 or 64
**FA3 (FlashAttention 3)**n/a⚠️ (page_size=1 only)
**Triton**n/a⚠️ (page_size=1 only)
**FA4**1
**Ascend MLA (NPU)**128
+ + +Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" column indicates whether a corresponding multimodal implementation exists for that backend family. + + + +- FlashAttention 4 supports both prefill and decode on SM90 (Hopper) and SM100 (Blackwell). FA4 MLA supports `page_size = 1`; FA4 MHA requires `page_size = 128`. On SM100, this is auto-enforced by the server; on SM90, users must set `--page-size 128` manually. +- NSA is specifically designed for [DeepSeek V3.2 DSA](https://lmsys.org/blog/2025-09-29-deepseek-V32/). See the [DSA Attention Backend (NSA)](#dsa-attention-backend-nsa) section and [DeepSeek V3.2 deployment guide](../basic_usage/deepseek_v32) for details. + + + +**FA4 on Hopper (SM90):** FA4 decode speed decreases as sequence length grows due to lack of SplitKV support. At batch=1 compared to FA3 on H100: ~-10% at 2K tokens, ~-18% at 4K, ~-31% at 8K, ~-49% at 16K. Larger batch sizes reduce the gap (e.g., batch=8: ~-2% at 2K, ~-8% at 4K). Blackwell (SM100) is not affected. + + + +For the KV4 FA4 scenario, FA4 requires using a different --decode-attention-backend to run. Except for trtllm_mha being incompatible with FA4, all other decode backends behave as shown in the table. + + + +Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths. + + + +**Speculative Decoding V2 (Spec V2):** Spec V2 uses overlap scheduling (`SGLANG_ENABLE_SPEC_V2=True`) that benefits various attention backends. Requires `--speculative-eagle-topk 1` and currently applies to EAGLE and EAGLE3. + +**Verified backends:** TRTLLM MLA, TRTLLM MHA, FA3, Ascend (NPU), Triton. + +**Limited support:** FlashInfer can run under Spec V2, but its plan stream (used for split-KV optimization) introduces a synchronization point that limits overlap benefits. + + + +Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching). Note that higher page sizes generally improve attention kernel performance, so prefer `page_size > 1` when prefix cache reuse is not critical. + + +Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). + +MLA page-size constraints: +- FlashInfer MLA: page_size = 1. +- FlashMLA: page_size = 64. +- Cutlass MLA: page_size = 128. +- TRTLLM MLA: page_size ∈ {32, 64}. + +### GDN Attention Backends + +GDN (Gated Delta Network) is a linear attention mechanism with O(n) complexity, used in hybrid models that alternate GDN linear attention layers with standard full attention layers. GDN is **not** selected via `--attention-backend`; it is automatically activated when the model architecture requires it (e.g., Qwen 3.5, Qwen 3 Next, Jet Nemotron, Jet VLM). + +The GDN linear attention layers have their own kernel backends, selected via `--linear-attn-backend` (default: `triton`). You can override the kernel per phase with `--linear-attn-decode-backend` and `--linear-attn-prefill-backend`. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendDecodePrefill / ExtendSpec Decoding (Target Verify)
Triton (CUDA)
Triton (AMD/ROCm)
Triton (NPU)
Triton (CPU)
CuTe DSL (CUDA only)
+ + +GDN models are hybrid: the full-attention layers still require a standard `--attention-backend`. Platform constraints for the full-attention backend on hybrid GDN models: +- **Blackwell (e.g., B200)**: `triton`, `trtllm_mha`, or `fa4` only. +- **NPU (Ascend)**: `ascend` only. +- **AMD (ROCm)**: `triton` recommended. +- **Other CUDA (Hopper, Ampere, etc.)**: auto-selection works; no special constraints. + + +### DSA Attention Backend (NSA) + +DSA (Deepseek Sparse Attention) is a native sparse attention mechanism used by [DeepSeek V3.2](https://lmsys.org/blog/2025-09-29-deepseek-V32/). It is activated automatically when the model architecture requires it and is selected via `--attention-backend nsa`. + +Internally, the NSA backend dispatches to different sub-backends for prefill and decode phases. You can override these with `--nsa-prefill-backend` and `--nsa-decode-backend`: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Sub-backendPrefillDecodeNotes
flashmla_sparseDefault prefill on Hopper and Blackwell (bf16)
flashmla_kvDefault decode for FP8 on Blackwell with DP
flashmla_autoAuto-selects flashmla_sparse or flashmla_kv based on kv_cache_dtype
fa3Default decode on Hopper (bf16)
trtllmDefault decode on Blackwell (bf16); default for both on Blackwell without DP
tilelangDefault on AMD (ROCm)
aiterAMD-specific kernel library (requires aiter package)
+ +For deployment examples, see the [DeepSeek V3.2 deployment guide](../basic_usage/deepseek_v32). + +### Hybrid attention (different backends for prefill vs decode) (Experimental) + + +Hybrid attention is an experimental feature. + + +You can mix-and-match attention backends for prefill and decode. This is useful when one backend excels at prefill and another excels at decode. For the implementation details, please see `python/sglang/srt/layers/attention/hybrid_attn_backend.py`. + +```bash Command +# Example: Prefill with FA4, Decode with TRTLLM MLA (Blackwell) +python3 -m sglang.launch_server \ + --model-path nvidia/DeepSeek-R1-FP4 \ + --tp 8 \ + --attention-backend trtllm_mla \ + --moe-runner-backend flashinfer_trtllm \ + --quantization modelopt_fp4 \ + --prefill-attention-backend fa4 +``` + +#### Speculative decoding with hybrid attention + +Hybrid attention also works with speculative decoding. The backend used for draft decoding and target verification depends on `--speculative-attention-mode`: + +- `--speculative-attention-mode decode` (recommended): draft/verify use the decode backend. +- `--speculative-attention-mode prefill` (default): draft/verify use the prefill backend. + +Constraints when combining hybrid attention with speculative decoding: + +- If any attention backend is `trtllm_mha`, speculative decoding supports only `--speculative-eagle-topk 1`. +- For paged MHA backends with `--page-size > 1` and `--speculative-eagle-topk > 1`, only `flashinfer` is supported. +- CUDA Graph: the decode backend is always captured; the prefill backend is captured only when `--speculative-attention-mode prefill`. + + + +If you set only one of `--prefill-attention-backend` or `--decode-attention-backend`, the unspecified phase inherits `--attention-backend`. +If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase. + + +## Attention Backend Selection Guide (CUDA) + +If the `--attention-backend` argument is not specified, SGLang automatically selects the best backend based on the hardware (CUDA) and model architecture. + +### Automatic Selection Logic + +**1. MHA Models (e.g., Llama, Qwen)** +- **Hopper (e.g., H100, H200)**: Defaults to `fa3` if using CUDA 12.3+ and the model configuration is supported. +- **Blackwell (e.g., B200)**: Defaults to `trtllm_mha`, unless using speculative decoding with `topk > 1`. +- **Other Architectures (Ampere, Ada, etc.)**: Defaults to `flashinfer` if available; otherwise falls back to `triton`. + +**2. MLA Models (e.g., DeepSeek V3)** +- **Hopper**: Defaults to `fa3` (requires CUDA 12.3+). +- **Blackwell**: Defaults to `flashinfer`; `trtllm_mla` is auto-selected for DeepSeek V3 models specifically. +- **Other Architectures**: Defaults to `triton`. + + +## User Guide + +### Launch Command for Different Attention Backends + +- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40) +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend flashinfer +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-V3 \ + --attention-backend flashinfer \ + --trust-remote-code +``` + +- FlashAttention 3 (Default for Hopper Machines, e.g., H100, H200, H20) +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend fa3 +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-V3 \ + --trust-remote-code \ + --attention-backend fa3 +``` + +- Triton +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend triton +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-V3 \ + --attention-backend triton \ + --trust-remote-code +``` + +- FlashMLA +```bash Command +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend flashmla \ + --trust-remote-code +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend flashmla \ + --kv-cache-dtype fp8_e4m3 \ + --trust-remote-code +``` + +- TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200) +```bash Command +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend trtllm_mla \ + --trust-remote-code +``` + +- TRTLLM MLA with FP8 KV Cache (Higher concurrency, lower memory footprint) +```bash Command +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend trtllm_mla \ + --kv-cache-dtype fp8_e4m3 \ + --trust-remote-code +``` + +- TRTLLM MHA (Optimized for Blackwell Architecture, e.g., B200) +```bash Command +python3 -m sglang.launch_server \ + --tp 4 \ + --model Qwen/Qwen3.5-35B-A3B-FP8 \ + --attention-backend trtllm_mha \ + --trust-remote-code +``` + +- TRTLLM MHA (XQA backend) (Optimized for SM90 and SM120, e.g., H20, H200, 5090) + Note that TRTLLM XQA backend only works well for pagesize 64. +```bash Command +python3 -m sglang.launch_server \ + --tp 4 \ + --model Qwen/Qwen3.5-35B-A3B-FP8 \ + --decode-attention-backend trtllm_mha \ + --trust-remote-code +``` + +- FlashAttention 4 (MHA & MLA) +```bash Command +# FA4 for both prefill and decode on SM90/SM100 +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \ + --attention-backend fa4 \ + --page-size 128 \ + --trust-remote-code + +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --prefill-attention-backend fa4 \ + --trust-remote-code +``` + +- Cutlass MLA +```bash Command +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend cutlass_mla \ + --trust-remote-code +``` + +- Ascend +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend ascend +``` + +- Intel XPU +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend intel_xpu +``` + +- Wave +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend wave +``` + +- FlexAttention +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend flex_attention +``` + +- Dual Chunk FlashAttention +```bash Command +python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-14B-Instruct-1M \ + --attention-backend dual_chunk_flash_attn +``` + +- Torch Native +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend torch_native +``` + +## Steps to add a new attention backend +To add a new attention backend, you can learn from the existing backends +(`python/sglang/srt/layers/attention/triton_backend.py`, `python/sglang/srt/layers/attention/flashattention_backend.py`) +and follow the steps below. + + +Linear attention kernel backends (GDN, KDA) follow a different pattern. They implement `LinearAttnKernelBase` in `python/sglang/srt/layers/attention/linear/kernels/` and are dispatched by `GDNKernelDispatcher` / `KDAKernelDispatcher` rather than registered via `@register_attention_backend`. + + +1. Run without cuda graph. Support the two forward functions + - forward_extend + - Will be used for prefill, prefill with KV cache, and target verification + - It will be called once per layer + - forward_decode + - Will be used for normal decode, and draft decode + - It will be called once per layer + - init_forward_metadata + - Initialize the class and common metadata shared by all layers + - Call the plan function for optimizations like split_kv + - It will be called once per forward +2. Run with cuda graph. It has two phases (capture and replay) and you need to implement three functions + - init_cuda_graph_state + - It will be called once during life time + - Create all common shared buffers + - init_forward_metadata_capture_cuda_graph + - It will be called before capturing a cuda graph + - It is similar to init_forward_metadata but write the medatada to some pre-defined buffers + - init_forward_metadata_replay_cuda_graph + - It will be called before replaying a cuda graph + - This function is in the critical path and needs to be fast diff --git a/docs_new/docs/advanced_features/breakable_cuda_graph.mdx b/docs_new/docs/advanced_features/breakable_cuda_graph.mdx new file mode 100644 index 000000000000..3497ba837313 --- /dev/null +++ b/docs_new/docs/advanced_features/breakable_cuda_graph.mdx @@ -0,0 +1,192 @@ +--- +title: "Breakable CUDA Graph" +metatags: + description: "Use Breakable CUDA Graph to insert targeted eager graph breaks for debugging and CUDA graph compatibility." +--- + +## Motivation + +Standard CUDA graphs capture an entire forward pass as a single, opaque graph. This is great for performance, but creates two problems: + +1. **Debugging is hard.** When something goes wrong inside a captured graph (wrong outputs, numerical mismatches, crashes), there is no way to step through the operations or insert print statements because the graph replays as a monolithic unit. + +2. **Some ops are incompatible.** Certain operations — dynamic control flow, host-device synchronization, JIT compilation, or ops that change behavior across iterations — cannot be captured into a CUDA graph at all. Today, the only workaround is to disable CUDA graphs entirely, which sacrifices the kernel launch overhead savings for the rest of the model. + +**Breakable CUDA Graph** solves both problems by allowing graph breaks to be inserted at specific points. The computation is split into multiple captured graph segments with eager (non-graph) execution in between. This preserves most of the CUDA graph performance benefit while allowing targeted operations to run outside the graph. + +## Usage + +### Debug Mode: Run Everything Eagerly + +The simplest use case is debugging. The `--debug-cuda-graph` flag wraps the entire decode forward pass in a graph break, so every operation runs eagerly while still going through the full CUDA graph capture/replay code path. This lets you debug CUDA graph issues without changing model code. + +```bash +python -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --debug-cuda-graph +``` + +This mode is intended for debugging only — it eliminates the performance benefit of CUDA graphs since every op runs eagerly. + +### Selective Graph Breaks in Model Code + +For production use, you can mark specific functions as "non-graphable" using the `@eager_on_graph` decorator. During CUDA graph capture, these functions run eagerly between captured graph segments. Outside of capture, they behave normally. + +```python +from sglang.srt.model_executor.breakable_cuda_graph.breakable_cuda_graph import eager_on_graph + +@eager_on_graph(enable=True) +def my_dynamic_op(x): + # This op is incompatible with CUDA graph capture + return some_dynamic_operation(x) +``` + +You can also insert a bare graph break (no computation) using the `break_graph()` helper: + +```python +from sglang.srt.model_executor.breakable_cuda_graph.breakable_cuda_graph import break_graph + +def forward(self, x): + x = self.layer1(x) + break_graph() # force a segment split here + x = self.layer2(x) + return x +``` + +To enable breakable CUDA graph at the environment level (without debug mode), set the environment variable: + +```bash +export SGLANG_USE_BREAKABLE_CUDA_GRAPH=1 +python -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct +``` + +### Server Args + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultDescription
--debug-cuda-graphFalseEnable debug/eager mode. Wraps the entire forward pass in a graph break so every op runs eagerly through the capture/replay path.
SGLANG_USE_BREAKABLE_CUDA_GRAPH0Environment variable. Enables breakable CUDA graph without debug mode. Required for @eager_on_graph decorators to take effect.
+ +## How It Works + +### Capture + +Breakable CUDA graph extends PyTorch's `torch.cuda.CUDAGraph` by splitting a single capture into multiple segments separated by graph breaks. + +During capture, the flow is: + +``` +Begin capture (segment 1) + ... graphable ops ... + @eager_on_graph function encountered: + 1. End current capture segment + 2. Run the function eagerly (allocates output tensors) + 3. Record the function for later replay + 4. Begin new capture segment + ... more graphable ops ... +End capture (segment N) +``` + +Each segment is independently instantiated as a CUDA graph executable. The non-graph functions and their argument references are stored for replay. + +### Replay + +During replay: + +``` +For each segment i: + 1. Launch CUDA graph segment i + 2. Run the recorded non-graph function i eagerly +Launch final CUDA graph segment +``` + +The non-graph functions are re-invoked with the same tensor references as capture time. Since these references point to the CUDA graph's static input/output buffers, they see updated values on each replay. + +### Output Writeback + +When a non-graph function produces output during replay, the result must be written back into the same tensor buffers that downstream graph segments reference. The mechanism handles: + +- **Plain tensors**: In-place `copy_()` into the original buffer. +- **Structured outputs** (dataclasses, objects with tensor attributes): Tensor fields are copied in-place; non-tensor fields are replaced. +- **Dicts of tensors**: Tensor values are copied in-place; non-tensor values are replaced. + +### Stream Fork/Join Tracking + +Some models fork work onto secondary CUDA streams (e.g., for overlapped computation). Breakable CUDA graph hooks `torch.cuda.Stream.wait_stream` to track which streams are forked from the capture stream. When a graph break occurs, all forked streams are automatically joined back before ending the capture segment, and re-forked after beginning the next segment. + +## Compatibility + +- **NVIDIA CUDA only.** Breakable CUDA graph is not supported on ROCm/HIP or other non-CUDA platforms. On unsupported platforms, `--debug-cuda-graph` is automatically disabled with a warning. +- **Requires `cuda-python`.** The `cuda.bindings` package must be installed (`pip install cuda-python`). +- **Not compatible with memory saver mode.** Cannot be used together with `SGLANG_MEMORY_SAVER_CUDA_GRAPH`. + +## Performance + +When no graph breaks are inserted, breakable CUDA graph has minimal overhead compared to standard CUDA graph — the capture/replay path is nearly identical. + +Each graph break adds: +- One `cudaGraphLaunch` call (to replay the segment before the break) +- One eager Python function call +- One `cudaStreamBeginCapture` / `cudaStreamEndCapture` pair during capture + +For typical use cases with a small number of graph breaks, the overhead is negligible compared to the saved kernel launch overhead from the captured segments. + +## Code Reference + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FileDescription
python/sglang/srt/model_executor/breakable_cuda_graph/breakable_cuda_graph.pyCore implementation: eager_on_graph, BreakableCUDAGraph, BreakableCUDAGraphCapture
python/sglang/srt/model_executor/breakable_cuda_graph/cuda_utils.pyCUDA runtime binding utilities
python/sglang/srt/model_executor/cuda_graph_runner.pyIntegration with the main CUDA graph runner
python/sglang/srt/server_args.py--debug-cuda-graph flag and environment variable handling
python/sglang/srt/environ.pySGLANG_USE_BREAKABLE_CUDA_GRAPH environment variable definition
diff --git a/docs_new/docs/advanced_features/checkpoint_engine.mdx b/docs_new/docs/advanced_features/checkpoint_engine.mdx new file mode 100644 index 000000000000..59472e6094cc --- /dev/null +++ b/docs_new/docs/advanced_features/checkpoint_engine.mdx @@ -0,0 +1,257 @@ +--- +title: "Checkpoint Engine Integration" +metatags: + description: "SGLang checkpoint engine: distributed model weight loading, parallel multi-node setup, broadcast and P2P modes. Reduces loading time for large models." +--- +The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes. + +## Overview + +The checkpoint engine integration allows SGLang to: +- Load model weights in parallel using multiple processes +- Distribute weight loading across multiple nodes to increase effective disk bandwidth +- Overlap weight loading with other initialization tasks like CUDA graph capture +- Support both single-node and multi-node deployments + +## Installation + +First, install the checkpoint engine package: + +```bash Command +pip install 'checkpoint-engine[p2p]' +``` + +## Architecture + +The system consists of two main components: + +1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready +2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights + +The checkpoint engine uses a parameter server architecture with support for: +- **Broadcast mode**: Weights are broadcast from loading processes to inference processes +- **P2P mode**: Direct peer-to-peer weight transfer between processes +- **All mode**: Combination of both broadcast and P2P methods + +## Usage Examples + +### Single Node Setup + +**Terminal 1 - Launch SGLang Server:** +```bash Command +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --tp 8 \ + --load-format dummy \ + --wait-for-initial-weights +``` + +**Terminal 2 - Run Checkpoint Engine:** + +Using sglang entrypoint: +```bash Command +python -m sglang.srt.checkpoint_engine.update \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 8 +``` + +Using torchrun directly: +```bash Command +torchrun --nproc-per-node 8 \ + examples/checkpoint_engine/update.py \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 8 +``` + +### Multi-Node Setup (2 Nodes) + +**Node 0:** + +Launch SGLang server: +```bash Command +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --tp 8 \ + --load-format dummy \ + --wait-for-initial-weights \ + --host [IP] +``` + +Run checkpoint engine: + +Using sglang entrypoint (recommended): +```bash Command +python -m sglang.srt.checkpoint_engine.update \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 8 +``` + +Using torchrun directly: +```bash Command +torchrun --nproc-per-node 8 \ + --nnodes 2 \ + --node-rank 0 \ + --master-addr [IP] \ + --master-port 29500 \ + examples/checkpoint_engine/update.py \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 8 +``` + +**Node 1:** + +Launch SGLang server: +```bash Command +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --tp 8 \ + --load-format dummy \ + --wait-for-initial-weights \ + --host [IP] +``` + +Run checkpoint engine: + +Using sglang entrypoint (recommended): +```bash Command +python -m sglang.srt.checkpoint_engine.update \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 8 +``` + +Using torchrun directly: +```bash Command +torchrun --nproc-per-node 8 \ + --nnodes 2 \ + --node-rank 1 \ + --master-addr [IP] \ + --master-port 29500 \ + examples/checkpoint_engine/update.py \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 8 +``` + +### Multi-Node Setup with Tensor Parallelism (TP=16) + +**Node 0:** + +Launch SGLang server: +```bash Command +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --tp 8 \ + --load-format dummy \ + --wait-for-initial-weights \ + --host [IP] \ + --dist-init-addr [IP]:9120 \ + --nnodes 2 \ + --node-rank 0 +``` + +Run checkpoint engine: + +Using sglang entrypoint (recommended): +```bash Command +python -m sglang.srt.checkpoint_engine.update \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 16 +``` + +Using torchrun directly: +```bash Command +torchrun --nproc-per-node 8 \ + --nnodes 2 \ + --node-rank 0 \ + --master-addr [IP] \ + --master-port 29500 \ + examples/checkpoint_engine/update.py \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 16 +``` + +**Node 1:** + +Launch SGLang server: +```bash Command +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --tp 8 \ + --load-format dummy \ + --wait-for-initial-weights \ + --host [IP] \ + --dist-init-addr [IP]:9120 \ + --nnodes 2 \ + --node-rank 1 +``` + +Run checkpoint engine: + +Using sglang entrypoint (recommended): +```bash Command +python -m sglang.srt.checkpoint_engine.update \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 16 +``` + +Using torchrun directly: +```bash Command +torchrun --nproc-per-node 8 \ + --nnodes 2 \ + --node-rank 1 \ + --master-addr [IP] \ + --master-port 29500 \ + examples/checkpoint_engine/update.py \ + --update-method broadcast \ + --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ + --inference-parallel-size 16 +``` + +## Configuration Options + +### SGLang Server Options + +- `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks) +- `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready +- `--host`: Host address for multi-node setups +- `--dist-init-addr`: Distributed initialization address for tensor parallelism + +### Checkpoint Engine Options + +- `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`) +- `--checkpoint-path`: Path to model checkpoint directory +- `--inference-parallel-size`: Number of inference parallel processes +- `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`) +- `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`) +- `--save-metas-file`: File to save checkpoint metadata +- `--load-metas-file`: File to load checkpoint metadata from +- `--uds`: Unix domain socket path for communication +- `--weight-version`: Version identifier for weights + +## Performance Benefits + +The checkpoint engine provides significant time savings in two main aspects: + +1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes. + +2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings. + +## Troubleshooting + +- Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'` +- Verify network connectivity between nodes in multi-node setups +- Check that the checkpoint path contains valid model files +- Monitor logs for connection errors between SGLang server and checkpoint engine +- Use `--sleep-time` parameter to add delays if needed for debugging + +## References + +- [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine) diff --git a/docs_new/docs/advanced_features/cuda_graph_for_multi_modal_encoder.mdx b/docs_new/docs/advanced_features/cuda_graph_for_multi_modal_encoder.mdx new file mode 100644 index 000000000000..1ee4639792e1 --- /dev/null +++ b/docs_new/docs/advanced_features/cuda_graph_for_multi_modal_encoder.mdx @@ -0,0 +1,76 @@ +--- +title: "Cuda Graph for Multi-Modal Encoder in SGLang" +metatags: + description: "CUDA Graph optimization for ViT in SGLang: reduce kernel launch overhead, dynamic input handling, support for Qwen2.5-VL and Qwen3-VL models." +--- +## Motivation + +In multimodal reasoning services, the visual encoder (ViT / Vision Transformer) typically has a few characteristic traits: + +Many layers, fragmented operators: Each layer includes LN, QKV projections, attention, MLP, residual connections, etc., resulting in extremely frequent kernel launches. + +Server-side “small batch / low latency” is common: The batch size is very small (sometimes it looks like 1 after “flattening” the batch), so kernel launch overhead accounts for a large portion of end-to-end latency. + +Input token count (number of patches) varies frequently: Different image/video resolutions and different batch composition lead to different sequence lengths +S — and this is precisely the biggest obstacle for CUDA Graph (unstable shapes). + +The value of CUDA Graph: It captures a long sequence of GPU kernels with fixed shapes and fixed memory addresses into a graph; later, for the same shapes, it can replay the graph directly, dramatically reducing launch overhead and making GPU scheduling more compact. + +This led us to seek a CUDA Graph enabled feature for ViT in order to improve ViT performance. + +## Design and Restrictions + +The new CUDA Graph enabled ViT logic is built on ViTCudaGraphRunner. This runner captures the "blocks + merger + deepstack merger (optional)" part of a vision transformer into a CUDA graph and replays it for identical shapes. See the following design consideration and restrictions for more details. + +### Dynamic inputs to fit static constraints of CUDA Graph + +Variable sequence length S is very common in ViT. While CUDA Graph requires fixed shapes. The solution is to build a graph cache by S(e.g., graph_key = S). The first time create a new S, and then capture a graph; afterwards, replay it. + +If there are many distinct S values, we need to increase VRAM usage which is graph-private memory pools for many graphs. + +### Stable addresses + +Everything "parameter-like" becomes a static buffer: + +- block_input / block_ws / block_output +- cu_full_len / cu_window_len and their kk variants +- sin_cos_ws + +In this way to solve the underlying requirement: during replay, not allowed to swap tensors, can only modify tensor contents. + +### Attention backend arguments +Attention backend arguments are fixed inside the graph: + +TritonAttn expects [cu_seqlens, cu_seqlens_kk, max_len] +FA3 expects [cu_seqlens, max_len] + +max_len is frozen as an int constant. +cu_seqlens is cached into a dict during create_graph(), and its contents are not updated during subsequent replays. + +For the same graph_key = S, you not only require the input shape to match, but also require the segmentation pattern in cu_seqlens (and window seqlens) to be identical. Otherwise, attention will segment the sequence incorrectly. + +### Rotary buffer management +The feature reallocates a larger sin_cos_ws when seq_len increases. +The max_content_len is used to make sure the maximum size of the allocated rotary buffer. + + +## Command Example +You can enable CUDA Graph for ViT by setting env variable `SGLANG_VIT_ENABLE_CUDA_GRAPH=1`, for example: +```shell Command +SGLANG_VIT_ENABLE_CUDA_GRAPH=1 \ +python3 -m sglang.launch_server \ + --model Qwen/Qwen3-VL-8B-Instruct +``` +Or you can run CUDA Graph for ViT together with Piecewise CUDA Graph feature by both setting env variable `SGLANG_VIT_ENABLE_CUDA_GRAPH=1` and setting `--enable-piecewise-cuda-graph`, for example: +```shell Command +SGLANG_VIT_ENABLE_CUDA_GRAPH=1 \ +python3 -m sglang.launch_server \ + --model Qwen/Qwen3-VL-8B-Instruct \ + --piecewise-cuda-graph-max-tokens 4096 \ + --enable-piecewise-cuda-graph \ + --piecewise-cuda-graph-compiler eager +``` + +## Known supported models +- Qwen2.5-VL (https://github.com/sgl-project/sglang/pull/14422) +- Qwen3-VL (https://github.com/sgl-project/sglang/pull/15320) diff --git a/docs_new/docs/advanced_features/deterministic_inference.mdx b/docs_new/docs/advanced_features/deterministic_inference.mdx new file mode 100644 index 000000000000..9c67328105d2 --- /dev/null +++ b/docs_new/docs/advanced_features/deterministic_inference.mdx @@ -0,0 +1,215 @@ +--- +title: "Deterministic Inference" +metatags: + description: "SGLang deterministic inference: consistent outputs for RL training, testing, and production. Supports FlashInfer, FA3, Triton backends with CUDA Graph." +--- +## Why Deterministic Inference Matters + +Deterministic inference ensures consistent LLM outputs across runs, which is critical for: +- **Reinforcement Learning**: Ensures consistent logprobs across runs, reducing stochastic noise and making RL training more stable, reproducible, and debuggable. +- **Testing & Debugging**: Enables reproducible validation +- **Production**: Improves reliability and user experience + +Even with `temperature=0`, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels. + +## The Root Cause of Non-Determinism + +The main source is **varying batch sizes**. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity (`(a + b) + c ≠ a + (b + c)`), this produces different results even for identical inputs. + + +## SGLang's Solution + +Building on [Thinking Machines Lab's batch-invariant operators](https://github.com/thinking-machines-lab/batch_invariant_ops), SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this [issue](https://github.com/sgl-project/sglang/issues/10278). + +### Supported Backends + +Deterministic inference is only supported with the following three attention backends: **FlashInfer**, **FlashAttention 3 (FA3)**, and **Triton**. + +The following table shows feature compatibility for deterministic inference across different attention backends: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Attention BackendCUDA GraphChunked PrefillRadix CacheNon-greedy Sampling (Temp > 0)
**FlashInfer**✅ Yes✅ Yes❌ No✅ Yes
**FlashAttention 3 (FA3)**✅ Yes✅ Yes✅ Yes✅ Yes
**Triton**✅ Yes✅ Yes✅ Yes✅ Yes
+ +## Usage + +### Basic Usage + +Enable deterministic inference by adding the `--enable-deterministic-inference` flag: + +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --attention-backend fa3 \ + --enable-deterministic-inference +``` + +### Server Arguments + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType/DefaultDescription
`--enable-deterministic-inference`flag; default: disabledEnable deterministic inference with batch-invariant operations
`--attention-backend`string; default: fa3Choose attention backend (flashinfer, fa3, or triton)
+ +### Example Configurations + +#### Qwen3-8B +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-8B \ + --attention-backend flashinfer \ + --enable-deterministic-inference +``` + +#### Llama Models +```bash Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --attention-backend fa3 \ + --enable-deterministic-inference +``` + +#### Qwen3-30B-A3B (MoE Model) +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-30B-A3B \ + --attention-backend fa3 \ + --enable-deterministic-inference +``` + +### Deterministic Inference with Non-Greedy Sampling (Temperature > 0) + +SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses. + +#### Default Behavior + +By default, SGLang uses a sampling seed of `42` for reproducible sampling: + +```python Example +import requests + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "Tell me a joke", + "sampling_params": { + "temperature": 0.8, # Non-greedy sampling + "max_new_tokens": 128, + }, + }, +) +print(response.json()) +# This will always produce the same response across runs +``` + +#### Generating Multiple Reproducible Responses + +To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests: + +```python Example +import requests + +# Prepare a list of sampling seeds for different responses +sampling_seeds = [42, 43, 44, 45, 46] + +responses = [] +for seed in sampling_seeds: + response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "Tell me a joke", + "sampling_params": { + "temperature": 0.8, + "max_new_tokens": 128, + "sampling_seed": seed, # Specify sampling seed + }, + }, + ) + responses.append(response.json()) + +# Each seed will produce a different but reproducible response +# Using the same seed will always produce the same response +``` + +This approach ensures that: +- Different seeds produce diverse responses +- The same seed always produces the same response across different runs +- Results are reproducible for debugging and evaluation + + +## Verification + +Run deterministic tests to verify consistent outputs: + +```bash Command +# Single test: same prompt, varying batch sizes +python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50 + +# Prefix test: prompts with different prefix lengths +python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50 + +# Radix Cache Consistency mode: test radix cache determinism (cached vs uncached prefill) +python3 -m sglang.test.test_deterministic --test-mode radix_cache +``` + +Expected result: All tests should show `Unique samples: 1` (perfectly deterministic). diff --git a/docs_new/docs/advanced_features/dp_dpa_smg_guide.mdx b/docs_new/docs/advanced_features/dp_dpa_smg_guide.mdx new file mode 100644 index 000000000000..cbb2545b75c3 --- /dev/null +++ b/docs_new/docs/advanced_features/dp_dpa_smg_guide.mdx @@ -0,0 +1,509 @@ +--- +title: "DP, DPA and SGLang DP Router" +metatags: + description: "Learn the differences between Data Parallelism, Data Parallelism Attention, and SGLang Model Gateway routing for production DP deployments." +--- + +This guide explains the difference between Data Parallelism (DP) and Data Parallelism Attention (DPA), how to enable each mode correctly, and how to use the SGLang Model Gateway (SMG) for production-grade DP deployments. + +## Data Parallelism (DP) + +**Data Parallelism (DP)** is the most common parallelism strategy that replicates the entire model across multiple GPU sets and processes different batches of requests in parallel. Each GPU set handles independent requests. With dedicated routing strategies, as we will introduce later, with those proper routing algorithms in SGLang Model Gateway, the throughput of your serving system could be multiplied nearly linearly. + +### Key characteristics + +- Each replica has a full copy of the model +- Requests are distributed/scattered across replicas +- No inter-replica communication during one request's inference (for simple DP) + +## Data Parallelism Attention (DPA) + +**Data Parallelism Attention (DPA)**, also known as DP Attention, is an advanced parallelism strategy. While DPA provides the most significant benefits for **Multi-Head Latent Attention (MLA)** models (such as DeepSeek, MiniMax, Kimi-K2), it also supports **standard attention models** like Qwen. + +### The Problem with Tensor Parallelism for MLA Models + +The most common parallelism strategy for inference is **Tensor Parallelism (TP)**. However, TP might not be the most efficient strategy for certain models. For example, DeepSeek models use MLA and only have **one KV head**. If we use tensor parallelism on 8 GPUs, it will lead to: + +- **Duplicated KV cache** across all GPUs +- **Unwanted memory usage** that limits batch size +- **Reduced throughput** due to memory constraints + +### How DPA Works + +DPA addresses these limitations by applying **data parallelism specifically to the attention component**. + + + + + + + + + + + + +
+ DPA + EP Architecture + +

Each DP replica:

+
    +
  • Processes different batches independently (can be in different forward modes: prefill, decode, or idle)
  • +
  • Maintains its own KV cache (no duplication)
  • +
  • Enables significantly larger batch sizes due to memory savings
  • +
+

Communication patterns in DPA + EP:

+
    +
  • All2All (Dispatch): Routes tokens to expert sub-groups based on gating decisions
  • +
  • All2All (Combine): Gathers computed results from experts back to original token positions
  • +
+
+ +### Key benefits of DPA + +1. **Significantly reduced KV cache memory**: Each DP replica only stores KV cache for its own batches +2. **Larger batch sizes**: Memory savings enable larger batch sizes +3. **Improved decoding throughput**: Significant throughput gains for MLA-based models +4. **Independent forward modes**: Each DP replica can be in different forward modes (prefill, decode, or idle) and handles its assigned batches independently during attention computation + +### DPA with Expert Parallelism for MoE + +For MoE models like DeepSeek, DPA is **often** paired with Expert Parallelism (EP) for best throughput at scale. However, **DPA does not require EP**: you can enable DPA without EP if your deployment does not need expert sharding. + +- Distribute 256+ expert weights across GPUs (cannot fit on a single GPU) +- Enable efficient all-to-all token routing via DeepEP +- Scale to large clusters (up to 5x throughput improvement over vanilla TP) + +### Recommended setup for DeepSeek + +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ + --tp 8 \ + --dp-size 8 \ + --ep 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --moe-runner-backend deep_gemm +``` + +> **Note**: `--dp-size` must be explicitly set when using `--enable-dp-attention`. If `dp_size` is 1 (default), DPA will be disabled. + +For detailed EP configuration (DeepEP, Two-Batch Overlap, EPLB), see [Expert Parallelism](/docs/advanced_features/expert_parallelism). + +### Target Models + +DPA supports the following model architectures: + +- **MLA (Multi-Head Latent Attention) models** - where DPA provides the most significant benefits: + - DeepSeek family (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1) + - MiniMax models + - Kimi-K2 + - Other models using MLA architecture + +- **Standard attention models** - also supported: + - Qwen models (see [PR #6121](https://github.com/sgl-project/sglang/pull/6121)) + +For models like Llama, with standard GQA, standard DP, or TP is typically recommended. + +To enable DPA, add `--enable-dp-attention` to your server launch command. + +### Activation Logic + +DPA is enabled explicitly via server arguments (CLI or config). You must set both `--dp-size` and `--enable-dp-attention`: + +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ + --tp 8 \ + --dp-size 8 \ + --enable-dp-attention +``` + +**Important**: `--dp-size` must be greater than 1 for DPA to work. When `dp_size == 1` (default), `--enable-dp-attention` is automatically disabled. The constraint `tp_size % dp_size == 0` must also be satisfied. + +### Standard DP for MLA models + +Note that MLA models, of course, also support DP. Suppose you want to enable standard DP for MLA models. First, launch each MLA model's replica independently. You may launch these replicas one by one with DPA enabled. After launching each MLA model's replica, launch an SMG and connect all the replicas to the SMG. A detailed explanation of SMG is as follows. + +## Modern Data Parallelism SGLang Model Gateway (SMG) + +### Native DP Mode + +Native DP (built-in Data Parallelism) in SGLang creates multiple worker processes within a single SGLang instance, under the control of `DataParallelController` with the launching parameter of `dp-size`. + +```bash +# Native DP mode +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 +``` + +**Limitations:** + +- Built-in in-process load balancing only (e.g., `round_robin`, `total_requests`, `total_tokens`) +- No cache-aware routing +- Limited observability and metrics +- No fault tolerance or circuit breakers +- Not suitable for production workloads + +⚠️ Native DP is **highly not recommended for use right now**. It is only used in some ancient/outdated RL frameworks. You can use SGLang Model Gateway (SMG) to power up your data parallelism in any use case. + +### SMG-Based DP (Recommended) + +Starting from September 2024, SGLang Model Gateway, i.e., SMG, formerly named as SGLang DP Router, was built especially as a production-ready DP routing system with Rust. It starts from DP routing, but later we further expanded its scope to coordinate RL, PD Disaggregation, and other scenarios. This doc only discusses SMG's usage in DP routing. For other usage, please refer to [SGLang Model Gateway Documentation](/docs/advanced_features/sgl_model_gateway). + +> To achieve the best production-level routing performance and reduce the overhead to an extreme extent, we use Rust to build SMG, but not Python, since Python is never FAST enough. + +**We strongly recommend using the SGLang Model Gateway (SMG) for production-grade Data Parallelism.** SMG provides significant advantages over native DP mode. + +```bash +# SMG-based DP mode (Recommended) +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 +``` + +⚠️ Note that **SMG and Naive DP share the same launching parameter, `--dp-size`**. But the entrypoint of Naive DP is `python -m sglang.launch_server`, and SMG's entrypoint is `python -m sglang_router.launch_server`. + +**Advantages of SMG-Based DP:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeatureNative DPSMG-Based DP
Load BalancingBuilt-in in-process methodsAdvanced policies (cache-aware, power-of-two, etc.)
Cache Awareness❌ No✅ Yes - significantly higher cache hit rate
ThroughputBaselineSignificant improvement
Multi-Node SupportLimited✅ Full support
Worker Health MonitoringBasic✅ Circuit breakers, health checks
ReliabilityBasic✅ Retries, rate limiting, queuing
ObservabilityBasic metrics✅ 40+ Prometheus metrics, OpenTelemetry
Hot Worker Add/Remove❌ No✅ Yes
+ +### SMG's Performance + +The cache-aware routing policy in SMG significantly improves performance for workloads with shared prefixes: + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricWithout Cache-AwareWith Cache-Aware SMG
Throughput (token/s)82,665158,596 (+92%)
Cache Hit Rate20%75% (+275%)
+ +*Benchmark from [SGLang v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), workload with multiple long prefix groups, 8x A100 80GB GPUs, dp-size=8* + +### When to Use Each + +**Use Native DP when:** + +- ~Never use Native/Naive DP~ +- Learning material of DP routing + +**Use SMG-Based DP when:** + +- In any case, when you think DP is needed +- Production deployments +- Multi-node distributed setups +- Workloads with shared prefixes (high cache reuse potential) +- You need high availability and reliability features +- You require detailed observability and metrics +- You want to have highly efficient RL rollout systems + +Note that for RL rollout systems, **there are four crucial reasons that SMG-Based DP is far better than naive DP routing**. Details can be found at [Load Balancing Router in RL](/docs/advanced_features/sglang_for_rl#load-balancing-router). + +### Quick Start For SMG + +**Installation** + +```bash +pip install sglang-router +# or +pip install "sglang[all]" +``` + +**Option A: Co-launch Workers and SMG (Simplest)** + +This is the easiest way to get started - SMG and workers are launched together: + +```bash +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Option B: Separate Launch (Multi-Node)** + +For distributed deployments across multiple machines: + +1. Launch workers on each node + +```bash +# Node 1 +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --port 8000 + +# Node 2 +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --port 8000 +``` + +2. Launch SMG pointing to workers + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://node1:8000 http://node2:8000 \ + --policy cache_aware \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Option C: Dynamic Worker Registration** + +For elastic deployments where workers can be added/removed dynamically: + +```bash +# Launch SMG first +python -m sglang_router.launch_router \ + --policy cache_aware \ + --host 0.0.0.0 \ + --port 30000 + +# Register workers dynamically +curl -X POST http://localhost:30000/workers \ + -H "Content-Type: application/json" \ + -d '{"url": "http://worker1:8000"}' + +curl -X POST http://localhost:30000/workers \ + -H "Content-Type: application/json" \ + -d '{"url": "http://worker2:8000"}' +``` + +### Load Balancing Policies + +SMG supports multiple load balancing policies: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PolicyDescriptionBest For
cache_awareCombines cache locality with load balancingRecommended for most workloads
round_robinCycles through workers in orderSimple, predictable distribution
randomRandom worker selectionBaseline, testing
power_of_twoSamples two workers, picks lighter oneLow latency requirements
+ +**Cache-Aware Policy (Default, Recommended)** + +The cache-aware policy provides the best performance for most workloads: + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8000 \ + --policy cache_aware \ + --cache-threshold 0.5 \ + --balance-abs-threshold 32 \ + --balance-rel-threshold 1.5 \ + --eviction-interval-secs 120 \ + --max-tree-size 67108864 +``` + +**How it works:** + +1. Maintains an approximate radix tree for each worker based on request history +2. Routes requests to workers with the highest prefix match (cache hit) +3. Falls back to shortest-queue routing when load is imbalanced +4. Automatically evicts old entries to prevent memory overflow + +### Best Practices + +1. **Start with `cache_aware` policy** - It provides the best balance between cache locality and load distribution for most workloads +2. **Use SMG for production** - Prefer `sglang_router.launch_server` over `sglang.launch_server` for better reliability and observability +3. **Enable health checks** - Configure `--router-health-check-interval-secs` to detect and remove unhealthy workers automatically + +**Recommended command with best practices applied:** + +```bash +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 \ + --router-policy cache_aware \ + --router-health-check-interval-secs 30 \ + --router-prometheus-port 10001 \ + --host 0.0.0.0 \ + --port 30000 +``` + +For advanced configuration (circuit breakers, retries, Prometheus metrics, K8s integration), see [SGLang Model Gateway Documentation](/docs/advanced_features/sgl_model_gateway). + +### Verifying Traffic Distribution + +After launching SMG, verify that traffic is being distributed correctly: + +**1. Check worker status:** + +```bash +curl http://localhost:30000/workers +``` + +**2. Check load distribution:** + +```bash +curl http://localhost:30000/get_loads +``` + +**3. Monitor metrics (if Prometheus enabled):** + +```bash +# Key metrics to check +smg_router_requests_total{model="..."} +smg_worker_requests_active{worker="..."} +sglang_cache_hit_rate{source="..."} +``` + +For detailed metrics and monitoring setup, see [SGLang Model Gateway Documentation](/docs/advanced_features/sgl_model_gateway). + +## Reference + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
StrategyUse CaseKey Benefit
Native DP (--dp-size)NeverEasy to understand, not rust based
SMG-Based DPProduction (recommended)Cache-aware routing, high availability
DPA (--dp-size N --enable-dp-attention)DeepSeek/MLA modelsEliminates KV cache duplication, improved throughput
DPA + EPDeepSeek MoE modelsSignificant throughput improvement vs vanilla TP
+ +**Recommended production setup for DeepSeek:** +1. Enable **DPA** for attention layers (`--dp-size 8 --enable-dp-attention`) +2. Enable **EP** for MoE layers (`--ep 8 --moe-a2a-backend deepep`) +3. Use **SMG** with **cache_aware** policy + +**Related documentation:** +- [Expert Parallelism](./expert_parallelism) - DeepEP, Two-Batch Overlap, EPLB +- [SGLang Model Gateway Documentation](./sgl_model_gateway) - SMG configuration & troubleshooting +- [Large-Scale EP Blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/) - 96 GPU deployment guide diff --git a/docs_new/docs/advanced_features/dp_for_multi_modal_encoder.mdx b/docs_new/docs/advanced_features/dp_for_multi_modal_encoder.mdx new file mode 100644 index 000000000000..afa991532496 --- /dev/null +++ b/docs_new/docs/advanced_features/dp_for_multi_modal_encoder.mdx @@ -0,0 +1,33 @@ +--- +title: "DP for Multi-Modal Encoder in SGLang" +metatags: + description: "Data parallelism for VLM vision encoder in SGLang: reduce TTFT, boost throughput. Supports Qwen2.5-VL, Qwen3-VL, InternVL, GLM-4.5V/4.6V." +--- +A typical VLM architecture involves two main components: an multi-modal encoder and a text decoder. + +Most VLMs utilize a Vision Transformer (ViT) as their multi-modal encoder, it is responsible for processing visual data, extracting features (objects, colors, textures, etc.), and transforming them into a format that can be understood by the model. + +The text deocoder is based on LLM. It processes textual data and generates output based on the encoded visual features. + +However, since the size of ViT is very small compared to language decoders, +there is relatively little gain from TP. On the other hand, TP incurs significant communication +overhead because of all-reduce being performed after every layer. + +Placing the ViT in data parallel while keeping the LLM in tensor parallel consistently lowers TTFT and boosts end-to-end throughput. In this hybrid layout, the vision front-end becomes parallel and lightweight, while scarce interconnect bandwidth and collective ops are reserved for the LLM. + +Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel. + +## Command Example +You can enable batch-level DP by setting `mm-enable-dp-encoder`, for example: +```shell Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen2.5-VL-7B-Instruct \ + --tp 2 \ + --mm-enable-dp-encoder +``` + +## Known supported models +- Qwen2.5-VL (<https://github.com/sgl-project/sglang/pull/13126>) +- Qwen3-VL (<https://github.com/sgl-project/sglang/pull/13724>) +- InternVL (<https://github.com/sgl-project/sglang/pull/13925>) +- GLM-4.5V & GLM-4.6V (<https://github.com/sgl-project/sglang/pull/14097>) diff --git a/docs_new/docs/advanced_features/epd_disaggregation.mdx b/docs_new/docs/advanced_features/epd_disaggregation.mdx new file mode 100644 index 000000000000..db936196a05b --- /dev/null +++ b/docs_new/docs/advanced_features/epd_disaggregation.mdx @@ -0,0 +1,197 @@ +--- +title: "EPD Disaggregation" +metatags: + description: "SGLang EPD disaggregation: separate encoder, prefill, decode stages for VLM inference. Independent scaling, load balancing, three-tier architecture." +--- +## Why and What is EPD Disaggregation? + +In modern Vision-Language Model (VLM) inference, request execution naturally decomposes into three distinct stages: Encoder, Prefill, and Decode. +The Encoder stage performs vision preprocessing and ViT-based image encoding, which is highly compute-intensive but only required during request initialization. The Prefill stage processes the full multimodal input sequence to initialize the language model’s Key-Value (KV) cache, while the Decode stage is dominated by memory bandwidth and KV cache access for autoregressive token generation. + +Existing deployments typically colocate these stages within a unified execution engine, or at best apply Prefill–Decode (PD) disaggregation. However, such designs still tightly couple vision encoding with language prefill, leading to inefficient resource utilization, limited scalability for image-heavy workloads, and suboptimal scheduling under load. + +To address these challenges, we introduce Encoder–Prefill–Decode (EPD) Disaggregation in SGLang. EPD further separates vision encoding from language processing, enabling independent horizontal scaling of encoder servers, improved load balancing for multimodal requests, and seamless integration with existing PD disaggregation to form a fully decoupled three-tier inference architecture. + +### Usage + +You can launch a language-only model using `--language-only`, or an encoder-only model using `--encoder-only`. +When launching a language-only model, you must additionally specify the encoder service endpoints via `--encoder-urls`. + +We support multiple encoder transfer backends, including zmq_to_scheduler, zmq_to_tokenizer, and mooncake (the default is zmq_to_scheduler). The backend can be selected using `--encoder-transfer-backend`. + +### Encoder transfer with Mooncake + +`--encoder-transfer-backend mooncake` controls **how encoder outputs are transferred** between encoder and language/prefill services. It is an encoder transfer option and can be used independently of the global multimodal embedding cache. + +Example: + +```bash Command +# encoder +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --encoder-transfer-backend mooncake \ + --port 30000 + +# language-only server +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --language-only \ + --encoder-urls http://127.0.0.1:30000 \ + --encoder-transfer-backend mooncake \ + --port 30002 +``` + +### Global multimodal embedding cache with Mooncake + +SGLang also supports a Mooncake-backed **global multimodal embedding cache** for EPD workloads. When enabled on encoder servers, repeated image inputs can reuse previously computed ViT embeddings across instances instead of running the vision encoder again. + +This feature is useful when: + +- the deployment serves repeated or overlapping image inputs, +- encoder compute is the bottleneck, and +- Mooncake is already available in the cluster. + +At a high level, the encoder checks whether the image embedding already exists in Mooncake. Cache hits are prefetched from the global store, while misses are encoded normally and inserted into the cache in the background. + +To enable it: + +- install and configure Mooncake in the same way as other SGLang Mooncake integrations, +- add `--enable-mm-global-cache` on the encoder server. + +`--enable-mm-global-cache` controls **whether multimodal embeddings are looked up and stored in the global Mooncake cache**. It is separate from `--encoder-transfer-backend`, which only controls encoder output transport. + +For Mooncake deployment and configuration details, see [HiCache best practices](./hicache_best_practices#deployment-with-mooncake) and the [Mooncake backend README](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md). + +Example: + +```bash Command +# Shared Mooncake configuration +export MOONCAKE_TE_META_DATA_SERVER="http://127.0.0.1:8080/metadata" +export MOONCAKE_MASTER="127.0.0.1:50051" +export MOONCAKE_PROTOCOL="rdma" +export MOONCAKE_GLOBAL_SEGMENT_SIZE="4gb" + +# encoder with global multimodal cache enabled +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --enable-mm-global-cache \ + --port 30000 + +# language-only server +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --language-only \ + --encoder-urls http://127.0.0.1:30000 \ + --port 30002 +``` + +Notes: + +- This cache is for **multimodal encoder embeddings**, not the language model KV cache. +- The feature currently uses Mooncake as the shared backing store. +- It can be enabled regardless of which `--encoder-transfer-backend` you use. +- It is most relevant for EPD or encoder-disaggregated VLM deployments where the same images are likely to appear across requests or instances. + +#### Qwen VL + +- EP Disaggregation + +```bash Command +# encoder 0 +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30000 +# encoder 1 +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30001 +# language-only server +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --language-only \ + --encoder-urls http://127.0.0.1:30000 http://127.0.0.1:30001 \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30002 +``` + +- EPD Disaggregation + +```bash Command +# encoder 0 +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30000 +# encoder 1 +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30001 +# prefill 0 +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --disaggregation-mode prefill \ + --language-only \ + --encoder-urls http://127.0.0.1:30000 http://127.0.0.1:30001 \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30002 +# decode 0 +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --disaggregation-mode decode \ + --port 30003 +# router +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://$PREFILL_HOST:30002 \ + --decode http://$DECODE_HOST:30003 \ + --port 8000 + +``` + +#### gRPC Encoder (EPD) + +You can run the encoder as a gRPC server while keeping prefill/decode as HTTP. +When using gRPC encoders, set `SGLANG_ENCODER_MM_RECEIVER_MODE=grpc` for the +prefill process so it uses the gRPC receiver. + +```bash Command +# gRPC encoder +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --encoder-only \ + --grpc-mode \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30000 + +# prefill (HTTP) - tell it to use gRPC receiver +SGLANG_ENCODER_MM_RECEIVER_MODE=grpc \ +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --disaggregation-mode prefill \ + --language-only \ + --encoder-urls grpc://127.0.0.1:30000 \ + --encoder-transfer-backend zmq_to_scheduler \ + --port 30002 + +# decode (HTTP) +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --disaggregation-mode decode \ + --port 30003 + +# router +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://$PREFILL_HOST:30002 \ + --decode http://$DECODE_HOST:30003 \ + --port 8000 +``` diff --git a/docs_new/docs/advanced_features/expert_parallelism.mdx b/docs_new/docs/advanced_features/expert_parallelism.mdx new file mode 100644 index 000000000000..1e0e1965550f --- /dev/null +++ b/docs_new/docs/advanced_features/expert_parallelism.mdx @@ -0,0 +1,304 @@ +--- +title: "Expert Parallelism" +metatags: + description: "SGLang Expert Parallelism: distribute MoE experts across GPUs, DeepEP all-to-all, grouped GEMMs, TBO/SBO overlap, EPLB load balancing." +--- +Expert Parallelism (EP) in SGLang distributes expert weights across multiple devices in Mixture-of-Experts (MoE) models, addressing memory bottlenecks and enabling efficient scaling for high-performance inference. It is particularly vital for serving large-scale MoE models where tokens are dynamically routed to specialized experts across GPUs. By leveraging optimized all-to-all communication and grouped matrix multiplications (GEMMs), EP reduces latency, boosts throughput, and minimizes idle GPU time. SGLang's EP offers strong extensibility through its modular framework, allowing seamless integration of custom kernels, backends, and optimizations without refactoring core logic, supporting diverse hardware and quantization schemes. + +## Supported Backends and Selection Guidance + +SGLang's EP integrates diverse, highly efficient backends for different use cases, allowing fine-grained control over performance trade-offs. Users specify backends via command-line flags: +- `--moe-a2a-backend`: Selects the backend for all-to-all communication. +- `--moe-runner-backend`: Selects the backend for MoE computation. + +### Backends for All-to-All Communication + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendDescriptionUse Cases
**`none` (default)**Disables all-to-all for EP. Uses All-Reduce or All-Gather for token dispatch.Hybrid EP and TP setups.
`deepep`DeepEP, a communication library for efficient token shuffling in MoE models.Large-scale EP deployments.
`mooncake`An extension of DeepEP for elastic inference, leveraging RDMA for high-performance data transfers.Elastic EP serving.
nixlNIXL-EP, an elastic EP communication library built on NVIDIA's NIXL framework with native RDMA and NVLink support.Elastic EP serving with fault tolerance and dynamic scaling.
moriMORI-EP, AMD's native all-to-all communication implementation optimized for ROCm.AMD GPU deployments.
`flashinfer`Flashinfer implementation of all-to-all.Large-scale EP deployments.
`ascend_fuseep`Ascend NPU native fused all-to-all communication.Ascend NPU deployments.
+ +DeepEP and Mooncake backends support two modes for token dispatch: `normal` mode (optimized for prefill workloads with high throughput) and `low_latency` mode (optimized for decode workloads with low latency and CUDA Graph compatibility). MORI backend only supports `normal` mode now. NIXL-EP currently operates in low-latency mode with CUDA Graph support. Users are recommended to set `--deepep-mode auto` to enable automatic dispatch mode switching during runtime. Setting `--deepep-mode normal` or `--deepep-mode low_latency` is useful for debugging or development purposes. + +Currently, DeepEP, Mooncake, NIXL-EP, `ascend_fuseep` and MORI only support cases where `ep_size = tp_size`. For hybrid EP and TP (i.e., `ep_size < tp_size`), only the `none` backend (All-Reduce or All-Gather-based dispatching) is supported. + +### Backends for MoE Computation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendDescriptionUse Cases
**`auto` (default)**Automatically selects the optimal backend based on model architecture, hardware (e.g., NVIDIA architecture like Ampere, Hopper, Blackwell), quantization scheme (e.g., FP8, FP4), and runtime conditions.General-purpose deployments; ensures compatibility and performance without user intervention.
`triton`Triton-based implementation for grouped GEMMs. To achieve higher performance, it's highly recommended to create tuned configurations.Custom kernel development or scenarios requiring high extensibility with Torch compilation support.
`deep_gemm`DeepGEMM backend optimized for MoE matrix multiplications, supporting contiguous layouts for prefill and masked layouts for decode; often JIT-compiled for performance.Large-scale EP deployments with FP8 block-wise quantization.
`cutlass`CUTLASS-based backend for efficient GEMMs.NVIDIA architectures with CUTLASS support.
`flashinfer_trtllm`FlashInfer integrated with TensorRT-LLM for accelerated MoE computations, supporting FP4 communication operators and high-performance GEMMs.Blackwell with TRT-LLM.
flashinfer_trtllm_routedFlashInfer integrated with TensorRT-LLM for accelerated routed MoE computations, consuming SGLang-computed top-k expert assignments and weights.Blackwell with TRT-LLM.
`flashinfer_cutlass`FlashInfer combined with CUTLASS for high-performance grouped GEMMs in MoE layers, handling FP4/FP8 quantization efficiently.Blackwell with FP4/FP8 models.
`flashinfer_mxfp4`FlashInfer variant optimized for MXFP4 (mixed FP4) quantization in MoE runners, focusing on memory-efficient low-precision inference.Low-precision models with MXFP4.
`flashinfer_cutedsl`FlashInfer with a custom DSL for flexible and efficient MoE kernel generation, integrated with ModelOpt FP4 quantization.Low-precision models with NVFP4.
+ +### Examples + +Launch with DeepEP and DeepGEMM for DeepSeek-V3: + +```bash Command +python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --moe-a2a-backend deepep --moe-runner-backend deep_gemm --tp 8 --ep 8 +``` + +## Extensible EP Framework + +SGLang's EP framework provides modular abstractions for easy integration of custom kernels, backends, and optimizations. It decouples the MoE forward pass into stages (dispatch → pre-permute → core runner → post-permute → combine), enabling seamless extensions without refactoring core logic. + +### Framework Overview + +The framework centers on `FusedMoE` as the unified entry point for a single, extensible structure. Key components include: +- **Dispatcher**: Manages dispatch/combine for backends like DeepEP (implements `BaseDispatcher` subclasses). +- **MoeRunner**: Orchestrates grouped-GEMM execution via `MoeRunnerCore` implementations (e.g., `TritonRunnerCore`). +- **PermuteMethodPool**: Auto-registers layout conversions (e.g., pre/post-permute via `register_pre_permute` and `register_post_permute` for dynamic modes, or `register_fused_func` for static, torch.compile-compatible fused operations). +- **TopK Router**: Backend-agnostic expert selection. + +This design supports multiple backends via `--moe-a2a-backend` and `--moe-runner-backend`, with quantization integrated through a standardized `apply()` method. The computation flow ensures modularity: + +```text Output +[input_hidden_states] + | + v + TopK.forward -> select_experts / triton_kernels.routing / bypass + | + v + [TopKOutput] + | + v + FusedMoE.forward -> Dispatcher.dispatch -> DeepEP / bypass + | | + | v + | [DispatchOutput] + | | + | v + | quant_method.apply -> MoeRunner.forward + | | | + | | v + | | pre-permute + grouped_gemm + post-permute + | | | + | |-------------- + | v + | [CombineInput] + | | + | v + | Dispatcher.combine -> DeepEP / bypass + | | + |--------------------- + v +[final_hidden_states] +``` + +For details, see the [MoE Refactor Roadmap](https://github.com/sgl-project/sglang/issues/8715). + +### Implementing New Backends + +To add a new backend: +1. For a new all-to-all dispatcher, implement a `BaseDispatcher` subclass with `dispatch` and `combine` methods. +2. For a new MoE runner backend, define a `MoeRunnerCore` subclass for core operations (e.g., grouped GEMMs). +3. Define new input/output formats for the dispatcher or model runner (e.g., `RunnerInput`, `RunnerOutput`). +4. Register permute/unpermute methods to ensure compatibility: + - **Fused Mode** (static, torch.compile-compatible): Use `register_fused_func` for end-to-end operations. + - **Permute Mode** (dynamic): Register `register_pre_permute` and `register_post_permute` for flexible layouts. + +See the [MoE Refactor Implementation PR](https://github.com/sgl-project/sglang/pull/9269) for full changes, including type hints and config expansions. + +### Examples + +For an example implementation, see [moe_runner/triton.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/moe/moe_runner/triton.py), which demonstrates Triton-based grouped GEMMs with registered fused and permutation functions. + +## Computation and Communication Overlap + +SGLang's EP employs advanced overlap techniques to hide communication latency behind computation, maximizing GPU utilization in MoE layers. + +### Two-Batch Overlap (TBO) + +TBO splits requests into micro-batches, interleaving attention computation with dispatch/combine operations. Yield points in the execution graph allow pausing for overlaps, increasing overall throughput without peak memory spikes: + +```python Example +operations = [ + self._forward_attn, + YieldOperation(), # Overlap with dispatch of prior micro-batch + self._forward_dispatch, + self._forward_mlp, + YieldOperation(), # Overlap with combine + self._forward_combine, +] +``` + +Users need to specify `--enable-two-batch-overlap` to unlock up to 2x throughput. For details, see the [Large-Scale EP Blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/#two-batch-overlap). + +### Single-Batch Overlap (SBO) + +SGLang introduces a dispatcher-hook system for Single-Batch Overlap (SBO), enabling the overlap of operations within a single batch—such as shared experts computation with communication—while decentralizing logic to enhance modularity. These hooks execute before and after the `dispatch` and `combine` operations without modifying core MoE modules. This design simplifies interfaces, reduces coupling, and improves extensibility. For implementation details and an example of overlapping shared experts with DeepEP's combine operation, refer to [PR #13327](https://github.com/sgl-project/sglang/pull/13327). Users can set `--enable-single-batch-overlap` to enable this feature. + + +## Workload Balancer + +SGLang integrates the [Expert Parallelism Load Balancer (EPLB)](https://github.com/deepseek-ai/EPLB) from DeepSeek to address routing imbalances in MoE models. By analyzing expert activation statistics, EPLB computes an optimal expert arrangement, strategically placing or replicating experts to minimize GPU utilization variance, reduce idle cycles, and enhance scalability. + +To enable EPLB, use the flags `--enable-eplb`. For optimal performance, increase batch sizes to stabilize activation statistics and configure periodic rebalancing (e.g., every 1000 requests) to adapt to evolving workloads. Simulations demonstrate significant improvements in load balancedness (ratio of mean to max computation time), correlating strongly with throughput gains. + +For more details, refer to the [EPLB Section in the Large-Scale EP Blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/#expert-parallelism-load-balancer) and the [EPLB Repository](https://github.com/deepseek-ai/eplb). + + +## EP with Spectulative Decoding + + +When utilizing speculative decoding with MTP on MoE architectures, use the `--speculative-moe-runner-backend` and `--speculative-moe-a2a-backend` arguments to customize the MoE layer behavior for the draft model. While they default to the target model’s settings, users can differentiate them for varying precisions between target and draft models. + +For model like `nvidia/DeepSeek-R1-0528-NVFP4-v2`, the target model uses NVFP4 precision while the draft model uses BF16. To apply `flashinfer_trtllm` kernel for target MoE layer while falling back to triton fused MoE kernel for draft MoE layer, users can set the arguments as follows: +```text Output +... +--moe-runner-backend flashinfer_trtllm \ +--speculative-moe-runner-backend triton \ +... +``` + + +## Ascend NPU Guidance +### Guidance on SGLang configuration in Ascend NPU +- `--moe-a2a-backend` only supports `deepep` and `ascend_fuseep` backends, + + - `deepep`: The mechanism is consistent with the above description. + + - `ascend_fuseep`: Offer a large fused operator which integrates all operations between dispatch and combine to boost MoE computation. Only used for decode stage in PD Disaggregation Mode. + +- `--moe-runner-backend` parameter does not need to be configured. + +- `--deepep-mode`: + + - In PD mixed mode, please set `--deepep-mode auto`. + + - In PD Disaggregation Mode, prefill instance sets `--deepep-mode normal`, and decode instance sets `--deepep-mode low_latency`. + +### DeepEP Ascend Introduction +DeepEP Ascend is the adapted version of the DeepEP communication library for Huawei Ascend NPUs, specifically designed for Mixture-of-Experts (MoE) model Expert Parallelism (EP). +It supports the Ant-moving Function (Split the sequence length into rounds for streaming batch transmission) to optimize the buffer size occupied during collective communication in prefill stage, especially for long sequences. + +Ant-moving Function can be enabled for both the dispatch and combine phases via the following environment variables: + +- `DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS`: Enable ant-moving function in dispatch stage. Indicates the number of tokens transmitted per round on each rank, default 8192. + +- `DEEPEP_NORMAL_LONG_SEQ_ROUND`: Enable ant-moving function in dispatch stage. Indicates the number of rounds transmitted on each rank, default 1. + +- `DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ`: Enable ant-moving function in combine stage, default 0 (means disabled). + +`DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS * DEEPEP_NORMAL_LONG_SEQ_ROUND` means input sequence length. When the input sequence length exceeds 8192, it is recommended to enable the ant-moving function in both dispatch and combine phase. + +The environment variable `HCCL_BUFFSIZE` is used to configure the buffer size (MB) actually allocated. Its calculation formula is as follows: +```text Output +# Enable Ant-moving Function +HCCL_BUFFSIZE >= 2 * (102MB + 4MB + DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS * (hidden_size + hidden_size + hidden_size) * topk) + PADDING_BUFFSIZE + +# Disable Ant-moving Function +HCCL_BUFFSIZE >= 2 * (102MB + 4MB + TOTAL_SEQ_LEN * (hidden_size + hidden_size) * topk) + PADDING_BUFFSIZE +``` +Wherein the parameters are described as follows: + +- `hidden_size`: hidden size in model config. + +- `topk`: The number of selected routing experts. + +- `TOTAL_SEQ_LEN`: input sequence length. + +- `PADDING_BUFFSIZE`: A value of 20 or greater is recommended. diff --git a/docs_new/docs/advanced_features/forward_hooks.mdx b/docs_new/docs/advanced_features/forward_hooks.mdx new file mode 100644 index 000000000000..a66f1548c5a1 --- /dev/null +++ b/docs_new/docs/advanced_features/forward_hooks.mdx @@ -0,0 +1,298 @@ +--- +title: "Model Forward Hooks" +metatags: + description: "SGLang forward hooks: attach PyTorch hooks to model submodules via JSON config. Log activations, debug internals, export hidden states." +--- + +## Model Hooks + +SGLang supports attaching PyTorch forward hooks to specific submodules in the loaded model, configured entirely via `server_args` JSON. + +This is useful for: + +* Logging intermediate activations +* Debugging model internals +* Exporting hidden states to external tooling + +Hooks are attached once during `ModelRunner.initialize` and run on every forward pass. + +*** +### Configuration overview + +Hooks are configured via a `ServerArgs` field: + +```python Example +class ServerArgs: + ... + # For forward hooks + forward_hooks: Optional[List[dict[str, Any]]] = None +```` + +In JSON form, a minimal configuration looks like: + +```jsonc Example +{ + "forward_hooks": [ + { + "name": "outer_linear_hooks", + "target_modules": ["outer.0", "outer.1"], + "hook_factory": "my_project.hooks:dummy_hook_factory", + "config": { + "tag": "outer-layer" + } + } + ] +} +``` + +#### Top-level fields + +* `forward_hooks` (optional list of objects) + Each element is a hook spec describing: + + * Which modules to target + * Which Python factory to call + * What configuration to pass into that factory + +*** +### Hook spec schema + +Each entry in `forward_hooks` is a JSON object with the following shape: + +```jsonc Example +{ + "name": "optional-descriptive-name", + "target_modules": ["pattern1", "pattern2", "..."], + "hook_factory": "module.submodule:factory_name", + "config": { + "...": "arbitrary JSON" + } +} +``` + +#### `name` (optional) + +* Human-readable name for logging. +* Used only in log messages such as: + + ```text Output + Registered forward hook 'outer_linear_hooks' on outer.0 + ``` + +#### `target_modules` (required) + +* List of **module name patterns** used to match entries in `model.named_modules()`. +* Patterns are matched using `fnmatch.fnmatch`, so: + + * `"outer.0"` matches exactly `"outer.0"`. + * `"outer.*"` matches `"outer.0"`, `"outer.1"`, `"outer.inner"`, etc. + * `"outer.inner.*"` matches children under `outer.inner`. + +> If no modules match the given patterns, hook registration does **not** fail. +> Instead, SGLang logs a warning and continues: +> +> ```text +> No modules matched hook spec 'name' patterns=['...'] +> ``` + +#### `hook_factory` (required) + +* String path to the Python factory function that creates the hook. +* Supported formats: + + * `"package.module:factory_name"` + * `"package.module.submodule.factory_name"` + +The path is resolved via: + +```python Example +def resolve_callable(path: Optional[str]) -> Optional[Callable]: + if path is None: + return None + + if ":" in path: + module_name, fn_name = path.split(":", 1) + else: + parts = path.split(".") + if len(parts) < 2: + raise ValueError( + f"Invalid hook callable path '{path}'. " + "Expected 'module.submodule:factory' or 'module.submodule.factory'." + ) + *mod_parts, fn_name = parts + module_name = ".".join(mod_parts) + + module = importlib.import_module(module_name) + try: + return getattr(module, fn_name) + except AttributeError as e: + raise AttributeError( + f"Module '{module_name}' has no attribute '{fn_name}' " + f"(from hook path '{path}')" + ) from e +``` + +**Failure modes**: + +* If the path is malformed (not enough dots and no `:`), a `ValueError` is raised at startup. +* If the module imports but the attribute is missing, an `AttributeError` is raised with a clear error message. +* If the hook factory returns `None`, a warning is logged and no hook is registered for that spec (initialization continues). + +The first two cause initialization to fail fast with a descriptive error; the last one is non-fatal. + +#### `config` (optional) + +* Arbitrary JSON object. +* Passed directly to the hook factory as a Python `dict`. +* This lets you parameterize hook behavior from config (e.g. tags, log levels, sampling rates, etc.). + +*** +### Hook lifecycle and behavior + +Hooks are registered in `ModelRunner.initialize()`: + +```python Example +if server_args.forward_hooks: + register_forward_hooks(self.model, server_args.forward_hooks) +``` + +The actual registration logic is implemented by `register_forward_hooks`: + +```python Example +def register_forward_hooks(model: nn.Module, hook_specs: List[dict[str, Any]]) -> None: + """ + hook_specs is a list of dicts from server_args.forward_hooks. + Attaches forward hooks to the matching modules. + """ + name_to_module = dict(model.named_modules()) + + for spec in hook_specs: + spec_name = spec.get("name", "") + target_patterns = spec.get("target_modules", []) + if not target_patterns: + logger.warning( + f"Hook spec '{spec_name}' has no 'target_modules', skipping" + ) + continue + + hook_factory_path = spec.get("hook_factory") + if not hook_factory_path: + logger.warning( + f"Hook spec '{spec_name}' has no 'hook_factory', skipping" + ) + continue + + config = spec.get("config") or {} + hook_factory = resolve_callable(hook_factory_path) + + hook = hook_factory(config) if hook_factory else None + if hook is None: + logger.warning( + f"Hook factory '{hook_factory_path}' for spec '{spec_name}' " + "returned None, not registering any hook" + ) + continue + + # Resolve patterns like "model.layers.*.mlp" + matched = [] + for name, module in name_to_module.items(): + if any(fnmatch.fnmatch(name, pattern) for pattern in target_patterns): + matched.append((name, module)) + + if not matched: + logger.warning( + f"No modules matched hook spec '{spec_name}' " + f"patterns={target_patterns}" + ) + continue + + for module_name, module in matched: + if hook: + _ = module.register_forward_hook(hook) + logger.info( + f"Registered forward hook '{spec_name}' " + f"on {module_name}" + ) +``` + +Key points: + +* Hooks are **forward hooks only** (via `module.register_forward_hook`). +* They are attached once at initialization. +* Hook handles are currently not stored on `ModelRunner` (they cannot be removed later via this API). +* Failure to match any modules is non-fatal; a warning is logged instead. +* If a hook factory returns `None`, a warning is logged and that spec is skipped. + +*** +### Writing a hook factory + +A hook factory is a regular Python function: + +* Takes a `config: dict` (from JSON) +* Returns a forward hook function with signature `(module, inputs, output)` + +Example: + +```python Example +HOOK_CALLS = [] + +def dummy_hook_factory(config): + """Factory that returns a forward hook capturing a tag from config.""" + tag = config.get("tag", "default") + + def hook(module, inputs, output): + HOOK_CALLS.append( + { + "module_type": type(module).__name__, + "tag": tag, + "shape": tuple(output.shape), + } + ) + return output # must return output if you don’t want to modify the tensor + + return hook +``` + +In JSON: + +```jsonc Example +{ + "forward_hooks": [ + { + "name": "capture_outer", + "target_modules": ["outer.0", "outer.1"], + "hook_factory": "my_project.hooks:dummy_hook_factory", + "config": { + "tag": "outer" + } + } + ] +} +``` + +This will: + +* Resolve `my_project.hooks:dummy_hook_factory` to a Python callable. +* Call it with `config = {"tag": "outer"}`. +* Use the returned hook for all modules matching `outer.0` and `outer.1`. +* Append metadata about each call to `HOOK_CALLS`. + +*** +### Summary + +* Define `forward_hooks` as a list of specs in `ServerArgs` to turn on the feature. + +* Each spec: + + * selects modules via `target_modules` (glob patterns over `model.named_modules()`), + * points to a hook factory via `hook_factory`, + * passes arbitrary `config` into that factory. + +* Hook factories are resolved via `resolve_callable`, which supports `module:factory` and `module.submodule.factory`. + +* Hooks are standard PyTorch forward hooks, attached once at startup and invoked on every forward pass. + +* Misconfiguration is either: + + * **fatal and explicit** (bad path / missing attribute), or + * **non-fatal with clear warnings** (no targets matched, or factory returned `None`). diff --git a/docs_new/docs/advanced_features/hicache.mdx b/docs_new/docs/advanced_features/hicache.mdx new file mode 100644 index 000000000000..6c083ad23fc9 --- /dev/null +++ b/docs_new/docs/advanced_features/hicache.mdx @@ -0,0 +1,8 @@ +--- +title: "Hierarchical KV Caching (HiCache)" +metatags: + description: "SGLang HiCache: three-tier KV caching (GPU, CPU, storage) for long-context and multi-turn inference. Supports Mooncake, 3FS, NIXL backends." +--- +- [Hicache Best Practices](./hicache_best_practices) +- [Hicache Design](./hicache_design) +- [Hicache Storage Runtime Attach Detach](./hicache_storage_runtime_attach_detach) diff --git a/docs_new/docs/advanced_features/hicache.rst b/docs_new/docs/advanced_features/hicache.rst new file mode 100644 index 000000000000..e7d83211dc9a --- /dev/null +++ b/docs_new/docs/advanced_features/hicache.rst @@ -0,0 +1,9 @@ +Hierarchical KV Caching (HiCache) +================================= + +.. toctree:: + :maxdepth: 1 + + hicache_best_practices.md + hicache_design.md + hicache_storage_runtime_attach_detach.md diff --git a/docs_new/docs/advanced_features/hicache_best_practices.mdx b/docs_new/docs/advanced_features/hicache_best_practices.mdx new file mode 100644 index 000000000000..df91d0aacd5b --- /dev/null +++ b/docs_new/docs/advanced_features/hicache_best_practices.mdx @@ -0,0 +1,219 @@ +--- +title: "SGLang HiCache Best Practices" +metatags: + description: "HiCache configuration guide: memory layout, prefetch policies, PD disaggregation, HF3FS and Mooncake deployment, custom storage backends." +--- +## Why HiCache Matters + +SGLang HiCache extends the traditional RadixAttention with a three-tier hierarchical KV caching system that dramatically improves performance for long-context and multi-turn conversation scenarios. By intelligently managing KV caches across GPU memory, host memory, and external storage backends, HiCache addresses the fundamental capacity bottleneck that limits cache hit rates in conventional systems. + +## Configuration Guidelines + +## Core HiCache Parameters + +```bash Command +# Essential HiCache flags +--page-size 64 # Page size for cache management +--enable-hierarchical-cache # Enable HiCache +--hicache-ratio 2 # Host memory ratio (2x GPU memory) +--hicache-size 100 # Host memory size in GBs, will override the above ratio +--hicache-io-backend kernel # The I/O backend of moving data between CPU and GPU +--hicache-write-policy write_through # Cache write policy from GPU to CPU +--hicache-storage-backend # Optional storage backend (e.g., hf3fs, mooncake, etc.) +``` + +Notes: + +- Besides configuring `--hicache-storage-backend` at startup, SGLang also supports **runtime attach/detach** of the HiCache storage backend (no restart required) via HTTP admin endpoints. See [Runtime Attach/Detach HiCache Storage Backend](./hicache_storage_runtime_attach_detach). + +## Key Configurations with Storage Backends Enabled + +### Memory Layout Optimization + +```bash Command +# Page-first: Optimized for I/O efficiency with zero-copy (recommended with kernel backend) +--hicache-mem-layout page_first +# Page-first-direct: Optimized for direct I/O operations (Compatible with fa3 and same zero-copy performance as page_first) +--hicache-mem-layout page_first_direct +# Layer-first +--hicache-mem-layout layer_first +``` +**Layout Compatibility:** +- `page_first`: Only compatible with `kernel` I/O backend, automatically switches to `layer_first` with `direct` backend +- `page_first_direct`: Specifically designed for `direct` I/O backend with optimized memory organization + +### Heterogeneous TP Support (GQA/MHA models) + +HiCache storage supports cross-cluster KV reuse when different deployments use different TP sizes (for example, `tp=4` and `tp=8`) and share the same storage backend namespace. + +Use `tp_lcm_size` in `--hicache-storage-backend-extra-config`: + +```bash Command +# Example: heterogeneous TP = {4, 8}, so lcm = 8 +--hicache-storage-backend-extra-config '{"tp_lcm_size": 8}' +``` + +Guidelines: + +- Set `tp_lcm_size` to the least common multiple (LCM) of all TP sizes that will share the same HiCache storage. +- For MHA models with Mooncake and `page_head` layout, HiCache will split head shards based on `tp_lcm_size` to make keys reusable across heterogeneous TP deployments. +- If all clusters use the same TP size, this option is not needed. + +### Prefetch Policies + +```bash Command +# Best-effort: Terminate prefetch when needed +--hicache-storage-prefetch-policy best_effort +# Wait-complete: Ensure complete prefetch, higher cache reuse +--hicache-storage-prefetch-policy wait_complete +# Timeout: Balance between completion and best-effort +--hicache-storage-prefetch-policy timeout +``` + +### Integration with PD Disaggregation + +HiCache works seamlessly with PD Disaggregation. You can choose between two configurations: + +1. **Prefill-only HiCache**: Enable HiCache only on Prefill nodes, allowing KV cache sharing among Prefill instances +2. **Full HiCache with async offloading**: Enable HiCache on Prefill nodes and async KV cache offloading on Decode nodes, allowing Prefill nodes to reuse KV caches from Decode nodes in multi-turn dialogue scenarios + +```bash Command +# Prefill node with HiCache enabled for cross-prefill sharing (ideal for SystemPrompt scenarios) +python3 -m sglang.launch_server \ + --model-path /xxx/DeepSeek-R1/ \ + --tp 8 \ + --host 0.0.0.0 \ + --port 10000 \ + --enable-metrics \ + --enable-cache-report \ + --mem-fraction-static 0.85 \ + --page-size 64 \ + --enable-hierarchical-cache \ + --hicache-ratio 2 \ + --hicache-size 0 \ + --hicache-mem-layout page_first_direct \ + --hicache-io-backend direct \ + --hicache-write-policy write_through \ + --hicache-storage-backend hf3fs \ + --hicache-storage-prefetch-policy wait_complete \ + --disaggregation-ib-device mlx5_0 \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend mooncake + +# Decode node with async offloading enabled for KV cache reuse by Prefill (ideal for multi-turn conversations) +python3 -m sglang.launch_server \ + --model-path /xxx/DeepSeek-R1/ \ + --tp 8 \ + --host 0.0.0.0 \ + --port 10000 \ + --enable-metrics \ + --enable-cache-report \ + --page-size 64 \ + --hicache-ratio 2 \ + --hicache-size 0 \ + --hicache-mem-layout page_first_direct \ + --hicache-io-backend direct \ + --hicache-write-policy write_through \ + --hicache-storage-backend hf3fs \ + --hicache-storage-prefetch-policy wait_complete \ + --disaggregation-decode-enable-offload-kvcache \ # Enable async KV cache offloading in decode node + --disaggregation-ib-device mlx5_0 \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend mooncake +``` + + +### Deployment with HF3FS + +Here is an example of deploying DeepSeek-R1 with HiCache-HF3FS. For more details, see the [HF3FS Documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/hf3fs/docs/README.md). + +```bash Command +python3 -m sglang.launch_server \ + --model-path /xxx/DeepSeek-R1/ \ + --log-level info \ + --tp 8 \ + --host 0.0.0.0 \ + --port 10000 \ + --enable-metrics \ + --enable-cache-report \ + --page-size 64 \ + --mem-fraction-static 0.85 \ + --enable-hierarchical-cache \ + --hicache-ratio 2 \ + --hicache-size 0 \ + --hicache-mem-layout page_first_direct \ + --hicache-io-backend direct \ + --hicache-write-policy write_through \ + --hicache-storage-backend hf3fs \ + --hicache-storage-prefetch-policy wait_complete \ +``` + +### Deployment with Mooncake + +Here is an example of deploying Qwen3-235B-A22B-Instruct-2507 with Mooncake. For more details, see the [Mooncake Documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md). + +```bash Command +# Set Mooncake environment variables +export MOONCAKE_TE_META_DATA_SERVER="http://127.0.0.1:8080/metadata" +export MOONCAKE_GLOBAL_SEGMENT_SIZE=816043786240 +export MOONCAKE_PROTOCOL="rdma" +export MOONCAKE_DEVICE="$DEVICE_LIST" +export MOONCAKE_MASTER=127.0.0.1:50051 + +# Launch SGLang server with Mooncake backend +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --tp 8 \ + --page-size 64 \ + --enable-hierarchical-cache \ + --hicache-ratio 2 \ + --hicache-mem-layout page_first_direct \ + --hicache-io-backend direct \ + --hicache-storage-backend mooncake \ + --hicache-write-policy write_through \ + --hicache-storage-prefetch-policy timeout +``` + + +## Custom Storage Backend Integration + +To integrate a new storage backend: + +1. **Implement three core methods:** + - `get(key)`: Retrieve value by key + - `exists(key)`: Check key existence + - `set(key, value)`: Store key-value pair + +2. **Register your backend:** Add your storage backend to the HiCache [BackendFactory](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/backend_factory.py#L188) + +The HiCache controller handles all scheduling and synchronization automatically. + +### Dynamic Backend Loading + +Alternatively, you can use dynamic loading to avoid hard-coding your backend in the repository: + +```bash Command +python3 -m sglang.launch_server \ + --model-path your-model \ + --enable-hierarchical-cache \ + --hicache-storage-backend dynamic \ + --hicache-storage-backend-extra-config '{"backend_name":"custom_backend_name", "module_path": "your_module_path", "class_name": "YourHiCacheClassName"}' +``` + +**Configuration Parameters:** +- `--hicache-storage-backend`: Set to `dynamic` +- `--hicache-storage-backend-extra-config`: JSON configuration with: + - `backend_name`: Custom backend identifier + - `module_path`: Python module path to your implementation + - `class_name`: Your HiCache implementation class name + - `interface_v1`: 0 (disable) or 1 (enable) to control usage of batch_get_v1 and batch_set_v1 methods + + +## Community and Support + +- **GitHub Issues**: Report bugs and feature requests +- **Slack Channel**: Join community discussions in #sgl-kv-cache-store +- **Documentation**: Refer to storage backend-specific guides + +*** +*This document will be continuously updated based on community feedback and new features. Contributions and suggestions are welcome!* diff --git a/docs_new/docs/advanced_features/hicache_design.mdx b/docs_new/docs/advanced_features/hicache_design.mdx new file mode 100644 index 000000000000..15ab841af50b --- /dev/null +++ b/docs_new/docs/advanced_features/hicache_design.mdx @@ -0,0 +1,164 @@ +--- +title: "HiCache System Design and Optimization" +metatags: + description: "HiCache architecture: HiRadixTree metadata, L1/L2/L3 workflow, prefetch strategies, write-back policies, zero-copy transfers, multi-rank sync." +--- +This document provides a comprehensive overview of SGLang HiCache, covering its system architecture, workflow and key components. It also details configuration parameters, optimization techniques, and integration with various L3 storage backends, serving as a complete reference for users and developers to understand and tune HiCache for efficient LLM inference. + +## Why and What is HiCache? + +In large language model inference, the prefill phase is often time-consuming: input sequences need to be first converted into Key-Value cache (KV cache) for subsequent decoding. When multiple requests share the same prefix, the KV cache for that prefix is identical. By caching and reusing these shared KV caches, redundant computation can be avoided. To address this, SGLang introduced RadixAttention, which leverages idle GPU memory to cache and reuse prefix KV caches, and **HiCache**, which extends this idea to host memory and distributed storage. + +Inspired by the classic three-level cache design of modern CPUs, HiCache organizes GPU memory as L1, host memory as L2, and distributed storage as L3. This hierarchy enables HiCache to fully exploit the "idle" storage space of GPUs and CPUs, while integrating distributed cache systems such as Mooncake, 3FS, NIXL, and AIBrix KVCache for global KV cache storage and scheduling. As a result, HiCache significantly expands KV cache capacity while maintaining strong read performance—especially in workloads such as multi-QA and long-context inference, where KV cache reuse is frequent. For detailed benchmark results, see [this blog](https://lmsys.org/blog/2025-09-10-sglang-hicache/). + + +## System Design + +### Overall Architecture + +In many modern CPU architectures, the small but fast L1 and L2 caches are private to each core, enabling rapid access to the hottest data, while the larger L3 cache is shared across all cores to significantly reduce redundancy within the cache. Similarly, in HiCache, the L1 and L2 KV caches are private to each inference instance, whereas the L3 KV cache is shared among all inference instances within the cluster. + +### HiRadixTree: Metadata Organization in HiCache + +For KV cache data organization, HiCache builds upon the RadixTree structure introduced in RadixAttention and proposes HiRadixTree. In RadixAttention, each node of the RadixTree corresponds to the KV cache of a consecutive span of tokens in GPU memory. A path from the root to a leaf node represents the prefix of a request, and shared prefixes across multiple requests can reuse the same nodes, thereby avoiding redundant storage. + +HiRadixTree extends this idea: each node corresponds to the KV cache of a span of consecutive tokens and records where that KV cache is stored—whether in local GPU memory, CPU memory, L3 storage, or multiple of these tiers. If stored locally, HiRadixTree maintains precise metadata, including the exact storage address. However, to reduce overhead, HiRadixTree does not store or continuously synchronize metadata for L3 KV cache. Instead, when accessing L3 data, it queries the backend in real time to retrieve the necessary metadata, such as whether the data exists and on which server and location it resides. + +### Overall Workflow + +The workflow of HiCache mainly involves three key operations: **local match**, **prefetch** and **write-back**. When the system receives a new request, it first searches the local L1 and L2 caches for matching KV caches. For parts not found locally, it attempts to prefetch from L3. After prefetching, all required KV caches are loaded into the GPU for computation. Once the prefill computation is complete, the system considers storing the newly generated data into L2 or L3. + + + HiCache Workflow + + +### Local Match + +Local matching is the first step in HiCache's workflow, where incoming request tokens are matched against the HiRadixTree to locate cached KV data in local memory tiers (L1 GPU memory and L2 host memory). + +The matching algorithm traverses the HiRadixTree from the root node, following child nodes that match the token sequence prefix. At each node, the incoming token sequence is compared with the node’s stored token sequence. When `page_size > 1`, matching is performed at the page granularity to optimize memory access patterns. If a match terminates within a node’s stored sequence, the node is automatically split to create an exact boundary, improving the efficiency of future matches. + +The algorithm returns a continuous prefix of the request, with the first part residing in L1 and the latter part in L2. + +Since the process only requires traversing the local HiRadixTree and does not involve any actual data copying, local matching is extremely fast. + +### Prefetch from L3 + +Data prefetching is one of HiCache’s core optimization techniques, designed to proactively load KV caches from L3 storage into local L2 memory, thereby reducing access latency during subsequent operations. + +**Prefetch Trigger Conditions**: +After local matching, for the parts not found in L1 or L2, the system queries L3 to retrieve metadata for the next continuous matching KV caches. If the length of hit cache in L3 exceeds a threshold (default: 256 tokens, configurable), a prefetch operation is triggered. + +**Prefetch Strategies**: HiCache provides three different prefetch termination strategies to address different scenario needs: +- **best_effort**: Terminates immediately when GPU can execute prefill computation, with no waiting time, suitable for scenarios extremely sensitive to latency. +- **wait_complete**: Must wait for all prefetch operations to complete, suitable for scenarios requiring high cache hit rates. +- **timeout**: Terminates after specified time or when complete, balancing latency and cache hit rate needs. + +After prefetching stops, the data already fetched is used together with the local data for the prefill computation. + +For **timeout** strategy, HiCache introduces two configuration parameters to support fine-grained control over prefetch timeout conditions: + +* `prefetch_timeout_base`: the base timeout, representing overhead unrelated to the number of tokens (e.g., scheduling and synchronization). +* `prefetch_timeout_per_ki_token`: the incremental timeout per thousand tokens. + +The timeout is computed as: + +```python Example +timeout = prefetch_timeout_base + prefetch_timeout_per_ki_token * num_token_to_fetch / 1024 +``` + +### Data Write-back + +The write-back mechanism is responsible for moving frequently accessed KV caches from L1 to L2 and L3, enabling larger and longer-term storage as well as cache sharing across instances. + +**Configurable Write-back Policies**: HiCache supports three write-back strategies: + +* **write_through**: Every access is immediately written back to the next level. When bandwidth is sufficient, this strategy provides the strongest caching benefit. +* **write_through_selective**: Data is written back only after the access frequency exceeds a threshold. This strategy backs up only hot data, reducing I/O overhead. +* **write_back**: Data is written back to the next level only when it is evicted from the upper level. This strategy alleviates storage pressure and is suitable for scenarios where storage capacity is limited but memory utilization must be maximized. + +**Cross-instance Sharing**: When data is written back from L2 to L3, only data not already present in L3 is transferred. KV caches stored in L3 can then be shared across all SGLang instances in the cluster (depending on the L3 backend implementation), significantly improving cache hit rates within the same memory budget. + +### Multi-Rank Synchronization + +During multi-GPU parallel computation, such as tensor parallelism (TP), HiCache must ensure consistent states across different ranks. Therefore, critical computation steps require the use of `all_reduce` for state synchronization. + +For example, during prefetching, `all_reduce(op=min)` is used to ensure that all ranks obtain the same number of L3 hits, preventing inconsistent judgments about whether the prefetch threshold has been reached. Similarly, after prefetching completes or terminates, `all_reduce(op=min)` is again required to guarantee consensus among ranks on the prefix length of the successfully retrieved KV cache. + +### Data Transfer Optimization + +**Zero-Copy Data Transfers**: Both prefetching and write-back involve substantial data movement. Minimizing the number of data copies can significantly improve system performance. HiCache supports passing memory addresses and sizes directly when transferring data from L2 memory to an L3 backend. + +**“Batch-Oriented” Data Organization**: The granularity of data reads and writes has a major impact on performance. To address this, HiCache L3 stores and transfers KV cache data at the granularity of **pages** and supports different data layouts beyond the existing `layer first` scheme, including `page first` and `page first direct`. Under the `page first` and `page first direct` layouts, all KV cache data belonging to the same page is placed in contiguous memory, allowing it to be passed as a single object to L3 using zero-copy transfers. + + + HiCache L2 MEM layout + + +However, because GPU KV computation is naturally performed layer by layer, the GPU inherently operates in a `layer first` layout. When transferring `page first` data from L2 to the GPU, data must be transferred at the granularity of one token per layer. The `page first direct` layout mitigates this issue by grouping together all tokens of a given layer within a page, allowing transfers from L2 to GPU to be aggregated at the page-layer level. + +**CPU-to-GPU Transfer Optimizations**: In HiCache, moving data from CPU memory to GPU is as performance-critical as prefetching data from L3 to L2. HiCache employs several optimizations for this process: + +* **Compute-Transfer Overlap**: During the prefill phase, when transferring data from CPU to GPU, HiCache overlaps layers by concurrently loading the KV cache of layer N+1 while computing layer N. This effectively hides data transfer latency. +* **GPU-assisted I/O Kernels**: On top of `cudaMemcpyAsync`, HiCache implements a set of GPU-assisted I/O kernels specifically optimized for KV cache transfers between CPU and GPU. Compared to the baseline approach, these kernels achieve up to 3x higher transfer speed. + +**Write-back Optimization for MLA**: For MHA (Multi-Head Attention) models under multi-TP, each rank holds `1/tp_size` of a token’s KV data. In contrast, for MLA (Multi-Layer Attention) models, all ranks hold the complete and identical KV data for each token. HiCache includes a dedicated optimization for MLA: only one rank initiates the write-back operation, ensuring that data is not redundantly stored across ranks. + +### Integration with PD-Disaggregation Deployment Mode + +SGLang supports a PD (Prefill-Decode) disaggregation deployment mode through the Mooncake TransferEngine (for details, see [this doc](./pd_disaggregation)). In the PD-disaggregation deployment mode, HiCache can be enabled on both the prefill nodes and decode nodes to optimize prefill performance. If enabled on decode nodes, the decode output will also be written back to L3. + +### Unified Interfaces and Rich L3 Storage Backends + +HiCache encapsulates all read, write, and query operations on L3 backends within the `class HiCacheStorage(ABC)`, exposing a set of simple and consistent interfaces. This design supports a wide range of L3 storage backends and allows users to select the one that best fits their specific use cases. + +- **Mooncake**: Mooncake is a high-performance caching system for LLM inference that leverages RDMA and multi-NIC resources to enable zero-copy, ultra-fast data transfers. Try Mooncake [here](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/mem_cache/storage/mooncake_store). + +- **DeepSeek 3FS (HF3FS)**: HF3FS is a Kubernetes-native distributed storage solution with operator-based deployment. Try HF3FS [here](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/mem_cache/storage/hf3fs). + +- **NIXL**: NIXL provides a unified API for accessing various storage plugins, including but not limited to DeepSeek's 3FS, GPU Direct Storage (GDS) and Amazon S3-compatible object storage. Try NIXL [here](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/mem_cache/storage/nixl). + +- **AIBrix KVCache**: AIBrix KVCache is a production-ready KVCache Offloading Framework, which enables efficient memory tiering and low-overhead cross-engine reuse. Try AIBrix KVCache [here](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/mem_cache/storage/aibrix_kvcache). + +- **HiCacheFile**: A simple file-based storage backend for demonstration purposes. + +Specifically, **LMCache**, an efficient KV cache layer for enterprise-scale LLM inference, provides an alternative solution to HiCache. Try LMCache [here](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/mem_cache/storage/lmcache). + +## Related Parameters + +- **`--enable-hierarchical-cache`**: Enable hierarchical cache functionality. This is required to use HiCache. + +- **`--hicache-ratio HICACHE_RATIO`**: The ratio of the size of host KV cache memory pool to the size of device pool. For example, a value of 2 means the host memory pool is twice as large as the device memory pool. The value of this parameter must be greater than 1, as the current implementation requires the host memory allocated for the KV cache to be larger than the device memory allocated for the KV cache. + +- **`--hicache-size HICACHE_SIZE`**: The size of host KV cache memory pool in gigabytes. This parameter overrides `hicache-ratio` if set. For example, `--hicache-size 30` allocates 30GB (1GB = 1e9 bytes) for the host memory pool **for each rank**. If there are 8 ranks, then the total memory size is 240GB. Just like `hicache-ratio`, the value of this parameter must be larger than the size of device memory allocated for KV cache. + +**Note**: `--hicache-ratio` and `--hicache-size` are two critical parameters. In general, a larger HiCache size leads to a higher cache hit rate, which improves prefill performance. However, the relationship between cache size and hit rate is not linear. Once most reusable KV data—especially hot tokens—are already cached, further increasing the size may yield only marginal performance gains. Users can set these parameters based on their workload characteristics and performance requirements. + +- **`--page-size PAGE_SIZE`**: The number of tokens per page. This parameter determines the granularity of KV cache storage and retrieval. Larger page sizes reduce metadata overhead and improve I/O efficiency for storage backends, but may lower the cache hit rate when only part of a page matches the stored KV cache. For workloads with long common prefixes, larger pages can improve performance, while workloads with more diverse prefixes may benefit from smaller pages. See [Data Transfer Optimization](#data-transfer-optimization) for how page granularity affects I/O performance. + +- **`--hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}`**: Controls when prefetching from storage should stop. See [Prefetch from L3](#prefetch-from-l3) for details. + - `best_effort`: Prefetch as much as possible without blocking + - `wait_complete`: Wait for prefetch to complete before proceeding + - `timeout`: Terminates after specified time or when complete (Recommended for production environments, as setting an appropriate timeout helps the system meet required SLOs) + +- **`--hicache-write-policy {write_back,write_through,write_through_selective}`**: Controls how data is written from faster to slower memory tiers. See [Data Write-back](#data-write-back) for details. + - `write_through`: Immediately writes data to all tiers (strongest caching benefits) + - `write_through_selective`: Uses hit-count tracking to back up only frequently accessed data + - `write_back`: Writes data back to slower tiers only when eviction is needed (reduces I/O load) + +- **`--hicache-io-backend {direct,kernel}`**: Choose the I/O backend for KV cache transfer between CPU and GPU. See [Data Transfer Optimization](#data-transfer-optimization) for details. + - `direct`: Standard CUDA memory copy operations + - `kernel`: GPU-assisted I/O kernels (recommended for better performance) + +- **`--hicache-mem-layout {layer_first,page_first,page_first_direct}`**: Memory layout for the host memory pool. See [Data Transfer Optimization](#data-transfer-optimization) for details. + - `layer_first`: Compatible with GPU computation kernels (default for GPU memory) + - `page_first`: Optimized for I/O efficiency + - `page_first_direct`: Groups all tokens of a given layer within a page, allowing transfers from L2 to GPU to be aggregated at the page-layer level + +- **`--hicache-storage-backend {file,mooncake,hf3fs,nixl,aibrix,dynamic}`**: Choose the storage backend for the L3 tier. Built-in backends: file, mooncake, hf3fs, nixl, aibrix. For dynamic backend, use --hicache-storage-backend-extra-config to specify: `backend_name` (custom name), `module_path` (Python module path), `class_name` (backend class name). See [Unified Interfaces and Rich L3 Storage Backends](#unified-interfaces-and-rich-l3-storage-backends) for available backends. + +- **`--enable-lmcache`**: Using LMCache as an alternative hierarchical cache solution. + +- **`--hicache-storage-backend-extra-config HICACHE_STORAGE_BACKEND_EXTRA_CONFIG`**: the extra config can be either + - a JSON string containing extra configuration for the storage backend, e.g., `--hicache-storage-backend-extra-config '{"prefetch_threshold":512, "prefetch_timeout_base": 0.5, "prefetch_timeout_per_ki_token": 0.25}' `, or + - a TOML or JSON or YAML file specifying the extra configuration for the storage backend (to differentiate from the JSON string input, prepend a `@` in front of the file name), e.g., `--hicache-storage-backend-extra-config "@config.toml"` where `config.toml` is the config file containing the complex configurations. This can be useful when the configuration consists of many or complex key-value pairs (for instance, it is preferred to use a config file for NIXL backend as its configurations can be complex). diff --git a/docs_new/docs/advanced_features/hicache_storage_runtime_attach_detach.mdx b/docs_new/docs/advanced_features/hicache_storage_runtime_attach_detach.mdx new file mode 100644 index 000000000000..b245bf520691 --- /dev/null +++ b/docs_new/docs/advanced_features/hicache_storage_runtime_attach_detach.mdx @@ -0,0 +1,133 @@ +--- +title: "Runtime Attach/Detach HiCache Storage Backend (No Restart)" +metatags: + description: "Dynamically attach/detach HiCache L3 storage backends at runtime via HTTP API. No restart required, idle-state safety checks." +--- +This document explains how to **dynamically attach/detach the HiCache L3 storage backend at runtime** (e.g., `mooncake` / `hf3fs` / `nixl` / `file` / `aibrix` / `eic`) while **SGLang is already running and serving traffic**, without restarting the process. + +For safety and consistency, the current implementation **strictly requires** these operations to happen only when the service is **idle**: + +- **No running requests** +- **No waiting/queued requests** + +If the idle condition is not met, the API will fail fast (HTTP 400) and **will not modify** the current service state. + +*** +## 1. Background and implementation overview + +### 1.1 Architecture / control path + +The control path is: + +1. **HTTP Server** (`python/sglang/srt/entrypoints/http_server.py`) + - Exposes `PUT /hicache/storage-backend`, `DELETE /hicache/storage-backend`, `GET /hicache/storage-backend` +2. **TokenizerManager** (`python/sglang/srt/managers/tokenizer_control_mixin.py`) + - Sends the request to the Scheduler via `FanOutCommunicator` +3. **Scheduler** (`python/sglang/srt/managers/scheduler.py`) + - Performs a **strict idle check** + - Calls `tree_cache.attach_storage_backend(...)` / `detach_storage_backend(...)` +4. **HiRadixCache** (`python/sglang/srt/mem_cache/hiradix_cache.py`) + - Parses `hicache_storage_backend_extra_config_json` (supports both backend config and prefetch knobs) + - Calls `cache_controller.attach_storage_backend(...)` / `detach_storage_backend(...)` +5. **HiCacheController** (`python/sglang/srt/managers/cache_controller.py`) + - Creates/destroys the storage backend instance (via `StorageBackendFactory`) + - Starts/stops backend background threads at runtime (prefetch/backup) + +*** +## 2. Idle-state requirement (strict) + +The Scheduler uses `is_fully_idle()` which checks: + +- No running batches (including chunked prefill, overlap, pipeline-parallel, and disaggregation paths) +- No waiting requests in any queue (waiting, grammar, disagg bootstrap/prealloc/transfer/inflight) +- No DLLM staging requests + +If the condition is not met, attach/detach returns an error like: + +- `Reject attach: scheduler is not idle. #queue-req=... #running-req=...` + + +before switching, drain upstream traffic and wait for the server to become idle, then call attach/detach. + + +### 2.1 DP (data parallel) semantics + +When `dp_size > 1`, the tokenizer dispatches the request to **all DP scheduler instances** and aggregates their responses: + +- The final `success` is **true only if all DP ranks return success** +- The final `message` concatenates messages from all DP ranks + +This is intended to prevent “silent partial success”, but it also means you may see: + +- Overall **failure** even though **some ranks already succeeded** + +Currently there is **no automatic partial rollback** across DP ranks (see TODO in code). Operationally: + +- Prefer to keep backend config identical across ranks +- If attach fails, immediately call detach (best-effort/idempotent), fix config, then retry attach + +*** +## 3. How to use (HTTP Admin API) + +The examples below assume your SGLang HTTP server is at `http://127.0.0.1:30000`. + +### 3.1 Query current storage backend status + +```bash Command +curl -s http://127.0.0.1:30000/hicache/storage-backend +``` + +Example response: + +```json Config +{ + "hicache_storage_backend": "mooncake", + "hicache_storage_backend_extra_config": "{\"master_server_address\":\"127.0.0.1:50051\", ...}" +} +``` + +### 3.2 Attach (enable) a storage backend +```bash Command +curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \ + -H 'Content-Type: application/json' \ + -d '{ + "hicache_storage_backend": "mooncake" + }' +``` + +```bash Command +curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \ + -H 'Content-Type: application/json' \ + -d '{ + "hicache_storage_backend": "mooncake", + "hicache_storage_backend_extra_config_json": "{\"master_server_address\":\"127.0.0.1:50051\",\"protocol\":\"tcp\",\"global_segment_size\":\"4gb\",\"prefetch_threshold\":256}", + "hicache_storage_prefetch_policy": "timeout" + }' +``` + +Notes: + +- `hicache_storage_backend_extra_config_json` can include both: + - **Backend configuration** (e.g., Mooncake master/metadata/protocol, etc.) + - **Prefetch configuration** (`prefetch_threshold`, `prefetch_timeout_base`, `prefetch_timeout_per_ki_token`, `hicache_storage_pass_prefix_keys`) + +### 3.3 Detach (disable) the storage backend + +```bash Command +curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend +``` + +Notes: + +- Detach only makes SGLang **stop using** the L3 storage backend and stops prefetch/backup threads +- It **does not automatically delete** data stored in Mooncake/HF3FS (or other remote backends) + +*** +## 4. Behavior and caveats + +- **No restart required**: attach/detach switches in-process at runtime +- **Must be idle**: otherwise the request is rejected to avoid consistency issues +- **Host KV layout constraints still apply**: for example, Mooncake still requires layouts like `page_first/page_first_direct/page_head`; if the server's HiCache host-memory layout does not satisfy the backend requirements, attach will fail with an error +- **Observability**: + - After attach, `server_args.hicache_storage_backend*` is updated on both the tokenizer and scheduler sides + - If metrics are enabled, attach will create a storage metrics collector in `HiRadixCache` on demand diff --git a/docs_new/docs/advanced_features/hisparse_guide.mdx b/docs_new/docs/advanced_features/hisparse_guide.mdx new file mode 100644 index 000000000000..9ec2e082bd74 --- /dev/null +++ b/docs_new/docs/advanced_features/hisparse_guide.mdx @@ -0,0 +1,187 @@ +--- +title: "HiSparse: Hierarchical Sparse Attention" +metatags: + description: "Use HiSparse hierarchical sparse attention to reduce decode GPU KV memory with CPU pinned host storage and PD disaggregation." +--- + +HiSparse reduces per-request GPU memory consumption during the decode phase by maintaining only a small "hot" KV buffer on GPU while keeping complete KV data in CPU pinned memory. Combined with PD disaggregation, it enables significantly higher decode concurrency. + +> **Prerequisites**: HiSparse only works with models that use **DeepSeek Sparse Attention (DSA)** architectures (e.g., DeepSeek-V3.2, GLM-5). These models natively select a subset of tokens for attention, making it possible to keep only the top-k KV on GPU while storing the full KV in host memory — without accuracy loss. Additionally, HiSparse currently requires **PD disaggregation mode** and is enabled on the **decode instance** only. + +## Why HiSparse? + +In long-context LLM inference, each decoding request holds a full-length KV cache on GPU, limiting the number of concurrent requests a decode instance can serve. HiSparse addresses this by: + +- **Reducing GPU memory per request**: Each request occupies only a fixed-size device buffer (e.g., 4KB tokens) instead of the full sequence length. +- **On-demand swap-in**: A CUDA kernel dynamically loads the top-k most relevant KV entries from host memory based on attention scores. +- **Transparent to prefill**: HiSparse is entirely a decode-side optimization; the prefill instance requires no changes. + +## Design Overview + +### Decode Workflow + +Each decode step follows this flow: + +1. **Forward decode** — generate the next token +2. **Top-k selection** — select the most relevant token positions via attention scores +3. **Swap-in** — the CUDA kernel loads top-k KV entries from host to device buffer: + - *Short sequences* (`seq_len ≤ device_buffer_size`): fast path, all KV already in buffer + - *Long sequences*: hit detection → LRU reordering → miss handling (host → device copy) +4. **Decode attention** — compute attention using the top-k device locations +5. **Eager backup** — asynchronously copy the previous token's KV from device to host + +### PD Disaggregation Integration (Direct-to-Host) + +In PD disaggregation mode, the prefill instance transfers KV cache directly into the decode instance's host pool via RDMA, bypassing the GPU entirely on the decode side. This eliminates the transient GPU memory spike during KV transfer and removes the staging DMA step. + +``` +Prefill GPU ──RDMA──▶ Decode Host Pool (CPU pinned memory) + │ + ▼ + alloc device buffer (4KB) + │ + ▼ + swap-in kernel (on-demand top-k) +``` + +## Server Arguments + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType / DefaultDescription
--enable-hisparseflag; default: disabledEnable HiSparse on the decode instance
--hisparse-configJSON stringConfiguration for HiSparse (see below)
+ +### HiSparse Config Parameters + +Pass as a JSON string via `--hisparse-config`: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterType / DefaultDescription
top_kintNumber of topk entries
device_buffer_sizeintNumber of token slots in the per-request GPU device buffer
host_to_device_ratiointRatio of logical pool size to device pool size, determining host memory capacity
+ +Example: `--hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}'` + +## Deployment + +HiSparse currently requires **PD disaggregation mode** and is enabled only on the **decode instance**. + +### Prefill Instance + +```bash Command +python3 -m sglang.launch_server \ + --model-path /path/to/model \ + --trust-remote-code \ + --port 8000 --host 0.0.0.0 \ + --context-length 81920 \ + --chunked-prefill-size 65536 \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --mem-fraction-static 0.85 \ + --disaggregation-mode prefill \ + --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \ + --nnodes 1 --node-rank 0 +``` + +### Decode Instance (with HiSparse) + +```bash Command +python3 -m sglang.launch_server \ + --model-path /path/to/model \ + --trust-remote-code \ + --port 8000 --host 0.0.0.0 \ + --context-length 81920 \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --mem-fraction-static 0.85 \ + --kv-cache-dtype bfloat16 \ + --nsa-decode-backend flashmla_sparse \ + --disaggregation-mode decode \ + --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \ + --dist-init-addr 127.0.0.1:5757 \ + --nnodes 1 --node-rank 0 \ + --enable-hisparse \ + --hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}' +``` + +### Benchmark + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \ + --dataset-name random \ + --random-input 40000 \ + --random-output 20000 \ + --num-prompts 200 \ + --max-concurrency 200 \ + --request-rate 40 \ + --random-range-ratio 1.0 \ + --host 127.0.0.1 \ + --port 20000 \ + --model /path/to/model \ + --flush-cache \ +``` + +### Key Notes + +- The prefill instance does not need `--enable-hisparse`; it is unaware of HiSparse. +- On the decode instance, the following flags are **required** for HiSparse: + - `--kv-cache-dtype bfloat16` — currently only bfloat16 KV cache is supported (more dtypes planned). + - `--nsa-decode-backend flashmla_sparse` — currently only `flashmla_sparse` backend is supported. + - `--enable-hisparse` — enables HiSparse. + - `--hisparse-config` — HiSparse configuration (top_k, device_buffer_size, host_to_device_ratio). + - `host_to_device_ratio` should be configured based on the host machine's available memory. For example: + - **~1 TB** host memory → `host_to_device_ratio: 5` + - **~2 TB** host memory → `host_to_device_ratio: 10` + +## Acknowledgments + +We would like to thank the SGLang team and community for the implementation and generous support, especially Zhiqiang Xie, Zhangheng Huang, Tingwei Huang, Shangming Cai, Teng Ma, and many others. We also thank the Alibaba Cloud TairKVCache team and the AntGroup SCT Inference team for their valuable contributions. diff --git a/docs_new/docs/advanced_features/hyperparameter_tuning.mdx b/docs_new/docs/advanced_features/hyperparameter_tuning.mdx new file mode 100644 index 000000000000..6a52d5a365d5 --- /dev/null +++ b/docs_new/docs/advanced_features/hyperparameter_tuning.mdx @@ -0,0 +1,82 @@ +--- +title: "Hyperparameter Tuning" +metatags: + description: "SGLang performance tuning: batch size, token usage, mem-fraction-static, chunked-prefill-size, CUDA graph, DP/TP optimization." +--- +## Achieving high throughput for offline batch inference + +Achieving a large batch size is the most important thing for attaining high throughput in offline batch inference. +When the server is running at full load in a steady state, look for the following in the log: + +```text Output +Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, cuda graph: True, gen throughput (token/s): 4594.01, #queue-req: 317 +``` + +### Adjust the request submission speed to control `#queue-req` + +`#queue-req` indicates the number of requests in the queue. +If you frequently see `#queue-req: 0`, it suggests that your client code is submitting requests too slowly. +A healthy range for `#queue-req` is `100 - 2000`. +However, avoid making `#queue-req` too large, as this will increase the scheduling overhead on the server. + +### Achieve a high `token usage` + +`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization. + +If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3. +The case of a server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings. + +On the other hand, if you see `token usage` very high and you frequently see warnings like +`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. +If you see `KV cache pool is full. Retract requests.` occasionally but not frequently (~1 time per minute), it is okay. + +### Tune `--mem-fraction-static` to increase KV cache pool capacity +SGLang allocates memory as follows: + +Total memory usage = model weights + KV cache pool + CUDA graph buffers + activations + +The `--mem-fraction-static` parameter determines how much memory is allocated to the first two components: + +mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity + +To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers. + +SGLang uses simple heuristics to set the default value of `--mem-fraction-static`, but you can optimize it for your use cases. +As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready. +Look for log entries like this: + +```text Output +[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB +``` + +Check the `available_gpu_mem` value. +- If it is between 5–8 GB, the setting is good. +- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache. +- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`. + +Another straightforward approach is to increase `--mem-fraction-static` in increments of 0.01 until you encounter OOM errors for your workloads. + +### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests` + +If you encounter out-of-memory (OOM) errors, you can adjust the following parameters: + +- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts. +- If OOM occurs during decoding, try lowering `--max-running-requests`. +- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput. + +### Tune `--cuda-graph-max-bs` +By default, CUDA graph is enabled only for small batch sizes (e.g., less than 160 or 256). +However, for some models, especially at large tensor parallelism sizes, CUDA graph can be useful for batch sizes up to 512 or 768. +Therefore, it may be beneficial to increase `--cuda-graph-max-bs` to a larger value. +Note that CUDA graph consumes more memory, so you may need to reduce `--mem-fraction-static` at the same time. + +### Tune `--dp-size` and `--tp-size` + +Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [SGLang Model Gateway (former Router)](../advanced_features/sgl_model_gateway) for a better data parallelism rather than using `dp_size` parameter. + +### Try other options + +- `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`. +- Try other quantization (e.g. FP8 quantization with `--quantization fp8`) +- Try other parallelism strategies (e.g. [expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`). +- If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead. diff --git a/docs_new/docs/advanced_features/lora.ipynb b/docs_new/docs/advanced_features/lora.ipynb new file mode 100644 index 000000000000..8e6e6d0a02af --- /dev/null +++ b/docs_new/docs/advanced_features/lora.ipynb @@ -0,0 +1,714 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LoRA Serving" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Arguments for LoRA Serving" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following server arguments are relevant for multi-LoRA serving:\n", + "\n", + "* `enable_lora`: Enable LoRA support for the model. This argument is automatically set to True if `--lora-paths` is provided for backward compatibility.\n", + "\n", + "* `enable_lora_overlap_loading`: Enable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters.\n", + "\n", + "* `lora_paths`: The list of LoRA adapters to load. Each adapter must be specified in one of the following formats: | = | JSON with schema {\"lora_name\":str,\"lora_path\":str,\"pinned\":bool}.\n", + "\n", + "* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.\n", + "\n", + "* `max_loaded_loras`: If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `max-loras-per-batch`.\n", + "\n", + "* `lora_eviction_policy`: LoRA adapter eviction policy when GPU memory pool is full. `lru`: Least Recently Used (default, better cache efficiency). `fifo`: First-In-First-Out.\n", + "\n", + "* `lora_backend`: The backend of running GEMM kernels for Lora modules. Currently we support Triton LoRA backend (`triton`) and Chunked SGMV backend (`csgmv`). In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n", + "\n", + "* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.\n", + "\n", + "* `lora_target_modules`: The union set of all target modules where LoRA should be applied (e.g., `q_proj`, `k_proj`, `gate_proj`). If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of different target modules after server startup. You can also set it to `all` to enable LoRA for all supported modules. However, enabling LoRA on additional modules introduces a minor performance overhead. If your application is performance-sensitive, we recommend only specifying the modules for which you plan to load adapters.\n", + "\n", + "* `--max-lora-chunk-size`: Maximum chunk size for the ChunkedSGMV LoRA backend. Only used when --lora-backend is 'csgmv'. Choosing a larger value might improve performance. Please tune this value based on your hardware and workload as needed. Defaults to 16.\n", + "\n", + "* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.\n", + "\n", + "From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage\n", + "\n", + "### Serving Single Adaptor" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** SGLang supports LoRA adapters through two APIs:\n", + "\n", + "1. **OpenAI-Compatible API** (`/v1/chat/completions`, `/v1/completions`): Use the `model:adapter-name` syntax. See [OpenAI API with LoRA](../basic_usage/openai_api_completions.ipynb#Using-LoRA-Adapters) for examples.\n", + "\n", + "2. **Native API** (`/generate`): Pass `lora_path` in the request body (shown below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import requests\n", + "\n", + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, terminate_process" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "server_process, port = launch_server_cmd(\n", + " # Here we set max-loras-per-batch to 2: one slot for adaptor and another one for base model\n", + " \"\"\"\n", + "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", + " --enable-lora \\\n", + " --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", + " --max-loras-per-batch 2 \\\n", + " --log-level warning \\\n", + "\"\"\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://127.0.0.1:{port}\"\n", + "json_data = {\n", + " \"text\": [\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " ],\n", + " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", + " # The first input uses lora0, and the second input uses the base model\n", + " \"lora_path\": [\"lora0\", None],\n", + "}\n", + "response = requests.post(\n", + " url + \"/generate\",\n", + " json=json_data,\n", + ")\n", + "print(f\"Output 0: {response.json()[0]['text']}\")\n", + "print(f\"Output 1: {response.json()[1]['text']}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Serving Multiple Adaptors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "server_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", + " --enable-lora \\\n", + " --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", + " lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json \\\n", + " --max-loras-per-batch 2 \\\n", + " --log-level warning \\\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://127.0.0.1:{port}\"\n", + "json_data = {\n", + " \"text\": [\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " ],\n", + " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", + " # The first input uses lora0, and the second input uses lora1\n", + " \"lora_path\": [\"lora0\", \"lora1\"],\n", + "}\n", + "response = requests.post(\n", + " url + \"/generate\",\n", + " json=json_data,\n", + ")\n", + "print(f\"Output 0: {response.json()[0]['text']}\")\n", + "print(f\"Output 1: {response.json()[1]['text']}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dynamic LoRA loading" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Instead of specifying all adapters during server startup via `--lora-paths`. You can also load & unload LoRA adapters dynamically via the `/load_lora_adapter` and `/unload_lora_adapter` API.\n", + "\n", + "When using dynamic LoRA loading, it's recommended to explicitly specify both `--max-lora-rank` and `--lora-target-modules` at startup. For backward compatibility, SGLang will infer these values from `--lora-paths` if they are not explicitly provided. However, in that case, you would have to ensure that all dynamically loaded adapters share the same shape (rank and target modules) as those in the initial `--lora-paths` or are strictly \"smaller\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json\" # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj\n", + "lora1 = \"algoprog/fact-generation-llama-3.1-8b-instruct-lora\" # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj\n", + "lora0_new = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\" # rank - 256, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj\n", + "\n", + "\n", + "# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.\n", + "# We are adding it here just to demonstrate usage.\n", + "server_process, port = launch_server_cmd(\"\"\"\n", + " python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", + " --enable-lora \\\n", + " --cuda-graph-max-bs 2 \\\n", + " --max-loras-per-batch 2 \\\n", + " --max-lora-rank 256\n", + " --lora-target-modules all\n", + " --log-level warning\n", + " \"\"\")\n", + "\n", + "url = f\"http://127.0.0.1:{port}\"\n", + "wait_for_server(url, process=server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load adapter lora0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(\n", + " url + \"/load_lora_adapter\",\n", + " json={\n", + " \"lora_name\": \"lora0\",\n", + " \"lora_path\": lora0,\n", + " },\n", + ")\n", + "\n", + "if response.status_code == 200:\n", + " print(\"LoRA adapter loaded successfully.\", response.json())\n", + "else:\n", + " print(\"Failed to load LoRA adapter.\", response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load adapter lora1:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(\n", + " url + \"/load_lora_adapter\",\n", + " json={\n", + " \"lora_name\": \"lora1\",\n", + " \"lora_path\": lora1,\n", + " },\n", + ")\n", + "\n", + "if response.status_code == 200:\n", + " print(\"LoRA adapter loaded successfully.\", response.json())\n", + "else:\n", + " print(\"Failed to load LoRA adapter.\", response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check inference output:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://127.0.0.1:{port}\"\n", + "json_data = {\n", + " \"text\": [\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " ],\n", + " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", + " # The first input uses lora0, and the second input uses lora1\n", + " \"lora_path\": [\"lora0\", \"lora1\"],\n", + "}\n", + "response = requests.post(\n", + " url + \"/generate\",\n", + " json=json_data,\n", + ")\n", + "print(f\"Output from lora0: \\n{response.json()[0]['text']}\\n\")\n", + "print(f\"Output from lora1 (updated): \\n{response.json()[1]['text']}\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unload lora0 and replace it with a different adapter:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(\n", + " url + \"/unload_lora_adapter\",\n", + " json={\n", + " \"lora_name\": \"lora0\",\n", + " },\n", + ")\n", + "\n", + "response = requests.post(\n", + " url + \"/load_lora_adapter\",\n", + " json={\n", + " \"lora_name\": \"lora0\",\n", + " \"lora_path\": lora0_new,\n", + " },\n", + ")\n", + "\n", + "if response.status_code == 200:\n", + " print(\"LoRA adapter loaded successfully.\", response.json())\n", + "else:\n", + " print(\"Failed to load LoRA adapter.\", response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check output again:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://127.0.0.1:{port}\"\n", + "json_data = {\n", + " \"text\": [\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " ],\n", + " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", + " # The first input uses lora0, and the second input uses lora1\n", + " \"lora_path\": [\"lora0\", \"lora1\"],\n", + "}\n", + "response = requests.post(\n", + " url + \"/generate\",\n", + " json=json_data,\n", + ")\n", + "print(f\"Output from lora0: \\n{response.json()[0]['text']}\\n\")\n", + "print(f\"Output from lora1 (updated): \\n{response.json()[1]['text']}\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### OpenAI-compatible API usage\n", + "\n", + "You can use LoRA adapters via the OpenAI-compatible APIs by specifying the adapter in the `model` field using the `base-model:adapter-name` syntax (for example, `qwen/qwen2.5-0.5b-instruct:adapter_a`). For more details and examples, see the “Using LoRA Adapters” section in the OpenAI API documentation: [openai_api_completions.ipynb](../basic_usage/openai_api_completions.ipynb).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### LoRA GPU Pinning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another advanced option is to specify adapters as `pinned` during loading. When an adapter is pinned, it is permanently assigned to one of the available GPU pool slots (as configured by `--max-loras-per-batch`) and will not be evicted from GPU memory during runtime. Instead, it remains resident until it is explicitly unloaded.\n", + "\n", + "This can improve performance in scenarios where the same adapter is frequently used across requests, by avoiding repeated memory transfers and reinitialization overhead. However, since GPU pool slots are limited, pinning adapters reduces the flexibility of the system to dynamically load other adapters on demand. If too many adapters are pinned, it may lead to degraded performance, or in the most extreme case (`Number of pinned adapters == max-loras-per-batch`), halt all unpinned requests. Therefore, currently SGLang limits maximal number of pinned adapters to `max-loras-per-batch - 1` to prevent unexpected starvations. \n", + "\n", + "In the example below, we start a server with `lora1` loaded as pinned, `lora2` and `lora3` loaded as regular (unpinned) adapters. Please note that, we intentionally specify `lora2` and `lora3` in two different formats to demonstrate that both are supported." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "server_process, port = launch_server_cmd(\"\"\"\n", + " python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", + " --enable-lora \\\n", + " --cuda-graph-max-bs 8 \\\n", + " --max-loras-per-batch 3 \\\n", + " --max-lora-rank 256 \\\n", + " --lora-target-modules all \\\n", + " --lora-paths \\\n", + " {\"lora_name\":\"lora0\",\"lora_path\":\"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json\",\"pinned\":true} \\\n", + " {\"lora_name\":\"lora1\",\"lora_path\":\"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"} \\\n", + " lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora\n", + " --log-level warning\n", + " \"\"\")\n", + "\n", + "\n", + "url = f\"http://127.0.0.1:{port}\"\n", + "wait_for_server(url, process=server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also specify adapter as pinned during dynamic adapter loading. In the example below, we reload `lora2` as pinned adapter:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(\n", + " url + \"/unload_lora_adapter\",\n", + " json={\n", + " \"lora_name\": \"lora1\",\n", + " },\n", + ")\n", + "\n", + "response = requests.post(\n", + " url + \"/load_lora_adapter\",\n", + " json={\n", + " \"lora_name\": \"lora1\",\n", + " \"lora_path\": \"algoprog/fact-generation-llama-3.1-8b-instruct-lora\",\n", + " \"pinned\": True, # Pin the adapter to GPU\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Verify that the results are expected:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://127.0.0.1:{port}\"\n", + "json_data = {\n", + " \"text\": [\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " ],\n", + " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", + " # The first input uses lora0, and the second input uses lora1\n", + " \"lora_path\": [\"lora0\", \"lora1\", \"lora2\"],\n", + "}\n", + "response = requests.post(\n", + " url + \"/generate\",\n", + " json=json_data,\n", + ")\n", + "print(f\"Output from lora0 (pinned): \\n{response.json()[0]['text']}\\n\")\n", + "print(f\"Output from lora1 (pinned): \\n{response.json()[1]['text']}\\n\")\n", + "print(f\"Output from lora2 (not pinned): \\n{response.json()[2]['text']}\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Choosing LoRA Backend\n", + "\n", + "SGLang supports two LoRA backends that you can choose from using the `--lora-backend` argument:\n", + "\n", + "- `triton`: Basic Triton-based backend.\n", + "- `csgmv`: Default chunked SGMV backend optimized for high concurrency scenarios.\n", + "\n", + "The `csgmv` backend was recently introduced to improve performance especially at high-concurrency scenarios. Our benchmark shows that it achieves 20% to 80% latency improvements over the basic triton backend." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "server_process, port = launch_server_cmd(\"\"\"\n", + " python3 -m sglang.launch_server \\\n", + " --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", + " --enable-lora \\\n", + " --lora-backend csgmv \\\n", + " --max-loras-per-batch 16 \\\n", + " --lora-paths lora1=path/to/lora1 lora2=path/to/lora2\n", + " \"\"\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## LoRA Overlap Loading" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By using the `--enable-lora-overlap-loading` server argument, the SGLang engine is able to overlap the loading of LoRA weights with prefill and decode compute, essentially hiding the data movement for LoRA weights behind GPU computation. Our benchmarks show that under adversarial conditions, enabling this feature can result in a ~35% reduction in median TTFT - (see the [LoRA overlap loading PR](https://github.com/sgl-project/sglang/pull/15512) for detailed benchmarks)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json\"\n", + "lora1 = \"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"\n", + "lora2 = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\"\n", + "\n", + "\n", + "server_process, port = launch_server_cmd(\"\"\"\n", + " python3 -m sglang.launch_server \\\n", + " --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", + " --enable-lora \\\n", + " --enable-lora-overlap-loading \\\n", + " --lora-paths lora0=Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json \\\n", + " lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", + " lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora \\\n", + " --max-lora-rank 256 \\\n", + " --max-loras-per-batch 2 \\\n", + " --max-loaded-loras 4\n", + " \"\"\")\n", + "\n", + "url = f\"http://127.0.0.1:{port}\"\n", + "wait_for_server(url, process=server_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "json_data = {\n", + " \"text\": [\n", + " \"Write a very long fairy-tale.\",\n", + " \"List 3 countries and their capitals.\",\n", + " \"List 3 countries and their capitals.\",\n", + " ],\n", + " \"sampling_params\": [\n", + " {\"max_new_tokens\": 1024, \"temperature\": 0},\n", + " {\"max_new_tokens\": 64, \"temperature\": 0},\n", + " {\"max_new_tokens\": 64, \"temperature\": 0},\n", + " ],\n", + " \"lora_path\": [\"lora0\", \"lora1\", \"lora2\"],\n", + "}\n", + "\n", + "# lora0 and lora1 will be loaded into the memory pool first, and because max_loras_per_batch = 2, lora2's request will remain in the queue.\n", + "# lora1's request will likely finish first, and once it does, lora2 will be loaded. With --enable-lora-overlap-loading, this loading will\n", + "# occur asynchronously and thus decoding for lora0's request won't be blocked.\n", + "response = requests.post(\n", + " url + \"/generate\",\n", + " json=json_data,\n", + ")\n", + "\n", + "for i in range(3):\n", + " print(f\"Output from lora{i}: \\n{response.json()[i]['text']}\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Limitations of LoRA Overlap Loading" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, LoRA overlap loading is not free and comes with two important caveats:\n", + "\n", + "1. **Pinned CPU memory requirement**:\n", + " Asynchronous H2D memory copies require LoRA weights to be pinned in CPU memory, which is a finite system resource. To mitigate excessive pinned-memory usage, SGLang currently restricts `max_loaded_loras` to be at most 2× `max_loras_per_batch` when LoRA overlap loading is enabled.\n", + "\n", + "2. **Reduced multi-adapter prefill batching**:\n", + " With overlap loading, adapters become available on the GPU at different times because each adapter is loaded asynchronously. This can reduce the scheduler’s ability to form multi-adapter prefill batches, since only requests whose adapters are currently loaded can be grouped together. As a result, requests for different adapters will be scheduled in separate (or smaller) prefill batches, which can increase TTFT when adapter load time is small compared to prefill compute time. This is why LoRA overlap loading is disabled by default: it should only be enabled when users have determined that LoRA weight loading is a bottleneck (EG high adapter churn, heavy adapter weights, or PCIe-bottlenecked workloads).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Example When Overlap Loading Results in Higher Latency" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For instance, suppose we have four LoRA adapters: `lora0`, `lora1`, `lora2`, and `lora3`. Loading any adapter takes 2ms, while the prefill step for requests for that adapter takes 20ms.\n", + "\n", + "1. **Baseline**:\n", + " The engine loads all four adapters synchronously, then runs one combined prefill batch, giving us a total time of ≈ `2 * 4 + 20 = 28ms`\n", + "\n", + "2. **With LoRA overlap loading enabled**:\n", + " The engine begins loading `lora0` and, once it is ready, schedules a prefill batch containing only `lora0` while `lora1` loads in the background. Then it schedules `lora1`’s prefill while `lora2` loads, and so on. In the worst case where prefill cannot be batched across adapters, total time is ≈ `2 + 4 * 20 = 82ms`\n", + "\n", + "In this scenario, overlap loading reduces adapter-load overhead, but the loss of multi-adapter prefill batching dominates and leads to higher TTFT." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Future Works\n", + "\n", + "The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Other features, including Embedding Layer, Unified Paging, Cutlass backend are still under development." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/advanced_features/lora.mdx b/docs_new/docs/advanced_features/lora.mdx new file mode 100644 index 000000000000..3ed6b4430df7 --- /dev/null +++ b/docs_new/docs/advanced_features/lora.mdx @@ -0,0 +1,509 @@ +--- +title: "LoRA Serving" +metatags: + description: "SGLang multi-LoRA serving: S-LoRA and Punica techniques, dynamic adapter loading, GPU pinning, overlap loading, Triton and CSGMV backends." +--- +SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs. + + +## Arguments for LoRA Serving + + +The following server arguments are relevant for multi-LoRA serving: + +* `enable_lora`: Enable LoRA support for the model. This argument is automatically set to True if `--lora-paths` is provided for backward compatibility. + +* `enable_lora_overlap_loading`: Enable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters. + +* `lora_paths`: The list of LoRA adapters to load. Each adapter must be specified in one of the following formats: <PATH> | <NAME>=<PATH> | JSON with schema {"lora_name":str,"lora_path":str,"pinned":bool}. + +* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8. + +* `max_loaded_loras`: If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `max-loras-per-batch`. + +* `lora_eviction_policy`: LoRA adapter eviction policy when GPU memory pool is full. `lru`: Least Recently Used (default, better cache efficiency). `fifo`: First-In-First-Out. + +* `lora_backend`: The backend of running GEMM kernels for Lora modules. Currently we support Triton LoRA backend (`triton`) and Chunked SGMV backend (`csgmv`). In the future, faster backend built upon Cutlass or Cuda kernels will be added. + +* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup. + +* `lora_target_modules`: The union set of all target modules where LoRA should be applied (e.g., `q_proj`, `k_proj`, `gate_proj`). If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of different target modules after server startup. You can also set it to `all` to enable LoRA for all supported modules. However, enabling LoRA on additional modules introduces a minor performance overhead. If your application is performance-sensitive, we recommend only specifying the modules for which you plan to load adapters. + +* `max_lora_chunk_size`: Maximum chunk size for the ChunkedSGMV LoRA backend. Only used when --lora-backend is 'csgmv'. Choosing a larger value might improve performance. Please tune this value based on your hardware and workload as needed. Defaults to 16. + +* `lora_drain_wait_threshold`: When any LoRA adapter request waits longer than this threshold (in seconds), the scheduler will selectively drain one running adapter to make room. This mitigates extreme tail latency under high or skewed workloads by preventing a small set of adapters from monopolizing batch slots. Set to 0 to disable draining (default). + +* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper. + +From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to. + + +## Usage + +### Serving Single Adaptor + + +**Note:** SGLang supports LoRA adapters through two APIs: + +1. **OpenAI-Compatible API** (`/v1/chat/completions`, `/v1/completions`): Use the `model:adapter-name` syntax. See [OpenAI API with LoRA](../basic_usage/openai_api_completions#using-lora-adapters) for examples. + +2. **Native API** (`/generate`): Pass `lora_path` in the request body (shown below). + + + +```python Example +import json +import requests + +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, terminate_process +``` + + +```python Example +server_process, port = launch_server_cmd( + # Here we set max-loras-per-batch to 2: one slot for adaptor and another one for base model + """ +python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --enable-lora \ + --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \ + --max-loras-per-batch 2 \ + --log-level warning \ +""" +) + +wait_for_server(f"http://localhost:{port}") +``` + + +```python Example +url = f"http://127.0.0.1:{port}" +json_data = { + "text": [ + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + ], + "sampling_params": {"max_new_tokens": 32, "temperature": 0}, + # The first input uses lora0, and the second input uses the base model + "lora_path": ["lora0", None], +} +response = requests.post( + url + "/generate", + json=json_data, +) +print(f"Output 0: {response.json()[0]['text']}") +print(f"Output 1: {response.json()[1]['text']}") +``` + + +```python Example +terminate_process(server_process) +``` + +### Serving Multiple Adaptors + + + +```python Example +server_process, port = launch_server_cmd( + """ +python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --enable-lora \ + --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \ + lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json \ + --max-loras-per-batch 2 \ + --log-level warning \ +""" +) + +wait_for_server(f"http://localhost:{port}") +``` + + +```python Example +url = f"http://127.0.0.1:{port}" +json_data = { + "text": [ + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + ], + "sampling_params": {"max_new_tokens": 32, "temperature": 0}, + # The first input uses lora0, and the second input uses lora1 + "lora_path": ["lora0", "lora1"], +} +response = requests.post( + url + "/generate", + json=json_data, +) +print(f"Output 0: {response.json()[0]['text']}") +print(f"Output 1: {response.json()[1]['text']}") +``` + + +```python Example +terminate_process(server_process) +``` + +### Dynamic LoRA loading + + +Instead of specifying all adapters during server startup via `--lora-paths`. You can also load & unload LoRA adapters dynamically via the `/load_lora_adapter` and `/unload_lora_adapter` API. + +When using dynamic LoRA loading, it's recommended to explicitly specify both `--max-lora-rank` and `--lora-target-modules` at startup. For backward compatibility, SGLang will infer these values from `--lora-paths` if they are not explicitly provided. However, in that case, you would have to ensure that all dynamically loaded adapters share the same shape (rank and target modules) as those in the initial `--lora-paths` or are strictly "smaller". + + + +```python Example +lora0 = "Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json" # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj +lora1 = "algoprog/fact-generation-llama-3.1-8b-instruct-lora" # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj +lora0_new = "philschmid/code-llama-3-1-8b-text-to-sql-lora" # rank - 256, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj + + +# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified. +# We are adding it here just to demonstrate usage. +server_process, port = launch_server_cmd( + """ + python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --enable-lora \ + --cuda-graph-max-bs 2 \ + --max-loras-per-batch 2 \ + --max-lora-rank 256 + --lora-target-modules all + --log-level warning + """ +) + +url = f"http://127.0.0.1:{port}" +wait_for_server(url) +``` + +Load adapter lora0 + + + +```python Example +response = requests.post( + url + "/load_lora_adapter", + json={ + "lora_name": "lora0", + "lora_path": lora0, + }, +) + +if response.status_code == 200: + print("LoRA adapter loaded successfully.", response.json()) +else: + print("Failed to load LoRA adapter.", response.json()) +``` + +Load adapter lora1: + + + +```python Example +response = requests.post( + url + "/load_lora_adapter", + json={ + "lora_name": "lora1", + "lora_path": lora1, + }, +) + +if response.status_code == 200: + print("LoRA adapter loaded successfully.", response.json()) +else: + print("Failed to load LoRA adapter.", response.json()) +``` + +Check inference output: + + + +```python Example +url = f"http://127.0.0.1:{port}" +json_data = { + "text": [ + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + ], + "sampling_params": {"max_new_tokens": 32, "temperature": 0}, + # The first input uses lora0, and the second input uses lora1 + "lora_path": ["lora0", "lora1"], +} +response = requests.post( + url + "/generate", + json=json_data, +) +print(f"Output from lora0: \n{response.json()[0]['text']}\n") +print(f"Output from lora1 (updated): \n{response.json()[1]['text']}\n") +``` + +Unload lora0 and replace it with a different adapter: + + + +```python Example +response = requests.post( + url + "/unload_lora_adapter", + json={ + "lora_name": "lora0", + }, +) + +response = requests.post( + url + "/load_lora_adapter", + json={ + "lora_name": "lora0", + "lora_path": lora0_new, + }, +) + +if response.status_code == 200: + print("LoRA adapter loaded successfully.", response.json()) +else: + print("Failed to load LoRA adapter.", response.json()) +``` + +Check output again: + + + +```python Example +url = f"http://127.0.0.1:{port}" +json_data = { + "text": [ + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + ], + "sampling_params": {"max_new_tokens": 32, "temperature": 0}, + # The first input uses lora0, and the second input uses lora1 + "lora_path": ["lora0", "lora1"], +} +response = requests.post( + url + "/generate", + json=json_data, +) +print(f"Output from lora0: \n{response.json()[0]['text']}\n") +print(f"Output from lora1 (updated): \n{response.json()[1]['text']}\n") +``` + + +```python Example +terminate_process(server_process) +``` + +### OpenAI-compatible API usage + +You can use LoRA adapters via the OpenAI-compatible APIs by specifying the adapter in the `model` field using the `base-model:adapter-name` syntax (for example, `qwen/qwen2.5-0.5b-instruct:adapter_a`). For more details and examples, see the “Using LoRA Adapters” section in the OpenAI API documentation: [openai_api_completions](../basic_usage/openai_api_completions). + + + +### LoRA GPU Pinning + + +Another advanced option is to specify adapters as `pinned` during loading. When an adapter is pinned, it is permanently assigned to one of the available GPU pool slots (as configured by `--max-loras-per-batch`) and will not be evicted from GPU memory during runtime. Instead, it remains resident until it is explicitly unloaded. + +This can improve performance in scenarios where the same adapter is frequently used across requests, by avoiding repeated memory transfers and reinitialization overhead. However, since GPU pool slots are limited, pinning adapters reduces the flexibility of the system to dynamically load other adapters on demand. If too many adapters are pinned, it may lead to degraded performance, or in the most extreme case (`Number of pinned adapters == max-loras-per-batch`), halt all unpinned requests. Therefore, currently SGLang limits maximal number of pinned adapters to `max-loras-per-batch - 1` to prevent unexpected starvations. + +In the example below, we start a server with `lora1` loaded as pinned, `lora2` and `lora3` loaded as regular (unpinned) adapters. Please note that, we intentionally specify `lora2` and `lora3` in two different formats to demonstrate that both are supported. + + + +```python Example +server_process, port = launch_server_cmd( + """ + python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --enable-lora \ + --cuda-graph-max-bs 8 \ + --max-loras-per-batch 3 \ + --max-lora-rank 256 \ + --lora-target-modules all \ + --lora-paths \ + {"lora_name":"lora0","lora_path":"Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json","pinned":true} \ + {"lora_name":"lora1","lora_path":"algoprog/fact-generation-llama-3.1-8b-instruct-lora"} \ + lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora + --log-level warning + """ +) + + +url = f"http://127.0.0.1:{port}" +wait_for_server(url) +``` + +You can also specify adapter as pinned during dynamic adapter loading. In the example below, we reload `lora2` as pinned adapter: + + + +```python Example +response = requests.post( + url + "/unload_lora_adapter", + json={ + "lora_name": "lora1", + }, +) + +response = requests.post( + url + "/load_lora_adapter", + json={ + "lora_name": "lora1", + "lora_path": "algoprog/fact-generation-llama-3.1-8b-instruct-lora", + "pinned": True, # Pin the adapter to GPU + }, +) +``` + +Verify that the results are expected: + + + +```python Example +url = f"http://127.0.0.1:{port}" +json_data = { + "text": [ + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + ], + "sampling_params": {"max_new_tokens": 32, "temperature": 0}, + # The first input uses lora0, and the second input uses lora1 + "lora_path": ["lora0", "lora1", "lora2"], +} +response = requests.post( + url + "/generate", + json=json_data, +) +print(f"Output from lora0 (pinned): \n{response.json()[0]['text']}\n") +print(f"Output from lora1 (pinned): \n{response.json()[1]['text']}\n") +print(f"Output from lora2 (not pinned): \n{response.json()[2]['text']}\n") +``` + + +```python Example +terminate_process(server_process) +``` + +## Choosing LoRA Backend + +SGLang supports two LoRA backends that you can choose from using the `--lora-backend` argument: + +- `triton`: Basic Triton-based backend. +- `csgmv`: Default chunked SGMV backend optimized for high concurrency scenarios. + +The `csgmv` backend was recently introduced to improve performance especially at high-concurrency scenarios. Our benchmark shows that it achieves 20% to 80% latency improvements over the basic triton backend. + + + +```python Example +server_process, port = launch_server_cmd( + """ + python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --enable-lora \ + --lora-backend csgmv \ + --max-loras-per-batch 16 \ + --lora-paths lora1=path/to/lora1 lora2=path/to/lora2 + """ +) +``` + + +```python Example +terminate_process(server_process) +``` + +## LoRA Overlap Loading + + +By using the `--enable-lora-overlap-loading` server argument, the SGLang engine is able to overlap the loading of LoRA weights with prefill and decode compute, essentially hiding the data movement for LoRA weights behind GPU computation. Our benchmarks show that under adversarial conditions, enabling this feature can result in a ~35% reduction in median TTFT - (see the [LoRA overlap loading PR](https://github.com/sgl-project/sglang/pull/15512) for detailed benchmarks). + + + +```python Example +lora0 = "Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json" +lora1 = "algoprog/fact-generation-llama-3.1-8b-instruct-lora" +lora2 = "philschmid/code-llama-3-1-8b-text-to-sql-lora" + + +server_process, port = launch_server_cmd( + """ + python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --enable-lora \ + --enable-lora-overlap-loading \ + --lora-paths lora0=Nutanix/Meta-Llama-3.1-8B-Instruct_SFT_lora_4_alpha_16_humaneval_raw_json \ + lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora \ + lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora \ + --max-lora-rank 256 \ + --max-loras-per-batch 2 \ + --max-loaded-loras 4 + """ +) + +url = f"http://127.0.0.1:{port}" +wait_for_server(url) +``` + + +```python Example +json_data = { + "text": [ + "Write a very long fairy-tale.", + "List 3 countries and their capitals.", + "List 3 countries and their capitals.", + ], + "sampling_params": [ + {"max_new_tokens": 1024, "temperature": 0}, + {"max_new_tokens": 64, "temperature": 0}, + {"max_new_tokens": 64, "temperature": 0}, + ], + "lora_path": ["lora0", "lora1", "lora2"], +} + +# lora0 and lora1 will be loaded into the memory pool first, and because max_loras_per_batch = 2, lora2's request will remain in the queue. +# lora1's request will likely finish first, and once it does, lora2 will be loaded. With --enable-lora-overlap-loading, this loading will +# occur asynchronously and thus decoding for lora0's request won't be blocked. +response = requests.post( + url + "/generate", + json=json_data, +) + +for i in range(3): + print(f"Output from lora{i}: \n{response.json()[i]['text']}\n") +``` + + +```python Example +terminate_process(server_process) +``` + +#### Limitations of LoRA Overlap Loading + + +However, LoRA overlap loading is not free and comes with two important caveats: + +1. **Pinned CPU memory requirement**: + Asynchronous H2D memory copies require LoRA weights to be pinned in CPU memory, which is a finite system resource. To mitigate excessive pinned-memory usage, SGLang currently restricts `max_loaded_loras` to be at most 2× `max_loras_per_batch` when LoRA overlap loading is enabled. + +2. **Reduced multi-adapter prefill batching**: + With overlap loading, adapters become available on the GPU at different times because each adapter is loaded asynchronously. This can reduce the scheduler’s ability to form multi-adapter prefill batches, since only requests whose adapters are currently loaded can be grouped together. As a result, requests for different adapters will be scheduled in separate (or smaller) prefill batches, which can increase TTFT when adapter load time is small compared to prefill compute time. This is why LoRA overlap loading is disabled by default: it should only be enabled when users have determined that LoRA weight loading is a bottleneck (EG high adapter churn, heavy adapter weights, or PCIe-bottlenecked workloads). + + + +#### Example When Overlap Loading Results in Higher Latency + + +For instance, suppose we have four LoRA adapters: `lora0`, `lora1`, `lora2`, and `lora3`. Loading any adapter takes 2ms, while the prefill step for requests for that adapter takes 20ms. + +1. **Baseline**: + The engine loads all four adapters synchronously, then runs one combined prefill batch, giving us a total time of ≈ `2 * 4 + 20 = 28ms` + +2. **With LoRA overlap loading enabled**: + The engine begins loading `lora0` and, once it is ready, schedules a prefill batch containing only `lora0` while `lora1` loads in the background. Then it schedules `lora1`’s prefill while `lora2` loads, and so on. In the worst case where prefill cannot be batched across adapters, total time is ≈ `2 + 4 * 20 = 82ms` + +In this scenario, overlap loading reduces adapter-load overhead, but the loss of multi-adapter prefill batching dominates and leads to higher TTFT. + + +## Future Works + +The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Other features, including Embedding Layer, Unified Paging, Cutlass backend are still under development. diff --git a/docs_new/docs/advanced_features/object_storage.mdx b/docs_new/docs/advanced_features/object_storage.mdx new file mode 100644 index 000000000000..a6de5a206da2 --- /dev/null +++ b/docs_new/docs/advanced_features/object_storage.mdx @@ -0,0 +1,142 @@ +--- +title: "Loading Models from Object Storage" +metatags: + description: "Load SGLang models directly from S3, Google Cloud Storage, Azure Blob, and S3-compatible object storage with runai_streamer." +--- + +SGLang supports direct loading of models from object storage (S3 and Google Cloud Storage) without requiring a full local download. This feature uses the `runai_streamer` load format to stream model weights directly from cloud storage, significantly reducing startup time and local storage requirements. + +## Overview + +When loading models from object storage, SGLang uses a two-phase approach: + +1. **Metadata Download** (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache +2. **Weight Streaming** (lazy, during model loading): Model weights are streamed directly from object storage as needed + +## Supported Storage Backends + +1. **Amazon S3**: `s3://bucket-name/path/to/model/` +2. **Google Cloud Storage**: `gs://bucket-name/path/to/model/` +3. **Azure Blob**: `az://some-azure-container/path/` +4. **S3 compatible**: `s3://bucket-name/path/to/model/` + +## Quick Start + +### Basic Usage + +Simply provide an object storage URI as the model path: + +```bash +# S3 +python -m sglang.launch_server \ + --model-path s3://my-bucket/models/llama-3-8b/ \ + --load-format runai_streamer + +# Google Cloud Storage +python -m sglang.launch_server \ + --model-path gs://my-bucket/models/llama-3-8b/ \ + --load-format runai_streamer +``` + +**Note**: The `--load-format runai_streamer` is automatically detected when using object storage URIs, so you can omit it: + +```bash +python -m sglang.launch_server \ + --model-path s3://my-bucket/models/llama-3-8b/ +``` + +### With Tensor Parallelism + +```bash +python -m sglang.launch_server \ + --model-path gs://my-bucket/models/llama-70b/ \ + --tp 4 \ + --model-loader-extra-config '{"distributed": true}' +``` + +## Configuration + +### Load Format + +The `runai_streamer` load format is specifically designed for object storage, ssd and shared file systems + +```bash +python -m sglang.launch_server \ + --model-path s3://bucket/model/ \ + --load-format runai_streamer +``` + +### Extended Configuration Parameters + +Use `--model-loader-extra-config` to pass additional configuration as a JSON string: + +```bash +python -m sglang.launch_server \ + --model-path s3://bucket/model/ \ + --model-loader-extra-config '{ + "distributed": true, + "concurrency": 8, + "memory_limit": 2147483648 + }' +``` + +#### Available Parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescriptionDefault
distributedboolEnable distributed streaming for multi-GPU setups. Automatically set to true for object storage paths and cuda alike devices.Auto-detected
concurrencyintNumber of concurrent download streams. Higher values can improve throughput for large models.4
memory_limitintMemory limit (in bytes) for the streaming buffer.System-dependent
+ +## Performance Considerations + +### Distributed Streaming + +For multi-GPU setups, enable distributed streaming to parallelize weight loading between the processes: + +```bash +python -m sglang.launch_server \ + --model-path s3://bucket/model/ \ + --tp 8 \ + --model-loader-extra-config '{"distributed": true}' +``` + +## Limitations + +- **Supported Formats**: Currently only supports `.safetensors` weight format (recommended format) +- **Supported Device**: Distributed streaming is supported on cuda alike devices. Otherwise fallback to non distributed streaming + +## See Also + +- [Runai model streamer documentation](https://github.com/run-ai/runai-model-streamer) diff --git a/docs_new/docs/advanced_features/observability.mdx b/docs_new/docs/advanced_features/observability.mdx new file mode 100644 index 000000000000..3b550f6add3f --- /dev/null +++ b/docs_new/docs/advanced_features/observability.mdx @@ -0,0 +1,38 @@ +--- +title: "Observability" +metatags: + description: "SGLang observability: Prometheus metrics, request logging, request dump and replay, crash dump debugging." +--- +## Production Metrics +SGLang exposes the following metrics via Prometheus. You can enable them by adding `--enable-metrics` when launching the server. +You can query them by: +```bash Command +curl http://localhost:30000/metrics +``` + +See [Production Metrics](../references/production_metrics) and [Production Request Tracing](../references/production_request_trace) for more details. + +## Logging + +By default, SGLang does not log any request contents. You can log them by using `--log-requests`. +You can control the verbosity by using `--log-request-level`. +See [Logging](./server_arguments#logging) for more details. + +## Request Dump and Replay + +You can dump all requests and replay them later for benchmarking or other purposes. + +To start dumping, use the following command to send a request to a server: +```bash Command +python3 -m sglang.srt.managers.configure_logging --url http://localhost:30000 --dump-requests-folder /tmp/sglang_request_dump --dump-requests-threshold 100 +``` +The server will dump the requests into a pickle file for every 100 requests. + +To replay the request dump, use `scripts/playground/replay_request_dump.py`. + +## Crash Dump and Replay +Sometimes the server might crash, and you may want to debug the cause of the crash. +SGLang supports crash dumping, which will dump all requests from the 5 minutes before the crash, allowing you to replay the requests and debug the reason later. + +To enable crash dumping, use `--crash-dump-folder /tmp/crash_dump`. +To replay the crash dump, use `scripts/playground/replay_request_dump.py`. diff --git a/docs_new/docs/advanced_features/overview.mdx b/docs_new/docs/advanced_features/overview.mdx new file mode 100644 index 000000000000..804f01bd4f6a --- /dev/null +++ b/docs_new/docs/advanced_features/overview.mdx @@ -0,0 +1,18 @@ +--- +title: Advanced Features +description: Advanced configuration, optimization, and deployment features for SGLang. +--- + +- [Server Arguments](./server_arguments) +- [Hyperparameter Tuning](./hyperparameter_tuning) +- [Attention Backend](./attention_backend) +- [Speculative Decoding](./speculative_decoding) +- [Structured Outputs](./structured_outputs) +- [Quantization](./quantization) +- [Expert Parallelism](./expert_parallelism) +- [LoRA](./lora) +- [PD Disaggregation](./pd_disaggregation) +- [Pipeline Parallelism](./pipeline_parallelism) +- [HiCache](./hicache_best_practices) +- [Observability](./observability) +- [And more…](./server_arguments) diff --git a/docs_new/docs/advanced_features/pd_disaggregation.mdx b/docs_new/docs/advanced_features/pd_disaggregation.mdx new file mode 100644 index 000000000000..86f4bf025483 --- /dev/null +++ b/docs_new/docs/advanced_features/pd_disaggregation.mdx @@ -0,0 +1,489 @@ +--- +title: "PD Disaggregation" +metatags: + description: "SGLang PD disaggregation: separate prefill and decode phases, Mooncake and NIXL transfer engines, multi-node DeepSeek deployment." +--- +## Why and What is PD Disaggregation? + +Large Language Model (LLM) inference comprises two distinct phases: **Prefill** and **Decode**. The Prefill phase is computation-intensive, processing the entire input sequence, while the Decode phase is memory-intensive, managing the Key-Value (KV) cache for token generation. Traditionally, these phases are handled within a unified engine, where combined scheduling of prefill and decode batches introduces inefficiencies. To address these challenges, we introduce **Prefill and Decoding (PD) Disaggregation** in SGLang. + +### Issues with Unified Scheduling + +The conventional unified engine, which processes prefill and decode batches together, results in two significant problems: + +1. **Prefill Interruption**: Incoming prefill batches frequently interrupt ongoing decode batches, causing substantial delays in token generation. +2. **DP Attention Imbalance**: In data-parallel (DP) attention, one DP worker may process a prefill batch while another handles a decode batch simultaneously, leading to increased decode latency. + +PD Disaggregation resolves these by separating the two stages, enabling tailored optimizations for each. + +For the design details, please refer to [link](https://docs.google.com/document/d/1rQXJwKd5b9b1aOzLh98mnyMhBMhlxXA5ATZTHoQrwvc/edit?tab=t.0). + +Currently, we support Mooncake and NIXL as the transfer engine. + +## Profiling in PD Disaggregation Mode + +When you need to profile prefill or decode workers in PD disaggregation mode, please refer to the [Profile In PD Disaggregation Mode](../developer_guide/benchmark_and_profiling#profile-in-pd-disaggregation-mode) section in the Benchmark and Profiling guide. Due to torch profiler limitations, prefill and decode workers must be profiled separately using dedicated command-line options. + +## Router Integration + +For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the [SGLang Model Gateway (former Router)](./sgl_model_gateway#prefill-decode-disaggregation). + + +## Mooncake +### Requirements + +```bash +uv pip install mooncake-transfer-engine +``` + +### Usage + +### Llama Single Node + +```bash +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --port 30000 \ + --disaggregation-ib-device mlx5_roce0 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --port 30001 \ + --base-gpu-id 1 \ + --disaggregation-ib-device mlx5_roce0 +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 +``` + +### DeepSeek Multi-Node + +```bash +# prefill 0 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 +# prefill 1 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 +# decode 0 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 +# decode 1 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 +``` +### Advanced Configuration + +PD Disaggregation with Mooncake supports the following environment variables for fine-grained control over system behavior. + +#### NVLink Transport Configuration +To enable NVLink transport for KV cache transfers with the mooncake backend (recommended for NVL72 deployments), set the following environment variables. Note that auxiliary data transfer will still use TCP as a temporary workaround. + +```bash Command +export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=NVLINK +export MC_FORCE_MNNVL=True +``` + +The `SGLANG_MOONCAKE_CUSTOM_MEM_POOL` environment variable enables the custom memory pool. Supported values are `NVLINK` (or `True`), `BAREX`, and `INTRA_NODE_NVLINK`. + +#### Prefill Server Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
VariableDescriptionDefault
**`SGLANG_DISAGGREGATION_THREAD_POOL_SIZE`**Controls the total number of worker threads for KVCache transfer operations per TP rankA dynamic value calculated by int(0.75 * os.cpu_count()) // 8), which is limited to be larger than 4 and less than 12 to ensure efficiency and prevent thread race conditions
**`SGLANG_DISAGGREGATION_QUEUE_SIZE`**Sets the number of parallel transfer queues. KVCache transfer requests from multiple decode instances will be sharded into these queues so that they can share the threads and the transfer bandwidth at the same time. If it is set to 1, then we transfer requests one by one according to fcfs strategy`4`
**`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`**Timeout (seconds) for receiving destination KV indices during request initialization`300`
SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVALInterval (seconds) between cleanups of bootstrap entries120
+ +If a greater mean TTFT is acceptable, you can `export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600` (10 minutes) to relax the timeout condition. +Please be aware that this setting will cause prefill instances to take a longer time to clean up the affected memory resources when a running decode node loses connection. + +#### Decode Server Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
VariableDescriptionDefault
**`SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL`**Interval (seconds) between health checks to prefill bootstrap servers`5.0`
**`SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE`**Consecutive heartbeat failures before marking prefill server offline`2`
**`SGLANG_DISAGGREGATION_WAITING_TIMEOUT`**Timeout (seconds) for receiving KV Cache after request initialization`300`
+ +If a greater mean TTFT is acceptable, you can `export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600` (10 minutes) to relax the timeout condition. + + +## Heterogeneous TP with GPU Staging Buffer + +When prefill and decode use different tensor parallelism (TP) sizes (e.g., prefill TP=4, decode DP attention with TP=1), the KV cache memory layout differs between the two sides. The **GPU staging buffer** solves this by gathering KV head slices into a contiguous buffer on the prefill side, performing bulk RDMA transfer, then scattering into the correct KV cache pages on the decode side. This provides **2–5x throughput improvement** over the default per-token slice approach at high concurrency and matches homogeneous TP baselines within ~5%. + +Enable the staging buffer when prefill and decode use **different TP sizes** with the **Mooncake** transfer backend. When both sides use the same TP size, staging is automatically bypassed even if enabled. + +> **Note:** The staging buffer is designed for non-MLA models (e.g. GQA, MHA). MLA models (e.g. DeepSeek-V2/V3) should not enable this flag. + +### Environment Variables + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
VariableDescriptionDefault
SGLANG_DISAGG_STAGING_BUFFEREnable GPU staging buffer for heterogeneous TP KV transferFalse
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MBPrefill-side per-worker staging buffer size in MB64
SGLANG_DISAGG_STAGING_POOL_SIZE_MBDecode-side ring buffer pool total size in MB4096
+ +### Usage Example + +```bash Command +# Set staging buffer environment variables on BOTH prefill and decode +export SGLANG_DISAGG_STAGING_BUFFER=1 +export SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB=64 +export SGLANG_DISAGG_STAGING_POOL_SIZE_MB=4096 + +# Prefill with TP=4 +python -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --disaggregation-mode prefill \ + --port 30000 \ + --tp 4 \ + --trust-remote-code \ + --disaggregation-ib-device mlx5_1,mlx5_2 + +# Decode with TP=1 (or DP attention with effective attention TP=1) +python -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --disaggregation-mode decode \ + --port 30001 \ + --tp 4 \ + --dp 4 \ + --enable-dp-attention \ + --trust-remote-code \ + --disaggregation-ib-device mlx5_3,mlx5_4 + +# Router +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://127.0.0.1:30000 \ + --decode http://127.0.0.1:30001 \ + --host 0.0.0.0 --port 8000 +``` + +## NIXL +### Requirements + +Install via pip. + +```bash +pip install nixl +``` + +Or build from source - may be required if you already have UCX installed. + +```bash +git clone https://github.com/ai-dynamo/nixl.git +cd nixl +pip install . --config-settings=setup-args="-Ducx_path=/path/to/ucx" +``` + + +### Usage + +### Llama Single Node + +```bash +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --port 30000 \ + --disaggregation-transfer-backend nixl +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --port 30001 \ + --base-gpu-id 1 \ + --disaggregation-transfer-backend nixl +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 +``` + +### DeepSeek Multi-Node + +```bash +# prefill 0 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 +# prefill 1 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 +# decode 0 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 +# decode 1 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 +``` + +### Advanced Configuration + +#### NIXL Backend Selection + +By default, NIXL uses the **UCX** backend for KV cache transfers. You can select a different NIXL plugin backend depending on your infrastructure using the environment variable `SGLANG_DISAGGREGATION_NIXL_BACKEND`. + +Example: `export SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC` + +**Available backends:** UCX (default), LIBFABRIC, or any installed NIXL plugin. + +Example usage: +```bash +export SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl \ + --port 30000 +``` + +## ASCEND + +### Usage + +Use ascend backend with [memfabric_hybrid](https://gitcode.com/Ascend/memfabric_hybrid) and ASCEND_MF_STORE_URL being set + +```bash Command +pip install memfabric-hybrid==1.0.0 +export ASCEND_MF_STORE_URL="tcp://xxx.xx.xxx.xxx:xxxx" +``` +Use mooncake backend, more details can be found in mooncake section. +```bash +export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true +``` +ASCEND_NPU_PHY_ID need to be set in container env +```bash +export ASCEND_NPU_PHY_ID=xxx +``` + + +### Llama Single Node + +```bash +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --port 30000 \ + --disaggregation-transfer-backend ascend +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --port 30001 \ + --base-gpu-id 1 \ + --disaggregation-transfer-backend ascend +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 +``` + +### DeepSeek Multi-Node + +```bash +# prefill 0 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 1 \ + --node-rank 0 \ + --tp-size 16 +# decode 0 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 1 \ + --node-rank 0 \ + --tp-size 16 +``` diff --git a/docs_new/docs/advanced_features/piecewise_cuda_graph.mdx b/docs_new/docs/advanced_features/piecewise_cuda_graph.mdx new file mode 100644 index 000000000000..701bb9ae1634 --- /dev/null +++ b/docs_new/docs/advanced_features/piecewise_cuda_graph.mdx @@ -0,0 +1,299 @@ +--- +title: "Piecewise CUDA Graph" +metatags: + description: "Use Piecewise CUDA Graph to reduce prefill and extend kernel launch overhead while supporting dynamic token shapes." +--- + +## Motivation + +Standard CUDA graphs capture the entire model forward pass as a single graph. This works well for decode (fixed batch size), but not for extend/prefill where the number of tokens varies across iterations. + +Piecewise CUDA Graph (PCG) solves this by splitting the model's computation graph into pieces (roughly one per layer) at "split points" (e.g., MoE dispatch ops). Each piece is captured as a separate CUDA graph for a set of pre-defined token lengths. At runtime, the input is padded to the nearest captured size, and each piece is replayed. This eliminates kernel launch overhead for prefill/extend while still supporting dynamic shapes. + +Recently we **enabled PCG by default**, which means that the old `--enable-piecewise-cuda-graph` flag is deprecated. Use `--disable-piecewise-cuda-graph` to turn it off. + +## Usage + +PCG is enabled by default for supported configurations. No extra flags needed: + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct +``` + +### Disable PCG + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disable-piecewise-cuda-graph +``` + +### Custom capture sizes + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --piecewise-cuda-graph-max-tokens 2048 +``` + +### Server Args + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultDescription
--disable-piecewise-cuda-graphFalseDisable PCG for extend/prefill.
--enforce-piecewise-cuda-graphFalseForce-enable PCG, skipping all auto-disable conditions. For testing only.
--piecewise-cuda-graph-max-tokensNone (auto)Maximum token count to capture. Defaults to chunked_prefill_size (non-MLA) or 2048 (MLA).
--piecewise-cuda-graph-tokensNone (auto)Explicit list of token lengths to capture. Auto-generated if not set.
--piecewise-cuda-graph-compiler"eager"Compiler backend for the captured subgraphs. Choices: eager, inductor.
--enable-piecewise-cuda-graphDeprecated. PCG is now enabled by default. Use --enforce-piecewise-cuda-graph to skip auto-disable conditions.
+ +## Bug Report + +PCG is enabled by default but is still in an experimental stage. Since PCG relies on `torch.compile` to trace the model's forward pass, most bugs are introduced by torch compile tracing failures (e.g., untraceable ops, dynamic control flow, or graph breaks). If you encounter any issues related to PCG, please disable it by adding `--disable-piecewise-cuda-graph` to your launch command and report the bug at [GitHub Issues](https://github.com/sgl-project/sglang/issues/new/choose). We greatly appreciate your help in improving this feature. + +### For Users + +If you see an error message like the following during server startup, it is a PCG bug: + +``` +Piecewise CUDA Graph is enabled by default as an experimental feature. +To work around this error, add --disable-piecewise-cuda-graph to your launch command. +Please report this issue at https://github.com/sgl-project/sglang/issues/new/choose +``` + +To work around it, add `--disable-piecewise-cuda-graph` to your launch command. When filing a bug report, please include: +1. The full error traceback +2. Model name and quantization method +3. Launch command with all arguments +4. GPU type and driver version + +### For Developers + +Since PCG relies on `torch.compile` to trace the model's forward pass, newly developed CUDA kernels (both JIT kernels and sgl-kernels) are typically not compatible with `torch.compile` out of the box. The tracing will fail on untraceable operations such as JIT compilation, file I/O, or dynamic module loading inside the kernel. + +To make a kernel compatible with PCG, you need to register it as a custom op using `register_custom_op` from `sglang.srt.utils.custom_op`. This wraps the kernel as an opaque node in the compiled graph so that `torch.compile` will not trace inside it. + +**Example usage (JIT kernel):** + +```python +from sglang.srt.utils.custom_op import register_custom_op + +# Inplace operator (no return value) +@register_custom_op(mutates_args=["output_q", "output_s"]) +def per_token_group_quant_8bit( + input: torch.Tensor, + output_q: torch.Tensor, + output_s: torch.Tensor, +) -> None: + # kernel implementation ... +``` + +**Example usage (operator with output):** + +```python +# out_shape indicates which argument has the same shape as the output +@register_custom_op(mutates_args=["x"], out_shape=0) +def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: + return x.add_(y) +``` + +For wrapping external library functions (e.g., FlashInfer kernels), use `register_custom_op_from_extern` instead. See `python/sglang/srt/utils/custom_op.py` for full API documentation. + +## How it works + +### Torch compile backend + +PCG uses `torch.compile` with a custom backend (`SGLangBackend`) to split and compile the model's forward pass. The flow is: + +``` +model.forward wrapper +→ torch.compile(..., backend=SGLangBackend) +→ FX graph +→ split_graph() at registered split ops +→ split_gm (top-level graph that chains the pieces) +→ replace capturable submodules with CUDAPiecewiseBackend +→ runtime dispatch: eager split ops + per-piece capture/replay +``` + +- **Install**: `install_torch_compiled()` replaces `model.forward` with a wrapper function. When `is_in_piecewise_cuda_graph()` returns True, the wrapper dispatches to the compiled callable; otherwise it falls back to the original forward. The first invocation through this path triggers Dynamo tracing and graph compilation — CUDA graph replay only happens after the capture phase completes. + +- **Split**: When `torch.compile` traces the model, `SGLangBackend` receives the FX graph and calls `split_graph()`. Ops listed in `CompilationConfig.split_ops` are treated as split points, so the graph is cut at each one. These split-op submodules are left to run eagerly at runtime, while the surrounding submodules are compiled and wrapped by `CUDAPiecewiseBackend`. The result is a top-level "stitching graph" (`split_gm`) with children such as `submod_0`, `submod_1`, … interleaving capturable subgraphs and eager split-op submodules. + +- **Replace**: `PiecewiseCompileInterpreter` iterates over each capturable submodule in `split_gm`, compiles it for general (dynamic) shapes, and replaces it in-place with a `CUDAPiecewiseBackend` instance. Split-op submodules (e.g., attention, all-reduce) are left as-is and run eagerly at runtime. + +- **Dispatch**: At runtime, calling `split_gm` executes the stitching graph, which calls each submodule in order. Split-op submodules run eagerly. Each `CUDAPiecewiseBackend` submodule goes through three phases: + - **Compile warmup** — runs the general-shape compiled path. + - **Capture** — for each capture size, runs one warmup pass then records a CUDA graph. + - **Steady-state replay** — replays the captured CUDA graph for each forward pass. + +### Piecewise cuda graph runner + +`PiecewiseCudaGraphRunner` orchestrates the full lifecycle through three phases: + +- **Compile** — Warms up JIT kernels with a dummy forward pass, then wraps the model with `torch.compile`, triggering Dynamo tracing to split the FX graph and create `CUDAPiecewiseBackend` instances for each subgraph piece. + +- **Capture** — Iterates over capture sizes in reverse order (largest first). For each size, runs the forward pass twice (one warmup, one CUDA graph capture). + +- **Replay** — At runtime, finds the smallest captured size >= actual token count via binary search, copies inputs into static buffers with zero-padding, replays the captured CUDA graphs, and slices outputs back to the actual token count. + +### Memory optimization + +The memory cost of PCG comes from two parts: **torch memory allocator** and **non-torch memory**. + +The torch memory allocator overhead is trivial thanks to several optimizations: a global shared memory pool is reused across all CUDA graph runners and capture sizes, capture is done in reverse order (large to small) so smaller graphs reuse memory allocated by larger ones, and output tensors of the last subgraph are stored as weak references to maximize memory reuse. + +The main memory overhead comes from non-torch memory — the CUDA graph objects themselves require GPU memory to store the recorded kernel launch parameters and internal state. This overhead scales with the number of captured sizes, which is why `piecewise_cuda_graph_max_tokens` is capped conservatively by default. + +### Shape configuration + +Piecewise CUDA graph pre-captures graphs for a set of token counts. At runtime, the actual token count is rounded up to the nearest captured size (via binary search), and the corresponding graph is replayed. If the token count exceeds the largest captured size, the runtime falls back to the normal (non-graph) forward path. + +The default capture schedule is auto-generated with increasing granularity: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Token rangeStep size
4 – 324
48 – 25616
288 – 51232
576 – 102464
1280 – 4096256
4096+512
+ +For the auto-generated schedule, sizes are capped at `--piecewise-cuda-graph-max-tokens`. The default cap is `chunked_prefill_size` for non-MLA models and `2048` for MLA backend models. If `--max-total-tokens` is set, the cap is further limited to not exceed it. Additionally, Llama-2 models are auto-capped at 4096 tokens as a temporary workaround. + +## Compatibility + +PCG is auto-disabled in the following scenarios. We are actively working on expanding compatibility — support for many of these will be coming soon. + +- Disabled model architectures (e.g., `DeepseekV32ForCausalLM`) +- Speculative decoding +- DP attention +- Pipeline parallelism (`pp_size > 1`) +- Non-CUDA hardware (AMD ROCm, Ascend NPU) +- MoE A2A backend +- LoRA +- Multimodal / VLM models +- DLLM (diffusion LLM) +- Deterministic inference +- PD disaggregation +- Expert distribution recorder / EPLB + +Use `--enforce-piecewise-cuda-graph` to skip all auto-disable checks (for testing/debugging only). + +## Code Reference + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FileDescription
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.pyMain runner: init, capture, replay
python/sglang/srt/compilation/compile.pyinstall_torch_compiled trampoline
python/sglang/srt/compilation/backend.pySGLangBackend, graph splitting, piecewise compilation
python/sglang/srt/compilation/cuda_piecewise_backend.pyPer-subgraph CUDA graph capture/replay
python/sglang/srt/compilation/piecewise_context_manager.pyGlobal context flags and ForwardContext
python/sglang/srt/compilation/compilation_config.pyCapture sizes, split ops, compiler config
python/sglang/srt/utils/custom_op.pyregister_custom_op for torch.compile compatibility
python/sglang/srt/server_args.pyServer arguments and auto-disable logic
diff --git a/docs_new/docs/advanced_features/pipeline_parallelism.mdx b/docs_new/docs/advanced_features/pipeline_parallelism.mdx new file mode 100644 index 000000000000..77c88c2c4f7e --- /dev/null +++ b/docs_new/docs/advanced_features/pipeline_parallelism.mdx @@ -0,0 +1,119 @@ +--- +title: "Pipeline Parallelism for Long Context" +metatags: + description: "SGLang pipeline parallelism: reduce TTFT for ultra-long sequences, dynamic chunking, async P2P communication, multi-node deployment." +--- +## Why Pipeline Parallelism? + +As Large Language Models (LLMs) scale toward trillion-parameter architectures and "infinite" context windows, the underlying serving infrastructure must evolve toward more granular, cross-node parallelization strategies. While KV cache techniques effectively mitigate redundant computation, they cannot circumvent the prohibitive Time to First Token (TTFT) inherent in ultra-long sequences with extremely large initial Input Token Length (ITL). Although Tensor Parallelism (TP) remains the conventional approach for intra-node scaling, it frequently encounters communication bottlenecks during multi-node deployments. On the other hand, pipeline parallelism only requires cross-node communication at the boundaries of each pipeline stage, which can achieve better computation-communication overlap compared to a large TP. Therefore, it is also a promising parallelization strategy for improving throughput. + +Detailed analysis can be found in this [blog](https://lmsys.org/blog/2026-01-15-chunked-pipeline/). + +## Implementation Refactoring based on Async Communication +With Dynamic Chunked Prefill, pipeline parallelism has the potential to reduce the TTFT of long-context inputs. For each request, its input tokens can be partitioned into multiple chunks, each no longer than the chunked prefill size. Different chunks of the same request can be processed simultaneously by different nodes, thus parallelizing the processing and reducing TTFT. SGLang has supported Pipeline Parallelism (#5724) for some time and made it compatible with the PD Disaggregation feature (#8846), but the implementation was not perfect and had significant room for performance improvements. + +To eliminate this performance hazard, SGLang implements a Micro-batching Event Loop with non-blocking asynchronous peer-to-peer (P2P) communication to overlap GPU computation with CPU metadata processing and PP communication. This ensures that while one micro-batch is being computed on the GPU, the next one is already being prepared and moved into position effectively, ensuring the pipeline remains as saturated as possible. This approach was first proposed in #7979 and has been redesigned and included in #11852. + +The key mechanisms of the implementation include: + +* **Decoupled Sync/Async Logic in the Event Loop:** The scheduler uses `async_send` in `_pp_send_pyobj_to_next_stage`. Instead of waiting for a transfer to complete, it returns a `P2PWork` handle. The actual synchronization (`P2PWork.work.wait()`) is deferred until `_pp_commit_comm_work` is called, allowing the CPU to perform other work—like scheduling the next batch or processing metadata—while data is in flight. +* **Multi-Stream Execution:** In addition to the main `default_stream`, which serves as the synchronization stream, SGLang utilizes dedicated `forward_stream` and `copy_stream` to execute forward pass GPU computation and Data-to-Host (D2H) memory transfers separately for better overlapping. While `_pp_launch_batch` is executing the current micro-batch on the GPU for the current stage, the CPU processes the previous micro-batch's results using `_pp_process_batch_result`. + +## Guidance about Dynamic Chunking + +### Why Dynamic Chunking +Chunked prefill with a fixed size can cause bubbles in the pipeline, especially when the pp size is large. The main reason behind this phenomenon is that the model has a non-uniform running time, even though each chunk size is identical (brought by the Transformer structure). The larger the prefix sequence length, the longer the running time of the chunk. And these bubbles will be propagated to the next stage, and will significantly degrade the scale efficiency of larger pp ranks. + +To address this issue, SGLang introduces a dynamic chunking mechanism to predict the optimal size for the next chunk such that it satisfies this condition: + +Runtime(L + Next Chunk Size) - Runtime(L) = Runtime(Initial Chunk Size) + +where ***L*** denotes the Prefix Sequence Length. By profiling a series of requests with different ITLs, we model the cumulative runtime as a quadratic function of sequence length. Using this model, we solve the optimal next chunk size for any given prefix length ***L***. Since the computation complexity of the Attention mechanism scales with ***L***, the next chunk size will be progressively reduced as ***L*** grows to maintain an aligned chunk execution time across pipeline stages. + +Based on this method, the scheduler can predict and dynamically reduce the chunk size during runtime to minimize the bubbles caused by the stage misalignment. To be noticed, the scheduler does not use the raw predicted value. To facilitate efficient KVCache memory management and ensure affinity with hardware execution efficiency, the value is aligned downward to the nearest multiple of max(`--page-size`, 64). + + +### Chunked Prefill Size and Smoothing Factor + +When `--enable-dynamic-chunking` is enabled, each chunk size of a sequence is determined dynamically based on the quadratic model that predicts the next chunk size based on the estimated runtime of the initial chunk length. In this case, we use `--chunked-prefill-size` to set up the initial chunk size. When switching to the dynamic chunking mode, the initial chunk size (`--chunked-prefill-size`) should be set to a larger value comparable to the original chunked prefill size, so that there won't be too many chunks. + +**`SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR`** is an environmental variable that controls the smoothing factor for the dynamic chunking algorithm, defaulting to 0.75. It determines how much the chunk size can change during the prefill phase. A larger value means a more aggressive chunk size change, which may lead to better performance but also to greater chunk size changes (the chunk size at the end may become very small, which could lead to performance degradation) and more total chunks. When it is set to 1, the chunk size will be adjusted strictly based on the aforementioned quadratic model that predicts the next chunk size. A smaller value means a more conservative chunk size change, which may lead to smaller chunk size changes and fewer total chunks. When it is set to 0, the chunk size will not be adjusted dynamically, so it is identical to the traditional way with a fixed chunked prefill size. + +Due to the variation in hardware, models, and target workloads, a static configuration is seldom optimal across all scenarios. Consequently, achieving peak performance necessitates a degree of hyperparameter tuning when switching to the dynamic chunking mode. + +**Tuning Guidance for Dynamic Chunked Prefill** + +* **Step 1 \- Iterate to find the optimal fixed chunked prefill size for the targeted PP size**: Different PP sizes for targeted ITL may have different optimal chunked prefill sizes. Therefore, users should iterate to obtain the baseline according to the available resources for scaling. +* **Step 2 \- Initial Chunk Size Selection for Dynamic Chunking**: Set the initial size to 2× or 3× the optimal fixed chunked prefill size. This reduces the total number of chunks and prevents "tail chunks" from underutilizing hardware. To maintain efficiency for extremely large Input Token Lengths (ITL), the dynamic predictor automatically ensures subsequent chunks are at least 1/4 of this initial size. In addition, it is recommended to use a larger initial chunk size (e.g., 4× the optimal fixed chunked prefill size) for such cases as well. +* **Step 3 \- Smooth Factor Adjustment**: This factor controls how strictly the chunk size adjusts the prediction given by the quadratic performance fitting model. + * 1.0: Follows the model strictly. + * **0.6 – 0.85 (Recommended)**: Typical range for the best balance between dynamic scaling and hardware stability. Through experiments, we find that a range between 0.6 and 0.85 typically yields the best performance for dynamic chunking. + * 0: Disables dynamic adjustment, reverting to traditional fixed-size chunking. +* **Another small optimization tip:** Put the larger partition in the higher PP rank when the layers are not evenly divisible across ranks. It can increase the GPU utilization when a larger PP rank is waiting for the previous stage’s result, hence reducing the bubbles on higher PP ranks. If we take DeepSeek-V3.1 as an example, `SGLANG_PP_LAYER_PARTITION=15,15,15,16` usually performs better than `16,15,15,15`. + +## Best Practice for Long Context + +### Tuning the Chunked Prefill Size +Optimizing the chunked prefill size is crucial for balancing pipeline efficiency and resource utilization. The ideal size depends on factors including model architecture, hardware configuration, and typical input lengths. We recommend starting with a small chunk size, such as 4K, and gradually increasing it until you find the optimal size for your specific use case (Different targeted ITL and PP Sizes may have different optimal chunked prefill sizes. Therefore, users should iterate to obtain the baseline according to the available resources for scaling). Alternatively, you can analyze the hardware capacity and determine the optimal chunk size based on the roofline model. + +### Enable Dynamic Chunking and Adjust Smoothing Factor for Ultra-long ITL +SGLang also offers a dynamic chunking solution that could further improve performance. This feature is currently an experimental feature that requires a certain amount of tuning experimentation and may not be suitable for all workloads. In addition, fine-tuning the smoothing factor can help optimize performance for specific workloads and model characteristics. + +### Case Study on NVIDIA H20 + +When evaluating pipeline parallelism with fixed chunked prefill sizes from 2K to 16K, experiment results show that a 4K chunk size delivered optimal prefill TTFT performance for the DeepSeek-V3.1, and a 6K chunk size delivered optimal prefill TTFT performance for the Qwen3-235B-A22B-FP8. + +When enabling dynamic chunking, we first scale the optimal fixed chunked prefill size by a factor of 3 as the initial chunk size. Through experimentation, we found that a multiplier of 2-3 provides an appropriate balance—avoiding excessive initial pipeline bubbles while ensuring that subsequent chunks don't become too small as context length increases. With the default dynamic chunking smoothing factor of 0.75, we performed parameter tuning and determined that a value of 0.65 works optimally with the 12K initial chunk size for the DeepSeek-V3.1, while a value of 0.8 works optimally with the 18K initial chunk size for the Qwen3-235B-A22B-FP8. + +#### DeepSeek-V3.1 with 128K Input Token Length +```bash Command +# prefill node 0 (fixed chunked prefill size) +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.1 --trust-remote-code \ + --nnodes 4 --node-rank 0 --tp 8 --pp-size 4 \ + --port 30000 --dist-init-addr \ + --disable-radix-cache --mem-fraction-static 0.8 \ + --attention-backend fa3 --host 0.0.0.0 --watchdog-timeout 3600 \ + --max-running-requests 128 --chunked-prefill-size 4096 +``` + +```bash Command +# prefill node 0 (with dynamic chunking) +export SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR=0.65 +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.1 --trust-remote-code \ + --nnodes 4 --node-rank 0 --tp 8 --pp-size 4 \ + --port 30000 --dist-init-addr \ + --disable-radix-cache --mem-fraction-static 0.8 \ + --attention-backend fa3 --host 0.0.0.0 --watchdog-timeout 3600 \ + --max-running-requests 128 --chunked-prefill-size 12288 --enable-dynamic-chunking +``` + +#### Qwen3-235B-A22B-FP8 with 128K Input Token Length +```bash Command +# prefill node 0 (fixed chunked prefill size) +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-235B-A22B-FP8 --trust-remote-code \ + --nnodes 4 --node-rank 0 --tp 4 --pp-size 8 \ + --port 30000 --dist-init-addr \ + --disable-radix-cache --mem-fraction-static 0.8 \ + --attention-backend fa3 --host 0.0.0.0 --watchdog-timeout 3600 \ + --max-running-requests 128 --chunked-prefill-size 6144 +``` + +```bash Command +# prefill node 0 (with dynamic chunking) +export SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR=0.8 +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-235B-A22B-FP8 --trust-remote-code \ + --nnodes 4 --node-rank 0 --tp 4 --pp-size 8 \ + --port 30000 --dist-init-addr \ + --disable-radix-cache --mem-fraction-static 0.8 \ + --attention-backend fa3 --host 0.0.0.0 --watchdog-timeout 3600 \ + --max-running-requests 128 --chunked-prefill-size 18432 --enable-dynamic-chunking +``` + +Note: `--disable-radix-cache` is enabled only for reproducible benchmarking purposes. It is not recommended to use it in production. + +## Best Practice for Pipeline Parallelism with PD Disaggregation +To be added. Stay tuned for the latest updates on Pipeline Parallelism with PD Disaggregation. diff --git a/docs_new/docs/advanced_features/quantization.mdx b/docs_new/docs/advanced_features/quantization.mdx new file mode 100644 index 000000000000..bd59a0497bba --- /dev/null +++ b/docs_new/docs/advanced_features/quantization.mdx @@ -0,0 +1,807 @@ +--- +title: "Quantization" +metatags: + description: "SGLang quantization: FP8, FP4, AWQ, GPTQ, ModelOpt, torchao. Offline and online quantization methods for efficient LLM inference." +--- +SGLang supports various quantization methods, including offline quantization and online dynamic quantization. + +Offline quantization loads pre-quantized model weights directly during inference. This is required for quantization methods +such as GPTQ and AWQ, which collect and pre-compute various statistics from the original weights using the calibration dataset. + +Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. +Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors +on-the-fly to convert high-precision weights into a lower-precision format. + +**Note: For better performance, usability and convenience, offline quantization is recommended over online quantization.** + +If you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time. +For popular pre-quantized models, please visit [Unsloth](https://huggingface.co/unsloth), [NVIDIA ModelOpt](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer) +or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on HF for some +popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization +to guard against abnormal quantization loss regressions. + +## Platform Compatibility + +The following table summarizes quantization method support across NVIDIA and AMD GPUs, Ascend NPUs. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodNVIDIA GPUsAMD GPUs (MI300X/MI325X/MI350X)Ascend NPUs (A2/A3)Notes
fp8YesYesWIPAiter or Triton backend on AMD
mxfp4YesYesWIPRequires CDNA3/CDNA4 with MXFP support; uses Aiter
blockwise_int8YesYesNoTriton-based, works on both platforms
w8a8_int8YesYesNo
w8a8_fp8YesYesNoAiter or Triton FP8 on AMD
awqYesYesYesUses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA). Uses CANN kernels on Ascend
gptqYesYesYesUses Triton or vLLM kernels on AMD. Uses CANN kernels on Ascend
compressed-tensorsYesYesPartialAiter paths for FP8/MoE on AMD. Uses CANN kernels on Ascend, FP8 not supported yet
quarkYesYesNoAMD Quark quantization; Aiter GEMM paths on AMD
auto-roundYesYesPartialPlatform-agnostic (Intel auto-round). Uses CANN kernels on Ascend
quark_int4fp8_moeNoYesNoAMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4)
awq_marlinYesNoNoMarlin kernels are CUDA-only
gptq_marlinYesNoNoMarlin kernels are CUDA-only
ggufYesNoYesCUDA kernels in sgl-kernel; Ascend uses CPU pre-dequantization at load time
modelopt / modelopt_fp8Yes (Hopper/SM90+)NoNoNVIDIA ModelOpt; requires NVIDIA hardware
modelopt_fp4Yes (Blackwell/SM100+)NoNoNVIDIA ModelOpt; native FP4 on Blackwell (B200, GB200)
petit_nvfp4NoYes (MI250/MI300X/MI325X)NoEnables NVFP4 on ROCm via Petit; use modelopt_fp4 on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See LMSYS blog and AMD ROCm blog.
bitsandbytesYesExperimentalNoDepends on bitsandbytes ROCm support
torchao (int4wo, etc.)YesPartialNoint4wo not supported on AMD; other methods may work
modelslimNoNoYesAscend quantization; Uses CANN kernels
+ +On AMD, several of these methods use [Aiter](https://github.com/ROCm/aiter) for acceleration -- set `SGLANG_USE_AITER=1` where noted. See [AMD GPU setup](../hardware-platforms/amd_gpu) for installation and configuration details. + +On Ascend, various layers quantization configurations are supported, see [Ascend NPU quantization](../hardware-platforms/ascend-npus/ascend_npu_quantization) for details. + +## GEMM Backends for FP4/FP8 Quantization + + +Backend selection is supported only for **blockwise FP8** and **NVFP4** GEMM. When running FP8 or FP4 quantized models, you can select the GEMM backend via `--fp8-gemm-backend` and `--fp4-gemm-backend`. + + +### `--fp8-gemm-backend` (Blockwise FP8 GEMM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendHardwareDescription
autoAllAuto-selects based on hardware
deep_gemmSM90, SM100JIT-compiled; enabled when DeepGEMM is installed
flashinfer_trtllmSM100FlashInfer TensorRT-LLM backend; optimal for low-latency
flashinfer_cutlassSM100/120FlashInfer CUTLASS groupwise FP8 GEMM
flashinfer_deepgemmSM90Uses swapAB optimization for small M dimensions in decoding
cutlassSM90, SM100/120sgl-kernel CUTLASS
tritonAllFallback; widely compatible
aiterROCmAMD AITER backend
+ +**`auto` selection order:** 1) DeepGEMM (SM90/SM100, installed); 2) FlashInfer TRTLLM (SM100, FlashInfer available); 3) CUTLASS (SM90/SM100/120); 4) AITER (AMD); 5) Triton. **Exception:** SM120 always resolves to Triton. + +### `--fp4-gemm-backend` (NVFP4 GEMM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendHardwareDescription
autoSM100/120Auto-selects: flashinfer_cudnn on SM120; flashinfer_cutlass on SM100
cutlassSM100/120SGLang CUTLASS kernel
flashinfer_cutlassSM100/120FlashInfer CUTLASS backend
flashinfer_cudnnSM100/120 (CUDA 13+, cuDNN 9.15+)FlashInfer cuDNN backend; used on SM120 for performance
flashinfer_trtllmSM100FlashInfer TensorRT-LLM backend
+ +When FlashInfer is unavailable for NVFP4, the SGLang CUTLASS kernel is used as an automatic fallback. + +## Offline Quantization + +To load already quantized models, simply load the model weights and config. **Again, if the model has been quantized offline, +there's no need to add `--quantization` argument when starting the engine. The quantization method will be parsed from the +downloaded Hugging Face or msModelSlim config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.** + +```bash Command +python3 -m sglang.launch_server \ + --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \ + --port 30000 --host 0.0.0.0 +``` + +Take note, if your model is **per-channel quantized (INT8 or FP8) with per-token dynamic quantization activation**, you can opt to include `--quantization w8a8_int8` or `--quantization w8a8_fp8` to invoke the corresponding CUTLASS int8_kernel or fp8_kernel in sgl-kernel. This action will ignore the Hugging Face config's quantization settings. For instance, with `neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic`, if you execute with `--quantization w8a8_fp8`, the system will use the `W8A8Fp8Config` from SGLang to invoke the sgl-kernel, rather than the `CompressedTensorsConfig` for vLLM kernels. + +```bash Command +python3 -m sglang.launch_server \ + --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic \ + --quantization w8a8_fp8 \ + --port 30000 --host 0.0.0.0 +``` + +### Examples of Offline Model Quantization + +#### Using [Unsloth](https://docs.unsloth.ai/basics/inference-and-deployment/sglang-guide) + +We strongly suggest the use of Unsloth to quantize and load the model. Please refer to [SGLang Deployment & Inference Guide with Unsloth](https://docs.unsloth.ai/basics/inference-and-deployment/sglang-guide). + +#### Using [auto-round](https://github.com/intel/auto-round) + +```bash Command +# Install +pip install auto-round +``` + +- LLM quantization + +```py Example +# for LLM +from auto_round import AutoRound +model_id = "meta-llama/Llama-3.2-1B-Instruct" +quant_path = "Llama-3.2-1B-Instruct-autoround-4bit" +# Scheme examples: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc. +scheme = "W4A16" +format = "auto_round" +autoround = AutoRound(model_id, scheme=scheme) +autoround.quantize_and_save(quant_path, format=format) # quantize and save + +``` + +- VLM quantization +```py Example +# for VLMs +from auto_round import AutoRoundMLLM +model_name = "Qwen/Qwen2-VL-2B-Instruct" +quant_path = "Qwen2-VL-2B-Instruct-autoround-4bit" +scheme = "W4A16" +format = "auto_round" +autoround = AutoRoundMLLM(model_name, scheme) +autoround.quantize_and_save(quant_path, format=format) # quantize and save + +``` + +- Command Line Usage (Gaudi/CPU/Intel GPU/CUDA) + +```bash Command +auto-round \ + --model meta-llama/Llama-3.2-1B-Instruct \ + --bits 4 \ + --group_size 128 \ + --format "auto_round" \ + --output_dir ./tmp_autoround +``` + +- known issues + +Several limitations currently affect offline quantized model loading in sglang, These issues might be resolved in future updates of sglang. If you experience any problems, consider using Hugging Face Transformers as an alternative. + +1. Mixed-bit Quantization Limitations + + Mixed-bit quantization is not fully supported. Due to vLLM's layer fusion (e.g., QKV fusion), applying different bit-widths to components within the same fused layer can lead to compatibility issues. + + +2. Limited Support for Quantized MoE Models + + Quantized MoE models may encounter inference issues due to kernel limitations (e.g., lack of support for mlp.gate layer quantization). please try to skip quantizing these layers to avoid such errors. + + +3. Limited Support for Quantized VLMs + + {/* VLM failure cases */} + + Qwen2.5-VL-7B + + auto_round:auto_gptq format: Accuracy is close to zero. + + GPTQ format: Fails with: + ```text Output + The output size is not aligned with the quantized weight shape + ``` + auto_round:auto_awq and AWQ format: These work as expected. + + +#### Using [GPTQModel](https://github.com/ModelCloud/GPTQModel) + +```bash Command +# install +pip install gptqmodel --no-build-isolation -v +``` + +```py Example +from datasets import load_dataset +from gptqmodel import GPTQModel, QuantizeConfig + +model_id = "meta-llama/Llama-3.2-1B-Instruct" +quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit" + +calibration_dataset = load_dataset( + "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", + split="train" + ).select(range(1024))["text"] + +quant_config = QuantizeConfig(bits=4, group_size=128) # quantization config +model = GPTQModel.load(model_id, quant_config) # load model + +model.quantize(calibration_dataset, batch_size=2) # quantize +model.save(quant_path) # save model +``` + +#### Using [LLM Compressor](https://github.com/vllm-project/llm-compressor/) + +```bash Command +# install +pip install llmcompressor +``` + +Here, we take quantize `meta-llama/Meta-Llama-3-8B-Instruct` to `FP8` as an example to elaborate on how to do offline quantization. + +```python Example +from transformers import AutoTokenizer +from llmcompressor.transformers import SparseAutoModelForCausalLM +from llmcompressor.transformers import oneshot +from llmcompressor.modifiers.quantization import QuantizationModifier + +# Step 1: Load the original model. +MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" + +model = SparseAutoModelForCausalLM.from_pretrained( + MODEL_ID, device_map="auto", torch_dtype="auto") +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) + +# Step 2: Perform offline quantization. +# Step 2.1: Configure the simple PTQ quantization. +recipe = QuantizationModifier( + targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]) + +# Step 2.2: Apply the quantization algorithm. +oneshot(model=model, recipe=recipe) + +# Step 3: Save the model. +SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" +model.save_pretrained(SAVE_DIR) +tokenizer.save_pretrained(SAVE_DIR) +``` + +Then, you can directly use the quantized model with `SGLang`, by using the following command: + +```bash Command +python3 -m sglang.launch_server \ + --model-path $PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic \ + --port 30000 --host 0.0.0.0 +``` + +#### Using [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer) + +NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. + +**Offline vs. Online Quantization:** + +SGLang supports two modes for ModelOpt. + +* **Offline Quantization (pre-quantized):** + * **Usage:** Download a pre-quantized model from Hugging Face or run `hf_ptq.py` once to create a new quantized checkpoint. Then load this quantized checkpoint. + * **Pros:** Fast server startup, quantization can be validated before deployment, efficient resource usage. + * **Cons:** Requires an extra preparation step. + +* **Online Quantization (quant and serve):** + * **Usage:** Load a standard BF16/FP16 model and add a flag. The engine applies quantization *on startup*. + * **Pros:** Convenient (no new checkpoint needed). + * **Cons:** **High startup time**, increases VRAM usage during initialization (risk of OOM). + +The following sections guide you through using the Offline path: loading pre-quantized models or creating your own checkpoints. + +##### Using Pre-Quantized Checkpoints + +If a model is already quantized (e.g., from Hugging Face), you can load it directly. + +* **FP8 Models:** + Use `--quantization modelopt_fp8`. + ```bash Command + python3 -m sglang.launch_server \ + --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \ + --quantization modelopt_fp8 \ + --port 30000 + ``` + +* **FP4 Models:** + Use `--quantization modelopt_fp4`. + ```bash Command + python3 -m sglang.launch_server \ + --model-path nvidia/Llama-3.3-70B-Instruct-NVFP4 \ + --quantization modelopt_fp4 \ + --port 30000 + ``` + +##### Creating Your Own Quantized Checkpoints + +If a pre-quantized checkpoint is not available for your model, you can create one using NVIDIA Model Optimizer's `hf_ptq.py` script. + +**Why quantize?** +- Reduce VRAM usage +- Higher throughput and lower latency +- More flexible deployment (on smaller GPUs) + +**What can be quantized?** +- The entire model +- MLP layers only +- KV cache + +**Key options in `hf_ptq.py`:** + +`--qformat`: Quantization formats `fp8`, `nvfp4`, `nvfp4_mlp_only` + +`--kv_cache_qformat`: KV cache quantization format (default: `fp8`) + +**Note:** The default `kv_cache_qformat` may not be optimal for all use cases. Consider setting this explicitly. + +**Hardware requirements:** Hopper and higher are recommended. Insufficient GPU memory may cause weight offloading, resulting in extremely long quantization time. + +For detailed usage and supported model architectures, see [NVIDIA Model Optimizer LLM PTQ](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq). + +SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment. + +##### Installation + +First, install ModelOpt: + +```bash Command +pip install nvidia-modelopt +``` + +##### Quantization and Export Workflow + +SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow. Run from the SGLang repository root (see [modelopt_quantize_and_export.py](https://github.com/sgl-project/sglang/blob/main/examples/usage/modelopt_quantize_and_export.py)): + +```bash Command +# Quantize and export a model using ModelOpt FP8 quantization +python examples/usage/modelopt_quantize_and_export.py quantize \ + --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ + --export-dir ./quantized_tinyllama_fp8 \ + --quantization-method modelopt_fp8 + +# For FP4 quantization (requires Blackwell GPU) +python examples/usage/modelopt_quantize_and_export.py quantize \ + --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ + --export-dir ./quantized_tinyllama_fp4 \ + --quantization-method modelopt_fp4 +``` + +##### Available Quantization Methods + +- `modelopt_fp8`: FP8 quantization with optimal performance on NVIDIA Hopper and Blackwell GPUs +- `modelopt_fp4`: FP4 quantization with optimal performance on Nvidia Blackwell GPUs + +##### Python API Usage + +You can also use ModelOpt quantization programmatically: + +```python Example +import sglang as sgl +from sglang.srt.configs.device_config import DeviceConfig +from sglang.srt.configs.load_config import LoadConfig +from sglang.srt.configs.model_config import ModelConfig +from sglang.srt.model_loader.loader import get_model_loader + +# Configure model with ModelOpt quantization and export +model_config = ModelConfig( + model_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0", + quantization="modelopt_fp8", # or "modelopt_fp4" + trust_remote_code=True, +) + +load_config = LoadConfig( + modelopt_export_path="./exported_model", + modelopt_checkpoint_save_path="./checkpoint.pth", # optional, fake quantized checkpoint +) +device_config = DeviceConfig(device="cuda") + +# Load and quantize the model (export happens automatically) +model_loader = get_model_loader(load_config, model_config) +quantized_model = model_loader.load_model( + model_config=model_config, + device_config=device_config, +) +``` + +##### Deploying Quantized Models + +After quantization and export, you can deploy the model with SGLang: + +```bash Command +# Deploy the exported quantized model +python -m sglang.launch_server \ + --model-path ./quantized_tinyllama_fp8 \ + --quantization modelopt \ + --port 30000 --host 0.0.0.0 +``` + +Or using the Python API (use the same path as `modelopt_export_path` from the quantize step): + +```python Example +import sglang as sgl + +def main(): + # Deploy exported ModelOpt quantized model + # Path must match modelopt_export_path from quantize step (e.g., ./exported_model) + llm = sgl.Engine( + model_path="./exported_model", + quantization="modelopt", + ) + + # Run inference + prompts = [ + "Hello, how are you?", + "What is the capital of France?", + ] + sampling_params = { + "temperature": 0.8, + "top_p": 0.95, + "max_new_tokens": 100, + } + + outputs = llm.generate(prompts, sampling_params) + + for i, output in enumerate(outputs): + print(f"Prompt: {prompts[i]}") + print(f"Output: {output['text']}") + +if __name__ == "__main__": + main() + +``` + +##### Advanced Features + +**Checkpoint Management**: Save and restore fake quantized checkpoints for reuse: + +```bash Command +# Save the fake quantized checkpoint during quantization +python examples/usage/modelopt_quantize_and_export.py quantize \ + --model-path meta-llama/Llama-3.2-1B-Instruct \ + --export-dir ./quantized_model \ + --quantization-method modelopt_fp8 \ + --checkpoint-save-path ./my_checkpoint.pth + +# The checkpoint can be reused for future quantization runs and skip calibration +``` + +**Export-only Workflow**: If you have a pre-existing fake quantized ModelOpt checkpoint, you can export it directly. See [LoadConfig](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/load_config.py) for the full API: + +```python Example +from sglang.srt.configs.device_config import DeviceConfig +from sglang.srt.configs.load_config import LoadConfig +from sglang.srt.configs.model_config import ModelConfig +from sglang.srt.model_loader.loader import get_model_loader + +model_config = ModelConfig( + model_path="meta-llama/Llama-3.2-1B-Instruct", + quantization="modelopt_fp8", + trust_remote_code=True, +) + +load_config = LoadConfig( + modelopt_checkpoint_restore_path="./my_checkpoint.pth", + modelopt_export_path="./exported_model", +) + +# Load and export the model (DeviceConfig defaults to device="cuda") +model_loader = get_model_loader(load_config, model_config) +model_loader.load_model(model_config=model_config, device_config=DeviceConfig()) +``` + +##### Benefits of ModelOpt + +- **Hardware Optimization**: Specifically optimized for NVIDIA GPU architectures +- **Advanced Quantization**: Supports cutting-edge FP8 and FP4 quantization techniques +- **Seamless Integration**: Automatic export to HuggingFace format for easy deployment +- **Calibration-based**: Uses calibration datasets for optimal quantization quality +- **Production Ready**: Enterprise-grade quantization with NVIDIA support + +#### Using [ModelSlim](https://gitcode.com/Ascend/msmodelslim) +MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware. + +- **Installation** + + ```bash Command + # Clone repo and install msmodelslim: + git clone https://gitcode.com/Ascend/msmodelslim.git + cd msmodelslim + bash install.sh + ``` + +- **LLM quantization** + + Download the original floating-point weights of the large model. Taking Qwen3-32B as an example, you can go to [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the huggingface model card). + > Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech). + + _Traditional quantification methods require the preparation of calibration data files (```.jsonl``` formats) for calibration in the quantification process._ + ```bash Command + Qwen3-32B/ # floating-point model downloaded from official HF (or modelscope) repo + msmodelslim/ # msmodelslim repo + |----- lab_calib # calibration date folder (put your dataset here in ```.jsonl``` format or use pre-prepared ones) + |----- some file (such as laos_calib.jsonl) + |----- lab_practice # best practice folder with configs for quantization + |----- model folder (such as qwen3_5_moe folder) # folder with quantization configs + |----- quant_config (such as qwen3_5_moe_w8a8.yaml) # quantization config + |----- another folders + output_folder/ # generated by below command + |----- quant_model_weights-00001-of-0001.safetensors # quantized weights + |----- quant_model_description.json # file with description of the quantization methods for each layer (```W4A4_DYNAMIC```, etc.) + |----- another files (such as config.json, tokenizer.json, etc.) + ``` + Run quantization using one-click quantization (recommended): + ```bash Command + msmodelslim quant \ + --model_path ${MODEL_PATH} \ + --save_path ${SAVE_PATH} \ + --device npu:0,1 \ + --model_type Qwen3-32B \ + --quant_type w8a8 \ + --trust_remote_code True + ``` + +- **Usage Example** + ```bash Command + python3 -m sglang.launch_server \ + --model-path $PWD/Qwen3-32B-w8a8 \ + --port 30000 --host 0.0.0.0 + ``` + +- **Available Quantization Methods**: + - [x] ```W4A4_DYNAMIC``` linear with online quantization of activations + - [x] ```W8A8``` linear with offline quantization of activations + - [x] ```W8A8_DYNAMIC``` linear with online quantization of activations + - [x] ```W4A4_DYNAMIC``` MOE with online quantization of activations + - [x] ```W4A8_DYNAMIC``` MOE with online quantization of activations + - [x] ```W8A8_DYNAMIC``` MOE with online quantization of activations + - [ ] ```W4A8``` linear TBD + - [ ] ```W4A16``` linear TBD + - [ ] ```W48A16``` linear TBD + - [ ] ```W4A16``` MoE in progress + - [ ] ```W8A16``` MoE in progress + - [ ] ```KV Cache``` in progress + - [ ] ```Attention``` in progress + + +For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/README.md) section in ModelSLim repo. + +## Online Quantization + +To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`: + +```bash Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --quantization fp8 \ + --port 30000 --host 0.0.0.0 +``` + +Our team is working on supporting more online quantization methods. SGLang will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`. + +### torchao online quantization method + +SGLang also supports quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: + +```bash Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --torchao-config int4wo-128 \ + --port 30000 --host 0.0.0.0 +``` + +SGLang supports the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`. + +Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command: + +```bash Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --torchao-config int8dq \ + --disable-cuda-graph \ + --port 30000 --host 0.0.0.0 +``` + +### `quark_int4fp8_moe` online quantization method + +SGLang running on AMD GPUs (CDNA3 or CDNA4 architecture) supports the quantization method `--quantization quark_int4fp8_moe`, that will replace [MoE layers](https://github.com/sgl-project/sglang/blob/v0.4.8/python/sglang/srt/layers/moe/fused_moe_triton/layer.py#L271) originally in high precision (bfloat16, float16 or float32) to use weights dynamically quantized to int4, that are upcasted to float8 during inference to run compute in float8 precision with activations dynamically quantized on the fly to float8. + +Other layers (e.g. projections in the attention layers) have their weights quantized online to float8 directly. + +## Reference + +- [GPTQModel](https://github.com/ModelCloud/GPTQModel) +- [LLM Compressor](https://github.com/vllm-project/llm-compressor/) +- [NVIDIA Model Optimizer (ModelOpt)](https://github.com/NVIDIA/Model-Optimizer) +- [NVIDIA Model Optimizer LLM PTQ](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) +- [Petit: NVFP4 on ROCm](https://github.com/causalflow-ai/petit-kernel) — [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/), [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html) +- [Torchao: PyTorch Architecture Optimization](https://github.com/pytorch/ao) +- [vLLM Quantization](https://docs.vllm.ai/en/latest/quantization/) +- [auto-round](https://github.com/intel/auto-round) +- [ModelSlim](https://gitcode.com/Ascend/msmodelslim) diff --git a/docs_new/docs/advanced_features/quantized_kv_cache.mdx b/docs_new/docs/advanced_features/quantized_kv_cache.mdx new file mode 100644 index 000000000000..034741fe0258 --- /dev/null +++ b/docs_new/docs/advanced_features/quantized_kv_cache.mdx @@ -0,0 +1,256 @@ +--- +title: "Quantized KV Cache" +metatags: + description: "SGLang quantized KV cache: FP8 E4M3/E5M2 and FP4 E2M1 formats, memory savings up to 3.56x, scaling factors, accuracy benchmarks." +--- +Quantized KV cache reduces the memory footprint of key-value cache storage by using lower-precision data types (FP8 or FP4) instead of the default model precision in BF16. During autoregressive generation, LLMs cache previously computed key-value pairs to avoid redundant calculations. The KV cache typically consumes a significant portion of GPU memory, especially for long sequences. + +Quantized KV cache is a memory optimization technique that primarily benefits throughput by allowing more tokens to be cached, but may introduce minimal accuracy degradation depending on the quantization format used. + + +**Performance Warning**: When quantized KV cache must be dequantized before use in attention operations, performance can be extremely slow if dequantization is not fused with the attention kernel. Always verify that your chosen attention backend supports quantized KV cache. Backends without fused support may experience significant throughput degradation, potentially negating the memory benefits. + +**Backend Support**: Not all attention backends support quantized KV cache. Refer to [Attention Backend](./attention_backend) for which backends support it. + + +## Supported Formats + +SGLang supports the following quantized KV cache formats: + +### FP8 Format + +[OCP (Open Compute Project)](https://www.opencompute.org) specifies two common 8-bit floating point formats: + +- **E5M2** (5 exponent bits, 2 mantissa bits): Larger dynamic range (±57344.0), lower precision +- **E4M3** (4 exponent bits, 3 mantissa bits): Higher precision, smaller dynamic range (±240.0) + +### FP4 Format + + +FP4 quantization is currently experimental. + + +[OCP (Open Compute Project)](https://www.opencompute.org) specifies MXFP4 (Microscaling FP4), a 4-bit floating-point format: + +- **E2M1** (1 sign bit, 2 exponent bits, 1 mantissa bit): Uses block-based microscaling where tensors are divided into blocks of consecutive elements, with each block sharing a single 8-bit exponential scaling factor. While OCP specifies blocks of 32 elements, SGLang's current implementation uses blocks of 16 elements for KV cache quantization. + +## Usage + +### Enabling Quantized KV Cache + +To enable quantized KV cache, use the `--kv-cache-dtype` argument when launching the server: + +```bash Command +# Enable FP8 E5M2 KV cache +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-0528 \ + --kv-cache-dtype fp8_e5m2 \ + +# Enable FP8 E4M3 KV cache +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-0528 \ + --kv-cache-dtype fp8_e4m3 \ + +# Enable FP4 E2M1 KV cache +python3 -m sglang.launch_server \ + --model-path nvidia/DeepSeek-R1-0528-NVFP4 \ + --kv-cache-dtype fp4_e2m1 \ +``` + +### Scaling Factors + +FP8 quantization requires scaling factors to properly quantize and dequantize the KV cache. + + +Currently, only per-tensor (scalar) scaling factors are supported. + + +Scaling factors can be: + +- **Loaded from checkpoints**: Pre-quantized models (e.g., ModelOpt) may include `k_scale` and `v_scale` parameters that are automatically loaded +- **Provided via JSON**: Supply scaling factors via `--quantization-param-path`. + +The JSON file should follow this format: + +```json Config +{ + "kv_cache": { + "dtype": "float8_e4m3fn", + "scaling_factor": { + "0": { + "0": 1.0, + "1": 1.0 + } + } + } +} +``` + +Where the outer keys in `scaling_factor` are tensor parallel ranks and inner keys are layer indices. + + +If scaling factors are not provided and not found in the checkpoint, it will default to 1.0, which may cause accuracy issues. + + + +**FP4 (MXFP4)**: Unlike FP8, FP4 quantization handles scaling factors automatically on-the-fly during quantization and dequantization. No pre-quantized models or external scaling factor files are required—the block-based scaling factors are computed dynamically as needed. + + +## Performance Considerations + +### Memory Savings + +Quantized KV cache provides significant memory savings: +- **BF16 → FP4**: Supports approximately 3.56× more tokens than BF16 (accounting for scaling factor overhead) + + +FP4 and FP8 quantization require additional memory for block-based scaling factors, which reduces the effective memory savings compared to the raw bit-width reduction. FP4 with block size 16 supports approximately 1.78× more tokens than FP8, and approximately 3.56× more tokens than BF16. The relative token capacity between FP8 and BF16 can be derived from these ratios. + + +This enables longer context lengths or more concurrent requests within the same memory budget. + +### Accuracy Impact + +#### FP8 Accuracy + +FP8 E4M3 quantization typically introduces minimal accuracy degradation. The impact depends on model architecture, sequence length, and quantization format (generally, E4M3 has better accuracy than E5M2). + +#### FP4 Accuracy + +FP4 (MXFP4) quantization provides significant memory savings with varying accuracy impact depending on model size and dataset complexity. Preliminary accuracy test results from [PR #10078](https://github.com/sgl-project/sglang/pull/10078) (MLA) and [PR #12612](https://github.com/sgl-project/sglang/pull/12612) (MHA) show: + +**Large Models (e.g., Qwen3-235B-A22B, DeepSeek-R1-0528)** + +On large-scale models, FP4 maintains accuracy close to FP8/BF16, especially on simpler datasets: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelDatasetKV16KV8 (FP8 E4M3)KV4 (FP4 E2M1)
Qwen3-235B-A22Bgsm8k0.91680.91810.9186
Qwen3-235B-A22Baime250.77330.73330.6000
Qwen3-235B-A22Bgpqa_diamond0.70100.68990.6778
DeepSeek-R1-0528gsm8k0.91570.91540.9124
DeepSeek-R1-0528aime250.50670.49340.4000
DeepSeek-R1-0528gpqa_diamond0.77070.76970.7273
+ +**Smaller Models (e.g., GPT-OSS-120B)** + +On smaller models, FP4 shows more pronounced accuracy drops, particularly on challenging datasets: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelDatasetKV16KV8 (FP8 E4M3)KV4 (FP4 E2M1)
GPT-OSS-120Bgsm8k0.91610.91630.9152
GPT-OSS-120Baime250.75330.76670.3533
GPT-OSS-120Bgpqa_diamond0.50810.54340.3202
+ +**Key Observations:** + +- **Simple datasets (e.g., gsm8k)**: FP4 maintains accuracy close to FP8/BF16 across model sizes +- **Model size matters**: Large models (200B+ parameters) generally tolerate FP4 quantization better than smaller models +- **Context length**: Accuracy degradation may be more pronounced in long-context scenarios, as the accumulation of the quantization error may become significant. + + +Evaluate FP4 accuracy on your specific model and workload. Large models on simpler tasks typically show minimal degradation, while smaller models or complex reasoning tasks may require FP8 or BF16 for acceptable accuracy. + + +## Best Practices + +- **Use pre-quantized models**: Prefer models quantized offline with scaling factors included in the checkpoint. +- **Choose the right format**: Use `fp8_e4m3` for better accuracy (recommended), `fp8_e5m2` for larger dynamic range, or `fp4_e2m1` for maximum memory savings (experimental) +- **Check backend compatibility**: Verify that your chosen attention backend supports quantized KV cache + + +See also: +- [Quantization](./quantization) +- [Attention Backend](./attention_backend) +- [Server Arguments](./server_arguments) + diff --git a/docs_new/docs/advanced_features/rfork.mdx b/docs_new/docs/advanced_features/rfork.mdx new file mode 100644 index 000000000000..84bfaa8f8382 --- /dev/null +++ b/docs_new/docs/advanced_features/rfork.mdx @@ -0,0 +1,108 @@ +--- +title: "R-Fork" +metatags: + description: "SGLang R-Fork: zero-copy GPU-to-GPU weight loading, reduce boot-up time from minutes to seconds. NCCL and TransferEngine backends." +--- +R-Fork (Tensor Remote Fork) is a novel weight loading methodology that leverages efficient inter-node GPU-to-GPU data transfer path to load tensors from a running SGLang instance to a new instance with zero-copy. It can significantly optimize the SGLang instance boot-up time by reducing model weights loading from several minutes to mere seconds. + +To learn more details about R-Fork, please check ** R-Fork blog ** + +## Usage + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentUsage
load-formatset to `remote_instance` to enable R-Fork.
remote-instance-weight-loader-backendnccl, transfer_engine, or modelexpress. Default is nccl.
remote-instance-weight-loader-seed-instance-ipIP address of the seed instance who will provide the model weight. Used by nccl and transfer_engine backends.
remote-instance-weight-loader-seed-instance-service-portthe port that the seed instance's HTTP server is listening on. Used by nccl and transfer_engine backends.
remote-instance-weight-loader-send-weights-group-portsthe list of available ports on the seed instance that will be used to build NCCL communication groups between seed and client instance. Only needed by nccl backend.
remote-instance-weight-loader-start-seed-via-transfer-engineset to start seed service that supports TransferEngine as backend. Needed for seed instances when using transfer_engine as backend.
modelexpress-configJSON config for modelexpress backend. Keys: "url" (required, gRPC host:port of ModelExpress server), "model_name" (optional, defaults to --model-path), "source" (optional bool, true for seed mode).
+ +### NCCL as backend + +seed instance: +```shell Command +python -m sglang.launch_server [args] +``` + +client instance: +```shell Command +python -m sglang.launch_server [args] \ + --load-format remote_instance \ + --remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \ + --remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \ + --remote-instance-weight-loader-send-weights-group-ports [send_weights_nccl_group_ports_list] \ + --remote-instance-weight-loader-backend nccl +``` + +### TransferEngine as backend + +seed instance: +```shell Command +python -m sglang.launch_server [args] \ + --remote-instance-weight-loader-start-seed-via-transfer-engine +``` + +```shell Command +python -m sglang.launch_server [args] \ + --load-format remote_instance \ + --remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \ + --remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \ + --remote-instance-weight-loader-backend transfer_engine +``` + +### ModelExpress as backend + +[ModelExpress](https://github.com/ai-dynamo/modelexpress) is a coordination service that manages P2P weight transfer metadata. It removes the need for direct seed IP/port configuration by providing a centralized registry that seeds publish to and clients discover from. Under the hood it uses TransferEngine (Mooncake) for the actual RDMA data transfer. + +A running ModelExpress server is required. See the [ModelExpress documentation](https://github.com/ai-dynamo/modelexpress) for setup instructions. + +seed instance: +```bash Command +python -m sglang.launch_server [args] \ + --modelexpress-config '{"url": "[modelexpress_grpc_host:port]", "model_name": "[model_name]", "source": true}' +``` + +client instance: +```bash Command +python -m sglang.launch_server [args] \ + --load-format remote_instance \ + --remote-instance-weight-loader-backend modelexpress \ + --modelexpress-config '{"url": "[modelexpress_grpc_host:port]", "model_name": "[model_name]"}' +``` + +The seed publishes its TransferEngine session ID and tensor layout to ModelExpress. The client queries ModelExpress to discover the seed, then pulls weights directly via RDMA. This enables dynamic seed discovery without hardcoding IPs, and supports multiple models through a single ModelExpress instance. diff --git a/docs_new/docs/advanced_features/separate_reasoning.ipynb b/docs_new/docs/advanced_features/separate_reasoning.ipynb new file mode 100644 index 000000000000..6277dd8bd4bc --- /dev/null +++ b/docs_new/docs/advanced_features/separate_reasoning.ipynb @@ -0,0 +1,377 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reasoning Parser\n", + "\n", + "SGLang supports parsing reasoning content out from \"normal\" content for reasoning models such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).\n", + "\n", + "## Supported Models & Parsers\n", + "\n", + "| Model | Reasoning tags | Parser | Notes |\n", + "|---------|-----------------------------|------------------|-------|\n", + "| [DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `` … `` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n", + "| [DeepSeek‑V3 series](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `` … `` | `deepseek-v3` | Including [DeepSeek‑V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp). Supports `thinking` parameter |\n", + "| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `` … `` | `qwen3` | Supports `enable_thinking` parameter |\n", + "| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `` … `` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n", + "| [Kimi K2 Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) | `◁think▷` … `◁/think▷` | `kimi_k2` | Uses special thinking delimiters. Also requires `--tool-call-parser kimi_k2` for tool use. |\n", + "| [GPT OSS](https://huggingface.co/openai/gpt-oss-120b) | `<\\|channel\\|>analysis<\\|message\\|>` … `<\\|end\\|>` | `gpt-oss` | N/A |\n", + "### Model-Specific Behaviors\n", + "\n", + "**DeepSeek-R1 Family:**\n", + "- DeepSeek-R1: No `` start tag, jumps directly to thinking content\n", + "- DeepSeek-R1-0528: Generates both `` start and `` end tags\n", + "- Both are handled by the same `deepseek-r1` parser\n", + "\n", + "**DeepSeek-V3 Family:**\n", + "- DeepSeek-V3.1/V3.2: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n", + "\n", + "**Qwen3 Family:**\n", + "- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n", + "- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use `qwen3` or `qwen3-thinking` parser, always thinks\n", + "\n", + "**Kimi K2:**\n", + "- Kimi K2 Thinking: Uses special `◁think▷` and `◁/think▷` tags. For agentic tool use, also specify `--tool-call-parser kimi_k2`.\n", + "\n", + "**GPT OSS:**\n", + "- GPT OSS: Uses special `<|channel|>analysis<|message|>` and `<|end|>` tags" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage\n", + "\n", + "### Launching the Server" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specify the `--reasoning-parser` option." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from openai import OpenAI\n", + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `--reasoning-parser` defines the parser used to interpret responses." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### OpenAI Compatible API\n", + "\n", + "Using the OpenAI compatible API, the contract follows the [DeepSeek API design](https://api-docs.deepseek.com/guides/reasoning_model) established with the release of DeepSeek-R1:\n", + "\n", + "- `reasoning_content`: The content of the CoT.\n", + "- `content`: The content of the final answer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize OpenAI-like client\n", + "client = OpenAI(api_key=\"None\", base_url=f\"http://0.0.0.0:{port}/v1\")\n", + "model_name = client.models.list().data[0].id\n", + "\n", + "messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"What is 1+3?\",\n", + " }\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Non-Streaming Request" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response_non_stream = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=messages,\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " stream=False, # Non-streaming\n", + " extra_body={\"separate_reasoning\": True},\n", + ")\n", + "print_highlight(\"==== Reasoning ====\")\n", + "print_highlight(response_non_stream.choices[0].message.reasoning_content)\n", + "\n", + "print_highlight(\"==== Text ====\")\n", + "print_highlight(response_non_stream.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Streaming Request" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response_stream = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=messages,\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " stream=True, # Non-streaming\n", + " extra_body={\"separate_reasoning\": True},\n", + ")\n", + "\n", + "reasoning_content = \"\"\n", + "content = \"\"\n", + "for chunk in response_stream:\n", + " if chunk.choices[0].delta.content:\n", + " content += chunk.choices[0].delta.content\n", + " if chunk.choices[0].delta.reasoning_content:\n", + " reasoning_content += chunk.choices[0].delta.reasoning_content\n", + "\n", + "print_highlight(\"==== Reasoning ====\")\n", + "print_highlight(reasoning_content)\n", + "\n", + "print_highlight(\"==== Text ====\")\n", + "print_highlight(content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response_stream = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=messages,\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " stream=True, # Non-streaming\n", + " extra_body={\"separate_reasoning\": True, \"stream_reasoning\": False},\n", + ")\n", + "\n", + "reasoning_content = \"\"\n", + "content = \"\"\n", + "for chunk in response_stream:\n", + " if chunk.choices[0].delta.content:\n", + " content += chunk.choices[0].delta.content\n", + " if chunk.choices[0].delta.reasoning_content:\n", + " reasoning_content += chunk.choices[0].delta.reasoning_content\n", + "\n", + "print_highlight(\"==== Reasoning ====\")\n", + "print_highlight(reasoning_content)\n", + "\n", + "print_highlight(\"==== Text ====\")\n", + "print_highlight(content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The reasoning separation is enable by default when specify . \n", + "**To disable it, set the `separate_reasoning` option to `False` in request.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response_non_stream = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=messages,\n", + " temperature=0.6,\n", + " top_p=0.95,\n", + " stream=False, # Non-streaming\n", + " extra_body={\"separate_reasoning\": False},\n", + ")\n", + "\n", + "print_highlight(\"==== Original Output ====\")\n", + "print_highlight(response_non_stream.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### SGLang Native API " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n", + "input = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "\n", + "gen_url = f\"http://localhost:{port}/generate\"\n", + "gen_data = {\n", + " \"text\": input,\n", + " \"sampling_params\": {\n", + " \"skip_special_tokens\": False,\n", + " \"max_new_tokens\": 1024,\n", + " \"temperature\": 0.6,\n", + " \"top_p\": 0.95,\n", + " },\n", + "}\n", + "gen_response = requests.post(gen_url, json=gen_data).json()[\"text\"]\n", + "\n", + "print_highlight(\"==== Original Output ====\")\n", + "print_highlight(gen_response)\n", + "\n", + "parse_url = f\"http://localhost:{port}/separate_reasoning\"\n", + "separate_reasoning_data = {\n", + " \"text\": gen_response,\n", + " \"reasoning_parser\": \"deepseek-r1\",\n", + "}\n", + "separate_reasoning_response_json = requests.post(\n", + " parse_url, json=separate_reasoning_data\n", + ").json()\n", + "print_highlight(\"==== Reasoning ====\")\n", + "print_highlight(separate_reasoning_response_json[\"reasoning_text\"])\n", + "print_highlight(\"==== Text ====\")\n", + "print_highlight(separate_reasoning_response_json[\"text\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Offline Engine API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sglang as sgl\n", + "from sglang.srt.parser.reasoning_parser import ReasoningParser\n", + "from sglang.utils import print_highlight\n", + "\n", + "llm = sgl.Engine(model_path=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n", + "tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n", + "input = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "sampling_params = {\n", + " \"max_new_tokens\": 1024,\n", + " \"skip_special_tokens\": False,\n", + " \"temperature\": 0.6,\n", + " \"top_p\": 0.95,\n", + "}\n", + "result = llm.generate(prompt=input, sampling_params=sampling_params)\n", + "\n", + "generated_text = result[\"text\"] # Assume there is only one prompt\n", + "\n", + "print_highlight(\"==== Original Output ====\")\n", + "print_highlight(generated_text)\n", + "\n", + "parser = ReasoningParser(\"deepseek-r1\")\n", + "reasoning_text, text = parser.parse_non_stream(generated_text)\n", + "print_highlight(\"==== Reasoning ====\")\n", + "print_highlight(reasoning_text)\n", + "print_highlight(\"==== Text ====\")\n", + "print_highlight(text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm.shutdown()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Supporting New Reasoning Model Schemas\n", + "\n", + "For future reasoning models, you can implement the reasoning parser as a subclass of `BaseReasoningFormatDetector` in `python/sglang/srt/reasoning_parser.py` and specify the reasoning parser for new reasoning model schemas accordingly." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs_new/docs/advanced_features/separate_reasoning.mdx b/docs_new/docs/advanced_features/separate_reasoning.mdx new file mode 100644 index 000000000000..e0bea35eed0f --- /dev/null +++ b/docs_new/docs/advanced_features/separate_reasoning.mdx @@ -0,0 +1,317 @@ +--- +title: "Reasoning Parser" +metatags: + description: "SGLang reasoning parser: separate thinking content from output for DeepSeek R1, Qwen3, Kimi K2, GPT-OSS reasoning models." +--- +SGLang supports parsing reasoning content out from "normal" content for reasoning models such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). + +## Supported Models & Parsers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelReasoning tagsParserNotes
[DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d)`` … ```deepseek-r1`Supports all variants (R1, R1-0528, R1-Distill)
[DeepSeek‑V3 series](https://huggingface.co/deepseek-ai/DeepSeek-V3.1)`` … ```deepseek-v3`Including [DeepSeek‑V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp). Supports `thinking` parameter
[Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f)`` … ```qwen3`Supports `enable_thinking` parameter
[Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)`` … ```qwen3` or `qwen3-thinking`Always generates thinking content
[Kimi K2 Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking)`◁think▷` … `◁/think▷``kimi_k2`Uses special thinking delimiters. Also requires `--tool-call-parser kimi_k2` for tool use.
[GPT OSS](https://huggingface.co/openai/gpt-oss-120b)`<|channel|>analysis<|message|>` … `<|end|>``gpt-oss`N/A
+### Model-Specific Behaviors + +**DeepSeek-R1 Family:** +- DeepSeek-R1: No `` start tag, jumps directly to thinking content +- DeepSeek-R1-0528: Generates both `` start and `` end tags +- Both are handled by the same `deepseek-r1` parser + +**DeepSeek-V3 Family:** +- DeepSeek-V3.1/V3.2: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`) + +**Qwen3 Family:** +- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates +- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use `qwen3` or `qwen3-thinking` parser, always thinks + +**Kimi K2:** +- Kimi K2 Thinking: Uses special `◁think▷` and `◁/think▷` tags. For agentic tool use, also specify `--tool-call-parser kimi_k2`. + +**GPT OSS:** +- GPT OSS: Uses special `<|channel|>analysis<|message|>` and `<|end|>` tags + + +## Usage + +### Launching the Server + + +Specify the `--reasoning-parser` option. + + + +```python Example +import requests +from openai import OpenAI +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +server_process, port = launch_server_cmd( + "python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}") +``` + +Note that `--reasoning-parser` defines the parser used to interpret responses. + + +### OpenAI Compatible API + +Using the OpenAI compatible API, the contract follows the [DeepSeek API design](https://api-docs.deepseek.com/guides/reasoning_model) established with the release of DeepSeek-R1: + +- `reasoning_content`: The content of the CoT. +- `content`: The content of the final answer. + + + +```python Example +# Initialize OpenAI-like client +client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1") +model_name = client.models.list().data[0].id + +messages = [ + { + "role": "user", + "content": "What is 1+3?", + } +] +``` + +#### Non-Streaming Request + + + +```python Example +response_non_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0.6, + top_p=0.95, + stream=False, # Non-streaming + extra_body={"separate_reasoning": True}, +) +print_highlight("==== Reasoning ====") +print_highlight(response_non_stream.choices[0].message.reasoning_content) + +print_highlight("==== Text ====") +print_highlight(response_non_stream.choices[0].message.content) +``` + +#### Streaming Request + + + +```python Example +response_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0.6, + top_p=0.95, + stream=True, # Non-streaming + extra_body={"separate_reasoning": True}, +) + +reasoning_content = "" +content = "" +for chunk in response_stream: + if chunk.choices[0].delta.content: + content += chunk.choices[0].delta.content + if chunk.choices[0].delta.reasoning_content: + reasoning_content += chunk.choices[0].delta.reasoning_content + +print_highlight("==== Reasoning ====") +print_highlight(reasoning_content) + +print_highlight("==== Text ====") +print_highlight(content) +``` + +Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content). + + + +```python Example +response_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0.6, + top_p=0.95, + stream=True, # Non-streaming + extra_body={"separate_reasoning": True, "stream_reasoning": False}, +) + +reasoning_content = "" +content = "" +for chunk in response_stream: + if chunk.choices[0].delta.content: + content += chunk.choices[0].delta.content + if chunk.choices[0].delta.reasoning_content: + reasoning_content += chunk.choices[0].delta.reasoning_content + +print_highlight("==== Reasoning ====") +print_highlight(reasoning_content) + +print_highlight("==== Text ====") +print_highlight(content) +``` + +The reasoning separation is enable by default when specify . +**To disable it, set the `separate_reasoning` option to `False` in request.** + + + +```python Example +response_non_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0.6, + top_p=0.95, + stream=False, # Non-streaming + extra_body={"separate_reasoning": False}, +) + +print_highlight("==== Original Output ====") +print_highlight(response_non_stream.choices[0].message.content) +``` + +### SGLang Native API + + + +```python Example +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") +input = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) + +gen_url = f"http://localhost:{port}/generate" +gen_data = { + "text": input, + "sampling_params": { + "skip_special_tokens": False, + "max_new_tokens": 1024, + "temperature": 0.6, + "top_p": 0.95, + }, +} +gen_response = requests.post(gen_url, json=gen_data).json()["text"] + +print_highlight("==== Original Output ====") +print_highlight(gen_response) + +parse_url = f"http://localhost:{port}/separate_reasoning" +separate_reasoning_data = { + "text": gen_response, + "reasoning_parser": "deepseek-r1", +} +separate_reasoning_response_json = requests.post( + parse_url, json=separate_reasoning_data +).json() +print_highlight("==== Reasoning ====") +print_highlight(separate_reasoning_response_json["reasoning_text"]) +print_highlight("==== Text ====") +print_highlight(separate_reasoning_response_json["text"]) +``` + + +```python Example +terminate_process(server_process) +``` + +### Offline Engine API + + + +```python Example +import sglang as sgl +from sglang.srt.parser.reasoning_parser import ReasoningParser +from sglang.utils import print_highlight + +llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") +tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") +input = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +sampling_params = { + "max_new_tokens": 1024, + "skip_special_tokens": False, + "temperature": 0.6, + "top_p": 0.95, +} +result = llm.generate(prompt=input, sampling_params=sampling_params) + +generated_text = result["text"] # Assume there is only one prompt + +print_highlight("==== Original Output ====") +print_highlight(generated_text) + +parser = ReasoningParser("deepseek-r1") +reasoning_text, text = parser.parse_non_stream(generated_text) +print_highlight("==== Reasoning ====") +print_highlight(reasoning_text) +print_highlight("==== Text ====") +print_highlight(text) +``` + + +```python Example +llm.shutdown() +``` + +## Supporting New Reasoning Model Schemas + +For future reasoning models, you can implement the reasoning parser as a subclass of `BaseReasoningFormatDetector` in `python/sglang/srt/reasoning_parser.py` and specify the reasoning parser for new reasoning model schemas accordingly. diff --git a/docs_new/docs/advanced_features/server_arguments.mdx b/docs_new/docs/advanced_features/server_arguments.mdx new file mode 100644 index 000000000000..902601797552 --- /dev/null +++ b/docs_new/docs/advanced_features/server_arguments.mdx @@ -0,0 +1,2843 @@ +--- +title: "Server Arguments" +metatags: + description: "SGLang server arguments: model selection, TP/DP parallelism, memory management, quantization, logging, and optimization options." +--- +This page provides a list of server arguments used in the command line to configure the behavior +and performance of the language model server during deployment. These arguments enable users to +customize key aspects of the server, including model selection, parallelism policies, +memory management, and optimization techniques. +You can find all arguments by `python3 -m sglang.launch_server --help` + +## Common launch commands + +- To use a configuration file, create a YAML file with your server arguments and specify it with `--config`. CLI arguments will override config file values. + + ```bash Command + # Create config.yaml + cat > config.yaml << EOF + model-path: meta-llama/Meta-Llama-3-8B-Instruct + host: 0.0.0.0 + port: 30000 + tensor-parallel-size: 2 + enable-metrics: true + log-requests: true + EOF + + # Launch server with config file + python -m sglang.launch_server --config config.yaml + ``` + +- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command. + + ```bash Command + python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 + ``` + +- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Model Gateway (former Router)](../advanced_features/sgl_model_gateway) for data parallelism. + + ```bash Command + python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2 + ``` + +- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`. + + ```bash Command + python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 + ``` + +- See [hyperparameter tuning](./hyperparameter_tuning) on tuning hyperparameters for better performance. +- For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See `--shm-size` for docker and `/dev/shm` size update for Kubernetes manifests. +- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size. + + ```bash Command + python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 + ``` +- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. +- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e4m3` or `--kv-cache-dtype fp8_e5m2`. +- To enable deterministic inference and batch invariant operations, add `--enable-deterministic-inference`. More details can be found in [deterministic inference document](./deterministic_inference). +- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template). If the tokenizer has multiple named templates (e.g., 'default', 'tool_use'), you can select one using `--hf-chat-template-name tool_use`. +- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph` +- (Note: This feature is out of maintenance and might cause error) To enable `torch.compile` acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. By default, the cache path is located at `/tmp/torchinductor_root`, you can customize it using environment variable `TORCHINDUCTOR_CACHE_DIR`. For more details, please refer to [PyTorch official documentation](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) and [Enabling cache for torch.compile](../references/torch_compile_cache). + ```bash Command + # Node 0 + python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3-8B-Instruct \ + --tp 4 \ + --dist-init-addr sgl-dev-0:50000 \ + --nnodes 2 \ + --node-rank 0 + + # Node 1 + python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3-8B-Instruct \ + --tp 4 \ + --dist-init-addr sgl-dev-0:50000 \ + --nnodes 2 \ + --node-rank 1 + ``` + +Please consult the documentation below and [server_args.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py) to learn more about the arguments you may provide when launching a server. + +## Model and tokenizer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
` --model-path`
`--model`
The path of the model weights. This can be a local folder or a Hugging Face repo ID.` None`Type: str
` --tokenizer-path`The path of the tokenizer.` None`Type: str
` --tokenizer-mode`Tokenizer mode. 'auto' will use the fast tokenizer if available, and 'slow' will always use the slow tokenizer.` auto`auto, slow
` --tokenizer-backend`Tokenizer backend. 'huggingface' uses the default HuggingFace tokenizers library; 'fastokens' uses the fastokens library for faster tokenization. Requires the fastokens package to be installed.` huggingface`huggingface, fastokens
` --tokenizer-worker-num`The worker num of the tokenizer manager.` 1`Type: int
` --skip-tokenizer-init`If set, skip init tokenizer and pass input_ids in generate request.` False`bool flag (set to enable)
` --load-format`The format of the model weights to load. "auto" will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available. "pt" will load the weights in the pytorch bin format. "safetensors" will load the weights in the safetensors format. "npcache" will load the weights in pytorch format and store a numpy cache to speed up the loading. "dummy" will initialize the weights with random values, which is mainly for profiling."gguf" will load the weights in the gguf format. "bitsandbytes" will load the weights using bitsandbytes quantization."layered" loads weights layer by layer so that one can quantize a layer before loading another to make the peak memory envelope smaller. "flash_rl" will load the weights in flash_rl format. "fastsafetensors" and "private" are also supported. "runai_streamer" enables direct model loading from object storage and shared file systems.` auto`auto, pt, safetensors, npcache, dummy, sharded_state, gguf, bitsandbytes, layered, flash_rl, remote, remote_instance, fastsafetensors, private, runai_streamer
` --model-loader-extra-config`Extra config for model loader. This will be passed to the model loader corresponding to the chosen load_format.` {}`Type: str
` --trust-remote-code`Whether or not to allow for custom models defined on the Hub in their own modeling files.` False`bool flag (set to enable)
` --context-length`The model's maximum context length. Defaults to None (will use the value from the model's config.json instead).` None`Type: int
` --is-embedding`Whether to use a CausalLM as an embedding model.` False`bool flag (set to enable)
` --enable-multimodal`Enable the multimodal functionality for the served model. If the model being served is not multimodal, nothing will happen.` None`bool flag (set to enable)
` --revision`The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.` None`Type: str
` --model-impl`Which implementation of the model to use. "auto" will try to use the SGLang implementation if it exists and fall back to the Transformers implementation if no SGLang implementation is available. "sglang" will use the SGLang model implementation. "transformers" will use the Transformers model implementation.` auto`Type: str
+ +## HTTP server + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--host`The host of the HTTP server.`127.0.0.1`Type: str
`--port`The port of the HTTP server.`30000`Type: int
`--fastapi-root-path`App is behind a path based routing proxy.`""`Type: str
`--grpc-mode`If set, use gRPC server instead of HTTP server.`False`bool flag (set to enable)
`--skip-server-warmup`If set, skip warmup.`False`bool flag (set to enable)
`--warmups`Specify custom warmup functions (csv) to run before server starts eg. --warmups=warmup_name1,warmup_name2 will run the functions `warmup_name1` and `warmup_name2` specified in warmup.py before the server starts listening for requests`None`Type: str
`--nccl-port`The port for NCCL distributed environment setup. Defaults to a random port.`None`Type: int
`--checkpoint-engine-wait-weights-before-ready`If set, the server will wait for initial weights to be loaded via checkpoint-engine or other update methods before serving inference requests.`False`bool flag (set to enable)
+ +## Quantization and data type + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--dtype`Data type for model weights and activations. * "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. * "half" for FP16. Recommended for AWQ quantization. * "float16" is the same as "half". * "bfloat16" for a balance between precision and range. * "float" is shorthand for FP32 precision. * "float32" for FP32 precision.`auto`auto, half, float16, bfloat16, float, float32
`--quantization`The quantization method.`None`awq, fp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf, modelopt, modelopt_fp8, modelopt_fp4, petit_nvfp4, w8a8_int8, w8a8_fp8, moe_wna16, qoq, w4afp8, mxfp4, mxfp8, auto-round, compressed-tensors, modelslim, quark_int4fp8_moe
`--quantization-param-path`Path to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling factors default to 1.0, which may cause accuracy issues.`None`Type: Optional[str]
`--kv-cache-dtype`Data type for kv cache storage. "auto" will use model data type. "bf16" or "bfloat16" for BF16 KV cache. "fp8_e5m2" and "fp8_e4m3" are supported for CUDA 11.8+. "fp4_e2m1" (only mxfp4) is supported for CUDA 12.8+ and PyTorch 2.8.0+`auto`auto, fp8_e5m2, fp8_e4m3, bf16, bfloat16, fp4_e2m1
`--enable-fp32-lm-head`If set, the LM head outputs (logits) are in FP32.`False`bool flag (set to enable)
`--modelopt-quant`The ModelOpt quantization configuration. Supported values: 'fp8', 'int4_awq', 'w4a8_awq', 'nvfp4', 'nvfp4_awq'. This requires the NVIDIA Model Optimizer library to be installed: pip install nvidia-modelopt`None`Type: str
`--modelopt-checkpoint-restore-path`Path to restore a previously saved ModelOpt quantized checkpoint. If provided, the quantization process will be skipped and the model will be loaded from this checkpoint.`None`Type: str
`--modelopt-checkpoint-save-path`Path to save the ModelOpt quantized checkpoint after quantization. This allows reusing the quantized model in future runs.`None`Type: str
`--modelopt-export-path`Path to export the quantized model in HuggingFace format after ModelOpt quantization. The exported model can then be used directly with SGLang for inference. If not provided, the model will not be exported.`None`Type: str
`--quantize-and-serve`Quantize the model with ModelOpt and immediately serve it without exporting. This is useful for development and prototyping. For production, it's recommended to use separate quantization and deployment steps.`False`bool flag (set to enable)
`--rl-quant-profile`Path to the FlashRL quantization profile. Required when using --load-format flash_rl.`None`Type: str
+ +## Memory and scheduling + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--mem-fraction-static`The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors.`None`Type: float
`--max-running-requests`The maximum number of running requests.`None`Type: int
`--max-queued-requests`The maximum number of queued requests. This option is ignored when using disaggregation-mode.`None`Type: int
`--max-total-tokens`The maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes.`None`Type: int
`--chunked-prefill-size`The maximum number of tokens in a chunk for the chunked prefill. Setting this to -1 means disabling chunked prefill.`None`Type: int
`--prefill-max-requests`The maximum number of requests in a prefill batch. If not specified, there is no limit.`None`Type: int
`--enable-dynamic-chunking`Enable dynamic chunk size adjustment for pipeline parallelism. When enabled, chunk sizes are dynamically calculated based on fitted function to maintain consistent execution time across chunks.`False`bool flag (set to enable)
`--max-prefill-tokens`The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model's maximum context length.`16384`Type: int
`--schedule-policy`The scheduling policy of the requests.`fcfs``lpm`, `random`, `fcfs`, `dfs-weight`, `lof`, `priority`, `routing-key`
`--enable-priority-scheduling`Enable priority scheduling. Requests with higher priority integer values will be scheduled first by default.`False`bool flag (set to enable)
`--abort-on-priority-when-disabled`If set, abort requests that specify a priority when priority scheduling is disabled.`False`bool flag (set to enable)
`--schedule-low-priority-values-first`If specified with --enable-priority-scheduling, the scheduler will schedule requests with lower priority integer values first.`False`bool flag (set to enable)
`--priority-scheduling-preemption-threshold`Minimum difference in priorities for an incoming request to have to preempt running request(s).`10`Type: int
`--schedule-conservativeness`How conservative the schedule policy is. A larger value means more conservative scheduling. Use a larger value if you see requests being retracted frequently.`1.0`Type: float
`--page-size`The number of tokens in a page.`1`Type: int
`--swa-full-tokens-ratio`The ratio of SWA layer KV tokens / full layer KV tokens, regardless of the number of swa:full layers. It should be between 0 and 1. E.g. 0.5 means if each swa layer has 50 tokens, then each full layer has 100 tokens.`0.8`Type: float
`--disable-hybrid-swa-memory`Disable the hybrid SWA memory.`False`bool flag (set to enable)
`--radix-eviction-policy`The eviction policy of radix trees. 'lru' stands for Least Recently Used, 'lfu' stands for Least Frequently Used.`lru``lru`, `lfu`
`--enable-prefill-delayer`Enable prefill delayer for DP attention to reduce idle time.`False`bool flag (set to enable)
`--prefill-delayer-max-delay-passes`Maximum forward passes to delay prefill.`30`Type: int
`--prefill-delayer-token-usage-low-watermark`Token usage low watermark for prefill delayer.`None`Type: float
`--prefill-delayer-forward-passes-buckets`Custom buckets for prefill delayer forward passes histogram. 0 and max_delay_passes-1 will be auto-added.`None`List[float]
`--prefill-delayer-wait-seconds-buckets`Custom buckets for prefill delayer wait seconds histogram. 0 will be auto-added.`None`List[float]
+ +## Runtime options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--device`The device to use ('cuda', 'xpu', 'hpu', 'npu', 'cpu'). Defaults to auto-detection if not specified.`None`Type: str
`--tensor-parallel-size` `--tp-size`The tensor parallelism size.`1`Type: int
`--pipeline-parallel-size` `--pp-size`The pipeline parallelism size.`1`Type: int
--attention-context-parallel-size<br>--attn-cp-sizeThe attention context parallelism size.1Type: int
--moe-data-parallel-size<br>--moe-dp-sizeThe moe data parallelism size.1Type: int
--pp-max-micro-batch-sizeThe maximum micro batch size in pipeline parallelism.NoneType: int
--pp-async-batch-depthThe async batch depth of pipeline parallelism.0Type: int
--stream-intervalThe interval (or buffer size) for streaming in terms of the token length. A smaller value makes streaming smoother, while a larger value makes the throughput higher1Type: int
--incremental-streaming-outputWhether to output as a sequence of disjoint segments.Falsebool flag (set to enable)
--random-seedThe random seed.NoneType: int
--constrained-json-whitespace-pattern(outlines and llguidance backends only) Regex pattern for syntactic whitespaces allowed in JSON constrained output. For example, to allow the model to generate consecutive whitespaces, set the pattern to [\n\t ]*NoneType: str
--constrained-json-disable-any-whitespace(xgrammar and llguidance backends only) Enforce compact representation in JSON constrained output.Falsebool flag (set to enable)
--watchdog-timeoutSet watchdog timeout in seconds. If a forward batch takes longer than this, the server will crash to prevent hanging.300Type: float
--soft-watchdog-timeoutSet soft watchdog timeout in seconds. If a forward batch takes longer than this, the server will dump information for debugging.`None`Type: float
--dist-timeoutSet timeout for torch.distributed initialization.`None`Type: int
--download-dirModel download directory for huggingface.NoneType: str
--model-checksumModel file integrity verification. If provided without value, uses model-path as HF repo ID. Otherwise, provide checksums JSON file path or HuggingFace repo ID.NoneType: str
--base-gpu-idThe base GPU ID to start allocating GPUs from. Useful when running multiple instances on the same machine.0Type: int
--gpu-id-stepThe delta between consecutive GPU IDs that are used. For example, setting it to 2 will use GPU 0,2,4,...1Type: int
--sleep-on-idleReduce CPU usage when sglang is idle.Falsebool flag (set to enable)
`--custom-sigquit-handler`Register a custom sigquit handler so you can do additional cleanup after the server is shutdown. This is only available for Engine, not for CLI.`None`Type: str
+ +## Logging + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--log-level`The logging level of all loggers.`info`Type: str
`--log-level-http`The logging level of HTTP server. If not set, reuse --log-level by default.`None`Type: str
`--log-requests`Log metadata, inputs, outputs of all requests. The verbosity is decided by --log-requests-level`False`bool flag (set to enable)
`--log-requests-level`0: Log metadata (no sampling parameters). 1: Log metadata and sampling parameters. 2: Log metadata, sampling parameters and partial input/output. 3: Log every input/output.`2`0, 1, 2, 3
`--log-requests-format`Format for request logging: 'text' (human-readable) or 'json' (structured)`text`text, json
`--log-requests-target`Target(s) for request logging: 'stdout' and/or directory path(s) for file output. Can specify multiple targets, e.g., '--log-requests-target stdout /my/path'.`None`List[str]
`--uvicorn-access-log-exclude-prefixes`Exclude uvicorn access logs whose request path starts with any of these prefixes. Defaults to empty (disabled).`[]`List[str]
`--crash-dump-folder`Folder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled.`None`Type: str
`--show-time-cost`Show time cost of custom marks.`False`bool flag (set to enable)
`--enable-metrics`Enable log prometheus metrics.`False`bool flag (set to enable)
--enable-mfu-metricsEnable estimated MFU-related prometheus metrics.`False`bool flag (set to enable)
--enable-metrics-for-all-schedulersEnable --enable-metrics-for-all-schedulers when you want schedulers on all TP ranks (not just TP 0) to record request metrics separately. This is especially useful when dp_attention is enabled, as otherwise all metrics appear to come from TP 0.Falsebool flag (set to enable)
--tokenizer-metrics-custom-labels-headerSpecify the HTTP header for passing custom labels for tokenizer metrics.x-custom-labelsType: str
--tokenizer-metrics-allowed-custom-labelsThe custom labels allowed for tokenizer metrics. The labels are specified via a dict in '--tokenizer-metrics-custom-labels-header' field in HTTP requests, e.g., {'label1': 'value1', 'label2': 'value2'} is allowed if '--tokenizer-metrics-allowed-custom-labels label1 label2' is set.`None`List[str]
--bucket-time-to-first-tokenThe buckets of time to first token, specified as a list of floats.`None`List[float]
--bucket-inter-token-latencyThe buckets of inter-token latency, specified as a list of floats.`None`List[float]
--bucket-e2e-request-latencyThe buckets of end-to-end request latency, specified as a list of floats.NoneList[float]
--collect-tokens-histogramCollect prompt/generation tokens histogram.Falsebool flag (set to enable)
--prompt-tokens-bucketsThe buckets rule of prompt tokens. Supports 3 rule types: 'default' uses predefined buckets; 'tse <middle> <base> <count>' generates two sides exponential distributed buckets (e.g., 'tse 1000 2 8' generates buckets [984.0, 992.0, 996.0, 998.0, 1000.0, 1002.0, 1004.0, 1008.0, 1016.0]).); 'custom <value1> <value2> ...' uses custom bucket values (e.g., 'custom 10 50 100 500').`None`List[str]
--generation-tokens-bucketsThe buckets rule for generation tokens histogram. Supports 3 rule types: 'default' uses predefined buckets; 'tse <middle> <base> <count>' generates two sides exponential distributed buckets (e.g., 'tse 1000 2 8' generates buckets [984.0, 992.0, 996.0, 998.0, 1000.0, 1002.0, 1004.0, 1008.0, 1016.0]).); 'custom <value1> <value2> ...' uses custom bucket values (e.g., 'custom 10 50 100 500').NoneList[str]
--gc-warning-threshold-secsThe threshold for long GC warning. If a GC takes longer than this, a warning will be logged. Set to 0 to disable.0.0Type: float
--decode-log-intervalThe log interval of decode batch.40Type: int
--enable-request-time-stats-loggingEnable per request time stats loggingFalsebool flag (set to enable)
--kv-events-configConfig in json format for NVIDIA dynamo KV event publishing. Publishing will be enabled if this flag is used.NoneType: str
--enable-traceEnable opentelemetry traceFalsebool flag (set to enable)
`--otlp-traces-endpoint`Config opentelemetry collector endpoint if --enable-trace is set. format: <ip>:<port>`localhost:4317`Type: str
+ +## RequestMetricsExporter configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--export-metrics-to-file`Export performance metrics for each request to local file (e.g. for forwarding to external systems).`False`bool flag (set to enable)
`--export-metrics-to-file-dir`Directory path for writing performance metrics files (required when --export-metrics-to-file is enabled).`None`Type: str
+ +## API related + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--api-key`Set API key of the server. It is also used in the OpenAI API compatible server.`None`Type: str
`--admin-api-key`Set admin API key for administrative/control endpoints (e.g., weights update, cache flush, /server_info). Endpoints marked as admin-only require Authorization: Bearer <admin_api_key> when this is set.`None`Type: str
`--served-model-name`Override the model name returned by the v1/models endpoint in OpenAI API server.`None`Type: str
`--weight-version`Version identifier for the model weights. Defaults to 'default' if not specified.`default`Type: str
`--chat-template`The builtin chat template name or the path of the chat template file. This is only used for OpenAI-compatible API server.`None`Type: str
`--hf-chat-template-name`When the HuggingFace tokenizer has multiple chat templates (e.g., 'default', 'tool_use', 'rag'), specify which named template to use. If not set, the first available template is used.`None`Type: str
`--completion-template`The builtin completion template name or the path of the completion template file. This is only used for OpenAI-compatible API server. only for code completion currently.`None`Type: str
`--file-storage-path`The path of the file storage in backend.`sglang_storage`Type: str
`--enable-cache-report`Return number of cached tokens in usage.prompt_tokens_details for each openai request.`False`bool flag (set to enable)
`--reasoning-parser`Specify the parser for reasoning models. Supported parsers: [deepseek-r1, deepseek-v3, glm45, gpt-oss, kimi, qwen3, qwen3-thinking, step3].`None`deepseek-r1, deepseek-v3, glm45, gpt-oss, kimi, qwen3, qwen3-thinking, step3
`--tool-call-parser`Specify the parser for handling tool-call interactions. Supported parsers: [deepseekv3, deepseekv31, glm, glm45, glm47, gpt-oss, kimi_k2, llama3, mistral, pythonic, qwen, qwen25, qwen3_coder, step3].`None`deepseekv3, deepseekv31, glm, glm45, glm47, gpt-oss, kimi_k2, llama3, mistral, pythonic, qwen, qwen25, qwen3_coder, step3, gigachat3
`--tool-server`Either 'demo' or a comma-separated list of tool server urls to use for the model. If not specified, no tool server will be used.`None`Type: str
`--sampling-defaults`Where to get default sampling parameters. 'openai' uses SGLang/OpenAI defaults (temperature=1.0, top_p=1.0, etc.). 'model' uses the model's generation_config.json to get the recommended sampling parameters if available. Default is 'model'.`model`openai, model
+ +## Data parallelism + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
` --data-parallel-size`
`--dp-size`
The data parallelism size.` 1`Type: int
` --load-balance-method`The load balancing strategy for data parallelism. The `total_tokens` algorithm can only be used when DP attention is applied. This algorithm performs load balancing based on the real-time token load of the DP workers.` auto`` auto`, `round_robin`, `follow_bootstrap_room`, `total_requests`, `total_tokens`
+ +## Multi-node distributed serving + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
` --dist-init-addr`
`--nccl-init-addr`
The host address for initializing distributed backend (e.g., `192.168.0.2:25000`).` None`Type: str
` --nnodes`The number of nodes.` 1`Type: int
` --node-rank`The node rank.` 0`Type: int
+ +## Model override args + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--json-model-override-args`A dictionary in JSON string format used to override default model configurations.`{}`Type: str
`--preferred-sampling-params`json-formatted sampling settings that will be returned in /get_model_info`None`Type: str
+ +## LoRA + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-lora`Enable LoRA support for the model. This argument is automatically set to `True` if `--lora-paths` is provided for backward compatibility.`False`Bool flag (set to enable)
`--enable-lora-overlap-loading`Enable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters.`False`Bool flag (set to enable)
`--max-lora-rank`The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in --lora-paths. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.`None`Type: int
`--lora-target-modules`The union set of all target modules where LoRA should be applied (e.g., q_proj, k_proj, gate_proj). If not specified, it will be automatically inferred from the adapters provided in --lora-paths. You can also set it to all to enable LoRA for all supported modules; note this may introduce minor performance overhead.`None`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, qkv_proj, gate_up_proj, all
`--lora-paths`The list of LoRA adapters to load. Each adapter must be specified in one of the following formats: <PATH> \| <NAME>=<PATH> \| JSON with schema {"lora_name": str, "lora_path": str, "pinned": bool}.`None`Type: List[str] / JSON objects
`--max-loras-per-batch`Maximum number of adapters for a running batch, including base-only requests.`8`Type: int
`--max-loaded-loras`If specified, limits the maximum number of LoRA adapters loaded in CPU memory at a time. Must be ≥ --max-loras-per-batch.`None`Type: int
`--lora-eviction-policy`LoRA adapter eviction policy when the GPU memory pool is full.`lru`lru, fifo
`--lora-backend`Choose the kernel backend for multi-LoRA serving.`csgmv`triton, csgmv, ascend, torch_native
`--max-lora-chunk-size`Maximum chunk size for the ChunkedSGMV LoRA backend. Only used when --lora-backend is csgmv. Larger values may improve performance.`16`16, 32, 64, 128
`--lora-drain-wait-threshold`When any LoRA adapter request waits longer than this threshold (in seconds), the scheduler will selectively drain one running adapter to make room. This mitigates extreme tail latency under high or skewed workloads by preventing a small set of adapters from monopolizing batch slots. Set to 0 to disable draining (default).`0`Type: float
+ +## Kernel Backends (Attention, Sampling, Grammar, GEMM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--attention-backend`Choose the kernels for attention layers.`None`triton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend
`--prefill-attention-backend`Choose the kernels for prefill attention layers (have priority over --attention-backend).`None`triton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend
`--decode-attention-backend`Choose the kernels for decode attention layers (have priority over --attention-backend).`None`triton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend
`--sampling-backend`Choose the kernels for sampling layers.`None`flashinfer, pytorch, ascend
`--grammar-backend`Choose the backend for grammar-guided decoding.`None`xgrammar, outlines, llguidance, none
`--mm-attention-backend`Set multimodal attention backend.`None`sdpa, fa3, fa4, triton_attn, ascend_attn, aiter_attn
`--nsa-prefill-backend`Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention).`flashmla_sparse`flashmla_sparse, flashmla_kv, flashmla_auto, fa3, tilelang, aiter, trtllm
`--nsa-decode-backend`Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding.`fa3`flashmla_sparse, flashmla_kv, fa3, tilelang, aiter, trtllm
`--fp8-gemm-backend`Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (FlashInfer TRTLLM backend; SM100/SM103 only), 'flashinfer_cutlass' (FlashInfer CUTLASS backend, SM120 only), 'flashinfer_deepgemm' (Hopper SM90 only, uses swapAB optimization for small M dimensions in decoding), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only).`auto`auto, deep_gemm, flashinfer_trtllm, flashinfer_cutlass, flashinfer_deepgemm, cutlass, triton, aiter
`--fp4-gemm-backend`Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback.flashinfer_cutlassauto, flashinfer_cudnn, flashinfer_cutlass, flashinfer_trtllm
`--disable-flashinfer-autotune`Flashinfer autotune is enabled by default. Set this flag to disable the autotune.`False`bool flag (set to enable)
+ +## Speculative decoding + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--speculative-algorithm`Speculative algorithm.`None``EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`
`--speculative-draft-model-path` `--speculative-draft-model`The path of the draft model weights. This can be a local folder or a Hugging Face repo ID.`None`Type: str
`--speculative-draft-model-revision`The specific draft model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.`None`Type: str
`--speculative-draft-load-format`The format of the draft model weights to load. If not specified, will use the same format as `--load-format`. Use 'dummy' to initialize draft model weights with random values for profiling.`None`Same as `--load-format` options
`--speculative-num-steps`The number of steps sampled from draft model in Speculative Decoding.`None`Type: int
`--speculative-eagle-topk`The number of tokens sampled from the draft model in eagle2 each step.`None`Type: int
`--speculative-num-draft-tokens`The number of tokens sampled from the draft model in Speculative Decoding.`None`Type: int
`--speculative-accept-threshold-single`Accept a draft token if its probability in the target model is greater than this threshold.`1.0`Type: float
`--speculative-accept-threshold-acc`The accept probability of a draft token is raised from its target probability p to min(1, p / threshold_acc).`1.0`Type: float
`--speculative-token-map`The path of the draft model's small vocab table.`None`Type: str
`--speculative-attention-mode`Attention backend for speculative decoding operations (both target verify and draft extend). Can be one of 'prefill' (default) or 'decode'.`prefill`prefill, decode
`--speculative-draft-attention-backend`Attention backend for speculative decoding drafting.`None`Same as attention backend options
`--speculative-moe-runner-backend`MOE backend for EAGLE speculative decoding, see `--moe-runner-backend` for options. Same as moe runner backend if unset.`None`Same as `--moe-runner-backend` options
`--speculative-moe-a2a-backend`MOE A2A backend for EAGLE speculative decoding, see `--moe-a2a-backend` for options. Same as moe a2a backend if unset.`None`Same as `--moe-a2a-backend` options
`--speculative-draft-model-quantization`The quantization method for speculative model.`None`Same as `--quantization` options
+ +## Ngram speculative decoding + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
--speculative-ngram-min-bfs-breadthThe minimum breadth for BFS (Breadth-First Search) in ngram speculative decoding.`1`Type: int
--speculative-ngram-max-bfs-breadthThe maximum breadth for BFS (Breadth-First Search) in ngram speculative decoding.10Type: int
--speculative-ngram-match-typeNgram tree-building mode. BFS selects recency-based expansion and PROB selects frequency-based expansion. This setting is forwarded to the ngram cache implementation.BFSBFS, PROB
--speculative-ngram-max-trie-depthMaximum suffix length stored and matched by the ngram trie.18Type: int
--speculative-ngram-capacityThe cache capacity for ngram speculative decoding.10000000Type: int
+ +## Multi-layer Eagle speculative decoding + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-multi-layer-eagle`Enable multi-layer Eagle speculative decoding.`False`bool flag (set to enable)
+ +## MoE + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
` --expert-parallel-size`
`--ep-size`
`--ep`
The expert parallelism size.` 1`Type: int
` --moe-a2a-backend`Select the backend for all-to-all communication for expert parallelism.` none`none, deepep, mooncake, mori, nixl, ascend_fuseep
` --moe-runner-backend`Choose the runner backend for MoE.` auto`auto, deep_gemm, triton, triton_kernel, flashinfer_trtllm, flashinfer_trtllm_routed, flashinfer_cutlass, flashinfer_mxfp4, flashinfer_cutedsl, cutlass
` --flashinfer-mxfp4-moe-precision`Choose the computation precision of flashinfer mxfp4 moe` default`default, bf16
` --enable-flashinfer-allreduce-fusion`Enable FlashInfer allreduce fusion with Residual RMSNorm.` False`bool flag (set to enable)
--enable-aiter-allreduce-fusionEnable aiter allreduce fusion with Residual RMSNorm.Falsebool flag (set to enable)
--deepep-modeSelect the mode when enable DeepEP MoE, could be normal, low_latency or auto. Default is auto, which means low_latency for decode batch and normal for prefill batch.autonormal, low_latency, auto
--ep-num-redundant-expertsAllocate this number of redundant experts in expert parallel.0Type: int
--ep-dispatch-algorithmThe algorithm to choose ranks for redundant experts in expert parallel.NoneType: str
--init-expert-locationInitial location of EP experts.trivialType: str
--enable-eplbEnable EPLB algorithmFalsebool flag (set to enable)
--eplb-algorithmChosen EPLB algorithmautoType: str
--eplb-rebalance-num-iterationsNumber of iterations to automatically trigger a EPLB re-balance.1000Type: int
--eplb-rebalance-layers-per-chunkNumber of layers to rebalance per forward pass.NoneType: int
--eplb-min-rebalancing-utilization-thresholdMinimum threshold for GPU average utilization to trigger EPLB rebalancing. Must be in the range [0.0, 1.0].1.0Type: float
--expert-distribution-recorder-modeMode of expert distribution recorder.` None`Type: str
--expert-distribution-recorder-buffer-sizeCircular buffer size of expert distribution recorder. Set to -1 to denote infinite buffer.NoneType: int
--enable-expert-distribution-metricsEnable logging metrics for expert balancednessFalsebool flag (set to enable)
--deepep-configTuned DeepEP config suitable for your own cluster. It can be either a string with JSON content or a file path.` None`Type: str
--moe-dense-tp-sizeTP size for MoE dense MLP layers. This flag is useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports.` none`Type: int
--elastic-ep-backendSpecify the collective communication backend for elastic EP. Currently supports 'mooncake'.` None`none, mooncake
--enable-elastic-expert-backupEnable elastic EP backend to backup expert weights in DRAM feature. Currently supports 'mooncake'.Falsebool flag (set to enable)
` --mooncake-ib-device`The InfiniBand devices for Mooncake Backend transfer, accepts multiple comma-separated devices (e.g., --mooncake-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when Mooncake Backend is enabled.` None`Type: str
+ +## Mamba Cache + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--max-mamba-cache-size`The maximum size of the mamba cache.`None`Type: int
`--mamba-ssm-dtype`The data type of the SSM states in mamba cache.`float32`float32, bfloat16, float16
`--mamba-full-memory-ratio`The ratio of mamba state memory to full kv cache memory.`0.9`Type: float
`--mamba-scheduler-strategy`The strategy to use for mamba scheduler. auto currently defaults to no_buffer. 1. no_buffer does not support overlap scheduler due to not allocating extra mamba state buffers. Branching point caching support is feasible but not implemented. 2. extra_buffer supports overlap schedule by allocating extra mamba state buffers to track mamba state for caching (mamba state usage per running req becomes 2x for non-spec; 1+(1/(2+speculative_num_draft_tokens))x for spec dec (e.g. 1.16x if speculative_num_draft_tokens==4)). 2a. extra_buffer is strictly better for non-KV-cache-bound cases; for KV-cache-bound cases, the tradeoff depends on whether enabling overlap outweighs reduced max running requests. 2b. mamba caching at radix cache branching point is strictly better than non-branch but requires kernel support (currently only FLA backend), currently only extra_buffer supports branching.`auto`auto, no_buffer, extra_buffer
`--mamba-track-interval`The interval (in tokens) to track the mamba state during decode. Only used when --mamba-scheduler-strategy is extra_buffer. Must be divisible by page_size if set, and must be >= speculative_num_draft_tokens when using speculative decoding.`256`Type: int
+ +## Hierarchical cache + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-hierarchical-cache`Enable hierarchical cache`False`bool flag (set to enable)
`--hicache-ratio`The ratio of the size of host KV cache memory pool to the size of device pool.`2.0`Type: float
`--hicache-size`The size of host KV cache memory pool in gigabytes, which will override the hicache_ratio if set.`0`Type: int
`--hicache-write-policy`The write policy of hierarchical cache.`write_through``write_back`, `write_through`, `write_through_selective`
`--hicache-io-backend`The IO backend for KV cache transfer between CPU and GPU`kernel``direct`, `kernel`, `kernel_ascend`
`--hicache-mem-layout`The layout of host memory pool for hierarchical cache.`layer_first``layer_first`, `page_first`, `page_first_direct`, `page_first_kv_split`, `page_head`
`--hicache-storage-backend`The storage backend for hierarchical KV cache. Built-in backends: file, mooncake, hf3fs, nixl, aibrix. For dynamic backend, use --hicache-storage-backend-extra-config to specify: backend_name (custom name), module_path (Python module path), class_name (backend class name).`None``file`, `mooncake`, `hf3fs`, `nixl`, `aibrix`, `dynamic`, `eic`
`--hicache-storage-prefetch-policy`Control when prefetching from the storage backend should stop.`best_effort``best_effort`, `wait_complete`, `timeout`
`--hicache-storage-backend-extra-config`A dictionary in JSON string format, or a string starting with a `@` followed by a config file in JSON/YAML/TOML format, containing extra configuration for the storage backend.`None`Type: str
+ +## Hierarchical sparse attention + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--hierarchical-sparse-attention-extra-config`A dictionary in JSON string format for hierarchical sparse attention configuration. Required fields: `algorithm` (str), `backend` (str). All other fields are algorithm-specific and passed to the algorithm constructor.`None`Type: str
+ +## LMCache + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-lmcache`Using LMCache as an alternative hierarchical cache solution`False`bool flag (set to enable)
+ +## Ktransformers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--kt-weight-path`[ktransformers parameter] The path of the quantized expert weights for amx kernel. A local folder.`None`Type: str
`--kt-method`[ktransformers parameter] Quantization formats for CPU execution.`AMXINT4`Type: str
`--kt-cpuinfer`[ktransformers parameter] The number of CPUInfer threads.`None`Type: int
`--kt-threadpool-count`[ktransformers parameter] One-to-one with the number of NUMA nodes (one thread pool per NUMA).`2`Type: int
`--kt-num-gpu-experts`[ktransformers parameter] The number of GPU experts.`None`Type: int
`--kt-max-deferred-experts-per-token`[ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0.`None`Type: int
+ +## Diffusion LLM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--dllm-algorithm`The diffusion LLM algorithm, such as LowConfidence.`None`Type: str
`--dllm-algorithm-config`The diffusion LLM algorithm configurations. Must be a YAML file.`None`Type: str
+ +## Offloading + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--cpu-offload-gb`How many GBs of RAM to reserve for CPU offloading.`0`Type: int
`--offload-group-size`Number of layers per group in offloading.`-1`Type: int
`--offload-num-in-group`Number of layers to be offloaded within a group.`1`Type: int
`--offload-prefetch-step`Steps to prefetch in offloading.`1`Type: int
`--offload-mode`Mode of offloading.`cpu`Type: str
+ +## Args for multi-item scoring + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--multi-item-scoring-delimiter`Delimiter token ID for multi-item scoring. Used to combine Query and Items into a single sequence: Query<delimiter>Item1<delimiter>Item2<delimiter>... This enables efficient batch processing of multiple items against a single query.`None`Type: int
+ +## Optimization/debug options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
--disable-radix-cacheDisable RadixAttention for prefix caching.Falsebool flag (set to enable)
--cuda-graph-max-bsSet the maximum batch size for cuda graph. It will extend the cuda graph capture batch size to this value.`None`Type: int
--cuda-graph-bsSet the list of batch sizes for cuda graph.`None`List[int]
--disable-cuda-graphDisable cuda graph.Falsebool flag (set to enable)
--disable-cuda-graph-paddingDisable cuda graph when padding is needed. Still uses cuda graph when padding is not needed.Falsebool flag (set to enable)
--enable-profile-cuda-graphEnable profiling of cuda graph capture.Falsebool flag (set to enable)
--enable-cudagraph-gcEnable garbage collection during CUDA graph capture. If disabled (default), GC is frozen during capture to speed up the process.Falsebool flag (set to enable)
--enable-layerwise-nvtx-markerEnable layerwise NVTX profiling annotations for the model. This adds NVTX markers to every layer for detailed per-layer performance analysis with Nsight Systems.Falsebool flag (set to enable)
--enable-nccl-nvlsEnable NCCL NVLS for prefill heavy requests when available.Falsebool flag (set to enable)
--enable-symm-memEnable NCCL symmetric memory for fast collectives.Falsebool flag (set to enable)
--disable-flashinfer-cutlass-moe-fp4-allgatherDisables quantize before all-gather for flashinfer cutlass moe.Falsebool flag (set to enable)
--enable-tokenizer-batch-encodeEnable batch tokenization for improved performance when processing multiple text inputs. Do not use with image inputs, pre-tokenized input_ids, or input_embeds.Falsebool flag (set to enable)
--disable-tokenizer-batch-decodeDisable batch decoding when decoding multiple completions.Falsebool flag (set to enable)
--disable-outlines-disk-cacheDisable disk cache of outlines to avoid possible crashes related to file system or high concurrency.Falsebool flag (set to enable)
--disable-custom-all-reduceDisable the custom all-reduce kernel and fall back to NCCL.Falsebool flag (set to enable)
--enable-mscclppEnable using mscclpp for small messages for all-reduce kernel and fall back to NCCL.Falsebool flag (set to enable)
--enable-torch-symm-memEnable using torch symm mem for all-reduce kernel and fall back to NCCL. Only supports CUDA device SM90 and above. SM90 supports world size 4, 6, 8. SM10 supports world size 6, 8.Falsebool flag (set to enable)
--disable-overlap-scheduleDisable the overlap scheduler, which overlaps the CPU scheduler with GPU model worker.Falsebool flag (set to enable)
--enable-mixed-chunkEnabling mixing prefill and decode in a batch when using chunked prefill.Falsebool flag (set to enable)
--enable-dp-attentionEnabling data parallelism for attention and tensor parallelism for FFN. The dp size should be equal to the tp size. Currently DeepSeek-V2 and Qwen 2/3 MoE models are supported.Falsebool flag (set to enable)
--enable-dp-lm-headEnable vocabulary parallel across the attention TP group to avoid all-gather across DP groups, optimizing performance under DP attention.Falsebool flag (set to enable)
--enable-two-batch-overlapEnabling two micro batches to overlap.Falsebool flag (set to enable)
--enable-single-batch-overlapLet computation and communication overlap within one micro batch.Falsebool flag (set to enable)
--tbo-token-distribution-thresholdThe threshold of token distribution between two batches in micro-batch-overlap, determines whether to two-batch-overlap or two-chunk-overlap. Set to 0 denote disable two-chunk-overlap.0.48Type: float
--enable-torch-compileOptimize the model with torch.compile. Experimental feature.Falsebool flag (set to enable)
--enable-torch-compile-debug-modeEnable debug mode for torch compile.Falsebool flag (set to enable)
--disable-piecewise-cuda-graphDisable piecewise cuda graph for extend/prefill. PCG is enabled by default.Falsebool flag (set to disable)
--enforce-piecewise-cuda-graphEnforce piecewise cuda graph, skipping all auto-disable conditions. For testing only.Falsebool flag (set to enable)
--piecewise-cuda-graph-tokensSet the list of tokens when using piecewise cuda graph.`None`Type: JSON list
--piecewise-cuda-graph-compilerSet the compiler for piecewise cuda graph. Choices are: eager, inductor.eagereager, inductor
--torch-compile-max-bsSet the maximum batch size when using torch compile.32Type: int
--piecewise-cuda-graph-max-tokensSet the maximum tokens when using piecewise cuda graph.4096Type: int
--torchao-configOptimize the model with torchao. Experimental feature. Current choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row``Type: str
--enable-nan-detectionEnable the NaN detection for debugging purposes.Falsebool flag (set to enable)
--enable-p2p-checkEnable P2P check for GPU access, otherwise the p2p access is allowed by default.Falsebool flag (set to enable)
--triton-attention-reduce-in-fp32Cast the intermediate attention results to fp32 to avoid possible crashes related to fp16. This only affects Triton attention kernels.Falsebool flag (set to enable)
--triton-attention-num-kv-splitsThe number of KV splits in flash decoding Triton kernel. Larger value is better in longer context scenarios. The default value is 8.8Type: int
--triton-attention-split-tile-sizeThe size of split KV tile in flash decoding Triton kernel. Used for deterministic inference.`None`Type: int
--num-continuous-decode-stepsRun multiple continuous decoding steps to reduce scheduling overhead. This can potentially increase throughput but may also increase time-to-first-token latency. The default value is 1, meaning only run one decoding step at a time.1Type: int
--delete-ckpt-after-loadingDelete the model checkpoint after loading the model.Falsebool flag (set to enable)
--enable-memory-saverAllow saving memory using release_memory_occupation and resume_memory_occupationFalsebool flag (set to enable)
--enable-weights-cpu-backupSave model weights to CPU memory during release_weights_occupation and resume_weights_occupationFalsebool flag (set to enable)
--enable-draft-weights-cpu-backupSave draft model weights to CPU memory during release_weights_occupation and resume_weights_occupationFalsebool flag (set to enable)
--allow-auto-truncateAllow automatically truncating requests that exceed the maximum input length instead of returning an error.Falsebool flag (set to enable)
--enable-custom-logit-processorEnable users to pass custom logit processors to the server (disabled by default for security)Falsebool flag (set to enable)
--flashinfer-mla-disable-raggedNot using ragged prefill wrapper when running flashinfer mlaFalsebool flag (set to enable)
--disable-shared-experts-fusionDisable shared experts fusion optimization for deepseek v3/r1.Falsebool flag (set to enable)
--disable-chunked-prefix-cacheDisable chunked prefix cache feature for deepseek, which should save overhead for short sequences.Falsebool flag (set to enable)
--disable-fast-image-processorAdopt base image processor instead of fast image processor.Falsebool flag (set to enable)
--keep-mm-feature-on-deviceKeep multimodal feature tensors on device after processing to save D2H copy.Falsebool flag (set to enable)
--enable-return-hidden-statesEnable returning hidden states with responses.Falsebool flag (set to enable)
--enable-return-routed-expertsEnable returning routed experts of each layer with responses.Falsebool flag (set to enable)
--scheduler-recv-intervalThe interval to poll requests in scheduler. Can be set to >1 to reduce the overhead of this.1Type: int
--numa-nodeSets the numa node for the subprocesses. i-th element corresponds to i-th subprocess.`None`List[int]
--enable-deterministic-inferenceEnable deterministic inference mode with batch invariant ops.Falsebool flag (set to enable)
--rl-on-policy-targetThe training system that SGLang needs to match for true on-policy.`None`fsdp
--enable-attn-tp-input-scatteredAllow input of attention to be scattered when only using tensor parallelism, to reduce the computational load of operations such as qkv latent.Falsebool flag (set to enable)
--enable-nsa-prefill-context-parallelEnable context parallelism used in the long sequence prefill phase of DeepSeek v3.2.Falsebool flag (set to enable)
--nsa-prefill-cp-modeToken splitting mode for the prefill phase of DeepSeek v3.2 under context parallelism. Optional values: round-robin-split(default),in-seq-split. round-robin-split distributes tokens across ranks based on token_idx % cp_size. It supports multi-batch prefill, fused MoE, and FP8 KV cache.in-seq-splitin-seq-split, round-robin-split
--enable-fused-qk-norm-ropeEnable fused qk normalization and rope rotary embedding.Falsebool flag (set to enable)
--enable-precise-embedding-interpolationEnable corner alignment for resize of embeddings grid to ensure more accurate(but slower) evaluation of interpolated embedding values.Falsebool flag (set to enable)
+ +## Dynamic batch tokenizer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-dynamic-batch-tokenizer`Enable async dynamic batch tokenizer for improved performance when multiple requests arrive concurrently.`False`bool flag (set to enable)
`--dynamic-batch-tokenizer-batch-size`[Only used if --enable-dynamic-batch-tokenizer is set] Maximum batch size for dynamic batch tokenizer.`32`Type: int
`--dynamic-batch-tokenizer-batch-timeout`[Only used if --enable-dynamic-batch-tokenizer is set] Timeout in seconds for batching tokenization requests.`0.002`Type: float
+ +## Debug tensor dumps + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--debug-tensor-dump-output-folder`The output folder for dumping tensors.`None`Type: str
`--debug-tensor-dump-layers`The layer ids to dump. Dump all layers if not specified.`None`Type: JSON list
`--debug-tensor-dump-input-file`The input filename for dumping tensors`None`Type: str
`--debug-tensor-dump-inject`Inject the outputs from jax as the input of every layer.`False`Type: str
+ +## PD disaggregation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
--disaggregation-modeOnly used for PD disaggregation. "prefill" for prefill-only server, and "decode" for decode-only server. If not specified, it is not PD disaggregatednullnull, prefill, decode
--disaggregation-transfer-backendThe backend for disaggregation transfer. Default is mooncake.mooncakemooncake, nixl, ascend, fake
--disaggregation-bootstrap-portBootstrap server port on the prefill server. Default is 8998.8998Type: int
--disaggregation-ib-deviceThe InfiniBand devices for disaggregation transfer, accepts single device (e.g., --disaggregation-ib-device mlx5_0) or multiple comma-separated devices (e.g., --disaggregation-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when mooncake backend is enabled.NoneType: str
--disaggregation-decode-enable-offload-kvcacheEnable async KV cache offloading on decode server (PD mode).`False`bool flag (set to enable)
--num-reserved-decode-tokensNumber of decode tokens that will have memory reserved when adding new request to the running batch.512Type: int
--disaggregation-decode-polling-intervalThe interval to poll requests in decode server. Can be set to >1 to reduce the overhead of this.1Type: int
+ +## Encode prefill disaggregation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--encoder-only`For MLLM with an encoder, launch an encoder-only server`False`bool flag (set to enable)
`--language-only`For VLM, load weights for the language model only.`False`bool flag (set to enable)
`--encoder-transfer-backend`The backend for encoder disaggregation transfer. Default is zmq_to_scheduler.`zmq_to_scheduler``zmq_to_scheduler`, `zmq_to_tokenizer`, `mooncake`
`--encoder-urls`List of encoder server urls.`[]`Type: JSON list
+ +## Custom weight loader + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
--custom-weight-loaderThe custom dataloader which used to update the model. Should be set with a valid import path, such as my_package.weight_load_funcNoneList[str]
--weight-loader-disable-mmapDisable mmap while loading weight using safetensors.`False`bool flag (set to enable)
--weight-loader-prefetch-checkpointsPrefetch checkpoint files into OS page cache before loading. Each rank prefetches a fraction of the shards in a background thread, reducing total network I/O on shared filesystems (NFS/Lustre) from N\*checkpoint to 1\*checkpoint. Recommended for models on network storage.Falsebool flag (set to enable)
--weight-loader-prefetch-num-threadsNumber of threads per rank for checkpoint prefetching.4Type: int
--remote-instance-weight-loader-seed-instance-ipThe ip of the seed instance for loading weights from remote instance.NoneType: str
--remote-instance-weight-loader-seed-instance-service-portThe service port of the seed instance for loading weights from remote instance.NoneType: int
--remote-instance-weight-loader-send-weights-group-portsThe communication group ports for loading weights from remote instance.NoneType: JSON list
--remote-instance-weight-loader-backendThe backend for loading weights from remote instance. Can be 'transfer_engine' or 'nccl'. Default is 'nccl'.nccltransfer_engine, nccl
--remote-instance-weight-loader-start-seed-via-transfer-engineStart seed server via transfer engine backend for remote instance weight loader.Falsebool flag (set to enable)
+ +## For PD-Multiplexing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-pdmux`Enable PD-Multiplexing, PD running on greenctx stream.`False`bool flag (set to enable)
`--pdmux-config-path`The path of the PD-Multiplexing config file.`None`Type: str
`--sm-group-num`Number of sm partition groups.`8`Type: int
+ +## Configuration file support + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--config`Read CLI options from a config file. Must be a YAML file with configuration options.`None`Type: str
+ +## For Multi-Modal + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
--mm-max-concurrent-callsThe max concurrent calls for async mm data processing.32Type: int
--mm-per-request-timeoutThe timeout for each multi-modal request in seconds.10.0Type: int
--enable-broadcast-mm-inputs-processEnable broadcast mm-inputs process in scheduler.Falsebool flag (set to enable)
--mm-process-configMultimodal preprocessing config, a json config contains keys: image, video, audio.{}Type: JSON / Dict
--mm-enable-dp-encoderEnabling data parallelism for mm encoder. The dp size will be set to the tp size automatically.Falsebool flag (set to enable)
--limit-mm-data-per-requestLimit the number of multimodal inputs per request. e.g. '{"image": 1, "video": 1, "audio": 1}'`None`Type: JSON / Dict
--enable-mm-global-cacheEnable Mooncake-backed global multimodal embedding cache on encoder servers so repeated images can reuse cached ViT embeddings instead of recomputing them.Falsebool flag (set to enable)
+ +## For checkpoint decryption + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--decrypted-config-file`The path of the decrypted config file.`None`Type: str
`--decrypted-draft-config-file`The path of the decrypted draft config file.`None`Type: str
`--enable-prefix-mm-cache`Enable prefix multimodal cache. Currently only supports mm-only.`False`bool flag (set to enable)
+ +## Forward hooks + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--forward-hooks`JSON-formatted list of forward hook specifications. Each element must include `target_modules` (list of glob patterns matched against `model.named_modules()` names) and `hook_factory` (Python import path to a factory, e.g. `my_package.hooks:make_hook`). An optional `name` field is used for logging, and an optional `config` object is passed as a `dict` to the factory.`None`Type: JSON list
+ +## Deprecated arguments + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescriptionDefaultsOptions
`--enable-ep-moe`NOTE: --enable-ep-moe is deprecated. Please set `--ep-size` to the same value as `--tp-size` instead.`None`N/A
`--enable-deepep-moe`NOTE: --enable-deepep-moe is deprecated. Please set `--moe-a2a-backend` to 'deepep' instead.`None`N/A
`--prefill-round-robin-balance`Note: Note: --prefill-round-robin-balance is deprecated now.`None`N/A
`--enable-flashinfer-cutlass-moe`NOTE: --enable-flashinfer-cutlass-moe is deprecated. Please set `--moe-runner-backend` to 'flashinfer_cutlass' instead.`None`N/A
`--enable-flashinfer-cutedsl-moe`NOTE: --enable-flashinfer-cutedsl-moe is deprecated. Please set `--moe-runner-backend` to 'flashinfer_cutedsl' instead.`None`N/A
`--enable-flashinfer-trtllm-moe`NOTE: --enable-flashinfer-trtllm-moe is deprecated. Please set `--moe-runner-backend` to 'flashinfer_trtllm' instead.`None`N/A
`--enable-triton-kernel-moe`NOTE: --enable-triton-kernel-moe is deprecated. Please set `--moe-runner-backend` to 'triton_kernel' instead.`None`N/A
`--enable-flashinfer-mxfp4-moe`NOTE: --enable-flashinfer-mxfp4-moe is deprecated. Please set `--moe-runner-backend` to 'flashinfer_mxfp4' instead.`None`N/A
`--crash-on-nan`Crash the server on nan logprobs.`False`Type: str
`--hybrid-kvcache-ratio`Mix ratio in [0,1] between uniform and hybrid kv buffers (0.0 = pure uniform: swa_size / full_size = 1)(1.0 = pure hybrid: swa_size / full_size = local_attention_size / context_length)`None`Optional[float]
`--load-watch-interval`The interval of load watching in seconds.`0.1`Type: float
`--nsa-prefill`Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention).`flashmla_sparse``flashmla_sparse`, `flashmla_decode`, `fa3`, `tilelang`, `aiter`
`--nsa-decode`Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding.`flashmla_kv``flashmla_prefill`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`
diff --git a/docs_new/docs/advanced_features/sgl_model_gateway.mdx b/docs_new/docs/advanced_features/sgl_model_gateway.mdx new file mode 100644 index 000000000000..049e6d4081b6 --- /dev/null +++ b/docs_new/docs/advanced_features/sgl_model_gateway.mdx @@ -0,0 +1,2816 @@ +--- +title: "SGLang Model Gateway" +metatags: + description: "SGLang Model Gateway: load balancing, PD disaggregation, multi-model routing, gRPC support, MCP integration, Kubernetes service discovery." +--- +SGLang Model Gateway is a high-performance model-routing gateway for large-scale LLM deployments. It centralizes worker lifecycle management, balances traffic across heterogeneous protocols (HTTP, gRPC, OpenAI-compatible), and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows. The gateway is deeply optimized for the SGLang serving runtime, but can route to any OpenAI-compatible backend. + +*** +## Table of Contents + +1. [Overview](#overview) +2. [Architecture](#architecture) + - [Control Plane](#control-plane) + - [Data Plane](#data-plane) + - [Storage and Privacy](#storage-and-privacy) +3. [Installation](#installation) +4. [Quick Start](#quick-start) +5. [Deployment Modes](#deployment-modes) + - [Co-launch Router and Workers](#co-launch-router-and-workers) + - [Separate Launch (HTTP)](#separate-launch-http) + - [gRPC Launch](#grpc-launch) + - [Prefill-Decode Disaggregation](#prefill-decode-disaggregation) + - [OpenAI Backend Proxy](#openai-backend-proxy) + - [Multi-Model Inference Gateway](#multi-model-inference-gateway) +6. [API Reference](#api-reference) + - [Inference Endpoints](#inference-endpoints) + - [Tokenization Endpoints](#tokenization-endpoints) + - [Parser Endpoints](#parser-endpoints) + - [Classification API](#classification-api) + - [Conversation and Response APIs](#conversation-and-response-apis) + - [Worker Management APIs](#worker-management-apis) + - [Admin and Health Endpoints](#admin-and-health-endpoints) +7. [Load Balancing Policies](#load-balancing-policies) +8. [Reliability and Flow Control](#reliability-and-flow-control) + - [Retries](#retries) + - [Circuit Breaker](#circuit-breaker) + - [Rate Limiting and Queuing](#rate-limiting-and-queuing) + - [Health Checks](#health-checks) +9. [Reasoning Parser Integration](#reasoning-parser-integration) +10. [Tool Call Parsing](#tool-call-parsing) +11. [Tokenizer Management](#tokenizer-management) +12. [MCP Integration](#mcp-integration) +13. [Service Discovery (Kubernetes)](#service-discovery-kubernetes) +14. [History and Data Connectors](#history-and-data-connectors) +15. [WASM Middleware](#wasm-middleware) +16. [Language Bindings](#language-bindings) +17. [Security and Authentication](#security-and-authentication) + - [TLS (HTTPS) for Gateway Server](#tls-https-for-gateway-server) + - [mTLS for Worker Communication](#mtls-for-worker-communication) +18. [Observability](#observability) + - [Prometheus Metrics](#prometheus-metrics) + - [OpenTelemetry Tracing](#opentelemetry-tracing) + - [Logging](#logging) +19. [Production Recommendations](#production-recommendations) + - [Security Best Practices](#security-best-practices) + - [High Availability](#high-availability) + - [Performance](#performance) + - [Kubernetes Deployment](#kubernetes-deployment) + - [Monitoring with PromQL](#monitoring-with-promql) +20. [Configuration Reference](#configuration-reference) +21. [Troubleshooting](#troubleshooting) + +*** +## Overview + +- **Unified control plane** for registering, monitoring, and orchestrating regular, prefill, and decode workers across heterogeneous model fleets. +- **Multi-protocol data plane** that routes traffic across HTTP, PD (prefill/decode), gRPC, and OpenAI-compatible backends with shared reliability primitives. +- **Industry-first gRPC pipeline** with native Rust tokenization, reasoning parsers, and tool-call execution for high-throughput, OpenAI-compatible serving; supports both single-stage and PD topologies. +- **Inference Gateway Mode (`--enable-igw`)** dynamically instantiates multiple router stacks (HTTP regular/PD, gRPC) and applies per-model policies for multi-tenant deployments. +- **Conversation & responses connectors** centralize chat history inside the router so the same context can be reused across models and MCP loops without leaking data to upstream vendors (memory, none, Oracle ATP, PostgreSQL). +- **Enterprise privacy**: agentic multi-turn `/v1/responses`, native MCP client (STDIO/HTTP/SSE/Streamable), and history storage all operate within the router boundary. +- **Reliability core**: retries with jitter, worker-scoped circuit breakers, token-bucket rate limiting with queuing, background health checks, and cache-aware load monitoring. +- **Comprehensive observability**: 40+ Prometheus metrics, OpenTelemetry distributed tracing, structured logging, and request ID propagation. + +*** +## Architecture + +### Control Plane + +- **Worker Manager** discovers capabilities (`/get_server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry. +- **Job Queue** serializes add/remove requests and exposes status (`/workers/{worker_id}`) so clients can track onboarding progress. +- **Load Monitor** feeds cache-aware and power-of-two policies with live worker load statistics. +- **Health Checker** continuously probes workers and updates readiness, circuit breaker state, and router metrics. +- **Tokenizer Registry** manages dynamically registered tokenizers with async loading from HuggingFace or local paths. + +### Data Plane + +- **HTTP routers** (regular & PD) implement `/generate`, `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, `/v1/embeddings`, `/v1/rerank`, `/v1/classify`, `/v1/tokenize`, `/v1/detokenize`, and associated admin endpoints. +- **gRPC router** streams tokenized requests directly to SRT gRPC workers, running fully in Rust—tokenizer, reasoning parser, and tool parser all reside in-process. Supports both single-stage and PD routing, including embeddings and classification. +- **OpenAI router** proxies OpenAI-compatible endpoints to external vendors (OpenAI, xAI, etc.) while keeping chat history and multi-turn orchestration local. + +### Storage and Privacy + +- Conversation and response history is stored at the router tier (memory, none, Oracle ATP, or PostgreSQL). The same history can power multiple models or MCP loops without sending data to upstream vendors. +- `/v1/responses` agentic flows, MCP sessions, and conversation APIs share the same storage layer, enabling compliance for regulated workloads. + +*** +## Installation + +### Docker + +Pre-built Docker images are available on Docker Hub with multi-architecture support (x86_64 and ARM64): + +```bash Command +docker pull lmsysorg/sgl-model-gateway:latest +``` + +### Prerequisites + +- **Rust and Cargo** + ```bash Command + curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh + source "$HOME/.cargo/env" + rustc --version + cargo --version + ``` +- **Python** with `pip` and virtualenv tooling available. + +### Rust Binary + +```bash Command +cd sgl-model-gateway +cargo build --release +``` + +### Python Package + +```bash Command +pip install maturin + +# Fast development mode +cd sgl-model-gateway/bindings/python +maturin develop + +# Production build +maturin build --release --out dist --features vendored-openssl +pip install --force-reinstall dist/*.whl +``` + +*** +## Quick Start + +### Regular HTTP Routing + +```bash Command +# Rust binary +./target/release/sgl-model-gateway \ + --worker-urls http://worker1:8000 http://worker2:8000 \ + --policy cache_aware + +# Python launcher +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8000 \ + --policy cache_aware +``` + +### gRPC Routing + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls grpc://127.0.0.1:20000 \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --reasoning-parser deepseek-r1 \ + --tool-call-parser json \ + --host 0.0.0.0 --port 8080 +``` + +*** +## Deployment Modes + +### Co-launch Router and Workers + +Launch the router and a fleet of SGLang workers in one process: + +```bash Command +python -m sglang_router.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +Comprehensive example with router arguments (prefixed with `--router-`): + +```bash Command +python -m sglang_router.launch_server \ + --host 0.0.0.0 \ + --port 8080 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --tp-size 1 \ + --dp-size 8 \ + --grpc-mode \ + --log-level debug \ + --router-prometheus-port 10001 \ + --router-tool-call-parser llama \ + --router-model-path meta-llama/Llama-3.1-8B-Instruct \ + --router-policy round_robin \ + --router-log-level debug +``` + +### Separate Launch (HTTP) + +Run workers independently and point the router at their HTTP endpoints: + +```bash Command +# Worker nodes +python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 +python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8001 + +# Router node +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --policy cache_aware \ + --host 0.0.0.0 --port 30000 +``` + +### gRPC Launch + +Use SRT gRPC workers to unlock the highest throughput and access native reasoning/tool pipelines: + +```bash Command +# Workers expose gRPC endpoints +python -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --grpc-mode \ + --port 20000 + +# Router +python -m sglang_router.launch_router \ + --worker-urls grpc://127.0.0.1:20000 \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --reasoning-parser deepseek-r1 \ + --tool-call-parser json \ + --host 0.0.0.0 --port 8080 +``` + +The gRPC router supports both regular HTTP-equivalent serving and PD (prefill/decode) serving. Provide `--tokenizer-path` or `--model-path` (HuggingFace ID or local directory) whenever connection mode resolves to gRPC. + +### Prefill-Decode Disaggregation + +Split prefill and decode workers for PD-aware caching and balancing: + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://prefill1:30001 9001 \ + --decode http://decode1:30011 \ + --prefill-policy cache_aware \ + --decode-policy power_of_two +``` + +Prefill entries accept an optional bootstrap port. PD mode merges prefill metadata with decode outputs and streams results back to the client. + +### OpenAI Backend Proxy + +Proxy OpenAI-compatible endpoints while keeping history and MCP sessions local: + +```bash Command +python -m sglang_router.launch_router \ + --backend openai \ + --worker-urls https://api.openai.com \ + --history-backend memory +``` + +OpenAI backend mode expects exactly one `--worker-urls` entry per router instance. + +### Multi-Model Inference Gateway + +Enable IGW mode to route multiple models through a single router: + +```bash Command +./target/release/sgl-model-gateway \ + --enable-igw \ + --policy cache_aware \ + --max-concurrent-requests 512 + +# Register workers dynamically +curl -X POST http://localhost:30000/workers \ + -H "Content-Type: application/json" \ + -d '{ + "url": "http://worker-a:8000", + "model_id": "mistral", + "priority": 10, + "labels": {"tier": "gold"} + }' +``` + +*** +## API Reference + +### Inference Endpoints + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodPathDescription
`POST``/generate`SGLang generate API
`POST``/v1/chat/completions`OpenAI-compatible chat completions (streaming/tool calls)
`POST``/v1/completions`OpenAI-compatible text completions
`POST``/v1/embeddings`Embedding generation (HTTP and gRPC)
`POST``/v1/rerank`, `/rerank`Reranking requests
`POST``/v1/classify`Text classification
+ +### Tokenization Endpoints + +The gateway provides HTTP endpoints for text tokenization with batch support, designed to mirror the SGLang Python tokenization API. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodPathDescription
`POST``/v1/tokenize`Tokenize text to token IDs (single or batch)
`POST``/v1/detokenize`Convert token IDs back to text (single or batch)
`POST``/v1/tokenizers`Register a new tokenizer (async, returns job status)
`GET``/v1/tokenizers`List all registered tokenizers
`GET``/v1/tokenizers/{id}`Get tokenizer info by UUID
`GET``/v1/tokenizers/{id}/status`Check async tokenizer loading status
`DELETE``/v1/tokenizers/{id}`Remove a tokenizer from the registry
+ +#### Tokenize Request + +```json Config +{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "prompt": "Hello, world!" +} +``` + +#### Batch Tokenize Request + +```json Config +{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "prompt": ["Hello", "World", "How are you?"] +} +``` + +#### Tokenize Response + +```json Config +{ + "tokens": [15339, 11, 1917, 0], + "count": 4, + "char_count": 13 +} +``` + +#### Detokenize Request + +```json Config +{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "tokens": [15339, 11, 1917, 0], + "skip_special_tokens": true +} +``` + +#### Detokenize Response + +```json Config +{ + "text": "Hello, world!" +} +``` + +#### Add Tokenizer (Async) + +```bash Command +curl -X POST http://localhost:30000/v1/tokenizers \ + -H "Content-Type: application/json" \ + -d '{"name": "llama3", "source": "meta-llama/Llama-3.1-8B-Instruct"}' +``` + +Response: +```json Config +{ + "id": "550e8400-e29b-41d4-a716-446655440000", + "status": "pending", + "message": "Tokenizer registration queued" +} +``` + +Check status: +```bash Command +curl http://localhost:30000/v1/tokenizers/550e8400-e29b-41d4-a716-446655440000/status +``` + +### Parser Endpoints + +The gateway provides admin endpoints for parsing reasoning content and function calls from LLM outputs. + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodPathDescription
`POST``/parse/reasoning`Separate reasoning (`<think>`) from normal text
`POST``/parse/function_call`Parse function/tool calls from text
+ +#### Separate Reasoning Request + +```json Config +{ + "text": "<think>Let me analyze this step by step...</think>The answer is 42.", + "parser": "deepseek-r1" +} +``` + +#### Response + +```json Config +{ + "normal_text": "The answer is 42.", + "reasoning_text": "Let me analyze this step by step..." +} +``` + +#### Function Call Parsing + +```json Config +{ + "text": "{\"name\": \"get_weather\", \"arguments\": {\"city\": \"NYC\"}}", + "parser": "json" +} +``` + +### Classification API + +The `/v1/classify` endpoint provides text classification using sequence classification models (e.g., `Qwen2ForSequenceClassification`, `BertForSequenceClassification`). + +#### Request + +```bash Command +curl http://localhost:30000/v1/classify \ + -H "Content-Type: application/json" \ + -d '{ + "model": "jason9693/Qwen2.5-1.5B-apeach", + "input": "I love this product!" + }' +``` + +#### Response + +```json Config +{ + "id": "classify-a1b2c3d4-5678-90ab-cdef-1234567890ab", + "object": "list", + "created": 1767034308, + "model": "jason9693/Qwen2.5-1.5B-apeach", + "data": [ + { + "index": 0, + "label": "positive", + "probs": [0.12, 0.88], + "num_classes": 2 + } + ], + "usage": { + "prompt_tokens": 6, + "completion_tokens": 0, + "total_tokens": 6 + } +} +``` + +#### Response Fields + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescription
`label`Predicted class label (from model's `id2label` config, or `LABEL_N` fallback)
`probs`Probability distribution over all classes (softmax of logits)
`num_classes`Number of classification classes
+ +#### Notes + +- Classification reuses the embedding backend—the scheduler returns logits which are converted to probabilities via softmax +- Labels come from the model's HuggingFace config (`id2label` field); models without this mapping use generic labels (`LABEL_0`, `LABEL_1`, etc.) +- Both HTTP and gRPC routers support classification + +### Conversation and Response APIs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodPathDescription
`POST``/v1/responses`Create background responses (agentic loops)
`GET``/v1/responses/{id}`Retrieve stored response
`POST``/v1/responses/{id}/cancel`Cancel background response
`DELETE``/v1/responses/{id}`Delete response
`GET``/v1/responses/{id}/input_items`List response input items
`POST``/v1/conversations`Create conversation
`GET``/v1/conversations/{id}`Get conversation
`POST``/v1/conversations/{id}`Update conversation
`DELETE``/v1/conversations/{id}`Delete conversation
`GET``/v1/conversations/{id}/items`List conversation items
`POST``/v1/conversations/{id}/items`Add items to conversation
`GET``/v1/conversations/{id}/items/{item_id}`Get conversation item
`DELETE``/v1/conversations/{id}/items/{item_id}`Delete conversation item
+ +### Worker Management APIs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodPathDescription
`POST``/workers`Queue worker registration (returns 202 Accepted)
`GET``/workers`List workers with health, load, and policy metadata
`GET``/workers/{worker_id}`Inspect specific worker or job queue entry
`PUT``/workers/{worker_id}`Queue worker update
`DELETE``/workers/{worker_id}`Queue worker removal
+ +#### Add Worker + +```bash Command +curl -X POST http://localhost:30000/workers \ + -H "Content-Type: application/json" \ + -d '{"url":"grpc://0.0.0.0:31000","worker_type":"regular"}' +``` + +#### List Workers + +```bash Command +curl http://localhost:30000/workers +``` + +Response: +```json Config +{ + "workers": [ + { + "id": "2f3a0c3e-3a7b-4c3f-8c70-1b7d4c3a6e1f", + "url": "http://0.0.0.0:31378", + "model_id": "mistral", + "priority": 50, + "cost": 1.0, + "worker_type": "regular", + "is_healthy": true, + "load": 0, + "connection_mode": "Http" + } + ], + "total": 1, + "stats": { + "prefill_count": 0, + "decode_count": 0, + "regular_count": 1 + } +} +``` + +### Admin and Health Endpoints + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodPathDescription
`GET``/liveness`Health check (always returns OK)
`GET``/readiness`Readiness check (checks healthy worker availability)
`GET``/health`Alias for liveness
`GET``/health_generate`Health generate test
`GET``/engine_metrics`Engine-level metrics from workers
`GET``/v1/models`List available models
`GET``/get_model_info`Get model information
`GET``/get_server_info`Get server information
`POST``/flush_cache`Clear all caches
`GET``/get_loads`Get all worker loads
`POST``/wasm`Upload WASM module
`GET``/wasm`List WASM modules
`DELETE``/wasm/{module_uuid}`Remove WASM module
+ +*** +## Load Balancing Policies + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PolicyDescriptionUsage
`random`Uniform random selection`--policy random`
`round_robin`Cycles through workers in order`--policy round_robin`
`power_of_two`Samples two workers and picks the lighter one`--policy power_of_two`
`cache_aware`Combines cache locality with load balancing (default)`--policy cache_aware`
`bucket`Divides workers into load buckets with dynamic boundaries`--policy bucket`
+ +### Cache-Aware Policy Tuning + +```bash Command +--cache-threshold 0.5 \ +--balance-abs-threshold 32 \ +--balance-rel-threshold 1.5 \ +--eviction-interval-secs 120 \ +--max-tree-size 67108864 +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDefaultDescription
`--cache-threshold`0.3Minimum prefix match ratio for cache hit
`--balance-abs-threshold`64Absolute load difference before rebalancing
`--balance-rel-threshold`1.5Relative load ratio before rebalancing
`--eviction-interval-secs`120Cache eviction cadence in seconds
`--max-tree-size`67108864Maximum nodes in cache tree
+ +*** +## Reliability and Flow Control + +### HTTP Client + +Configure upstream HTTP client connection settings: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDefaultDescription
`--pool-idle-timeout-secs`50Idle timeout in seconds for pooled upstream HTTP connections. Can also be set with `SMG_POOL_IDLE_TIMEOUT_SECS`.
`--connect-timeout-secs`10Timeout in seconds for new upstream HTTP connections. Can also be set with `SMG_CONNECT_TIMEOUT_SECS`.
`--pool-max-idle-per-host`500Maximum idle upstream HTTP connections to keep per host. Can also be set with `SMG_POOL_MAX_IDLE_PER_HOST`.
`--tcp-keepalive-secs`30TCP keepalive idle time in seconds for upstream HTTP connections. Can also be set with `SMG_TCP_KEEPALIVE_SECS`.
+ +### Retries + +Configure exponential backoff retries: + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --retry-max-retries 5 \ + --retry-initial-backoff-ms 50 \ + --retry-max-backoff-ms 30000 \ + --retry-backoff-multiplier 1.5 \ + --retry-jitter-factor 0.2 +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDefaultDescription
`--retry-max-retries`5Maximum retry attempts
`--retry-initial-backoff-ms`50Initial backoff duration (ms)
`--retry-max-backoff-ms`5000Maximum backoff duration (ms)
`--retry-backoff-multiplier`2.0Exponential backoff multiplier
`--retry-jitter-factor`0.1Random jitter factor (0.0-1.0)
`--disable-retries`falseDisable retries entirely
+ +**Retryable Status Codes:** 408, 429, 500, 502, 503, 504 + +### Circuit Breaker + +Per-worker circuit breakers prevent cascading failures: + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --cb-failure-threshold 5 \ + --cb-success-threshold 2 \ + --cb-timeout-duration-secs 30 \ + --cb-window-duration-secs 60 +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDefaultDescription
`--cb-failure-threshold`5Consecutive failures to open circuit
`--cb-success-threshold`2Successes to close from half-open
`--cb-timeout-duration-secs`30Time before half-open attempt
`--cb-window-duration-secs`60Failure counting window
`--disable-circuit-breaker`falseDisable circuit breaker
+ +**Circuit Breaker States:** +- **Closed**: Normal operation, requests allowed +- **Open**: Failing, requests rejected immediately +- **Half-Open**: Testing recovery, limited requests allowed + +### Rate Limiting and Queuing + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --max-concurrent-requests 256 \ + --rate-limit-tokens-per-second 512 \ + --queue-size 128 \ + --queue-timeout-secs 30 +``` + +Requests beyond the concurrency limit wait in a FIFO queue. Returns: +- `429 Too Many Requests` when queue is full +- `408 Request Timeout` when queue timeout expires + +### Health Checks + +```bash Command +--health-check-interval-secs 30 \ +--health-check-timeout-secs 10 \ +--health-success-threshold 2 \ +--health-failure-threshold 3 \ +--health-check-endpoint /health +``` + +*** +## Reasoning Parser Integration + +The gateway includes built-in reasoning parsers for models that use Chain-of-Thought (CoT) reasoning with explicit thinking blocks. + +### Supported Parsers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Parser IDModel FamilyThink Tokens
`deepseek-r1`DeepSeek-R1`<think>...</think>` (initial reasoning)
`qwen3`Qwen-3`<think>...</think>`
`qwen3-thinking`Qwen-3 Thinking`<think>...</think>` (initial reasoning)
`kimi`Kimi K2Unicode think tokens
`glm45`GLM-4.5/4.6/4.7`<think>...</think>`
`step3`Step-3`<think>...</think>`
`minimax`MiniMax`<think>...</think>`
+ +### Usage + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls grpc://127.0.0.1:20000 \ + --model-path deepseek-ai/DeepSeek-R1 \ + --reasoning-parser deepseek-r1 +``` + +The gRPC router automatically: +1. Detects reasoning blocks in streaming output +2. Separates reasoning content from normal text +3. Applies incremental streaming parsing with buffer management +4. Handles partial token detection for correct streaming behavior + +*** +## Tool Call Parsing + +The gateway supports parsing function/tool calls from LLM outputs in multiple formats. + +### Supported Formats + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParserFormatDescription
`json`JSONStandard JSON tool calls
`python`PythonicPython function call syntax
`xml`XMLXML-formatted tool calls
+ +### Usage + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls grpc://127.0.0.1:20000 \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --tool-call-parser json +``` + +*** +## Tokenizer Management + +### Tokenizer Sources + +The gateway supports multiple tokenizer backends: +- **HuggingFace**: Load from HuggingFace Hub by model ID +- **Local**: Load from local `tokenizer.json` or directory +- **Tiktoken**: Auto-detect OpenAI GPT models (gpt-4, davinci, etc.) + +### Configuration + +```bash Command +# HuggingFace model +--model-path meta-llama/Llama-3.1-8B-Instruct + +# Local tokenizer +--tokenizer-path /path/to/tokenizer.json + +# With chat template override +--chat-template /path/to/template.jinja +``` + +### Tokenizer Caching + +Two-level caching for optimal performance: + + + + + + + + + + + + + + + + + + + + + + + + + + +
CacheTypeDescription
L0Exact matchWhole-string caching for repeated prompts
L1Prefix matchPrefix boundary matching for incremental prompts
+ +```bash Command +--enable-l0-cache \ +--l0-max-entries 10000 \ +--enable-l1-cache \ +--l1-max-memory 52428800 # 50MB +``` + +*** +## MCP Integration + +The gateway provides native Model Context Protocol (MCP) client integration for tool execution. + +### Supported Transports + + + + + + + + + + + + + + + + + + + + + + + + + + +
TransportDescription
STDIOLocal process execution
SSEServer-Sent Events (HTTP)
StreamableBidirectional streaming
+ +### Configuration + +```bash Command +python -m sglang_router.launch_router \ + --mcp-config-path /path/to/mcp-config.yaml \ + --worker-urls http://worker1:8000 +``` + +### MCP Configuration File + +```yaml Config +servers: + - name: "filesystem" + command: "npx" + args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"] + protocol: "stdio" + required: false + + - name: "github" + url: "https://api.github.com/mcp" + token: "ghp_xxxxx" + protocol: "sse" + required: false + + - name: "custom-tools" + url: "https://tools.example.com/mcp" + protocol: "streamable" + required: true + +pool: + max_connections: 100 + idle_timeout: 300 + +proxy: + http: "http://proxy.internal:8080" + https: "https://proxy.internal:8443" + no_proxy: "localhost,127.0.0.1,*.internal" + +inventory: + enable_refresh: true + tool_ttl: 300 + refresh_interval: 300 +``` + +*** +## Service Discovery (Kubernetes) + +Enable automatic worker discovery via Kubernetes pod selectors: + +```bash Command +python -m sglang_router.launch_router \ + --service-discovery \ + --selector app=sglang-worker role=inference \ + --service-discovery-namespace production \ + --service-discovery-port 8000 +``` + +### PD Mode Discovery + +```bash Command +--pd-disaggregation \ +--prefill-selector app=sglang component=prefill \ +--decode-selector app=sglang component=decode \ +--service-discovery +``` + +Prefill pods can expose bootstrap ports via the `sglang.ai/bootstrap-port` annotation. RBAC must allow `get`, `list`, and `watch` on pods. + +*** +## History and Data Connectors + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendDescriptionUsage
`memory`In-memory storage (default)`--history-backend memory`
`none`No persistence`--history-backend none`
`oracle`Oracle Autonomous Database`--history-backend oracle`
`postgres`PostgreSQL Database`--history-backend postgres`
`redis`Redis`--history-backend redis`
+ +### Oracle Configuration + +```bash Command +# Connection descriptor +export ATP_DSN="(description=(address=(protocol=tcps)(port=1522)(host=adb.region.oraclecloud.com))(connect_data=(service_name=service_name)))" + +# Or TNS alias (requires wallet) +export ATP_TNS_ALIAS="sglroutertestatp_high" +export ATP_WALLET_PATH="/path/to/wallet" + +# Credentials +export ATP_USER="admin" +export ATP_PASSWORD="secret" +export ATP_POOL_MIN=4 +export ATP_POOL_MAX=32 + +python -m sglang_router.launch_router \ + --backend openai \ + --worker-urls https://api.openai.com \ + --history-backend oracle +``` + +### PostgreSQL Configuration + +```bash Command +export POSTGRES_DB_URL="postgres://user:password@host:5432/dbname" + +python -m sglang_router.launch_router \ + --backend openai \ + --worker-urls https://api.openai.com \ + --history-backend postgres +``` + +### Redis Configuration + +```bash Command +export REDIS_URL="redis://localhost:6379" +export REDIS_POOL_MAX=16 +export REDIS_RETENTION_DAYS=30 + +python -m sglang_router.launch_router \ + --backend openai \ + --worker-urls https://api.openai.com \ + --history-backend redis \ + --redis-retention-days 30 +``` + +Use `--redis-retention-days -1` for persistent storage (default is 30 days). + +*** +## WASM Middleware + +The gateway supports WebAssembly (WASM) middleware modules for custom request/response processing. This enables organization-specific logic for authentication, rate limiting, billing, logging, and more—without modifying or recompiling the gateway. + +### Overview + +WASM middleware runs in a sandboxed environment with memory isolation, no network/filesystem access, and configurable resource limits. + + + + + + + + + + + + + + + + + + + + + + + + + + +
Attach PointWhen ExecutedUse Cases
`OnRequest`Before forwarding to workersAuth, rate limiting, request modification
`OnResponse`After receiving worker responseLogging, response modification, error handling
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
ActionDescription
`Continue`Proceed without modification
`Reject(status)`Reject request with HTTP status code
`Modify(...)`Modify headers, body, or status
+ +### Examples + +Complete working examples are available in `examples/wasm/`: + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExampleDescription
`auth/`API key authentication for protected routes
`rate_limit/`Per-client rate limiting (requests/minute)
`logging/`Request tracking headers and response modification
+ +The interface definition is located at `src/wasm/interface`. + +### Building Modules + +```bash Command +# Prerequisites +rustup target add wasm32-wasip2 +cargo install wasm-tools + +# Build +cargo build --target wasm32-wasip2 --release + +# Convert to component format +wasm-tools component new \ + target/wasm32-wasip2/release/my_middleware.wasm \ + -o my_middleware.component.wasm +``` + +### Deploying Modules + +```bash Command +# Enable WASM support +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 \ + --enable-wasm + +# Upload module +curl -X POST http://localhost:30000/wasm \ + -H "Content-Type: application/json" \ + -d '{ + "modules": [{ + "name": "auth-middleware", + "file_path": "/absolute/path/to/auth.component.wasm", + "module_type": "Middleware", + "attach_points": [{"Middleware": "OnRequest"}] + }] + }' + +# List modules +curl http://localhost:30000/wasm + +# Remove module +curl -X DELETE http://localhost:30000/wasm/{module_uuid} +``` + +### Runtime Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDefaultDescription
`max_memory_pages`1024 (64MB)Maximum WASM memory
`max_execution_time_ms`1000Execution timeout
`max_stack_size`1MBStack size limit
`module_cache_size`10Cached modules per worker
+ +**Note:** Rate limiting state is per-worker thread and not shared across gateway replicas. For production, consider implementing rate limiting at a shared layer (e.g., Redis) + +*** +## Language Bindings + +SGLang Model Gateway provides official language bindings for Python and Go, enabling integration with different technology stacks and organizational requirements. + +### Python Bindings + +The Python bindings provide a PyO3-based wrapper around the Rust gateway library. This is a straightforward binding that calls the gateway server startup from Python. + +#### Installation + +```bash Command +# From PyPI +pip install sglang-router + +# Development build +cd sgl-model-gateway/bindings/python +pip install maturin && maturin develop --features vendored-openssl +``` + +#### Usage + +The Python bindings are used throughout this documentation. See the [Quick Start](#quick-start) and [Deployment Modes](#deployment-modes) sections for detailed examples. + +Key components: +- `RouterArgs` dataclass with 50+ configuration options +- `Router.from_args()` for programmatic startup +- CLI commands: `smg launch`, `smg server`, `python -m sglang_router.launch_router` + +### Go Bindings + +The Go bindings provide a high-performance gRPC client library for organizations with Go-based infrastructure. This is ideal for: + +- Integration with internal Go services and tooling +- High-performance client applications +- Building custom OpenAI-compatible proxy servers + +#### Architecture + +```text Output ++-------------------------------------------+ +| High-Level Go API | +| (client.go - OpenAI-style interface) | ++-------------------------------------------+ +| gRPC Layer | ++-------------------------------------------+ +| Rust FFI Layer | +| (Tokenization, Parsing, Conversion) | ++-------------------------------------------+ +``` + +**Key Features:** +- Native Rust tokenization via FFI (thread-safe, lock-free) +- Full streaming support with context cancellation +- Configurable channel buffer sizes for high concurrency +- Built-in tool call parsing and chat template application + +#### Installation + +```bash Command +# Build the FFI library first +cd sgl-model-gateway/bindings/golang +make build && make lib + +# Then use in your Go project +go get github.com/sgl-project/sgl-go-sdk +``` + +**Requirements:** Go 1.24+, Rust toolchain + +#### Examples + +Complete working examples are available in `bindings/golang/examples/`: + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExampleDescription
`simple/`Non-streaming chat completion
`streaming/`Streaming chat completion with SSE
`oai_server/`Full OpenAI-compatible HTTP server
+ +```bash Command +# Run examples +cd sgl-model-gateway/bindings/golang/examples/simple && ./run.sh +cd sgl-model-gateway/bindings/golang/examples/streaming && ./run.sh +cd sgl-model-gateway/bindings/golang/examples/oai_server && ./run.sh +``` + +#### Testing + +```bash Command +cd sgl-model-gateway/bindings/golang + +# Unit tests +go test -v ./... + +# Integration tests (requires running SGLang server) +export SGL_GRPC_ENDPOINT=grpc://localhost:20000 +export SGL_TOKENIZER_PATH=/path/to/tokenizer +go test -tags=integration -v ./... +``` + +### Comparison + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeaturePythonGo
**Primary Use**Gateway server launchergRPC client library
**CLI Support**Full CLI (smg, sglang-router)Library only
**K8s Discovery**Native supportN/A (client library)
**PD Mode**Built-inN/A (client library)
+ +**When to Use Python:** Launching and managing the gateway server, service discovery, PD disaggregation. + +**When to Use Go:** Building custom client applications, integration with Go microservices, OpenAI-compatible proxy servers + +*** +## Security and Authentication + +### Router API Key + +```bash Command +python -m sglang_router.launch_router \ + --api-key "your-router-api-key" \ + --worker-urls http://worker1:8000 +``` + +Clients must supply `Authorization: Bearer ` for protected endpoints. + +### Worker API Keys + +```bash Command +# Add worker with explicit key +curl -H "Authorization: Bearer router-key" \ + -X POST http://localhost:8080/workers \ + -H "Content-Type: application/json" \ + -d '{"url":"http://worker:8000","api_key":"worker-key"}' +``` + +### Security Configurations + +1. **No Authentication** (default): Use only in trusted environments +2. **Router-only Authentication**: Clients authenticate to router +3. **Worker-only Authentication**: Router open, workers require keys +4. **Full Authentication**: Both router and workers protected + +### TLS (HTTPS) for Gateway Server + +Enable TLS to serve the gateway over HTTPS: + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 \ + --tls-cert-path /path/to/server.crt \ + --tls-key-path /path/to/server.key +``` + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
`--tls-cert-path`Path to server certificate (PEM format)
`--tls-key-path`Path to server private key (PEM format)
+ +Both parameters must be provided together. The gateway uses rustls with the ring crypto provider for TLS termination. If TLS is not configured, the gateway falls back to plain HTTP. + +### mTLS for Worker Communication + +Enable mutual TLS (mTLS) for secure communication with workers in HTTP mode: + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls https://worker1:8443 https://worker2:8443 \ + --client-cert-path /path/to/client.crt \ + --client-key-path /path/to/client.key \ + --ca-cert-path /path/to/ca.crt +``` + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
`--client-cert-path`Path to client certificate for mTLS (PEM format)
`--client-key-path`Path to client private key for mTLS (PEM format)
`--ca-cert-path`Path to CA certificate for verifying worker TLS (PEM format, repeatable)
+ +**Key Points:** +- Client certificate and key must be provided together +- Multiple CA certificates can be added with multiple `--ca-cert-path` flags +- Uses rustls backend when TLS is configured +- Single HTTP client is created for all workers (assumes single security domain) +- TCP keepalive (30 seconds) is enabled for long-lived connections + +### Full TLS Configuration Example + +Gateway HTTPS + Worker mTLS + API Key authentication: + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls https://worker1:8443 https://worker2:8443 \ + --tls-cert-path /etc/certs/server.crt \ + --tls-key-path /etc/certs/server.key \ + --client-cert-path /etc/certs/client.crt \ + --client-key-path /etc/certs/client.key \ + --ca-cert-path /etc/certs/ca.crt \ + --api-key "secure-api-key" \ + --policy cache_aware +``` + +*** +## Observability + +### Prometheus Metrics + +Enable with `--prometheus-host`/`--prometheus-port` (defaults to `0.0.0.0:29000`). + +#### Metric Categories (40+ metrics) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LayerPrefixMetrics
HTTP`smg_http_*``requests_total`, `request_duration_seconds`, `responses_total`, `connections_active`, `rate_limit_total`
Router`smg_router_*``requests_total`, `request_duration_seconds`, `request_errors_total`, `stage_duration_seconds`, `upstream_responses_total`
Inference`smg_router_*``ttft_seconds`, `tpot_seconds`, `tokens_total`, `generation_duration_seconds`
Worker`smg_worker_*``pool_size`, `connections_active`, `requests_active`, `health_checks_total`, `selection_total`, `errors_total`
Circuit Breaker`smg_worker_cb_*``state`, `transitions_total`, `outcomes_total`, `consecutive_failures`, `consecutive_successes`
Retry`smg_worker_*``retries_total`, `retries_exhausted_total`, `retry_backoff_seconds`
Discovery`smg_discovery_*``registrations_total`, `deregistrations_total`, `sync_duration_seconds`, `workers_discovered`
MCP`smg_mcp_*``tool_calls_total`, `tool_duration_seconds`, `servers_active`, `tool_iterations_total`
Database`smg_db_*``operations_total`, `operation_duration_seconds`, `connections_active`, `items_stored`
+ +#### Key Inference Metrics (gRPC mode) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricTypeDescription
`smg_router_ttft_seconds`HistogramTime to first token
`smg_router_tpot_seconds`HistogramTime per output token
`smg_router_tokens_total`CounterTotal tokens (input/output)
`smg_router_generation_duration_seconds`HistogramEnd-to-end generation time
+ +#### Duration Buckets + +1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 15s, 30s, 45s, 60s, 90s, 120s, 180s, 240s + +### OpenTelemetry Tracing + +Enable distributed tracing with OTLP export: + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 \ + --enable-trace \ + --otlp-traces-endpoint localhost:4317 +``` + +#### Features + +- OTLP/gRPC exporter (default port 4317) +- W3C Trace Context propagation for HTTP and gRPC +- Batch span processing (500ms delay, 64 span batch size) +- Custom filtering to reduce noise +- Trace context injection into upstream worker requests +- Service name: `sgl-router` + +### Logging + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 \ + --log-level debug \ + --log-dir ./router_logs +``` + +Structured tracing with optional file sink. Log levels: `debug`, `info`, `warn`, `error`. + +### Request ID Propagation + +```bash Command +--request-id-headers x-request-id x-trace-id x-correlation-id +``` + +Responses include `x-request-id` header for correlation. + +*** +## Production Recommendations + +This section provides guidance for deploying SGLang Model Gateway in production environments. + +### Security Best Practices + +**Always enable TLS in production:** + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls https://worker1:8443 https://worker2:8443 \ + --tls-cert-path /etc/certs/server.crt \ + --tls-key-path /etc/certs/server.key \ + --client-cert-path /etc/certs/client.crt \ + --client-key-path /etc/certs/client.key \ + --ca-cert-path /etc/certs/ca.crt \ + --api-key "${ROUTER_API_KEY}" +``` + +**Security Checklist:** +- Enable TLS for gateway HTTPS termination +- Enable mTLS for worker communication when workers are on untrusted networks +- Set `--api-key` to protect router endpoints +- Use Kubernetes Secrets or a secrets manager for credentials +- Rotate certificates and API keys periodically +- Restrict network access with firewalls or network policies + +### High Availability + +**Scaling Strategy:** + +The gateway supports running multiple replicas behind a load balancer for high availability. However, there are important considerations: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ComponentShared Across ReplicasImpact
Worker RegistryNo (independent)Each replica discovers workers independently
Radix Cache TreeNo (independent)Cache hits may decrease by 10-20%
Circuit Breaker StateNo (independent)Each replica tracks failures independently
Rate LimitingNo (independent)Limits apply per-replica, not globally
+ +**Recommendations:** + +1. **Prefer horizontal scaling over vertical scaling**: Deploy multiple smaller gateway replicas rather than one large instance with excessive CPU and memory. This provides: + - Better fault tolerance (single replica failure doesn't take down the gateway) + - More predictable resource usage + - Easier capacity planning + +2. **Use Kubernetes Service Discovery**: Let the gateway automatically discover and manage workers: + ```bash Command + python -m sglang_router.launch_router \ + --service-discovery \ + --selector app=sglang-worker \ + --service-discovery-namespace production + ``` + +3. **Accept cache efficiency trade-off**: With multiple replicas, the cache-aware routing policy's radix tree is not synchronized across replicas. This means: + - Each replica builds its own cache tree + - Requests from the same user may hit different replicas + - Expected cache hit rate reduction: **10-20%** + - This is often acceptable given the HA benefits + +4. **Configure session affinity (optional)**: If cache efficiency is critical, configure your load balancer for session affinity based on a consistent hash of the request (e.g., user ID or API key). + +**Example HA Architecture:** +```text Output + +-------------------+ + | Load Balancer | + | (L4/L7) | + +---------+---------+ + | + +-------------------+-------------------+ + | | | + v v v + +-----------+ +-----------+ +-----------+ + | Gateway | | Gateway | | Gateway | + | Replica 1 | | Replica 2 | | Replica 3 | + +-----+-----+ +-----+-----+ +-----+-----+ + | | | + +-------------------+-------------------+ + | + +-------------------+-------------------+ + | | | + v v v + +-----------+ +-----------+ +-----------+ + | Worker | | Worker | | Worker | + | Pod 1 | | Pod 2 | | Pod N | + +-----------+ +-----------+ +-----------+ +``` + +### Performance + +**Use gRPC mode for high throughput:** + +gRPC mode provides the highest performance for SGLang workers: + +```bash Command +# Start workers in gRPC mode +python -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --grpc-mode \ + --port 20000 + +# Configure gateway for gRPC +python -m sglang_router.launch_router \ + --worker-urls grpc://worker1:20000 grpc://worker2:20000 \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --policy cache_aware +``` + +**Performance Benefits of gRPC:** +- Native Rust tokenization (no Python overhead) +- Streaming with lower latency +- Built-in reasoning parser execution +- Tool call parsing in the gateway +- Reduced serialization overhead + +**Tuning Recommendations:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterRecommendationReason
`--policy``cache_aware`Best for repeated prompts, ~30% latency reduction
`--max-concurrent-requests`2-4x worker countPrevent overload while maximizing throughput
`--queue-size`2x max-concurrentBuffer for burst traffic
`--request-timeout-secs`Based on max generation lengthPrevent stuck requests
+ +### Kubernetes Deployment + +**Pod Labeling for Service Discovery:** + +For the gateway to discover workers automatically, label your worker pods consistently: + +```yaml Config +# Worker Deployment (Regular Mode) +apiVersion: apps/v1 +kind: Deployment +metadata: + name: sglang-worker + namespace: production +spec: + replicas: 4 + selector: + matchLabels: + app: sglang-worker + component: inference + template: + metadata: + labels: + app: sglang-worker + component: inference + model: llama-3-8b + spec: + containers: + - name: worker + image: lmsysorg/sglang:latest + ports: + - containerPort: 8000 + name: http + - containerPort: 20000 + name: grpc +``` + +**Gateway configuration for discovery:** +```bash Command +python -m sglang_router.launch_router \ + --service-discovery \ + --selector app=sglang-worker component=inference \ + --service-discovery-namespace production \ + --service-discovery-port 8000 +``` + +**PD (Prefill/Decode) Mode Labeling:** + +```yaml Config +# Prefill Worker +metadata: + labels: + app: sglang-worker + component: prefill + annotations: + sglang.ai/bootstrap-port: "9001" + +# Decode Worker +metadata: + labels: + app: sglang-worker + component: decode +``` + +**Gateway configuration for PD discovery:** +```bash Command +python -m sglang_router.launch_router \ + --service-discovery \ + --pd-disaggregation \ + --prefill-selector app=sglang-worker component=prefill \ + --decode-selector app=sglang-worker component=decode \ + --service-discovery-namespace production +``` + +**RBAC Requirements:** + +The gateway needs permissions to watch pods: + +```yaml Config +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: sglang-gateway + namespace: production +rules: +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] +*** +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: sglang-gateway + namespace: production +subjects: +- kind: ServiceAccount + name: sglang-gateway + namespace: production +roleRef: + kind: Role + name: sglang-gateway + apiGroup: rbac.authorization.k8s.io +``` + +### Monitoring with PromQL + +Configure Prometheus to scrape the gateway metrics endpoint (default: `:29000/metrics`). + +**Essential Dashboards:** + +**1. Request Rate and Latency:** +```sql Example +# Request rate by endpoint +sum(rate(smg_http_requests_total[5m])) by (path, method) + +# P50 latency +histogram_quantile(0.50, sum(rate(smg_http_request_duration_seconds_bucket[5m])) by (le)) + +# P99 latency +histogram_quantile(0.99, sum(rate(smg_http_request_duration_seconds_bucket[5m])) by (le)) + +# Error rate +sum(rate(smg_http_responses_total{status=~"5.."}[5m])) / sum(rate(smg_http_responses_total[5m])) +``` + +**2. Worker Health:** +```sql Example +# Healthy workers +sum(smg_worker_pool_size) + +# Active connections per worker +smg_worker_connections_active + +# Worker health check failures +sum(rate(smg_worker_health_checks_total{result="failure"}[5m])) by (worker_id) +``` + +**3. Circuit Breaker Status:** +```sql Example +# Circuit breaker states (0=closed, 1=open, 2=half-open) +smg_worker_cb_state + +# Circuit breaker transitions +sum(rate(smg_worker_cb_transitions_total[5m])) by (worker_id, from_state, to_state) + +# Workers with open circuits +count(smg_worker_cb_state == 1) +``` + +**4. Inference Performance (gRPC mode):** +```sql Example +# Time to first token (P50) +histogram_quantile(0.50, sum(rate(smg_router_ttft_seconds_bucket[5m])) by (le, model)) + +# Time per output token (P99) +histogram_quantile(0.99, sum(rate(smg_router_tpot_seconds_bucket[5m])) by (le, model)) + +# Token throughput +sum(rate(smg_router_tokens_total[5m])) by (model, direction) + +# Generation duration P95 +histogram_quantile(0.95, sum(rate(smg_router_generation_duration_seconds_bucket[5m])) by (le)) +``` + +**5. Rate Limiting and Queuing:** +```sql Example +# Rate limit rejections +sum(rate(smg_http_rate_limit_total{decision="rejected"}[5m])) + +# Queue depth (if using concurrency limiting) +smg_worker_requests_active + +# Retry attempts +sum(rate(smg_worker_retries_total[5m])) by (worker_id) + +# Exhausted retries (failures after all retries) +sum(rate(smg_worker_retries_exhausted_total[5m])) +``` + +**6. MCP Tool Execution:** +```sql Example +# Tool call rate +sum(rate(smg_mcp_tool_calls_total[5m])) by (server, tool) + +# Tool latency P95 +histogram_quantile(0.95, sum(rate(smg_mcp_tool_duration_seconds_bucket[5m])) by (le, tool)) + +# Active MCP server connections +smg_mcp_servers_active +``` + +**Alerting Rules Example:** + +```yaml Config +groups: +- name: sglang-gateway + rules: + - alert: HighErrorRate + expr: | + sum(rate(smg_http_responses_total{status=~"5.."}[5m])) + / sum(rate(smg_http_responses_total[5m])) > 0.05 + for: 5m + labels: + severity: critical + annotations: + summary: "High error rate on SGLang Gateway" + + - alert: CircuitBreakerOpen + expr: count(smg_worker_cb_state == 1) > 0 + for: 2m + labels: + severity: warning + annotations: + summary: "Worker circuit breaker is open" + + - alert: HighLatency + expr: | + histogram_quantile(0.99, sum(rate(smg_http_request_duration_seconds_bucket[5m])) by (le)) > 30 + for: 5m + labels: + severity: warning + annotations: + summary: "P99 latency exceeds 30 seconds" + + - alert: NoHealthyWorkers + expr: sum(smg_worker_pool_size) == 0 + for: 1m + labels: + severity: critical + annotations: + summary: "No healthy workers available" +``` + +*** +## Configuration Reference + +### Core Settings + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDefaultDescription
`--host`str127.0.0.1Router host
`--port`int30000Router port
`--worker-urls`list[]Worker URLs (HTTP or gRPC)
`--policy`strcache_awareRouting policy
`--max-concurrent-requests`int-1Concurrency limit (-1 disables)
`--request-timeout-secs`int600Request timeout
`--max-payload-size`int256MBMaximum request payload
+ +### Prefill/Decode + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDefaultDescription
`--pd-disaggregation`flagfalseEnable PD mode
`--prefill`list[]Prefill URLs + optional bootstrap ports
`--decode`list[]Decode URLs
`--prefill-policy`strNoneOverride policy for prefill nodes
`--decode-policy`strNoneOverride policy for decode nodes
`--worker-startup-timeout-secs`int600Worker init timeout
+ +### Kubernetes Discovery + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescription
`--service-discovery`flagEnable discovery
`--selector`listLabel selectors (key=value)
`--prefill-selector` / `--decode-selector`listPD mode selectors
`--service-discovery-namespace`strNamespace to watch
`--service-discovery-port`intWorker port (default 80)
`--bootstrap-port-annotation`strAnnotation for bootstrap ports
+ +### TLS Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescription
`--tls-cert-path`strServer certificate for gateway HTTPS (PEM)
`--tls-key-path`strServer private key for gateway HTTPS (PEM)
`--client-cert-path`strClient certificate for worker mTLS (PEM)
`--client-key-path`strClient private key for worker mTLS (PEM)
`--ca-cert-path`strCA certificate for verifying workers (PEM, repeatable)
+ +*** +## Troubleshooting + +### Workers Never Ready + +Increase `--worker-startup-timeout-secs` or ensure health probes respond before router startup. + +### Load Imbalance / Hot Workers + +Inspect `smg_router_requests_total` by worker and tune cache-aware thresholds (`--balance-*`, `--cache-threshold`). + +### Circuit Breaker Flapping + +Increase `--cb-failure-threshold` or extend the timeout/window durations. Consider temporarily disabling retries. + +### Queue Overflow (429) + +Increase `--queue-size` or reduce client concurrency. Ensure `--max-concurrent-requests` matches downstream capacity. + +### Memory Growth + +Reduce `--max-tree-size` or lower `--eviction-interval-secs` for more aggressive cache pruning. + +### Debugging + +```bash Command +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 \ + --log-level debug \ + --log-dir ./router_logs +``` + +### gRPC Connection Issues + +Ensure workers are started with `--grpc-mode` and verify `--model-path` or `--tokenizer-path` is provided to the router. + +### Tokenizer Loading Failures + +Check HuggingFace Hub credentials (`HF_TOKEN` environment variable) for private models. Verify local paths are accessible. + +*** +SGLang Model Gateway continues to evolve alongside the SGLang runtime. Keep CLI flags, integrations, and documentation aligned when adopting new features or contributing improvements. diff --git a/docs_new/docs/advanced_features/sglang_for_rl.mdx b/docs_new/docs/advanced_features/sglang_for_rl.mdx new file mode 100644 index 000000000000..7e2f4a1e1831 --- /dev/null +++ b/docs_new/docs/advanced_features/sglang_for_rl.mdx @@ -0,0 +1,655 @@ +--- +title: "SGLang for RL Systems" +metatags: + description: "SGLang for RL: engine sleep/wake, weight refit, partial rollout, deterministic inference, cache-aware load balancing for RLHF." +--- +This document is a practical guide for infrastructure teams integrating SGLang into RL and post-training systems. It focuses on the operational pain points in the loop (rollout, evaluation, training, weight sync) and maps them to concrete SGLang APIs, flags, and integration patterns. The focus is on maximizing rollout efficiency, accuracy and stability while keeping rollout-serving behavior aligned in production environments. + +## Why SGLang for RL Lifecycle? + +Let's embrace a guiding principle from early DeepMind's RL engineering: + +**Be a library, not a framework.** + +This philosophy empowers innovation by providing SGLang as flexible tools, not rigid structures. Here are five reasons to use SGLang for your RL lifecycle: + +* **Fine-Grained Engine Sleep and Wake Up**: facilitate maximum-powered rollout and training +* **Open-To-Use Refit Functionality**: diverse methods for co-location or disaggregation +* **Easy To Postpone Generation**: enable partial rollout and dedicated rollout control +* **Deterministic Inference**: achieve deterministic inference to enable zero training-inference mismatch +* **Load Balancing Router**: cache-aware load-balancing for high-throughput rollout + +The following sections cover these aspects in detail. + +## Fine-Grained Engine Sleep and Wake Up + +Rollout and training are both memory-intensive, and co-locating them on the same GPUs often leads to memory pressure and slow handoffs. SGLang provides a memory-aware sleep/wake mechanism that releases KV cache and weights while keeping the server process alive, then resumes them for rollout without a full restart. This avoids repeated disk I/O and CUDA graph recapture during each RL step. + +Under the hood, the RL team uses CUDA-graph-aware weight offload via [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to preserve virtual memory addresses for graph replay. For details, see: [Efficient RL Training - Optimizing Memory Usage in verl](https://hebiao064.github.io/rl-memory-management). + +### Server flag + +Enable memory saver support when launching the server: + +```text Output +--enable-memory-saver +``` + +### Release Memory + +**Endpoint:** `POST /release_memory_occupation` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`tags`Which memory regions to release. If omitted, all are released.`None`Type: list[str], values: `kv_cache`, `weights`
+{/* python/sglang/srt/managers/io_struct.py#L1381 currently only supports `kv_cache`, `weights` */} +**Behavior notes:** + +- This call asserts there are no ongoing requests. Ensure the engine is idle before calling it. +- If `kv_cache` is released, SGLang flushes cache; subsequent requests will rebuild KV cache as needed. + +### Resume Memory + +**Endpoint:** `POST /resume_memory_occupation` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`tags`Which memory regions to resume. If omitted, all are resumed.`None`Type: list[str], values: `kv_cache`, `weights`
+{/* python/sglang/srt/managers/io_struct.py#L1393 currently only supports `kv_cache`, `weights` */} + +## Open-To-Use Refit Functionality + +After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. For a deeper dive into SGLang's weight update utilities, see [RL System Deep Thinking: Weight Update Mechanisms](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md). + +**How to choose:** + +- **From disk** is simplest and best for elastic rollout scaling and checkpointing. +- **From tensor** is best for co-located training/rollout when you can pass in-memory tensors. +- **From distributed** is best for disaggregated training/rollout with dedicated communication groups (NCCL/IB). + +### Update Weights from Disk + +**When to use:** + +- Save checkpoint to disk and update weights from disk +- Dynamic scaling (new rollout instances can load from the same checkpoint) + +**Why it works well:** + +This path trades some I/O overhead for simplicity and flexibility. It integrates naturally with checkpointing and makes it trivial to add new rollout engines: point them at the same checkpoint and call the API. It is also the safest option for high availability because the checkpoint itself is the source of truth. + +**Endpoint:** `POST /update_weights_from_disk` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`model_path`The model path with the new weights.RequiredType: str
`load_format`The format to load the weights.`None`Type: str
`abort_all_requests`Abort all running requests before update.`False`Type: bool
`weight_version`Optional weight version label tracked by the server.`None`Type: str
`is_async`Perform weight load asynchronously.`False`Type: bool
`torch_empty_cache`Empty torch cache.`False`Type: bool
`keep_pause`Keep scheduler paused after update.`False`Type: bool
`recapture_cuda_graph`Recapture CUDA graphs after update.`False`Type: bool
`token_step`Trainer step id for rollout bookkeeping.`0`Type: int
`flush_cache`Flush KV cache after update.`True`Type: bool
+ +**Response body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`success`Whether the update succeeded.-Type: bool
`message`Status / error message.-Type: str
`num_paused_requests`Number of paused requests during update.`0`Type: int
+ +**Python Engine API:** `engine.update_weights_from_disk(model_path, load_format=None)` + +**Diffusion engine (SGLang-Diffusion):** The diffusion engine exposes the same `POST /update_weights_from_disk` endpoint with the following behavior: + +- **All-or-nothing with rollback:** if any module fails to load, all previously updated modules are rolled back to the original weights by reloading from the original model path. No partial updates are left behind. If rollback itself fails, the exception propagates so the caller knows the model is in an inconsistent state. +- **Offload-aware:** when layerwise offload (`--dit-layerwise-offload`) is enabled, the diffusion offload manager replaces GPU parameters with small `torch.empty((1,))` placeholders while real weights live in consolidated pinned CPU buffers. A naive `param.data.copy_()` would fail with a shape mismatch. Instead, the updater dynamically detects active offload managers and writes new weights directly into their CPU buffers, bypassing the placeholders entirely. For any layer that happens to be prefetched on GPU at update time, the live GPU tensor is also updated so the change takes effect immediately. This requires no extra GPU memory and does not disturb the offload state. +- **DTensor-aware:** parameters distributed via `torch.distributed.tensor` (tensor parallelism) are updated through `distribute_tensor` so that each shard is correctly placed on the right device mesh. + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
model_pathThe model path with the new weights.RequiredType: str
flush_cacheFlush TeaCache state after update.TrueType: bool
target_modulesList of module names to update (e.g. ["transformer"]). If omitted, all nn.Module components are updated.NoneType: list[str]
+ +**Response body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
successWhether the update succeeded.-Type: bool
messageStatus / error message.-Type: str
+ +> **Note:** The diffusion engine (SGLang-Diffusion) does not currently support hot refit (updating weights while inference is in progress). The diffusion scheduler processes one request at a time and completes the entire inference before handling the next request, so weight updates and inference never run concurrently. + +### Update Weights from Tensor + +**When to use:** + +- Co-located training and rollout, where training can provide tensors directly +- Fast in-memory updates + +**Important constraints:** + +This strategy requires the training process and rollout engine to share access to the tensors. Co-located setups must keep the model on GPU; moving tensors to CPU will break the update path. For high-performance MoE or specialized attention kernels, co-location may limit some optimizations compared to disaggregated rollouts. + +**Endpoint:** `POST /update_weights_from_tensor` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
serialized_named_tensorsPer-TP serialized tensor payloads.RequiredType: list[str|bytes]
load_formatOptional load format selector.NoneNone, direct, flattened_bucket, or a custom loader path string
flush_cacheFlush KV cache after update.TrueType: bool
abort_all_requestsAbort all running requests before update.FalseType: bool
weight_versionOptional version label tracked by the server.NoneType: str
+ +**Note:** The serialized tensor payloads must be created with `MultiprocessingSerializer.serialize(...)` and should be base64-safe strings. + +**Python Engine API:** `engine.update_weights_from_tensor(named_tensors, load_format=None, flush_cache=True)` + +### Update Weights from Distributed Group + +**When to use:** + +- Disaggregated training and rollout +- NCCL or IB-backed weight broadcast from training workers to rollout workers + +**How it works:** + +Training workers gather weights (typically on TP rank 0), broadcast them to the rollout group, and each rollout TP shard loads the parameters it needs. This avoids disk I/O and keeps training and rollout decoupled, at the cost of managing a dedicated communication group. + +**Initialize weight update group** + +**Endpoint:** `POST /init_weights_update_group` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`master_address`Group master address.RequiredType: str
`master_port`Group master port.RequiredType: int
`rank_offset`Offset for local rank mapping.RequiredType: int
`world_size`Total world size.RequiredType: int
`group_name`Group name.`weight_update_group`Type: str
`backend`Communication backend.`nccl`Type: str
+ +**Update weight** + +**Endpoint:** `POST /update_weights_from_distributed` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`names`Parameter names to update.RequiredType: list[str]
`dtypes`Dtype strings for each parameter.RequiredType: list[str]
`shapes`Tensor shapes.RequiredType: list[list[int]]
`group_name`Group name.`weight_update_group`Type: str
`flush_cache`Flush KV cache after update.`True`Type: bool
`abort_all_requests`Abort all running requests before update.`False`Type: bool
`weight_version`Optional version label.`None`Type: str
`load_format`Optional format selector.`None``None` or `flattened_bucket`
+ +**Destroy weights update group** + +**Endpoint:** `POST /destroy_weights_update_group` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`group_name`Group name.`weight_update_group`Type: str
+ +**Python Engine APIs:** + +- `engine.init_weights_update_group(...)` +- `engine.update_weights_from_distributed(names, dtypes, shapes, ...)` +- `engine.destroy_weights_update_group(group_name)` + +## Easy To Postpone Generation + +Multi-turn RL rollouts often suffer from long-tail requests that block the entire batch. A small number of slow interactions can stall all GPUs, and the long-tail behavior makes profiling and monitoring difficult. + +SGLang exposes explicit pause/resume APIs so you can pause slow requests and continue them later. This pattern matches systems like [APRIL](https://arxiv.org/abs/2509.18521), terminate once enough responses are collected, and recycle incomplete responses in the next step. The result is higher GPU utilization without discarding partial work. + +`pause_generation` --- update weights --- `continue_generation` is the correct execution flow when updating weights from training. An update can only happen when SGLang is not actively processing inference tasks. + +### Pause Generation + +**Endpoint:** `POST /pause_generation` + +**Request body:** + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefaultsOptions
`mode`Pause mode.`abort``abort`, `retract`, `in_place`
+ +**Modes:** + +- `abort`: Default behavior, identical to `abort` endpoint with `abort_all` set. Pending requests from `waiting_queue` and `running_queue` will be returned immediately to the caller. +- `retract`: Put engine in "paused" state. Move running requests back to waiting queue. KV cache can be flushed and recomputed later. +- `in_place`: Put engine in "paused" state without changing states of the requests. Running requests rely on availability of KV caches to continue, so any subsequent `flush_cache` call will be unsuccessful. + +### Continue Generation + +**Endpoint:** `POST /continue_generation` + +## Deterministic Inference + +In many RL stacks, rollout and training are implemented with different kernels or batching behavior. Even when weights are identical, token probabilities can drift, silently breaking the on-policy assumption. This is the training–inference mismatch problem. + +SGLang supports a deterministic inference mode that reduces non-determinism across batch shapes. This mitigates variance introduced by runtime batching and kernel selection. To further achieve true on-policy training, you need to modify the training engine to use the same deterministic kernels. For implementation details, see these miles examples: [True On-Policy](https://github.com/radixark/miles/tree/main/examples/true_on_policy) and [True On-Policy for VLM](https://github.com/radixark/miles/tree/main/examples/true_on_policy_vlm). For additional context, see the blog post [Let Speed Be With Stability: All-In-One Solution to Training-Inference Mismatch with Miles](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/mismatch/blog-en.md). + +**Server flag:** + +```text Output +--enable-deterministic-inference +``` + +For more details, see [Deterministic Inference](./deterministic_inference) + +## Load Balancing Router + +SGLang Model Gateway is the recommended control plane for large‑scale RL rollouts. It provides async, non‑blocking request handling, cache‑aware load balancing, and fault‑tolerant routing across rollout and reward servers. This lets you keep GPUs saturated while avoiding long‑tail stalls and brittle, engine‑local concurrency logic. It has been deployed in the training of GLM 4.5+ models and proven to be highly efficient in production-level large-scale RL workloads. + +Key benefits for RL infrastructure: + +- **Async non-blocking efficiency**: SGLang’s native async server/router architecture (HTTPS/gRPC) manages concurrency automatically. This guarantees maximum GPU saturation and effective continuous batching without requiring complex, manual implementation by engineers. +- **Elasticity and fault tolerance**: By encapsulating the reward model and rollout as independent servers, SGLang decouples them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption. +- **Training–Inference alignment**: Using the SGLang Model Gateway for both training and inference ensures "What You See Is What You Get." This eliminates score discrepancies and the painful backend alignment issues often caused by using different engines for training versus deployment. +- **Dynamic load balancing and long-tail mitigation**: Unlike static partitioning, the SGLang Model Gateway enables request-level dynamic dispatching for multi-turn RL. It can distribute different turns of a conversation across different servers to balance workloads and eliminate long-tail latency caused by varying sequence lengths. + +For deployment and configuration, see: [SGLang Model Gateway](./sgl_model_gateway) diff --git a/docs/advanced_features/speculative_decoding.ipynb b/docs_new/docs/advanced_features/speculative_decoding.ipynb similarity index 96% rename from docs/advanced_features/speculative_decoding.ipynb rename to docs_new/docs/advanced_features/speculative_decoding.ipynb index aa62b897a8b6..c24cac4025bd 100644 --- a/docs/advanced_features/speculative_decoding.ipynb +++ b/docs_new/docs/advanced_features/speculative_decoding.ipynb @@ -66,13 +66,11 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \\\n", " --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \\\n", " --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] @@ -121,14 +119,12 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \\\n", " --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \\\n", " --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \\\n", " --enable-torch-compile --torch-compile-max-bs 2 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] @@ -181,14 +177,12 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \\\n", " --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \\\n", " --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \\\n", " --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] @@ -237,14 +231,12 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", "python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \\\n", " --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \\\n", " --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \\\n", " --cuda-graph-max-bs 2 --dtype float16 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] @@ -293,13 +285,11 @@ "metadata": {}, "outputs": [], "source": [ - "server_process, port = launch_server_cmd(\n", - " \"\"\"\n", + "server_process, port = launch_server_cmd(\"\"\"\n", " python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \\\n", " --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \\\n", " --mem-fraction 0.5 --log-level warning\n", - "\"\"\"\n", - ")\n", + "\"\"\")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] diff --git a/docs_new/docs/advanced_features/speculative_decoding.mdx b/docs_new/docs/advanced_features/speculative_decoding.mdx new file mode 100644 index 000000000000..931e5c05ae43 --- /dev/null +++ b/docs_new/docs/advanced_features/speculative_decoding.mdx @@ -0,0 +1,1049 @@ +--- +title: "Speculative Decoding" +metatags: + description: "SGLang speculative decoding: EAGLE-2/EAGLE-3, MTP, DFLASH, draft model configuration, and overlap-scheduler guidance." +--- +SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, DFLASH, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines. + +## Summary + +### Jump to sections + +- [EAGLE Decoding](#eagle-decoding) + - [EAGLE-2 Decoding](#eagle-2-decoding) + - [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile) + - [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling) + - [EAGLE-3 Decoding](#eagle-3-decoding) +- [Multi Token Prediction](#multi-token-prediction) +- [DFlash Decoding](#dflash-decoding) +- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model) +- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler) +- [Ngram Speculative Decoding](#ngram-speculative-decoding) +- [Full Parameter Reference](#full-parameter-reference) +- [OOM Troubleshooting](#oom-troubleshooting) +- [References](#references) + +### Quick guidance + +- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`. +- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`. +- **Workload acceptance changes over time**: Use [**Adaptive speculative decoding**](./adaptive_speculative_decoding) on top of **EAGLE** with `--speculative-eagle-topk 1`. +- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`. +- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section). +- **You have a DFlash draft checkpoint**: Use **DFLASH** with `--speculative-algorithm DFLASH` and `--speculative-draft-model-path ...`. +- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`). +- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only). +- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`). + +### Method comparison (mini table) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodDraft sourceSeparate draft model?How to enableNotes / constraints
EAGLE-2EAGLE draft model (feature drafting + tree)Typically yes--speculative-algorithm EAGLE + --speculative-draft-model-path ...Tune --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens
EAGLE-2 + torch.compileSame as EAGLE-2Typically yesAdd --enable-torch-compile (optionally --torch-compile-max-bs)Benefit varies by hardware/model; benchmark to verify
EAGLE-2 + FR-SpecSame as EAGLE-2 + token subsetTypically yesAdd --speculative-token-map ...Reduces lm_head overhead with high-frequency token vocab
EAGLE-3EAGLE3 draft modelYes--speculative-algorithm EAGLE3 + --speculative-draft-model-path ...Best throughput in the benchmark below
MTPBuilt-in multi-token heads (model-specific)Often noSee Multi Token Prediction sectionUses speculative workflow; draft path may be auto-handled for some models
DFLASHDFlash draft model (linear block verification)Yes--speculative-algorithm DFLASH + --speculative-draft-model-path ...No --enable-dp-attention; pp_size == 1; disables overlap scheduler & mixed chunked prefill
STANDALONESmaller draft LLM (token-level)Yes--speculative-algorithm STANDALONE + --speculative-draft-model-path ...Does not support --enable-dp-attention
SpecV2 (experimental)V2 workers + overlap schedulerN/ASGLANG_ENABLE_SPEC_V2=TrueOnly supports --speculative-eagle-topk 1; applies to EAGLE, EAGLE3, STANDALONE
NGRAMNgram cache from previous tokensNo--speculative-algorithm NGRAMCUDA-only; no --enable-dp-attention; disables overlap scheduler & mixed chunked prefill
+ +### Performance Highlights + +Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding. +For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840). + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodThroughput (tokens/s)
SGLang (w/o speculative, 1x H100)158.34 tokens/s
SGLang + EAGLE-2 (1x H100)244.10 tokens/s
SGLang + EAGLE-3 (1x H100)373.25 tokens/s
+ +--- + +## EAGLE Decoding + +To enable EAGLE speculative decoding the following parameters are relevant: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionDefault
--speculative-draft-model-pathDraft model path/weights. Typically required for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted.None
--speculative-num-stepsDepth of autoregressive drafting. Increases speculation range but risks rejection cascades.Auto (5 for Llama/Grok; 3 for many other models)
--speculative-eagle-topkBranching factor per step. Improves candidate diversity and acceptance rate, but increases memory/compute consumption.Auto (4 for Llama/Grok; 1 for many other models)
--speculative-num-draft-tokensMaximum parallel verification capacity. Allows deeper tree evaluation but increases GPU memory usage.Auto (8 for Llama/Grok; 4 for many other models). If topk=1, it is adjusted to num_steps + 1.
--speculative-accept-threshold-singleAcceptance threshold for single-token verification. Lower values accept more aggressively.1.0
--speculative-accept-threshold-accAccumulated acceptance threshold across steps.1.0
--speculative-attention-modeAttention mode for speculative operations (prefill or decode), affecting both target verification and draft extension."prefill"
--speculative-draft-attention-backendOverride attention backend for the draft model.None (same as target)
--speculative-draft-model-quantizationQuantization method for the draft model. Use "unquant" to force no quantization even when the target model is quantized.Same as target model
--speculative-draft-model-revisionSpecific revision/commit of the draft model to load.None (auto-set to "main" when --speculative-draft-model-path is set and revision is omitted)
--speculative-draft-load-formatLoad format for the draft model weights.None
+ +These parameters are mostly the same for EAGLE-2 and EAGLE-3. `--speculative-token-map` is ignored for EAGLE-3 models. +For `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens`: leave all three unset to use auto-tuning, or set all three explicitly when tuning. +If you use EAGLE with `--speculative-eagle-topk 1` and your acceptance rate varies across requests, see [Adaptive Speculative Decoding](./adaptive_speculative_decoding). + +You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py). + + +### EAGLE-2 Decoding + +You can enable EAGLE-2 Decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model. + +**Launch the server:** + +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +### EAGLE-2 Decoding with `torch.compile` + +You can optionally enable `torch.compile` to apply kernel-level optimizations (operator fusion, autotune) to the draft model. The actual speedup depends on your hardware, model architecture, and batch size. In some configurations (e.g., small draft models on H100 where cuBLAS is already optimal and CUDA graphs are enabled), the benefit may be negligible. We recommend benchmarking with and without this flag on your specific setup to verify whether it helps. + +To enable it, add `--enable-torch-compile` and optionally set `--torch-compile-max-bs`: + +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --mem-fraction-static 0.7 \ + --enable-torch-compile \ + --torch-compile-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling + +By employing a truncated high-frequency token vocabulary in the draft model, EAGLE speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, check out [the paper](https://arxiv.org/pdf/2502.14856). + +In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency tokens in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency tokens by directly downloading these tokens from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset). + +Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). + +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3-8B-Instruct \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --dtype float16 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3-8B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +### EAGLE-3 Decoding + +You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model. + +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --speculative-num-draft-tokens 16 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --dtype float16 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Multi Token Prediction + +We support [MTP (Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use `XiaomiMiMo/MiMo-7B-RL` as an example here (for DeepSeek MTP usage, refer to [deepseek_v32 doc](../basic_usage/deepseek_v32#multi-token-prediction)). + +```bash Command +python3 -m sglang.launch_server \ + --model XiaomiMiMo/MiMo-7B-RL \ + --host 0.0.0.0 \ + --trust-remote-code \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 1 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 2 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "XiaomiMiMo/MiMo-7B-RL", + "messages": [{"role": "user", "content": "What is the capital of France?"}], +} + +response = requests.post(url, json=data) +print(response.json()) +``` + +--- + +## DFlash Decoding + +SGLang also supports **DFLASH** speculative decoding using a dedicated draft model checkpoint. Compared with EAGLE-style tree verification, DFLASH verifies a linear draft block and is configured around a block size / draft window. This path is useful when the target model has a matching DFlash draft checkpoint, such as `meta-llama/Llama-3.1-8B-Instruct` with `z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat`. + +Relevant parameters: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionDefault
--speculative-draft-model-pathRequired DFlash draft model path/weights.None
--speculative-num-draft-tokensDFlash verify block size.Inferred from draft config, otherwise 16
--speculative-dflash-block-sizeAlias of --speculative-num-draft-tokens for DFlash.None
--speculative-dflash-draft-window-sizeDraft KV sliding-window size. Must be >= speculative-num-draft-tokens when set.None
+ +```bash Command +python3 -m sglang.launch_server \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --speculative-algorithm DFLASH \ + --speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Llama-3.1-8B-Instruct", + messages=[ + {"role": "user", "content": "Write a quicksort implementation in Python."}, + ], + temperature=0, + max_tokens=128, +) + +print(response.choices[0].message.content) +``` + +--- + +## Standalone Speculative Decoding (Small Draft Model) + +Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`. + +Relevant parameters: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionDefault
--speculative-draft-model-pathDraft model weights (smaller than the target model).None
--speculative-num-stepsDraft depth (how many steps the draft model runs autoregressively).3 (auto default for STANDALONE)
--speculative-eagle-topkBranching factor (token candidates per step).1 (auto default for STANDALONE)
--speculative-num-draft-tokensVerification capacity.4 (auto default for STANDALONE)
--speculative-draft-model-quantizationQuantization for the draft model. Use "unquant" to disable quantization on the draft even when the target is quantized.Same as target
+ +> **Note:** Standalone speculative decoding currently **does not support** `--enable-dp-attention`. + +```bash Command +python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-7B-Instruct \ + --speculative-algorithm STANDALONE \ + --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \ + --speculative-num-steps 4 \ + --speculative-eagle-topk 2 \ + --speculative-num-draft-tokens 7 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Speculative Decoding V2 (Overlap Scheduler) + +SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`). + +To enable it, set the environment variable: +- `SGLANG_ENABLE_SPEC_V2=True` + +Notes: +- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**. +- If you explicitly set `--speculative-eagle-topk > 1`, the server will error. +- If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama). This is incompatible with SpecV2 and may not always trigger an immediate config error, so set `--speculative-eagle-topk 1` explicitly. +- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`. + +```bash Command +SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-7B-Instruct \ + --speculative-algorithm STANDALONE \ + --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \ + --speculative-num-steps 4 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 5 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Ngram Speculative Decoding + +SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model. + +Enable it with: +- `--speculative-algorithm NGRAM` + +### Ngram-specific parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionDefault
--speculative-num-draft-tokensNumber of draft tokens verified per step. If omitted, defaults to min(--speculative-ngram-max-trie-depth, 12).12 (with default ngram settings)
--speculative-ngram-min-bfs-breadthMinimum BFS breadth.1
--speculative-ngram-max-bfs-breadthMaximum BFS breadth.10
--speculative-ngram-match-typeNgram tree-building mode: "BFS" for recency-based expansion or "PROB" for frequency-based expansion."BFS"
--speculative-ngram-max-trie-depthMaximum suffix length stored and matched by the ngram trie.18
--speculative-ngram-capacityCache capacity (number of entries).10,000,000
+ +Notes: +- Ngram speculative decoding **only supports CUDA**. +- It currently **does not support** `--enable-dp-attention`. +- It disables the overlap scheduler and mixed chunked prefill. +- If `--speculative-ngram-max-bfs-breadth > 1` (thus `speculative_eagle_topk > 1`) and `page_size > 1`, use `--attention-backend flashinfer`; otherwise the server will error. +- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification. + +```bash Command +python3 -m sglang.launch_server \ + --model Qwen/Qwen2.5-7B-Instruct \ + --speculative-algorithm NGRAM \ + --speculative-num-draft-tokens 16 \ + --speculative-ngram-max-bfs-breadth 10 \ + --mem-fraction-static 0.7 \ + --cuda-graph-max-bs 8 \ + --log-level warning +``` + +**Send a request:** + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +--- + +## Full Parameter Reference + +Below is a comprehensive list of all speculative decoding parameters available in SGLang: + +### Core parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDefaultDescription
--speculative-algorithmstrNoneAlgorithm to use: DFLASH, EAGLE, EAGLE3, STANDALONE, NGRAM, NEXTN (alias of EAGLE)
--speculative-draft-model-pathstrNonePath to the draft model weights
--speculative-draft-model-revisionstrNoneSpecific revision/commit of the draft model ("main" is auto-used when draft path is set and revision is omitted)
--speculative-draft-load-formatstrNoneLoad format for draft model weights
--speculative-num-stepsintNone (auto-chosen when omitted)Autoregressive drafting depth
--speculative-eagle-topkintNone (auto-chosen when omitted)Branching factor per drafting step
--speculative-num-draft-tokensintNone (auto-chosen when omitted)Maximum number of draft tokens for verification
--speculative-dflash-block-sizeintNoneDFlash-only alias of --speculative-num-draft-tokens
--speculative-dflash-draft-window-sizeintNoneDFlash-only draft KV sliding-window size
--speculative-accept-threshold-singlefloat1.0Single-token acceptance threshold
--speculative-accept-threshold-accfloat1.0Accumulated acceptance threshold
--speculative-token-mapstrNonePath to FR-Spec high-frequency token map
--speculative-attention-modestr"prefill"Attention mode for speculative operations ("prefill" or "decode")
--speculative-draft-attention-backendstrNoneOverride attention backend for the draft model
--speculative-moe-runner-backendstrNoneMoE runner backend for the draft model
--speculative-moe-a2a-backendstrNoneMoE all-to-all backend for the draft model
--speculative-draft-model-quantizationstrSame as targetQuantization for the draft model ("unquant" to disable)
+ +### Ngram-specific parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDefaultDescription
--speculative-ngram-min-bfs-breadthint1Minimum BFS breadth
--speculative-ngram-max-bfs-breadthint10Maximum BFS breadth
--speculative-ngram-match-typestr"BFS"Ngram tree-building mode: "BFS" for recency-based expansion or "PROB" for frequency-based expansion
--speculative-ngram-max-trie-depthint18Maximum suffix length stored and matched by the ngram trie
--speculative-ngram-capacityint10,000,000Cache capacity
+ +### Environment variables + + + + + + + + + + + + + + + + + + + + + + + + + + +
VariableDefaultDescription
SGLANG_ENABLE_SPEC_V2FalseEnable Speculative Decoding V2 (overlap scheduler)
SGLANG_NGRAM_FORCE_GREEDY_VERIFYFalseForce greedy verification for ngram decoding
+ +### Other related flags + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
--enable-multi-layer-eagleEnable multi-layer EAGLE (auto-enabled for MiMoV2 and Step3p5 models)
--enable-torch-compileEnable torch.compile for kernel-level optimizations
--torch-compile-max-bsMaximum batch size for torch.compile
+ +--- + +## OOM Troubleshooting + +> [!WARNING] +> **Out of Memory (OOM)?** Speculative decoding may increase GPU memory usage because the draft tree, CUDA graphs, and verification-related buffers consume additional VRAM. If you encounter OOM errors, try the following adjustments. + +### Step 1: Lower static memory fraction (most effective) + +```bash Command +--mem-fraction-static 0.5 # when omitted, this value is auto-computed +``` + +- `--mem-fraction-static` controls the memory budget for model weights + KV cache pool. +- Lowering it directly increases dynamic headroom for activations and CUDA graph buffers. +- If omitted, SGLang auto-estimates this value from other settings, and those auto settings can still be too aggressive for some workloads. + +### Step 2: Reduce CUDA graph batch size + +```bash Command +# Fewer CUDA graph captures = less memory reserved +--cuda-graph-max-bs 4 # or even 2 for tight memory situations +``` + +- If omitted, `--cuda-graph-max-bs` is auto-selected based on GPU memory and TP size, and can be much larger on high-memory GPUs. + +### Step 3: Reduce draft tree size + +These three parameters directly control how much memory the draft tree consumes: + +```bash Command +# Before (aggressive, high memory) +--speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 + +# After (conservative, lower memory) +--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` + +### Step 4: Limit concurrent requests + +```bash Command +# Fewer concurrent requests lowers in-flight load and can reduce OOM risk +--max-running-requests 4 +``` + +### Quick OOM recovery recipe + +If you're hitting OOM and just want something that works, start with this minimal configuration and scale up: + +```bash Command +python3 -m sglang.launch_server \ + --model \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --cuda-graph-max-bs 2 \ + --mem-fraction-static 0.5 \ + --max-running-requests 4 \ + --log-level warning +``` + +Then gradually increase `--speculative-num-draft-tokens`, `--speculative-eagle-topk`, and `--cuda-graph-max-bs`. Increase `--mem-fraction-static` last, only after the run is stable. + +--- + +## References + +EAGLE process is as follows: + +- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. +- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again. +- In SGLang's EAGLE-2 implementation, the draft tree is expanded for the configured steps and then reranked to select the top `speculative_num_draft_tokens` final nodes as draft tokens. +- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner. + +This enhances drafting accuracy by operating on features instead of tokens for more regular inputs and by additionally passing tokens from the next timestep to reduce sampling randomness. For more details, see the [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) papers. + +For guidance on how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train). For EAGLE-3 training specifically, check out [SpecForge](https://github.com/sgl-project/SpecForge), the SGLang team's training framework designed for EAGLE-3 speculative decoding models with seamless porting to SGLang serving. See the [SpecForge documentation](https://docs.sglang.ai/SpecForge/) and [blog post](https://lmsys.org/blog/2025-07-25-spec-forge) for details. diff --git a/docs_new/docs/advanced_features/structured_outputs.ipynb b/docs_new/docs/advanced_features/structured_outputs.ipynb new file mode 100644 index 000000000000..8902c949765e --- /dev/null +++ b/docs_new/docs/advanced_features/structured_outputs.ipynb @@ -0,0 +1,994 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Structured Outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can specify a JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.\n", + "\n", + "SGLang supports three grammar backends:\n", + "\n", + "- [XGrammar](https://github.com/mlc-ai/xgrammar)(default): Supports JSON schema, regular expression, and EBNF constraints.\n", + "- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.\n", + "- [Llguidance](https://github.com/guidance-ai/llguidance): Supports JSON schema, regular expression, and EBNF constraints.\n", + "\n", + "We suggest using XGrammar for its better performance and utility. XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n", + "\n", + "To use Outlines, simply add `--grammar-backend outlines` when launching the server.\n", + "To use llguidance, add `--grammar-backend llguidance` when launching the server.\n", + "If no backend is specified, XGrammar will be used as the default.\n", + "\n", + "For better output quality, **It's advisable to explicitly include instructions in the prompt to guide the model to generate the desired format.** For example, you can specify, 'Please generate the output in the following JSON format: ...'.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## OpenAI Compatible API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "import os\n", + "\n", + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n", + "\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", + "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JSON\n", + "\n", + "you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, Field\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Please generate the information of the capital of France in the JSON format.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=128,\n", + " response_format={\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": {\n", + " \"name\": \"foo\",\n", + " # convert the pydantic model to json schema\n", + " \"schema\": CapitalInfo.model_json_schema(),\n", + " },\n", + " },\n", + ")\n", + "\n", + "response_content = response.choices[0].message.content\n", + "# validate the JSON response by the pydantic model\n", + "capital_info = CapitalInfo.model_validate_json(response_content)\n", + "print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "json_schema = json.dumps(\n", + " {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n", + " \"population\": {\"type\": \"integer\"},\n", + " },\n", + " \"required\": [\"name\", \"population\"],\n", + " }\n", + ")\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Give me the information of the capital of France in the JSON format.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=128,\n", + " response_format={\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": {\"name\": \"foo\", \"schema\": json.loads(json_schema)},\n", + " },\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### EBNF" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ebnf_grammar = \"\"\"\n", + "root ::= city | description\n", + "city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\n", + "description ::= city \" is \" status\n", + "status ::= \"the capital of \" country\n", + "country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"\n", + "\"\"\"\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful geography bot.\"},\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Give me the information of the capital of France.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=32,\n", + " extra_body={\"ebnf\": ebnf_grammar},\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Regular expression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=[\n", + " {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n", + " ],\n", + " temperature=0,\n", + " max_tokens=128,\n", + " extra_body={\"regex\": \"(Paris|London)\"},\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Structural Tag" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tool_get_current_weather = {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_current_weather\",\n", + " \"description\": \"Get the current weather in a given location\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city to find the weather for, e.g. 'San Francisco'\",\n", + " },\n", + " \"state\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"the two-letter abbreviation for the state that the city is\"\n", + " \" in, e.g. 'CA' which would mean 'California'\",\n", + " },\n", + " \"unit\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The unit to fetch the temperature in\",\n", + " \"enum\": [\"celsius\", \"fahrenheit\"],\n", + " },\n", + " },\n", + " \"required\": [\"city\", \"state\", \"unit\"],\n", + " },\n", + " },\n", + "}\n", + "\n", + "tool_get_current_date = {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_current_date\",\n", + " \"description\": \"Get the current date and time for a given timezone\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"timezone\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The timezone to fetch the current date and time for, e.g. 'America/New_York'\",\n", + " }\n", + " },\n", + " \"required\": [\"timezone\"],\n", + " },\n", + " },\n", + "}\n", + "\n", + "schema_get_current_weather = tool_get_current_weather[\"function\"][\"parameters\"]\n", + "schema_get_current_date = tool_get_current_date[\"function\"][\"parameters\"]\n", + "\n", + "\n", + "def get_messages():\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": f\"\"\"\n", + "# Tool Instructions\n", + "- Always execute python code in messages that you share.\n", + "- When looking for real time information use relevant functions if available else fallback to brave_search\n", + "You have access to the following functions:\n", + "Use the function 'get_current_weather' to: Get the current weather in a given location\n", + "{tool_get_current_weather[\"function\"]}\n", + "Use the function 'get_current_date' to: Get the current date and time for a given timezone\n", + "{tool_get_current_date[\"function\"]}\n", + "If a you choose to call a function ONLY reply in the following format:\n", + "<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}\n", + "where\n", + "start_tag => ` a JSON dict with the function argument name as key and function argument value as value.\n", + "end_tag => ``\n", + "Here is an example,\n", + "{{\"example_name\": \"example_value\"}}\n", + "Reminder:\n", + "- Function calls MUST follow the specified format\n", + "- Required parameters MUST be specified\n", + "- Only call one function at a time\n", + "- Put the entire function call reply on one line\n", + "- Always add your sources when using search results to answer the user query\n", + "You are a helpful assistant.\"\"\",\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"You are in New York. Please get the current date and time, and the weather.\",\n", + " },\n", + " ]\n", + "\n", + "\n", + "messages = get_messages()\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=messages,\n", + " response_format={\n", + " \"type\": \"structural_tag\",\n", + " \"structures\": [\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_weather,\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_date,\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"triggers\": [\"\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=messages,\n", + " response_format={\n", + " \"type\": \"structural_tag\",\n", + " \"format\": {\n", + " \"type\": \"triggered_tags\",\n", + " \"triggers\": [\"\",\n", + " \"content\": {\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": schema_get_current_weather,\n", + " },\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"content\": {\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": schema_get_current_date,\n", + " },\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"at_least_one\": False,\n", + " \"stop_after_first\": False,\n", + " },\n", + " },\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Native API and SGLang Runtime (SRT)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JSON" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "from pydantic import BaseModel, Field\n", + "\n", + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "# Make API request\n", + "messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Here is the information of the capital of France in the JSON format.\\n\",\n", + " }\n", + "]\n", + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": text,\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 64,\n", + " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n", + " },\n", + " },\n", + ")\n", + "print_highlight(response.json())\n", + "\n", + "\n", + "response_data = json.loads(response.json()[\"text\"])\n", + "# validate the response by the pydantic model\n", + "capital_info = CapitalInfo.model_validate(response_data)\n", + "print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "json_schema = json.dumps(\n", + " {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n", + " \"population\": {\"type\": \"integer\"},\n", + " },\n", + " \"required\": [\"name\", \"population\"],\n", + " }\n", + ")\n", + "\n", + "# JSON\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": text,\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 64,\n", + " \"json_schema\": json_schema,\n", + " },\n", + " },\n", + ")\n", + "\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### EBNF" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Give me the information of the capital of France.\",\n", + " }\n", + "]\n", + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": text,\n", + " \"sampling_params\": {\n", + " \"max_new_tokens\": 128,\n", + " \"temperature\": 0,\n", + " \"n\": 3,\n", + " \"ebnf\": (\n", + " \"root ::= city | description\\n\"\n", + " 'city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\\n'\n", + " 'description ::= city \" is \" status\\n'\n", + " 'status ::= \"the capital of \" country\\n'\n", + " 'country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"'\n", + " ),\n", + " },\n", + " \"stream\": False,\n", + " \"return_logprob\": False,\n", + " },\n", + ")\n", + "\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Regular expression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Paris is the capital of\",\n", + " }\n", + "]\n", + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": text,\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 64,\n", + " \"regex\": \"(France|England)\",\n", + " },\n", + " },\n", + ")\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Structural Tag" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "# generate an answer\n", + "tokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n", + "\n", + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "payload = {\n", + " \"text\": text,\n", + " \"sampling_params\": {\n", + " \"structural_tag\": json.dumps(\n", + " {\n", + " \"type\": \"structural_tag\",\n", + " \"structures\": [\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_weather,\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_date,\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"triggers\": [\"\n", + "payload = {\n", + " \"text\": text,\n", + " \"sampling_params\": {\n", + " \"structural_tag\": json.dumps(\n", + " {\n", + " \"type\": \"structural_tag\",\n", + " \"format\": {\n", + " \"type\": \"triggered_tags\",\n", + " \"triggers\": [\"\",\n", + " \"content\": {\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": schema_get_current_weather,\n", + " },\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"content\": {\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": schema_get_current_date,\n", + " },\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"at_least_one\": False,\n", + " \"stop_after_first\": False,\n", + " },\n", + " }\n", + " )\n", + " },\n", + "}\n", + "\n", + "\n", + "# Send POST request to the API endpoint\n", + "response = requests.post(f\"http://localhost:{port}/generate\", json=payload)\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Offline Engine API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sglang as sgl\n", + "\n", + "llm = sgl.Engine(\n", + " model_path=\"meta-llama/Meta-Llama-3.1-8B-Instruct\", grammar_backend=\"xgrammar\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JSON" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pydantic import BaseModel, Field\n", + "\n", + "prompts = [\n", + " \"Give me the information of the capital of China in the JSON format.\",\n", + " \"Give me the information of the capital of France in the JSON format.\",\n", + " \"Give me the information of the capital of Ireland in the JSON format.\",\n", + "]\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "sampling_params = {\n", + " \"temperature\": 0.1,\n", + " \"top_p\": 0.95,\n", + " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n", + "}\n", + "\n", + "outputs = llm.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print_highlight(\"===============================\")\n", + " print_highlight(f\"Prompt: {prompt}\") # validate the output by the pydantic model\n", + " capital_info = CapitalInfo.model_validate_json(output[\"text\"])\n", + " print_highlight(f\"Validated output: {capital_info.model_dump_json()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Give me the information of the capital of China in the JSON format.\",\n", + " \"Give me the information of the capital of France in the JSON format.\",\n", + " \"Give me the information of the capital of Ireland in the JSON format.\",\n", + "]\n", + "\n", + "json_schema = json.dumps(\n", + " {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n", + " \"population\": {\"type\": \"integer\"},\n", + " },\n", + " \"required\": [\"name\", \"population\"],\n", + " }\n", + ")\n", + "\n", + "sampling_params = {\"temperature\": 0.1, \"top_p\": 0.95, \"json_schema\": json_schema}\n", + "\n", + "outputs = llm.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print_highlight(\"===============================\")\n", + " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### EBNF\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Give me the information of the capital of France.\",\n", + " \"Give me the information of the capital of Germany.\",\n", + " \"Give me the information of the capital of Italy.\",\n", + "]\n", + "\n", + "sampling_params = {\n", + " \"temperature\": 0.8,\n", + " \"top_p\": 0.95,\n", + " \"ebnf\": (\n", + " \"root ::= city | description\\n\"\n", + " 'city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\\n'\n", + " 'description ::= city \" is \" status\\n'\n", + " 'status ::= \"the capital of \" country\\n'\n", + " 'country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"'\n", + " ),\n", + "}\n", + "\n", + "outputs = llm.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print_highlight(\"===============================\")\n", + " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Regular expression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Please provide information about London as a major global city:\",\n", + " \"Please provide information about Paris as a major global city:\",\n", + "]\n", + "\n", + "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95, \"regex\": \"(France|England)\"}\n", + "\n", + "outputs = llm.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print_highlight(\"===============================\")\n", + " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Structural Tag" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "prompts = [text]\n", + "\n", + "\n", + "sampling_params = {\n", + " \"temperature\": 0.8,\n", + " \"top_p\": 0.95,\n", + " \"structural_tag\": json.dumps(\n", + " {\n", + " \"type\": \"structural_tag\",\n", + " \"structures\": [\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_weather,\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_date,\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"triggers\": [\"\n", + "sampling_params = {\n", + " \"temperature\": 0.8,\n", + " \"top_p\": 0.95,\n", + " \"structural_tag\": json.dumps(\n", + " {\n", + " \"type\": \"structural_tag\",\n", + " \"format\": {\n", + " \"type\": \"triggered_tags\",\n", + " \"triggers\": [\"\",\n", + " \"content\": {\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": schema_get_current_weather,\n", + " },\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"content\": {\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": schema_get_current_date,\n", + " },\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"at_least_one\": False,\n", + " \"stop_after_first\": False,\n", + " },\n", + " }\n", + " ),\n", + "}\n", + "\n", + "\n", + "# Send POST request to the API endpoint\n", + "outputs = llm.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print_highlight(\"===============================\")\n", + " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm.shutdown()" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/advanced_features/structured_outputs.mdx b/docs_new/docs/advanced_features/structured_outputs.mdx new file mode 100644 index 000000000000..26e450bfd151 --- /dev/null +++ b/docs_new/docs/advanced_features/structured_outputs.mdx @@ -0,0 +1,803 @@ +--- +title: "Structured Outputs" +metatags: + description: "SGLang structured outputs: JSON schema, regex, EBNF constraints. XGrammar, Outlines, Llguidance backends for guaranteed output format." +--- +You can specify a JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request. + +SGLang supports three grammar backends: + +- [XGrammar](https://github.com/mlc-ai/xgrammar)(default): Supports JSON schema, regular expression, and EBNF constraints. +- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints. +- [Llguidance](https://github.com/guidance-ai/llguidance): Supports JSON schema, regular expression, and EBNF constraints. + +We suggest using XGrammar for its better performance and utility. XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). + +To use Outlines, simply add `--grammar-backend outlines` when launching the server. +To use llguidance, add `--grammar-backend llguidance` when launching the server. +If no backend is specified, XGrammar will be used as the default. + +For better output quality, **It's advisable to explicitly include instructions in the prompt to guide the model to generate the desired format.** For example, you can specify, 'Please generate the output in the following JSON format: ...'. + + + +## OpenAI Compatible API + + + +```python Example +import openai +import os + +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +os.environ["TOKENIZERS_PARALLELISM"] = "false" + + +server_process, port = launch_server_cmd( + "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}") +client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") +``` + +### JSON + +you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response. + + +**Using Pydantic** + + + +```python Example +from pydantic import BaseModel, Field + + +# Define the schema using Pydantic +class CapitalInfo(BaseModel): + name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city") + population: int = Field(..., description="Population of the capital city") + + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=[ + { + "role": "user", + "content": "Please generate the information of the capital of France in the JSON format.", + }, + ], + temperature=0, + max_tokens=128, + response_format={ + "type": "json_schema", + "json_schema": { + "name": "foo", + # convert the pydantic model to json schema + "schema": CapitalInfo.model_json_schema(), + }, + }, +) + +response_content = response.choices[0].message.content +# validate the JSON response by the pydantic model +capital_info = CapitalInfo.model_validate_json(response_content) +print_highlight(f"Validated response: {capital_info.model_dump_json()}") +``` + +**JSON Schema Directly** + + + + +```python Example +import json + +json_schema = json.dumps( + { + "type": "object", + "properties": { + "name": {"type": "string", "pattern": "^[\\w]+$"}, + "population": {"type": "integer"}, + }, + "required": ["name", "population"], + } +) + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=[ + { + "role": "user", + "content": "Give me the information of the capital of France in the JSON format.", + }, + ], + temperature=0, + max_tokens=128, + response_format={ + "type": "json_schema", + "json_schema": {"name": "foo", "schema": json.loads(json_schema)}, + }, +) + +print_highlight(response.choices[0].message.content) +``` + +### EBNF + + + +```python Example +ebnf_grammar = """ +root ::= city | description +city ::= "London" | "Paris" | "Berlin" | "Rome" +description ::= city " is " status +status ::= "the capital of " country +country ::= "England" | "France" | "Germany" | "Italy" +""" + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=[ + {"role": "system", "content": "You are a helpful geography bot."}, + { + "role": "user", + "content": "Give me the information of the capital of France.", + }, + ], + temperature=0, + max_tokens=32, + extra_body={"ebnf": ebnf_grammar}, +) + +print_highlight(response.choices[0].message.content) +``` + +### Regular expression + + + +```python Example +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=[ + {"role": "user", "content": "What is the capital of France?"}, + ], + temperature=0, + max_tokens=128, + extra_body={"regex": "(Paris|London)"}, +) + +print_highlight(response.choices[0].message.content) +``` + +### Structural Tag + + + +```python Example +tool_get_current_weather = { + "type": "function", + "function": { + "name": "get_current_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The city to find the weather for, e.g. 'San Francisco'", + }, + "state": { + "type": "string", + "description": "the two-letter abbreviation for the state that the city is" + " in, e.g. 'CA' which would mean 'California'", + }, + "unit": { + "type": "string", + "description": "The unit to fetch the temperature in", + "enum": ["celsius", "fahrenheit"], + }, + }, + "required": ["city", "state", "unit"], + }, + }, +} + +tool_get_current_date = { + "type": "function", + "function": { + "name": "get_current_date", + "description": "Get the current date and time for a given timezone", + "parameters": { + "type": "object", + "properties": { + "timezone": { + "type": "string", + "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'", + } + }, + "required": ["timezone"], + }, + }, +} + +schema_get_current_weather = tool_get_current_weather["function"]["parameters"] +schema_get_current_date = tool_get_current_date["function"]["parameters"] + + +def get_messages(): + return [ + { + "role": "system", + "content": f""" +# Tool Instructions +- Always execute python code in messages that you share. +- When looking for real time information use relevant functions if available else fallback to brave_search +You have access to the following functions: +Use the function 'get_current_weather' to: Get the current weather in a given location +{tool_get_current_weather["function"]} +Use the function 'get_current_date' to: Get the current date and time for a given timezone +{tool_get_current_date["function"]} +If a you choose to call a function ONLY reply in the following format: +<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}} +where +start_tag => ` a JSON dict with the function argument name as key and function argument value as value. +end_tag => `` +Here is an example, +{{"example_name": "example_value"}} +Reminder: +- Function calls MUST follow the specified format +- Required parameters MUST be specified +- Only call one function at a time +- Put the entire function call reply on one line +- Always add your sources when using search results to answer the user query +You are a helpful assistant.""", + }, + { + "role": "user", + "content": "You are in New York. Please get the current date and time, and the weather.", + }, + ] + + +messages = get_messages() + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3.1-8B-Instruct", + messages=messages, + response_format={ + "type": "structural_tag", + "structures": [ + { + "begin": "", + "schema": schema_get_current_weather, + "end": "", + }, + { + "begin": "", + "schema": schema_get_current_date, + "end": "", + }, + ], + "triggers": ["", + "content": { + "type": "json_schema", + "json_schema": schema_get_current_weather, + }, + "end": "", + }, + { + "begin": "", + "content": { + "type": "json_schema", + "json_schema": schema_get_current_date, + }, + "end": "", + }, + ], + "at_least_one": False, + "stop_after_first": False, + }, + }, +) + +print_highlight(response.choices[0].message.content) +``` + +## Native API and SGLang Runtime (SRT) + + +### JSON + + +**Using Pydantic** + + + +```python Example +import requests +import json +from pydantic import BaseModel, Field + +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") + + +# Define the schema using Pydantic +class CapitalInfo(BaseModel): + name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city") + population: int = Field(..., description="Population of the capital city") + + +# Make API request +messages = [ + { + "role": "user", + "content": "Here is the information of the capital of France in the JSON format.\n", + } +] +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": text, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 64, + "json_schema": json.dumps(CapitalInfo.model_json_schema()), + }, + }, +) +print_highlight(response.json()) + + +response_data = json.loads(response.json()["text"]) +# validate the response by the pydantic model +capital_info = CapitalInfo.model_validate(response_data) +print_highlight(f"Validated response: {capital_info.model_dump_json()}") +``` + +**JSON Schema Directly** + + + +```python Example +json_schema = json.dumps( + { + "type": "object", + "properties": { + "name": {"type": "string", "pattern": "^[\\w]+$"}, + "population": {"type": "integer"}, + }, + "required": ["name", "population"], + } +) + +# JSON +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": text, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 64, + "json_schema": json_schema, + }, + }, +) + +print_highlight(response.json()) +``` + +### EBNF + + + +```python Example +messages = [ + { + "role": "user", + "content": "Give me the information of the capital of France.", + } +] +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": text, + "sampling_params": { + "max_new_tokens": 128, + "temperature": 0, + "n": 3, + "ebnf": ( + "root ::= city | description\n" + 'city ::= "London" | "Paris" | "Berlin" | "Rome"\n' + 'description ::= city " is " status\n' + 'status ::= "the capital of " country\n' + 'country ::= "England" | "France" | "Germany" | "Italy"' + ), + }, + "stream": False, + "return_logprob": False, + }, +) + +print_highlight(response.json()) +``` + +### Regular expression + + + +```python Example +messages = [ + { + "role": "user", + "content": "Paris is the capital of", + } +] +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": text, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 64, + "regex": "(France|England)", + }, + }, +) +print_highlight(response.json()) +``` + +### Structural Tag + + + +```python Example +from transformers import AutoTokenizer + +# generate an answer +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") + +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +payload = { + "text": text, + "sampling_params": { + "structural_tag": json.dumps( + { + "type": "structural_tag", + "structures": [ + { + "begin": "", + "schema": schema_get_current_weather, + "end": "", + }, + { + "begin": "", + "schema": schema_get_current_date, + "end": "", + }, + ], + "triggers": ["", + "content": { + "type": "json_schema", + "json_schema": schema_get_current_weather, + }, + "end": "", + }, + { + "begin": "", + "content": { + "type": "json_schema", + "json_schema": schema_get_current_date, + }, + "end": "", + }, + ], + "at_least_one": False, + "stop_after_first": False, + }, + } + ) + }, +} + + +# Send POST request to the API endpoint +response = requests.post(f"http://localhost:{port}/generate", json=payload) +print_highlight(response.json()) +``` + + +```python Example +terminate_process(server_process) +``` + +## Offline Engine API + + + +```python Example +import sglang as sgl + +llm = sgl.Engine( + model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", grammar_backend="xgrammar" +) +``` + +### JSON + + +**Using Pydantic** + + + +```python Example +import json +from pydantic import BaseModel, Field + + +prompts = [ + "Give me the information of the capital of China in the JSON format.", + "Give me the information of the capital of France in the JSON format.", + "Give me the information of the capital of Ireland in the JSON format.", +] + + +# Define the schema using Pydantic +class CapitalInfo(BaseModel): + name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city") + population: int = Field(..., description="Population of the capital city") + + +sampling_params = { + "temperature": 0.1, + "top_p": 0.95, + "json_schema": json.dumps(CapitalInfo.model_json_schema()), +} + +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print_highlight("===============================") + print_highlight(f"Prompt: {prompt}") # validate the output by the pydantic model + capital_info = CapitalInfo.model_validate_json(output["text"]) + print_highlight(f"Validated output: {capital_info.model_dump_json()}") +``` + +**JSON Schema Directly** + + + +```python Example +prompts = [ + "Give me the information of the capital of China in the JSON format.", + "Give me the information of the capital of France in the JSON format.", + "Give me the information of the capital of Ireland in the JSON format.", +] + +json_schema = json.dumps( + { + "type": "object", + "properties": { + "name": {"type": "string", "pattern": "^[\\w]+$"}, + "population": {"type": "integer"}, + }, + "required": ["name", "population"], + } +) + +sampling_params = {"temperature": 0.1, "top_p": 0.95, "json_schema": json_schema} + +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print_highlight("===============================") + print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}") +``` + +### EBNF + + + + +```python Example +prompts = [ + "Give me the information of the capital of France.", + "Give me the information of the capital of Germany.", + "Give me the information of the capital of Italy.", +] + +sampling_params = { + "temperature": 0.8, + "top_p": 0.95, + "ebnf": ( + "root ::= city | description\n" + 'city ::= "London" | "Paris" | "Berlin" | "Rome"\n' + 'description ::= city " is " status\n' + 'status ::= "the capital of " country\n' + 'country ::= "England" | "France" | "Germany" | "Italy"' + ), +} + +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print_highlight("===============================") + print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}") +``` + +### Regular expression + + + +```python Example +prompts = [ + "Please provide information about London as a major global city:", + "Please provide information about Paris as a major global city:", +] + +sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"} + +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print_highlight("===============================") + print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}") +``` + +### Structural Tag + + + +```python Example +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +prompts = [text] + + +sampling_params = { + "temperature": 0.8, + "top_p": 0.95, + "structural_tag": json.dumps( + { + "type": "structural_tag", + "structures": [ + { + "begin": "", + "schema": schema_get_current_weather, + "end": "", + }, + { + "begin": "", + "schema": schema_get_current_date, + "end": "", + }, + ], + "triggers": ["", + "content": { + "type": "json_schema", + "json_schema": schema_get_current_weather, + }, + "end": "", + }, + { + "begin": "", + "content": { + "type": "json_schema", + "json_schema": schema_get_current_date, + }, + "end": "", + }, + ], + "at_least_one": False, + "stop_after_first": False, + }, + } + ), +} + + +# Send POST request to the API endpoint +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print_highlight("===============================") + print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}") +``` + + +```python Example +llm.shutdown() +``` diff --git a/docs_new/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb b/docs_new/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb new file mode 100644 index 000000000000..cfc07fd01629 --- /dev/null +++ b/docs_new/docs/advanced_features/structured_outputs_for_reasoning_models.ipynb @@ -0,0 +1,841 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Structured Outputs For Reasoning Models\n", + "\n", + "When working with reasoning models that use special tokens like `...` to denote reasoning sections, you might want to allow free-form text within these sections while still enforcing grammar constraints on the rest of the output.\n", + "\n", + "SGLang provides a feature to disable grammar restrictions within reasoning sections. This is particularly useful for models that need to perform complex reasoning steps before providing a structured output.\n", + "\n", + "To enable this feature, use the `--reasoning-parser` flag which decide the think_end_token, such as `
`, when launching the server. You can also specify the reasoning parser using the `--reasoning-parser` flag.\n", + "\n", + "## Supported Models\n", + "\n", + "Currently, SGLang supports the following reasoning models:\n", + "- [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d): The reasoning content is wrapped with `` and `` tags.\n", + "- [QwQ](https://huggingface.co/Qwen/QwQ-32B): The reasoning content is wrapped with `` and `` tags.\n", + "\n", + "\n", + "## Usage\n", + "\n", + "## OpenAI Compatible API" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specify the `--grammar-backend`, `--reasoning-parser` option." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "import os\n", + "\n", + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n", + "\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \"python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", + "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JSON\n", + "\n", + "you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, Field\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n", + " messages=[\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"Give me the information and population of the capital of France in the JSON format.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=2048,\n", + " response_format={\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": {\n", + " \"name\": \"foo\",\n", + " # convert the pydantic model to json schema\n", + " \"schema\": CapitalInfo.model_json_schema(),\n", + " },\n", + " },\n", + ")\n", + "\n", + "print_highlight(\n", + " f\"reasoing_content: {response.choices[0].message.reasoning_content}\\n\\ncontent: {response.choices[0].message.content}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "json_schema = json.dumps(\n", + " {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n", + " \"population\": {\"type\": \"integer\"},\n", + " },\n", + " \"required\": [\"name\", \"population\"],\n", + " }\n", + ")\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n", + " messages=[\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"Give me the information and population of the capital of France in the JSON format.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=2048,\n", + " response_format={\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": {\"name\": \"foo\", \"schema\": json.loads(json_schema)},\n", + " },\n", + ")\n", + "\n", + "print_highlight(\n", + " f\"reasoing_content: {response.choices[0].message.reasoning_content}\\n\\ncontent: {response.choices[0].message.content}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### EBNF" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ebnf_grammar = \"\"\"\n", + "root ::= city | description\n", + "city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\n", + "description ::= city \" is \" status\n", + "status ::= \"the capital of \" country\n", + "country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"\n", + "\"\"\"\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful geography bot.\"},\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"Give me the information and population of the capital of France in the JSON format.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=2048,\n", + " extra_body={\"ebnf\": ebnf_grammar},\n", + ")\n", + "\n", + "print_highlight(\n", + " f\"reasoing_content: {response.choices[0].message.reasoning_content}\\n\\ncontent: {response.choices[0].message.content}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Regular expression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = client.chat.completions.create(\n", + " model=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n", + " messages=[\n", + " {\"role\": \"assistant\", \"content\": \"What is the capital of France?\"},\n", + " ],\n", + " temperature=0,\n", + " max_tokens=2048,\n", + " extra_body={\"regex\": \"(Paris|London)\"},\n", + ")\n", + "\n", + "print_highlight(\n", + " f\"reasoing_content: {response.choices[0].message.reasoning_content}\\n\\ncontent: {response.choices[0].message.content}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Structural Tag" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tool_get_current_weather = {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_current_weather\",\n", + " \"description\": \"Get the current weather in a given location\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city to find the weather for, e.g. 'San Francisco'\",\n", + " },\n", + " \"state\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"the two-letter abbreviation for the state that the city is\"\n", + " \" in, e.g. 'CA' which would mean 'California'\",\n", + " },\n", + " \"unit\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The unit to fetch the temperature in\",\n", + " \"enum\": [\"celsius\", \"fahrenheit\"],\n", + " },\n", + " },\n", + " \"required\": [\"city\", \"state\", \"unit\"],\n", + " },\n", + " },\n", + "}\n", + "\n", + "tool_get_current_date = {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_current_date\",\n", + " \"description\": \"Get the current date and time for a given timezone\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"timezone\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The timezone to fetch the current date and time for, e.g. 'America/New_York'\",\n", + " }\n", + " },\n", + " \"required\": [\"timezone\"],\n", + " },\n", + " },\n", + "}\n", + "\n", + "schema_get_current_weather = tool_get_current_weather[\"function\"][\"parameters\"]\n", + "schema_get_current_date = tool_get_current_date[\"function\"][\"parameters\"]\n", + "\n", + "\n", + "def get_messages():\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": f\"\"\"\n", + "# Tool Instructions\n", + "- Always execute python code in messages that you share.\n", + "- When looking for real time information use relevant functions if available else fallback to brave_search\n", + "You have access to the following functions:\n", + "Use the function 'get_current_weather' to: Get the current weather in a given location\n", + "{tool_get_current_weather[\"function\"]}\n", + "Use the function 'get_current_date' to: Get the current date and time for a given timezone\n", + "{tool_get_current_date[\"function\"]}\n", + "If a you choose to call a function ONLY reply in the following format:\n", + "<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}\n", + "where\n", + "start_tag => ` a JSON dict with the function argument name as key and function argument value as value.\n", + "end_tag => ``\n", + "Here is an example,\n", + "{{\"example_name\": \"example_value\"}}\n", + "Reminder:\n", + "- Function calls MUST follow the specified format\n", + "- Required parameters MUST be specified\n", + "- Only call one function at a time\n", + "- Put the entire function call reply on one line\n", + "- Always add your sources when using search results to answer the user query\n", + "You are a helpful assistant.\"\"\",\n", + " },\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"You are in New York. Please get the current date and time, and the weather.\",\n", + " },\n", + " ]\n", + "\n", + "\n", + "messages = get_messages()\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n", + " messages=messages,\n", + " response_format={\n", + " \"type\": \"structural_tag\",\n", + " \"max_new_tokens\": 2048,\n", + " \"structures\": [\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_weather,\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_date,\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"triggers\": [\" Note: For native API, as a work-around, you need to set `require_reasoning` argument to `True` to ensure the model will think before generating the structured output. It's not required for chat-completion API." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JSON" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from pydantic import BaseModel, Field\n", + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "messages = [\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"Give me the information and population of the capital of France in the JSON format.\",\n", + " },\n", + "]\n", + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "# Make API request\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": text,\n", + " \"require_reasoning\": True,\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 2048,\n", + " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n", + " },\n", + " },\n", + ")\n", + "print(response.json())\n", + "\n", + "\n", + "reasoing_content = response.json()[\"text\"].split(\"
\")[0]\n", + "content = response.json()[\"text\"].split(\"\")[1]\n", + "print_highlight(f\"reasoing_content: {reasoing_content}\\n\\ncontent: {content}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "json_schema = json.dumps(\n", + " {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n", + " \"population\": {\"type\": \"integer\"},\n", + " },\n", + " \"required\": [\"name\", \"population\"],\n", + " }\n", + ")\n", + "\n", + "# JSON\n", + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": text,\n", + " \"require_reasoning\": True,\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 2048,\n", + " \"json_schema\": json_schema,\n", + " },\n", + " },\n", + ")\n", + "\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### EBNF" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": \"Give me the information of the capital of France.\",\n", + " \"require_reasoning\": True,\n", + " \"sampling_params\": {\n", + " \"max_new_tokens\": 2048,\n", + " \"temperature\": 0,\n", + " \"n\": 3,\n", + " \"ebnf\": (\n", + " \"root ::= city | description\\n\"\n", + " 'city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\\n'\n", + " 'description ::= city \" is \" status\\n'\n", + " 'status ::= \"the capital of \" country\\n'\n", + " 'country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"'\n", + " ),\n", + " },\n", + " \"stream\": False,\n", + " \"return_logprob\": False,\n", + " },\n", + ")\n", + "\n", + "print(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Regular expression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": \"Paris is the capital of\",\n", + " \"require_reasoning\": True,\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 2048,\n", + " \"regex\": \"(France|England)\",\n", + " },\n", + " },\n", + ")\n", + "print(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Structural Tag" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n", + ")\n", + "payload = {\n", + " \"text\": text,\n", + " \"require_reasoning\": True,\n", + " \"sampling_params\": {\n", + " \"max_new_tokens\": 2048,\n", + " \"structural_tag\": json.dumps(\n", + " {\n", + " \"type\": \"structural_tag\",\n", + " \"structures\": [\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_weather,\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_date,\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"triggers\": [\"\",\n", + " \"schema\": schema_get_current_weather,\n", + " \"end\": \"\",\n", + " },\n", + " {\n", + " \"begin\": \"\",\n", + " \"schema\": schema_get_current_date,\n", + " \"end\": \"\",\n", + " },\n", + " ],\n", + " \"triggers\": [\"...` to denote reasoning sections, you might want to allow free-form text within these sections while still enforcing grammar constraints on the rest of the output. + +SGLang provides a feature to disable grammar restrictions within reasoning sections. This is particularly useful for models that need to perform complex reasoning steps before providing a structured output. + +To enable this feature, use the `--reasoning-parser` flag which decide the think_end_token, such as ``, when launching the server. You can also specify the reasoning parser using the `--reasoning-parser` flag. + +## Supported Models + +Currently, SGLang supports the following reasoning models: +- [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d): The reasoning content is wrapped with `` and `` tags. +- [QwQ](https://huggingface.co/Qwen/QwQ-32B): The reasoning content is wrapped with `` and `` tags. + + +## Usage + +## OpenAI Compatible API + + +Specify the `--grammar-backend`, `--reasoning-parser` option. + + + +```python Example +import openai +import os + +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +os.environ["TOKENIZERS_PARALLELISM"] = "false" + + +server_process, port = launch_server_cmd( + "python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}") +client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") +``` + +### JSON + +you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response. + + +**Using Pydantic** + + + +```python Example +from pydantic import BaseModel, Field + + +# Define the schema using Pydantic +class CapitalInfo(BaseModel): + name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city") + population: int = Field(..., description="Population of the capital city") + + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", + messages=[ + { + "role": "assistant", + "content": "Give me the information and population of the capital of France in the JSON format.", + }, + ], + temperature=0, + max_tokens=2048, + response_format={ + "type": "json_schema", + "json_schema": { + "name": "foo", + # convert the pydantic model to json schema + "schema": CapitalInfo.model_json_schema(), + }, + }, +) + +print_highlight( + f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}" +) +``` + +**JSON Schema Directly** + + + + +```python Example +import json + +json_schema = json.dumps( + { + "type": "object", + "properties": { + "name": {"type": "string", "pattern": "^[\\w]+$"}, + "population": {"type": "integer"}, + }, + "required": ["name", "population"], + } +) + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", + messages=[ + { + "role": "assistant", + "content": "Give me the information and population of the capital of France in the JSON format.", + }, + ], + temperature=0, + max_tokens=2048, + response_format={ + "type": "json_schema", + "json_schema": {"name": "foo", "schema": json.loads(json_schema)}, + }, +) + +print_highlight( + f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}" +) +``` + +### EBNF + + + +```python Example +ebnf_grammar = """ +root ::= city | description +city ::= "London" | "Paris" | "Berlin" | "Rome" +description ::= city " is " status +status ::= "the capital of " country +country ::= "England" | "France" | "Germany" | "Italy" +""" + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", + messages=[ + {"role": "system", "content": "You are a helpful geography bot."}, + { + "role": "assistant", + "content": "Give me the information and population of the capital of France in the JSON format.", + }, + ], + temperature=0, + max_tokens=2048, + extra_body={"ebnf": ebnf_grammar}, +) + +print_highlight( + f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}" +) +``` + +### Regular expression + + + +```python Example +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", + messages=[ + {"role": "assistant", "content": "What is the capital of France?"}, + ], + temperature=0, + max_tokens=2048, + extra_body={"regex": "(Paris|London)"}, +) + +print_highlight( + f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}" +) +``` + +### Structural Tag + + + +```python Example +tool_get_current_weather = { + "type": "function", + "function": { + "name": "get_current_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The city to find the weather for, e.g. 'San Francisco'", + }, + "state": { + "type": "string", + "description": "the two-letter abbreviation for the state that the city is" + " in, e.g. 'CA' which would mean 'California'", + }, + "unit": { + "type": "string", + "description": "The unit to fetch the temperature in", + "enum": ["celsius", "fahrenheit"], + }, + }, + "required": ["city", "state", "unit"], + }, + }, +} + +tool_get_current_date = { + "type": "function", + "function": { + "name": "get_current_date", + "description": "Get the current date and time for a given timezone", + "parameters": { + "type": "object", + "properties": { + "timezone": { + "type": "string", + "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'", + } + }, + "required": ["timezone"], + }, + }, +} + +schema_get_current_weather = tool_get_current_weather["function"]["parameters"] +schema_get_current_date = tool_get_current_date["function"]["parameters"] + + +def get_messages(): + return [ + { + "role": "system", + "content": f""" +# Tool Instructions +- Always execute python code in messages that you share. +- When looking for real time information use relevant functions if available else fallback to brave_search +You have access to the following functions: +Use the function 'get_current_weather' to: Get the current weather in a given location +{tool_get_current_weather["function"]} +Use the function 'get_current_date' to: Get the current date and time for a given timezone +{tool_get_current_date["function"]} +If a you choose to call a function ONLY reply in the following format: +<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}} +where +start_tag => ` a JSON dict with the function argument name as key and function argument value as value. +end_tag => `` +Here is an example, +{{"example_name": "example_value"}} +Reminder: +- Function calls MUST follow the specified format +- Required parameters MUST be specified +- Only call one function at a time +- Put the entire function call reply on one line +- Always add your sources when using search results to answer the user query +You are a helpful assistant.""", + }, + { + "role": "assistant", + "content": "You are in New York. Please get the current date and time, and the weather.", + }, + ] + + +messages = get_messages() + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", + messages=messages, + response_format={ + "type": "structural_tag", + "max_new_tokens": 2048, + "structures": [ + { + "begin": "", + "schema": schema_get_current_weather, + "end": "", + }, + { + "begin": "", + "schema": schema_get_current_date, + "end": "", + }, + ], + "triggers": [" Note: For native API, as a work-around, you need to set `require_reasoning` argument to `True` to ensure the model will think before generating the structured output. It's not required for chat-completion API. + + +### JSON + + +**Using Pydantic** + + + +```python Example +import requests +from pydantic import BaseModel, Field +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") + + +# Define the schema using Pydantic +class CapitalInfo(BaseModel): + name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city") + population: int = Field(..., description="Population of the capital city") + + +messages = [ + { + "role": "assistant", + "content": "Give me the information and population of the capital of France in the JSON format.", + }, +] +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +# Make API request +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": text, + "require_reasoning": True, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 2048, + "json_schema": json.dumps(CapitalInfo.model_json_schema()), + }, + }, +) +print(response.json()) + + +reasoing_content = response.json()["text"].split("")[0] +content = response.json()["text"].split("")[1] +print_highlight(f"reasoing_content: {reasoing_content}\n\ncontent: {content}") +``` + +**JSON Schema Directly** + + + +```python Example +json_schema = json.dumps( + { + "type": "object", + "properties": { + "name": {"type": "string", "pattern": "^[\\w]+$"}, + "population": {"type": "integer"}, + }, + "required": ["name", "population"], + } +) + +# JSON +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": text, + "require_reasoning": True, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 2048, + "json_schema": json_schema, + }, + }, +) + +print_highlight(response.json()) +``` + +### EBNF + + + +```python Example +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": "Give me the information of the capital of France.", + "require_reasoning": True, + "sampling_params": { + "max_new_tokens": 2048, + "temperature": 0, + "n": 3, + "ebnf": ( + "root ::= city | description\n" + 'city ::= "London" | "Paris" | "Berlin" | "Rome"\n' + 'description ::= city " is " status\n' + 'status ::= "the capital of " country\n' + 'country ::= "England" | "France" | "Germany" | "Italy"' + ), + }, + "stream": False, + "return_logprob": False, + }, +) + +print(response.json()) +``` + +### Regular expression + + + +```python Example +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": "Paris is the capital of", + "require_reasoning": True, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 2048, + "regex": "(France|England)", + }, + }, +) +print(response.json()) +``` + +### Structural Tag + + + +```python Example +text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, return_dict=False +) +payload = { + "text": text, + "require_reasoning": True, + "sampling_params": { + "max_new_tokens": 2048, + "structural_tag": json.dumps( + { + "type": "structural_tag", + "structures": [ + { + "begin": "", + "schema": schema_get_current_weather, + "end": "", + }, + { + "begin": "", + "schema": schema_get_current_date, + "end": "", + }, + ], + "triggers": ["", + "schema": schema_get_current_weather, + "end": "", + }, + { + "begin": "", + "schema": schema_get_current_date, + "end": "", + }, + ], + "triggers": [" is not trimmed.\n", + "\n", + "sampling_params = {\n", + " \"max_new_tokens\": 1024,\n", + " \"temperature\": 0,\n", + " \"top_p\": 0.95,\n", + " \"skip_special_tokens\": False,\n", + "}\n", + "\n", + "# 1) Offline generation\n", + "result = llm.generate(input_ids=input_ids, sampling_params=sampling_params)\n", + "generated_text = result[\"text\"] # Assume there is only one prompt\n", + "\n", + "print_highlight(\"=== Offline Engine Output Text ===\")\n", + "print_highlight(generated_text)\n", + "\n", + "\n", + "# 2) Parse using FunctionCallParser\n", + "def convert_dict_to_tool(tool_dict: dict) -> Tool:\n", + " function_dict = tool_dict.get(\"function\", {})\n", + " return Tool(\n", + " type=tool_dict.get(\"type\", \"function\"),\n", + " function=Function(\n", + " name=function_dict.get(\"name\"),\n", + " description=function_dict.get(\"description\"),\n", + " parameters=function_dict.get(\"parameters\"),\n", + " ),\n", + " )\n", + "\n", + "\n", + "tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools]\n", + "\n", + "parser = FunctionCallParser(tools=tools, tool_call_parser=\"qwen25\")\n", + "normal_text, calls = parser.parse_non_stream(generated_text)\n", + "\n", + "print_highlight(\"=== Parsing Result ===\")\n", + "print(\"Normal text portion:\", normal_text)\n", + "print_highlight(\"Function call portion:\")\n", + "for call in calls:\n", + " # call: ToolCallItem\n", + " print_highlight(f\" - tool name: {call.name}\")\n", + " print_highlight(f\" parameters: {call.parameters}\")\n", + "\n", + "# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm.shutdown()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tool Choice Mode\n", + "\n", + "SGLang supports OpenAI's `tool_choice` parameter to control when and which tools the model should call. This feature is implemented using EBNF (Extended Backus-Naur Form) grammar to ensure reliable tool calling behavior.\n", + "\n", + "### Supported Tool Choice Options\n", + "\n", + "- **`tool_choice=\"required\"`**: Forces the model to call at least one tool\n", + "- **`tool_choice={\"type\": \"function\", \"function\": {\"name\": \"specific_function\"}}`**: Forces the model to call a specific function\n", + "\n", + "### Backend Compatibility\n", + "\n", + "Tool choice is fully supported with the **Xgrammar backend**, which is the default grammar backend (`--grammar-backend xgrammar`). However, it may not be fully supported with other backends such as `outlines`.\n", + "\n", + "### Example: Required Tool Choice" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "from sglang.test.doc_patch import launch_server_cmd\n", + "\n", + "# Start a new server session for tool choice examples\n", + "server_process_tool_choice, port_tool_choice = launch_server_cmd(\n", + " \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\"\n", + ")\n", + "wait_for_server(\n", + " f\"http://localhost:{port_tool_choice}\", process=server_process_tool_choice\n", + ")\n", + "\n", + "# Initialize client for tool choice examples\n", + "client_tool_choice = OpenAI(\n", + " api_key=\"None\", base_url=f\"http://0.0.0.0:{port_tool_choice}/v1\"\n", + ")\n", + "model_name_tool_choice = client_tool_choice.models.list().data[0].id\n", + "\n", + "# Example with tool_choice=\"required\" - forces the model to call a tool\n", + "messages_required = [\n", + " {\"role\": \"user\", \"content\": \"Hello, what is the capital of France?\"}\n", + "]\n", + "\n", + "# Define tools\n", + "tools = [\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_current_weather\",\n", + " \"description\": \"Get the current weather in a given location\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city to find the weather for, e.g. 'San Francisco'\",\n", + " },\n", + " \"unit\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The unit to fetch the temperature in\",\n", + " \"enum\": [\"celsius\", \"fahrenheit\"],\n", + " },\n", + " },\n", + " \"required\": [\"city\", \"unit\"],\n", + " },\n", + " },\n", + " }\n", + "]\n", + "\n", + "response_required = client_tool_choice.chat.completions.create(\n", + " model=model_name_tool_choice,\n", + " messages=messages_required,\n", + " temperature=0,\n", + " max_tokens=1024,\n", + " tools=tools,\n", + " tool_choice=\"required\", # Force the model to call a tool\n", + ")\n", + "\n", + "print_highlight(\"Response with tool_choice='required':\")\n", + "print(\"Content:\", response_required.choices[0].message.content)\n", + "print(\"Tool calls:\", response_required.choices[0].message.tool_calls)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example: Specific Function Choice\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example with specific function choice - forces the model to call a specific function\n", + "messages_specific = [\n", + " {\"role\": \"user\", \"content\": \"What are the most attactive places in France?\"}\n", + "]\n", + "\n", + "response_specific = client_tool_choice.chat.completions.create(\n", + " model=model_name_tool_choice,\n", + " messages=messages_specific,\n", + " temperature=0,\n", + " max_tokens=1024,\n", + " tools=tools,\n", + " tool_choice={\n", + " \"type\": \"function\",\n", + " \"function\": {\"name\": \"get_current_weather\"},\n", + " }, # Force the model to call the specific get_current_weather function\n", + ")\n", + "\n", + "print_highlight(\"Response with specific function choice:\")\n", + "print(\"Content:\", response_specific.choices[0].message.content)\n", + "print(\"Tool calls:\", response_specific.choices[0].message.tool_calls)\n", + "\n", + "if response_specific.choices[0].message.tool_calls:\n", + " tool_call = response_specific.choices[0].message.tool_calls[0]\n", + " print_highlight(f\"Called function: {tool_call.function.name}\")\n", + " print_highlight(f\"Arguments: {tool_call.function.arguments}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process_tool_choice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pythonic Tool Call Format (Llama-3.2 / Llama-3.3 / Llama-4)\n", + "\n", + "Some Llama models (such as Llama-3.2-1B, Llama-3.2-3B, Llama-3.3-70B, and Llama-4) support a \"pythonic\" tool call format, where the model outputs function calls as Python code, e.g.:\n", + "\n", + "```python\n", + "[get_current_weather(city=\"San Francisco\", state=\"CA\", unit=\"celsius\")]\n", + "```\n", + "\n", + "- The output is a Python list of function calls, with arguments as Python literals (not JSON).\n", + "- Multiple tool calls can be returned in the same list:\n", + "```python\n", + "[get_current_weather(city=\"San Francisco\", state=\"CA\", unit=\"celsius\"),\n", + " get_current_weather(city=\"New York\", state=\"NY\", unit=\"fahrenheit\")]\n", + "```\n", + "\n", + "For more information, refer to Meta’s documentation on [Zero shot function calling](https://github.com/meta-llama/llama-models/blob/main/models/llama4/prompt_format.md#zero-shot-function-calling---system-message).\n", + "\n", + "Note that this feature is still under development on Blackwell.\n", + "\n", + "### How to enable\n", + "- Launch the server with `--tool-call-parser pythonic`\n", + "- You may also specify --chat-template with the improved template for the model (e.g., `--chat-template=examples/chat_template/tool_chat_template_llama4_pythonic.jinja`).\n", + "This is recommended because the model expects a special prompt format to reliably produce valid pythonic tool call outputs. The template ensures that the prompt structure (e.g., special tokens, message boundaries like `<|eom|>`, and function call delimiters) matches what the model was trained or fine-tuned on. If you do not use the correct chat template, tool calling may fail or produce inconsistent results.\n", + "\n", + "#### Forcing Pythonic Tool Call Output Without a Chat Template\n", + "If you don't want to specify a chat template, you must give the model extremely explicit instructions in your messages to enforce pythonic output. For example, for `Llama-3.2-1B-Instruct`, you need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \" python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1 --log-level warning\" # llama-3.2-1b-instruct\n", + ")\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", + "\n", + "tools = [\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_weather\",\n", + " \"description\": \"Get the current weather for a given location.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"location\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The name of the city or location.\",\n", + " }\n", + " },\n", + " \"required\": [\"location\"],\n", + " },\n", + " },\n", + " },\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_tourist_attractions\",\n", + " \"description\": \"Get a list of top tourist attractions for a given city.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The name of the city to find attractions for.\",\n", + " }\n", + " },\n", + " \"required\": [\"city\"],\n", + " },\n", + " },\n", + " },\n", + "]\n", + "\n", + "\n", + "def get_messages():\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a travel assistant. \"\n", + " \"When asked to call functions, ALWAYS respond ONLY with a python list of function calls, \"\n", + " \"using this format: [func_name1(param1=value1, param2=value2), func_name2(param=value)]. \"\n", + " \"Do NOT use JSON, do NOT use variables, do NOT use any other format. \"\n", + " \"Here is an example:\\n\"\n", + " '[get_weather(location=\"Paris\"), get_tourist_attractions(city=\"Paris\")]'\n", + " ),\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": (\n", + " \"I'm planning a trip to Tokyo next week. What's the weather like and what are some top tourist attractions? \"\n", + " \"Propose parallel tool calls at once, using the python list of function calls format as shown above.\"\n", + " ),\n", + " },\n", + " ]\n", + "\n", + "\n", + "messages = get_messages()\n", + "\n", + "client = openai.Client(base_url=f\"http://localhost:{port}/v1\", api_key=\"xxxxxx\")\n", + "model_name = client.models.list().data[0].id\n", + "\n", + "\n", + "response_non_stream = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=messages,\n", + " temperature=0,\n", + " top_p=0.9,\n", + " stream=False, # Non-streaming\n", + " tools=tools,\n", + ")\n", + "print_highlight(\"Non-stream response:\")\n", + "print_highlight(response_non_stream)\n", + "\n", + "response_stream = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=messages,\n", + " temperature=0,\n", + " top_p=0.9,\n", + " stream=True,\n", + " tools=tools,\n", + ")\n", + "texts = \"\"\n", + "tool_calls = []\n", + "name = \"\"\n", + "arguments = \"\"\n", + "\n", + "for chunk in response_stream:\n", + " if chunk.choices[0].delta.content:\n", + " texts += chunk.choices[0].delta.content\n", + " if chunk.choices[0].delta.tool_calls:\n", + " tool_calls.append(chunk.choices[0].delta.tool_calls[0])\n", + "\n", + "print_highlight(\"Streaming Response:\")\n", + "print_highlight(\"==== Text ====\")\n", + "print_highlight(texts)\n", + "\n", + "print_highlight(\"==== Tool Call ====\")\n", + "for tool_call in tool_calls:\n", + " print_highlight(tool_call)\n", + "\n", + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> **Note:** \n", + "> The model may still default to JSON if it was heavily finetuned on that format. Prompt engineering (including examples) is the only way to increase the chance of pythonic output if you are not using a chat template." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How to support a new model?\n", + "1. Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include:\n", + "```\n", + "\tTOOLS_TAG_LIST = [\n", + "\t “<|plugin|>“,\n", + "\t ““,\n", + "\t “<|python_tag|>“,\n", + "\t “[TOOL_CALLS]”\n", + "\t]\n", + "```\n", + "2. Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example:\n", + "```\n", + " class NewModelDetector(BaseFormatDetector):\n", + "```\n", + "3. Add the new detector to the MultiFormatParser class that manages all the format detectors." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs_new/docs/advanced_features/tool_parser.mdx b/docs_new/docs/advanced_features/tool_parser.mdx new file mode 100644 index 000000000000..a5fcc169b1eb --- /dev/null +++ b/docs_new/docs/advanced_features/tool_parser.mdx @@ -0,0 +1,740 @@ +--- +title: "Tool Parser" +metatags: + description: "SGLang function calling: tool parsers for DeepSeek, Llama, Qwen, Mistral, GLM, Kimi K2. OpenAI-compatible tool use API." +--- +This guide demonstrates how to use SGLang’s [Function calling](https://platform.openai.com/docs/guides/function-calling) functionality. + + +## Currently supported parsers: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParserSupported ModelsNotes
`deepseekv3`DeepSeek-v3 (e.g., `deepseek-ai/DeepSeek-V3-0324`)Recommend adding `--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja` to launch command.
`deepseekv31`DeepSeek-V3.1 and DeepSeek-V3.2-Exp (e.g. `deepseek-ai/DeepSeek-V3.1`, `deepseek-ai/DeepSeek-V3.2-Exp`)Recommend adding `--chat-template ./examples/chat_template/tool_chat_template_deepseekv31.jinja` (Or ..deepseekv32.jinja for DeepSeek-V3.2) to launch command.
`deepseekv32`DeepSeek-V3.2 (`deepseek-ai/DeepSeek-V3.2`)
`glm`GLM series (e.g. `zai-org/GLM-4.6`)
`gpt-oss`GPT-OSS (e.g., `openai/gpt-oss-120b`, `openai/gpt-oss-20b`, `lmsys/gpt-oss-120b-bf16`, `lmsys/gpt-oss-20b-bf16`)The gpt-oss tool parser filters out analysis channel events and only preserves normal text. This can cause the content to be empty when explanations are in the analysis channel. To work around this, complete the tool round by returning tool results as `role="tool"` messages, which enables the model to generate the final content.
`kimi_k2``moonshotai/Kimi-K2-Instruct`
`llama3`Llama 3.1 / 3.2 / 3.3 (e.g. `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`, `meta-llama/Llama-3.3-70B-Instruct`)
`llama4`Llama 4 (e.g. `meta-llama/Llama-4-Scout-17B-16E-Instruct`)
`mistral`Mistral (e.g. `mistralai/Mistral-7B-Instruct-v0.3`, `mistralai/Mistral-Nemo-Instruct-2407`, `mistralai/Mistral-7B-v0.3`)
`pythonic`Llama-3.2 / Llama-3.3 / Llama-4Model outputs function calls as Python code. Requires `--tool-call-parser pythonic` and is recommended to use with a specific chat template.
`qwen`Qwen series (e.g. `Qwen/Qwen3-Next-80B-A3B-Instruct`, `Qwen/Qwen3-VL-30B-A3B-Thinking`) except Qwen3-Coder
`qwen3_coder`Qwen3-Coder (e.g. `Qwen/Qwen3-Coder-30B-A3B-Instruct`)
`step3`Step-3
+ + + +## OpenAI Compatible API + + +### Launching the Server + + + +```python Example +import json +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process +from openai import OpenAI + +server_process, port = launch_server_cmd( + "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning" # qwen25 +) +wait_for_server(f"http://localhost:{port}") +``` + +Note that `--tool-call-parser` defines the parser used to interpret responses. + + +### Define Tools for Function Call +Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters. + + + +```python Example +# Define tools +tools = [ + { + "type": "function", + "function": { + "name": "get_current_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The city to find the weather for, e.g. 'San Francisco'", + }, + "state": { + "type": "string", + "description": "the two-letter abbreviation for the state that the city is" + " in, e.g. 'CA' which would mean 'California'", + }, + "unit": { + "type": "string", + "description": "The unit to fetch the temperature in", + "enum": ["celsius", "fahrenheit"], + }, + }, + "required": ["city", "state", "unit"], + }, + }, + } +] +``` + +### Define Messages + + + +```python Example +def get_messages(): + return [ + { + "role": "user", + "content": "What's the weather like in Boston today? Output a reasoning before act, then use the tools to help you.", + } + ] + + +messages = get_messages() +``` + +### Initialize the Client + + + +```python Example +# Initialize OpenAI-like client +client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1") +model_name = client.models.list().data[0].id +``` + +### Non-Streaming Request + + + +```python Example +# Non-streaming mode test +response_non_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0, + top_p=0.95, + max_tokens=1024, + stream=False, # Non-streaming + tools=tools, +) +print_highlight("Non-stream response:") +print_highlight(response_non_stream) +print_highlight("==== content ====") +print_highlight(response_non_stream.choices[0].message.content) +print_highlight("==== tool_calls ====") +print_highlight(response_non_stream.choices[0].message.tool_calls) +``` + +#### Handle Tools +When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly. + + + +```python Example +name_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.name +arguments_non_stream = ( + response_non_stream.choices[0].message.tool_calls[0].function.arguments +) + +print_highlight(f"Final streamed function call name: {name_non_stream}") +print_highlight(f"Final streamed function call arguments: {arguments_non_stream}") +``` + +### Streaming Request + + + +```python Example +# Streaming mode test +print_highlight("Streaming response:") +response_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0, + top_p=0.95, + max_tokens=1024, + stream=True, # Enable streaming + tools=tools, +) + +texts = "" +tool_calls = [] +name = "" +arguments = "" +for chunk in response_stream: + if chunk.choices[0].delta.content: + texts += chunk.choices[0].delta.content + if chunk.choices[0].delta.tool_calls: + tool_calls.append(chunk.choices[0].delta.tool_calls[0]) +print_highlight("==== Text ====") +print_highlight(texts) + +print_highlight("==== Tool Call ====") +for tool_call in tool_calls: + print_highlight(tool_call) +``` + +#### Handle Tools +When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly. + + + +```python Example +# Parse and combine function call arguments +arguments = [] +for tool_call in tool_calls: + if tool_call.function.name: + print_highlight(f"Streamed function call name: {tool_call.function.name}") + + if tool_call.function.arguments: + arguments.append(tool_call.function.arguments) + +# Combine all fragments into a single JSON string +full_arguments = "".join(arguments) +print_highlight(f"streamed function call arguments: {full_arguments}") +``` + +### Define a Tool Function + + + +```python Example +# This is a demonstration, define real function according to your usage. +def get_current_weather(city: str, state: str, unit: "str"): + return ( + f"The weather in {city}, {state} is 85 degrees {unit}. It is " + "partly cloudly, with highs in the 90's." + ) + + +available_tools = {"get_current_weather": get_current_weather} +``` + + +### Execute the Tool + + + +```python Example +messages.append(response_non_stream.choices[0].message) + +# Call the corresponding tool function +tool_call = messages[-1].tool_calls[0] +tool_name = tool_call.function.name +tool_to_call = available_tools[tool_name] +result = tool_to_call(**(json.loads(tool_call.function.arguments))) +print_highlight(f"Function call result: {result}") +# messages.append({"role": "tool", "content": result, "name": tool_name}) +messages.append( + { + "role": "tool", + "tool_call_id": tool_call.id, + "content": str(result), + "name": tool_name, + } +) + +print_highlight(f"Updated message history: {messages}") +``` + +### Send Results Back to Model + + + +```python Example +final_response = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0, + top_p=0.95, + stream=False, + tools=tools, +) +print_highlight("Non-stream response:") +print_highlight(final_response) + +print_highlight("==== Text ====") +print_highlight(final_response.choices[0].message.content) +``` + +## Native API and SGLang Runtime (SRT) + + + +```python Example +from transformers import AutoTokenizer +import requests + +# generate an answer +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") + +messages = get_messages() + +input = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True, tools=tools, return_dict=False +) + +gen_url = f"http://localhost:{port}/generate" +gen_data = { + "text": input, + "sampling_params": { + "skip_special_tokens": False, + "max_new_tokens": 1024, + "temperature": 0, + "top_p": 0.95, + }, +} +gen_response = requests.post(gen_url, json=gen_data).json()["text"] +print_highlight("==== Response ====") +print_highlight(gen_response) + +# parse the response +parse_url = f"http://localhost:{port}/parse_function_call" + +function_call_input = { + "text": gen_response, + "tool_call_parser": "qwen25", + "tools": tools, +} + +function_call_response = requests.post(parse_url, json=function_call_input) +function_call_response_json = function_call_response.json() + +print_highlight("==== Text ====") +print(function_call_response_json["normal_text"]) +print_highlight("==== Calls ====") +print("function name: ", function_call_response_json["calls"][0]["name"]) +print("function arguments: ", function_call_response_json["calls"][0]["parameters"]) +``` + + +```python Example +terminate_process(server_process) +``` + +## Offline Engine API + + + +```python Example +import sglang as sgl +from sglang.srt.function_call.function_call_parser import FunctionCallParser +from sglang.srt.managers.io_struct import Tool, Function + +llm = sgl.Engine(model_path="Qwen/Qwen2.5-7B-Instruct") +tokenizer = llm.tokenizer_manager.tokenizer +input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, tools=tools, return_dict=False +) + +# Note that for gpt-oss tool parser, adding "no_stop_trim": True +# to make sure the tool call token is not trimmed. + +sampling_params = { + "max_new_tokens": 1024, + "temperature": 0, + "top_p": 0.95, + "skip_special_tokens": False, +} + +# 1) Offline generation +result = llm.generate(input_ids=input_ids, sampling_params=sampling_params) +generated_text = result["text"] # Assume there is only one prompt + +print_highlight("=== Offline Engine Output Text ===") +print_highlight(generated_text) + + +# 2) Parse using FunctionCallParser +def convert_dict_to_tool(tool_dict: dict) -> Tool: + function_dict = tool_dict.get("function", {}) + return Tool( + type=tool_dict.get("type", "function"), + function=Function( + name=function_dict.get("name"), + description=function_dict.get("description"), + parameters=function_dict.get("parameters"), + ), + ) + + +tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools] + +parser = FunctionCallParser(tools=tools, tool_call_parser="qwen25") +normal_text, calls = parser.parse_non_stream(generated_text) + +print_highlight("=== Parsing Result ===") +print("Normal text portion:", normal_text) +print_highlight("Function call portion:") +for call in calls: + # call: ToolCallItem + print_highlight(f" - tool name: {call.name}") + print_highlight(f" parameters: {call.parameters}") + +# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc. +``` + + +```python Example +llm.shutdown() +``` + +## Tool Choice Mode + +SGLang supports OpenAI's `tool_choice` parameter to control when and which tools the model should call. This feature is implemented using EBNF (Extended Backus-Naur Form) grammar to ensure reliable tool calling behavior. + +### Supported Tool Choice Options + +- **`tool_choice="required"`**: Forces the model to call at least one tool +- **`tool_choice={"type": "function", "function": {"name": "specific_function"}}`**: Forces the model to call a specific function + +### Backend Compatibility + +Tool choice is fully supported with the **Xgrammar backend**, which is the default grammar backend (`--grammar-backend xgrammar`). However, it may not be fully supported with other backends such as `outlines`. + +### Example: Required Tool Choice + + + +```python Example +from openai import OpenAI +from sglang.utils import wait_for_server, print_highlight, terminate_process +from sglang.test.doc_patch import launch_server_cmd + +# Start a new server session for tool choice examples +server_process_tool_choice, port_tool_choice = launch_server_cmd( + "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning" +) +wait_for_server(f"http://localhost:{port_tool_choice}") + +# Initialize client for tool choice examples +client_tool_choice = OpenAI( + api_key="None", base_url=f"http://0.0.0.0:{port_tool_choice}/v1" +) +model_name_tool_choice = client_tool_choice.models.list().data[0].id + +# Example with tool_choice="required" - forces the model to call a tool +messages_required = [ + {"role": "user", "content": "Hello, what is the capital of France?"} +] + +# Define tools +tools = [ + { + "type": "function", + "function": { + "name": "get_current_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The city to find the weather for, e.g. 'San Francisco'", + }, + "unit": { + "type": "string", + "description": "The unit to fetch the temperature in", + "enum": ["celsius", "fahrenheit"], + }, + }, + "required": ["city", "unit"], + }, + }, + } +] + +response_required = client_tool_choice.chat.completions.create( + model=model_name_tool_choice, + messages=messages_required, + temperature=0, + max_tokens=1024, + tools=tools, + tool_choice="required", # Force the model to call a tool +) + +print_highlight("Response with tool_choice='required':") +print("Content:", response_required.choices[0].message.content) +print("Tool calls:", response_required.choices[0].message.tool_calls) +``` + +### Example: Specific Function Choice + + + + +```python Example +# Example with specific function choice - forces the model to call a specific function +messages_specific = [ + {"role": "user", "content": "What are the most attactive places in France?"} +] + +response_specific = client_tool_choice.chat.completions.create( + model=model_name_tool_choice, + messages=messages_specific, + temperature=0, + max_tokens=1024, + tools=tools, + tool_choice={ + "type": "function", + "function": {"name": "get_current_weather"}, + }, # Force the model to call the specific get_current_weather function +) + +print_highlight("Response with specific function choice:") +print("Content:", response_specific.choices[0].message.content) +print("Tool calls:", response_specific.choices[0].message.tool_calls) + +if response_specific.choices[0].message.tool_calls: + tool_call = response_specific.choices[0].message.tool_calls[0] + print_highlight(f"Called function: {tool_call.function.name}") + print_highlight(f"Arguments: {tool_call.function.arguments}") +``` + + +```python Example +terminate_process(server_process_tool_choice) +``` + +## Pythonic Tool Call Format (Llama-3.2 / Llama-3.3 / Llama-4) + +Some Llama models (such as Llama-3.2-1B, Llama-3.2-3B, Llama-3.3-70B, and Llama-4) support a "pythonic" tool call format, where the model outputs function calls as Python code, e.g.: + +```python Example +[get_current_weather(city="San Francisco", state="CA", unit="celsius")] +``` + +- The output is a Python list of function calls, with arguments as Python literals (not JSON). +- Multiple tool calls can be returned in the same list: +```python Example +[get_current_weather(city="San Francisco", state="CA", unit="celsius"), + get_current_weather(city="New York", state="NY", unit="fahrenheit")] +``` + +For more information, refer to Meta’s documentation on [Zero shot function calling](https://github.com/meta-llama/llama-models/blob/main/models/llama4/prompt_format.md#zero-shot-function-calling---system-message). + +Note that this feature is still under development on Blackwell. + +### How to enable +- Launch the server with `--tool-call-parser pythonic` +- You may also specify --chat-template with the improved template for the model (e.g., `--chat-template=examples/chat_template/tool_chat_template_llama4_pythonic.jinja`). +This is recommended because the model expects a special prompt format to reliably produce valid pythonic tool call outputs. The template ensures that the prompt structure (e.g., special tokens, message boundaries like `<|eom|>`, and function call delimiters) matches what the model was trained or fine-tuned on. If you do not use the correct chat template, tool calling may fail or produce inconsistent results. + +#### Forcing Pythonic Tool Call Output Without a Chat Template +If you don't want to specify a chat template, you must give the model extremely explicit instructions in your messages to enforce pythonic output. For example, for `Llama-3.2-1B-Instruct`, you need: + + + +```python Example +import openai + +server_process, port = launch_server_cmd( + " python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1 --log-level warning" # llama-3.2-1b-instruct +) +wait_for_server(f"http://localhost:{port}") + +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a given location.", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The name of the city or location.", + } + }, + "required": ["location"], + }, + }, + }, + { + "type": "function", + "function": { + "name": "get_tourist_attractions", + "description": "Get a list of top tourist attractions for a given city.", + "parameters": { + "type": "object", + "properties": { + "city": { + "type": "string", + "description": "The name of the city to find attractions for.", + } + }, + "required": ["city"], + }, + }, + }, +] + + +def get_messages(): + return [ + { + "role": "system", + "content": ( + "You are a travel assistant. " + "When asked to call functions, ALWAYS respond ONLY with a python list of function calls, " + "using this format: [func_name1(param1=value1, param2=value2), func_name2(param=value)]. " + "Do NOT use JSON, do NOT use variables, do NOT use any other format. " + "Here is an example:\n" + '[get_weather(location="Paris"), get_tourist_attractions(city="Paris")]' + ), + }, + { + "role": "user", + "content": ( + "I'm planning a trip to Tokyo next week. What's the weather like and what are some top tourist attractions? " + "Propose parallel tool calls at once, using the python list of function calls format as shown above." + ), + }, + ] + + +messages = get_messages() + +client = openai.Client(base_url=f"http://localhost:{port}/v1", api_key="xxxxxx") +model_name = client.models.list().data[0].id + + +response_non_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0, + top_p=0.9, + stream=False, # Non-streaming + tools=tools, +) +print_highlight("Non-stream response:") +print_highlight(response_non_stream) + +response_stream = client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0, + top_p=0.9, + stream=True, + tools=tools, +) +texts = "" +tool_calls = [] +name = "" +arguments = "" + +for chunk in response_stream: + if chunk.choices[0].delta.content: + texts += chunk.choices[0].delta.content + if chunk.choices[0].delta.tool_calls: + tool_calls.append(chunk.choices[0].delta.tool_calls[0]) + +print_highlight("Streaming Response:") +print_highlight("==== Text ====") +print_highlight(texts) + +print_highlight("==== Tool Call ====") +for tool_call in tool_calls: + print_highlight(tool_call) + +terminate_process(server_process) +``` + +> **Note:** +> The model may still default to JSON if it was heavily finetuned on that format. Prompt engineering (including examples) is the only way to increase the chance of pythonic output if you are not using a chat template. + + +## How to support a new model? +1. Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include: +```text Output + TOOLS_TAG_LIST = [ + “<|plugin|>“, + ““, + “<|python_tag|>“, + “[TOOL_CALLS]” + ] +``` +2. Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example: +```text Output + class NewModelDetector(BaseFormatDetector): +``` +3. Add the new detector to the MultiFormatParser class that manages all the format detectors. diff --git a/docs_new/docs/advanced_features/vlm_query.ipynb b/docs_new/docs/advanced_features/vlm_query.ipynb new file mode 100644 index 000000000000..24bd7a90bc9f --- /dev/null +++ b/docs_new/docs/advanced_features/vlm_query.ipynb @@ -0,0 +1,379 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0", + "metadata": {}, + "source": [ + "# Query VLM with Offline Engine\n", + "\n", + "This tutorial demonstrates how to use SGLang's **offline Engine API** to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:\n", + "\n", + "1. **Basic Call**: Directly pass images and text.\n", + "2. **Processor Output**: Use HuggingFace processor for data preprocessing.\n", + "3. **Precomputed Embeddings**: Pre-calculate image features to improve inference efficiency." + ] + }, + { + "cell_type": "markdown", + "id": "1", + "metadata": {}, + "source": [ + "## Understanding the Three Input Formats\n", + "\n", + "SGLang supports three ways to pass visual data, each optimized for different scenarios:\n", + "\n", + "### 1. **Raw Images** - Simplest approach\n", + "- Pass PIL Images, file paths, URLs, or base64 strings directly\n", + "- SGLang handles all preprocessing automatically\n", + "- Best for: Quick prototyping, simple applications\n", + "\n", + "### 2. **Processor Output** - For custom preprocessing\n", + "- Pre-process images with HuggingFace processor\n", + "- Pass the complete processor output dict with `format: \"processor_output\"`\n", + "- Best for: Custom image transformations, integration with existing pipelines\n", + "- Requirement: Must use `input_ids` instead of text prompt\n", + "\n", + "### 3. **Precomputed Embeddings** - For maximum performance\n", + "- Pre-calculate visual embeddings using the vision encoder\n", + "- Pass embeddings with `format: \"precomputed_embedding\"`\n", + "- Best for: Repeated queries on same images, caching, high-throughput serving\n", + "- Performance gain: Avoids redundant vision encoder computation (30-50% speedup)\n", + "\n", + "**Key Rule**: Within a single request, use only one format for all images. Don't mix formats.\n", + "\n", + "The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models." + ] + }, + { + "cell_type": "markdown", + "id": "2", + "metadata": {}, + "source": [ + "## Querying Qwen2.5-VL Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "import nest_asyncio\n", + "\n", + "nest_asyncio.apply()\n", + "\n", + "import sglang.test.doc_patch # noqa: F401\n", + "\n", + "model_path = \"Qwen/Qwen2.5-VL-3B-Instruct\"\n", + "chat_template = \"qwen2-vl\"\n", + "example_image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4", + "metadata": {}, + "outputs": [], + "source": [ + "from io import BytesIO\n", + "import requests\n", + "from PIL import Image\n", + "\n", + "from sglang.srt.parser.conversation import chat_templates\n", + "\n", + "image = Image.open(BytesIO(requests.get(example_image_url).content))\n", + "\n", + "conv = chat_templates[chat_template].copy()\n", + "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n", + "conv.append_message(conv.roles[1], \"\")\n", + "conv.image_data = [image]\n", + "\n", + "print(\"Generated prompt text:\")\n", + "print(conv.get_prompt())\n", + "print(f\"\\nImage size: {image.size}\")\n", + "image" + ] + }, + { + "cell_type": "markdown", + "id": "5", + "metadata": {}, + "source": [ + "### Basic Offline Engine API Call" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "from sglang import Engine\n", + "\n", + "llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7", + "metadata": {}, + "outputs": [], + "source": [ + "out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n", + "print(\"Model response:\")\n", + "print(out[\"text\"])" + ] + }, + { + "cell_type": "markdown", + "id": "8", + "metadata": {}, + "source": [ + "### Call with Processor Output\n", + "\n", + "Using a HuggingFace processor to preprocess text and images, and passing the `processor_output` directly into `Engine.generate`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoProcessor\n", + "\n", + "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", + "processor_output = processor(\n", + " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n", + ")\n", + "\n", + "out = llm.generate(\n", + " input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n", + " image_data=[dict(processor_output, format=\"processor_output\")],\n", + ")\n", + "print(\"Response using processor output:\")\n", + "print(out[\"text\"])" + ] + }, + { + "cell_type": "markdown", + "id": "10", + "metadata": {}, + "source": [ + "### Call with Precomputed Embeddings\n", + "\n", + "You can pre-calculate image features to avoid repeated visual encoding processes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoProcessor\n", + "from transformers import Qwen2_5_VLForConditionalGeneration\n", + "\n", + "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", + "model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()\n", + "vision = model.model.visual.cuda()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12", + "metadata": {}, + "outputs": [], + "source": [ + "processor_output = processor(\n", + " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n", + ")\n", + "\n", + "input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n", + "\n", + "precomputed_embeddings = vision(\n", + " processor_output[\"pixel_values\"].cuda(), processor_output[\"image_grid_thw\"].cuda()\n", + ")\n", + "precomputed_embeddings = precomputed_embeddings.pooler_output\n", + "\n", + "multi_modal_item = dict(\n", + " processor_output,\n", + " format=\"precomputed_embedding\",\n", + " feature=precomputed_embeddings,\n", + ")\n", + "\n", + "out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])\n", + "print(\"Response using precomputed embeddings:\")\n", + "print(out[\"text\"])\n", + "\n", + "llm.shutdown()" + ] + }, + { + "cell_type": "markdown", + "id": "13", + "metadata": {}, + "source": [ + "## Querying Llama 4 Vision Model\n", + "\n", + "```python\n", + "model_path = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n", + "chat_template = \"llama-4\"\n", + "\n", + "from io import BytesIO\n", + "import requests\n", + "from PIL import Image\n", + "\n", + "from sglang.srt.parser.conversation import chat_templates\n", + "\n", + "# Download the same example image\n", + "image = Image.open(BytesIO(requests.get(example_image_url).content))\n", + "\n", + "conv = chat_templates[chat_template].copy()\n", + "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n", + "conv.append_message(conv.roles[1], \"\")\n", + "conv.image_data = [image]\n", + "\n", + "print(\"Llama 4 generated prompt text:\")\n", + "print(conv.get_prompt())\n", + "print(f\"Image size: {image.size}\")\n", + "\n", + "image\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "14", + "metadata": {}, + "source": [ + "### Llama 4 Basic Call\n", + "\n", + "Llama 4 requires more computational resources, so it's configured with multi-GPU parallelism (tp_size=4) and larger context length.\n", + "\n", + "```python\n", + "llm = Engine(\n", + " model_path=model_path,\n", + " enable_multimodal=True,\n", + " attention_backend=\"fa3\",\n", + " tp_size=4,\n", + " context_length=65536,\n", + ")\n", + "\n", + "out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n", + "print(\"Llama 4 response:\")\n", + "print(out[\"text\"])\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "15", + "metadata": {}, + "source": [ + "### Call with Processor Output\n", + "\n", + "Using HuggingFace processor to preprocess data can reduce computational overhead during inference.\n", + "\n", + "```python\n", + "from transformers import AutoProcessor\n", + "\n", + "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", + "processor_output = processor(\n", + " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n", + ")\n", + "\n", + "out = llm.generate(\n", + " input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n", + " image_data=[dict(processor_output, format=\"processor_output\")],\n", + ")\n", + "print(\"Response using processor output:\")\n", + "print(out)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "16", + "metadata": {}, + "source": [ + "### Call with Precomputed Embeddings\n", + "\n", + "```python\n", + "from transformers import AutoProcessor\n", + "from transformers import Llama4ForConditionalGeneration\n", + "\n", + "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", + "model = Llama4ForConditionalGeneration.from_pretrained(\n", + " model_path, torch_dtype=\"auto\"\n", + ").eval()\n", + "\n", + "vision = model.vision_model.cuda()\n", + "multi_modal_projector = model.multi_modal_projector.cuda()\n", + "\n", + "print(f'Image pixel values shape: {processor_output[\"pixel_values\"].shape}')\n", + "input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n", + "\n", + "# Process image through vision encoder\n", + "image_outputs = vision(\n", + " processor_output[\"pixel_values\"].to(\"cuda\"), \n", + " aspect_ratio_ids=processor_output[\"aspect_ratio_ids\"].to(\"cuda\"),\n", + " aspect_ratio_mask=processor_output[\"aspect_ratio_mask\"].to(\"cuda\"),\n", + " output_hidden_states=False\n", + ")\n", + "image_features = image_outputs.last_hidden_state\n", + "\n", + "# Flatten image features and pass through multimodal projector\n", + "vision_flat = image_features.view(-1, image_features.size(-1))\n", + "precomputed_embeddings = multi_modal_projector(vision_flat)\n", + "\n", + "# Build precomputed embedding data item\n", + "mm_item = dict(\n", + " processor_output, \n", + " format=\"precomputed_embedding\", \n", + " feature=precomputed_embeddings\n", + ")\n", + "\n", + "# Use precomputed embeddings for efficient inference\n", + "out = llm.generate(input_ids=input_ids, image_data=[mm_item])\n", + "print(\"Llama 4 precomputed embedding response:\")\n", + "print(out[\"text\"])\n", + "```" + ] + } + ], + "metadata": { + "jupytext": { + "cell_metadata_filter": "-all", + "custom_cell_magics": "kql", + "encoding": "# -*- coding: utf-8 -*-", + "text_representation": { + "extension": ".py", + "format_name": "light", + "format_version": "1.5", + "jupytext_version": "1.16.1" + } + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs_new/docs/advanced_features/vlm_query.mdx b/docs_new/docs/advanced_features/vlm_query.mdx new file mode 100644 index 000000000000..ca411260fc2c --- /dev/null +++ b/docs_new/docs/advanced_features/vlm_query.mdx @@ -0,0 +1,249 @@ +--- +title: "Query VLM with Offline Engine" +metatags: + description: "SGLang VLM offline engine: raw images, processor output, precomputed embeddings. Qwen2.5-VL and Llama 4 examples." +--- +This tutorial demonstrates how to use SGLang's **offline Engine API** to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches: + +1. **Basic Call**: Directly pass images and text. +2. **Processor Output**: Use HuggingFace processor for data preprocessing. +3. **Precomputed Embeddings**: Pre-calculate image features to improve inference efficiency. + +## Understanding the Three Input Formats + +SGLang supports three ways to pass visual data, each optimized for different scenarios: + +### 1. **Raw Images** - Simplest approach +- Pass PIL Images, file paths, URLs, or base64 strings directly +- SGLang handles all preprocessing automatically +- Best for: Quick prototyping, simple applications + +### 2. **Processor Output** - For custom preprocessing +- Pre-process images with HuggingFace processor +- Pass the complete processor output dict with `format: "processor_output"` +- Best for: Custom image transformations, integration with existing pipelines +- Requirement: Must use `input_ids` instead of text prompt + +### 3. **Precomputed Embeddings** - For maximum performance +- Pre-calculate visual embeddings using the vision encoder +- Pass embeddings with `format: "precomputed_embedding"` +- Best for: Repeated queries on same images, caching, high-throughput serving +- Performance gain: Avoids redundant vision encoder computation (30-50% speedup) + +**Key Rule**: Within a single request, use only one format for all images. Don't mix formats. + +The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models. + +## Querying Qwen2.5-VL Model + +```python Example +import nest_asyncio + +nest_asyncio.apply() + +import sglang.test.doc_patch # noqa: F401 + +model_path = "Qwen/Qwen2.5-VL-3B-Instruct" +chat_template = "qwen2-vl" +example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png" +``` + +```python Example +from io import BytesIO +import requests +from PIL import Image + +from sglang.srt.parser.conversation import chat_templates + +image = Image.open(BytesIO(requests.get(example_image_url).content)) + +conv = chat_templates[chat_template].copy() +conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?") +conv.append_message(conv.roles[1], "") +conv.image_data = [image] + +print("Generated prompt text:") +print(conv.get_prompt()) +print(f"\nImage size: {image.size}") +image +``` + +### Basic Offline Engine API Call + +```python Example +from sglang import Engine + +llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning") +``` + +```python Example +out = llm.generate(prompt=conv.get_prompt(), image_data=[image]) +print("Model response:") +print(out["text"]) +``` + +### Call with Processor Output + +Using a HuggingFace processor to preprocess text and images, and passing the `processor_output` directly into `Engine.generate`. + +```python Example +from transformers import AutoProcessor + +processor = AutoProcessor.from_pretrained(model_path, use_fast=True) +processor_output = processor( + images=[image], text=conv.get_prompt(), return_tensors="pt" +) + +out = llm.generate( + input_ids=processor_output["input_ids"][0].detach().cpu().tolist(), + image_data=[dict(processor_output, format="processor_output")], +) +print("Response using processor output:") +print(out["text"]) +``` + +### Call with Precomputed Embeddings + +You can pre-calculate image features to avoid repeated visual encoding processes. + +```python Example +from transformers import AutoProcessor +from transformers import Qwen2_5_VLForConditionalGeneration + +processor = AutoProcessor.from_pretrained(model_path, use_fast=True) +model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval() +vision = model.model.visual.cuda() +``` + +```python Example +processor_output = processor( + images=[image], text=conv.get_prompt(), return_tensors="pt" +) + +input_ids = processor_output["input_ids"][0].detach().cpu().tolist() + +precomputed_embeddings = vision( + processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda() +) +precomputed_embeddings = precomputed_embeddings.pooler_output + +multi_modal_item = dict( + processor_output, + format="precomputed_embedding", + feature=precomputed_embeddings, +) + +out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item]) +print("Response using precomputed embeddings:") +print(out["text"]) + +llm.shutdown() +``` + +## Querying Llama 4 Vision Model + +```python Example +model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct" +chat_template = "llama-4" + +from io import BytesIO +import requests +from PIL import Image + +from sglang.srt.parser.conversation import chat_templates + +# Download the same example image +image = Image.open(BytesIO(requests.get(example_image_url).content)) + +conv = chat_templates[chat_template].copy() +conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?") +conv.append_message(conv.roles[1], "") +conv.image_data = [image] + +print("Llama 4 generated prompt text:") +print(conv.get_prompt()) +print(f"Image size: {image.size}") + +image +``` + +### Llama 4 Basic Call + +Llama 4 requires more computational resources, so it's configured with multi-GPU parallelism (tp_size=4) and larger context length. + +```python Example +llm = Engine( + model_path=model_path, + enable_multimodal=True, + attention_backend="fa3", + tp_size=4, + context_length=65536, +) + +out = llm.generate(prompt=conv.get_prompt(), image_data=[image]) +print("Llama 4 response:") +print(out["text"]) +``` + +### Call with Processor Output + +Using HuggingFace processor to preprocess data can reduce computational overhead during inference. + +```python Example +from transformers import AutoProcessor + +processor = AutoProcessor.from_pretrained(model_path, use_fast=True) +processor_output = processor( + images=[image], text=conv.get_prompt(), return_tensors="pt" +) + +out = llm.generate( + input_ids=processor_output["input_ids"][0].detach().cpu().tolist(), + image_data=[dict(processor_output, format="processor_output")], +) +print("Response using processor output:") +print(out) +``` + +### Call with Precomputed Embeddings + +```python Example +from transformers import AutoProcessor +from transformers import Llama4ForConditionalGeneration + +processor = AutoProcessor.from_pretrained(model_path, use_fast=True) +model = Llama4ForConditionalGeneration.from_pretrained( + model_path, torch_dtype="auto" +).eval() + +vision = model.vision_model.cuda() +multi_modal_projector = model.multi_modal_projector.cuda() + +print(f'Image pixel values shape: {processor_output["pixel_values"].shape}') +input_ids = processor_output["input_ids"][0].detach().cpu().tolist() + +# Process image through vision encoder +image_outputs = vision( + processor_output["pixel_values"].to("cuda"), + aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"), + aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"), + output_hidden_states=False +) +image_features = image_outputs.last_hidden_state + +# Flatten image features and pass through multimodal projector +vision_flat = image_features.view(-1, image_features.size(-1)) +precomputed_embeddings = multi_modal_projector(vision_flat) + +# Build precomputed embedding data item +mm_item = dict( + processor_output, + format="precomputed_embedding", + feature=precomputed_embeddings +) + +# Use precomputed embeddings for efficient inference +out = llm.generate(input_ids=input_ids, image_data=[mm_item]) +print("Llama 4 precomputed embedding response:") +print(out["text"]) +``` diff --git a/docs_new/docs/basic_usage/deepseek_ocr.mdx b/docs_new/docs/basic_usage/deepseek_ocr.mdx new file mode 100644 index 000000000000..97b39e0a0b02 --- /dev/null +++ b/docs_new/docs/basic_usage/deepseek_ocr.mdx @@ -0,0 +1,58 @@ +--- +title: "DeepSeek OCR (OCR-1 / OCR-2)" +metatags: + description: "DeepSeek OCR models are multimodal (image + text) models for OCR and document understanding." +--- + +DeepSeek OCR models are multimodal (image + text) models for OCR and document understanding. + +## Launch server + +```shell +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-OCR-2 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +> You can replace `deepseek-ai/DeepSeek-OCR-2` with `deepseek-ai/DeepSeek-OCR`. + +## Prompt examples + +Recommended prompts from the model card: + +``` + +<|grounding|>Convert the document to markdown. +``` + +``` + +Free OCR. +``` + +## OpenAI-compatible request example + +```python +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "deepseek-ai/DeepSeek-OCR-2", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "\n<|grounding|>Convert the document to markdown."}, + {"type": "image_url", "image_url": {"url": "https://example.com/your_image.jpg"}}, + ], + } + ], + "max_tokens": 512, +} + +response = requests.post(url, json=data) +print(response.text) +``` diff --git a/docs_new/docs/basic_usage/deepseek_v3.mdx b/docs_new/docs/basic_usage/deepseek_v3.mdx new file mode 100644 index 000000000000..7d7c93d298f3 --- /dev/null +++ b/docs_new/docs/basic_usage/deepseek_v3.mdx @@ -0,0 +1,375 @@ +--- +title: "DeepSeek V3/V3.1/R1 Usage" +metatags: + description: "Deploy DeepSeek V3/R1 with SGLang: MLA optimization, FP8 quantization, multi-node TP, DP attention, MTP speculative decoding. Supports H200, B200, MI300X, A100." +--- +SGLang provides many optimizations specifically designed for the DeepSeek models, making it the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended) from Day 0. + +This document outlines current optimizations for DeepSeek. +For an overview of the implemented features see the completed [Roadmap](https://github.com/sgl-project/sglang/issues/2591). + +## Launch DeepSeek V3.1/V3/R1 with SGLang + +To run DeepSeek V3.1/V3/R1 models, the recommended settings are as follows: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Weight TypeConfiguration
Full precision FP8<br>*(recommended)*8 x H200
8 x B200
8 x MI300X
2 x 8 x H100/800/20
Xeon 6980P CPU
Full precision (BF16) (upcast from original FP8)2 x 8 x H200
2 x 8 x MI300X
4 x 8 x H100/800/20
4 x 8 x A100/A800
Quantized weights (INT8)16 x A100/800
32 x L40S
Xeon 6980P CPU
4 x Atlas 800I A3
Quantized weights (W4A8)8 x H20/100, 4 x H200
Quantized weights (AWQ)8 x H100/800/20
8 x A100/A800
Quantized weights (MXFP4)8, 4 x MI355X/350X
Quantized weights (NVFP4)8, 4 x B200
+ + + + +The official DeepSeek V3 is already in FP8 format, so you should not run it with any quantization arguments like `--quantization fp8`. + + +Detailed commands for reference: + +- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) +- [4 x B200, 8 x B200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-one-b200-node) +- [8 x MI300X](../hardware-platforms/amd_gpu#running-deepseek-v3) +- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker) +- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes) +- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization) +- [16 x A100 (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) +- [32 x L40S (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization) +- [Xeon 6980P CPU](../hardware-platforms/cpu_server#example-running-deepseek-r1) +- [4 x Atlas 800I A3 (int8)](../hardware-platforms/ascend-npus/ascend_npu_deepseek_example#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3) + +### Download Weights +If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights. + +### Launch with one node of 8 x H200 +Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch). + +### Running examples on Multi-Node + +- [Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP](https://lmsys.org/blog/2025-06-16-gb200-part-1/) ([Part I](https://lmsys.org/blog/2025-06-16-gb200-part-1/), [Part II](https://lmsys.org/blog/2025-09-25-gb200-part-2/)) - Comprehensive guide on GB200 optimizations. + +- [Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs](https://lmsys.org/blog/2025-05-05-large-scale-ep/) - Guide on PD disaggregation and large-scale EP. + +- [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes). + +- [Best Practices for Serving DeepSeek-R1 on H20](https://lmsys.org/blog/2025-09-26-sglang-ant-group/) - Comprehensive guide on H20 optimizations, deployment and performance. + +- [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker). + +- [Serving with four A100*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes). + +## Optimizations + +### Multi-head Latent Attention (MLA) Throughput Optimizations + +**Description**: [MLA](https://arxiv.org/pdf/2405.04434) is an innovative attention mechanism introduced by the DeepSeek team, aimed at improving inference efficiency. SGLang has implemented specific optimizations for this, including: + +- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase. + +- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/attention.html#flashinfer-mla), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), **TRTLLM MLA** (optimized for Blackwell architecture), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads. + +- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption. + +- **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes. + +- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for FlashAttention3 backend. + +Overall, with these optimizations, we have achieved up to **7x** acceleration in output throughput compared to the previous version. + +

+ Multi-head Latent Attention for DeepSeek Series Models +

+ +**Usage**: MLA optimization is enabled by default. + +**Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details. + +### Data Parallelism Attention + +**Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer. If you do not use DP attention, KV cache will be duplicated among all TP ranks. + +

+ Data Parallelism Attention for DeepSeek Series Models +

+ +With data parallelism attention enabled, we have achieved up to **1.9x** decoding throughput improvement compared to the previous version. + +

+ Data Parallelism Attention Performance Comparison +

+ +**Usage**: +- Append `--enable-dp-attention --tp 8 --dp 8` to the server arguments when using 8 H200 GPUs. This optimization improves peak throughput in high batch size scenarios where the server is limited by KV cache capacity. +- DP and TP attention can be flexibly combined. For example, to deploy DeepSeek-V3/R1 on 2 nodes with 8 H100 GPUs each, you can specify `--enable-dp-attention --tp 16 --dp 2`. This configuration runs attention with 2 DP groups, each containing 8 TP GPUs. + + +Data parallelism attention is not recommended for low-latency, small-batch use cases. It is optimized for high-throughput scenarios with large batch sizes. + + +**Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models). + +### Multi-Node Tensor Parallelism + +**Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory. + +**Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker) for usage examples. + +### Block-wise FP8 + +**Description**: SGLang implements block-wise FP8 quantization with two key optimizations: + +- **Activation**: E4M3 format using per-token-per-128-channel sub-vector scales with online casting. + +- **Weight**: Per-128x128-block quantization for better numerical stability. + +- **DeepGEMM**: The [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) kernel library optimized for FP8 matrix multiplications. + +**Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is enabled by default on NVIDIA Hopper/Blackwell GPUs and disabled by default on other devices. DeepGEMM can also be manually turned off by setting the environment variable `SGLANG_ENABLE_JIT_DEEPGEMM=0`. + + +Before serving the DeepSeek model, precompile the DeepGEMM kernels to improve first-run performance. The precompilation process typically takes around 10 minutes to complete. + + +```bash Command +python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code +``` + +### Multi-token Prediction +**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting. + +**Usage**: +Add `--speculative-algorithm EAGLE`. Other flags, like `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` are optional. For example: +```text Output +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --speculative-algorithm EAGLE \ + --trust-remote-code \ + --tp 8 +``` +- The default configuration for DeepSeek models is `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4`. The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. +- Most MLA attention backends fully support MTP usage. See [MLA Backends](../advanced_features/attention_backend#mla-backends) for details. + + +To enable DeepSeek MTP for large batch sizes (>48), you need to adjust some parameters (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)): +- Adjust `--max-running-requests` to a larger number. The default value is `48` for MTP. For larger batch sizes, you should increase this value beyond the default value. +- Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The [default captured batch sizes for speculative decoding](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py#L888-L895) is 48. You can customize this by including more batch sizes. + + + +To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. + + + +### Reasoning Content for DeepSeek R1 & V3.1 + +See [Reasoning Parser](../advanced_features/separate_reasoning) and [Thinking Parameter for DeepSeek V3.1](./openai_api_completions#Example:-DeepSeek-V3-Models). + + +### Function calling for DeepSeek Models + +Add arguments `--tool-call-parser deepseekv3` and `--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja`(recommended) to enable this feature. For example (running on 1 * H20 node): + +```text Output +python3 -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-V3-0324 \ + --tp 8 \ + --port 30000 \ + --host 0.0.0.0 \ + --mem-fraction-static 0.9 \ + --tool-call-parser deepseekv3 \ + --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja +``` + +Sample Request: + +``` +curl "http://127.0.0.1:30000/v1/chat/completions" \ +-H "Content-Type: application/json" \ +-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324", "tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of a city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "How'\''s the weather like in Qingdao today"}]}' +``` + +Expected Response + +```text Output +{"id":"6501ef8e2d874006bf555bc80cddc7c5","object":"chat.completion","created":1745993638,"model":"deepseek-ai/DeepSeek-V3-0324","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":null,"tool_calls":[{"id":"0","index":null,"type":"function","function":{"name":"query_weather","arguments":"{\"city\": \"Qingdao\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","matched_stop":null}],"usage":{"prompt_tokens":116,"total_tokens":138,"completion_tokens":22,"prompt_tokens_details":null}} + +``` +Sample Streaming Request: +``` +curl "http://127.0.0.1:30000/v1/chat/completions" \ +-H "Content-Type: application/json" \ +-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324","stream":true,"tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of a city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "How'\''s the weather like in Qingdao today"}]}' +``` +Expected Streamed Chunks (simplified for clarity): +```text Output +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"{\""}}]}}]} +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"city"}}]}}]} +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\":\""}}]}}]} +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"Q"}}]}}]} +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"ing"}}]}}]} +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"dao"}}]}}]} +data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\"}"}}]}}]} +data: {"choices":[{"delta":{"tool_calls":null}}], "finish_reason": "tool_calls"} +data: [DONE] +``` +The client needs to concatenate all arguments fragments to reconstruct the complete tool call: +```text Output +{"city": "Qingdao"} +``` + + +1. Use a lower `"temperature"` value for better results. +2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt. + + + +### Thinking Budget for DeepSeek R1 + +In SGLang, we can implement thinking budget with `CustomLogitProcessor`. + +Launch a server with `--enable-custom-logit-processor` flag on. + +```text Output +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --port 30000 --host 0.0.0.0 --mem-fraction-static 0.9 --disable-cuda-graph --reasoning-parser deepseek-r1 --enable-custom-logit-processor +``` + +Sample Request: + +```python Sample Request +import openai +from rich.pretty import pprint +from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor + + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*") +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-R1", + messages=[ + { + "role": "user", + "content": "Question: Is Paris the Capital of France?", + } + ], + max_tokens=1024, + extra_body={ + "custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(), + "custom_params": { + "thinking_budget": 512, + }, + }, +) +pprint(response) +``` + +## FAQ + +**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?** + +A: If you're experiencing extended model loading times and an NCCL timeout, you can try increasing the timeout duration. Add the argument `--dist-timeout 3600` when launching your model. This will set the timeout to one hour, which often resolves the issue. diff --git a/docs_new/docs/basic_usage/deepseek_v32.mdx b/docs_new/docs/basic_usage/deepseek_v32.mdx new file mode 100644 index 000000000000..1077a9956f0e --- /dev/null +++ b/docs_new/docs/basic_usage/deepseek_v32.mdx @@ -0,0 +1,601 @@ +--- +title: "DeepSeek V3.2/GLM-5 Usage" +metatags: + description: "Deploy DeepSeek V3.2/GLM-5 with SGLang: DeepSeek Sparse Attention (DSA), long-context optimization, MTP speculative decoding, function calling. Supports H200, B200, MI300X, MI350." +--- +DeepSeek-V3.2 model family equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves efficiency improvements in long-context scenarios. + + +Note: This document is originally written for the usage of [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model. The usage of [DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) or [DeepSeek-V3.2-Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale) is the same as DeepSeek-V3.2-Exp except for the tool call parser. [GLM-5](https://huggingface.co/zai-org/GLM-5) model also applies DSA (DeepSeek Sparse Attention) structure, so it can share most of the usage here, except for the reasoning parser and tool call parser. + + +## Installation + +### Docker + +```bash Command +# H200/B200 +docker pull lmsysorg/sglang:latest + +# MI350/MI355 +docker pull lmsysorg/sglang:v0.5.8-rocm700-mi35x + +# MI300 +# v0.5.8-rocm700-mi30x does not include PR #17504. Prefer the newest MI30x ROCm +# image tag from Docker Hub when available, or build from source (below). +docker pull lmsysorg/sglang:v0.5.8-rocm700-mi30x + + +# NPUs +docker pull lmsysorg/sglang:dsv32-a2 +docker pull lmsysorg/sglang:dsv32-a3 +``` + +### Build From Source + +```bash Command +# Install SGLang +git clone https://github.com/sgl-project/sglang +cd sglang +pip3 install pip --upgrade +pip3 install -e "python" +``` + +## Launch DeepSeek V3.2/GLM-5 with SGLang + +To serve [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) on 8xH200/B200 GPUs: + +```bash Command +# Launch with TP + DP (Recommended) +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention + +# Launch with EP + DP +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention + +# Launch with Pure TP +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 + +# Launch with TP on MI30x/MI35x +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --nsa-prefill-backend tilelang --nsa-decode-backend tilelang +``` + +To serve GLM-5, just replace the `--model` argument with `zai-org/GLM-5-FP8`. + +### Configuration Tips +- **DP Attention**: To enable [DP Attention](../advanced_features/dp_dpa_smg_guide), please include `--enable-dp-attention --dp ` in command. DP Attention is better for large concurrency scenarios. +- **TP Attention**: Launching with TP attention is also supported. TP attention is better for low latency scenarios. +- **Short-sequence MHA prefill (adaptive)**: For short prefill sequences (default threshold: **2048 tokens**), the NSA backend uses standard MHA automatically (no extra flags). On H200 (SM90) this path uses the FlashAttention variable-length kernel; on B200 (SM100) it uses TRT-LLM ragged MHA. MHA uses `MHA_ONE_SHOT` for best performance, which computes multi-head attention over all tokens (both cached prefix and newly extended tokens) in a single kernel invocation, avoiding the overhead of chunked KV cache processing. This achieves optimal throughput for short sequences where total sequence length fits within the chunk capacity limit. +- **MHA prefill threshold relaxation**: To apply MHA attention to requests longer than 2048 tokens, please set the flag `SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD` to a value larger than 2048. As threshold grows larger, the prefill performance can be improved, but at the cost of potential accuracy drop. +- **Choices of Attention Kernels**: The attention backend is automatically set to `nsa` attention backend for DeepSeek V3.2 model. In this backend, different kernels for sparse prefilling/decoding are implemented, which can be specified by `--nsa-prefill-backend` and `--nsa-decode-backend` server arguments. The choices of nsa prefill/decode attention kernels include: + - `flashmla_sparse`: `flash_mla_sparse_fwd` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. It requires bf16 q, kv inputs. + - `flashmla_kv`: `flash_mla_with_kvcache` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. It requires bf16 q, fp8 k_cache inputs. + - `flashmla_auto`: enables automatic selection of either `flashmla_sparse` or `flashmla_kv` kernel for prefill based on KV cache dtype, hardware, and heuristics. With BF16 KV cache, `flashmla_sparse` is always used on both Hopper and Blackwell. With FP8 KV cache: On Hopper (SM90), it unconditionally uses `flashmla_kv`; On Blackwell (SM100), it uses `flashmla_sparse` when `total_kv_tokens < total_q_tokens * 512`, otherwise falls back to `flashmla_kv`. The heuristics may need to be tuned if the performance of either kernel changes significantly. + - `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs. It requires bf16 q, kv inputs. + - `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU. + - `aiter`: Aiter kernel on AMD HPUs. Can only be used as decode kernel. + - `trtllm`: `trtllm-mla` sparse kernel from flashinfer library. Only run on blackwell GPUs. It requires q,k,v to be uniformly bf16 or fp8_e4m3 format. + - On the basis of performance benchmarks, the default configuration of DSA kernels on Hopper and Blackwell are set as follows : + - Bfloat 16 kv cache: On Hopper, `flashmla_sparse` prefill attention, `fa3` decode attention; On Blackwell, `flashmla_sparse` prefill attention, `trtllm` decode attention + - Float8_e4m3fn KV cache: On Hopper, `flashmla_kv` prefill attention, `flashmla_kv` decode attention; On Blackwell, `trtllm` prefill attention and `trtllm` decode attention. +- **Index Cache**: Introduce in [this paper](https://arxiv.org/abs/2603.12201), IndexCache improves speed by reusing the result of indexer across different layers, only at cost of negligible accuracy loss. For **GLM-5** model, we recommend appending `--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'` to command for better tradeoff between speedup and performance. + +## Multi-token Prediction +SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information. + +Example usage with DP Attention: +```bash Command +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` + +Example usage with Pure TP: +```bash Command +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` + +- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. +- The default value of `--max-running-requests` is set to `48` for MTP. For larger batch sizes, this value should be increased beyond the default value. + + +To enable overlap scheduler for EAGLE speculative decoding, we recommend setting the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. + + + +## Function Calling and Reasoning Parser +The usage of function calling and reasoning parser is the same as DeepSeek V3.1. Please refer to [Reasoning Parser](../advanced_features/separate_reasoning) and [Tool Parser](../advanced_features/tool_parser) documents. + +To launch `DeepSeek-V3.2-Exp` with function calling and reasoning parser: +> Note: It is recommended to specify the chat-template, ensuring that you are within the sglang's root directory. +```bash Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --trust-remote-code \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --tool-call-parser deepseekv31 \ + --reasoning-parser deepseek-v3 \ + --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja +``` + +To launch `DeepSeek-V3.2` with function calling and reasoning parser: +```bash Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2 \ + --trust-remote-code \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --tool-call-parser deepseekv32 \ + --reasoning-parser deepseek-v3 +``` + +`DeepSeek-V3.2-Speciale` does not support tool calling, so it can only be launched with the reasoning parser: +```bash Command +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Speciale \ + --trust-remote-code \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --reasoning-parser deepseek-v3 +``` + +To launch `GLM-5` with function calling and reasoning parser: +```bash Command +python -m sglang.launch_server \ + --model zai-org/GLM-5-FP8 \ + --tp-size 8 --dp-size 8 --enable-dp-attention \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ +``` + +## NVFP4 Checkpoint + +To launch deepseek v3.2 [NVFP4 checkpoint](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4) on Blackwell devices, the user needs to specify the quantization method as `modelopt_fp4`, and moe runner backend as one of `flashinfer_trtllm`(recommended), `flashinfer_cutlass` and `flashinfer_cutedsl`. Any other usage (parallelism, reasoning parser, ...) is the same as FP8 checkpoint. + +An example launching command can be: +```bash Command +python -m sglang.launch_server --model nvidia/DeepSeek-V3.2-NVFP4 --tp 4 --quantization modelopt_fp4 --moe-runner-backend flashinfer_trtllm --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 +``` + +## PD Disaggregation + +Prefill Command: +```bash Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --disaggregation-mode prefill \ + --host $LOCAL_IP \ + --port $PORT \ + --tp 8 \ + --dp 8 \ + --enable-dp-attention \ + --dist-init-addr ${HOST}:${DIST_PORT} \ + --trust-remote-code \ + --disaggregation-bootstrap-port 8998 \ + --mem-fraction-static 0.9 \ +``` + +Decode command: +```bash Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --disaggregation-mode decode \ + --host $LOCAL_IP \ + --port $PORT \ + --tp 8 \ + --dp 8 \ + --enable-dp-attention \ + --dist-init-addr ${HOST}:${DIST_PORT} \ + --trust-remote-code \ + --mem-fraction-static 0.9 \ +``` + +Router command: +```bash Command +python -m sglang_router.launch_router --pd-disaggregation \ + --prefill $PREFILL_ADDR 8998 \ + --decode $DECODE_ADDR \ + --host 127.0.0.1 \ + --port 8000 \ +``` + +If you need more advanced deployment methods or production-ready deployment methods, such as RBG or LWS-based deployment, please refer to [references/multi_node_deployment/rbg_pd/deepseekv32_pd.md](../references/multi_node_deployment/rbg_pd/deepseekv32_pd). Additionally, you can also find startup commands for DeepEP-based EP parallelism in the aforementioned documentation. + + +## Benchmarking Results + +### Accuracy Test with `gsm8k` +A simple accuracy benchmark can be tested with `gsm8k` dataset: +```bash Command +python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 +``` + +The result is 0.956, which matches our expectation: +```bash Command +Accuracy: 0.956 +Invalid: 0.000 +Latency: 25.109 s +Output throughput: 5226.235 token/s +``` + +To test long-context accuracy, run gsm8k with `--num-shots 20`. The results are very close to the 8 shots results: +```text Output +Accuracy: 0.956 +Invalid: 0.000 +Latency: 29.545 s +Output throughput: 4418.617 token/s +``` + + +### Accuracy Test with `gpqa-diamond` + +Accuracy benchmark on long context can be tested on GPQA-diamond dataset with long output tokens and thinking enabled: +```bash Command +python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3 +``` + +The mean accuracy over 8 runs shows 0.797, which matches the number 0.799 in official tech report. +```bash Command +Repeat: 8, mean: 0.797 +Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793'] +``` + +For DeepSeek V3.2, DeepSeek recommends setting the sampling parameters to temperature = 1.0, top_p = 0.95: + +```bash Command +python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3 + +Repeat: 8, mean: 0.840 +Scores: ['0.848', '0.808', '0.848', '0.838', '0.879', '0.813', '0.838', '0.848'] +``` +which matches the official score, 0.824, as reported in the [DeepSeek-V3.2 technical report](https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf). + +### Accuracy Test with `aime 2025` + +Prepare the environment by installing NeMo-Skills in the docker or your own virtual environment: + + ``` + pip install git+https://github.com/NVIDIA/NeMo-Skills.git --ignore-installed blinker + ``` + +Then launch the SGLang server: +```text Output +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention +``` + +**For `DeepSeek-V3.2` and `DeepSeek-V3.2-Speciale`**: + +```text Output +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp-size 8 --dp-size 8 --enable-dp-attention --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 +``` + +Run the following script to evaluate AIME 2025: +```text Output +#! /bin/bash +export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1 + +ns prepare_data aime25 + +PORT=30000 +BACKEND=sglang +MODEL="deepseek-ai/DeepSeek-V3.2-Exp" # Should be changed to the model name +MODEL_NAME="dsv32-fp8" + +echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..." +ns eval \ + --benchmarks=aime25:4 \ + --server_type=$BACKEND \ + --model=$MODEL \ + --server_address=http://localhost:${PORT}/v1 \ + --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \ + ++chat_template_kwargs.thinking=true \ + ++inference.temperature=1.0 \ + ++inference.top_p=0.95 \ + ++inference.tokens_to_generate=64000 + # ++inference.tokens_to_generate=120000 for Speciale model +``` + +Test results (8*B200): + +DeepSeek-V3.2-Exp: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
evaluation_modenum_entriesavg_tokensgen_secondssymbolic_correctno_answer
pass@1[avg-of-4]3015040167387.50% ± 1.67%0.00%
majority@43015040167390.00%0.00%
pass@43015040167390.00%0.00%
+ + +DeepSeek-V3.2: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
evaluation_modenum_entriesavg_tokensgen_secondssymbolic_correctno_answer
pass@1[avg-of-4]3013550163292.50% ± 1.67%0.00%
majority@43013550163294.71%0.00%
pass@43013550163296.67%0.00%
+ + +DeepSeek-V3.2-Speciale: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
evaluation_modenum_entriesavg_tokensgen_secondssymbolic_correctno_answer
pass@1[avg-of-4]3024155358395.00% ± 1.92%0.00%
majority@43024155358395.83%0.00%
pass@430241553583100.00%0.00%
+ + +## DSA long sequence context parallel optimization(experimental) + +**Note: This feature is only verified on Hopper machines** + +For context parallel in DeepSeek V3.2 model, we provide two different modes of splitting tokens, which can be controlled with argument `--nsa-prefill-cp-mode`. + +### In sequence splitting + +The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator. Add `attn_cp_size` for communication group for context parallel. + +Note that the in-sequence splitting mode has the following restrictions: +- The batch size is restricted to 1 for prefill batches +- `moe_dense_tp_size=1`, `moe_a2a_backend = "deepep"` +- To ensure `cp_size > 1`, the passed in `tp_size` must be larger than `dp_size` + +For more details, please refer to PR https://github.com/sgl-project/sglang/pull/12065. + +Example: +```bash Command +# In-seq splitting mode launched with EP + DP +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --attn-cp-size 4 --nsa-prefill-cp-mode in-seq-split --max-running-requests 32 +``` + +### Round robin splitting (default setting) + +This mode can be enabled by specifying the parameter `--nsa-prefill-cp-mode round-robin-split`, which distributes tokens across ranks based on `token_idx % cp_size`. + +In this scenario, compared to the in-sequence splitting method, it additionally supports the fused MoE backend (the fused MoE backend may deliver better performance than DeepEP in single-machine scenarios), FP8 KV-cache, and multi-batch prefill inference. However, it cannot be enabled with DP attention together. + +For more details, please refer to PR https://github.com/sgl-project/sglang/pull/13959. + +Example usage: +```bash Command +# Launch with FusedMoe + CP8 +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --attn-cp-size 8 --nsa-prefill-cp-mode round-robin-split --max-running-requests 32 +``` +### Pipeline Parallel + Context Parallel (PP + CP) + +This mode combines Pipeline Parallelism (PP) and Context Parallelism (CP) to scale across multiple nodes, which can achieve better throughput and Time To First Token (TTFT). Note that this method has only been tested on H20 96G. + +#### Standard Usage + +To launch with PP=2 and CP (via `round-robin-split` mode) on 2 nodes. This configuration uses the fused MoE kernel by default, which generally provides better performance. + +For related development details, please refer to: +- Fused MoE + CP support: [PR #13959](https://github.com/sgl-project/sglang/pull/13959) +- PP + CP support: [Issue #15358](https://github.com/sgl-project/sglang/issues/15358) and [PR #16380](https://github.com/sgl-project/sglang/pull/16380) + +Node 0: +```bash Command +export SGLANG_PP_LAYER_PARTITION=30,31 +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --nnodes 2 --node-rank 0 \ + --dist-init-addr :62001 \ + --tp 8 --pp-size 2 \ + --dp-size 1 --moe-dense-tp-size 1 \ + --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ + --nsa-prefill-cp-mode round-robin-split \ + --trust-remote-code \ + --disable-radix-cache \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 \ + --chunked-prefill-size 16384 \ + --cuda-graph-max-bs 8 \ + --page-size 64 \ + --watchdog-timeout 3600 \ + --host 0.0.0.0 --port 8000 \ + --tool-call-parser deepseekv32 +``` + +Node 1: +```bash Command +export SGLANG_PP_LAYER_PARTITION=30,31 +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --nnodes 2 --node-rank 1 \ + --dist-init-addr :62001 \ + --tp 8 --pp-size 2 \ + --dp-size 1 --moe-dense-tp-size 1 \ + --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ + --nsa-prefill-cp-mode round-robin-split \ + --trust-remote-code \ + --disable-radix-cache \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 \ + --chunked-prefill-size 16384 \ + --cuda-graph-max-bs 8 \ + --page-size 64 \ + --watchdog-timeout 3600 \ + --host 0.0.0.0 --port 8000 \ + --tool-call-parser deepseekv32 +``` + +#### PD Disaggregation with PP + CP + +If using PD (Prefill-Decode) Disaggregation, the Prefill nodes can be configured with PP + CP as follows. + +Prefill Node 0: +```bash Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --served-model-name deepseek-v32 \ + --nnodes 2 --node-rank 0 \ + --dist-init-addr :20102 \ + --tp 8 --pp-size 2 \ + --dp-size 1 --moe-dense-tp-size 1 \ + --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ + --nsa-prefill-cp-mode round-robin-split \ + --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ + --trust-remote-code \ + --disable-radix-cache \ + --max-running-requests 512 \ + --chunked-prefill-size 4096 \ + --context-length 131072 \ + --mem-fraction-static 0.9 \ + --page-size 64 \ + --enable-metrics \ + --collect-tokens-histogram \ + --tokenizer-worker-num 8 \ + --host 0.0.0.0 --port 30000 +``` + +Prefill Node 1: +```bash Command +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --served-model-name deepseek-v32-prefill \ + --nnodes 2 --node-rank 1 \ + --dist-init-addr :20102 \ + --tp 8 --pp-size 2 \ + --dp-size 1 --moe-dense-tp-size 1 \ + --enable-nsa-prefill-context-parallel \ + --attn-cp-size 8 \ + --nsa-prefill-cp-mode round-robin-split \ + --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ + --trust-remote-code \ + --disable-radix-cache \ + --max-running-requests 512 \ + --chunked-prefill-size 4096 \ + --context-length 131072 \ + --mem-fraction-static 0.9 \ + --page-size 64 \ + --enable-metrics \ + --collect-tokens-histogram \ + --tokenizer-worker-num 8 \ + --host 0.0.0.0 --port 30000 +``` + +For the Decode nodes, it is recommended to use the **EP mode**. + +## HiSparse: Hierarchical Sparse Attention for DSA (experimental) + +HiSparse reduces per-request GPU memory during decode by keeping only a small "hot" KV buffer on GPU while storing complete KV data in CPU pinned memory. A CUDA kernel dynamically swaps in the top-k most relevant KV entries from host memory on each decode step. This enables significantly higher decode concurrency for long-context DSA models. + +HiSparse currently requires PD disaggregation mode and is enabled on the decode instance only. For detailed design, configuration, and deployment instructions, see the [HiSparse Guide](../advanced_features/hisparse_guide). diff --git a/docs_new/docs/basic_usage/glm45.mdx b/docs_new/docs/basic_usage/glm45.mdx new file mode 100644 index 000000000000..210c857568d3 --- /dev/null +++ b/docs_new/docs/basic_usage/glm45.mdx @@ -0,0 +1,75 @@ +--- +title: "Launch GLM-4.5 / GLM-4.6 / GLM-4.7 with SGLang" +metatags: + description: "Deploy GLM-4.5/4.6/4.7 models with SGLang: FP8 inference, EAGLE speculative decoding, function calling support. Optimized for H100/H200 GPUs." +--- +## Launch GLM-4.5 / GLM-4.6 / GLM-4.7 with SGLang + +To serve GLM-4.5 / GLM-4.6 FP8 models on 8xH100/H200 GPUs: + +```bash Command +python3 -m sglang.launch_server --model zai-org/GLM-4.6-FP8 --tp 8 +``` + +### EAGLE Speculative Decoding + +**Description**: SGLang has supported GLM-4.5 / GLM-4.6 models +with [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding). + +**Usage**: +Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and +`--speculative-num-draft-tokens` to enable this feature. For example: + +```bash Command +python3 -m sglang.launch_server \ + --model-path zai-org/GLM-4.6-FP8 \ + --tp-size 8 \ + --tool-call-parser glm45 \ + --reasoning-parser glm45 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --mem-fraction-static 0.9 \ + --served-model-name glm-4.6-fp8 \ + --enable-custom-logit-processor +``` + + +To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. + + +### Thinking Budget for GLM-4.5 / GLM-4.6 +**Note**: For GLM-4.7, `--tool-call-parser` should be set to `glm47`, for GLM-4.5 and GLM-4.6, it should be set to `glm45`. + +In SGLang, we can implement thinking budget with `CustomLogitProcessor`. + +Launch a server with `--enable-custom-logit-processor` flag on. + +Sample Request: + +```python Example +import openai +from rich.pretty import pprint +from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor + + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*") +response = client.chat.completions.create( + model="zai-org/GLM-4.6", + messages=[ + { + "role": "user", + "content": "Question: Is Paris the Capital of France?", + } + ], + max_tokens=1024, + extra_body={ + "custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(), + "custom_params": { + "thinking_budget": 512, + }, + }, +) +pprint(response) +``` diff --git a/docs_new/docs/basic_usage/glmv.mdx b/docs_new/docs/basic_usage/glmv.mdx new file mode 100644 index 000000000000..088ad0b924a0 --- /dev/null +++ b/docs_new/docs/basic_usage/glmv.mdx @@ -0,0 +1,139 @@ +--- +title: "GLM-4.6V / GLM-4.5V Usage" +metatags: + description: "Deploy GLM-4.6V/4.5V vision models with SGLang: FP8 and BF16 modes, expert parallelism, video understanding. Supports H100, H200, A100 GPUs." +--- +## Launch commands for SGLang + +Below are suggested launch commands tailored for different hardware / precision modes + +### FP8 (quantised) mode + +For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported: + +```bash Command +python3 -m sglang.launch_server \ + --model-path zai-org/GLM-4.6V-FP8 \ + --tp 2 \ + --ep 2 \ + --host 0.0.0.0 \ + --port 30000 \ + --keep-mm-feature-on-device +``` + +### Non-FP8 (BF16 / full precision) mode +For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used): +```bash Command +python3 -m sglang.launch_server \ + --model-path zai-org/GLM-4.6V \ + --tp 4 \ + --ep 4 \ + --host 0.0.0.0 \ + --port 30000 +``` + +## Hardware-specific notes / recommendations + +- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. +- On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference. +- On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing. + +## Sending Image/Video Requests + +### Image input: + +```python Example +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "zai-org/GLM-4.6V", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s in this image?"}, + { + "type": "image_url", + "image_url": { + "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +### Video Input: + +```python Example +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "zai-org/GLM-4.6V", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s happening in this video?"}, + { + "type": "video_url", + "video_url": { + "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +## Important Server Parameters and Flags + +When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior: + +- `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3) +- `--mm-max-concurrent-calls `: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference. +- `--mm-per-request-timeout `: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated. +- `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads. +- `--mm-enable-dp-encoder`: Placing the ViT in data parallel while keeping the LLM in tensor parallel consistently lowers TTFT and boosts end-to-end throughput. +- `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency. + +### Example usage with the above optimizations: +```bash Command +SGLANG_USE_CUDA_IPC_TRANSPORT=1 \ +SGLANG_VLM_CACHE_SIZE_MB=0 \ +python -m sglang.launch_server \ + --model-path zai-org/GLM-4.6V \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code \ + --tp-size 8 \ + --enable-cache-report \ + --log-level info \ + --max-running-requests 64 \ + --mem-fraction-static 0.65 \ + --chunked-prefill-size 8192 \ + --attention-backend fa3 \ + --mm-attention-backend fa3 \ + --mm-enable-dp-encoder \ + --enable-metrics +``` + +### Thinking Budget for GLM-4.5V / GLM-4.6V + +In SGLang, we can implement thinking budget with `CustomLogitProcessor`. + +Launch a server with the `--enable-custom-logit-processor` flag. Then, use `Glm4MoeThinkingBudgetLogitProcessor` in the request, similar to the `GLM-4.6` example in [glm45.md](./glm45). diff --git a/docs_new/docs/basic_usage/gpt_oss.mdx b/docs_new/docs/basic_usage/gpt_oss.mdx new file mode 100644 index 000000000000..25c656e9182b --- /dev/null +++ b/docs_new/docs/basic_usage/gpt_oss.mdx @@ -0,0 +1,181 @@ +--- +title: "GPT OSS Usage" +metatags: + description: "Deploy GPT-OSS with SGLang: OpenAI Responses API compatible, built-in tools for web search and Python execution, reasoning levels, MCP tool server support." +--- +Please refer to [#8833](https://github.com/sgl-project/sglang/issues/8833). + +## Responses API & Built-in Tools + +### Responses API + +GPT‑OSS is compatible with the OpenAI Responses API. Use `client.responses.create(...)` with `model`, `instructions`, `input`, and optional `tools` to enable built‑in tool use. You can set reasoning level via `instructions`, e.g., "Reasoning: high" (also supports "medium" and "low") — levels: low (fast), medium (balanced), high (deep). + +### Built-in Tools + +GPT‑OSS can call built‑in tools for web search and Python execution. You can use the demo tool server or connect to external MCP tool servers. + +#### Python Tool + +- Executes short Python snippets for calculations, parsing, and quick scripts. +- By default runs in a Docker-based sandbox. To run on the host, set `PYTHON_EXECUTION_BACKEND=UV` (this executes model-generated code locally; use with care). +- Ensure Docker is available if you are not using the UV backend. It is recommended to run `docker pull python:3.11` in advance. + +#### Web Search Tool + +- Uses the Exa backend for web search. +- Requires an Exa API key; set `EXA_API_KEY` in your environment. Create a key at `https://exa.ai`. + +### Tool & Reasoning Parser + +- We support OpenAI Reasoning and Tool Call parser, as well as our SGLang native api for tool call and reasoning. Refer to [reasoning parser](../advanced_features/separate_reasoning) and [tool call parser](../advanced_features/tool_parser) for more details. + + +## Notes + +- Use **Python 3.12** for the demo tools. And install the required `gpt-oss` packages. +- The default demo integrates the web search tool (Exa backend) and a demo Python interpreter via Docker. +- For search, set `EXA_API_KEY`. For Python execution, either have Docker available or set `PYTHON_EXECUTION_BACKEND=UV`. + +Examples: +```bash Command +export EXA_API_KEY=YOUR_EXA_KEY +# Optional: run Python tool locally instead of Docker (use with care) +export PYTHON_EXECUTION_BACKEND=UV +``` + +Launch the server with the demo tool server: + +```bash Command +python3 -m sglang.launch_server \ + --model-path openai/gpt-oss-120b \ + --tool-server demo \ + --tp 2 +``` + +For production usage, sglang can act as an MCP client for multiple services. An [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) is provided. Start the servers and point sglang to them: +```bash Command +mcp run -t sse browser_server.py:mcp +mcp run -t sse python_server.py:mcp + +python -m sglang.launch_server ... --tool-server ip-1:port-1,ip-2:port-2 +``` +The URLs should be MCP SSE servers that expose server information and well-documented tools. These tools are added to the system prompt so the model can use them. + +## Speculative Decoding + +SGLang supports speculative decoding for GPT-OSS models using EAGLE3 algorithm. This can significantly improve decoding speed, especially for small batch sizes. + +**Usage**: +Add `--speculative-algorithm EAGLE3` along with the draft model path. +```bash Command +python3 -m sglang.launch_server \ + --model-path openai/gpt-oss-120b \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \ + --tp 2 +``` + + +To enable the experimental overlap scheduler for EAGLE3 speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. + + +### Quick Demo + +```python Example +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="sk-123456" +) + +tools = [ + {"type": "code_interpreter"}, + {"type": "web_search_preview"}, +] + +# Reasoning level example +response = client.responses.create( + model="openai/gpt-oss-120b", + instructions="You are a helpful assistant." + reasoning_effort="high" # Supports high, medium, or low + input="In one sentence, explain the transformer architecture.", +) +print("====== reasoning: high ======") +print(response.output_text) + +# Test python tool +response = client.responses.create( + model="openai/gpt-oss-120b", + instructions="You are a helfpul assistant, you could use python tool to execute code.", + input="Use python tool to calculate the sum of 29138749187 and 29138749187", # 58,277,498,374 + tools=tools +) +print("====== test python tool ======") +print(response.output_text) + +# Test browser tool +response = client.responses.create( + model="openai/gpt-oss-120b", + instructions="You are a helfpul assistant, you could use browser to search the web", + input="Search the web for the latest news about Nvidia stock price", + tools=tools +) +print("====== test browser tool ======") +print(response.output_text) +``` + +Example output: +```text Output +====== test python tool ====== +The sum of 29,138,749,187 and 29,138,749,187 is **58,277,498,374**. +====== test browser tool ====== +**Recent headlines on Nvidia (NVDA) stock** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Date (2025)SourceKey news pointsStock‑price detail
**May 13**ReutersThe market data page shows Nvidia trading “higher” at **$116.61** with no change from the previous close.**$116.61** – latest trade (delayed ≈ 15 min)【14†L34-L38】
**Aug 18**CNBCMorgan Stanley kept an **overweight** rating and lifted its price target to **$206** (up from $200), implying a 14 % upside from the Friday close. The firm notes Nvidia shares have already **jumped 34 % this year**.No exact price quoted, but the article signals strong upside expectations【9†L27-L31】
**Aug 20**The Motley FoolNvidia is set to release its Q2 earnings on Aug 27. The article lists the **current price of $175.36**, down 0.16 % on the day (as of 3:58 p.m. ET).**$175.36** – current price on Aug 20【10†L12-L15】【10†L53-L57】
+ +**What the news tells us** + +* Nvidia’s share price has risen sharply this year – up roughly a third according to Morgan Stanley – and analysts are still raising targets (now $206). +* The most recent market quote (Reuters, May 13) was **$116.61**, but the stock has surged since then, reaching **$175.36** by mid‑August. +* Upcoming earnings on **Aug 27** are a focal point; both the Motley Fool and Morgan Stanley expect the results could keep the rally going. + +**Bottom line:** Nvidia’s stock is on a strong upward trajectory in 2025, with price targets climbing toward $200‑$210 and the market price already near $175 as of late August. + +``` diff --git a/docs_new/docs/basic_usage/kimi_k2_5.mdx b/docs_new/docs/basic_usage/kimi_k2_5.mdx new file mode 100644 index 000000000000..d87920198952 --- /dev/null +++ b/docs_new/docs/basic_usage/kimi_k2_5.mdx @@ -0,0 +1,106 @@ +--- +title: "Kimi-K2.5 Usage" +metatags: + description: "Deploy Kimi-K2.5 with SGLang: 1T-parameter multimodal MoE model, 256K context, MLA attention, MoonViT vision encoder, thinking and instant modes, tool calling support." +--- +[Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is Moonshot AI's open-source, native multimodal, agentic MoE. It is a 1T-parameter model (32B active) with 256K context, MLA attention, and a MoonViT vision encoder, supporting both thinking and instant modes. + +In SGLang, Kimi-K2.5 uses the `kimi_k2` reasoning and tool-call parsers for correct thinking and tool handling. + +```{note} Example +Kimi-K2.5 support is in SGLang main and will land in the next release. Use the latest main or a nightly image until then. +``` + +Official deployment guide: [Kimi-K2.5 deployment guide](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/docs/deploy_guidance) + +## Install (Latest Main) + +```bash Command +uv pip install "sglang @ git+https://github.com/sgl-project/sglang.git#subdirectory=python" +# For CUDA 12: +uv pip install "nvidia-cudnn-cu12==9.16.0.29" +# For CUDA 13: +uv pip install "nvidia-cudnn-cu13==9.16.0.29" +``` + +## Launch Kimi-K2.5 with SGLang + +Example: single node, TP8 on H200. + +```bash Command +python3 -m sglang.launch_server \ + --model-path moonshotai/Kimi-K2.5 \ + --tp 8 \ + --trust-remote-code \ + --tool-call-parser kimi_k2 \ + --reasoning-parser kimi_k2 +``` + +### Parser Requirements + +- `--tool-call-parser kimi_k2`: Required for tool calling. +- `--reasoning-parser kimi_k2`: Required to parse thinking content; thinking mode is enabled by default. + +## Test the Deployment + +Thinking mode is enabled by default. To disable thinking (instant mode), pass `extra_body.chat_template_kwargs.thinking=false`. + +```bash Command +# Thinking mode (default) +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "moonshotai/Kimi-K2.5", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Explain mixture-of-experts in one sentence."} + ], + "max_tokens": 256 + }' +``` + +```bash Command +# Instant mode (thinking disabled) +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "moonshotai/Kimi-K2.5", + "messages": [ + {"role": "user", "content": "Give one sentence on MoE models."} + ], + "max_tokens": 128, + "extra_body": {"chat_template_kwargs": {"thinking": false}} + }' +``` + +## Multimodal Inputs (Image/Video) + +Kimi-K2.5 is multimodal. Image inputs are supported via the OpenAI-compatible vision API. For more details, see `openai_api_vision.ipynb`. + +```bash Command +# Image input (SGLang) +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "moonshotai/Kimi-K2.5", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe this image."}, + { + "type": "image_url", + "image_url": { + "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true" + } + } + ] + } + ], + "max_tokens": 256 + }' +``` + + +Video chat is experimental and is only supported in the official Moonshot API for now. + diff --git a/docs_new/docs/basic_usage/llama4.mdx b/docs_new/docs/basic_usage/llama4.mdx new file mode 100644 index 000000000000..c68ea27c7a91 --- /dev/null +++ b/docs_new/docs/basic_usage/llama4.mdx @@ -0,0 +1,117 @@ +--- +title: "Llama4 Usage" +metatags: + description: "Deploy Llama 4 Scout (109B) and Maverick (400B) with SGLang: up to 10M context, hybrid KV cache, vision support. Optimized for H100/H200 GPUs." +--- +[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD) is Meta's latest generation of open-source LLM model with industry-leading performance. + +SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5). + +Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118). + +## Launch Llama 4 with SGLang + +To serve Llama 4 models on 8xH100/H200 GPUs: + +```bash Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --tp 8 \ + --context-length 1000000 +``` + +### Configuration Tips + +- **OOM Mitigation**: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\*H100 and up to 2.5M on 8\*H200. For the Maverick model, we don't need to set context length on 8\*H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\*H100 and up to 10M on 8\*H200 for the Scout model. + +- **Attention Backend Auto-Selection**: SGLang automatically selects the optimal attention backend for Llama 4 based on your hardware. You typically don't need to specify `--attention-backend` manually: + - **Blackwell GPUs (B200/GB200)**: `trtllm_mha` + - **Hopper GPUs (H100/H200)**: `fa3` + - **AMD GPUs**: `aiter` + - **Intel XPU**: `intel_xpu` + - **Other platforms**: `triton` (fallback) + + To override the auto-selection, explicitly specify `--attention-backend` with one of the supported backends: `fa3`, `aiter`, `triton`, `trtllm_mha`, or `intel_xpu`. + +- **Chat Template**: Add `--chat-template llama-4` for chat completion tasks. +- **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities. +- **Enable Hybrid-KVCache**: Set `--swa-full-tokens-ratio` to adjust the ratio of SWA layer (for Llama4, it's local attention layer) KV tokens / full layer KV tokens. (default: 0.8, range: 0-1) + + +### EAGLE Speculative Decoding +**Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding). + +**Usage**: +Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: +```text Output +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --trust-remote-code \ + --tp 8 \ + --context-length 1000000 +``` + +- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode. + +## Benchmarking Results + +### Accuracy Test with `lm_eval` + +The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/). + +Benchmark results on MMLU Pro dataset with 8*H100: + + + + + + + + + + + + + + + + + + + + + + + + + +
Llama-4-Scout-17B-16E-InstructLlama-4-Maverick-17B-128E-Instruct
Official Benchmark74.380.5
SGLang75.280.7
+ +Commands: + +```bash Command +# Llama-4-Scout-17B-16E-Instruct model +python -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --port 30000 \ + --tp 8 \ + --mem-fraction-static 0.8 \ + --context-length 65536 +lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0 + +# Llama-4-Maverick-17B-128E-Instruct +python -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --port 30000 \ + --tp 8 \ + --mem-fraction-static 0.8 \ + --context-length 65536 +lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0 +``` + +Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092). diff --git a/docs_new/docs/basic_usage/minimax_m2.mdx b/docs_new/docs/basic_usage/minimax_m2.mdx new file mode 100644 index 000000000000..c248b1942590 --- /dev/null +++ b/docs_new/docs/basic_usage/minimax_m2.mdx @@ -0,0 +1,88 @@ +--- +title: "MiniMax M2.5/M2.1/M2 Usage" +metatags: + description: "Deploy MiniMax M2.5/M2.1/M2 with SGLang: 230B MoE model (10B active), up to 3M context, optimized for coding and agentic tasks, tool use support." +--- +[MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), [MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1), and [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) are advanced large language models created by [MiniMax](https://www.minimax.io/). + +The MiniMax-M2 series redefines efficiency for agents. These compact, fast, and cost-effective MoE models (230 billion total parameters with 10 billion active parameters) are built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, the MiniMax-M2 series provides sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. + +## Supported Models + +This guide applies to the following models. You only need to update the model name during deployment. The following examples use **MiniMax-M2**: + +- [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) +- [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) +- [MiniMaxAI/MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) + +## System Requirements + +The following are recommended configurations; actual requirements should be adjusted based on your use case: + +- 4x 96GB GPUs: Supported context length of up to 400K tokens. +- 8x 144GB GPUs: Supported context length of up to 3M tokens. + +## Deployment with Python + +4-GPU deployment command: + +```bash Command +python -m sglang.launch_server \ + --model-path MiniMaxAI/MiniMax-M2 \ + --tp-size 4 \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --host 0.0.0.0 \ + --trust-remote-code \ + --port 8000 \ + --mem-fraction-static 0.85 +``` + +8-GPU deployment command: + +```bash Command +python -m sglang.launch_server \ + --model-path MiniMaxAI/MiniMax-M2 \ + --tp-size 8 \ + --ep-size 8 \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --host 0.0.0.0 \ + --trust-remote-code \ + --port 8000 \ + --mem-fraction-static 0.85 +``` + +### AMD GPUs (MI300X/MI325X/MI355X) + +8-GPU deployment command: + +```bash Command +SGLANG_USE_AITER=1 python -m sglang.launch_server \ + --model-path MiniMaxAI/MiniMax-M2.5 \ + --tp-size 8 \ + --ep-size 8 \ + --attention-backend aiter \ + --tool-call-parser minimax-m2 \ + --reasoning-parser minimax-append-think \ + --host 0.0.0.0 \ + --trust-remote-code \ + --port 8000 \ + --mem-fraction-static 0.85 +``` + +## Testing Deployment + +After startup, you can test the SGLang OpenAI-compatible API with the following command: + +```bash Command +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "MiniMaxAI/MiniMax-M2", + "messages": [ + {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, + {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]} + ] + }' +``` diff --git a/docs_new/docs/basic_usage/native_api.ipynb b/docs_new/docs/basic_usage/native_api.ipynb new file mode 100644 index 000000000000..d3ead5e349d6 --- /dev/null +++ b/docs_new/docs/basic_usage/native_api.ipynb @@ -0,0 +1,675 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# SGLang Native APIs\n", + "\n", + "Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n", + "\n", + "- `/generate` (text generation model)\n", + "- `/get_model_info`\n", + "- `/server_info`\n", + "- `/health`\n", + "- `/health_generate`\n", + "- `/flush_cache`\n", + "- `/update_weights`\n", + "- `/encode`(embedding model)\n", + "- `/v1/rerank`(cross encoder rerank model)\n", + "- `/v1/score`(decoder-only scoring)\n", + "- `/classify`(reward model)\n", + "- `/start_expert_distribution_record`\n", + "- `/stop_expert_distribution_record`\n", + "- `/dump_expert_distribution_record`\n", + "- `/tokenize`\n", + "- `/detokenize`\n", + "- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n", + "\n", + "We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launch A Server" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generate (text generation model)\n", + "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "url = f\"http://localhost:{port}/generate\"\n", + "data = {\"text\": \"What is the capital of France?\"}\n", + "\n", + "response = requests.post(url, json=data)\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get Model Info\n", + "\n", + "Get the information of the model.\n", + "\n", + "- `model_path`: The path/name of the model.\n", + "- `is_generation`: Whether the model is used as generation model or embedding model.\n", + "- `tokenizer_path`: The path/name of the tokenizer.\n", + "- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args.\n", + "- `weight_version`: This field contains the version of the model weights. This is often used to track changes or updates to the model’s trained parameters.\n", + "- `has_image_understanding`: Whether the model has image-understanding capability.\n", + "- `has_audio_understanding`: Whether the model has audio-understanding capability.\n", + "- `model_type`: The model type from the HuggingFace config (e.g., \"qwen2\", \"llama\").\n", + "- `architectures`: The model architectures from the HuggingFace config (e.g., [\"Qwen2ForCausalLM\"])." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://localhost:{port}/get_model_info\"\n", + "\n", + "response = requests.get(url)\n", + "response_json = response.json()\n", + "print_highlight(response_json)\n", + "assert response_json[\"model_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n", + "assert response_json[\"is_generation\"] is True\n", + "assert response_json[\"tokenizer_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n", + "assert response_json[\"preferred_sampling_params\"] is None\n", + "assert response_json.keys() == {\n", + " \"model_path\",\n", + " \"is_generation\",\n", + " \"tokenizer_path\",\n", + " \"preferred_sampling_params\",\n", + " \"weight_version\",\n", + " \"has_image_understanding\",\n", + " \"has_audio_understanding\",\n", + " \"model_type\",\n", + " \"architectures\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get Server Info\n", + "Gets the server information including CLI arguments, token limits, and memory pool sizes.\n", + "- Note: `get_server_info` merges the following deprecated endpoints:\n", + " - `get_server_args`\n", + " - `get_memory_pool_size`\n", + " - `get_max_total_num_tokens`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://localhost:{port}/server_info\"\n", + "\n", + "response = requests.get(url)\n", + "print_highlight(response.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Health Check\n", + "- `/health`: Check the health of the server.\n", + "- `/health_generate`: Check the health of the server by generating one token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://localhost:{port}/health_generate\"\n", + "\n", + "response = requests.get(url)\n", + "print_highlight(response.text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://localhost:{port}/health\"\n", + "\n", + "response = requests.get(url)\n", + "print_highlight(response.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Flush Cache\n", + "\n", + "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API.\n", + "\n", + "Parameters:\n", + "- `timeout` (query, float, default `0`, unit: seconds): Wait time for idle state before flushing. `0` means fail fast if not idle. When HiCache async operations are in-flight, a non-zero timeout allows the server to wait until idle before flushing, avoiding unnecessary 400 errors.\n", + "\n", + "```bash\n", + "# With timeout (wait up to 30s for idle state)\n", + "curl -s -X POST \"http://127.0.0.1:30000/flush_cache?timeout=30\"\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = f\"http://localhost:{port}/flush_cache\"\n", + "\n", + "response = requests.post(url)\n", + "print_highlight(response.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Update Weights From Disk\n", + "\n", + "Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.\n", + "\n", + "SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# successful update with same architecture and size\n", + "\n", + "url = f\"http://localhost:{port}/update_weights_from_disk\"\n", + "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct\"}\n", + "\n", + "response = requests.post(url, json=data)\n", + "print_highlight(response.text)\n", + "assert response.json()[\"success\"] is True\n", + "assert response.json()[\"message\"] == \"Succeeded to update model weights.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# failed update with different parameter size or wrong name\n", + "\n", + "url = f\"http://localhost:{port}/update_weights_from_disk\"\n", + "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct-wrong\"}\n", + "\n", + "response = requests.post(url, json=data)\n", + "response_json = response.json()\n", + "print_highlight(response_json)\n", + "assert response_json[\"success\"] is False\n", + "assert response_json[\"message\"] == (\n", + " \"Failed to get weights iterator: \"\n", + " \"qwen/qwen2.5-0.5b-instruct-wrong\"\n", + " \" (repository not found).\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Encode (embedding model)\n", + "\n", + "Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n", + "Therefore, we launch a new server to server an embedding model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "embedding_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n", + " --host 0.0.0.0 --is-embedding --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=embedding_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# successful encode for embedding model\n", + "\n", + "url = f\"http://localhost:{port}/encode\"\n", + "data = {\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"text\": \"Once upon a time\"}\n", + "\n", + "response = requests.post(url, json=data)\n", + "response_json = response.json()\n", + "print_highlight(f\"Text embedding (first 10): {response_json['embedding'][:10]}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(embedding_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## v1/rerank (cross encoder rerank model)\n", + "Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "reranker_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n", + " --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=reranker_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute rerank scores for query and documents\n", + "\n", + "url = f\"http://localhost:{port}/v1/rerank\"\n", + "data = {\n", + " \"model\": \"BAAI/bge-reranker-v2-m3\",\n", + " \"query\": \"what is panda?\",\n", + " \"documents\": [\n", + " \"hi\",\n", + " \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\",\n", + " ],\n", + "}\n", + "\n", + "response = requests.post(url, json=data)\n", + "response_json = response.json()\n", + "for item in response_json:\n", + " print_highlight(f\"Score: {item['score']:.2f} - Document: '{item['document']}'\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(reranker_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## v1/score (decoder-only scoring)\n", + "\n", + "Compute token probabilities for specified tokens given a query and items. This is useful for classification tasks, scoring responses, or computing log-probabilities.\n", + "\n", + "Parameters:\n", + "- `query`: Query text\n", + "- `items`: Item text(s) to score\n", + "- `label_token_ids`: Token IDs to compute probabilities for\n", + "- `apply_softmax`: Whether to apply softmax to get normalized probabilities (default: False)\n", + "- `item_first`: Whether items come first in concatenation order (default: False)\n", + "- `model`: Model name\n", + "\n", + "The response contains `scores` - a list of probability lists, one per item, each in the order of `label_token_ids`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "score_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n", + " --host 0.0.0.0 --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=score_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Score the probability of different completions given a query\n", + "query = \"The capital of France is\"\n", + "items = [\"Paris\", \"London\", \"Berlin\"]\n", + "\n", + "url = f\"http://localhost:{port}/v1/score\"\n", + "data = {\n", + " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n", + " \"query\": query,\n", + " \"items\": items,\n", + " \"label_token_ids\": [9454, 2753], # e.g. \"Yes\" and \"No\" token ids\n", + " \"apply_softmax\": True, # Normalize probabilities to sum to 1\n", + "}\n", + "\n", + "response = requests.post(url, json=data)\n", + "response_json = response.json()\n", + "\n", + "# Display scores for each item\n", + "for item, scores in zip(items, response_json[\"scores\"]):\n", + " print_highlight(f\"Item '{item}': probabilities = {[f'{s:.4f}' for s in scores]}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(score_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Classify (reward model)\n", + "\n", + "SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Note that SGLang now treats embedding models and reward models as the same type of models.\n", + "# This will be updated in the future.\n", + "\n", + "reward_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=reward_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "PROMPT = (\n", + " \"What is the range of the numeric output of a sigmoid node in a neural network?\"\n", + ")\n", + "\n", + "RESPONSE1 = \"The output of a sigmoid node is bounded between -1 and 1.\"\n", + "RESPONSE2 = \"The output of a sigmoid node is bounded between 0 and 1.\"\n", + "\n", + "CONVS = [\n", + " [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE1}],\n", + " [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE2}],\n", + "]\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\")\n", + "prompts = tokenizer.apply_chat_template(CONVS, tokenize=False, return_dict=False)\n", + "\n", + "url = f\"http://localhost:{port}/classify\"\n", + "data = {\"model\": \"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\", \"text\": prompts}\n", + "\n", + "responses = requests.post(url, json=data).json()\n", + "for response in responses:\n", + " print_highlight(f\"reward: {response['embedding'][0]}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(reward_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Capture expert selection distribution in MoE models\n", + "\n", + "SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization.\n", + "\n", + "*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "expert_record_server_process, port = launch_server_cmd(\n", + " \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=expert_record_server_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.post(f\"http://localhost:{port}/start_expert_distribution_record\")\n", + "print_highlight(response)\n", + "\n", + "url = f\"http://localhost:{port}/generate\"\n", + "data = {\"text\": \"What is the capital of France?\"}\n", + "\n", + "response = requests.post(url, json=data)\n", + "print_highlight(response.json())\n", + "\n", + "response = requests.post(f\"http://localhost:{port}/stop_expert_distribution_record\")\n", + "print_highlight(response)\n", + "\n", + "response = requests.post(f\"http://localhost:{port}/dump_expert_distribution_record\")\n", + "print_highlight(response)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(expert_record_server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tokenize/Detokenize Example (Round Trip)\n", + "\n", + "This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer_free_server_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=tokenizer_free_server_process)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from sglang.utils import print_highlight\n", + "\n", + "base_url = f\"http://localhost:{port}\"\n", + "tokenize_url = f\"{base_url}/tokenize\"\n", + "detokenize_url = f\"{base_url}/detokenize\"\n", + "\n", + "model_name = \"qwen/qwen2.5-0.5b-instruct\"\n", + "input_text = \"SGLang provides efficient tokenization endpoints.\"\n", + "print_highlight(f\"Original Input Text:\\n'{input_text}'\")\n", + "\n", + "# --- tokenize the input text ---\n", + "tokenize_payload = {\n", + " \"model\": model_name,\n", + " \"prompt\": input_text,\n", + " \"add_special_tokens\": False,\n", + "}\n", + "try:\n", + " tokenize_response = requests.post(tokenize_url, json=tokenize_payload)\n", + " tokenize_response.raise_for_status()\n", + " tokenization_result = tokenize_response.json()\n", + " token_ids = tokenization_result.get(\"tokens\")\n", + "\n", + " if not token_ids:\n", + " raise ValueError(\"Tokenization returned empty tokens.\")\n", + "\n", + " print_highlight(f\"\\nTokenized Output (IDs):\\n{token_ids}\")\n", + " print_highlight(f\"Token Count: {tokenization_result.get('count')}\")\n", + " print_highlight(f\"Max Model Length: {tokenization_result.get('max_model_len')}\")\n", + "\n", + " # --- detokenize the obtained token IDs ---\n", + " detokenize_payload = {\n", + " \"model\": model_name,\n", + " \"tokens\": token_ids,\n", + " \"skip_special_tokens\": True,\n", + " }\n", + "\n", + " detokenize_response = requests.post(detokenize_url, json=detokenize_payload)\n", + " detokenize_response.raise_for_status()\n", + " detokenization_result = detokenize_response.json()\n", + " reconstructed_text = detokenization_result.get(\"text\")\n", + "\n", + " print_highlight(f\"\\nDetokenized Output (Text):\\n'{reconstructed_text}'\")\n", + "\n", + " if input_text == reconstructed_text:\n", + " print_highlight(\n", + " \"\\nRound Trip Successful: Original and reconstructed text match.\"\n", + " )\n", + " else:\n", + " print_highlight(\n", + " \"\\nRound Trip Mismatch: Original and reconstructed text differ.\"\n", + " )\n", + "\n", + "except requests.exceptions.RequestException as e:\n", + " print_highlight(f\"\\nHTTP Request Error: {e}\")\n", + "except Exception as e:\n", + " print_highlight(f\"\\nAn error occurred: {e}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(tokenizer_free_server_process)" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs_new/docs/basic_usage/native_api.mdx b/docs_new/docs/basic_usage/native_api.mdx new file mode 100644 index 000000000000..42c5bd228319 --- /dev/null +++ b/docs_new/docs/basic_usage/native_api.mdx @@ -0,0 +1,448 @@ +--- +title: "SGLang Native APIs" +metatags: + description: "SGLang native server APIs for text generation, embedding, reranking, model info, cache management, and more." +--- +Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs: + +- `/generate` (text generation model) +- `/get_model_info` +- `/server_info` +- `/health` +- `/health_generate` +- `/flush_cache` +- `/update_weights` +- `/encode`(embedding model) +- `/v1/rerank`(cross encoder rerank model) +- `/v1/score`(decoder-only scoring) +- `/classify`(reward model) +- `/start_expert_distribution_record` +- `/stop_expert_distribution_record` +- `/dump_expert_distribution_record` +- `/tokenize` +- `/detokenize` +- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py) + +We mainly use `requests` to test these APIs in the following examples. You can also use `curl`. + +## Launch A Server + +```python Example +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +server_process, port = launch_server_cmd( + "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}", process=server_process) +``` + +## Generate (text generation model) +Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](./sampling_params). + +```python Example +import requests + +url = f"http://localhost:{port}/generate" +data = {"text": "What is the capital of France?"} + +response = requests.post(url, json=data) +print_highlight(response.json()) +``` + +## Get Model Info + +Get the information of the model. + +- `model_path`: The path/name of the model. +- `is_generation`: Whether the model is used as generation model or embedding model. +- `tokenizer_path`: The path/name of the tokenizer. +- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args. +- `weight_version`: This field contains the version of the model weights. This is often used to track changes or updates to the model’s trained parameters. +- `has_image_understanding`: Whether the model has image-understanding capability. +- `has_audio_understanding`: Whether the model has audio-understanding capability. +- `model_type`: The model type from the HuggingFace config (e.g., "qwen2", "llama"). +- `architectures`: The model architectures from the HuggingFace config (e.g., ["Qwen2ForCausalLM"]). + +```python Example +url = f"http://localhost:{port}/get_model_info" + +response = requests.get(url) +response_json = response.json() +print_highlight(response_json) +assert response_json["model_path"] == "qwen/qwen2.5-0.5b-instruct" +assert response_json["is_generation"] is True +assert response_json["tokenizer_path"] == "qwen/qwen2.5-0.5b-instruct" +assert response_json["preferred_sampling_params"] is None +assert response_json.keys() == { + "model_path", + "is_generation", + "tokenizer_path", + "preferred_sampling_params", + "weight_version", + "has_image_understanding", + "has_audio_understanding", + "model_type", + "architectures", +} +``` + +## Get Server Info +Gets the server information including CLI arguments, token limits, and memory pool sizes. +- Note: `get_server_info` merges the following deprecated endpoints: + - `get_server_args` + - `get_memory_pool_size` + - `get_max_total_num_tokens` + +```python Example +url = f"http://localhost:{port}/server_info" + +response = requests.get(url) +print_highlight(response.text) +``` + +## Health Check +- `/health`: Check the health of the server. +- `/health_generate`: Check the health of the server by generating one token. + +```python Example +url = f"http://localhost:{port}/health_generate" + +response = requests.get(url) +print_highlight(response.text) +``` + +```python Example +url = f"http://localhost:{port}/health" + +response = requests.get(url) +print_highlight(response.text) +``` + +## Flush Cache + +Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API. + +Parameters: +- `timeout` (query, float, default `0`, unit: seconds): Wait time for idle state before flushing. `0` means fail fast if not idle. When HiCache async operations are in-flight, a non-zero timeout allows the server to wait until idle before flushing, avoiding unnecessary 400 errors. + +```bash Command +# With timeout (wait up to 30s for idle state) +curl -s -X POST "http://127.0.0.1:30000/flush_cache?timeout=30" +``` + +```python Example +url = f"http://localhost:{port}/flush_cache" + +response = requests.post(url) +print_highlight(response.text) +``` + +## Update Weights From Disk + +Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size. + +SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk). + +```python Example +# successful update with same architecture and size + +url = f"http://localhost:{port}/update_weights_from_disk" +data = {"model_path": "qwen/qwen2.5-0.5b-instruct"} + +response = requests.post(url, json=data) +print_highlight(response.text) +assert response.json()["success"] is True +assert response.json()["message"] == "Succeeded to update model weights." +``` + +```python Example +# failed update with different parameter size or wrong name + +url = f"http://localhost:{port}/update_weights_from_disk" +data = {"model_path": "qwen/qwen2.5-0.5b-instruct-wrong"} + +response = requests.post(url, json=data) +response_json = response.json() +print_highlight(response_json) +assert response_json["success"] is False +assert response_json["message"] == ( + "Failed to get weights iterator: " + "qwen/qwen2.5-0.5b-instruct-wrong" + " (repository not found)." +) +``` + +```python Example +terminate_process(server_process) +``` + +## Encode (embedding model) + +Encode text into embeddings. Note that this API is only available for [embedding models](./openai_api_embeddings) and will raise an error for generation models. +Therefore, we launch a new server to server an embedding model. + +```python Example +embedding_process, port = launch_server_cmd(""" +python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \ + --host 0.0.0.0 --is-embedding --log-level warning +""") + +wait_for_server(f"http://localhost:{port}", process=embedding_process) +``` + +```python Example +# successful encode for embedding model + +url = f"http://localhost:{port}/encode" +data = {"model": "Alibaba-NLP/gte-Qwen2-1.5B-instruct", "text": "Once upon a time"} + +response = requests.post(url, json=data) +response_json = response.json() +print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}") +``` + +```python Example +terminate_process(embedding_process) +``` + +## v1/rerank (cross encoder rerank model) +Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`. + +```python Example +reranker_process, port = launch_server_cmd(""" +python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \ + --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning +""") + +wait_for_server(f"http://localhost:{port}", process=reranker_process) +``` + +```python Example +# compute rerank scores for query and documents + +url = f"http://localhost:{port}/v1/rerank" +data = { + "model": "BAAI/bge-reranker-v2-m3", + "query": "what is panda?", + "documents": [ + "hi", + "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.", + ], +} + +response = requests.post(url, json=data) +response_json = response.json() +for item in response_json: + print_highlight(f"Score: {item['score']:.2f} - Document: '{item['document']}'") +``` + +```python Example +terminate_process(reranker_process) +``` + +## v1/score (decoder-only scoring) + +Compute token probabilities for specified tokens given a query and items. This is useful for classification tasks, scoring responses, or computing log-probabilities. + +Parameters: +- `query`: Query text +- `items`: Item text(s) to score +- `label_token_ids`: Token IDs to compute probabilities for +- `apply_softmax`: Whether to apply softmax to get normalized probabilities (default: False) +- `item_first`: Whether items come first in concatenation order (default: False) +- `model`: Model name + +The response contains `scores` - a list of probability lists, one per item, each in the order of `label_token_ids`. + +```python Example +score_process, port = launch_server_cmd(""" +python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \ + --host 0.0.0.0 --log-level warning +""") + +wait_for_server(f"http://localhost:{port}", process=score_process) +``` + +```python Example +# Score the probability of different completions given a query +query = "The capital of France is" +items = ["Paris", "London", "Berlin"] + +url = f"http://localhost:{port}/v1/score" +data = { + "model": "qwen/qwen2.5-0.5b-instruct", + "query": query, + "items": items, + "label_token_ids": [9454, 2753], # e.g. "Yes" and "No" token ids + "apply_softmax": True, # Normalize probabilities to sum to 1 +} + +response = requests.post(url, json=data) +response_json = response.json() + +# Display scores for each item +for item, scores in zip(items, response_json["scores"]): + print_highlight(f"Item '{item}': probabilities = {[f'{s:.4f}' for s in scores]}") +``` + +```python Example +terminate_process(score_process) +``` + +## Classify (reward model) + +SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations. + +```python Example +# Note that SGLang now treats embedding models and reward models as the same type of models. +# This will be updated in the future. + +reward_process, port = launch_server_cmd(""" +python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning +""") + +wait_for_server(f"http://localhost:{port}", process=reward_process) +``` + +```python Example +from transformers import AutoTokenizer + +PROMPT = ( + "What is the range of the numeric output of a sigmoid node in a neural network?" +) + +RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1." +RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1." + +CONVS = [ + [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}], + [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}], +] + +tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2") +prompts = tokenizer.apply_chat_template(CONVS, tokenize=False, return_dict=False) + +url = f"http://localhost:{port}/classify" +data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts} + +responses = requests.post(url, json=data).json() +for response in responses: + print_highlight(f"reward: {response['embedding'][0]}") +``` + +```python Example +terminate_process(reward_process) +``` + +## Capture expert selection distribution in MoE models + +SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization. + +*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.* + +```python Example +expert_record_server_process, port = launch_server_cmd( + "python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning" +) + +wait_for_server(f"http://localhost:{port}", process=expert_record_server_process) +``` + +```python Example +response = requests.post(f"http://localhost:{port}/start_expert_distribution_record") +print_highlight(response) + +url = f"http://localhost:{port}/generate" +data = {"text": "What is the capital of France?"} + +response = requests.post(url, json=data) +print_highlight(response.json()) + +response = requests.post(f"http://localhost:{port}/stop_expert_distribution_record") +print_highlight(response) + +response = requests.post(f"http://localhost:{port}/dump_expert_distribution_record") +print_highlight(response) +``` + +```python Example +terminate_process(expert_record_server_process) +``` + +## Tokenize/Detokenize Example (Round Trip) + +This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization. + +```python Example +tokenizer_free_server_process, port = launch_server_cmd(""" +python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct +""") + +wait_for_server(f"http://localhost:{port}", process=tokenizer_free_server_process) +``` + +```python Example +import requests +from sglang.utils import print_highlight + +base_url = f"http://localhost:{port}" +tokenize_url = f"{base_url}/tokenize" +detokenize_url = f"{base_url}/detokenize" + +model_name = "qwen/qwen2.5-0.5b-instruct" +input_text = "SGLang provides efficient tokenization endpoints." +print_highlight(f"Original Input Text:\n'{input_text}'") + +# --- tokenize the input text --- +tokenize_payload = { + "model": model_name, + "prompt": input_text, + "add_special_tokens": False, +} +try: + tokenize_response = requests.post(tokenize_url, json=tokenize_payload) + tokenize_response.raise_for_status() + tokenization_result = tokenize_response.json() + token_ids = tokenization_result.get("tokens") + + if not token_ids: + raise ValueError("Tokenization returned empty tokens.") + + print_highlight(f"\nTokenized Output (IDs):\n{token_ids}") + print_highlight(f"Token Count: {tokenization_result.get('count')}") + print_highlight(f"Max Model Length: {tokenization_result.get('max_model_len')}") + + # --- detokenize the obtained token IDs --- + detokenize_payload = { + "model": model_name, + "tokens": token_ids, + "skip_special_tokens": True, + } + + detokenize_response = requests.post(detokenize_url, json=detokenize_payload) + detokenize_response.raise_for_status() + detokenization_result = detokenize_response.json() + reconstructed_text = detokenization_result.get("text") + + print_highlight(f"\nDetokenized Output (Text):\n'{reconstructed_text}'") + + if input_text == reconstructed_text: + print_highlight( + "\nRound Trip Successful: Original and reconstructed text match." + ) + else: + print_highlight( + "\nRound Trip Mismatch: Original and reconstructed text differ." + ) + +except requests.exceptions.RequestException as e: + print_highlight(f"\nHTTP Request Error: {e}") +except Exception as e: + print_highlight(f"\nAn error occurred: {e}") +``` + +```python Example +terminate_process(tokenizer_free_server_process) +``` diff --git a/docs_new/docs/basic_usage/offline_engine_api.ipynb b/docs_new/docs/basic_usage/offline_engine_api.ipynb new file mode 100644 index 000000000000..fe8a9e3045c0 --- /dev/null +++ b/docs_new/docs/basic_usage/offline_engine_api.ipynb @@ -0,0 +1,235 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Offline Engine API\n", + "\n", + "SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n", + "\n", + "- Offline Batch Inference\n", + "- Custom Server on Top of the Engine\n", + "\n", + "This document focuses on the offline batch inference, demonstrating four different inference modes:\n", + "\n", + "- Non-streaming synchronous generation\n", + "- Streaming synchronous generation\n", + "- Non-streaming asynchronous generation\n", + "- Streaming asynchronous generation\n", + "\n", + "Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Nest Asyncio\n", + "Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:\n", + "```python\n", + "import nest_asyncio\n", + "\n", + "nest_asyncio.apply()\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced Usage\n", + "\n", + "The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). \n", + "\n", + "Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Offline Batch Inference\n", + "\n", + "SGLang offline engine supports batch inference with efficient scheduling." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# launch the offline engine\n", + "import asyncio\n", + "\n", + "import sglang as sgl\n", + "import sglang.test.doc_patch # noqa: F401\n", + "from sglang.utils import async_stream_and_merge, stream_and_merge\n", + "\n", + "llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Non-streaming Synchronous Generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Hello, my name is\",\n", + " \"The president of the United States is\",\n", + " \"The capital of France is\",\n", + " \"The future of AI is\",\n", + "]\n", + "\n", + "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n", + "\n", + "outputs = llm.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print(\"===============================\")\n", + " print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streaming Synchronous Generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n", + " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n", + " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n", + "]\n", + "\n", + "sampling_params = {\n", + " \"temperature\": 0.2,\n", + " \"top_p\": 0.9,\n", + "}\n", + "\n", + "print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n", + "\n", + "for prompt in prompts:\n", + " print(f\"Prompt: {prompt}\")\n", + " merged_output = stream_and_merge(llm, prompt, sampling_params)\n", + " print(\"Generated text:\", merged_output)\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Non-streaming Asynchronous Generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n", + " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n", + " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n", + "]\n", + "\n", + "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n", + "\n", + "print(\"\\n=== Testing asynchronous batch generation ===\")\n", + "\n", + "\n", + "async def main():\n", + " outputs = await llm.async_generate(prompts, sampling_params)\n", + "\n", + " for prompt, output in zip(prompts, outputs):\n", + " print(f\"\\nPrompt: {prompt}\")\n", + " print(f\"Generated text: {output['text']}\")\n", + "\n", + "\n", + "asyncio.run(main())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streaming Asynchronous Generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\n", + " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n", + " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n", + " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n", + "]\n", + "\n", + "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n", + "\n", + "print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n", + "\n", + "\n", + "async def main():\n", + " for prompt in prompts:\n", + " print(f\"\\nPrompt: {prompt}\")\n", + " print(\"Generated text: \", end=\"\", flush=True)\n", + "\n", + " # Replace direct calls to async_generate with our custom overlap-aware version\n", + " async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n", + " print(cleaned_chunk, end=\"\", flush=True)\n", + "\n", + " print() # New line after each prompt\n", + "\n", + "\n", + "asyncio.run(main())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm.shutdown()" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/basic_usage/offline_engine_api.mdx b/docs_new/docs/basic_usage/offline_engine_api.mdx new file mode 100644 index 000000000000..4a814b319563 --- /dev/null +++ b/docs_new/docs/basic_usage/offline_engine_api.mdx @@ -0,0 +1,143 @@ +--- +title: "Offline Engine API" +metatags: + description: "Use SGLang's offline engine for direct batch inference without HTTP server overhead. Supports sync/async and streaming modes." +--- +SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases: + +- Offline Batch Inference +- Custom Server on Top of the Engine + +This document focuses on the offline batch inference, demonstrating four different inference modes: + +- Non-streaming synchronous generation +- Streaming synchronous generation +- Non-streaming asynchronous generation +- Streaming asynchronous generation + +Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py). + +## Nest Asyncio +Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code: +```python Example +import nest_asyncio + +nest_asyncio.apply() + +``` + +## Advanced Usage + +The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/tree/main/examples/runtime/hidden_states). + +Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases. + +## Offline Batch Inference + +SGLang offline engine supports batch inference with efficient scheduling. + +```python Example +# launch the offline engine +import asyncio + +import sglang as sgl +import sglang.test.doc_patch +from sglang.utils import async_stream_and_merge, stream_and_merge + +llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct") +``` + +### Non-streaming Synchronous Generation + +```python Example +prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] + +sampling_params = {"temperature": 0.8, "top_p": 0.95} + +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print("===============================") + print(f"Prompt: {prompt}\nGenerated text: {output['text']}") +``` + +### Streaming Synchronous Generation + +```python Example +prompts = [ + "Write a short, neutral self-introduction for a fictional character. Hello, my name is", + "Provide a concise factual statement about France’s capital city. The capital of France is", + "Explain possible future trends in artificial intelligence. The future of AI is", +] + +sampling_params = { + "temperature": 0.2, + "top_p": 0.9, +} + +print("\n=== Testing synchronous streaming generation with overlap removal ===\n") + +for prompt in prompts: + print(f"Prompt: {prompt}") + merged_output = stream_and_merge(llm, prompt, sampling_params) + print("Generated text:", merged_output) + print() +``` + +### Non-streaming Asynchronous Generation + +```python Example +prompts = [ + "Write a short, neutral self-introduction for a fictional character. Hello, my name is", + "Provide a concise factual statement about France’s capital city. The capital of France is", + "Explain possible future trends in artificial intelligence. The future of AI is", +] + +sampling_params = {"temperature": 0.8, "top_p": 0.95} + +print("\n=== Testing asynchronous batch generation ===") + +async def main(): + outputs = await llm.async_generate(prompts, sampling_params) + + for prompt, output in zip(prompts, outputs): + print(f"\nPrompt: {prompt}") + print(f"Generated text: {output['text']}") + +asyncio.run(main()) +``` + +### Streaming Asynchronous Generation + +```python Example +prompts = [ + "Write a short, neutral self-introduction for a fictional character. Hello, my name is", + "Provide a concise factual statement about France’s capital city. The capital of France is", + "Explain possible future trends in artificial intelligence. The future of AI is", +] + +sampling_params = {"temperature": 0.8, "top_p": 0.95} + +print("\n=== Testing asynchronous streaming generation (no repeats) ===") + +async def main(): + for prompt in prompts: + print(f"\nPrompt: {prompt}") + print("Generated text: ", end="", flush=True) + + # Replace direct calls to async_generate with our custom overlap-aware version + async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params): + print(cleaned_chunk, end="", flush=True) + + print() # New line after each prompt + +asyncio.run(main()) +``` + +```python Example +llm.shutdown() +``` diff --git a/docs_new/docs/basic_usage/ollama_api.mdx b/docs_new/docs/basic_usage/ollama_api.mdx new file mode 100644 index 000000000000..c92533c3ff74 --- /dev/null +++ b/docs_new/docs/basic_usage/ollama_api.mdx @@ -0,0 +1,157 @@ +--- +title: "Ollama-Compatible API" +metatags: + description: "SGLang provides Ollama API compatibility, allowing you to use the Ollama CLI and Python library with SGLang as the inference backend." +--- +SGLang provides Ollama API compatibility, allowing you to use the Ollama CLI and Python library with SGLang as the inference backend. + +## Prerequisites + + +```bash Command +# Install the Ollama Python library (for Python client usage) +pip install ollama +``` + + +You don't need the Ollama server installed - SGLang acts as the backend. You only need the `ollama` CLI or Python library as the client. + +## Endpoints + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EndpointMethodDescription
`/`GET, HEADHealth check for Ollama CLI
`/api/tags`GETList available models
`/api/chat`POSTChat completions (streaming & non-streaming)
`/api/generate`POSTText generation (streaming & non-streaming)
`/api/show`POSTModel information
+ +## Quick Start + +### 1. Launch SGLang Server + + +```bash Command +python -m sglang.launch_server \ + --model Qwen/Qwen2.5-1.5B-Instruct \ + --port 30001 \ + --host 0.0.0.0 +``` + + +The model name used with `ollama run` must match exactly what you passed to `--model`. + +### 2. Use Ollama CLI + + +```bash Command +# List available models +OLLAMA_HOST=http://localhost:30001 ollama list + +# Interactive chat +OLLAMA_HOST=http://localhost:30001 ollama run "Qwen/Qwen2.5-1.5B-Instruct" +``` + + +If connecting to a remote server behind a firewall: + + +```bash Command +# SSH tunnel +ssh -L 30001:localhost:30001 user@gpu-server -N & + +# Then use Ollama CLI as above +OLLAMA_HOST=http://localhost:30001 ollama list +``` + + +### 3. Use Ollama Python Library + +```python Example +import ollama + +client = ollama.Client(host='http://localhost:30001') + +# Non-streaming +response = client.chat( + model='Qwen/Qwen2.5-1.5B-Instruct', + messages=[{'role': 'user', 'content': 'Hello!'}] +) +print(response['message']['content']) + +# Streaming +stream = client.chat( + model='Qwen/Qwen2.5-1.5B-Instruct', + messages=[{'role': 'user', 'content': 'Tell me a story'}], + stream=True +) +for chunk in stream: + print(chunk['message']['content'], end='', flush=True) +``` + +## Smart Router + +For intelligent routing between local Ollama (fast) and remote SGLang (powerful) using an LLM judge, see the [Smart Router documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/ollama/README). + +## Summary + + + + + + + + + + + + + + + + + + + + + + + + + + +
ComponentPurpose
**Ollama API**Familiar CLI/API that developers already know
**SGLang Backend**High-performance inference engine
**Smart Router**Intelligent routing - fast local for simple tasks, powerful remote for complex tasks
diff --git a/docs_new/docs/basic_usage/openai_api.mdx b/docs_new/docs/basic_usage/openai_api.mdx new file mode 100644 index 000000000000..523fa2db6fea --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api.mdx @@ -0,0 +1,7 @@ +--- +title: "OpenAI-Compatible APIs" +description: "Documentation for OpenAI-Compatible APIs" +--- +- [Openai Api Completions](./openai_api_completions) +- [Openai Api Vision](./openai_api_vision) +- [Openai Api Embeddings](./openai_api_embeddings) diff --git a/docs_new/docs/basic_usage/openai_api.rst b/docs_new/docs/basic_usage/openai_api.rst new file mode 100644 index 000000000000..370abe99c567 --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api.rst @@ -0,0 +1,9 @@ +OpenAI-Compatible APIs +====================== + +.. toctree:: + :maxdepth: 1 + + openai_api_completions.ipynb + openai_api_vision.ipynb + openai_api_embeddings.ipynb diff --git a/docs_new/docs/basic_usage/openai_api_completions.ipynb b/docs_new/docs/basic_usage/openai_api_completions.ipynb new file mode 100644 index 000000000000..ffa576ae52c5 --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api_completions.ipynb @@ -0,0 +1,552 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# OpenAI APIs - Completions\n", + "\n", + "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", + "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n", + "\n", + "This tutorial covers the following popular APIs:\n", + "\n", + "- `chat/completions`\n", + "- `completions`\n", + "\n", + "Check out other tutorials to learn about [vision APIs](openai_api_vision.ipynb) for vision-language models and [embedding APIs](openai_api_embeddings.ipynb) for embedding models." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launch A Server\n", + "\n", + "Launch the server in your terminal and wait for it to initialize." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", + "print(f\"Server started on http://localhost:{port}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Chat Completions\n", + "\n", + "### Usage\n", + "\n", + "The server fully implements the OpenAI API.\n", + "It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n", + "You can also specify a custom chat template with `--chat-template` when launching the server." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[\n", + " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", + " ],\n", + " temperature=0,\n", + " max_tokens=64,\n", + ")\n", + "\n", + "print_highlight(f\"Response: {response}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model Thinking/Reasoning Support\n", + "\n", + "Some models support internal reasoning or thinking processes that can be exposed in the API response. SGLang provides unified support for various reasoning models through the `chat_template_kwargs` parameter and compatible reasoning parsers.\n", + "\n", + "#### Supported Models and Configuration\n", + "\n", + "| Model Family | Chat Template Parameter | Reasoning Parser | Notes |\n", + "|--------------|------------------------|------------------|--------|\n", + "| DeepSeek-R1 (R1, R1-0528, R1-Distill) | `enable_thinking` | `--reasoning-parser deepseek-r1` | Standard reasoning models |\n", + "| DeepSeek-V3.1 | `thinking` | `--reasoning-parser deepseek-v3` | Hybrid model (thinking/non-thinking modes) |\n", + "| Qwen3 (standard) | `enable_thinking` | `--reasoning-parser qwen3` | Hybrid model (thinking/non-thinking modes) |\n", + "| Qwen3-Thinking | N/A (always enabled) | `--reasoning-parser qwen3-thinking` | Always generates reasoning |\n", + "| Kimi | N/A (always enabled) | `--reasoning-parser kimi` | Kimi thinking models |\n", + "| Gpt-Oss | N/A (always enabled) | `--reasoning-parser gpt-oss` | Gpt-Oss thinking models |\n", + "\n", + "#### Basic Usage\n", + "\n", + "To enable reasoning output, you need to:\n", + "1. Launch the server with the appropriate reasoning parser\n", + "2. Set the model-specific parameter in `chat_template_kwargs`\n", + "3. Optionally use `separate_reasoning: False` to not get reasoning content separately (default to `True`)\n", + "\n", + "**Note for Qwen3-Thinking models:** These models always generate thinking content and do not support the `enable_thinking` parameter. Use `--reasoning-parser qwen3-thinking` or `--reasoning-parser qwen3` to parse the thinking content.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Example: Qwen3 Models\n", + "\n", + "```python\n", + "# Launch server:\n", + "# python3 -m sglang.launch_server --model Qwen/Qwen3-4B --reasoning-parser qwen3\n", + "\n", + "from openai import OpenAI\n", + "\n", + "client = OpenAI(\n", + " api_key=\"EMPTY\",\n", + " base_url=f\"http://127.0.0.1:30000/v1\",\n", + ")\n", + "\n", + "model = \"Qwen/Qwen3-4B\"\n", + "messages = [{\"role\": \"user\", \"content\": \"How many r's are in 'strawberry'?\"}]\n", + "\n", + "response = client.chat.completions.create(\n", + " model=model,\n", + " messages=messages,\n", + " extra_body={\n", + " \"chat_template_kwargs\": {\"enable_thinking\": True},\n", + " \"separate_reasoning\": True\n", + " }\n", + ")\n", + "\n", + "print(\"Reasoning:\", response.choices[0].message.reasoning_content)\n", + "print(\"-\"*100)\n", + "print(\"Answer:\", response.choices[0].message.content)\n", + "```\n", + "\n", + "**ExampleOutput:**\n", + "```\n", + "Reasoning: Okay, so the user is asking how many 'r's are in the word 'strawberry'. Let me think. First, I need to make sure I have the word spelled correctly. Strawberry... S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me break it down.\n", + "\n", + "Starting with 'strawberry', let's write out the letters one by one. S, T, R, A, W, B, E, R, R, Y. Hmm, wait, that's 10 letters. Let me check again. S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). So the letters are S-T-R-A-W-B-E-R-R-Y. \n", + "...\n", + "Therefore, the answer should be three R's in 'strawberry'. But I need to make sure I'm not counting any other letters as R. Let me check again. S, T, R, A, W, B, E, R, R, Y. No other R's. So three in total. Yeah, that seems right.\n", + "\n", + "----------------------------------------------------------------------------------------------------\n", + "Answer: The word \"strawberry\" contains **three** letters 'r'. Here's the breakdown:\n", + "\n", + "1. **S-T-R-A-W-B-E-R-R-Y** \n", + " - The **third letter** is 'R'. \n", + " - The **eighth and ninth letters** are also 'R's. \n", + "\n", + "Thus, the total count is **3**. \n", + "\n", + "**Answer:** 3.\n", + "```\n", + "\n", + "**Note:** Setting `\"enable_thinking\": False` (or omitting it) will result in `reasoning_content` being `None`. Qwen3-Thinking models always generate reasoning content and don't support the `enable_thinking` parameter.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Logit Bias Support\n", + "\n", + "SGLang supports the `logit_bias` parameter for both chat completions and completions APIs. This parameter allows you to modify the likelihood of specific tokens being generated by adding bias values to their logits. The bias values can range from -100 to 100, where:\n", + "\n", + "- **Positive values** (0 to 100) increase the likelihood of the token being selected\n", + "- **Negative values** (-100 to 0) decrease the likelihood of the token being selected\n", + "- **-100** effectively prevents the token from being generated\n", + "\n", + "The `logit_bias` parameter accepts a dictionary where keys are token IDs (as strings) and values are the bias amounts (as floats).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Getting Token IDs\n", + "\n", + "To use `logit_bias` effectively, you need to know the token IDs for the words you want to bias. Here's how to get token IDs:\n", + "\n", + "```python\n", + "# Get tokenizer to find token IDs\n", + "import tiktoken\n", + "\n", + "# For OpenAI models, use the appropriate encoding\n", + "tokenizer = tiktoken.encoding_for_model(\"gpt-3.5-turbo\") # or your model\n", + "\n", + "# Get token IDs for specific words\n", + "word = \"sunny\"\n", + "token_ids = tokenizer.encode(word)\n", + "print(f\"Token IDs for '{word}': {token_ids}\")\n", + "\n", + "# For SGLang models, you can access the tokenizer through the client\n", + "# and get token IDs for bias\n", + "```\n", + "\n", + "**Important:** The `logit_bias` parameter uses token IDs as string keys, not the actual words.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Example: DeepSeek-V3 Models\n", + "\n", + "DeepSeek-V3 models support thinking mode through the `thinking` parameter:\n", + "\n", + "```python\n", + "# Launch server:\n", + "# python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.1 --tp 8 --reasoning-parser deepseek-v3\n", + "\n", + "from openai import OpenAI\n", + "\n", + "client = OpenAI(\n", + " api_key=\"EMPTY\",\n", + " base_url=f\"http://127.0.0.1:30000/v1\",\n", + ")\n", + "\n", + "model = \"deepseek-ai/DeepSeek-V3.1\"\n", + "messages = [{\"role\": \"user\", \"content\": \"How many r's are in 'strawberry'?\"}]\n", + "\n", + "response = client.chat.completions.create(\n", + " model=model,\n", + " messages=messages,\n", + " extra_body={\n", + " \"chat_template_kwargs\": {\"thinking\": True},\n", + " \"separate_reasoning\": True\n", + " }\n", + ")\n", + "\n", + "print(\"Reasoning:\", response.choices[0].message.reasoning_content)\n", + "print(\"-\"*100)\n", + "print(\"Answer:\", response.choices[0].message.content)\n", + "```\n", + "\n", + "**Example Output:**\n", + "```\n", + "Reasoning: First, the question is: \"How many r's are in 'strawberry'?\"\n", + "\n", + "I need to count the number of times the letter 'r' appears in the word \"strawberry\".\n", + "\n", + "Let me write out the word: S-T-R-A-W-B-E-R-R-Y.\n", + "\n", + "Now, I'll go through each letter and count the 'r's.\n", + "...\n", + "So, I have three 'r's in \"strawberry\".\n", + "\n", + "I should double-check. The word is spelled S-T-R-A-W-B-E-R-R-Y. The letters are at positions: 3, 8, and 9 are 'r's. Yes, that's correct.\n", + "\n", + "Therefore, the answer should be 3.\n", + "----------------------------------------------------------------------------------------------------\n", + "Answer: The word \"strawberry\" contains **3** instances of the letter \"r\". Here's a breakdown for clarity:\n", + "\n", + "- The word is spelled: S-T-R-A-W-B-E-R-R-Y\n", + "- The \"r\" appears at the 3rd, 8th, and 9th positions.\n", + "```\n", + "\n", + "**Note:** DeepSeek-V3 models use the `thinking` parameter (not `enable_thinking`) to control reasoning output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example with logit_bias parameter\n", + "# Note: You need to get the actual token IDs from your tokenizer\n", + "# For demonstration, we'll use some example token IDs\n", + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[\n", + " {\"role\": \"user\", \"content\": \"Complete this sentence: The weather today is\"}\n", + " ],\n", + " temperature=0.7,\n", + " max_tokens=20,\n", + " logit_bias={\n", + " \"12345\": 50, # Increase likelihood of token ID 12345\n", + " \"67890\": -50, # Decrease likelihood of token ID 67890\n", + " \"11111\": 25, # Slightly increase likelihood of token ID 11111\n", + " },\n", + ")\n", + "\n", + "print_highlight(f\"Response with logit bias: {response.choices[0].message.content}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parameters\n", + "\n", + "The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n", + "\n", + "SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n", + " },\n", + " {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n", + " },\n", + " {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n", + " ],\n", + " temperature=0.3, # Lower temperature for more focused responses\n", + " max_tokens=128, # Reasonable length for a concise response\n", + " top_p=0.95, # Slightly higher for better fluency\n", + " presence_penalty=0.2, # Mild penalty to avoid repetition\n", + " frequency_penalty=0.2, # Mild penalty for more natural language\n", + " n=1, # Single response is usually more stable\n", + " seed=42, # Keep for reproducibility\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Streaming mode is also supported." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Logit Bias Support\n", + "\n", + "The completions API also supports the `logit_bias` parameter with the same functionality as described in the chat completions section above.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stream = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n", + " stream=True,\n", + ")\n", + "for chunk in stream:\n", + " if chunk.choices[0].delta.content is not None:\n", + " print(chunk.choices[0].delta.content, end=\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Returning Routed Experts (MoE Models)\n", + "\n", + "For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example with logit_bias parameter for completions API\n", + "# Note: You need to get the actual token IDs from your tokenizer\n", + "# For demonstration, we'll use some example token IDs\n", + "response = client.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " prompt=\"The best programming language for AI is\",\n", + " temperature=0.7,\n", + " max_tokens=20,\n", + " logit_bias={\n", + " \"12345\": 75, # Strongly favor token ID 12345\n", + " \"67890\": -100, # Completely avoid token ID 67890\n", + " \"11111\": -25, # Slightly discourage token ID 11111\n", + " },\n", + ")\n", + "\n", + "print_highlight(f\"Response with logit bias: {response.choices[0].text}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Completions\n", + "\n", + "### Usage\n", + "Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = client.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " prompt=\"List 3 countries and their capitals.\",\n", + " temperature=0,\n", + " max_tokens=64,\n", + " n=1,\n", + " stop=None,\n", + ")\n", + "\n", + "print_highlight(f\"Response: {response}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parameters\n", + "\n", + "The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n", + "\n", + "Here is an example of a detailed completions request:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "response = client.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " prompt=\"Write a short story about a space explorer.\",\n", + " temperature=0.7, # Moderate temperature for creative writing\n", + " max_tokens=150, # Longer response for a story\n", + " top_p=0.9, # Balanced diversity in word choice\n", + " stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n", + " presence_penalty=0.3, # Encourage novel elements\n", + " frequency_penalty=0.3, # Reduce repetitive phrases\n", + " n=1, # Generate one completion\n", + " seed=123, # For reproducible results\n", + ")\n", + "\n", + "print_highlight(f\"Response: {response}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Returning Routed Experts (MoE Models)\n", + "\n", + "For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Structured Outputs (JSON, Regex, EBNF)\n", + "\n", + "For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using LoRA Adapters\n", + "\n", + "SGLang supports LoRA (Low-Rank Adaptation) adapters with OpenAI-compatible APIs. You can specify which adapter to use directly in the `model` parameter using the `base-model:adapter-name` syntax.\n", + "\n", + "**Server Setup:**\n", + "```bash\n", + "python -m sglang.launch_server \\\n", + " --model-path qwen/qwen2.5-0.5b-instruct \\\n", + " --enable-lora \\\n", + " --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b\n", + "```\n", + "\n", + "For more details on LoRA serving configuration, see the [LoRA documentation](../advanced_features/lora.ipynb).\n", + "\n", + "**API Call:**\n", + "\n", + "(Recommended) Use the `model:adapter` syntax to specify which adapter to use:\n", + "```python\n", + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct:adapter_a\", # ← base-model:adapter-name\n", + " messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n", + " max_tokens=50,\n", + ")\n", + "```\n", + "\n", + "**Backward Compatible: Using `extra_body`**\n", + "\n", + "The old `extra_body` method is still supported for backward compatibility:\n", + "```python\n", + "# Backward compatible method\n", + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n", + " extra_body={\"lora_path\": \"adapter_a\"}, # ← old method\n", + " max_tokens=50,\n", + ")\n", + "```\n", + "**Note:** When both `model:adapter` and `extra_body[\"lora_path\"]` are specified, the `model:adapter` syntax takes precedence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/basic_usage/openai_api_completions.mdx b/docs_new/docs/basic_usage/openai_api_completions.mdx new file mode 100644 index 000000000000..c463fcca5ac7 --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api_completions.mdx @@ -0,0 +1,456 @@ +--- +title: "OpenAI APIs - Completions" +metatags: + description: "This tutorial covers the following popular APIs: 'chat/completions' and 'completions'" +--- +SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. +A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference). + +This tutorial covers the following popular APIs: + +- `chat/completions` +- `completions` + +Check out other tutorials to learn about [vision APIs](./openai_api_vision) for vision-language models and [embedding APIs](./openai_api_embeddings) for embedding models. + +## Launch A Server + +Launch the server in your terminal and wait for it to initialize. + +```python Example +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +server_process, port = launch_server_cmd( + "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}") +print(f"Server started on http://localhost:{port}") +``` + +## Chat Completions + +### Usage + +The server fully implements the OpenAI API. +It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available. +You can also specify a custom chat template with `--chat-template` when launching the server. + +```python Example +import openai + +client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") + +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print_highlight(f"Response: {response}") +``` + +### Model Thinking/Reasoning Support + +Some models support internal reasoning or thinking processes that can be exposed in the API response. SGLang provides unified support for various reasoning models through the `chat_template_kwargs` parameter and compatible reasoning parsers. + +#### Supported Models and Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyChat Template ParameterReasoning ParserNotes
DeepSeek-R1 (R1, R1-0528, R1-Distill)`enable_thinking``--reasoning-parser deepseek-r1`Standard reasoning models
DeepSeek-V3.1`thinking``--reasoning-parser deepseek-v3`Hybrid model (thinking/non-thinking modes)
Qwen3 (standard)`enable_thinking``--reasoning-parser qwen3`Hybrid model (thinking/non-thinking modes)
Qwen3-ThinkingN/A (always enabled)`--reasoning-parser qwen3-thinking`Always generates reasoning
KimiN/A (always enabled)`--reasoning-parser kimi`Kimi thinking models
Gpt-OssN/A (always enabled)`--reasoning-parser gpt-oss`Gpt-Oss thinking models
+ +#### Basic Usage + +To enable reasoning output, you need to: +1. Launch the server with the appropriate reasoning parser +2. Set the model-specific parameter in `chat_template_kwargs` +3. Optionally use `separate_reasoning: False` to not get reasoning content separately (default to `True`) + + +**Note for Qwen3-Thinking models:** These models always generate thinking content and do not support the `enable_thinking` parameter. Use `--reasoning-parser qwen3-thinking` or `--reasoning-parser qwen3` to parse the thinking content. + + +#### Example: Qwen3 Models + +```python Example +# Launch server: +# python3 -m sglang.launch_server --model Qwen/Qwen3-4B --reasoning-parser qwen3 + +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url=f"http://127.0.0.1:30000/v1", +) + +model = "Qwen/Qwen3-4B" +messages = [{"role": "user", "content": "How many r's are in 'strawberry'?"}] + +response = client.chat.completions.create( + model=model, + messages=messages, + extra_body={ + "chat_template_kwargs": {"enable_thinking": True}, + "separate_reasoning": True + } +) + +print("Reasoning:", response.choices[0].message.reasoning_content) +print("-"*100) +print("Answer:", response.choices[0].message.content) +``` + +**ExampleOutput:** +```text Output +Reasoning: Okay, so the user is asking how many 'r's are in the word 'strawberry'. Let me think. First, I need to make sure I have the word spelled correctly. Strawberry... S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me break it down. + +Starting with 'strawberry', let's write out the letters one by one. S, T, R, A, W, B, E, R, R, Y. Hmm, wait, that's 10 letters. Let me check again. S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). So the letters are S-T-R-A-W-B-E-R-R-Y. +... +Therefore, the answer should be three R's in 'strawberry'. But I need to make sure I'm not counting any other letters as R. Let me check again. S, T, R, A, W, B, E, R, R, Y. No other R's. So three in total. Yeah, that seems right. + +---------------------------------------------------------------------------------------------------- +Answer: The word "strawberry" contains **three** letters 'r'. Here's the breakdown: + +1. **S-T-R-A-W-B-E-R-R-Y** + - The **third letter** is 'R'. + - The **eighth and ninth letters** are also 'R's. + +Thus, the total count is **3**. + +**Answer:** 3. +``` + +Setting `"enable_thinking": False` (or omitting it) will result in `reasoning_content` being `None`. Qwen3-Thinking models always generate reasoning content and don't support the `enable_thinking` parameter. + + +#### Logit Bias Support + +SGLang supports the `logit_bias` parameter for both chat completions and completions APIs. This parameter allows you to modify the likelihood of specific tokens being generated by adding bias values to their logits. The bias values can range from -100 to 100, where: + +- **Positive values** (0 to 100) increase the likelihood of the token being selected +- **Negative values** (-100 to 0) decrease the likelihood of the token being selected +- **-100** effectively prevents the token from being generated + +The `logit_bias` parameter accepts a dictionary where keys are token IDs (as strings) and values are the bias amounts (as floats). + +#### Getting Token IDs + +To use `logit_bias` effectively, you need to know the token IDs for the words you want to bias. Here's how to get token IDs: + +```python Example +# Get tokenizer to find token IDs +import tiktoken + +# For OpenAI models, use the appropriate encoding +tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo") # or your model + +# Get token IDs for specific words +word = "sunny" +token_ids = tokenizer.encode(word) +print(f"Token IDs for '{word}': {token_ids}") + +# For SGLang models, you can access the tokenizer through the client +# and get token IDs for bias +``` + +**Important:** The `logit_bias` parameter uses token IDs as string keys, not the actual words. + + +#### Example: DeepSeek-V3 Models + +DeepSeek-V3 models support thinking mode through the `thinking` parameter: + +```python Example +# Launch server: +# python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.1 --tp 8 --reasoning-parser deepseek-v3 + +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url=f"http://127.0.0.1:30000/v1", +) + +model = "deepseek-ai/DeepSeek-V3.1" +messages = [{"role": "user", "content": "How many r's are in 'strawberry'?"}] + +response = client.chat.completions.create( + model=model, + messages=messages, + extra_body={ + "chat_template_kwargs": {"thinking": True}, + "separate_reasoning": True + } +) + +print("Reasoning:", response.choices[0].message.reasoning_content) +print("-"*100) +print("Answer:", response.choices[0].message.content) +``` + +**Example Output:** +```text Output +Reasoning: First, the question is: "How many r's are in 'strawberry'?" + +I need to count the number of times the letter 'r' appears in the word "strawberry". + +Let me write out the word: S-T-R-A-W-B-E-R-R-Y. + +Now, I'll go through each letter and count the 'r's. +... +So, I have three 'r's in "strawberry". + +I should double-check. The word is spelled S-T-R-A-W-B-E-R-R-Y. The letters are at positions: 3, 8, and 9 are 'r's. Yes, that's correct. + +Therefore, the answer should be 3. +---------------------------------------------------------------------------------------------------- +Answer: The word "strawberry" contains **3** instances of the letter "r". Here's a breakdown for clarity: + +- The word is spelled: S-T-R-A-W-B-E-R-R-Y +- The "r" appears at the 3rd, 8th, and 9th positions. +``` + +DeepSeek-V3 models use the `thinking` parameter (not `enable_thinking`) to control reasoning output. + + +```python Example +# Example with logit_bias parameter +# Note: You need to get the actual token IDs from your tokenizer +# For demonstration, we'll use some example token IDs +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + {"role": "user", "content": "Complete this sentence: The weather today is"} + ], + temperature=0.7, + max_tokens=20, + logit_bias={ + "12345": 50, # Increase likelihood of token ID 12345 + "67890": -50, # Decrease likelihood of token ID 67890 + "11111": 25, # Slightly increase likelihood of token ID 11111 + }, +) + +print_highlight(f"Response with logit bias: {response.choices[0].message.content}") +``` + +### Parameters + +The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details. + +SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor. + +```python Example +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + { + "role": "system", + "content": "You are a knowledgeable historian who provides concise responses.", + }, + {"role": "user", "content": "Tell me about ancient Rome"}, + { + "role": "assistant", + "content": "Ancient Rome was a civilization centered in Italy.", + }, + {"role": "user", "content": "What were their major achievements?"}, + ], + temperature=0.3, # Lower temperature for more focused responses + max_tokens=128, # Reasonable length for a concise response + top_p=0.95, # Slightly higher for better fluency + presence_penalty=0.2, # Mild penalty to avoid repetition + frequency_penalty=0.2, # Mild penalty for more natural language + n=1, # Single response is usually more stable + seed=42, # Keep for reproducibility +) + +print_highlight(response.choices[0].message.content) +``` + +Streaming mode is also supported. + +#### Logit Bias Support + +The completions API also supports the `logit_bias` parameter with the same functionality as described in the chat completions section above. + +```python Example +stream = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[{"role": "user", "content": "Say this is a test"}], + stream=True, +) +for chunk in stream: + if chunk.choices[0].delta.content is not None: + print(chunk.choices[0].delta.content, end="") +``` + +#### Returning Routed Experts (MoE Models) + +For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. + +```python Example +# Example with logit_bias parameter for completions API +# Note: You need to get the actual token IDs from your tokenizer +# For demonstration, we'll use some example token IDs +response = client.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + prompt="The best programming language for AI is", + temperature=0.7, + max_tokens=20, + logit_bias={ + "12345": 75, # Strongly favor token ID 12345 + "67890": -100, # Completely avoid token ID 67890 + "11111": -25, # Slightly discourage token ID 11111 + }, +) + +print_highlight(f"Response with logit bias: {response.choices[0].text}") +``` + +## Completions + +### Usage +Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates. + +```python Example +response = client.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + prompt="List 3 countries and their capitals.", + temperature=0, + max_tokens=64, + n=1, + stop=None, +) + +print_highlight(f"Response: {response}") +``` + +### Parameters + +The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details. + +Here is an example of a detailed completions request: + +```python Example +response = client.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + prompt="Write a short story about a space explorer.", + temperature=0.7, # Moderate temperature for creative writing + max_tokens=150, # Longer response for a story + top_p=0.9, # Balanced diversity in word choice + stop=["\n\n", "THE END"], # Multiple stop sequences + presence_penalty=0.3, # Encourage novel elements + frequency_penalty=0.3, # Reduce repetitive phrases + n=1, # Generate one completion + seed=123, # For reproducible results +) + +print_highlight(f"Response: {response}") +``` + +#### Returning Routed Experts (MoE Models) + +For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. + +## Structured Outputs (JSON, Regex, EBNF) + +For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs) for more details. + +## Using LoRA Adapters + +SGLang supports LoRA (Low-Rank Adaptation) adapters with OpenAI-compatible APIs. You can specify which adapter to use directly in the `model` parameter using the `base-model:adapter-name` syntax. + +**Server Setup:** +```bash Command +python -m sglang.launch_server \ + --model-path qwen/qwen2.5-0.5b-instruct \ + --enable-lora \ + --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b +``` + +For more details on LoRA serving configuration, see the [LoRA documentation](../advanced_features/lora). + +**API Call:** + +(Recommended) Use the `model:adapter` syntax to specify which adapter to use: +```python Example +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct:adapter_a", # ← base-model:adapter-name + messages=[{"role": "user", "content": "Convert to SQL: show all users"}], + max_tokens=50, +) +``` + +**Backward Compatible: Using `extra_body`** + +The old `extra_body` method is still supported for backward compatibility: +```python Example +# Backward compatible method +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[{"role": "user", "content": "Convert to SQL: show all users"}], + extra_body={"lora_path": "adapter_a"}, # ← old method + max_tokens=50, +) +``` +**Note:** When both `model:adapter` and `extra_body["lora_path"]` are specified, the `model:adapter` syntax takes precedence. + +```python Example +terminate_process(server_process) +``` diff --git a/docs_new/docs/basic_usage/openai_api_embeddings.ipynb b/docs_new/docs/basic_usage/openai_api_embeddings.ipynb new file mode 100644 index 000000000000..a6c90c06b5f0 --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api_embeddings.ipynb @@ -0,0 +1,193 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# OpenAI APIs - Embedding\n", + "\n", + "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", + "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n", + "\n", + "This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/retrieval_ranking/embedding_models.md)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launch A Server\n", + "\n", + "Launch the server in your terminal and wait for it to initialize. Remember to add `--is-embedding` to the command." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "embedding_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n", + " --host 0.0.0.0 --is-embedding --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=embedding_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using cURL" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess, json\n", + "\n", + "text = \"Once upon a time\"\n", + "\n", + "curl_text = f\"\"\"curl -s http://localhost:{port}/v1/embeddings \\\n", + " -H \"Content-Type: application/json\" \\\n", + " -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n", + "\n", + "result = subprocess.check_output(curl_text, shell=True)\n", + "\n", + "print(result)\n", + "\n", + "text_embedding = json.loads(result)[\"data\"][0][\"embedding\"]\n", + "\n", + "print_highlight(f\"Text embedding (first 10): {text_embedding[:10]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Python Requests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "text = \"Once upon a time\"\n", + "\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/v1/embeddings\",\n", + " json={\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": text},\n", + ")\n", + "\n", + "text_embedding = response.json()[\"data\"][0][\"embedding\"]\n", + "\n", + "print_highlight(f\"Text embedding (first 10): {text_embedding[:10]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using OpenAI Python Client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", + "\n", + "# Text embedding example\n", + "response = client.embeddings.create(\n", + " model=\"Alibaba-NLP/gte-Qwen2-1.5B-instruct\",\n", + " input=text,\n", + ")\n", + "\n", + "embedding = response.data[0].embedding[:10]\n", + "print_highlight(f\"Text embedding (first 10): {embedding}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Input IDs\n", + "\n", + "SGLang also supports `input_ids` as input to get the embedding." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "from transformers import AutoTokenizer\n", + "\n", + "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-1.5B-instruct\")\n", + "input_ids = tokenizer.encode(text)\n", + "\n", + "curl_ids = f\"\"\"curl -s http://localhost:{port}/v1/embeddings \\\n", + " -H \"Content-Type: application/json\" \\\n", + " -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n", + "\n", + "input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n", + " 0\n", + "][\"embedding\"]\n", + "\n", + "print_highlight(f\"Input IDs embedding (first 10): {input_ids_embedding[:10]}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(embedding_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Multi-Modal Embedding Model\n", + "Please refer to [Multi-Modal Embedding Model](../supported_models/retrieval_ranking/embedding_models.md)" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/basic_usage/openai_api_embeddings.mdx b/docs_new/docs/basic_usage/openai_api_embeddings.mdx new file mode 100644 index 000000000000..a0528a0c407e --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api_embeddings.mdx @@ -0,0 +1,126 @@ +--- +title: "OpenAI APIs - Embedding" +metatags: + description: "This tutorial covers the embedding APIs for embedding models." +--- +SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. +A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings). + +This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported-models) + + + +## Launch A Server + +Launch the server in your terminal and wait for it to initialize. Remember to add `--is-embedding` to the command. + + + +```python Example +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +embedding_process, port = launch_server_cmd( + """ +python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \ + --host 0.0.0.0 --is-embedding --log-level warning +""" +) + +wait_for_server(f"http://localhost:{port}") +``` + +## Using cURL + + + +```python Example +import subprocess, json + +text = "Once upon a time" + +curl_text = f"""curl -s http://localhost:{port}/v1/embeddings \ + -H "Content-Type: application/json" \ + -d '{{"model": "Alibaba-NLP/gte-Qwen2-1.5B-instruct", "input": "{text}"}}'""" + +result = subprocess.check_output(curl_text, shell=True) + +print(result) + +text_embedding = json.loads(result)["data"][0]["embedding"] + +print_highlight(f"Text embedding (first 10): {text_embedding[:10]}") +``` + +## Using Python Requests + + + +```python Example +import requests + +text = "Once upon a time" + +response = requests.post( + f"http://localhost:{port}/v1/embeddings", + json={"model": "Alibaba-NLP/gte-Qwen2-1.5B-instruct", "input": text}, +) + +text_embedding = response.json()["data"][0]["embedding"] + +print_highlight(f"Text embedding (first 10): {text_embedding[:10]}") +``` + +## Using OpenAI Python Client + + + +```python Example +import openai + +client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") + +# Text embedding example +response = client.embeddings.create( + model="Alibaba-NLP/gte-Qwen2-1.5B-instruct", + input=text, +) + +embedding = response.data[0].embedding[:10] +print_highlight(f"Text embedding (first 10): {embedding}") +``` + +## Using Input IDs + +SGLang also supports `input_ids` as input to get the embedding. + + + +```python Example +import json +import os +from transformers import AutoTokenizer + +os.environ["TOKENIZERS_PARALLELISM"] = "false" + +tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-1.5B-instruct") +input_ids = tokenizer.encode(text) + +curl_ids = f"""curl -s http://localhost:{port}/v1/embeddings \ + -H "Content-Type: application/json" \ + -d '{{"model": "Alibaba-NLP/gte-Qwen2-1.5B-instruct", "input": {json.dumps(input_ids)}}}'""" + +input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))["data"][ + 0 +]["embedding"] + +print_highlight(f"Input IDs embedding (first 10): {input_ids_embedding[:10]}") +``` + + +```python Example +terminate_process(embedding_process) +``` + +## Multi-Modal Embedding Model +Please refer to [Multi-Modal Embedding Model](../supported-models) diff --git a/docs_new/docs/basic_usage/openai_api_vision.ipynb b/docs_new/docs/basic_usage/openai_api_vision.ipynb new file mode 100644 index 000000000000..b6e6a1a24eb3 --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api_vision.ipynb @@ -0,0 +1,253 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# OpenAI APIs - Vision\n", + "\n", + "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", + "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n", + "This tutorial covers the vision APIs for vision language models.\n", + "\n", + "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/text_generation/multimodal_language_models.md).\n", + "\n", + "As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launch A Server\n", + "\n", + "Launch the server in your terminal and wait for it to initialize." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "example_image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\"\n", + "logo_image_url = (\n", + " \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\"\n", + ")\n", + "\n", + "vision_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=vision_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using cURL\n", + "\n", + "Once the server is up, you can send test requests using curl or requests." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "curl_command = f\"\"\"\n", + "curl -s http://localhost:{port}/v1/chat/completions \\\\\n", + " -H \"Content-Type: application/json\" \\\\\n", + " -d '{{\n", + " \"model\": \"Qwen/Qwen2.5-VL-7B-Instruct\",\n", + " \"messages\": [\n", + " {{\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {{\n", + " \"type\": \"text\",\n", + " \"text\": \"What’s in this image?\"\n", + " }},\n", + " {{\n", + " \"type\": \"image_url\",\n", + " \"image_url\": {{\n", + " \"url\": \"{example_image_url}\"\n", + " }}\n", + " }}\n", + " ]\n", + " }}\n", + " ],\n", + " \"max_tokens\": 300\n", + " }}'\n", + "\"\"\"\n", + "\n", + "response = subprocess.check_output(curl_command, shell=True).decode()\n", + "print_highlight(response)\n", + "\n", + "\n", + "response = subprocess.check_output(curl_command, shell=True).decode()\n", + "print_highlight(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Python Requests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "url = f\"http://localhost:{port}/v1/chat/completions\"\n", + "\n", + "data = {\n", + " \"model\": \"Qwen/Qwen2.5-VL-7B-Instruct\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n", + " {\n", + " \"type\": \"image_url\",\n", + " \"image_url\": {\"url\": example_image_url},\n", + " },\n", + " ],\n", + " }\n", + " ],\n", + " \"max_tokens\": 300,\n", + "}\n", + "\n", + "response = requests.post(url, json=data)\n", + "print_highlight(response.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using OpenAI Python Client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"What is in this image?\",\n", + " },\n", + " {\n", + " \"type\": \"image_url\",\n", + " \"image_url\": {\"url\": example_image_url},\n", + " },\n", + " ],\n", + " }\n", + " ],\n", + " max_tokens=300,\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Multiple-Image Inputs\n", + "\n", + "The server also supports multiple images and interleaved text and images if the model supports it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"image_url\",\n", + " \"image_url\": {\n", + " \"url\": example_image_url,\n", + " },\n", + " },\n", + " {\n", + " \"type\": \"image_url\",\n", + " \"image_url\": {\n", + " \"url\": logo_image_url,\n", + " },\n", + " },\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"I have two very different images. They are not related at all. \"\n", + " \"Please describe the first image in one sentence, and then describe the second image in another sentence.\",\n", + " },\n", + " ],\n", + " }\n", + " ],\n", + " temperature=0,\n", + ")\n", + "\n", + "print_highlight(response.choices[0].message.content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(vision_process)" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/basic_usage/openai_api_vision.mdx b/docs_new/docs/basic_usage/openai_api_vision.mdx new file mode 100644 index 000000000000..e27bf160c94f --- /dev/null +++ b/docs_new/docs/basic_usage/openai_api_vision.mdx @@ -0,0 +1,176 @@ +--- +title: "OpenAI APIs - Vision" +metatags: + description: "This tutorial covers the vision APIs for vision language models." +--- +SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. +A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision). +This tutorial covers the vision APIs for vision language models. + +SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported-models/multimodal_language_models). + +As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py). + +## Launch A Server + +Launch the server in your terminal and wait for it to initialize. + +```python Example +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png" +logo_image_url = ( + "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" +) + +vision_process, port = launch_server_cmd(""" +python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning +""") + +wait_for_server(f"http://localhost:{port}", process=vision_process) +``` + +## Using cURL + +Once the server is up, you can send test requests using curl or requests. + +```python Example +import subprocess + +curl_command = f""" +curl -s http://localhost:{port}/v1/chat/completions \\ + -H "Content-Type: application/json" \\ + -d '{{ + "model": "Qwen/Qwen2.5-VL-7B-Instruct", + "messages": [ + {{ + "role": "user", + "content": [ + {{ + "type": "text", + "text": "What’s in this image?" + }}, + {{ + "type": "image_url", + "image_url": {{ + "url": "{example_image_url}" + }} + }} + ] + }} + ], + "max_tokens": 300 + }}' +""" + +response = subprocess.check_output(curl_command, shell=True).decode() +print_highlight(response) + + +response = subprocess.check_output(curl_command, shell=True).decode() +print_highlight(response) +``` + +## Using Python Requests + +```python Example +import requests + +url = f"http://localhost:{port}/v1/chat/completions" + +data = { + "model": "Qwen/Qwen2.5-VL-7B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s in this image?"}, + { + "type": "image_url", + "image_url": {"url": example_image_url}, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print_highlight(response.text) +``` + +## Using OpenAI Python Client + +```python Example +from openai import OpenAI + +client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-VL-7B-Instruct", + messages=[ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "What is in this image?", + }, + { + "type": "image_url", + "image_url": {"url": example_image_url}, + }, + ], + } + ], + max_tokens=300, +) + +print_highlight(response.choices[0].message.content) +``` + +## Multiple-Image Inputs + +The server also supports multiple images and interleaved text and images if the model supports it. + +```python Example +from openai import OpenAI + +client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None") + +response = client.chat.completions.create( + model="Qwen/Qwen2.5-VL-7B-Instruct", + messages=[ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": example_image_url, + }, + }, + { + "type": "image_url", + "image_url": { + "url": logo_image_url, + }, + }, + { + "type": "text", + "text": "I have two very different images. They are not related at all. " + "Please describe the first image in one sentence, and then describe the second image in another sentence.", + }, + ], + } + ], + temperature=0, +) + +print_highlight(response.choices[0].message.content) +``` + +```python Example +terminate_process(vision_process) +``` diff --git a/docs_new/docs/basic_usage/overview.mdx b/docs_new/docs/basic_usage/overview.mdx new file mode 100644 index 000000000000..6f15d1c33b6e --- /dev/null +++ b/docs_new/docs/basic_usage/overview.mdx @@ -0,0 +1,11 @@ +--- +title: Basic Usage +description: Core APIs and common usage patterns for SGLang. +--- + +- [OpenAI-Compatible APIs](./openai_api_completions) — Chat completions, vision, and embeddings +- [Ollama API](./ollama_api) +- [Offline Engine API](./offline_engine_api) +- [Native API](./native_api) +- [Sampling Parameters](./sampling_params) +- [Popular Model Usage](./popular_model_usage) — DeepSeek, GLM, Qwen, Llama, and more diff --git a/docs_new/docs/basic_usage/popular_model_usage.mdx b/docs_new/docs/basic_usage/popular_model_usage.mdx new file mode 100644 index 000000000000..4c5a25e2f511 --- /dev/null +++ b/docs_new/docs/basic_usage/popular_model_usage.mdx @@ -0,0 +1,17 @@ +--- +title: "Popular Model Usage (DeepSeek, GPT-OSS, GLM, Llama, MiniMax, Qwen, and more)" +description: "Documentation for Popular Model Usage (DeepSeek, GPT-OSS, GLM, Llama, MiniMax, Qwen, and more)" +--- +For more usage examples and recipes, visit the [SGLang Cookbook](https://cookbook.sglang.io/). + +- [Deepseek V3](./deepseek_v3) +- [Deepseek V32](./deepseek_v32) +- [Glm45](./glm45) +- [Glmv](./glmv) +- [Gpt Oss](./gpt_oss) +- [Minimax M2](./minimax_m2) +- [Qwen3](./qwen3) +- [Qwen3 5](./qwen3_5) +- [Qwen3 Vl](./qwen3_vl) +- [Deepseek Ocr](./deepseek_ocr) +- [Llama4](./llama4) diff --git a/docs_new/docs/basic_usage/popular_model_usage.rst b/docs_new/docs/basic_usage/popular_model_usage.rst new file mode 100644 index 000000000000..06d4266618d6 --- /dev/null +++ b/docs_new/docs/basic_usage/popular_model_usage.rst @@ -0,0 +1,16 @@ +Popular Model Usage (DeepSeek, GPT-OSS, GLM, Llama, MiniMax, Qwen, and more) +=============================================================== + +.. toctree:: + :maxdepth: 1 + + deepseek_v3.md + deepseek_v32.md + glm45.md + glmv.md + gpt_oss.md + kimi_k2_5.md + minimax_m2.md + qwen3.md + qwen3_vl.md + llama4.md diff --git a/docs_new/docs/basic_usage/qwen3.mdx b/docs_new/docs/basic_usage/qwen3.mdx new file mode 100644 index 000000000000..4c316dcf8379 --- /dev/null +++ b/docs_new/docs/basic_usage/qwen3.mdx @@ -0,0 +1,42 @@ +--- +title: "Qwen3-Next Usage" +metatags: + description: "Deploy Qwen3-Next with SGLang: 80B hybrid Mamba model, MambaRadixCache prefix caching, EAGLE speculative decoding. Supports H100/H200 GPUs." +--- +SGLang has supported Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking since [this PR](https://github.com/sgl-project/sglang/pull/10233). + +## Launch Qwen3-Next with SGLang + +To serve Qwen3-Next models on 4xH100/H200 GPUs: + +```bash Command +python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4 +``` + +### Configuration Tips +- `--max-mamba-cache-size`: Adjust `--max-mamba-cache-size` to increase mamba cache space and max running requests capability. It will decrease KV cache space as a trade-off. You can adjust it according to workload. +- `--mamba-ssm-dtype`: `bfloat16` or `float32`, use `bfloat16` to save mamba cache size and `float32` to get more accurate results. The default setting is `float32`. +- `--mamba-full-memory-ratio`: The ratio of mamba state memory to full kv cache memory. The default is 0.9. + +### Mamba Radix Cache +SGLang supports prefix caching for Qwen3-Next models named `MambaRadixCache`, which improves inference speed by reusing computation results. There are two versions of `MambaRadixCache`: +- `no_buffer`: The default version, which is also other hybrid linear models' choice. When it is enabled, SGLang will automatically close overlap schedule for compatibility reasons. +- `extra_buffer`: An optimized version that is compatible with features like page size > 1, overlap schedule, and speculative decoding. It also supports storing mamba state in branching positions. However, it requires two extra mamba spaces for a ping-pong buffer for each request. To enable it, add the argument `--mamba-scheduler-strategy extra_buffer` when launching the server. + +### EAGLE Speculative Decoding +**Description**: SGLang has supported Qwen3-Next models with [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding). + +**Usage**: +Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: + +```bash Command +python3 -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tp 4 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --speculative-algo NEXTN +``` + +Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/10233). diff --git a/docs_new/docs/basic_usage/qwen3_5.mdx b/docs_new/docs/basic_usage/qwen3_5.mdx new file mode 100644 index 000000000000..88e897d0856d --- /dev/null +++ b/docs_new/docs/basic_usage/qwen3_5.mdx @@ -0,0 +1,80 @@ +--- +title: "Qwen 3.5 Usage" +metatags: + description: "Qwen 3.5 is Alibaba's latest generation LLM featuring a hybrid attention architecture, advanced MoE with shared experts, and native multimodal capabilities." +--- + +Qwen 3.5 is Alibaba's latest generation LLM featuring a hybrid attention architecture, advanced MoE with shared experts, and native multimodal capabilities. + +Key architecture features: +- **Hybrid Attention**: Gated Delta Networks (linear, O(n) complexity) combined with full attention every 4th layer for high associative recall +- **MoE with Shared Experts**: Top-8 active out of 64 routed experts plus a dedicated shared expert for universal features +- **Multimodal**: DeepStack Vision Transformer with Conv3d for native image and video understanding + +## Launch Qwen 3.5 with SGLang + +### Dense Model + +To serve `Qwen/Qwen3.5-397B-A17B` on 8 GPUs: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --trust-remote-code +``` + +### AMD GPU (MI300X / MI325X / MI35X) + +On AMD Instinct GPUs, use the `triton` attention backend. Both the full attention layers and the Gated Delta Net (linear attention) layers use Triton-based kernels on ROCm: + +```bash +SGLANG_USE_AITER=1 python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --attention-backend triton \ + --trust-remote-code +``` + + +Set `SGLANG_USE_AITER=1` to enable AMD's optimized aiter kernels for MoE and GEMM operations. + + +### Configuration Tips + +- `--attention-backend`: Use `triton` on AMD GPUs for Qwen 3.5. The hybrid attention architecture (Gated Delta Networks + full attention) works best with the Triton backend on ROCm. The linear attention (GDN) layers always use Triton kernels internally via the `GDNAttnBackend`. +- `--watchdog-timeout`: Increase to `1200` or higher for this large model, as weight loading takes significant time. +- `--model-loader-extra-config '{"enable_multithread_load": true}'`: Enables parallel weight loading for faster startup. + +### Reasoning and Tool Calling + +Qwen 3.5 supports reasoning and tool calling via the Qwen3 parsers: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3.5-397B-A17B \ + --tp 8 \ + --trust-remote-code \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder +``` + +## Accuracy Evaluation + +You can evaluate the model accuracy using `lm-eval`: + +```bash +pip install lm-eval[api] + +lm_eval --model local-completions \ + --model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \ + --tasks gsm8k \ + --batch_size auto \ + --num_fewshot 5 \ + --trust_remote_code +``` + +## Additional Resources + +- [AMD Day 0 Support for Qwen 3.5 on AMD Instinct GPUs](https://www.amd.com/en/developer/resources/technical-articles/2026/day-0-support-for-qwen-3-5-on-amd-instinct-gpus.html) +- [HuggingFace Model Card](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) diff --git a/docs_new/docs/basic_usage/qwen3_vl.mdx b/docs_new/docs/basic_usage/qwen3_vl.mdx new file mode 100644 index 000000000000..a98a5b4e8c28 --- /dev/null +++ b/docs_new/docs/basic_usage/qwen3_vl.mdx @@ -0,0 +1,133 @@ +--- +title: "Qwen3-VL Usage" +metatags: + description: "Deploy Qwen3-VL vision models with SGLang: FP8 and BF16 modes, image and video input, expert parallelism. Supports H100, H200, A100 GPUs." +--- +[Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl) +is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities. +SGLang supports Qwen3-VL Family of models with Image and Video input support. + +## Launch commands for SGLang + +Below are suggested launch commands tailored for different hardware / precision modes + +### FP8 (quantised) mode +For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported: +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ + --tp 8 \ + --ep 8 \ + --host 0.0.0.0 \ + --port 30000 \ + --keep-mm-feature-on-device +``` + +### Non-FP8 (BF16 / full precision) mode +For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used): +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ + --tp 8 \ + --ep 8 \ + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Hardware-specific notes / recommendations + +- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. +- On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference. +- On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing. + +## Sending Image/Video Requests + +### Image input: + +```python Example +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s in this image?"}, + { + "type": "image_url", + "image_url": { + "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +### Video Input: + +```python Example +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s happening in this video?"}, + { + "type": "video_url", + "video_url": { + "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +## Important Server Parameters and Flags + +When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior: + +- `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3) +- `--mm-max-concurrent-calls `: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference. +- `--mm-per-request-timeout `: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated. +- `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads. +- `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency. + +### Example usage with the above optimizations: +```bash Command +SGLANG_USE_CUDA_IPC_TRANSPORT=1 \ +SGLANG_VLM_CACHE_SIZE_MB=0 \ +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code \ + --tp-size 8 \ + --enable-cache-report \ + --log-level info \ + --max-running-requests 64 \ + --mem-fraction-static 0.65 \ + --chunked-prefill-size 8192 \ + --attention-backend fa3 \ + --mm-attention-backend fa3 \ + --enable-metrics +``` diff --git a/docs_new/docs/basic_usage/sampling_params.mdx b/docs_new/docs/basic_usage/sampling_params.mdx new file mode 100644 index 000000000000..4b271a229c9c --- /dev/null +++ b/docs_new/docs/basic_usage/sampling_params.mdx @@ -0,0 +1,576 @@ +--- +title: "Sampling Parameters" +metatags: + description: "Complete reference for SGLang sampling parameters: temperature, top_p, top_k, frequency penalty, stop tokens, and more." +--- +This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. +If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions). + +## `/generate` Endpoint + +The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType/DefaultDescription
text`Optional[Union[List[str], str]] = None`The input prompt. Can be a single prompt or a batch of prompts.
input_ids`Optional[Union[List[List[int]], List[int]]] = None`The token IDs for text; one can specify either text or input_ids.
input_embeds`Optional[Union[List[List[List[float]]], List[List[float]]]] = None`The embeddings for input_ids; one can specify either text, input_ids, or input_embeds.
image_data`Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None`The image input. Supports three formats: (1) **Raw images**: PIL Image, file path, URL, or base64 string; (2) **Processor output**: Dict with `format: "processor_output"` containing HuggingFace processor outputs; (3) **Precomputed embeddings**: Dict with `format: "precomputed_embedding"` and `feature` containing pre-calculated visual embeddings. Can be a single image, list of images, or list of lists of images. See [Multimodal Input Formats](#multimodal-input-formats) for details.
audio_data`Optional[Union[List[AudioDataItem], AudioDataItem]] = None`The audio input. Can be a file name, URL, or base64 encoded string.
sampling_params`Optional[Union[List[Dict], Dict]] = None`The sampling parameters as described in the sections below.
rid`Optional[Union[List[str], str]] = None`The request ID.
return_logprob`Optional[Union[List[bool], bool]] = None`Whether to return log probabilities for tokens.
logprob_start_len`Optional[Union[List[int], int]] = None`If return_logprob, the start location in the prompt for returning logprobs. Default is "-1", which returns logprobs for output tokens only.
top_logprobs_num`Optional[Union[List[int], int]] = None`If return_logprob, the number of top logprobs to return at each position.
token_ids_logprob`Optional[Union[List[List[int]], List[int]]] = None`If return_logprob, the token IDs to return logprob for.
return_text_in_logprobs`bool = False`Whether to detokenize tokens in text in the returned logprobs.
stream`bool = False`Whether to stream output.
lora_path`Optional[Union[List[Optional[str]], Optional[str]]] = None`The path to the LoRA.
custom_logit_processor`Optional[Union[List[Optional[str]], str]] = None`Custom logit processor for advanced sampling control. Must be a serialized instance of `CustomLogitProcessor` using its `to_str()` method. For usage see below.
return_hidden_states`Union[List[bool], bool] = False`Whether to return hidden states.
return_routed_experts`bool = False`Whether to return routed experts for MoE models. Requires `--enable-return-routed-experts` server flag. Returns base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`.
+ +## Sampling parameters + +The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs. + +### Note on defaults + +By default, SGLang initializes several sampling parameters from the model's `generation_config.json` (when the server is launched with `--sampling-defaults model`, which is the default). To use SGLang/OpenAI constant defaults instead, start the server with `--sampling-defaults openai`. You can always override any parameter per request via `sampling_params`. + +```bash Command +# Use model-provided defaults from generation_config.json (default behavior) +python -m sglang.launch_server --model-path --sampling-defaults model + +# Use SGLang/OpenAI constant defaults instead +python -m sglang.launch_server --model-path --sampling-defaults openai +``` + +### Core parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType/DefaultDescription
max_new_tokens`int = 128`The maximum output length measured in tokens.
stop`Optional[Union[str, List[str]]] = None`One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled.
stop_token_ids`Optional[List[int]] = None`Provide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled.
stop_regex`Optional[Union[str, List[str]]] = None`Stop when hitting any of the regex patterns in this list
temperature`float (model default; fallback 1.0)`[Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity.
top_p`float (model default; fallback 1.0)`[Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens.
top_k`int (model default; fallback -1)`[Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens.
min_p`float (model default; fallback 0.0)`[Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`.
+ +### Penalizers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType/DefaultDescription
frequency_penalty`float = 0.0`Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token.
presence_penalty`float = 0.0`Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occurred.
repetition_penalty`float = 1.0`Scales the logits of previously generated tokens to discourage (values > 1) or encourage (values < 1) repetition. Valid range is `[0, 2]`; `1.0` leaves probabilities unchanged.
min_new_tokens`int = 0`Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens.
+ +### Constrained decoding + +Please refer to our dedicated guide on [constrained decoding](../advanced_features/structured_outputs) for the following parameters. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType/DefaultDescription
json_schema`Optional[str] = None`JSON schema for structured outputs.
regex`Optional[str] = None`Regex for structured outputs.
ebnf`Optional[str] = None`EBNF for structured outputs.
structural_tag`Optional[str] = None`The structal tag for structured outputs.
+ +### Other options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentType/DefaultDescription
n`int = 1`Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.)
ignore_eos`bool = False`Don't stop generation when EOS token is sampled.
skip_special_tokens`bool = True`Remove special tokens during decoding.
spaces_between_special_tokens`bool = True`Whether or not to add spaces between special tokens during detokenization.
no_stop_trim`bool = False`Don't trim stop words or EOS token from the generated text.
custom_params`Optional[List[Optional[Dict[str, Any]]]] = None`Used when employing `CustomLogitProcessor`. For usage, see below.
+ +## Examples + +### Normal + +Launch a server: + +```bash Command +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 +``` + +Send a request: + +```python Example +import requests + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + }, +) +print(response.json()) +``` + +Detailed example in [send request](./send_request). + +### Streaming + +Send a request and stream the output: + +```python Example +import requests, json + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + "stream": True, + }, + stream=True, +) + +prev = 0 +for chunk in response.iter_lines(decode_unicode=False): + chunk = chunk.decode("utf-8") + if chunk and chunk.startswith("data:"): + if chunk == "data: [DONE]": + break + data = json.loads(chunk[5:].strip("\n")) + output = data["text"].strip() + print(output[prev:], end="", flush=True) + prev = len(output) +print("") +``` + +Detailed example in [openai compatible api](./openai_api_completions). + +### Multimodal + +Launch a server: + +```bash Command +python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov +``` + +Download an image: + +```bash Command +curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true +``` + +Send a request: + +```python Example +import requests + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n" + "<|im_start|>user\n\nDescribe this image in a very short sentence.<|im_end|>\n" + "<|im_start|>assistant\n", + "image_data": "example_image.png", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + }, +) +print(response.json()) +``` + +The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`. + +Streaming is supported in a similar manner as [above](#streaming). + +Detailed example in [OpenAI API Vision](./openai_api_vision). + +### Structured Outputs (JSON, Regex, EBNF) + +You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request. + +SGLang supports two grammar backends: + +- [XGrammar](https://github.com/mlc-ai/xgrammar) (default): Supports JSON schema, regular expression, and EBNF constraints. + - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README). +- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints. + +If instead you want to initialize the Outlines backend, you can use `--grammar-backend outlines` flag: + +```bash Command +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ +--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: xgrammar) +``` + +```python Example +import json +import requests + +json_schema = json.dumps({ + "type": "object", + "properties": { + "name": {"type": "string", "pattern": "^[\\w]+$"}, + "population": {"type": "integer"}, + }, + "required": ["name", "population"], +}) + +# JSON (works with both Outlines and XGrammar) +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "Here is the information of the capital of France in the JSON format.\n", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 64, + "json_schema": json_schema, + }, + }, +) +print(response.json()) + +# Regular expression (Outlines backend only) +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "Paris is the capital of", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 64, + "regex": "(France|England)", + }, + }, +) +print(response.json()) + +# EBNF (XGrammar backend only) +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "Write a greeting.", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 64, + "ebnf": 'root ::= "Hello" | "Hi" | "Hey"', + }, + }, +) +print(response.json()) +``` + +Detailed example in [structured outputs](../advanced_features/structured_outputs). + +### Custom logit processor + +Launch a server with `--enable-custom-logit-processor` flag on. + +```bash Command +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3-8B-Instruct \ + --port 30000 \ + --enable-custom-logit-processor +``` + +Define a custom logit processor that will always sample a specific token id. + +```python Example +from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor + +class DeterministicLogitProcessor(CustomLogitProcessor): + """A dummy logit processor that changes the logits to always + sample the given token id. + """ + + def __call__(self, logits, custom_param_list): + # Check that the number of logits matches the number of custom parameters + assert logits.shape[0] == len(custom_param_list) + key = "token_id" + + for i, param_dict in enumerate(custom_param_list): + # Mask all other tokens + logits[i, :] = -float("inf") + # Assign highest probability to the specified token + logits[i, param_dict[key]] = 0.0 + return logits +``` + +Send a request: + +```python Example +import requests + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "The capital of France is", + "custom_logit_processor": DeterministicLogitProcessor().to_str(), + "sampling_params": { + "temperature": 0.0, + "max_new_tokens": 32, + "custom_params": {"token_id": 5}, + }, + }, +) +print(response.json()) +``` + +Send an OpenAI chat completion request: + +```python Example +import openai +from sglang.utils import print_highlight + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="meta-llama/Meta-Llama-3-8B-Instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0.0, + max_tokens=32, + extra_body={ + "custom_logit_processor": DeterministicLogitProcessor().to_str(), + "custom_params": {"token_id": 5}, + }, +) + +print_highlight(f"Response: {response}") +``` diff --git a/docs_new/docs/basic_usage/send_request.ipynb b/docs_new/docs/basic_usage/send_request.ipynb new file mode 100644 index 000000000000..968a23b8d632 --- /dev/null +++ b/docs_new/docs/basic_usage/send_request.ipynb @@ -0,0 +1,251 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Sending Requests\n", + "This notebook provides a quick-start guide to use SGLang in chat completions after installation. Once your server is running, API documentation is available at `http://localhost:30000/docs` (Swagger UI), `http://localhost:30000/redoc` (ReDoc), or `http://localhost:30000/openapi.json` (OpenAPI spec, useful for AI agents). Replace `30000` with your port if using a different one.\n", + "\n", + "- For Vision Language Models, see [OpenAI APIs - Vision](openai_api_vision.ipynb).\n", + "- For Embedding Models, see [OpenAI APIs - Embedding](openai_api_embeddings.ipynb) and [Encode (embedding model)](native_api.html#Encode-(embedding-model)).\n", + "- For Reward Models, see [Classify (reward model)](native_api.html#Classify-(reward-model))." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launch A Server" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", + "\n", + "# This is equivalent to running the following command in your terminal\n", + "# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n", + "\n", + "server_process, port = launch_server_cmd(\"\"\"\n", + "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n", + " --host 0.0.0.0 --log-level warning\n", + "\"\"\")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using cURL\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess, json\n", + "\n", + "curl_command = f\"\"\"\n", + "curl -s http://localhost:{port}/v1/chat/completions \\\n", + " -H \"Content-Type: application/json\" \\\n", + " -d '{{\"model\": \"qwen/qwen2.5-0.5b-instruct\", \"messages\": [{{\"role\": \"user\", \"content\": \"What is the capital of France?\"}}]}}'\n", + "\"\"\"\n", + "\n", + "response = json.loads(subprocess.check_output(curl_command, shell=True))\n", + "print_highlight(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Python Requests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "url = f\"http://localhost:{port}/v1/chat/completions\"\n", + "\n", + "data = {\n", + " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n", + " \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n", + "}\n", + "\n", + "response = requests.post(url, json=data)\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using OpenAI Python Client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[\n", + " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", + " ],\n", + " temperature=0,\n", + " max_tokens=64,\n", + ")\n", + "print_highlight(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streaming" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", + "\n", + "# Use stream=True for streaming responses\n", + "response = client.chat.completions.create(\n", + " model=\"qwen/qwen2.5-0.5b-instruct\",\n", + " messages=[\n", + " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", + " ],\n", + " temperature=0,\n", + " max_tokens=64,\n", + " stream=True,\n", + ")\n", + "\n", + "# Handle the streaming output\n", + "for chunk in response:\n", + " if chunk.choices[0].delta.content:\n", + " print(chunk.choices[0].delta.content, end=\"\", flush=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Native Generation APIs\n", + "\n", + "You can also use the native `/generate` endpoint with requests, which provides more flexibility. An API reference is available at [Sampling Parameters](sampling_params.md)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": \"The capital of France is\",\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 32,\n", + " },\n", + " },\n", + ")\n", + "\n", + "print_highlight(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streaming" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests, json\n", + "\n", + "response = requests.post(\n", + " f\"http://localhost:{port}/generate\",\n", + " json={\n", + " \"text\": \"The capital of France is\",\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 32,\n", + " },\n", + " \"stream\": True,\n", + " },\n", + " stream=True,\n", + ")\n", + "\n", + "prev = 0\n", + "for chunk in response.iter_lines(decode_unicode=False):\n", + " chunk = chunk.decode(\"utf-8\")\n", + " if chunk and chunk.startswith(\"data:\"):\n", + " if chunk == \"data: [DONE]\":\n", + " break\n", + " data = json.loads(chunk[5:].strip(\"\\n\"))\n", + " output = data[\"text\"]\n", + " print(output[prev:], end=\"\", flush=True)\n", + " prev = len(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/basic_usage/send_request.mdx b/docs_new/docs/basic_usage/send_request.mdx new file mode 100644 index 000000000000..15ce9b121aa6 --- /dev/null +++ b/docs_new/docs/basic_usage/send_request.mdx @@ -0,0 +1,172 @@ +--- +title: "Tutorial: Sending a request" +metatags: + description: "This notebook provides a quick-start guide to use SGLang in chat completions after installation. " +--- +This notebook provides a quick-start guide to use SGLang in chat completions after installation. Once your server is running, API documentation is available at `http://localhost:30000/docs` (Swagger UI), `http://localhost:30000/redoc` (ReDoc), or `http://localhost:30000/openapi.json` (OpenAPI spec, useful for AI agents). Replace `30000` with your port if using a different one. + +- For Vision Language Models, see [OpenAI APIs - Vision](./openai_api_vision). +- For Embedding Models, see [OpenAI APIs - Embedding](./openai_api_embeddings) and [Encode (embedding model)](./native_api#encode-embedding-model). +- For Reward Models, see [Classify (reward model)](./native_api#classify-reward-model). + + +## Launch A Server + + + +```python Example +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import wait_for_server, print_highlight, terminate_process + +# This is equivalent to running the following command in your terminal +# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 + +server_process, port = launch_server_cmd( + """ +python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \ + --host 0.0.0.0 --log-level warning +""" +) + +wait_for_server(f"http://localhost:{port}") +``` + +## Using cURL + + + + +```python Example +import subprocess, json + +curl_command = f""" +curl -s http://localhost:{port}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{{"model": "qwen/qwen2.5-0.5b-instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}' +""" + +response = json.loads(subprocess.check_output(curl_command, shell=True)) +print_highlight(response) +``` + +## Using Python Requests + + + +```python Example +import requests + +url = f"http://localhost:{port}/v1/chat/completions" + +data = { + "model": "qwen/qwen2.5-0.5b-instruct", + "messages": [{"role": "user", "content": "What is the capital of France?"}], +} + +response = requests.post(url, json=data) +print_highlight(response.json()) +``` + +## Using OpenAI Python Client + + + +```python Example +import openai + +client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") + +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) +print_highlight(response) +``` + +### Streaming + + + +```python Example +import openai + +client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") + +# Use stream=True for streaming responses +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, + stream=True, +) + +# Handle the streaming output +for chunk in response: + if chunk.choices[0].delta.content: + print(chunk.choices[0].delta.content, end="", flush=True) +``` + +## Using Native Generation APIs + +You can also use the native `/generate` endpoint with requests, which provides more flexibility. An API reference is available at [Sampling Parameters](./sampling_params). + + + +```python Example +import requests + +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + }, +) + +print_highlight(response.json()) +``` +### Streaming + + + +```python Example +import requests, json + +response = requests.post( + f"http://localhost:{port}/generate", + json={ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + "stream": True, + }, + stream=True, +) + +prev = 0 +for chunk in response.iter_lines(decode_unicode=False): + chunk = chunk.decode("utf-8") + if chunk and chunk.startswith("data:"): + if chunk == "data: [DONE]": + break + data = json.loads(chunk[5:].strip("\n")) + output = data["text"] + print(output[prev:], end="", flush=True) + prev = len(output) +``` + +```python Example +terminate_process(server_process) +``` diff --git a/docs_new/docs/developer_guide/bench_serving.mdx b/docs_new/docs/developer_guide/bench_serving.mdx new file mode 100644 index 000000000000..b77097808637 --- /dev/null +++ b/docs_new/docs/developer_guide/bench_serving.mdx @@ -0,0 +1,393 @@ +--- +title: "Bench Serving Guide" +metatags: + description: "SGLang bench_serving: benchmark throughput, TTFT, ITL with random/sharegpt/image datasets. Multi-backend support." +--- +This guide explains how to benchmark online serving throughput and latency using `python -m sglang.bench_serving`. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs. + +### What it does + +- Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint +- Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more +- Supports streaming or non-streaming modes, rate control, and concurrency limits + +### Supported backends and endpoints + +- `sglang` / `sglang-native`: `POST /generate` +- `sglang-oai`, `vllm`, `lmdeploy`: `POST /v1/completions` +- `sglang-oai-chat`, `vllm-chat`, `lmdeploy-chat`: `POST /v1/chat/completions` +- `trt` (TensorRT-LLM): `POST /v2/models/ensemble/generate_stream` +- `gserver`: Custom server (Not Implemented yet in this script) +- `truss`: `POST /v1/models/model:predict` + +If `--base-url` is provided, requests are sent to it. Otherwise, `--host` and `--port` are used. When `--model` is not provided, the script will attempt to query `GET /v1/models` for an available model ID (OpenAI-compatible endpoints). + +### Prerequisites + +- Python 3.10+ +- Dependencies typically used by this script: `aiohttp`, `numpy`, `requests`, `tqdm`, `transformers`, and for some datasets `datasets`, `pillow`, `pybase64`. Install as needed. +- An inference server running and reachable via the endpoints above +- If your server requires authentication, set environment variable `OPENAI_API_KEY` (used as `Authorization: Bearer `) + +### Quick start + +Run a basic benchmark against an sglang server exposing `/generate`: + +```bash Command +python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct +``` + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --num-prompts 1000 \ + --model meta-llama/Llama-3.1-8B-Instruct +``` + +Or, using an OpenAI-compatible endpoint (completions): + +```bash Command +python3 -m sglang.bench_serving \ + --backend vllm \ + --base-url http://127.0.0.1:8000 \ + --num-prompts 1000 \ + --model meta-llama/Llama-3.1-8B-Instruct +``` + +### Datasets + +Select with `--dataset-name`: + +- `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len` +- `random`: random text lengths; sampled from ShareGPT token space +- `random-ids`: random token ids (can lead to gibberish) +- `image`: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content types +- `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions +- `mmmu`: samples from MMMU (Math split) and includes images + +Common dataset flags: + +- `--num-prompts N`: number of requests +- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/image +- `--image-count`: Number of images per request (for `image` dataset). + +- `--apply-chat-template`: apply tokenizer chat template when constructing prompts +- `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached + +Generated Shared Prefix flags (for `generated-shared-prefix`): + +- `--gsp-num-groups` +- `--gsp-prompts-per-group` +- `--gsp-system-prompt-len` +- `--gsp-question-len` +- `--gsp-output-len` + +Image dataset flags (for `image`): + +- `--image-count`: Number of images per request +- `--image-resolution`: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768) +- `--image-format`: Image format (jpeg or png) +- `--image-content`: Image content type (random or blank) + +### Examples + +1. To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run: + +```bash Command +python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache +``` + +```bash Command +python -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --dataset-name image \ + --num-prompts 500 \ + --image-count 3 \ + --image-resolution 720p \ + --random-input-len 512 \ + --random-output-len 512 +``` + +2. To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run: + +```bash Command +python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct +``` + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 3000 \ + --random-input 1024 \ + --random-output 1024 \ + --random-range-ratio 0.5 +``` + +### Choosing model and tokenizer + +- `--model` is required unless the backend exposes `GET /v1/models`, in which case the first model ID is auto-selected. +- `--tokenizer` defaults to `--model`. Both can be HF model IDs or local paths. +- For ModelScope workflows, setting `SGLANG_USE_MODELSCOPE=true` enables fetching via ModelScope (weights are skipped for speed). +- If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs. + +### Rate, concurrency, and streaming + +- `--request-rate`: requests per second. `inf` sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times. +- `--max-concurrency`: caps concurrent in-flight requests regardless of arrival rate. +- `--disable-stream`: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions. + +### Other key options + +- `--output-file FILE.jsonl`: append JSONL results to file; auto-named if unspecified +- `--output-details`: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens) +- `--extra-request-body '{"top_p":0.9,"temperature":0.6}'`: merged into payload (sampling params, etc.) +- `--disable-ignore-eos`: pass through EOS behavior (varies by backend) +- `--warmup-requests N`: run warmup requests with short output first (default 1) +- `--flush-cache`: call `/flush_cache` (sglang) before main run +- `--profile`: call `/start_profile` and `/stop_profile` (requires server to enable profiling, e.g., `SGLANG_TORCH_PROFILER_DIR`) +- `--lora-name name1 name2 ...`: randomly pick one per request and pass to backend (e.g., `lora_path` for sglang) +- `--tokenize-prompt`: send integer IDs instead of text (currently supports `--backend sglang` only) + +### Authentication + +If your target endpoint requires OpenAI-style auth, set: + +```bash Command +export OPENAI_API_KEY=sk-...yourkey... +``` + +The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for OpenAI-compatible routes. + +### Metrics explained + +Printed after each run: + +- Request throughput (req/s) +- Input token throughput (tok/s) - includes both text and vision tokens +- Output token throughput (tok/s) +- Total token throughput (tok/s) - includes both text and vision tokens +- Total input text tokens and Total input vision tokens - per-modality breakdown +- Concurrency: aggregate time of all requests divided by wall time +- End-to-End Latency (ms): mean/median/std/p99 per-request total latency +- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode +- Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens +- TPOT (ms): Token processing time after first token, i.e., `(latency - ttft)/(tokens-1)` +- Accept length (sglang-only, if available): speculative decoding accept length + +The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts. + +### JSONL output format + +When `--output-file` is set, one JSON object is appended per run. Base fields: + +- Arguments summary: backend, dataset, request_rate, max_concurrency, etc. +- Duration and totals: completed, total_input_tokens, total_output_tokens, retokenized totals +- Throughputs and latency statistics as printed in the console +- `accept_length` when available (sglang) + +With `--output-details`, an extended object also includes arrays: + +- `input_lens`, `output_lens` +- `ttfts`, `itls` (per request: ITL arrays) +- `generated_texts`, `errors` + +### End-to-end examples + +1) sglang native `/generate` (streaming): + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \ + --num-prompts 2000 \ + --request-rate 100 \ + --max-concurrency 512 \ + --output-file sglang_random.jsonl --output-details +``` + +2) OpenAI-compatible Completions (e.g., vLLM): + +```bash Command +python3 -m sglang.bench_serving \ + --backend vllm \ + --base-url http://127.0.0.1:8000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name sharegpt \ + --num-prompts 1000 \ + --sharegpt-output-len 256 +``` + +3) OpenAI-compatible Chat Completions (streaming): + +```bash Command +python3 -m sglang.bench_serving \ + --backend vllm-chat \ + --base-url http://127.0.0.1:8000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --num-prompts 500 \ + --apply-chat-template +``` + +4) Images (VLM) with chat template: + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model your-vlm-model \ + --dataset-name image \ + --image-count 2 \ + --image-resolution 720p \ + --random-input-len 128 --random-output-len 256 \ + --num-prompts 200 \ + --apply-chat-template +``` + +4a) Images with custom resolution: + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model your-vlm-model \ + --dataset-name image \ + --image-count 1 \ + --image-resolution 512x768 \ + --random-input-len 64 --random-output-len 128 \ + --num-prompts 100 \ + --apply-chat-template +``` + +4b) 1080p images with PNG format and blank content: + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model your-vlm-model \ + --dataset-name image \ + --image-count 1 \ + --image-resolution 1080p \ + --image-format png \ + --image-content blank \ + --random-input-len 64 --random-output-len 128 \ + --num-prompts 100 \ + --apply-chat-template +``` + +5) Generated shared prefix (long system prompts + short questions): + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name generated-shared-prefix \ + --gsp-num-groups 64 --gsp-prompts-per-group 16 \ + --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \ + --num-prompts 1024 +``` + +6) Tokenized prompts (ids) for strict length control (sglang only): + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --tokenize-prompt \ + --random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2 +``` + +7) Profiling and cache flush (sglang): + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --profile \ + --flush-cache +``` + +8) TensorRT-LLM streaming endpoint: + +```bash Command +python3 -m sglang.bench_serving \ + --backend trt \ + --base-url http://127.0.0.1:8000 \ + --model your-trt-llm-model \ + --dataset-name random \ + --num-prompts 100 \ + --disable-ignore-eos +``` + +9) Evaluating large-scale KVCache sharing with mooncake trace (sglang only): + +```bash Command +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model model-name \ + --dataset-name mooncake \ + --mooncake-slowdown-factor 1.0 \ + --mooncake-num-rounds 1000 \ + --mooncake-workload conversation|mooncake|agent|synthetic + --use-trace-timestamps true \ + --random-output-len 256 +``` + +10) Fake decode stress testing (PD disaggregation, decode-only): + +When benchmarking pure decode performance in a PD disaggregation setup, you can bypass the prefill node entirely by using `--fake-prefill`. This requires the decode server to be started with `--disaggregation-transfer-backend fake`: + +```bash Command +# Step 1: Start a decode-only server with fake transfer backend +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend fake \ + --port 30001 + +# Step 2: Run bench_serving with --fake-prefill +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30001 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --num-prompts 500 \ + --random-input-len 1024 --random-output-len 256 \ + --fake-prefill +``` + +Similarly, `bench_one_batch_server` also supports `--fake-prefill`: + +```bash Command +python3 -m sglang.bench_one_batch_server \ + --base-url http://127.0.0.1:30001 \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --batch-size 32 --input-len 1024 --output-len 256 \ + --fake-prefill +``` + +The `--fake-prefill` flag automatically injects special sentinel values into each request, telling the decode server to skip real KV transfer and generate fake KV data locally. + +### Troubleshooting + +- All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script. +- Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate. +- Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent. +- Image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`). +- Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server. + +### Notes + +- The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections. +- For sglang, `/server_info` is queried post-run to report speculative decoding accept length when available. diff --git a/docs_new/docs/developer_guide/benchmark_and_profiling.mdx b/docs_new/docs/developer_guide/benchmark_and_profiling.mdx new file mode 100644 index 000000000000..33c94b5b29d9 --- /dev/null +++ b/docs_new/docs/developer_guide/benchmark_and_profiling.mdx @@ -0,0 +1,503 @@ +--- +title: "Benchmark and Profiling" +metatags: + description: "SGLang benchmarking and profiling: PyTorch Profiler, Nsight Systems, layerwise NVTX, PD disaggregation profiling." +--- +## Benchmark + +SGLang provides four benchmark tools that operate at different levels of the stack. The table below summarizes their key differences: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ToolHTTP ServerSchedulerUse Case
bench_servingYes (async HTTP client to a running server)Yes (indirectly, via server)Realistic online serving benchmarks with latency metrics (TTFT, TPOT, ITL)
bench_one_batch_serverYes (sends HTTP requests to a running server)Yes (indirectly, via server)End-to-end single-batch latency including HTTP and scheduler overhead
bench_offline_throughputNoYes (directly uses Engine in-process)Maximum throughput measurement without HTTP overhead
bench_one_batchNoNo (directly calls ModelRunner)Kernel-level latency profiling of a single static batch
+ +Use `bench_serving` by default unless there are specific needs. + +**`bench_serving`** is an async HTTP load-testing client that sends requests at controlled rates with configurable concurrency to a running server. It measures realistic online serving metrics including time-to-first-token (TTFT), time-per-output-token (TPOT), inter-token latency (ITL), and throughput. Use `num-prompts >= 5 * max-concurrency` to measure steady-state performance. Launch a server with `sglang.launch_server` first. + + ```bash Command + python3 -m sglang.bench_serving --backend sglang --max-concurrency 16 --num-prompts 80 --random-input-len 256 --random-output-len 32 --dataset-name random + ``` + +**`bench_one_batch_server`** sends a single batch as one HTTP request to a running server. Due to only having a single batch, the server is never in a steady-state and metrics will be biased. Launch a server with `sglang.launch_server` first. + + ```bash Command + python3 -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32 + ``` + +**`bench_offline_throughput`** directly instantiates the `Engine` object in-process (no HTTP server) and submits all requests at once via `engine.generate()`. The engine's scheduler handles batching and execution. This measures maximum achievable throughput without any network overhead. + + ```bash Command + python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 + ``` + +**`bench_one_batch`** is the lowest-level tool. It directly instantiates a `ModelRunner` and calls `extend()` / `decode()` on a fixed static batch, bypassing the scheduler entirely. The prefill and decode phases are run separately, making profiling easier but rendering the metrics unrealistic. Because there is no dynamic batching, it may run out of memory for batch sizes that a real server can handle (a real server chunks prefill into smaller batches). This is best suited for profiling individual kernel performance. + + ```bash Command + python3 -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32 + ``` + +## Profile with PyTorch Profiler + +[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy. + +### Profile a server with `sglang.bench_serving` + +```bash Command +# set trace path +export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log + +# start server +python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct + +# send profiling request from client +python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile +``` + +For `bench_serving --profile`, the output directory is selected on the client side from `--profile-output-dir` or `SGLANG_TORCH_PROFILER_DIR` (fallback: `/tmp`), then sent in the `/start_profile` request. +If you call `/start_profile` directly and do not provide `output_dir`, the server uses its own `SGLANG_TORCH_PROFILER_DIR` (fallback: `/tmp`). + +Setting `SGLANG_TORCH_PROFILER_DIR` on both server and client is still recommended to avoid confusion about where traces are written. + +For more details, please refer to [Bench Serving Guide](./bench_serving). + +### Profile In PD Disaggregation Mode + +When profiling in PD disaggregation mode, prefill and decode workers **must be profiled separately** due to torch profiler limitations. The `bench_serving` command provides dedicated options for this: + +#### Profile Prefill Workers + +```bash Command +# set trace path +export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log + +# start prefill and decode servers (see PD disaggregation docs for setup) +python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill +python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 + +# start router +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 + +# send profiling request targeting prefill workers +python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000 +``` + +#### Profile Decode Workers + +```bash Command +# send profiling request targeting decode workers +python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001 +``` + +#### Important Notes + +- `--profile-prefill-url` and `--profile-decode-url` are **mutually exclusive** - you cannot profile both at the same time +- Both options support multiple worker URLs for multi-instance setups: + ```bash Command + # Profile multiple prefill workers + python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000 http://127.0.0.1:30002 + + # Profile multiple decode workers + python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001 http://127.0.0.1:30003 + ``` +- Make sure `SGLANG_TORCH_PROFILER_DIR` is set on all worker nodes before starting the servers +- For more details on setting up PD disaggregation, see [PD Disaggregation Guide](../advanced_features/pd_disaggregation) + +### Profile a server with `sglang.bench_offline_throughput` +```bash Command +export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log + +# profile one batch with bench_one_batch.py +# batch size can be controlled with --batch argument +python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile + +# profile multiple batches with bench_offline_throughput.py +python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8 +``` + +### Profile a server with `sglang.profiler` + +When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server. + +You can do this by running `python3 -m sglang.profiler`. For example: + +```text Output +# Terminal 1: Send a generation request +python3 -m sglang.test.send_one + +# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal. +# It will generate a profile of the above request for several decoding batches. +python3 -m sglang.profiler +``` + +You can also combine the above operations into a single command + +```text Output +python3 -m sglang.test.send_one --profile +``` + +### Profile a server with HTTP API endpoints + +SGLang provides HTTP API endpoints to control profiling on a running server. This allows you to start and stop profiling programmatically, which is useful for capturing specific workload patterns. + +#### Using `/start_profile` endpoint + +The `/start_profile` endpoint starts profiling on the server. You can control when profiling begins and how long it runs using the following parameters: + +**Basic usage:** + +```bash Command +# Start profiling immediately for 10 steps +curl -X POST http://127.0.0.1:30000/start_profile \ + -H "Content-Type: application/json" \ + -d '{ + "num_steps": 10 + }' +``` + +**Parameters:** + +- `output_dir` (optional): Directory where profile traces will be saved. If not specified, uses `SGLANG_TORCH_PROFILER_DIR` environment variable, or `/tmp` as the default +- `num_steps` (optional): Number of steps to profile. If not specified, profiling continues until manually stopped with `/stop_profile` +- `start_step` (optional): Step number at which to start profiling (inclusive). Useful for skipping warmup iterations +- `activities` (optional): List of activities to profile, e.g., `["CPU", "GPU"]`. Default is `["CPU", "GPU"]` +- `merge_profiles` (optional): Whether to merge distributed traces. Default is `false` + +**Note on step ranges:** Profiling starts at `start_step` (inclusive) and continues for `num_steps` iterations. For example, with `start_step=3` and `num_steps=10`, profiling captures steps 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12 (10 steps total, starting from step 3). + +**Advanced usage with `start_step`:** + +```bash Command +# Wait 5 steps (warmup), then profile for 10 steps +curl -X POST http://127.0.0.1:30000/start_profile \ + -H "Content-Type: application/json" \ + -d '{ + "output_dir": "/tmp/profiles", + "start_step": 5, + "num_steps": 10, + "activities": ["CPU", "GPU"] + }' +``` + +**Continuous profiling (manual stop):** + +```bash Command +# Start profiling without num_steps - must manually stop with /stop_profile +curl -X POST http://127.0.0.1:30000/start_profile +``` + +#### Using `/stop_profile` endpoint + +The `/stop_profile` endpoint stops an ongoing profiling session and saves the trace file. + +```bash Command +# Stop profiling and save traces +curl -X POST http://127.0.0.1:30000/stop_profile +``` + +This is only needed when you start profiling without specifying `num_steps`. If `num_steps` is specified, profiling will automatically stop after that many steps. + +#### Example workflow + +```bash Command +# Terminal 1: Start the server +export SGLANG_TORCH_PROFILER_DIR=/tmp/profiles +python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct + +# Terminal 2: Start continuous profiling +curl -X POST http://127.0.0.1:30000/start_profile \ + -H "Content-Type: application/json" \ + -d '{ + "start_step": 3 + }' + +# Terminal 3: Send requests to generate load +python -m sglang.bench_serving --backend sglang --num-prompts 100 + +# Terminal 2: Stop profiling when done +curl -X POST http://127.0.0.1:30000/stop_profile +``` + +### Profiler Trace Merger for Distributed Traces + +SGLang now supports automatic merging of profiling traces from distributed setups with multiple parallelism types (TP, DP, PP, EP). This feature is particularly useful for analyzing performance across distributed runs. + +#### Multi-Node Profiling and Shared Storage Considerations + +Single-node profiler output merging is completely supported. When profiling in distributed environments spanning multiple nodes, shared storage (e.g., NFS, Lustre) should be accessible by all nodes for the output directory to enable merging of trace files. + +If there is no shared storage accessible across nodes, automatic merging of trace files during profiling is not supported directly as of now. + +#### HTTP API Usage + +```bash Command +# Start profiling with automatic trace merging enabled +curl -X POST /start_profile \ + -H "Content-Type: application/json" \ + -d '{ + "output_dir": "/tmp/profiles", # where to store profile traces + "num_steps": 10, + "activities": ["CPU", "GPU"], + "merge_profiles": true # optional argument to merge profile traces (default=False) + }' +``` + +#### Command Line Usage + +```bash Command +# Start profiling with merge enabled +python -m sglang.profiler \ + --num-steps 10 \ + --cpu \ + --gpu \ + --output-dir /tmp/profiles \ + --merge-profiles # optional argument to merge profile traces (default=False) +``` + +#### Output Files + +The profile merger generates: +- Individual rank trace files: `{profile_id}-TP-{tp}-DP-{dp}-PP-{pp}-EP-{ep}.trace.json.gz` +- Merged trace file: `merged-{profile_id}.trace.json.gz` + +### Possible PyTorch bugs +If in any cases you encounter the following error (for example, using qwen 2.5 VL): +```bash Command +RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty. +``` +This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows: +```bash Command +export SGLANG_PROFILE_WITH_STACK=False +python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8 +``` + +### View traces + +Trace files can be loaded and visualized from: + +1. https://ui.perfetto.dev/ (any browser) +2. chrome://tracing (Chrome browser only) + +If browser cannot open trace file due to its large size, +client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs. +For example, when profiling a server, + +```bash Command +python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile +``` + +This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly. + +Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service. + +## Profile with Nsight + +[Nsight systems](https://docs.nvidia.com/nsight-systems/) is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events. + +1. Prerequisite: + + Install using apt, or run inside a [NVIDIA Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) or [SGLang Docker container](https://github.com/sgl-project/sglang/tree/main/docker). + + ```bash Command + # install nsys + # https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html + apt update + apt install -y --no-install-recommends gnupg + echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list + apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub + apt update + apt install nsight-systems-cli + ``` + +2. To profile a single batch, use + + ```bash Command + nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512 + ``` + +3. To profile a server, e.g. + + ```bash Command + # launch the server, set the delay and duration times according to needs + # after the duration time has been used up, server will be killed by nsys + + nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache + + # client + python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512 + ``` + + In practice, we recommend users to set `--duration` argument to a large value. Whenever user wants the server to stop profiling. Firstly run: + + ```bash Command + nsys sessions list + ``` + + to get the session id in the form of `profile-XXXXX`, then run: + + ```bash Command + nsys stop --session=profile-XXXXX + ``` + + to manually kill the profiler and generate `nsys-rep` files instantly. + +4. Use NVTX to annotate code regions, e.g. to see their execution time. + + ```bash Command + # install nvtx + pip install nvtx + ``` + + ```python Example + # code snippets + import nvtx + with nvtx.annotate("description", color="color"): + # some critical code + ``` + +### Layer-wise NVTX Profiling with Nsight Systems + +SGLang provides built-in layerwise NVTX annotations that can be combined with the CUDA Profiler for detailed per-layer profiling in Nsight Systems. This is particularly useful for identifying performance bottlenecks at the layer level. + +#### Using `--enable-layerwise-nvtx-marker` with Nsight Systems and `/start_profile` + +The `--enable-layerwise-nvtx-marker` flag automatically adds NVTX markers to every layer in your model. This is particularly powerful when combined with Nsight Systems profiling to see detailed per-layer performance. + +**Method 1: Using `/start_profile` with CUDA_PROFILER (for programmatic control)** + +This method allows you to control exactly when profiling starts/stops via HTTP API while Nsight Systems is running. + +1. Launch the server with layerwise NVTX enabled under Nsight Systems: + + ```bash Command + # Terminal 1: Start server with nsys and capture-range option + nsys profile --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + -o layerwise_profile \ + python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --enable-layerwise-nvtx-marker \ + --disable-cuda-graph + ``` + + Note: NVTX markers are not emitted for kernel launches captured by CUDA graphs. Use `--disable-cuda-graph` to ensure all layerwise NVTX markers are emitted in the trace. + +2. In another terminal, control profiling via `/start_profile` with `CUDA_PROFILER` activity: + + ```bash Command + # Terminal 2: Wait for server to be ready, then start CUDA profiling + # Wait 3 steps for warmup, then profile for 10 steps + curl -X POST http://127.0.0.1:30000/start_profile \ + -H "Content-Type: application/json" \ + -d '{ + "start_step": 3, + "num_steps": 10, + "activities": ["CUDA_PROFILER"] + }' + ``` + +3. Send requests to generate load: + + ```bash Command + # Terminal 3: Generate workload + python -m sglang.bench_serving --backend sglang --num-prompts 100 + ``` + +4. Profiling will automatically stop after 10 steps (due to `num_steps: 10`). If you hadn't specified `num_steps`, you would need to manually stop it: + + ```bash Command + # Terminal 2: Only needed if num_steps was not specified + curl -X POST http://127.0.0.1:30000/stop_profile + ``` + +The `--capture-range=cudaProfilerApi` option tells Nsight Systems to only capture data between `cudaProfilerStart()` and `cudaProfilerStop()` calls (triggered by `/start_profile` and `/stop_profile`), reducing overhead and file size. The `start_step` parameter skips the first 3 steps to avoid capturing warmup overhead. + +**Method 2: Simpler approach without `/start_profile` API** + +For simpler use cases where you don't need fine-grained control over profiling start/stop, you can profile with Nsight Systems capturing the entire workload: + +```bash Command +# Terminal 1: Start server with layerwise NVTX +# Note: --disable-cuda-graph ensures all NVTX markers are emitted +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --enable-layerwise-nvtx-marker \ + --disable-cuda-graph + +# Terminal 2: Profile the benchmarking client +nsys profile --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + -o layerwise_profile \ + python -m sglang.bench_serving --backend sglang --num-prompts 10 +``` + +This approach profiles the entire client execution, including all server interactions. The layerwise NVTX markers will be visible in the Nsight Systems timeline. + +**Viewing the profiling results:** + +Open the generated `.qdrep` file with Nsight Systems: + +```bash Command +nsys-ui layerwise_profile.qdrep +``` + +In the Nsight Systems GUI, you'll see: +- **NVTX ranges**: Each layer appears as a labeled range in the timeline with detailed information in the marker metadata +- **CUDA kernels**: All GPU kernels are shown alongside the layer annotations +- **Layer hierarchy**: The full module path (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct.model.layers.0.self_attn.qkv_proj`) helps identify specific layers. The prefix uses the full model path from `--model-path`. +- **Tensor shapes**: Input/output dimensions and parameter shapes are included in the NVTX marker data + +**Benefits of layerwise NVTX profiling:** + +- **Granular visibility**: See exactly which layers are taking the most time +- **Memory tracking**: Identify layers with large memory allocations +- **Bottleneck identification**: Quickly locate inefficient operations +- **Communication overhead**: In multi-GPU setups, see per-layer communication costs +- **Development debugging**: Validate that model architecture changes have the expected performance impact + +## Other tips + +1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder. +2. You can benchmark a model with modified configs (e.g., less layers) by using `--json-model-override-args`. For example, you can benchmark a model with only 2 layers and 2 kv heads using: + + ```bash Command + python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}' + ``` + +3. You can use `--python-backtrace=cuda` to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing) +4. For more arguments see [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html). diff --git a/docs_new/docs/developer_guide/contribution_guide.mdx b/docs_new/docs/developer_guide/contribution_guide.mdx new file mode 100644 index 000000000000..9c01980c2b7a --- /dev/null +++ b/docs_new/docs/developer_guide/contribution_guide.mdx @@ -0,0 +1,188 @@ +--- +title: "Contribution Guide" +mode: wide +metatags: + description: "SGLang contribution guide: source install, pre-commit, unit tests, CI triggers, code style, sgl-kernel updates." +--- +Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process. + +## Install SGLang from Source + +### Fork and clone the repository + +**Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally. + +```bash +git clone https://github.com//sglang.git +``` + +### Build from source + +Refer to [Install SGLang from Source](../get-started/install#method-2-from-source). + +## Format code with pre-commit + +We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run: + +```bash +pip3 install pre-commit +pre-commit install +pre-commit run --all-files +``` + +- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request. +- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch. +- Link checking with lychee is **enforced in CI**. By default, it is not blocking local commits. +- To run local link checks manually, use: `pre-commit run --hook-stage manual lychee --all-files`. + +## Run and add unit tests + +If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression. + +### Unit tests (no server required) + +Unit tests live under [`test/registered/unit/`](https://github.com/sgl-project/sglang/tree/main/test/registered/unit), organized to mirror the `python/sglang/srt/` source tree. These tests validate component logic **without** launching a server or loading real model weights. +SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework with [pytest](https://docs.pytest.org/) as the test runner. + +**When to add a unit test:** If you modify a file under `python/sglang/srt/`, check whether a corresponding test exists in `test/registered/unit/` and add coverage for your changes. For example: + +``` +srt/mem_cache/radix_cache.py → unit/mem_cache/test_radix_cache.py +srt/sampling/sampling_params.py → unit/sampling/test_sampling_params.py +``` + +**Run unit tests locally:** + +```bash Command +pytest test/registered/unit/ -v # all unit tests +pytest test/registered/unit/mem_cache/ -v # one module +``` + +**Run with coverage:** + +```bash Command +pytest test/registered/unit/ --cov --cov-config=.coveragerc -v +``` + +For conventions on CI registration, test structure, and examples, see [`test/registered/unit/README.md`](https://github.com/sgl-project/sglang/tree/main/test/registered/unit/README.md). + +### E2E tests (server required) + +For tests that require launching a server, refer to [`test/registered/README.md`](https://github.com/sgl-project/sglang/tree/main/test/registered/README.md) for guidance on where to place your test. + +For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md). + +## Write documentations + +We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. +For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md). + +## Test the accuracy +If your code changes the model output, please run the accuracy tests. A quick sanity check is the few-shot GSM8K. + +```text Output +# Launch a server +python3 -m sglang.launch_server --model Qwen/Qwen2-7B-Instruct + +# Evaluate +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +Please note that the above script is primarily a sanity check, not a rigorous accuracy or speed test. +This test can have significant variance (1%–5%) in accuracy due to batching and the non-deterministic nature of the inference engine. +Also, do not rely on the "Latency/Output throughput" from this script, as it is not a proper speed test. + +GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests. +You can find additional accuracy eval examples in: +- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/registered/eval/test_eval_accuracy_large.py) +- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/registered/core/test_gpt_oss_1gpu.py) + +## Benchmark the speed +Refer to [Benchmark and Profiling](./benchmark_and_profiling). + +## Requesting a review for merge +You can follow the pull request merge process described in [MAINTAINER.md](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md). +You will need to work with the Merge Oncall, Codeowner, and other reviewers to get their approvals. +Then your PR can be merged. + +## How to Trigger CI Tests + +We have a lot of open PRs but limited CI machines, so only top and trusted contributors have permission to trigger CI tests. +Users with permission are listed in the [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) + +**PR authors** can always use `/rerun-failed-ci` on their own PRs, even if they are not listed in `CI_PERMISSIONS.json`. + +For CI to run on a pull request, it must have the "run-ci" label. Authorized users can add the label or rerun failed tests by commenting on the PR with one of these commands: + +- `/tag-run-ci-label`: Adds the "run-ci" label. Every future commit will trigger CI. +- `/rerun-failed-ci`: Reruns the failed or flaky tests from the most recent commit. +- `/tag-and-rerun-ci`: A single command that performs both `/tag-run-ci-label` and `/rerun-failed-ci`. +- `/rerun-stage `: Reruns a specific test stage without waiting for its dependencies. This is useful when you want to quickly validate a fix for a specific test failure instead of waiting ~30 minutes for preceding stages to complete. + +If you have permission, the [Slash Command Handler](https://github.com/sgl-project/sglang/actions/workflows/slash-command-handler.yml) will run your command and react with a 👍 to your comment. It may take up to a few minutes for the reaction to appear. Here’s a usage [example](https://github.com/sgl-project/sglang/pull/14253#issuecomment-3599509302). + +To avoid spamming a PR with too many `/rerun-failed-ci` comments, you can also trigger the command by editing an existing comment and adding any suffix (e.g., `/rerun-failed-ci try again`). + +Example of rerunning a single test stage: `/rerun-stage unit-test-backend-4-gpu`. + +If you don’t have permission and you’re not the PR author, please ask maintainers to trigger CI for you. + +### CI rate limits + +Due to CI scheduling and limited resources, higher-priority PRs may preempt running jobs. In such cases, you may need to rerun the tests. +We apply CI rate limits to prevent abuse and ensure fair usage of our CI resources. + +Each CI workflow has a default limit defined in its workflow configuration file. For example, in [pr-gate.yml](https://github.com/sgl-project/sglang/blob/main/.github/workflows/pr-gate.yml), the default cooldown period is 120 minutes, and each workflow can override it via the `cool-down-minutes` input parameter: + +```yaml Config +cool-down-minutes: + description: "Default cooldown period in minutes; 0 disables rate limiting" + type: number + default: 120 +``` + +Users listed in [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) may have a per-user cooldown interval. In practice, we use the minimum of the workflow’s default window and the user-specific interval. + +## Code style guidance +- Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function. +- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code. +- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize all minor overheads as much as possible, especially in the model forward code. + - A common pattern is some runtime checks in the model forward pass (e.g., [this](https://github.com/sgl-project/sglang/blob/f1b0eda55c2c4838e8ab90a0fac7fb1e3d7064ab/python/sglang/srt/models/deepseek_v2.py#L486-L491)). These are very likely the same for every layer. Please cache the result as a single boolean value whenever possible. +- Make functions as pure as possible. Avoid in-place modification of arguments. +- Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files. (e.g., `scheduler.py`, `scheduler_output_processor_mixin.py`) +- Keep tests run fast. + - If a single test file run longer than 500 seconds, split it into multiple smaller files (e.g., `test_eagle_infer_a.py`, `test_eagle_infer_b.py`). + - If a single job in a github workflow runs longer than 30 mins, split it into smaller jobs/steps. + - Reuse server launches in your unit tests to make tests run faster. +- Never use `pickle.loads()`, `pickle.load()`, or `recv_pyobj()` to deserialize untrusted or network-received data. Python's [pickle module is not secure](https://docs.python.org/3/library/pickle.html) — it can execute arbitrary code during deserialization. Use safe serialization formats such as [msgpack](https://github.com/jcrist/msgspec) or JSON instead. +- When supporting new hardware or features, follow these guidelines: + - Do not drastically change existing code. + - Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_ascend.py`). + - If you write multiple if/else blocks for new features, ensure the common path (e.g., NVIDIA hardware or the existing code path) is the first branch. + +## How to update sgl-kernel +Since sglang and the `sglang-kernel` (prior `sgl-kernel`) distribution are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR). +To add a new kernel or modify an existing one in the `sgl-kernel/` source tree, you must use multiple PRs. + +Follow these steps: + +1. Submit a PR to update the sgl-kernel source code without using it in sglang python package (e.g., [#8884](https://github.com/sgl-project/sglang/pull/8884/files)). +2. Bump the version of the kernel package (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)). + - Once merged, this will trigger an automatic release of the `sglang-kernel` wheel to PyPI. + - If not urgent, you can wait for other people to release the wheel. A new version will typically be released within one week. +3. Apply the changes: + - Update the `sglang-kernel` version in `sglang/python/pyproject.toml` to use the modified kernels. + - Update the related caller code in the sglang to use the new kernel. + +## Tips for newcomers + +If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. + +Also check out the following materials as startup guide: +- [Mini-SGLang](https://github.com/sgl-project/mini-sglang) for a quick overview on the structure of sglang. +- [Code Walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow. +- [GTC-2026 Training Lab](https://drive.google.com/file/d/1mwOZEtipNLJzrflCTodj34KhuOZEoEw5/view?usp=drive_link) for hands-on practices of how to do optimization, benchmarking, or profiling on a launched SGLang instance. + +If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.io). + +Thank you for your interest in SGLang. Happy coding! diff --git a/docs_new/docs/developer_guide/development_guide_using_docker.mdx b/docs_new/docs/developer_guide/development_guide_using_docker.mdx new file mode 100644 index 000000000000..70e69aa28b72 --- /dev/null +++ b/docs_new/docs/developer_guide/development_guide_using_docker.mdx @@ -0,0 +1,119 @@ +--- +title: "Development Guide Using Docker" +sidebarTitle: "Using Docker" +metatags: + description: "SGLang Docker development: VSCode dev container, remote tunnels, debugger setup, nsys profiling." +--- +## Setup VSCode on a Remote Host +(Optional - you can skip this step if you plan to run sglang dev container locally) + +1. In the remote host, download `code` from [VSCode](https://code.visualstudio.com/download) and run `code tunnel` in a shell. + +Example +```bash Command +wget https://vscode.download.prss.microsoft.com/dbazure/download/stable/fabdb6a30b49f79a7aba0f2ad9df9b399473380f/vscode_cli_alpine_x64_cli.tar.gz +tar xf vscode_cli_alpine_x64_cli.tar.gz + +# https://code.visualstudio.com/docs/remote/tunnels +./code tunnel +``` + +2. In your local machine, press F1 in VSCode and choose "Remote Tunnels: Connect to Tunnel". + +## Setup Docker Container + +### Option 1. Use the default dev container automatically from VSCode +There is a `.devcontainer` folder in the sglang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers). + + VSCode Dev Container Architecture + + +*Figure 1: Diagram from VSCode official documentation [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).* + +To enable this, you only need to: +1. Start Visual Studio Code and install [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers). +2. Press F1, type and choose "Dev Container: Open Folder in Container. +3. Input the `sglang` local repo path in your machine and press enter. + +The first time you open it in dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container: + + + VSCode Dev Container Status Bar + + +Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, sglang server will be started in the dev container with all your local changes applied automatically: + + + SGLang Server Running in Dev Container + + + +### Option 2. Start up containers manually (advanced) + +The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers. + +❗️ **Note on RDMA** + + 1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below. + 2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`. + +```bash Command +# Change the name to yours +docker run -itd --shm-size 32g --gpus all -v --ipc=host --network=host --privileged --name sglang_dev lmsysorg/sglang:dev /bin/zsh +docker exec -it sglang_dev /bin/zsh +``` +Some useful volumes to mount are: +1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`. +2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer. + +Example 1: Monting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer. +```bash Command +docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh +docker exec -it sglang_zhyncs /bin/zsh +``` +Example 2: Mounting both HuggingFace cache and local SGLang repo. Local code changes are automatically synced to the devcontainer as the SGLang is installed in editable mode in the dev image. +```bash Command +docker run -itd --shm-size 32g --gpus all -v $HOME/.cache/huggingface/:/root/.cache/huggingface -v $HOME/src/sglang:/sgl-workspace/sglang --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh +docker exec -it sglang_zhyncs /bin/zsh +``` +## Debug SGLang with VSCode Debugger +1. (Create if not exist) open `launch.json` in VSCode. +2. Add the following config and save. Please note that you can edit the script as needed to apply different parameters or debug a different program (e.g. benchmark script). + ```JSON Config + { + "version": "0.2.0", + "configurations": [ + { + "name": "Python Debugger: launch_server", + "type": "debugpy", + "request": "launch", + "module": "sglang.launch_server", + "console": "integratedTerminal", + "args": [ + "--model-path", "meta-llama/Llama-3.2-1B", + "--host", "0.0.0.0", + "--port", "30000", + "--trust-remote-code", + ], + "justMyCode": false + } + ] + } + ``` + +3. Press "F5" to start. VSCode debugger will ensure that the program will pause at the breakpoints even if the program is running at remote SSH/Tunnel host + dev container. + +## Profile + +```bash Command +# Change batch size, input, output and add `disable-cuda-graph` (for easier analysis) +# e.g. DeepSeek V3 +nsys profile -o deepseek_v3 python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --disable-cuda-graph +``` + +## Evaluation + +```bash Command +# e.g. gsm8k 8 shot +python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 +``` diff --git a/docs_new/docs/developer_guide/development_jit_kernel_guide.mdx b/docs_new/docs/developer_guide/development_jit_kernel_guide.mdx new file mode 100644 index 000000000000..8b92fcd10a3f --- /dev/null +++ b/docs_new/docs/developer_guide/development_jit_kernel_guide.mdx @@ -0,0 +1,425 @@ +--- +title: "Development Guide for JIT Kernels" +sidebarTitle: "JIT Kernels" +metatags: + description: "SGLang JIT kernel development: clangd setup, TensorMatcher, LaunchKernel, add_constant example walkthrough." +--- +## Environment Setup + +We strongly recommend using `clangd` as the language server for JIT kernel development. +For Ubuntu/Debian, you can download clangd from [apt.llvm.org](https://apt.llvm.org/). +If you are using VS Code, we recommend installing the `clangd` extension for better IDE integration. + +All JIT-related files are located in `python/sglang/jit_kernel`. +Unlike `sgl-kernel`, which compiles CUDA/C++ binaries ahead of time (AOT), just-in-time (JIT) kernels are compiled at runtime. +Consequently, a static `compile_commands.json` cannot be generated. +To enable code completion with `clangd`, run `python -m sglang.jit_kernel` to generate a `.clangd` configuration file in your current directory. +After generating the file, restart the clangd language server. It should now recognize all JIT kernel files. + +## Code Structure + +### C++ Implementation + +C++ source code is located in `python/sglang/jit_kernel/csrc`. +Reusable functions should be placed in `python/sglang/jit_kernel/include`. + +We use [tvm-ffi](https://github.com/apache/tvm-ffi) for efficient foreign language bindings. +Refer to the [documentation](https://tvm.apache.org/ffi/) for advanced usage, such as exporting C++ objects. +Typically, `tvm::ffi::TensorView` is sufficient for passing PyTorch Tensors from Python. + +### Python Interface + +Python interfaces are defined in `python/sglang/jit_kernel`. +The `load_jit` utility function in `python/sglang/jit_kernel/utils.py` loads and returns the compiled module. +To export a C++ function (e.g., `cpp_func`), pass `cuda_wrappers=[("func", "cpp_func")]` to `load_jit`. +The function can then be called in Python as `module.func`. + +For caching compiled modules, prefer `sglang.jit_kernel.utils.cache_once` over `functools.lru_cache`. +`functools.lru_cache` is not compatible with `torch.compile`. + +### C++ Utilities + +The following C++ utilities are available: + +#### Integer Range + +Similar to PyTorch, we provide an `irange` function to represent an integer range. + +```C++ Example +#include + +void test() { + for (auto i : host::irange(100)) { // [0, 100) + // do something + } + for (auto i : host::irange(0, 100)) { // [0, 100) + // do something + } +} + +``` + +#### Runtime Checking + +`RuntimeCheck` validates conditions at runtime. It accepts optional arguments for error reporting. +If the check fails, these arguments are output to aid debugging. +`RuntimeDeviceCheck` verifies the status of the last kernel launch. + +```C++ Example +#include +#include + +void test() { + host::RuntimeCheck(1 + 1 == 2, 1 + 1, " != ", 2); + host::RuntimeDeviceCheck(); + // check the provided `cudaError_t` + host::RuntimeDeviceCheck(cudaGetLastError()); +} + +``` + +#### Tensor Checking + +`TensorMatcher` provides a readable way to validate and extract tensor shape information. + +```cpp Example +#include + +void test(const tvm::ffi::TensorView k_cache, const tvm::ffi::TensorView v_cache) { + using namespace host; + + auto D = SymbolicSize{"D"}; // cache dimension + auto N = SymbolicSize{"N"}; // kvcache stride + auto dtype = SymbolicDType{}; + auto device = SymbolicDevice{}; + + TensorMatcher({-1, D}) // + .with_strides({N, 1}) + .with_dtype(dtype) + .with_device(device) + .verify(k_cache) + .verify(v_cache); +} +``` + +Configure the `TensorMatcher` with expected stride, dtype, and device properties before verification. +- If `with_strides` is omitted, the tensor is expected to be contiguous. +- Template arguments in `with_dtype` restrict the allowed data types. +- Template arguments in `with_device` restrict the allowed devices. +- Values passed to `with_xxx` methods enforce equality checks. +- Passing `-1` for size or stride allows matching any value. + +A `Symbolic` variable must resolve to the same value across all verifications. +Use `.unwrap()` to retrieve the matched value after verification. + +> Note: `TensorMatcher` is a temporary expression and should not be stored in a variable. + +> Tip: Add `//` at the end of the `TensorMatcher` chain to enforce proper indentation. + +#### Kernel Launching + +`LaunchKernel::resolve_device` retrieves the current `cudaStream` from PyTorch. +Kernels can also be launched directly using `LaunchKernel`. + +```cpp Example +#include + +#include + +__global__ void kernel() {} + +void test() { + const auto num_blocks = 1; + const auto num_threads = 32; + const auto dynamic_smem = 0; + + DLDevice dev; // suppose this is initialized properly + host::LaunchKernel(num_blocks, num_threads, dev)(kernel); + + cudaStream_t stream = host::LaunchKernel::resolve_device(dev); + host::LaunchKernel(num_blocks, num_threads, stream, dynamic_smem)(kernel); +} + +``` + +## Add new kernels + +This section walks through a complete, end-to-end example of adding a new JIT kernel to the system. +We use a simple add_constant kernel as a running example, which adds a constant integer value to every element of an input tensor. + +Conceptually, the Python interface looks like this: + +```python Example +def add_constant(src: torch.Tensor, c: int): + return src + c +``` + +### STEP 1: Write the C++ kernel + +Write your CUDA kernel in [jit_kernel/csrc/add_constant.cuh](https://github.com/sgl-project/sglang/blob/main/python/sglang/jit_kernel/csrc/add_constant.cuh). For demonstration purposes, we pass the constant value as a template parameter. + +```cpp Example +#include // For TensorMatcher, SymbolicSize, SymbolicDevice +#include // For LaunchKernel +#include // For div_ceil, RuntimeCheck + +#include +#include + +#include +#include + +namespace { + +template +__global__ void add_constant_kernel(int32_t* dst, const int32_t* src, size_t length) { + size_t idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < length) { + dst[idx] = src[idx] + kConstant; + } +} + +constexpr size_t kBlockSize = 256; + +// You can also use struct with static method as an alternative +template +void add_constant(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) { + using namespace host; + + // 1. Validate input tensors + SymbolicSize N = {"num_elements"}; + SymbolicDevice device_; + TensorMatcher({N}) // 1D tensor, must be contiguous + .with_dtype() // must be int32 + .with_device(device_) // must be on CUDA device + .verify(dst) // check tensor dst + .verify(src); // check tensor src + + // 2. Extract required parameters, prepare for kernel launch + const size_t num_elements = N.unwrap(); + const size_t grid_size = div_ceil(num_elements, kBlockSize); + const DLDevice device = device_.unwrap(); + // some extra runtime checks using host::RuntimeCheck + RuntimeCheck(num_elements > 0, "We only support non-empty tensors, got num_elements = ", num_elements); + + // 3. Launch the kernel. Error code will be automatically checked. + LaunchKernel(grid_size, kBlockSize, device /*, dynamic_smem*/)( + // kernel function + add_constant_kernel, + // kernel arguments + static_cast(dst.data_ptr()), + static_cast(src.data_ptr()), + num_elements); +} + +} // namespace + +``` + +### STEP 2: Create Python Interfaces + +Next, expose the kernel through a Python wrapper. +Create a new file at [jit_kernel/add_constant.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/jit_kernel/add_constant.py) and expose the needed interfaces. + +```python Example +from __future__ import annotations +from typing import TYPE_CHECKING + +import torch + +from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args + +if TYPE_CHECKING: + from tvm_ffi.module import Module + + +@cache_once +def _jit_add_constant_module(constant: int) -> Module: + args = make_cpp_args(constant) # pass all the template argument + return load_jit( + "add_constant", + *args, + cuda_files=["add_constant.cuh"], + cuda_wrappers=[("add_constant", f"add_constant<{args}>")], + ) + + +def add_constant(src: torch.Tensor, constant: int) -> torch.Tensor: + if not src.is_cuda: + raise RuntimeError("src must be a CUDA tensor") + if src.dtype != torch.int32: + raise RuntimeError(f"Unsupported dtype {src.dtype}. Supported: int32") + dst = torch.empty_like(src) + module = _jit_add_constant_module(constant) + module.add_constant(dst, src) + return dst + +``` + +Keep the Python wrapper thin, but still validate the basic invariants such as device and dtype before dispatch. In the current JIT/FFI path, invalid tensors are not always rejected safely before launch. + +### STEP 3: Use your kernel + +Finally, import and use the kernel like a regular Python function: + +```python Example +from sglang.jit_kernel.add_constant import add_constant +``` + +For a complete, runnable example, refer to [test_add_constant.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/jit_kernel/tests/test_add_constant.py). + +## C++ Include Library Reference + +The JIT kernel framework provides a set of reusable C++ headers in +`python/sglang/jit_kernel/include/sgl_kernel/`. Each header is designed +to be lightweight and self-contained. Below is a summary of each header +and its key APIs. + +### Core Utilities + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HeaderNamespacePurpose
utils.hhostHost-side essentials: RuntimeCheck, Panic, div_ceil, irange
utils.cuhdevice / hostType aliases (fp16_t, bf16_t, ...), SGL_DEVICE macro, PDL helpers, LaunchKernel, RuntimeDeviceCheck
source_location.h(global)Portable std::source_location wrapper for error reporting
runtime.cuhhost::runtimeCUDA runtime queries: get_blocks_per_sm, get_sm_count, get_cc_major, get_runtime_version, get_available_dynamic_smem_per_block
+ +### Tensor Validation + + + + + + + + + + + + + + + + +
HeaderNamespacePurpose
tensor.hhostTensorMatcher, SymbolicSize, SymbolicDType, SymbolicDevice
+ +### Math & Type System + + + + + + + + + + + + + + + + + + + + + +
HeaderNamespacePurpose
math.cuhdevice::mathmax, min, abs, sqrt, rsqrt, exp, sin, cos, constants
type.cuh(global) / devicedtype_trait<T>, packed_t<T>, device::cast<To>(from)
+ +### Memory Access + + + + + + + + + + + + + + + + + + + + + +
HeaderNamespacePurpose
vec.cuhdeviceAlignedVector<T, N> - vectorized load/store (up to 128-bit; 256-bit requires Blackwell GPUs)
tile.cuhdevice::tileMemory<T> - cooperative tiled memory I/O (thread/warp/CTA)
+ +### Parallel Primitives + + + + + + + + + + + + + + + + + + + + + + + + + + +
HeaderNamespacePurpose
warp.cuhdevice::warpreduce_sum, reduce_max via __shfl_xor_sync
cta.cuhdevice::ctareduce_max across warps via shared memory
atomic.cuhdevice::atomicmax - atomic float max (CUDA + ROCm fallback)
+ +### Reusable Kernel Templates + + + + + + + + + + + + + + + + +
HeaderNamespacePurpose
impl/norm.cuhhost::norm / device::normRMSNorm building blocks (warp & CTA paths, StorageType)
diff --git a/docs_new/docs/developer_guide/evaluating_new_models.mdx b/docs_new/docs/developer_guide/evaluating_new_models.mdx new file mode 100644 index 000000000000..9f5aade7a6fb --- /dev/null +++ b/docs_new/docs/developer_guide/evaluating_new_models.mdx @@ -0,0 +1,149 @@ +--- +title: "Evaluating New Models with SGLang" +metatags: + description: "SGLang model evaluation: MMLU, GSM8K, GPQA, HumanEval, MMMU benchmarks. Latency and throughput testing commands." +--- +This document provides commands for evaluating models' accuracy and performance. Before open-sourcing new models, we strongly suggest running these commands to verify whether the score matches your internal benchmark results. + +**For cross verification, please submit commands for installation, server launching, and benchmark running with all the scores and hardware requirements when open-sourcing your models.** + +[Reference: MiniMax M2](https://github.com/sgl-project/sglang/pull/12129) + +## Accuracy + +### LLMs + +SGLang provides built-in scripts to evaluate common benchmarks. + +**MMLU** + +```bash Command +python -m sglang.test.run_eval \ + --eval-name mmlu \ + --port 30000 \ + --num-examples 1000 \ + --max-tokens 8192 +``` + +**GSM8K** + +```bash Command +python -m sglang.test.few_shot_gsm8k \ + --host http://127.0.0.1 \ + --port 30000 \ + --num-questions 200 \ + --num-shots 5 +``` + +**HellaSwag** + +```bash Command +python benchmark/hellaswag/bench_sglang.py \ + --host http://127.0.0.1 \ + --port 30000 \ + --num-questions 200 \ + --num-shots 20 +``` + +**GPQA** + +```bash Command +python -m sglang.test.run_eval \ + --eval-name gpqa \ + --port 30000 \ + --num-examples 198 \ + --max-tokens 120000 \ + --repeat 8 +``` + + +For reasoning models, add `--thinking-mode ` (e.g., `qwen3`, `deepseek-r1`, `deepseek-v3`). You may skip it if the model has forced thinking enabled. + + +**HumanEval** + +```bash Command +pip install human_eval + +python -m sglang.test.run_eval \ + --eval-name humaneval \ + --num-examples 10 \ + --port 30000 +``` + +### VLMs + +**MMMU** + +```bash Command +python benchmark/mmmu/bench_sglang.py \ + --port 30000 \ + --concurrency 64 +``` + + +You can set max tokens by passing `--extra-request-body '{"max_tokens": 4096}'`. + + +For models capable of processing video, we recommend extending the evaluation to include `VideoMME`, `MVBench`, and other relevant benchmarks. + +## Performance + +Performance benchmarks measure **Latency** (Time To First Token - TTFT) and **Throughput** (tokens/second). + +### LLMs + +**Latency-Sensitive Benchmark** + +This simulates a scenario with low concurrency (e.g., single user) to measure latency. + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --host 0.0.0.0 \ + --port 30000 \ + --dataset-name random \ + --num-prompts 10 \ + --max-concurrency 1 +``` + +**Throughput-Sensitive Benchmark** + +This simulates a high-traffic scenario to measure maximum system throughput. + +```bash Command +python -m sglang.bench_serving \ + --backend sglang \ + --host 0.0.0.0 \ + --port 30000 \ + --dataset-name random \ + --num-prompts 1000 \ + --max-concurrency 100 +``` + +**Single Batch Performance** + +You can also benchmark the performance of processing a single batch offline. + +```bash Command +python -m sglang.bench_one_batch_server \ + --model \ + --batch-size 8 \ + --input-len 1024 \ + --output-len 1024 +``` + +You can run more granular benchmarks: + +- **Low Concurrency**: `--num-prompts 10 --max-concurrency 1` +- **Medium Concurrency**: `--num-prompts 80 --max-concurrency 16` +- **High Concurrency**: `--num-prompts 500 --max-concurrency 100` + +## Reporting Results + +For each evaluation, please report: + +1. **Metric Score**: Accuracy % (LLMs and VLMs); Latency (ms) and Throughput (tok/s) (LLMs only). +2. **Environment settings**: GPU type/count, SGLang commit hash. +3. **Launch configuration**: Model path, TP size, and any special flags. +4. **Evaluation parameters**: Number of shots, examples, max tokens. diff --git a/docs_new/docs/developer_guide/msprobe_debugging_guide.mdx b/docs_new/docs/developer_guide/msprobe_debugging_guide.mdx new file mode 100644 index 000000000000..97b6b0dd399f --- /dev/null +++ b/docs_new/docs/developer_guide/msprobe_debugging_guide.mdx @@ -0,0 +1,599 @@ +--- +title: "MSProbe Debugging Guide" +metatags: + description: "Debugging AI model accuracy anomalies and numerical errors during inference using MSProbe in SGLang." +--- +MSProbe is a debugging tool for AI models that diagnoses accuracy anomalies and +numerical errors during model training and inference. It captures and monitors intermediate data (feature maps, weights, +activations, layer outputs) and contextual metadata (prompts, tensor dtypes, hardware configuration), and supports +visual analysis to systematically trace the root cause of accuracy degradation or numerical errors (e.g., NaN/Inf, +output drift, mismatched predictions). + +## Basic Details + +### Background Concepts: MSProbe Dumping Levels + +MSProbe supports three accuracy levels for data dumping, each for different debugging needs: + +- **L0**: Dumps tensors/statistics at the **module level** and generates `construct.json` (for network structure + reconstruction in visualization). Requires passing a model/submodule handle. +- **L1**: Dumps tensors/statistics at the **torch API level**, suitable for fine-grained API-level numerical checking. +- **mix**: Combines L0 + L1, ideal for scenarios that require both **graph reconstruction** and **numerical comparison**. + +### Prerequisites: Install MSProbe + +Install MSProbe with pip: + +```shell +pip install mindstudio-probe --pre +``` + +### Key Configuration Parameters + +MSProbe uses a JSON configuration file for customized data dumping. All core parameters are listed in the table below, +with the default JSON configuration provided for reference. + +#### Configuration Parameter Table + +| Field | Description | Required | +|:------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| `task` | Type of dump task. Common PyTorch values include `"statistics"` and `"tensor"`. A statistics task collects tensor statistics (mean, variance, max, min, etc.) while a tensor task captures arbitrary tensors. | Yes | +| `dump_path` | Directory where dump results are stored. When omitted, `MSProbe` uses its default path. | No | +| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks you must set this field to `[]`. | No | +| `step` | Token iteration(s) to sample. An empty list means every iteration. | No | +| `level` | Dump level string (`"L0"`, `"L1"`, or `"mix"`). `L0` targets `nn.Module`, `L1` targets `torch.api`, and `mix` collects both. | Yes | +| `async_dump` | Whether to enable asynchronous dump (supported for PyTorch `statistics`/`tensor` tasks). Defaults to `false`. | No | +| `scope` | Customize the scope of dump. Provide two module or API names that follow the tool's naming convention to lock a range, only data between the two names will be dumped. An empty list dumps every module or torch API.

Examples:
`"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]`
`"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]`

The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`. | No | +| `list` | Customize dump list, only dumps elements from the list. An empty list dumps every module or torch API. Options include:

•Supply the full names of specific APIs in PyTorch eager mode to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
•When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
•Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded. | No | + +#### Default configuration + +```json +{ + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "step": [], + "level": "L1", + "async_dump": false, + "statistics": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "summary_mode": "statistics" + }, + "tensor": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "file_format": "npy" + }, + "acc_check": { + "white_list": [], + "black_list": [], + "error_data_path": "./" + } +} +``` + +#### Outputs + +Dump files are written into `dump_path` you defined. They usually contain: + +- `dump.json`, which records metadata such as dtype, shape, min, max, mean, L2 norm, and `requires_grad`. +- `construct.json`, hierarchical structure description, when `level` is `L0` or `mix` (required for visualization), its + content is not empty. +- `stack.json`, record the call stack information of API/Module. +- `dump_tensor_data`, generated when `task` is `tensor` and save the collected tensor data. + +See [dump directory description](#dump-directory-description) for details. + +> **Note**: When MSProbe is enabled, cuda graph is disabled (disable_cuda_graph=True) because MSProbe only supports dump +> in eager mode, warmup is disabled (skip_server_warmup=True) because there is no need to dump data for this stage. + +## End-to-End Examples + +MSProbe’s full debugging workflow follows **Enable → Collect Data → Visualize → Analyze Root Cause**. Below is a common +E2E example for SGLang-based model inference debugging. + +### Example : Advanced Debugging with Custom Configuration + +Suitable for targeted debugging (e.g., only collect statistics data for specific ranks/steps, enable mix level for graph +reconstruction + numerical comparison) and root cause analysis via **problem vs. benchmark comparison**. + +#### Step 1: Enable +##### Prepare Custom Configuration JSON + +Create `msprobe-config.json` (dump statistics data for rank0/1, step0/1, mix level): + +```json +{ + "task": "statistics", + "dump_path": "./problem_dump", + "rank": [ + 0, + 1 + ], + "step": [ + 0, + 1 + ], + "level": "mix", + "async_dump": false, + "statistics": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "summary_mode": "statistics" + } +} +``` + +##### Enable MSProbe with Custom Configuration in SGLang + +Launch the SGLang server and specify the configuration file path with `--msprobe-dump-config`: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen2.5-0.5B-Instruct \ + --host 127.0.0.1 \ + --port 1027 \ + --msprobe-dump-config /home/msprobe-config.json +``` +#### Step 2: Collect Data +##### Collect Dump Data for Problem & Benchmark Sides + +Send normal inference requests to trigger model running (MSProbe automatically collects data during request processing): + +```bash +curl -H "Content-type: application/json" \ + -X POST \ + -d '{ + "model": "Qwen/Qwen2.5-0.5B-Instruct", + "messages": [ + { + "role": "user", + "content": "Hello, my name is" + } + ], + "max_tokens": 10 + }' \ + http://127.0.0.1:1027/v1/chat/completions +``` + +- **Problem side**: Run the above SGLang server (with the accuracy/numerical issue) and send inference request; dump + data is saved to `./problem_dump`. +- **Benchmark side**: Launch a normal SGLang server (without the issue, e.g., stable framework version/operator) with + the **same custom configuration** and send the **same inference request**; rename the dump directory + to `./bench_dump`. + +> **Key Requirement**: Problem and benchmark dumps must use the same inputs and sampling points (rank/step) +> for valid comparison. + +##### Check Generated Dump Files + +Dump files are saved to `./problem_dump` and `./bench_dump` you defined and include core files for subsequent analysis: + +- `dump.json`: Records tensor metadata of APIs and modules (dtype, shape, min/max/mean, L2 norm, `requires_grad`, etc.). +- `stack.json`: Logs call stack information of APIs and modules. +- `construct.json`: hierarchical structure description, required for visualization, its content is not empty. + +#### Step 3: Visualize +##### Visualize Problem vs. Benchmark Comparison (Multi-Rank) + +Generate a multi-rank comparison visualization file (mix level generates `construct.json` for graph reconstruction): + +```shell +msprobe graph_visualize -tp ./problem_dump/step0 -gp ./bench_dump/step0 -o ./graph_output +``` + +- `-tp`: Path to problem-side dump data +- `-gp`: Path to benchmark-side dump data +- `-o`: Output directory for visualization files + +If you want overflow check (for NaN/Inf detection), please specify the parameter `-oc` + +```shell +msprobe graph_visualize -tp ./problem_dump/step0 -gp ./bench_dump/step0 -o ./graph_output -oc +``` + +After the comparison or build task finishes, a `compare_{timestamp}.vis.db` file is created under `graph_output`. + +##### Launch TensorBoard + +Start TensorBoard: +```bash +tensorboard --logdir ./graph_output --bind_all --port 6006 +``` +#### Step 4: Analyze Root Cause +##### Locate Root Cause + +Root Cause Analysis in TensorBoard: +- Divergent nodes (with accuracy/numerical differences) are highlighted in **red** (darker red = larger difference). +- Click on divergent nodes to view detailed tensor data (inputs/outputs, parameters) and API/module call stacks. +- Use the **search/filter** function to quickly locate key layers/APIs (e.g., "relu", "conv"). +- Switch between ranks/steps via the UI to check cross-rank/cross-step divergence. +- Check the **overflow check** tab for NaN/Inf values in specific nodes (the direct cause of numerical instability). + +##### Verify the Root Cause + +After locating the divergent node (e.g., a specific Conv layer or torch API with abnormal tensor values), verify by: + +- Narrowing the dump scope to this node (via `scope`/`list` in the configuration file) for fine-grained data collection. +- Modifying the problematic layer/API (e.g., replacing the operator, adjusting the dtype) and re-running the debugging + workflow to confirm the issue is resolved. + +## Troubleshooting + +### No Dump Files Generated + +1. To confirm if MSProbe is installed, use `pip show mindstudio_probe` to troubleshoot. If it is installed, the MSProbe + version information will be printed. If it is confirmed that it has not been installed, please + use `pip install mindstudio-probe --pre` for installation; +2. Confirm the `--msprobe-dump-config` parameter points to the **correct JSON file path**. + +### Dump Files Are Too Large (Excessive Data) + +1. Start with `task: "statistics"` instead of `"tensor"` to collect only tensor statistics (avoids raw tensor dump); +2. Narrow the dump range with the `scope` field (specify start/end module/API); +3. Filter dump targets with the `list` field (only dump specific modules/APIs or substrings); +4. Sample specific `rank` and `step` (avoid dumping all ranks/iterations). + +### TensorBoard Visualization Fails + +1. Confirm `construct.json` is not empty (requires `level: L0` or `mix` – L1 does not generate graph files); +2. Check that the `-tp` (problem dump) and `-gp` (benchmark dump) paths point to **valid rank/step subdirectories** ( + e.g., `step0/rank0`); +3. Ensure the MSProbe version is up-to-date (reinstall with `pip install mindstudio-probe --pre --upgrade`); +4. Verify TensorBoard is installed and the `--logdir` parameter points to the directory containing `.vis.db` files (not + the file itself). + +### Numerical Comparison Shows No Divergence But Model Accuracy Is Low + +1. Expand the dump `step` range (check more token iterations for late-stage divergence); +2. Switch to `task: "tensor"` (statistics may mask subtle numerical differences in raw tensor data); +3. Ensure the problem and benchmark dumps use **the same input data/hardware configuration** (different inputs lead to + invalid comparisons); +4. Use the `manual mapping` feature in TensorBoard (automatic mapping may miss some nodes for custom models). + +--- + +## Appendix + +### Dump directory description + +```text +├── problem_dump or bench_dump +│ ├── step0 +│ │ ├── rank0 +│ │ │ ├── dump_tensor_data +│ │ │ │ ├── Tensor.permute.1.forward.pt +│ │ │ │ ├── Functional.linear.5.backward.output.pt # Format: {api_type}.{api_name}.{call_count}.{forward/backward}.{input/output}.{arg_index}. +│ │ │ │ │ # arg_index is the nth input or output of the API. If an input is a list, keep numbering with decimals (e.g., 1.1 is the first element of the first argument). +│ │ │ │ ├── Module.conv1.Conv2d.forward.0.input.0.pt # Format: {Module}.{module_name}.{class_name}.{forward/backward}.{call_count}.{input/output}.{arg_index}. +│ │ │ │ ├── Module.conv1.Conv2d.forward.0.parameters.bias.pt # Module parameter data: {Module}.{module_name}.{class_name}.forward.{call_count}.parameters.{parameter_name}. +│ │ │ │ └── Module.conv1.Conv2d.parameters_grad.weight.pt # Module parameter gradients: {Module}.{module_name}.{class_name}.parameters_grad.{parameter_name}. Gradients do not include call_count because the same gradient updates all invocations. +│ │ │ │ # When the `model` argument passed to dump is a List[torch.nn.Module] or Tuple[torch.nn.Module], module-level data names also include the index inside the list ({Module}.{index}.*), e.g., Module.0.conv1.Conv2d.forward.0.input.0.pt. +│ │ │ ├── dump.json +│ │ │ ├── stack.json +│ │ │ ├── dump_error_info.log +│ │ │ └── construct.json +│ │ ├── rank1 +│ │ │ ├── dump_tensor_data +│ │ │ │ └── ... +│ │ │ ├── dump.json +│ │ │ ├── stack.json +│ │ │ ├── dump_error_info.log +│ │ │ └── construct.json +│ │ ├── ... +│ │ │ +│ │ └── rank7 +│ ├── step1 +│ │ ├── ... +│ ├── step2 +``` + +- `rank`: Device ID. Each card writes its data to the corresponding `rank{ID}` directory. In non-distributed scenarios + the directory is simply named `rank`. +- `dump_tensor_data`: Save the collected tensor data. +- `dump.json`: Statistics for the forward data of each API or module, including names, dtype, shape, max, min, mean, L2 + norm (square root of the L2 variance), and CRC-32 when `summary_mode="md5"`. + See [dump.json file description](#dumpjson-file-description) for details. +- `dump_error_info.log`: Present only when the dump tool encountered an error and records the failure log. +- `stack.json`: Call stacks for APIs/modules. +- `construct.json`: Hierarchical structure description. Empty when `level=L1`. + +### dump.json file description + +#### L0 level + +An L0 `dump.json` contains forward/backward I/O for modules together with parameters and parameter gradients. Using +PyTorch's `Conv2d` as an example, the network code looks like: + +`output = self.conv2(input) # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)` + +`dump.json` contains the following entries: + +- `Module.conv2.Conv2d.forward.0`: Forward data of the module. `input_args` represents positional inputs, `input_kwargs` + represents keyword inputs, `output` stores forward outputs, and `parameters` stores weights/biases. +- `Module.conv2.Conv2d.parameters_grad`: Parameter gradients (weight and bias). +- `Module.conv2.Conv2d.backward.0`: Backward data of the module. `input` represents gradients that flow into the + module (gradients of the forward outputs) and `output` represents gradients that flow out (gradients of the module + inputs). + +**Note**: When the `model` parameter passed to the dump API is `List[torch.nn.Module]` or `Tuple[torch.nn.Module]`, +module-level names include the index inside the list (`{Module}.{index}.*`). Example: `Module.0.conv1.Conv2d.forward.0`. + +
+ +L0 dump.json + +```json +{ + "task": "tensor", + "level": "L0", + "framework": "pytorch", + "dump_data_dir": "/dump/path", + "data": { + "Module.conv2.Conv2d.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 1.638758659362793, + "Min": 0.0, + "Mean": 0.2544615864753723, + "Norm": 70.50277709960938, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.input.0.pt" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 1.6815717220306396, + "Min": -1.5120246410369873, + "Mean": -0.025344856083393097, + "Norm": 149.65576171875, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.output.0.pt" + } + ], + "parameters": { + "weight": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.05992485210299492, + "Min": -0.05999220535159111, + "Mean": -0.0006165213999338448, + "Norm": 3.421217441558838, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt" + }, + "bias": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32 + ], + "Max": 0.05744686722755432, + "Min": -0.04894155263900757, + "Mean": 0.006410328671336174, + "Norm": 0.17263513803482056, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt" + } + } + }, + "Module.conv2.Conv2d.parameters_grad": { + "weight": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.018550323322415352, + "Min": -0.008627401664853096, + "Mean": 0.0006675920449197292, + "Norm": 0.26084786653518677, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.parameters_grad.weight.pt" + } + ], + "bias": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32 + ], + "Max": 0.014914230443537235, + "Min": -0.006656786892563105, + "Mean": 0.002657240955159068, + "Norm": 0.029451673850417137, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.parameters_grad.bias.pt" + } + ] + }, + "Module.conv2.Conv2d.backward.0": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 0.0015069986693561077, + "Min": -0.001139344065450132, + "Mean": 3.3215508210560074e-06, + "Norm": 0.020567523315548897, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.backward.0.input.0.pt" + } + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 0.0007466732058674097, + "Min": -0.00044813455315306783, + "Mean": 6.814070275140693e-06, + "Norm": 0.01474067009985447, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.backward.0.output.0.pt" + } + ] + } + } +} +``` + +
+ +#### L1 level + +An L1 `dump.json` records forward/backward I/O for APIs. Using PyTorch's `relu` function as an +example (`output = torch.nn.functional.relu(input)`), the file contains: + +- `Functional.relu.0.forward`: Forward data of the API. `input_args` are positional inputs, `input_kwargs` are keyword + inputs, and `output` stores the forward outputs. +- `Functional.relu.0.backward`: Backward data of the API. `input` represents the gradients of the forward outputs, + and `output` represents the gradients that flow back to the forward inputs. + +
+ +L1 dump.json + +```json +{ + "task": "tensor", + "level": "L1", + "framework": "pytorch", + "dump_data_dir": "/dump/path", + "data": { + "Functional.relu.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": -1.3364859819412231, + "Mean": 0.03711778670549393, + "Norm": 236.20692443847656, + "requires_grad": true, + "data_name": "Functional.relu.0.forward.input.0.pt" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": 0.0, + "Mean": 0.16849493980407715, + "Norm": 175.23345947265625, + "requires_grad": true, + "data_name": "Functional.relu.0.forward.output.0.pt" + } + ] + }, + "Functional.relu.0.backward": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00013352684618439525, + "Mean": 0.00011915402356442064, + "Norm": 0.007598237134516239, + "requires_grad": false, + "data_name": "Functional.relu.0.backward.input.0.pt" + } + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00012117840378778055, + "Mean": 2.0098118724831693e-08, + "Norm": 0.006532244384288788, + "requires_grad": false, + "data_name": "Functional.relu.0.backward.output.0.pt" + } + ] + } + } +} +``` + +
+ +#### mix level + +A `mix` dump.json contains both L0 and L1 level data; the file format is the same as the examples above. diff --git a/docs_new/docs/developer_guide/overview.mdx b/docs_new/docs/developer_guide/overview.mdx new file mode 100644 index 000000000000..1ad72a0c416f --- /dev/null +++ b/docs_new/docs/developer_guide/overview.mdx @@ -0,0 +1,12 @@ +--- +title: Developer Guide +description: Contributing to SGLang — development setup, benchmarking, and evaluation. +--- + +- [Contribution Guide](./contribution_guide) +- [Development Guide (Docker)](./development_guide_using_docker) +- [JIT Kernels](./development_jit_kernel_guide) +- [Benchmark and Profiling](./benchmark_and_profiling) +- [Bench Serving](./bench_serving) +- [Evaluating New Models](./evaluating_new_models) +- [MSProbe Debugging Guide](./msprobe_debugging_guide) diff --git a/docs_new/docs/developer_guide/release_process.mdx b/docs_new/docs/developer_guide/release_process.mdx new file mode 100644 index 000000000000..cac283675636 --- /dev/null +++ b/docs_new/docs/developer_guide/release_process.mdx @@ -0,0 +1,21 @@ +--- +title: "PyPI Package Release Process" +metatags: + description: "SGLang PyPI release: version update, upload_pypi.sh script, GitHub release creation." +--- +## Update the version in code +Update the package version in `python/pyproject.toml` and `python/sglang/__init__.py`. + +## Upload the PyPI package + +```text Output +pip install build twine +``` + +```text Output +cd python +bash upload_pypi.sh +``` + +## Make a release in GitHub +Make a new release https://github.com/sgl-project/sglang/releases/new. diff --git a/docs_new/docs/developer_guide/setup_github_runner.mdx b/docs_new/docs/developer_guide/setup_github_runner.mdx new file mode 100644 index 000000000000..213a25d4763c --- /dev/null +++ b/docs_new/docs/developer_guide/setup_github_runner.mdx @@ -0,0 +1,54 @@ +--- +title: "Set Up Self-Hosted Runners for GitHub Action" +metatags: + description: "SGLang GitHub Actions self-hosted runner: Docker setup for NVIDIA/AMD GPUs, config.sh and run.sh." +--- +## Add a Runner + +### Step 1: Start a docker container. + +**You can mount a folder for the shared huggingface model weights cache. ** +The command below uses `/tmp/huggingface` as an example. + +``` +docker pull nvidia/cuda:12.9.1-devel-ubuntu22.04 +# Nvidia +docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.9.1-devel-ubuntu22.04 /bin/bash +# AMD +docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.8-rocm700-mi30x /bin/bash +# AMD just the last 2 GPUs +docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.8-rocm700-mi30x /bin/bash +``` + +### Step 2: Configure the runner by `config.sh` + +Run these commands inside the container. + +```text Output +apt update && apt install -y curl python3-pip git +pip install --upgrade pip +export RUNNER_ALLOW_RUNASROOT=1 +``` + +Then follow https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners to run `config.sh` + +**Notes** +- Do not need to specify the runner group +- Give it a name (e.g., `test-sgl-gpu-0`) and some labels (e.g., `1-gpu-h100`). The labels can be edited later in Github Settings. +- Do not need to change the work folder. + +### Step 3: Run the runner by `run.sh` + +- Set up environment variables +```text Output +export HF_HOME=/hf_home +export SGLANG_IS_IN_CI=true +export HF_TOKEN=hf_xxx +export OPENAI_API_KEY=sk-xxx +export CUDA_VISIBLE_DEVICES=0 +``` + +- Run it forever +```text Output +while true; do ./run.sh; echo "Restarting..."; sleep 2; done +``` diff --git a/docs_new/docs/get-started/install.mdx b/docs_new/docs/get-started/install.mdx new file mode 100644 index 000000000000..a7231d153117 --- /dev/null +++ b/docs_new/docs/get-started/install.mdx @@ -0,0 +1,208 @@ +--- +title: Installation +description: Install SGLang with pip/uv, source, Docker, Kubernetes, and cloud deployment options. +keywords: + - installation + - sglang + - pip + - docker +--- +You can install SGLang using one of the methods below. +This page primarily applies to common NVIDIA GPU platforms. +For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../hardware-platforms/amd_gpu), [Intel Xeon CPUs](../hardware-platforms/cpu_server), [Google TPU](../hardware-platforms/tpu), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../hardware-platforms/nvidia_jetson), [Ascend NPUs](../hardware-platforms/ascend-npus/ascend_npu), and [Intel XPU](../hardware-platforms/xpu). + + +Prerequisites: Python 3.10 or higher. + + +## Method 1: With pip or uv + +It is recommended to use uv for faster installation: + +```bash Command +pip install --upgrade pip +pip install uv +uv pip install sglang +``` + +### Quick fixes to common problems +- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions: + 1. Use `export CUDA_HOME=/usr/local/cuda-` to set the `CUDA_HOME` environment variable. + 2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above. + +## Method 2: From source + +```bash Command +# Use the last release branch +git clone -b v0.5.9 https://github.com/sgl-project/sglang.git +cd sglang + +# Install the python packages +pip install --upgrade pip +pip install -e "python" +``` + +**Quick fixes to common problems** + +- If you want to develop SGLang, you can try the dev docker image. Please refer to [setup docker container](../developer_guide/development_guide_using_docker#setup-docker-container). The docker image is `lmsysorg/sglang:dev`. + +## Method 3: Using docker + +The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). +Replace `` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). + +```bash Command +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 +``` + +For production deployments, use the `runtime` variant which is significantly smaller (~40% reduction) by excluding build tools and development dependencies: + +```bash Command +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:latest-runtime \ + python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 +``` + +You can also find the nightly docker images [here](https://hub.docker.com/r/lmsysorg/sglang/tags?name=nightly). + +Notes: +- SGLang is shipped with CUDA 13 environment by default. To run SGLang on CUDA 12 environment, please use images with `-cu12` or `-cu129` suffix, such as `lmsysorg/sglang:latest-cu129` or `lmsysorg/sglang:dev-cu12`. + +## Method 4: Using Kubernetes + +Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs). + + + +1. Option 1: For single node serving (typically when the model size fits into GPUs on one node) + + Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example. + +2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`) + + Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service. + + + +## Method 5: Using docker compose + + + +> This method is recommended if you plan to serve it as a service. +> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml). + +1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine +2. Execute the command `docker compose up -d` in your terminal. + + + +## Method 6: Run on Kubernetes or Clouds with SkyPilot + + + +To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot). + +1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html). +2. Deploy on your own infra with a single command and get the HTTP API endpoint: + +SkyPilot YAML: sglang.yaml}> + +```yaml Config +# sglang.yaml +envs: + HF_TOKEN: null + +resources: + image_id: docker:lmsysorg/sglang:latest + accelerators: A100 + ports: 30000 + +run: | + conda deactivate + python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --host 0.0.0.0 \ + --port 30000 +``` + + + +```bash Command +# Deploy on any cloud or Kubernetes cluster. Use --cloud to select a specific cloud provider. +HF_TOKEN= sky launch -c sglang --env HF_TOKEN sglang.yaml + +# Get the HTTP API endpoint +sky status --endpoint 30000 sglang +``` + +3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve). + + + +## Method 7: Run on AWS SageMaker + + + +To deploy on SGLang on AWS SageMaker, check out [AWS SageMaker Inference](https://aws.amazon.com/sagemaker/ai/deploy) + +Amazon Web Services provide supports for SGLang containers along with routine security patching. For available SGLang containers, check out [AWS SGLang DLCs](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sglang-containers) + +To host a model with your own container, follow the following steps: + +1. Build a docker container with [sagemaker.Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/sagemaker.Dockerfile) alongside the [serve](https://github.com/sgl-project/sglang/blob/main/docker/serve) script. +2. Push your container onto AWS ECR. + +Dockerfile Build Script: build-and-push.sh}> + +```bash Command +#!/bin/bash +AWS_ACCOUNT="" +AWS_REGION="" +REPOSITORY_NAME="" +IMAGE_TAG="" + +ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com" +IMAGE_URI="${ECR_REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG}" + +echo "Starting build and push process..." + +# Login to ECR +echo "Logging into ECR..." +aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REGISTRY} + +# Build the image +echo "Building Docker image..." +docker build -t ${IMAGE_URI} -f sagemaker.Dockerfile . + +echo "Pushing ${IMAGE_URI}" +docker push ${IMAGE_URI} + +echo "Build and push completed successfully!" +``` + + + +3. Deploy a model for serving on AWS Sagemaker, refer to [deploy_and_serve_endpoint.py](https://github.com/sgl-project/sglang/blob/main/examples/sagemaker/deploy_and_serve_endpoint.py). For more information, check out [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk). + 1. By default, the model server on SageMaker will run with the following command: `python3 -m sglang.launch_server --model-path opt/ml/model --host 0.0.0.0 --port 8080`. This is optimal for hosting your own model with SageMaker. + 2. To modify your model serving parameters, the [serve](https://github.com/sgl-project/sglang/blob/main/docker/serve) script allows for all available options within `python3 -m sglang.launch_server --help` cli by specifying environment variables with prefix `SM_SGLANG_`. + 3. The serve script will automatically convert all environment variables with prefix `SM_SGLANG_` from `SM_SGLANG_INPUT_ARGUMENT` into `--input-argument` to be parsed into `python3 -m sglang.launch_server` cli. + 4. For example, to run [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) with reasoning parser, simply add additional environment variables `SM_SGLANG_MODEL_PATH=Qwen/Qwen3-0.6B` and `SM_SGLANG_REASONING_PARSER=qwen3`. + + + +## Common Notes + +- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. +- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`. diff --git a/docs_new/docs/get-started/quickstart.mdx b/docs_new/docs/get-started/quickstart.mdx new file mode 100644 index 000000000000..a80ef9b0010a --- /dev/null +++ b/docs_new/docs/get-started/quickstart.mdx @@ -0,0 +1,332 @@ +--- +title: "Quickstart" +description: "Get up and running with SGLang in minutes: install, launch a server, and send your first request." +--- + +## Overview + +This guide walks you through the entire flow of getting started with SGLang: + +1. **Install** SGLang +2. **Launch** an inference server +3. **Send requests** using cURL, OpenAI Python client, Python `requests`, or the native SGLang API + +By the end, you'll have a working SGLang server responding to your prompts. + +--- + +## Prerequisites + +- **Python**: 3.10 or higher +- **GPU**: NVIDIA GPU with CUDA support (sm80 and above, e.g., A10, A100, L4, L40S, H100) +- **OS**: Linux (recommended) + + +For other platforms, see the dedicated guides for [AMD GPUs](../hardware-platforms/amd_gpu), [Intel Xeon CPUs](../hardware-platforms/cpu_server), [Google TPUs](../hardware-platforms/tpu), [NVIDIA Jetson](../hardware-platforms/nvidia_jetson), [Ascend NPUs](../hardware-platforms/ascend-npus/ascend_npu), and [Intel XPU](../hardware-platforms/xpu). + + +--- + +## Installation + + + + We recommend using **uv** for faster installation: + +```bash +pip install --upgrade pip +pip install uv +uv pip install sglang +``` + + +```bash +# Clone and install from source +git clone https://github.com/sgl-project/sglang.git +cd sglang +pip install --upgrade pip +pip install -e "python" +``` + + + The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags). + + Replace `` with your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens): + + ```bash + docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 + ``` + + For production deployments, use the smaller **runtime** variant (~40% size reduction): + + ```bash + docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:latest-runtime \ + python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 + ``` + + + + +If you encounter `OSError: CUDA_HOME environment variable is not set`, set it with: +```bash +export CUDA_HOME=/usr/local/cuda- +``` + + +--- + +## Launch a Server + +Start the SGLang server with a model. Here we use `qwen/qwen2.5-0.5b-instruct` as a lightweight example: + +```bash +python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000 +``` + +Wait until you see `The server is fired up and ready to roll!` in the terminal output. + + +Once the server is running, API documentation is available at: +- **Swagger UI**: `http://localhost:30000/docs` +- **ReDoc**: `http://localhost:30000/redoc` +- **OpenAPI Spec**: `http://localhost:30000/openapi.json` + + + +The server automatically applies the chat template from the Hugging Face tokenizer. You can override it with `--chat-template` when launching. + + +--- + +## Send Requests + +SGLang is fully **OpenAI API-compatible**, so you can use the same tools and libraries you already know. + +### Using cURL + +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen/qwen2.5-0.5b-instruct", + "messages": [ + {"role": "user", "content": "What is the capital of France?"} + ] + }' +``` + +### Using OpenAI Python Client + +Install the OpenAI Python library if you haven't: + +```bash +pip install openai +``` + +Then send a request: + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) + +print(response.choices[0].message.content) +``` + +#### Streaming + +```python Example +import openai + +client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="qwen/qwen2.5-0.5b-instruct", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, + stream=True, +) + +for chunk in response: + if chunk.choices[0].delta.content: + print(chunk.choices[0].delta.content, end="", flush=True) +``` + +### Using Python Requests + +```python Example +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "qwen/qwen2.5-0.5b-instruct", + "messages": [{"role": "user", "content": "What is the capital of France?"}], +} + +response = requests.post(url, json=data) +print(response.json()) +``` + +### Using the Native `/generate` API + +SGLang also provides a native `/generate` endpoint for more flexibility. + +```python Example +import requests + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + }, +) + +print(response.json()) +``` + +#### Streaming with `/generate` + +```python Example +import requests +import json + +response = requests.post( + "http://localhost:30000/generate", + json={ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 32, + }, + "stream": True, + }, + stream=True, +) + +prev = 0 +for chunk in response.iter_lines(decode_unicode=False): + chunk = chunk.decode("utf-8") + if chunk and chunk.startswith("data:"): + if chunk == "data: [DONE]": + break + data = json.loads(chunk[5:].strip("\n")) + output = data["text"] + print(output[prev:], end="", flush=True) + prev = len(output) +``` + +--- + +## Offline Batch Inference (No Server) + +SGLang also supports offline batch inference using the `Engine` class directly -- no HTTP server required. + +```python Example +import sglang as sgl + +llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct") + +prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] + +sampling_params = {"temperature": 0.8, "top_p": 0.95} + +outputs = llm.generate(prompts, sampling_params) + +for prompt, output in zip(prompts, outputs): + print(f"Prompt: {prompt}\nGenerated text: {output['text']}\n") + +llm.shutdown() +``` + +--- + +## Common Troubleshooting + + + + Set the `CUDA_HOME` environment variable to your CUDA install root: + ```bash + export CUDA_HOME=/usr/local/cuda- + ``` + + + Switch to alternative backends by adding these flags when launching the server: + ```bash + --attention-backend triton --sampling-backend pytorch + ``` + + + ```bash + pip3 install --upgrade flashinfer-python --force-reinstall --no-deps + rm -rf ~/.cache/flashinfer + ``` + + + ```bash + export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas + ``` + + + +--- + +{/* +WIP, TBD linked later +## What's Next? + + + + Explore the full Chat Completions and Completions APIs, including multi-turn conversations. + + + Send image inputs alongside text using OpenAI-compatible vision APIs. + + + Fine-tune generation with temperature, top-p, frequency penalty, and more. + + + Customize server behavior with advanced launch arguments like tensor parallelism. + + + Constrain model output to JSON, regex, or EBNF grammars. + + + Use the familiar Ollama CLI and Python library with SGLang as the backend. + + +*/} diff --git a/docs_new/docs/hardware-platforms/amd_gpu.mdx b/docs_new/docs/hardware-platforms/amd_gpu.mdx new file mode 100644 index 000000000000..4bc551d860af --- /dev/null +++ b/docs_new/docs/hardware-platforms/amd_gpu.mdx @@ -0,0 +1,196 @@ +--- +title: "AMD GPUs" +--- +This document describes how to run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +## System Configuration + +When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning: + +- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) +- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html) +- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html) +- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html) +- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html) + +**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system. + +Below are a few key settings to confirm or enable for SGLang: + +### Update GRUB Settings + +In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`: + +```text GRUB Configuration +pci=realloc=off iommu=pt +``` + +Afterward, run `sudo update-grub` (or your distro’s equivalent) and reboot. + +### Disable NUMA Auto-Balancing + +```bash Disable NUMA +sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' +``` + +You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh). + +Again, please go through the entire documentation to confirm your system is using the recommended configuration. + +## Install SGLang + +You can install SGLang using one of the methods below. + +### Install from Source + +```bash Command +# Use the last release branch +git clone -b v0.5.9 https://github.com/sgl-project/sglang.git +cd sglang + +# Compile sgl-kernel +pip install --upgrade pip +cd sgl-kernel +python setup_rocm.py install + +# Install sglang python package along with diffusion support +cd .. +rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +pip install -e "python[all_hip]" +``` + +### Install Using Docker (Recommended) + +The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [rocm.Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). + +The steps below show how to build and use an image. + +1. Build the docker image. + If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below. + + ```bash Command + docker build -t sglang_image -f rocm.Dockerfile . + ``` + +2. Create a convenient alias. + + ```bash Command + alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \ + --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v $HOME/dockerx:/dockerx \ + -v /data:/data' + ``` + + If you are using RDMA, please note that: + - `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them. + - You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`. + +3. Launch the server. + + **NOTE:** Replace `` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens). + + ```bash Command + drun -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + sglang_image \ + python3 -m sglang.launch_server \ + --model-path NousResearch/Meta-Llama-3.1-8B \ + --host 0.0.0.0 \ + --port 30000 + ``` + +4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](../basic_usage/openai_api_completions) to send requests to the engine. + + ```bash Command + drun sglang_image \ + python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 4000 \ + --random-input 128 \ + --random-output 128 + ``` + +With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities. + +## Quantization on AMD GPUs + +The [Quantization documentation](../advanced_features/quantization#platform-compatibility) has a full compatibility matrix. The short version: FP8, AWQ, MXFP4, W8A8, GPTQ, compressed-tensors, Quark, and **petit_nvfp4** (NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel)) all work on AMD. Methods that depend on Marlin or NVIDIA-specific kernels (`awq_marlin`, `gptq_marlin`, `gguf`, `modelopt_fp8`, `modelopt_fp4`) do not. + +A few things to keep in mind: + +- FP8 works via Aiter or Triton. Pre-quantized FP8 models like DeepSeek-V3/R1 work out of the box. +- AWQ uses Triton dequantization kernels on AMD. The faster Marlin path is not available. +- MXFP4 requires CDNA3/CDNA4 and `SGLANG_USE_AITER=1`. +- `petit_nvfp4` enables NVFP4 models (e.g., [Llama 3.3 70B FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)) on MI250/MI300X via [Petit](https://github.com/causalflow-ai/petit-kernel). Install with `pip install petit-kernel`; no `--quantization` flag needed when loading pre-quantized NVFP4 models. +- `quark_int4fp8_moe` is an AMD-only online quantization method for MoE models on CDNA3/CDNA4. + +Several of these backends are accelerated by [Aiter](https://github.com/ROCm/aiter). Enable it with: + +```bash Command +export SGLANG_USE_AITER=1 +``` + +Example -- serving an AWQ model: + +```bash Command +python3 -m sglang.launch_server \ + --model-path hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 \ + --trust-remote-code \ + --port 30000 --host 0.0.0.0 +``` + +Example -- FP8 online quantization: + +```bash Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --quantization fp8 \ + --port 30000 --host 0.0.0.0 +``` + +## Examples + +### Running DeepSeek-V3 + +The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command: + +```bash Command +drun -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --ipc=host \ + --env "HF_TOKEN=" \ + sglang_image \ + python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3 \ # <- here + --tp 8 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference. + +### Running Llama3.1 + +Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command: + +```bash Command +drun -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --ipc=host \ + --env "HF_TOKEN=" \ + sglang_image \ + python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here + --tp 8 \ + --trust-remote-code \ + --host 0.0.0.0 \ + --port 30000 +``` + +### Warmup Step + +When the server displays `The server is fired up and ready to roll!`, it means the startup is successful. diff --git a/docs_new/docs/hardware-platforms/apple_metal.mdx b/docs_new/docs/hardware-platforms/apple_metal.mdx new file mode 100644 index 000000000000..81e79c7486b0 --- /dev/null +++ b/docs_new/docs/hardware-platforms/apple_metal.mdx @@ -0,0 +1,78 @@ +--- +title: "Apple Silicon with Metal" +metatags: + description: "Run SGLang on Apple Silicon using the Metal backend." +--- + +This document describes how run SGLang on Apple Silicon using [Metal (MLX)](https://opensource.apple.com/projects/mlx/). If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +## Install SGLang + +You can install SGLang using one of the methods below. + +### Install from Source + +```bash +# Use the default branch +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Install sglang python package +pip install --upgrade pip +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +uv pip install -e "python[all_mps]" +``` + +## Launch of the Serving Engine + +Launch the server with: + +```bash +SGLANG_USE_MLX=1 python -m sglang.launch_server \ + --model \ + --disable-cuda-graph \ + --host 0.0.0.0 +``` + +**Key Parameters Explained:** + +1. `SGLANG_USE_MLX=1` - Enables the use of MLX as the SGLang runtime backend (if disabled, SGLang will fall back to `torch.mps`, which has less support) +2. `--disable-cuda-graph` - Disables usage of CUDA graph, which is not relevant for Apple Metal. +3. `--disable-overlap-schedule` - Disables overlap scheduling (enabled/not present by default) achieved using MLX's `async_eval()` + + +## Benchmarking with Requests + +`sglang.benchmark_one_batch` calls the synchronous prefill/decode methods directly without going through the scheduler and the overlap code path. + +`sglang.benchmark_offline_throughput` can toggle overlap scheduling as it uses the scheduler and the overlap code path by using the flag `--disable-overlap-schedule`. + +### Throughput Testing + +Basic synchronous one batch throughput: +```bash +SGLANG_USE_MLX=1 python -m sglang.bench_one_batch \ + --model-path \ + --disable-cuda-graph \ + --tp-size 1 \ + --batch-size 1 \ + --input-len 60 \ + --output-len 10 +``` + +Synchronous offline throughput: +```bash +SGLANG_USE_MLX=1 python -m sglang.bench_offline_throughput \ + --model-path \ + --disable-cuda-graph \ + --num-prompts 1 \ + --disable-overlap-schedule +``` + +Asynchronous offline throughput: +```bash +SGLANG_USE_MLX=1 python -m sglang.bench_offline_throughput \ + --model-path \ + --disable-cuda-graph \ + --num-prompts 1 +``` diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_contribution_guide.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_contribution_guide.mdx new file mode 100644 index 000000000000..3005aa246336 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_contribution_guide.mdx @@ -0,0 +1,167 @@ +--- +title: "Contribution Guide" +metatags: + description: "Set up the Ascend NPU development environment, run tests, build documentation, and open SGLang pull requests." +--- + +Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process. + +## Install SGLang from Source + +### Prepare Environment + +Before contributing, please ensure that your environment is set up correctly. Follow the steps in the [Installation Guide](./ascend_npu) to install the necessary dependencies. We recommend [using docker](./ascend_npu#method-2-using-docker-image) to build the environment. + +### Fork and clone the repository + +**Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally. + +```bash +git clone https://github.com//sglang.git +# if you are using docker, the environment is already set up. +cd sglang +export PYTHONPATH=$PWD/python:$PYTHONPATH +``` + +## Format code with pre-commit + +We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run: + +```bash +pip3 install pre-commit +pre-commit install +pre-commit run --all-files +``` + +- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request. +- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch. + +## Run and add unit tests + +If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression. +SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. +For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md). + +If you need to use model which is not in `python/sglang/test/ascend/test_ascend_utils.py` list. Follow these steps: +1. Register account and upload your model to [modelscope](https://modelscope.cn/models). +2. Make sure your model is pre-cached on the CI server and is on the way "/data/ascend-ci-share-pkking-sglang/modelscope/hub/models/{your_model_repo}/{your_model}". +If this is not the case, use following command on CI server: + ```bash + modelscope download + --model {your_model_repo}/{your_model} + --local_dir /data/ascend-ci-share-pkking-sglang/modelscope/hub/models/{your_model_repo}/{your_model} + ``` + > Note: If you don’t have access to CI server, please ask maintainers (zl19940307@163.com) to download your model. +4. Add model to ```python/sglang/test/ascend/test_ascend_utils.py``` (use docker ```"/root/.cache/modelscope/hub/models/{your_model_repo}/{your_model}"``` path). + +## Write documentations + +We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. +For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md). + +## Test the accuracy +If your code changes the model output, please run the accuracy tests. A quick sanity check is the few-shot GSM8K. + +``` +# Launch a server +python3 -m sglang.launch_server --model Qwen/Qwen2-7B-Instruct + +# Evaluate +python3 -m sglang.test.few_shot_gsm8k --num-questions 200 +``` + +Please note that the above script is primarily a sanity check, not a rigorous accuracy or speed test. +This test can have significant variance (1%–5%) in accuracy due to batching and the non-deterministic nature of the inference engine. +Also, do not rely on the "Latency/Output throughput" from this script, as it is not a proper speed test. + +GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests. +You can find additional accuracy eval examples in: +- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/registered/eval/test_eval_accuracy_large.py) +- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/registered/core/test_gpt_oss_1gpu.py) + +## Benchmark the speed +Refer to [Benchmark and Profiling](../../developer_guide/benchmark_and_profiling). + +## Requesting a review for merge +You can follow the pull request merge process described in [MAINTAINER.md](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md). +You will need to work with the Merge Oncall, Codeowner, and other reviewers to get their approvals. +Then your PR can be merged. + +## How to Trigger CI Tests + +We have a lot of open PRs but limited CI machines, so only top and trusted contributors have permission to trigger CI tests. +Users with permission are listed in the [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) + +For CI to run on a pull request, it must have the "run-ci" label. Authorized users can add the label or rerun failed tests by commenting on the PR with one of these commands: + +- `/tag-run-ci-label`: Adds the "run-ci" label. Every future commit will trigger CI. +- `/rerun-failed-ci`: Reruns the failed or flaky tests from the most recent commit. +- `/tag-and-rerun-ci`: A single command that performs both `/tag-run-ci-label` and `/rerun-failed-ci`. +- `/rerun-stage `: Reruns a specific test stage without waiting for its dependencies. This is useful when you want to quickly validate a fix for a specific test failure instead of waiting ~30 minutes for preceding stages to complete. + +If you have permission, the [Slash Command Handler](https://github.com/sgl-project/sglang/actions/workflows/slash-command-handler.yml) will run your command and react with a 👍 to your comment. It may take up to a few minutes for the reaction to appear. Here’s a usage [example](https://github.com/sgl-project/sglang/pull/14253#issuecomment-3599509302). + +To avoid spamming a PR with too many `/rerun-failed-ci` comments, you can also trigger the command by editing an existing comment and adding any suffix (e.g., `/rerun-failed-ci try again`). + +Example of rerunning a single test stage: `/rerun-stage unit-test-backend-4-gpu`. + +If you don’t have permission, please ask maintainers to trigger CI for you. + +### CI rate limits + +Due to CI scheduling and limited resources, higher-priority PRs may preempt running jobs. In such cases, you may need to rerun the tests. + +We apply CI rate limits to prevent abuse and ensure fair usage of our CI resources. + +Each CI workflow has a default limit defined in its workflow configuration file. For example, in [pr-gate.yml](https://github.com/sgl-project/sglang/blob/main/.github/workflows/pr-gate.yml), the default cooldown period is 120 minutes, and each workflow can override it via the `cool-down-minutes` input parameter: + +```yaml +cool-down-minutes: + description: "Cooldown period in minutes for low-permission users; 0 disables rate limiting" + type: number + default: 120 +``` + +Users listed in [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) may have a per-user cooldown interval. In practice, we use the minimum of the workflow’s default window and the user-specific interval. + +## Code style guidance +- Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function. +- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code. +- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize all minor overheads as much as possible, especially in the model forward code. + - A common pattern is some runtime checks in the model forward pass (e.g., [this](https://github.com/sgl-project/sglang/blob/f1b0eda55c2c4838e8ab90a0fac7fb1e3d7064ab/python/sglang/srt/models/deepseek_v2.py#L486-L491)). These are very likely the same for every layer. Please cache the result as a single boolean value whenever possible. +- Make functions as pure as possible. Avoid in-place modification of arguments. +- Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files. (e.g., `scheduler.py`, `scheduler_output_processor_mixin.py`) +- Keep tests run fast. + - If a single test file run longer than 500 seconds, split it into multiple smaller files (e.g., `test_eagle_infer_a.py`, `test_eagle_infer_b.py`). + - If a single job in a github workflow runs longer than 30 mins, split it into smaller jobs/steps. + - Reuse server launches in your unit tests to make tests run faster. +- When supporting new hardware or features, follow these guidelines: + - Do not drastically change existing code. + - Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_npu.py`). + - If you write multiple if/else blocks for new features, ensure the common path (e.g., NVIDIA hardware or the existing code path) is the first branch. + +## How to update sgl-kernel +Since sglang and sgl-kernel are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR). +To add a new kernel or modify an existing one in the `sgl-kernel/` source tree, you must use multiple PRs. + +Follow these steps: + +1. Submit a PR to update the sgl-kernel source code without using it in sglang python package (e.g., [#8884](https://github.com/sgl-project/sglang/pull/8884/files)). +2. Bump the version of the kernel package (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)). + - Once merged, this will trigger an automatic release of the `sglang-kernel` wheel to PyPI. + - If not urgent, you can wait for other people to release the wheel. A new version will typically be released within one week. +3. Apply the changes: + - Update the `sglang-kernel` version in `sglang/python/pyproject.toml` to use the modified kernels. + - Update the related caller code in the sglang to use the new kernel. + +## How to update sgl-kernel-npu + +Sgl-kernel-npu is the kernel package for Ascend NPU and is maintained in the [sgl-kernel-npu](https://github.com/sgl-project/sgl-kernel-npu) repository. if you want to add a new kernel and want to use it in sglang, please follow the steps in [Contribution Guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/docs/developer_guide/contribution_guide.md). + +## Tips for newcomers + +If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow. + +If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.io). + +Thank you for your interest in SGLang. Happy coding! diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu.mdx new file mode 100644 index 000000000000..3fe88df704ac --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu.mdx @@ -0,0 +1,294 @@ +--- +title: SGLang installation with NPUs support +--- +You can install SGLang using any of the methods below. Please go through `System Settings` section to ensure the clusters are roaring at max performance. Feel free to leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) if you encounter any issues or have any problems. + +## Component Version Mapping For SGLang + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ComponentVersionObtain Way
HDK25.5.2link
CANN8.5.0Obtain Images
Pytorch Adapter7.3.0link
MemFabric1.0.5`pip install memfabric-hybrid==1.0.5`
Triton3.2.0`pip install triton-ascend`
SGLang NPU KernelNAlink
+ + +### Obtain CANN Image +You can obtain the dependency of a specified version of CANN through an image. +```bash Command +# for Atlas 800I A3 and Ubuntu OS +docker pull quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11 +# for Atlas 800I A2 and Ubuntu OS +docker pull quay.io/ascend/cann:8.5.0-910b-ubuntu22.04-py3.11 +``` + +## Preparing the Running Environment + +### Method 1: Installing from source with prerequisites + +#### Python Version + +Only `python==3.11` is supported currently. If you don't want to break system pre-installed python, try installing with [conda](https://github.com/conda/conda). + +```bash Command +conda create --name sglang_npu python=3.11 +conda activate sglang_npu +``` + +#### CANN + +Prior to start work with SGLang on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 8.5.0, check the [installation guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/softwareinst/instg/instg_0008.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit) + +#### MemFabric-Hybrid + +If you want to use PD disaggregation mode, you need to install MemFabric-Hybrid. MemFabric-Hybrid is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters. + +```bash Command +pip install memfabric-hybrid==1.0.5 +``` + +#### Pytorch and Pytorch Framework Adaptor on Ascend + +```bash Command +PYTORCH_VERSION=2.8.0 +TORCHVISION_VERSION=0.23.0 +TORCH_NPU_VERSION=2.8.0.post2 +pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu +pip install torch_npu==$TORCH_NPU_VERSION +``` + +If you are using other versions of `torch` and install `torch_npu`, check [installation guide](https://github.com/Ascend/pytorch/blob/master/README.md) + +#### Triton on Ascend + +We provide our own implementation of Triton for Ascend. + +```bash Command +pip install triton-ascend +``` +For installation of Triton on Ascend nightly builds or from sources, follow [installation guide](https://gitcode.com/Ascend/triton-ascend/blob/master/docs/sources/getting-started/installation.md) + +#### SGLang Kernels NPU +We provide SGL kernels for Ascend NPU, check [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/sgl_kernel_npu/README.md). + +#### DeepEP-compatible Library +We provide a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md). + +#### Some other dependencies + +```bash Command +# libGL +apt update +apt install libgl1 libglib2.0-0 + +# ensure setuptools contains pkg_resources module +pip install "setuptools<80" +``` + +#### Installing SGLang from source + +```bash Command +# Use the last release branch +git clone https://github.com/sgl-project/sglang.git +cd sglang +mv python/pyproject_npu.toml python/pyproject.toml +pip install -e python[all_npu] +``` + +### Method 2: Using Docker Image +#### Obtain Image +You can download the SGLang image or build an image based on Dockerfile to obtain the Ascend NPU image. +1. Download SGLang image +```angular2html +dockerhub: docker.io/lmsysorg/sglang:$tag +# Main-based tag, change main to specific version like v0.5.6, +# you can get image for specific version +Atlas 800I A3 : {main}-cann8.5.0-a3 +Atlas 800I A2: {main}-cann8.5.0-910b +``` +2. Build an image based on Dockerfile +```bash Command +# Clone the SGLang repository +git clone https://github.com/sgl-project/sglang.git +cd sglang/docker + +# Build the docker image +# If there are network errors, please modify the Dockerfile to use offline dependencies or use a proxy +# is the target architecture of the image, e.g. amd64, arm64 +docker build --build-arg TARGETARCH= -t -f npu.Dockerfile . +``` + +#### Create Docker +__Notice:__ `--privileged` and `--network=host` are required by RDMA, which is typically needed by Ascend NPU clusters. + +__Notice:__ The following docker command is based on Atlas 800I A3 machines. If you are using Atlas 800I A2, make sure only `davinci[0-7]` are mapped into container. + +```bash Command + +alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \ + --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ + --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ + --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \ + --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \ + --device=/dev/davinci_manager --device=/dev/hisi_hdc \ + --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ + --volume /etc/ascend_install.info:/etc/ascend_install.info \ + --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/' + +# Add HF_TOKEN env for download model by SGLang. +drun --env "HF_TOKEN=" \ + \ + python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend +``` + +## System Settings + +### CPU performance power scheme + +The default power scheme on Ascend hardware is `ondemand` which could affect performance, changing it to `performance` is recommended. + +```bash Command +echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor + +# Make sure changes are applied successfully +cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance +``` + +### Disable NUMA balancing + +```bash Command +sudo sysctl -w kernel.numa_balancing=0 +# Check +cat /proc/sys/kernel/numa_balancing # shows 0 +``` + +### Prevent swapping out system memory + +```bash Command +sudo sysctl -w vm.swappiness=10 + +# Check +cat /proc/sys/vm/swappiness # shows 10 +``` + +## Running SGLang Service +### Running Service For Large Language Models +#### PD Mixed Scene +```bash Command +# Enabling CPU Affinity +export SGLANG_SET_CPU_AFFINITY=1 +python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend +``` + +#### PD Disaggregation Scene +1. Launch Prefill Server +```bash Command +# Enabling CPU Affinity +export SGLANG_SET_CPU_AFFINITY=1 + +# PIP: recommended to config first Prefill Server IP +# PORT: one free port +# all sglang servers need to be config the same PIP and PORT, +export ASCEND_MF_STORE_URL="tcp://PIP:PORT" +# if you are Atlas 800I A2 hardware and use rdma for kv cache transfer, add this parameter +export ASCEND_MF_TRANSFER_PROTOCOL="device_rdma" +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend ascend \ + --disaggregation-bootstrap-port 8995 \ + --attention-backend ascend \ + --device npu \ + --base-gpu-id 0 \ + --tp-size 1 \ + --host 127.0.0.1 \ + --port 8000 +``` + +2. Launch Decode Server +```bash Command +# PIP: recommended to config first Prefill Server IP +# PORT: one free port +# all sglang servers need to be config the same PIP and PORT, +export ASCEND_MF_STORE_URL="tcp://PIP:PORT" +# if you are Atlas 800I A2 hardware and use rdma for kv cache transfer, add this parameter +export ASCEND_MF_TRANSFER_PROTOCOL="device_rdma" +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend ascend \ + --attention-backend ascend \ + --device npu \ + --base-gpu-id 1 \ + --tp-size 1 \ + --host 127.0.0.1 \ + --port 8001 +``` + +3. Launch Router +```bash Command +python3 -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://127.0.0.1:8000 8995 \ + --decode http://127.0.0.1:8001 \ + --host 127.0.0.1 \ + --port 6688 +``` + +### Running Service For Multimodal Language Models +#### PD Mixed Scene +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen3-VL-30B-A3B-Instruct \ + --host 127.0.0.1 \ + --port 8000 \ + --tp 4 \ + --device npu \ + --attention-backend ascend \ + --mm-attention-backend ascend_attn \ + --disable-radix-cache \ + --trust-remote-code \ + --enable-multimodal \ + --sampling-backend ascend +``` diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_best_practice.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_best_practice.mdx new file mode 100644 index 000000000000..e57d590d1889 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_best_practice.mdx @@ -0,0 +1,4256 @@ +--- +title: "Best Practice on Ascend NPU" +metatags: + description: "Documentation for Best Practice on Ascend NPU" +--- +This section describes the best practice data of mainstream LLM models such as DeepSeek and Qwen on the Ascend NPU. If +you encounter issues or have any questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +## DeepSeek Series Models + +### Low Latency + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Deepseek-R1Atlas 800I A332PD Disaggregation6K+1.6K20msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A332PD Disaggregation3.9K+1K19msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A332PD Disaggregation3.5K+1.5K19msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A332PD Disaggregation3.5K+1K19msW8A8 INT8Optimal Configuration
DeepSeek-V3.2Atlas 800I A332PD Disaggregation128K+1K26msW8A8 INT8Optimal Configuration
+ +### High Throughput + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Deepseek-R1Atlas 800I A332PD Disaggregation3.5K+1.5K50msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A324PD Disaggregation2K+2K50msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A38PD Mixed2K+2K50msW4A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A316PD Disaggregation2K+2K50msW4A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A38PD Mixed3.5K+1.5K50msW4A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A316PD Disaggregation3.5K+1.5K50msW4A8 INT8Optimal Configuration
+ +## Qwen Series Models + +### Low Latency + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Qwen3-235B-A22BAtlas 800I A38PD Mixed11K+1K10msBF16Optimal Configuration
Qwen3-32BAtlas 800I A34PD Mixed6K+1.5K18msBF16Optimal Configuration
Qwen3-32BAtlas 800I A34PD Mixed4K+1.5K11msBF16Optimal Configuration
Qwen3-32BAtlas 800I A38PD Mixed18K+4K6msBF16Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed6K+1.5K18msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed4K+1.5K11msBF16Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed1K+0.3K12msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed6K+1.5K17msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed1K+0.3K7msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed6K+1.5K12msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed3.5K+1.5K5msW8A8 INT8Optimal Configuration
Qwen3-30B-A3BAtlas 800I A31PD Mixed6K+1.5K10msW8A8 INT8Optimal Configuration
Qwen3-30B-A3BAtlas 800I A31PD Mixed1K+0.3K7msW8A8 INT8Optimal Configuration
Qwen3-Next-A3B-InstructAtlas 800I A32PD Mixed1K+0.3K14.21msW8A8 INT8Optimal Configuration
Qwen3-Next-A3B-InstructAtlas 800I A32PD Mixed6K+1.5K15.62msW8A8 INT8Optimal Configuration
Qwen3-Next-A3B-InstructAtlas 800I A31PD Mixed3.5K+1.5K20msW8A8 INT8Optimal Configuration
Qwen3-14BAtlas 800I A31PD Mixed3.5K+1.5K9msW8A8 INT8Optimal Configuration
+ +### High Throughput + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Qwen3-235B-A22BAtlas 800I A324PD Disaggregation3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A38PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A38PD Mixed2K+2K100msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A38PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A316PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-30B-A3BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Coder-480B-A35B-InstructAtlas 800I A324PD Disaggregation3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Coder-480B-A35B-InstructAtlas 800I A316PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Coder-480B-A35B-InstructAtlas 800I A38PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Next-80B-A3B-InstructAtlas 800I A32PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-14BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
+ +## Optimal Configuration + +### DeepSeek-R1 3_5K-1_5K 50ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +export SGLANG_SET_CPU_AFFINITY=1 +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_OP_EXPANSION_MODE=AIV +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') + +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export SGLANG_USE_AG_AFTER_QLORA=1 + export HCCL_BUFFSIZE=800 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export SGLANG_NPU_FUSED_MOE_MODE=2 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=131072 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.778 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 16 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 60000 --moe-a2a-backend ascend_fuseep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=600 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 + export TASK_QUEUE_ENABLE=1 + export SGLANG_NPU_FUSED_MOE_MODE=1 + export SGLANG_LM_HEAD_TP=8 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \ + --mem-fraction-static 0.82 --max-running-requests 1024 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell Command +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 1024 --random-input-len 3584 --random-output-len 1536 --num-prompts 7168 --random-range-ratio 1 --request-rate 40 +``` + +### DeepSeek-R1 2K-2K 50ms on A3 24 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 24Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1') +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1600 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export SGLANG_USE_AG_AFTER_QLORA=1 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.8 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 20 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=800 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=102 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export SGLANG_NPU_FUSED_MOE_MODE=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \ + --mem-fraction-static 0.81 --max-running-requests 1088 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang \ +--host 127.0.0.1 \ +--port 6688 \ +--max-concurrency 1088 \ +--random-input-len 2048 \ +--random-output-len 2048 \ +--num-prompts 12800 \ +--random-range-ratio 1 \ +--request-rate 24 +``` + +### DeepSeek-R1 6K-1_6K 20ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 6K+1.6K + +TPOT: 20ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +export SGLANG_SET_CPU_AFFINITY=1 +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') + +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1536 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 4 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=650 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 8 \ + --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell Command +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 6000 \ + --random-output-len 1600 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.9K+1K + +TPOT: 19ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1536 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 4 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=650 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=12 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 16 \ + --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done +``` + +```bash Command +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 3900 \ + --random-output-len 1024 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 3_5K-1_5K 19ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 19ms + +#### Model Deployment + +Please Turn to [DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode](#deepseek-r1-3_9k-1k-19ms-on-a3-32-cards-disaggregation-mode) + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 3500 \ + --random-output-len 1500 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 3_5K-1K 19ms on A3 32 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1K + +TPOT: 19ms + +#### Model Deployment + +Please Turn to [DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode](#deepseek-r1-3_9k-1k-19ms-on-a3-32-cards-disaggregation-mode) + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang \ + --host 127.0.0.1 \ + --port 6688 \ + --max-concurrency 32 \ + --random-input-len 3500 \ + --random-output-len 1024 \ + --num-prompts 32 \ + --random-range-ratio 1 \ + --request-rate 16 +``` + +### DeepSeek-R1 2K-2K 50ms on A3 8 Cards Mixed Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=88 +export HCCL_BUFFSIZE=1600 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512 + +MODEL_PATH=xxx + +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_USE_FIA_NZ=1 + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ +--tp 16 \ +--trust-remote-code \ +--attention-backend ascend \ +--device npu \ +--quantization modelslim \ +--watchdog-timeout 9000 \ +--host 127.0.0.1 --port 6699 \ +--cuda-graph-bs 4 8 20 21 22 \ +--mem-fraction-static 0.78 \ +--max-running-requests 352 \ +--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 1500 \ +--moe-a2a-backend deepep --deepep-mode auto \ +--enable-dp-attention --dp-size 16 --enable-dp-lm-head \ +--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ +--dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 352 --random-input-len 2048 --random-output-len 2048 --num-prompts 1408 --random-range-ratio 1 +``` + +### DeepSeek-R1 2K-2K 50ms on A3 16 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" + +P_IP=('your prefill ip1') + +D_IP=('your decode ip1') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export ENABLE_MOE_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=2600 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.7 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 10240 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=900 + export SGLANG_DP_ROUND_ROBIN=1 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112 + export TASK_QUEUE_ENABLE=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \ + --mem-fraction-static 0.8 --max-running-requests 448 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 448 --random-input-len 2048 --random-output-len 2048 --num-prompts 1792 --random-range-ratio 1 --request-rate 32 +``` + +### DeepSeek-R1 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=56 +export HCCL_BUFFSIZE=1200 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_USE_FIA_NZ=1 + +MODEL_PATH=xxx + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ +--tp 16 \ +--trust-remote-code \ +--attention-backend ascend \ +--device npu \ +--quantization modelslim \ +--watchdog-timeout 9000 \ +--host 127.0.0.1 --port 6699 \ +--cuda-graph-bs 4 8 12 14 \ +--mem-fraction-static 0.77 \ +--max-running-requests 224 \ +--context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 3000 \ +--moe-a2a-backend deepep --deepep-mode auto \ +--enable-dp-attention --dp-size 16 --enable-dp-lm-head \ +--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ +--dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 224 --random-input-len 3500 --random-output-len 1500 --num-prompts 896 --random-range-ratio 1 +``` + +### DeepSeek-R1 3_5K-1_5K 50ms on A3 16 Cards Disaggregation Mode + +Model: Deepseek R1 + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" + +P_IP=('your prefill ip1') + +D_IP=('your decode ip1') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export ENABLE_MOE_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=3500 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.62 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 20480 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=800 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78 + export TASK_QUEUE_ENABLE=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \ + --mem-fraction-static 0.805 --max-running-requests 416 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \ + --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 416 --random-input-len 3500 --random-output-len 1500 --num-prompts 1664 --random-range-ratio 1 +``` + +### DeepSeek-V3.2 128K-1K 26ms on A3 32 Cards Disaggregation Mode + +Model: DeepSeek-V3.2-W8A8 + +Hardware: Atlas 800I A3 32Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 128K+1K + +TPOT: 26ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH} +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24670" + +P_IP=('your prefill ip1' 'your prefill ip2') +D_IP=('your decode ip1' 'your decode ip2') +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1200 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --tp 32 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host ${P_IP[$i]} --port 8000 \ + --mem-fraction-static 0.73 \ + --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 68000 \ + --max-running-requests 1 \ + --moe-a2a-backend deepep --deepep-mode normal \ + --quantization modelslim \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode prefill \ + --disable-cuda-graph \ + --nnodes 2 --node-rank $i \ + --disaggregation-bootstrap-port 8995 \ + --moe-dense-tp-size 1 \ + --enable-nsa-prefill-context-parallel \ + --nsa-prefill-cp-mode in-seq-split \ + --attn-cp-size 32 \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dist-init-addr ${P_IP[0]}:10000 + break + fi +done + + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + + export TASK_QUEUE_ENABLE=0 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + DP=8 + export HCCL_BUFFSIZE=400 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8 + + python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --tp 32 \ + --dp ${DP} \ + --ep 32 \ + --moe-dense-tp-size 1 \ + --enable-dp-attention \ + --enable-dp-lm-head \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host ${D_IP[$i]} --port 8001 \ + --mem-fraction-static 0.79 \ + --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 68000 \ + --max-running-requests 32 \ + --cuda-graph-max-bs 4 \ + --moe-a2a-backend deepep \ + --deepep-mode low_latency \ + --quantization modelslim \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode decode \ + --nnodes 2 --node-rank $i \ + --dist-init-addr ${D_IP[0]}:10000 + break + fi +done +``` + + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP1:8000 8995 \ + --decode http://D_IP1:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 8 --random-input-len 131076 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 24Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_DP_ROUND_ROBIN=1 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +MODEL_PATH=xxx +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667" +P_IP=('your prefill ip1') +D_IP=('your decode ip1' 'your decode ip2') + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + + +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416 + export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 + export DEEPEP_NORMAL_LONG_SEQ_ROUND=16 + export HCCL_BUFFSIZE=4300 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + export STREAMS_PER_DEVICE=32 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + + # P节点 + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \ + --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \ + --nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --mem-fraction-static 0.6 \ + --disable-radix-cache \ + --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --speculative-draft-model-quantization unquant \ + --max-running-requests 128 --chunked-prefill-size 94208 --max-prefill-tokens 262144 \ + --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --dtype bfloat16 + NODE_RANK=$i + break + fi +done + + +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export DP_ROUND_ROBIN=1 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536 + export HCCL_BUFFSIZE=800 + export HCCL_SOCKET_IFNAME=data0.3001 + export GLOO_SOCKET_IFNAME=data0.3001 + export STREAMS_PER_DEVICE=32 + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \ + --host ${D_IP[$i]} --port 8001 --trust-remote-code \ + --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.83 --max-running-requests 768 \ + --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --cuda-graph-bs 6 8 12 15 18 20 22 24 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-draft-model-quantization unquant \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --dist-init-addr xxx:5000 \ + --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 \ + --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```shell Command +export SGLANG_DP_ROUND_ROBIN=1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://PIP:8000 8995 \ + --decode http://DIP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang-oai --host 127.0.0.1 --port 7239 --max-concurrency 860 --random-input-len 3500 --random-output-len 1500 --num-prompts 3440 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=570 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100 + +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 432 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 94208 --max-prefill-tokens 458880 --sampling-backend ascend \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-radix-cache --moe-a2a-backend ascend_fuseep --speculative-draft-model-quantization unquant \ + --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs 1 2 4 8 16 20 24 26 27 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 272 --random-input-len 3500 --random-output-len 1500 --num-prompts 1088 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 2K-2K 100ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 100ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1200 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=144 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 576 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 32768 --max-prefill-tokens 458880 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --speculative-draft-model-quantization unquant \ + --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.84 --cuda-graph-bs 8 16 20 24 32 36 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 576 --random-input-len 2000 --random-output-len 2000 --num-prompts 576 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 2K-2K 50ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=450 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=147456 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 624 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 73728 --max-prefill-tokens 458880 --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --disable-radix-cache --moe-a2a-backend ascend_fuseep \ + --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.83 --cuda-graph-bs 4 8 16 24 28 29 30 32 34 36 37 38 39 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 480 --random-input-len 2048 --random-output-len 2048 --num-prompts 480 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 2K-2K 50ms on A3 16 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1600 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" + +MIX_IP=('IP1' 'IP2') + +for i in "${!MIX_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]]; + then + echo "${MIX_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + + python -m sglang.launch_server --model-path ${MODEL_PATH} \ + --host 127.0.0.1 --port 7439 --trust-remote-code \ + --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.8 --max-running-requests 768 \ + --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ + --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-bs 6 8 10 12 18 24 \ + --dist-init-addr ${MIX_IP[0]}:5000 --chunked-prefill-size 131072 --max-prefill-tokens 458880 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx --speculative-draft-model-quantization= unquant \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --context-length 8192 --disable-radix-cache \ + --enable-dp-lm-head --dtype bfloat16 + NODE_RANK=$i + break + fi +done + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 768 --random-input-len 2000 --random-output-len 2000 --num-prompts 768 --random-range-ratio 1 +``` + +### Qwen3-235B-A22B 11K-1K 10ms on A3 8 Cards Mixed Mode + +Model: Qwen3-235B-A22B-W8A8 + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 11K+1K + +TPOT: 10ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1600 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 1 --dtype bfloat16 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --disable-radix-cache --enable-dp-lm-head \ + --tp 16 --mem-fraction-static 0.78 --cuda-graph-bs 1 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 1 --random-input-len 11000 --random-output-len 1000 --num-prompts 1 --random-range-ratio 1 +``` + +### Qwen3-32B 6K-1_5K 18ms on A3 4 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 4Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 18ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 32 \ + --disable-radix-cache \ + --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1 +``` + +### Qwen3-32B 4K-1_5K 11ms on A3 4 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 4Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 4K+1.5K + +TPOT: 11ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 1 \ + --disable-radix-cache \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4 +``` + +### Qwen3-32B 18K-4K 6ms on A3 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 18K+4K + +TPOT: 6ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 1 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 65536 \ + --tp-size 16 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 18000 --random-input-len 4000 --num-prompts 1 +``` + +### Qwen3-32B 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 78 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-prefill-tokens 49152 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 16 32 64 68 72 78 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1 +``` + +### Qwen3-32B 2K-2K 50ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 120 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 49152 \ + --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16 + +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 480 --random-range-ratio 1 +``` + +### Qwen3-30B-A3B 3_5K-1_5K 50ms on A3 1 Card Mixed Mode + +Model: Qwen3-30B-A3B-Instruct-2507 + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export ASCEND_LAUNCH_BLOCKING=0 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 162 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 35000 \ + --tp-size 2 --mem-fraction-static 0.87 --cuda-graph-bs 1 5 15 40 70 100 120 130 140 146 150 154 156 158 160 162 \ + --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 156 --random-input-len 3500 --random-output-len 1500 --num-prompts 624 --random-range-ratio 1 +``` + +### Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode + +Model: Qwen3-Coder-480B-A35B-Instruct + +Hardware: Atlas 800I A3 24Card + +DeployMode: PD Disaggregation + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export SGLANG_NPU_FUSED_MOE_MODE=2 + +MODEL_PATH=xxx +export ASCEND_MF_STORE_URL="tcp://PIP:24667" +P_IP=('PIP') +D_IP=('DIP1' 'DIP2') +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + + +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=327680 + export HCCL_BUFFSIZE=1550 + export TASK_QUEUE_ENABLE=2 + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \ + --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \ + --nnodes 1 --node-rank $i --tp-size 16 --dp-size 2 --mem-fraction-static 0.7 \ + --disable-radix-cache \ + --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \ + --max-running-requests 16 --chunked-prefill-size 20480 --max-prefill-tokens 20480 \ + --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --dtype bfloat16 \ + --disable-overlap-schedule + NODE_RANK=$i + break + fi +done + +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536 + export HCCL_BUFFSIZE=600 + export SGLANG_NPU_FUSED_MOE_MODE=2 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \ + --host ${D_IP[$i]} --port 8001 --trust-remote-code \ + --nnodes 2 --node-rank $i --tp-size 32 --dp-size 4 --mem-fraction-static 0.75 --max-running-requests 544 \ + --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \ + --moe-a2a-backend ascend_fuseep --cuda-graph-bs 16 32 56 72 80 88 96 104 112 120 128 136 \ + --dist-init-addr DIP1:5000 \ + --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 --load-balance-method round_robin + NODE_RANK=$i + break + fi +done + +``` + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://PIP:8000 8995 \ + --decode http://DIP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 410 --random-input-len 3500 --random-output-len 1500 --num-prompts 1640 --random-range-ratio 1 --request-rate 8 +``` + +### Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 16 Cards Mixed Mode + +Model: Qwen3-Coder-480B-A35B-Instruct + +Hardware: Atlas 800I A3 16Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=72 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=1800 +export HCCL_SOCKET_IFNAME=xxx +export GLOO_SOCKET_IFNAME=xxx +export HCCL_OP_EXPANSION_MODE="AIV" + +MIX_IP=('IP1' 'IP2') + +for i in "${!MIX_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]]; + then + echo "${MIX_IP[$i]}" + + python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 2 --node-rank $i \ + --dist-init-addr 141.61.133.128:5000 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 288 --context-length 8192 --dtype bfloat16 \ + --chunked-prefill-size 114688 --max-prefill-tokens 458880 \ + --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto \ + --tp 32 --dp-size 4 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.7 --cuda-graph-bs 56 64 72 + NODE_RANK=$i + break + fi +done +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 288 --random-input-len 3500 --random-output-len 1500 --num-prompts 1152 --random-range-ratio 1 --request-rate 20 +``` + +### Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode + +Model: Qwen3-Coder-480B-A35B-Instruct + +Hardware: Atlas 800I A3 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=2100 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" + +python -m sglang.launch_server --model-path $MODEL_PATH \ +--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ +--attention-backend ascend --device npu --quantization modelslim \ +--max-running-requests 80 --context-length 8192 --dtype bfloat16 \ +--chunked-prefill-size 28672 --max-prefill-tokens 458880 \ +--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --enable-dp-attention --enable-dp-lm-head \ +--tp 16 --dp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 16 20 24 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 80 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1 +``` + +### Qwen3-Next-80B-A3B-Instruct 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell Command +export cann_path=/usr/local/Ascend/ascend-toolkit/latest +source /usr/local/Ascend/driver/bin/setenv.bash +source ${cann_path}/../set_env.sh +source ${cann_path}/../../nnal/atb/set_env.sh +source ${cann_path}/opp/vendors/customize/bin/set_env.bash +export ASCEND_HOME_PATH=${cann_path} +source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh + +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_ALGO="level0:NA;level1:ring" + +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=20 +export HCCL_BUFFSIZE=2000 + +python -m sglang.launch_server \ + --model-path /path/to/Qwen3-Next-80B-A3B-Instruct-W8A8-3 \ + --host 127.0.0.1 \ + --port 6699 \ + --tp-size 4 \ + --device npu \ + --attention-backend ascend \ + --mem-fraction-static 0.685 \ + --max-running-requests 80 \ + --watchdog-timeout 3600 \ + --disable-radix-cache \ + --cuda-graph-bs 80 \ + --max-prefill-tokens 28672 --max-total-tokens 450560 \ + --moe-a2a-backend deepep --deepep-mode auto \ + --quantization modelslim \ + --chunked-prefill-size -1 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 80 --random-output-len 1536 --random-input-len 3584 --num-prompts 160 --random-range-ratio 1 +``` + +### Qwen3-32B 6K-1_5K 18ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 18ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 32 \ + --disable-radix-cache \ + --chunked-prefill-size 24576 --max-prefill-tokens 65536 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1 +``` + +### Qwen3-32B 4K-1_5K 11ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 4K+1.5K + +TPOT: 11ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu \ + --max-running-requests 32 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 65536 \ + --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 6 12 18 24 30 32 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4 +``` + +### Qwen3-32B 1K-0_3K 12ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 12ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 8 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16 +``` + +### Qwen3-32B 6K-1_5K 17ms on A3 2 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 17ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 10 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-8B 1K-0_3K 7ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 7ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 2 4 6 9 10 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16 +``` + +### Qwen3-8B 6K-1_5K 12ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 12ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 16384 \ + --tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 5 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-32B 3_5K-1_5K 50ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 78 \ + --disable-radix-cache --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-prefill-tokens 65536 \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 8 16 32 64 68 72 78 --dtype bfloat16 --base-gpu-id 4 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1 +``` + +### Qwen3-32B 2K-2K 50ms on A2 8 Cards Mixed Mode + +Model: Qwen3-32B + +Hardware: Atlas 800I A2 8Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 2K+2K + +TPOT: 50ms + +#### Model Deployment + +```shell Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 120 \ + --disable-radix-cache \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-prefill-tokens 49152 --base-gpu-id 4 \ + --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```shell Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 120 --random-range-ratio 1 +``` + +### Qwen3-30B-A3B 6K-1_5K 10ms on A3 1 Cards Mixed Mode + +Model: Qwen3-30B-A3B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 10ms + +#### Model Deployment + +```bash Command +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 16 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 35000 \ + --tp-size 2 --mem-fraction-static 0.6 --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-30B-A3B 1K-0_3K 7ms on A3 1 Cards Mixed Mode + +Model: Qwen3-30B-A3B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 7ms + +#### Model Deployment + +```bash Command +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=400 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --max-running-requests 8 \ + --disable-radix-cache \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \ + --chunked-prefill-size -1 --max-prefill-tokens 35000 \ + --tp-size 2 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 3 4 5 6 7 8 --dtype bfloat16 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 8 --random-output-len 300 --random-input-len 1024 --num-prompts 8 +``` + +### Qwen3-Next 1K-0_3K 14_21ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 1K+0.3K + +TPOT: 14.21ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=5 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000 +export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 + +export ASCEND_USE_FIA=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 + +export SGLANG_WARMUP_TIMEOUT=3600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export FORCE_DRAFT_MODEL_NON_QUANT=1 + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=2000 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --page-size 128 \ + --tp-size 4 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host 127.0.0.1 --port 6699 \ + --mem-fraction-static 0.75 \ + --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-running-requests 312 \ + --cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \ + --mamba-ssm-dtype bfloat16 \ + --base-gpu-id 0 \ + --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \ + --quantization modelslim \ + --moe-a2a-backend deepep --deepep-mode auto \ +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16 +``` + +### Qwen3-Next 6K-1_5K 15_62ms on A3 2 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 2Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 6K+1.5K + +TPOT: 15.62ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=5 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000 +export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 + +export ASCEND_USE_FIA=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 + +export SGLANG_WARMUP_TIMEOUT=3600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export FORCE_DRAFT_MODEL_NON_QUANT=1 + +MODEL_PATH=xxx + +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +export HCCL_BUFFSIZE=2000 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --page-size 128 \ + --tp-size 4 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --host 127.0.0.1 --port 6699 \ + --mem-fraction-static 0.75 \ + --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-running-requests 312 \ + --cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \ + --mamba-ssm-dtype bfloat16 \ + --base-gpu-id 0 \ + --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \ + --quantization modelslim \ + --moe-a2a-backend deepep --deepep-mode auto \ +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16 +``` + +### Qwen3-14B 3_5K-1_5K 9ms on A3 1 Cards Mixed Mode + +Model: Qwen3-14B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 9ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export ASCEND_USE_FIA=0 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.8 \ + --tp-size 1 --dp-size 1 \ + --sampling-backend ascend --max-running-requests 8 \ + --served-model-name Qwen3-14B \ + --chunked-prefill-size -1 \ + --cuda-graph-bs 8 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --schedule-conservativeness 0.01 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 8 --random-range-ratio 1 +``` + +### Qwen3-14B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode + +Model: Qwen3-14B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export STREAMS_PER_DEVICE=32 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo +export ASCEND_USE_FIA=0 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.89 \ + --tp-size 1 --dp-size 2 \ + --sampling-backend ascend --max-running-requests 144 \ + --max-prefill-tokens 12288 \ + --served-model-name Qwen3-14B \ + --chunked-prefill-size -1 \ + --cuda-graph-bs 8 16 32 44 48 50 52 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ + --schedule-conservativeness 0.01 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 144 --random-output-len 1500 --random-input-len 3500 --num-prompts 576 --random-range-ratio 1 +``` + +### Qwen3-8B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 50ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1 +export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.9 \ + --tp-size 1 \ + --max-running-requests 70 \ + --max-prefill-tokens 16384 \ + --served-model-name Qwen3-8B \ + --chunked-prefill-size 16384 \ + --cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 256 --random-range-ratio 1 +``` + +### Qwen3-8B 3_5K-1_5K 5ms on A3 1 Cards Mixed Mode + +Model: Qwen3-8B + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 5ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +MODEL_PATH=xxx + +export SGLANG_SET_CPU_AFFINITY=1 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_OP_EXPANSION_MODE="AIV" +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python -m sglang.launch_server --model-path $MODEL_PATH \ + --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \ + --attention-backend ascend --device npu --quantization modelslim \ + --disable-radix-cache --mem-fraction-static 0.894 \ + --tp-size 2 \ + --max-running-requests 1 \ + --max-prefill-tokens 16384 \ + --served-model-name Qwen3-8B \ + --chunked-prefill-size -1 \ + --cuda-graph-bs 1 \ + --dtype bfloat16 \ + --speculative-draft-model-quantization unquant \ + --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \ + --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 4 --random-range-ratio 1 +``` + +### Qwen3-Next 3_5K-1_5K 20ms on A3 1 Cards Mixed Mode + +Model: Qwen3-Next-80B-A3B-Instruct + +Hardware: Atlas 800I A3 1Card + +DeployMode: PD Mixed + +Dataset: random + +Input Output Length: 3.5K+1.5K + +TPOT: 20ms + +#### Model Deployment + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=400 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=10 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048 +export HCCL_OP_EXPANSION_MODE="AIV" +export TASK_QUEUE_ENABLE=1 +export ASCEND_USE_FIA=1 +export SGLANG_NPU_USE_MULTI_STREAM=0 +export SGLANG_WARMUP_TIMEOUT=3600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export FORCE_DRAFT_MODEL_NON_QUANT=1 +export HCCL_BUFFSIZE=2000 +export ZBCCL_LOCAL_MEM_SIZE=60416 +export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0 + +export ZBCCL_BOOTSTRAP_URL=tcp://127.0.0.1:24669 +export ZBCCL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True +export ZBCCL_ENABLE_GRAPH=1 + +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +MODEL_PATH=xxx + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` + +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" + +python3 -m sglang.launch_server --model-path ${MODEL_PATH} \ + --page-size 128 \ + --tp-size 2 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --quantization modelslim \ + --watchdog-timeout 9000 \ + --host 127.0.0.1 --port 6699 \ + --mem-fraction-static 0.85 \ + --disable-radix-cache --max-prefill-tokens 28672 --context-length 26384 --max-total-tokens 122304 \ + --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \ + --chunked-prefill-size -1 --max-running-requests 2 \ + --cuda-graph-bs 2 \ + --mamba-ssm-dtype bfloat16 \ + --speculative-draft-model-path /path/to/Qwen3-Next-80B-A3B-Instruct +``` + +#### Benchmark + +We tested it based on the `RANDOM` dataset. + +```bash Command +python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 1 +``` diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_deepseek_example.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_deepseek_example.mdx new file mode 100644 index 000000000000..6bf89d21f963 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_deepseek_example.mdx @@ -0,0 +1,301 @@ +--- +title: "DeepSeek Examples" +metatags: + description: "Examples for running DeepSeek models on Ascend NPUs, including PD mixed mode, PD disaggregation, and SGLang Model Gateway." +--- + +## Running DeepSeek-V3 + +### Running DeepSeek in PD mixed mode on 1 x Atlas 800I A3. + +W4A8 Model weights could be found [here](https://modelers.cn/models/Modelers_Park/DeepSeek-R1-0528-w4a8). + +```shell Launch Server +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +#Deepep communication settings +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 +export HCCL_BUFFSIZE=1600 + +#spec overlap +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + +#npu acceleration operator +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +python3 -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --tp 16 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --quantization modelslim \ + --watchdog-timeout 9000 \ + --cuda-graph-bs 8 16 24 28 32 \ + --mem-fraction-static 0.68 \ + --max-running-requests 128 \ + --context-length 8188 \ + --disable-radix-cache \ + --chunked-prefill-size -1 \ + --max-prefill-tokens 16384 \ + --moe-a2a-backend deepep \ + --deepep-mode auto \ + --enable-dp-attention \ + --dp-size 4 \ + --enable-dp-lm-head \ + --speculative-algorithm NEXTN \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --dtype bfloat16 +``` + +### Running DeepSeek with PD disaggregation mode on 2 x Atlas 800I A3. + +W4A8 Model weights could be found [here](https://modelers.cn/models/Modelers_Park/DeepSeek-R1-0528-w4a8). + +1. Prefill: + +```bash Command +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +#memfabric config store +export ASCEND_MF_STORE_URL="tcp://:" + +#Deepep communication settings +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export HCCL_BUFFSIZE=1536 + +#npu acceleration operator +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 +export TASK_QUEUE_ENABLE=2 + +python -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --host $PREFILL_HOST_IP \ + --port 8000 \ + --disaggregation-mode prefill \ + --disaggregation-bootstrap-port 8996 \ + --disaggregation-transfer-backend ascend \ + --trust-remote-code \ + --nnodes 1 \ + --node-rank 0 \ + --tp-size 16 \ + --mem-fraction-static 0.6 \ + --attention-backend ascend \ + --device npu \ + --quantization modelslim \ + --load-balance-method round_robin \ + --max-running-requests 8 \ + --context-length 8192 \ + --disable-radix-cache \ + --chunked-prefill-size -1 \ + --max-prefill-tokens 28680 \ + --moe-a2a-backend deepep \ + --deepep-mode normal \ + --speculative-algorithm NEXTN \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --dp-size 2 \ + --enable-dp-attention \ + --disable-shared-experts-fusion \ + --dtype bfloat16 +``` + +2. Decode: + +```bash Command +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +#memfabric config store +export ASCEND_MF_STORE_URL="tcp://:" + +#Deepep communication settings +export HCCL_BUFFSIZE=720 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=88 + +#spec overlap +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + +#npu acceleration operator +unset TASK_QUEUE_ENABLE +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +# suggest max-running-requests <= max-cuda-graph-bs * dp_size, Because when this value is exceeded, performance will significantly degrade. +python -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --disaggregation-mode decode \ + --host $DECODE_HOST_IP \ + --port 8001 \ + --trust-remote-code \ + --nnodes 1 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 16 \ + --mem-fraction-static 0.8 \ + --max-running-requests 352 \ + --attention-backend ascend \ + --device npu \ + --quantization modelslim \ + --moe-a2a-backend deepep \ + --enable-dp-attention \ + --deepep-mode low_latency \ + --enable-dp-lm-head \ + --cuda-graph-bs 8 10 12 14 16 18 20 22 \ + --disaggregation-transfer-backend ascend \ + --watchdog-timeout 9000 \ + --context-length 8192 \ + --speculative-algorithm NEXTN \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --disable-shared-experts-fusion \ + --dtype bfloat16 \ + --tokenizer-worker-num 4 +``` + +3. SGLang Model Gateway (former Router) + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://:8000 8996 \ + --decode http://:8001 \ + --host 127.0.0.1 \ + --port 6688 +``` + +### Running DeepSeek with PD disaggregation on 4 x Atlas 800I A3. + +W8A8 Model weights could be found [here](https://modelers.cn/models/State_Cloud/Deepseek-R1-bf16-hfd-w8a8). + +1. Prefill & Decode: + +```bash Command +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +export SGLANG_SET_CPU_AFFINITY=1 +unset ASCEND_LAUNCH_BLOCKING +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh +export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH + +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 + +export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669" + +P_IP=('your prefill ip1' 'your prefill ip2') + +D_IP=('your decode ip1' 'your decode ip2') + +MODEL_PATH=xxx + +export SGLANG_NPU_USE_MLAPO=1 +export SGLANG_USE_FIA_NZ=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +echo "${LOCAL_HOST1}" +echo "${LOCAL_HOST2}" +# prefill +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + export HCCL_BUFFSIZE=1536 + export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 + export TASK_QUEUE_ENABLE=2 + + export HCCL_SOCKET_IFNAME=lo + export GLOO_SOCKET_IFNAME=lo + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \ + --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \ + --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \ + --disaggregation-transfer-backend ascend --max-running-requests 8 --context-length 8192 --disable-radix-cache \ + --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \ + --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ + --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered + NODE_RANK=$i + break + fi +done + +# decode +for i in "${!D_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]]; + then + echo "${D_IP[$i]}" + export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + export SGLANG_ENABLE_SPEC_V2=1 + export HCCL_BUFFSIZE=650 + export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78 + export TASK_QUEUE_ENABLE=1 + export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1 + export HCCL_SOCKET_IFNAME=xxx + export GLOO_SOCKET_IFNAME=xxx + python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \ + --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \ + --mem-fraction-static 0.815 --max-running-requests 832 --attention-backend ascend --device npu --quantization modelslim \ + --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \ + --cuda-graph-bs 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \ + --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \ + --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \ + --load-balance-method decode_round_robin + NODE_RANK=$i + break + fi +done +``` + +2. SGLang Model Gateway (former Router): + +```bash Command +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://P_IP:8000 8998 \ + --prefill http://P_IP:8000 8999 \ + --decode http://D_IP:8001 \ + --host 127.0.0.1 \ + --port 6688 \ + --mini-lb +``` + +### test gsm8k + +```python Test GSM8K +from types import SimpleNamespace +from sglang.test.few_shot_gsm8k import run_eval + +def gsm8k(): + args = SimpleNamespace( + num_shots=5, + data_path=None, + num_questions=200, + max_new_tokens=512, + parallel=32, + host=f"http://127.0.0.1", + port=6688, + ) + metrics = run_eval(args) + print(f"{metrics=}") + print(f"{metrics['accuracy']=}") +if __name__ == "__main__": + gsm8k() +``` diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_environment_variables.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_environment_variables.mdx new file mode 100644 index 000000000000..2ea91da4c676 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_environment_variables.mdx @@ -0,0 +1,149 @@ +--- +title: "Environment Variables" +metatags: + description: "Reference commonly used Ascend NPU environment variables for configuring SGLang runtime behavior." +--- +SGLang supports various environment variables related to Ascend NPU that can be used to configure its runtime behavior. +This document provides a list of commonly used environment variables and aims to stay updated over time. + +## Directly Used in SGLang + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_NPU_USE_MLAPOAdopts the MLAPO fusion operator in attention <br/> preprocessing stage of the MLA model.false
SGLANG_USE_FIA_NZReshapes KV Cache for FIA NZ format.<br/> SGLANG_USE_FIA_NZ must be enabled with SGLANG_NPU_USE_MLAPOfalse
SGLANG_NPU_USE_MULTI_STREAMEnable dual-stream computation of shared experts <br/> and routing experts in DeepSeek models.<br/> Enable dual-stream computation in DeepSeek NSA Indexer.false
SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHTDisable cast model weight tensor to a specific NPU <br/> ACL format.false
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each rank.128
+ +## Used in DeepEP Ascend + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENSEnable ant-moving function in dispatch stage. Indicates <br/> the number of tokens transmitted per round on each rank.8192
DEEPEP_NORMAL_LONG_SEQ_ROUNDEnable ant-moving function in dispatch stage. Indicates <br/> the number of rounds transmitted on each rank.1
DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQEnable ant-moving function in combine stage. <br/> The value 0 means disabled.0
MOE_ENABLE_TOPK_NEG_ONENeeds to be enabled when the expert ID to be processed by <br/> DEEPEP contains -1.0
DEEP_NORMAL_MODE_USE_INT8_QUANTQuantizes x to int8 and returns (tensor, scales) in dispatch operator.0
+ +## Others + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
TASK_QUEUE_ENABLEUsed to control the optimization level of the dispatch queue<br/> about the task_queue operator. Detail1
INF_NAN_MODE_ENABLEControls whether the chip uses saturation mode or INF_NAN mode. Detail1
STREAMS_PER_DEVICEConfigures the maximum number of streams for the stream pool. Detail32
PYTORCH_NPU_ALLOC_CONFControls the behavior of the cache allocator. <br/>This variable changes memory usage and may cause performance fluctuations. Detail
ASCEND_MF_STORE_URLThe address of config store in MemFabric during PD separation, <br/>which is generally set to the IP address of the P primary node<br/> with an arbitrary port number.
ASCEND_LAUNCH_BLOCKINGControls whether synchronous mode is enabled during operator execution. Detail0
HCCL_OP_EXPANSION_MODEConfigures the expansion position for communication algorithm scheduling. Detail
HCCL_BUFFSIZEControls the size of the buffer area for shared data between two NPUs. <br/>The unit is MB, and the value must be greater than or equal to 1. Detail200
HCCL_SOCKET_IFNAMEConfigures the name of the network card used by the Host <br/>during HCCL initialization. Detail
GLOO_SOCKET_IFNAMEConfigures the network interface name for GLOO communication.
diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_glm5_examples.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_glm5_examples.mdx new file mode 100644 index 000000000000..0f2fd5ce7886 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_glm5_examples.mdx @@ -0,0 +1,203 @@ +--- +title: "GLM-5 examples" +metatags: + description: "Documentation for GLM-5 examples" +--- +## Introduction + +The GLM (General Language Model) series is an open-source bilingual large language model family jointly developed by the KEG Laboratory of Tsinghua University and Zhipu AI. This series of models has performed outstandingly in the field of Chinese NLP with its unique unified pre-training framework and bilingual capabilities. [GLM-5](https://huggingface.co/zai-org/GLM-5) adopts the DeepSeek-V3/V3.2 architecture, including the sparse attention (DSA) and multi-token prediction (MTP). Ascend supports GLM-5 with 0Day based on the SGLang inference framework, achieving low-code seamless enablement and compatibility with the mainstream distributed parallel capabilities within the current SGLang framework. We welcome developers to download and experience it. + +## Environment Preparation + +### Model Weight + +- `GLM-5.0`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5). +- `GLM-5.0-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8). +- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively. + + +### Installation + +The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the online platform. You can directly pull it. + +```bash Command +#Atlas 800 A3 +docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-a3-glm5 +#Atlas 800 A2 +docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-910b-glm5 + +#start container +docker run -itd --shm-size=16g --privileged=true --name ${NAME} \ +--privileged=true --net=host \ +-v /var/queue_schedule:/var/queue_schedule \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /usr/local/sbin:/usr/local/sbin \ +-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ +-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ +--device=/dev/davinci0:/dev/davinci0 \ +--device=/dev/davinci1:/dev/davinci1 \ +--device=/dev/davinci2:/dev/davinci2 \ +--device=/dev/davinci3:/dev/davinci3 \ +--device=/dev/davinci4:/dev/davinci4 \ +--device=/dev/davinci5:/dev/davinci5 \ +--device=/dev/davinci6:/dev/davinci6 \ +--device=/dev/davinci7:/dev/davinci7 \ +--device=/dev/davinci8:/dev/davinci8 \ +--device=/dev/davinci9:/dev/davinci9 \ +--device=/dev/davinci10:/dev/davinci10 \ +--device=/dev/davinci11:/dev/davinci11 \ +--device=/dev/davinci12:/dev/davinci12 \ +--device=/dev/davinci13:/dev/davinci13 \ +--device=/dev/davinci14:/dev/davinci14 \ +--device=/dev/davinci15:/dev/davinci15 \ +--device=/dev/davinci_manager:/dev/davinci_manager \ +--device=/dev/hisi_hdc:/dev/hisi_hdc \ +--entrypoint=bash \ +swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:${TAG} +``` + +### Best Practices +Note: Using this image for **best practices**, you need to update transformers to version 5.3.0 +``` +# reinstall transformers + +# Install transformers version 5.3.0 from PyPI +pip install transformers==5.3.0 + +# Install from GitHub v5.3.0 tag from GitHub +pip install git+https://github.com/huggingface/transformers.git@v5.3.0 +``` + +## Deployment + +### Single-node Deployment + +- Quantized model `glm5_w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) . + +Run the following script to execute online inference. + +```shell Launch Server +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 16 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 16384 --max-prefill-tokens 280000 \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --served-model-name glm-5 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --moe-a2a-backend deepep --deepep-mode auto +``` + +### Multi-node Deployment + +- `GLM-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16). + +**A3 series** + +Modify the IP of 2 nodes, then run the same scripts on two nodes. + +**node 0/1** + +```shell Launch Multi-node Server +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_NPU_USE_MULTI_STREAM=1 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV + +# Run command ifconfig on two nodes, find out which inet addr has same IP with your node IP. That is your public interface, which should be added here +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + + +P_IP=('your ip1' 'your ip2') +P_MASTER="${P_IP[0]}:your port" +export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 + +export SGLANG_ENABLE_SPEC_V2=1 +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 + +LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'` +LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'` +for i in "${!P_IP[@]}"; +do + if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]]; + then + echo "${P_IP[$i]}" + python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 32 --nnodes 2 --node-rank $i --dist-init-addr $P_MASTER \ + --chunked-prefill-size 16384 --max-prefill-tokens 131072 \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.8\ + --port 8000 \ + --served-model-name glm-5 \ + --cuda-graph-max-bs 16 \ + --disable-radix-cache + NODE_RANK=$i + break + fi +done + +``` + +### Prefill-Decode Disaggregation + +Not test yet. + +### Using Benchmark + +Refer to [Benchmark and Profiling](../../developer_guide/benchmark_and_profiling) for details. diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quantization.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quantization.mdx new file mode 100644 index 000000000000..c952a3606948 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quantization.mdx @@ -0,0 +1,309 @@ +--- +title: "Quantization on Ascend" +metatags: + description: "Load, export, and serve quantized models on Ascend NPUs with SGLang." +--- +To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config. + +SGLang support **mix-bits** quantization (independently defines and loads each layer depending on the type of quantification specified in the `quant_model_description'.json`). [Advanced mix-bits for MoE](https://github.com/sgl-project/sglang/pull/17361) in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. + +[ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 SupportedDiffusion models
W4A4 dynamicLinearTBD
W8A8 staticLinearTBD
W8A8 dynamicLinearTBD
MXFP8LinearxxWIPWIP
W4A4 dynamicMoETBDx
W4A8 dynamicMoETBDx
W8A8 dynamicMoETBDx
MXFP8MoExxWIPx
+ +[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158): + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16MoETBD
+ +GPTQ on Ascend support + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16 MOEMoETBD
W8A16 MOEMoETBD
+ +[Auto-round on Ascend support](https://github.com/sgl-project/sglang/pull/16699) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16MoETBD
W8A16MoETBD
+ +Compressed-tensors (LLM Compressor) on Ascend support: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W8A8 dynamicLinearTBD
W4A8 dynamic with/without activation clipMoETBD
W4A16 MOEMoETBD
W8A8 dynamicMoETBD
+ +[GGUF on Ascend support](https://github.com/sgl-project/sglang/pull/17883) + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quantization typeLayer typeA2 SupportedA3 SupportedA5 Supported
All GGUF types (standard, K-quant)LinearTBD
All GGUF types (standard, K-quant)MoETBD
+ +**Usage Examples:** + +- Dense model (e.g. Qwen3-14B-Q4_K_M.gguf): + +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen3-14B-Q4_K_M.gguf \ + --device npu --attention-backend ascend \ + --host 0.0.0.0 --port 30000 \ + --mem-fraction-static 0.7 --tp-size 2 +``` + +- MoE model (e.g. Qwen3-30B-A3B-Q4_K_M.gguf): + +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen3-30B-A3B-Q4_K_M.gguf \ + --device npu --attention-backend ascend \ + --host 0.0.0.0 --port 30000 \ + --mem-fraction-static 0.8 --tp-size 2 +``` + +> **Implementation Notes:** +> - GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead). +> - MoE layers use `npu_grouped_matmul` and `npu_moe_init_routing` / `npu_moe_finalize_routing` for high-performance expert computation. +> - TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models. diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start.mdx new file mode 100644 index 000000000000..7a88a5e93df6 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start.mdx @@ -0,0 +1,107 @@ +--- +title: "Ascend NPU Quickstart" +metatags: + description: "Quickstart for running SGLang on Ascend NPUs with the official container image, including server launch and test request examples." +--- + +## Prerequisites + +### Supported Devices + +- Atlas 800I A2 inference series (Atlas 800I A2) +- Atlas 800I A3 inference series (Atlas 800I A3) + +## Setup environment using container + +__Notice:__ The following commands are based on Atlas 800I A3 machines. If you are using Atlas 800I A2, some changes are needed. + +- The image tag needs to be `main-cann8.5.0-a3` for Atlas 800I A3 and `main-cann8.5.0-910b` for Atlas 800I A2. +- The device mapping in `docker run` command needs to be changed to `davinci[0-7]` for Atlas 800I A2. + +```shell Command +# For Atlas 800I A3 +export IMAGE=quay.io/ascend/sglang:main-cann8.5.0-a3 + +docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \ + --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ + --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ + --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \ + --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \ + --device=/dev/davinci_manager \ + --device=/dev/hisi_hdc \ + --volume /usr/local/sbin:/usr/local/sbin \ + --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ + --volume /etc/ascend_install.info:/etc/ascend_install.info \ + --volume /var/queue_schedule:/var/queue_schedule \ + --volume ~/.cache/:/root/.cache/ \ + --entrypoint=bash \ + $IMAGE +``` + +## Usage + +The SGLang server is installed in the container by default. You can use `pip show sglang` to check the version. + +### Start SGLang server + +SGLang will automatically download the model from Hugging Face. + +```shell Command +# Set HF_ENDPOINT to a mirror site if network is not available +export HF_ENDPOINT=https://hf-mirror.com + +# Set your own HF_TOKEN to download restricted models +export HF_TOKEN= + +# Start SGLang server +# It may take several minutes to download the model on the first run +sglang serve --model-path Qwen/Qwen2.5-7B-Instruct --attention-backend ascend & +``` + +If you see output like the following, the server is running. + +```log Output +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) +The server is fired up and ready to roll! +``` + +### Send a test request + +You can do inference using the server: + +```shell Command +curl -X POST http://localhost:30000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "text": "The capital of France is", + "sampling_params": { + "temperature": 0, + "max_new_tokens": 16 + } + }' +``` + +If the "text" field in the response contains "Paris", the server is working as expected. + +### Stop server and exit container + +The SGLang server is running as a background process. You can send a `SIGINT` signal to stop it. + +```shell Command +SGLANG_PID=$(pgrep -f "sglang serve") +kill -SIGINT $SGLANG_PID +``` + +The output should be like the following: + +```log Output +INFO: Shutting down +INFO: Waiting for application shutdown. +INFO: Application shutdown complete. +INFO: Finished server process [25310] +``` + +The server has now stopped. You can verify it with `ps -ef | grep sglang`, then exit the container by pressing `Ctrl+D`. diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_5_examples.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_5_examples.mdx new file mode 100644 index 000000000000..f9fad5ad3596 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_5_examples.mdx @@ -0,0 +1,234 @@ +--- +title: "Qwen3.5 examples" +metatags: + description: "Documentation for Qwen3.5 examples" +--- +## Environment Preparation + +### Installation + +The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the quay.io platform. You can directly pull it. + +```bash Command +#Atlas 800 A3 +docker pull quay.io/ascend/sglang:main-cann8.5.0-a3 +#Atlas 800 A2 +docker pull quay.io/ascend/sglang:main-cann8.5.0-910b + +#start container +docker run -itd --shm-size=16g --privileged=true --name ${NAME} \ +--privileged=true --net=host \ +-v /var/queue_schedule:/var/queue_schedule \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /usr/local/sbin:/usr/local/sbin \ +-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ +-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ +--device=/dev/davinci0:/dev/davinci0 \ +--device=/dev/davinci1:/dev/davinci1 \ +--device=/dev/davinci2:/dev/davinci2 \ +--device=/dev/davinci3:/dev/davinci3 \ +--device=/dev/davinci4:/dev/davinci4 \ +--device=/dev/davinci5:/dev/davinci5 \ +--device=/dev/davinci6:/dev/davinci6 \ +--device=/dev/davinci7:/dev/davinci7 \ +--device=/dev/davinci8:/dev/davinci8 \ +--device=/dev/davinci9:/dev/davinci9 \ +--device=/dev/davinci10:/dev/davinci10 \ +--device=/dev/davinci11:/dev/davinci11 \ +--device=/dev/davinci12:/dev/davinci12 \ +--device=/dev/davinci13:/dev/davinci13 \ +--device=/dev/davinci14:/dev/davinci14 \ +--device=/dev/davinci15:/dev/davinci15 \ +--device=/dev/davinci_manager:/dev/davinci_manager \ +--device=/dev/hisi_hdc:/dev/hisi_hdc \ +--entrypoint=bash \ +quay.io/ascend/sglang:${tag} +``` + +## Deployment + +### Single-node Deployment + +Run the following script to execute online inference. + +#### Qwen3.5 397B + +```bash Command +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 16 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --enable-multimodal \ + --mm-attention-backend ascend_attn \ + --dtype bfloat16 +``` + +#### Qwen3.5 122B + +```bash Command +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 8 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --enable-multimodal \ + --mm-attention-backend ascend_attn \ + --dtype bfloat16 +``` + +#### Qwen3.5 35B + +```bash Command +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 2 --nnodes 1 --node-rank 0 \ + --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.7 \ + --port 8000 \ + --cuda-graph-bs 16 \ + --quantization modelslim \ + --enable-multimodal \ + --mm-attention-backend ascend_attn \ + --dtype bfloat16 +``` + +#### Qwen3.5 27B + +```bash Command +# high performance cpu +echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +sysctl -w vm.swappiness=0 +sysctl -w kernel.numa_balancing=0 +sysctl -w kernel.sched_migration_cost_ns=50000 +# bind cpu +export SGLANG_SET_CPU_AFFINITY=1 + +unset https_proxy +unset http_proxy +unset HTTPS_PROXY +unset HTTP_PROXY +unset ASCEND_LAUNCH_BLOCKING +# cann +source /usr/local/Ascend/ascend-toolkit/set_env.sh +source /usr/local/Ascend/nnal/atb/set_env.sh + +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1000 +export HCCL_OP_EXPANSION_MODE=AIV +export HCCL_SOCKET_IFNAME=lo +export GLOO_SOCKET_IFNAME=lo + +python3 -m sglang.launch_server \ + --model-path $MODEL_PATH \ + --attention-backend ascend \ + --device npu \ + --tp-size 2 \ + --chunked-prefill-size -1 --max-prefill-tokens 120000 \ + --disable-radix-cache \ + --trust-remote-code \ + --host 127.0.0.1 \ + --mem-fraction-static 0.8 \ + --port 8000 \ + --cuda-graph-bs 32 \ + --enable-multimodal \ + --mm-attention-backend ascend_attn +``` + +### Prefill-Decode Disaggregation + +Not test yet. + +### Using Benchmark + +Refer to [Benchmark and Profiling](../../developer_guide/benchmark_and_profiling) for details. diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_examples.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_examples.mdx new file mode 100644 index 000000000000..22bcb24bca35 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_qwen3_examples.mdx @@ -0,0 +1,212 @@ +--- +title: "Qwen3 Examples" +metatags: + description: "Documentation for Qwen3 Examples" +--- +## Qwen3 examples + +### Running Qwen3 + +#### Running Qwen3-32B on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-32B) + +```shell Launch Server +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export HCCL_OP_EXPANSION_MODE=AIV + +python -m sglang.launch_server \ + --device npu \ + --attention-backend ascend \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-32B \ + --mem-fraction-static 0.8 +``` + +#### Running Qwen3-32B on 1 x Atlas 800I A3 with Qwen3-32B-Eagle3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-32B) + +Speculative model weights could be found [here](https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3) + +```shell Launch Server with Eagle3 +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_OP_EXPANSION_MODE=AIV +export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 +export SGLANG_ENABLE_SPEC_V2=1 + +python -m sglang.launch_server \ + --device npu \ + --attention-backend ascend \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-32B \ + --mem-fraction-static 0.8 \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path Qwen/Qwen3-32B-Eagle3 \ + --speculative-num-steps 1 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 2 +``` + +#### Running Qwen3-30B-A3B MOE on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-30B-A3B) + +```shell Launch Server +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export HCCL_OP_EXPANSION_MODE=AIV +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 +export SGLANG_DEEPEP_BF16_DISPATCH=1 + +python -m sglang.launch_server \ + --device npu \ + --attention-backend ascend \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-30B-A3B \ + --mem-fraction-static 0.8 +``` + +#### Running Qwen3-235B-A22B-Instruct-2507 MOE on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) + +```shell Launch Server +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32 +export SGLANG_DEEPEP_BF16_DISPATCH=1 + +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \ + --tp-size 16 \ + --trust-remote-code \ + --attention-backend ascend \ + --device npu \ + --watchdog-timeout 9000 \ + --mem-fraction-static 0.8 +``` + +#### Running Qwen3-235B-A22B-Instruct-2507 with 256K long sequence on 2 x Atlas 800I A3 without CP + +This example uses **PD disaggregation** for long-sequence inference and keeps **context parallel disabled**. + +Set the shared environment variables on both nodes first: + +```bash Command +export ASCEND_USE_FIA=1 +export SGLANG_SET_CPU_AFFINITY=1 +export ASCEND_MF_STORE_URL="tcp://:12345" +export HCCL_SOCKET_IFNAME= +export GLOO_SOCKET_IFNAME= + +MODEL_PATH=/root/.cache/modelscope/hub/models/zcgy26/Qwen3-235B-A22B-Instruct-2507-w8a8 +``` + +**Prefill node:** + +```bash Command +export ASCEND_LAUNCH_BLOCKING=1 +export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 +export HCCL_BUFFSIZE=1500 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=128 +export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 + +python3 -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend ascend \ + --disaggregation-bootstrap-port 8995 \ + --attention-backend ascend \ + --disable-radix-cache \ + --quantization modelslim \ + --chunked-prefill-size -1 \ + --skip-server-warmup \ + --device npu \ + --tp-size 16 \ + --mem-fraction-static 0.45 \ + --max-running-requests 1 \ + --host \ + --port 8000 \ + --dist-init-addr :5000 \ + --nnodes 1 \ + --node-rank 0 \ + --moe-a2a-backend deepep \ + --deepep-mode normal +``` + +**Decode node:** + +```bash Command +export SGLANG_DEEPEP_BF16_DISPATCH=0 +export HCCL_BUFFSIZE=4000 +export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096 +export DEEPEP_NORMAL_LONG_SEQ_ROUND=16 + +python3 -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend ascend \ + --attention-backend ascend \ + --mem-fraction-static 0.8 \ + --disable-cuda-graph \ + --device npu \ + --disable-radix-cache \ + --quantization modelslim \ + --chunked-prefill-size 8192 \ + --skip-server-warmup \ + --tp-size 16 \ + --max-running-requests 1 \ + --host \ + --port 8232 \ + --moe-a2a-backend deepep \ + --deepep-mode low_latency \ + --disable-overlap-schedule +``` + +**Router:** + +```bash Command +python3 -m sglang_router.launch_router \ + --pd-disaggregation \ + --policy cache_aware \ + --prefill http://:8000 8995 \ + --decode http://:8232 \ + --host \ + --port 6689 \ + --prometheus-port 29010 +``` + +#### Running Qwen3-VL-8B-Instruct on 1 x Atlas 800I A3. + +Model weights could be found [here](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) + +```shell Launch Server +export SGLANG_SET_CPU_AFFINITY=1 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export STREAMS_PER_DEVICE=32 +export HCCL_BUFFSIZE=1536 +export HCCL_OP_EXPANSION_MODE=AIV + +python -m sglang.launch_server \ + --enable-multimodal \ + --attention-backend ascend \ + --mm-attention-backend ascend_attn \ + --trust-remote-code \ + --tp-size 4 \ + --model-path Qwen/Qwen3-VL-8B-Instruct \ + --mem-fraction-static 0.8 +``` diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_ring_sp_performance.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_ring_sp_performance.mdx new file mode 100644 index 000000000000..2f0385c56efc --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_ring_sp_performance.mdx @@ -0,0 +1,110 @@ +--- +title: "Ascend NPU Ring-SP Performance (Wan2.1-T2V-1.3B)" +metatags: + description: "This page reports Ring-SP performance on Ascend NPU with torchnpu==2.10.0." +--- + +This page reports Ring-SP performance on Ascend NPU with `torch_npu==2.10.0`. + +- Baseline config: `ulysses=1, ring=1` (short: `u1r1`) +- Ring-SP config: `ulysses=1, ring=2` (short: `u1r2`) + +## Benchmark Setup + +- Model: `Wan2.1-T2V-1.3B-Diffusers` +- Prompt: `"a cat is playing piano"` +- Framework command: `sglang generate` +- Runtime: `torch_npu==2.10.0` + +## Generate Commands + +### Baseline (`u1r1`) + +```bash +sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "a cat is playing piano" --num-gpus 1 --ring-degree 1 \ + --save-output +``` + +### Ring-SP (`u1r2`) + +```bash +sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "a cat is playing piano" --num-gpus 2 --ring-degree 2 \ + --save-output +``` + +## Benchmarks + +Benchmark Disclaimer + +These numbers are from one fixed setup and one prompt case. Actual performance may vary by model settings, environment, and workload. + +### Stage Time Breakdown + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Stage / Metricu1r2 (s)u1r1 baseline (s)Speedup
InputValidation0.00030.00020.67x
TextEncoding3.59363.58201.00x
LatentPreparation0.00070.00557.86x
TimestepPreparation0.00080.00070.88x
Denoising121.2788239.25801.97x
Decoding13.868516.49691.19x
Total (Pixel data generated)141.86266.501.88x
+ +## Summary + +- With `torch_npu==2.10.0`, Ring-SP (`u1r2`) runs successfully on NPU for this case. +- End-to-end generation time improves from `266.50s` to `141.86s` (`1.88x`). +- The main gain comes from `DenoisingStage` (`1.97x`), while decoding also improves (`1.19x`). diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_features.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_features.mdx new file mode 100644 index 000000000000..884d8f1dcbb8 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_features.mdx @@ -0,0 +1,2804 @@ +--- +title: "Support Features on Ascend NPU" +metatags: + description: "Documentation for Support Features on Ascend NPU" +--- +This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any +questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +If you want to know the meaning and usage of each parameter, +click [Server Arguments](../../advanced_features/server_arguments). + +## Model and tokenizer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--model-path`
`--model`
`None`Type: strA2, A3
`--tokenizer-path``None`Type: strA2, A3
`--tokenizer-mode``auto`auto, slowA2, A3
`--tokenizer-worker-num``1`Type: intA2, A3
`--skip-tokenizer-init``False`bool flag (set to enable)A2, A3
`--load-format``auto`auto, safetensors, ggufA2, A3
`--model-loader-`
`extra-config`
{}Type: strA2, A3
`--trust-remote-code``False`bool flag (set to enable)A2, A3
`--context-length``None`Type: intA2, A3
`--is-embedding``False`bool flag (set to enable)A2, A3
`--enable-multimodal``None`bool flag (set to enable)A2, A3
`--revision``None`Type: strA2, A3
`--model-impl``auto`auto, sglang,<br/> transformersA2, A3
+ +## HTTP server + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--host``127.0.0.1`Type: strA2, A3
`--port``30000`Type: intA2, A3
`--skip-server-warmup``False`bool flag (set to enable)A2, A3
`--warmups``None`Type: strA2, A3
`--nccl-port``None`Type: intA2, A3
`--fastapi-root-path``None`Type: strA2, A3
`--grpc-mode``False`FalsePlanned
+ +## SSL/TLS + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--ssl-keyfile``None`Type: strA2, A3
`--ssl-certfile``None`Type: strA2, A3
`--ssl-keyfile-password``None`Type: strA2, A3
`--enable-ssl-refresh``False`bool flag
(set to enable)
A2, A3
`--enable-http2``False`bool flag
(set to enable)
A2, A3
+ +## Quantization and data type + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--dtype``auto`auto,<br/> float16,<br/> bfloat16A2, A3
`--quantization``None``modelslim`A2, A3
`--quantization-param-path``None`Type: strSpecial For GPU
`--kv-cache-dtype``auto``auto`A2, A3
`--enable-fp32-lm-head``False`bool flag
(set to enable)
A2, A3
`--modelopt-quant``None`Type: strSpecial For GPU
`--modelopt-checkpoint-`
`restore-path`
`None`Type: strSpecial For GPU
`--modelopt-checkpoint-`
`save-path`
`None`Type: strSpecial For GPU
`--modelopt-export-path``None`Type: strSpecial For GPU
`--quantize-and-serve``False`bool flag
(set to enable)
Special For GPU
`--rl-quant-profile``None`Type: strSpecial For GPU
+ +## Memory and scheduling + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--mem-fraction-static``None`Type: floatA2, A3
`--max-running-requests``None`Type: intA2, A3
`--prefill-max-requests``None`Type: intA2, A3
`--max-queued-requests``None`Type: intA2, A3
`--max-total-tokens``None`Type: intA2, A3
`--chunked-prefill-size``None`Type: intA2, A3
`--max-prefill-tokens``16384`Type: intA2, A3
`--schedule-policy``fcfs`lpm, fcfsA2, A3
`--enable-priority-`
`scheduling`
`False`bool flag
(set to enable)
A2, A3
`--disable-priority-preemption``False`bool flag
(set to enable)
A2, A3
`--default-priority-value``None`Type: intA2, A3
`--schedule-low-priority-`
`values-first`
`False`bool flag
(set to enable)
A2, A3
`--priority-scheduling-`
`preemption-threshold`
`10`Type: intA2, A3
`--schedule-conservativeness``1.0`Type: floatA2, A3
`--page-size``128`Type: intA2, A3
`--swa-full-tokens-ratio``0.8`Type: floatPlanned
`--disable-hybrid-swa-memory``False`bool flag
(set to enable)
Planned
--radix-eviction-policylrulru,<br/>lfuA2, A3
--enable-prefill-delayer`False`bool flag
(set to enable)
A2, A3
--prefill-delayer-max-delay-passes30Type: intA2, A3
--prefill-delayer-token-usage-low-watermarkNoneType: floatA2, A3
--prefill-delayer-forward-passes-bucketsNoneList[float]A2, A3
--prefill-delayer-wait-seconds-bucketsNoneList[float]A2, A3
--abort-on-priority-<br/>when-disabled`False`bool flag
(set to enable)
A2, A3
`--enable-dynamic-chunking``False`bool flag
(set to enable)
Experimental
+ +## Runtime options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--device``None`Type: strA2, A3
`--tensor-parallel-size`
`--tp-size`
`1`Type: intA2, A3
`--pipeline-parallel-size`
`--pp-size`
`1`Type: int; Currently 2 not supportedExperimental
--attention-context-parallel-size<br/>--attn-cp-size1Type: int; must be equal to --tp-sizeA2, A3
--moe-data-parallel-size<br/>--moe-dp-size1Type: intPlanned
--pp-max-micro-batch-sizeNoneType: intExperimental
--pp-async-batch-depthNoneType: intExperimental
--stream-interval1Type: intA2, A3
--incremental-streaming-outputFalsebool flag (set to enable)A2, A3
--stream-response-default-include-usageFalsebool flag (set to enable)A2, A3
--enable-streaming-sessionFalsebool flag (set to enable)A2, A3
--random-seedNoneType: intA2, A3
--constrained-json-<br/>whitespace-patternNoneType: strA2, A3
--constrained-json-<br/>disable-any-whitespaceFalsebool flag (set to enable)A2, A3
--watchdog-timeout300Type: floatA2, A3
--soft-watchdog-timeout300Type: floatA2, A3
--dist-timeoutNoneType: intA2, A3
--download-dirNoneType: strA2, A3
--model-checksum`None`Type: strPlanned
--base-gpu-id0Type: intA2, A3
--gpu-id-step1Type: intA2, A3
--sleep-on-idleFalsebool flag (set to enable)A2, A3
--use-rayFalsebool flag (set to enable)A2, A3
--custom-sigquit-handlerNoneOnly for engineA2, A3
+ +## Logging + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--log-level``info`Type: strA2, A3
`--log-level-http``None`Type: strA2, A3
`--log-requests``False`bool flag
(set to enable)
A2, A3
`--log-requests-level``2`0, 1, 2, 3A2, A3
`--log-requests-format`texttext, jsonA2, A3
`--crash-dump-folder``None`Type: strA2, A3
`--enable-metrics``False`bool flag
(set to enable)
A2, A3
`--enable-mfu-metrics``False`bool flag
(set to enable)
A2, A3
`--enable-metrics-for-`
`all-schedulers`
`False`bool flag
(set to enable)
A2, A3
`--tokenizer-metrics-`
`custom-labels-header`
`x-custom-labels`Type: strA2, A3
`--tokenizer-metrics-`
`allowed-custom-labels`
`None`List[str]A2, A3
`--extra-metric-labels``None`Type: JSON/DictA2, A3
`--bucket-time-to-`
`first-token`
`None`List[float]A2, A3
`--bucket-inter-token-`
`latency`
`None`List[float]A2, A3
`--bucket-e2e-request-`
`latency`
`None`List[float]A2, A3
`--collect-tokens-`
`histogram`
`False`bool flag
(set to enable)
A2, A3
`--prompt-tokens-buckets``None`List[str]A2, A3
`--generation-tokens-buckets``None`List[str]A2, A3
`--gc-warning-threshold-secs``0.0`Type: floatA2, A3
`--decode-log-interval``40`Type: intA2, A3
`--enable-request-time-`
`stats-logging`
`False`bool flag
(set to enable)
A2, A3
`--kv-events-config``None`Type: strSpecial for GPU
`--enable-trace``False`bool flag
(set to enable)
A2, A3
`--oltp-traces-endpoint``localhost:4317`Type: strA2, A3
--log-requests-targetNoneType: strA2, A3
--uvicorn-access-log-exclude-prefixes[]List[str]A2, A3
+ +## RequestMetricsExporter configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--export-metrics-to-`
`file`
`False`bool flag
(set to enable)
A2, A3
`--export-metrics-to-`
`file-dir`
`None`Type: strA2, A3
+ +## API related + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--api-key``None`Type: strA2, A3
`--admin-api-key``None`Type: strA2, A3
`--served-model-name``None`Type: strA2, A3
`--weight-version``default`Type: strA2, A3
`--chat-template``None`Type: strA2, A3
--hf-chat-template-name`None`Type: strA2, A3
--completion-templateNoneType: strA2, A3
--file-storage-pathsglang_storageType: strUnused reserved parameter
--enable-cache-reportFalsebool flag<br/> (set to enable)A2, A3
--reasoning-parser`None`deepseek-r1<br/>deepseek-v3<br/>glm45<br/>gpt-oss<br/>kimi<br/>qwen3<br/>qwen3-thinking<br/>step3A2, A3
--tool-call-parserNonellama3<br/> pythonic<br/> qwen<br/> qwen3_coderA2, A3
`--sampling-defaults``model`openai, modelA2, A3
+ +## Data parallelism + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--data-parallel-size`
`--dp-size`
`1`Type: intA2, A3
`--load-balance-method`autoauto,<br/> round_robin,<br/> follow_bootstrap_room,<br/> total_requests,<br/> total_tokensA2, A3
+ +## Multi-node distributed serving + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--dist-init-addr`
`--nccl-init-addr`
`None`Type: strA2, A3
`--nnodes``1`Type: intA2, A3
`--node-rank``0`Type: intA2, A3
+ +## Model override args + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--json-model-override-`
`args`
`{}`Type: strA2, A3
`--preferred-sampling-`
`params`
`None`Type: strA2, A3
+ +## LoRA + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--enable-lora``False`Bool flag
(set to enable)
A2, A3
--enable-lora-overlap-loadingFalseBool flag <br/>(set to enable)A2, A3
--max-lora-rank`None`Type: intA2, A3
--lora-target-modules`None`allA2, A3
--lora-pathsNoneType: List[str] /<br/> JSON objectsA2, A3
--max-loras-per-batch8Type: intA2, A3
--max-loaded-lorasNoneType: intA2, A3
--lora-eviction-policylrulru,<br/> fifoA2, A3
--lora-backendcsgmvtriton,<br/>csgmv,<br/>ascend,<br/>torch_nativeA2, A3
--experts-shared-outer-lorasNoneType: boolA2, A3
--lora-use-virtual-expertsFalsebool flag
(set to enable)
A2, A3
--lora-strict-loadingFalseType: boolA2, A3
`--max-lora-chunk-size``16`16, 32,<br/> 64, 128Special for GPU
+ +## Kernel Backends (Attention, Sampling, Grammar, GEMM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--attention-backend``None``ascend`A2, A3
`--prefill-attention-backend``None``ascend`A2, A3
`--decode-attention-backend``None``ascend`A2, A3
`--sampling-backend``None`pytorch,<br/>ascendA2, A3
`--grammar-backend``None``xgrammar`A2, A3
`--mm-attention-backend``None``ascend_attn`A2, A3
`--nsa-prefill-backend``flashmla_sparse`flashmla_sparse,<br/> flashmla_decode,<br/>fa3,<br/> tilelang,<br/> aiterSpecial for GPU
`--nsa-decode-backend``fa3`flashmla_prefill,<br/> flashmla_kv,<br/> fa3,<br/>tilelang,<br/> aiterSpecial for GPU
`--fp8-gemm-backend``auto`auto,<br/> deep_gemm,<br/> flashinfer_trtllm,<br/>flashinfer_cutlass,<br/>flashinfer_deepgemm,<br/>cutlass,<br/> triton,<br/> aiterSpecial for GPU
`--disable-flashinfer-`
`autotune`
`False`bool flag
(set to enable)
Special for GPU
+ +## Speculative decoding + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--speculative-algorithm``None`EAGLE3,<br/> NEXTNA2, A3
`--speculative-draft-model-path`
`--speculative-draft-model`
`None`Type: strA2, A3
`--speculative-draft-model-`
`revision`
`None`Type: str,<br/> branch name,<br/> tag name,<br/> commit idA2, A3
`--speculative-draft-load-format`autoauto,<br/> dummyA2, A3
`--speculative-num-steps``None`Type: intA2, A3
`--speculative-eagle-topk``None`Type: intA2, A3
`--speculative-num-draft-tokens``None`Type: intA2, A3
`--speculative-accept-`
`threshold-single`
`1.0`Type: floatSpecial for GPU
`--speculative-accept-`
`threshold-acc`
`1.0`Type: floatSpecial for GPU
`--speculative-token-map``None`Type: strA2, A3
`--speculative-attention-`
`mode`
`prefill`prefill,<br/> decodeA2, A3
`--speculative-moe-runner-`
`backend`
`None``auto`A2, A3
`--speculative-moe-a2a-`
`backend`
`None``ascend_fuseep`A2, A3
`--speculative-draft-attention-backend``None``ascend`A2, A3
`--speculative-draft-model-quantization``None``unquant`A2, A3
+ +## Ngram speculative decoding + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--speculative-ngram-`
`min-match-window-size`
`1`Type: intExperimental
`--speculative-ngram-`
`max-match-window-size`
`12`Type: intExperimental
`--speculative-ngram-`
`min-bfs-breadth`
`1`Type: intExperimental
`--speculative-ngram-`
`max-bfs-breadth`
`10`Type: intExperimental
`--speculative-ngram-`
`match-type`
`BFS`BFS,<br/> PROBExperimental. BFS uses recency-based expansion; PROB uses frequency-based expansion.
--speculative-ngram-<br/>max-trie-depth`18`Type: intExperimental
`--speculative-ngram-`
`capacity`
`10000000`Type: intExperimental
`--speculative-ngram-external-corpus-path``None`Type: strExperimental
`--speculative-ngram-external-sam-budget``0`Type: intExperimental
`--speculative-ngram-external-corpus-max-tokens``10000000`Type: intExperimental
+ +## Expert parallelism + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--expert-parallel-size`
`--ep-size`
`--ep`
`1`Type: intA2, A3
`--moe-a2a-backend``none`none,<br/> deepep,<br/> ascend_fuseep(It is incompatible with eplb)A2, A3
`--moe-runner-backend``auto`auto, tritonA2, A3
`--flashinfer-mxfp4-`
`moe-precision`
`default`default,<br/> bf16Special for GPU
`--enable-flashinfer-`
`allreduce-fusion`
`False`bool flag
(set to enable)
Special for GPU
`--deepep-mode``auto`normal, <br/>low_latency,<br/> autoA2, A3
`--deepep-config``None`Type: strSpecial for GPU
`--ep-num-redundant-experts``0`Type: intA2, A3
`--ep-dispatch-algorithm``None`static,<br/> dynamic,<br/> fakeA2, A3
`--init-expert-location``trivial`trivial,<br/> <path.pt>,<br/> <path.json>,<br/> <json_string>A2, A3
`--enable-eplb``False`bool flag
(set to enable)
A2, A3
`--eplb-algorithm`deepseekauto,<br/> deepseekA2, A3
--eplb-rebalance-num-iterations1000Type: intA2, A3
--eplb-rebalance-layers-<br/>per-chunkNoneType: intA2, A3
--eplb-min-rebalancing-<br/>utilization-threshold1.0Type: floatA2, A3
--expert-distribution-<br/>recorder-mode`None`stat,<br/> stat_approx,<br/> per_pass,<br/> per_tokenA2, A3
--expert-distribution-<br/>recorder-buffer-sizeNoneType: intA2, A3
--enable-expert-distribution-<br/>metricsFalsebool flag (set to enable)A2, A3
--moe-dense-tp-size`None`1A2, A3
--elastic-ep-backend`None`none, mooncakeSpecial for GPU
`--mooncake-ib-device``None`Type: strSpecial for GPU
+ +## Mamba Cache + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--max-mamba-cache-size``None`Type: intA2, A3
`--mamba-ssm-dtype``float32`float32,<br/>bfloat16,<br/>float16A2, A3
`--mamba-full-memory-ratio`0.9Type: floatA2, A3
`--mamba-scheduler-strategy``auto`auto,<br/>no_buffer,<br/>extra_bufferA2, A3
`--mamba-track-interval``256`Type: intA2, A3
+ +## Hierarchical cache + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--enable-hierarchical-`
`cache`
`False`bool flag<br/> (set to enable).<br/> Currently, mamba cache is not supported.A2, A3
`--hicache-ratio``2.0`Type: floatA2, A3
`--hicache-size``0`Type: intA2, A3
`--hicache-write-policy``write_through`Currently only write_back supportedA2, A3
--hicache-io-backendkernelkernel_ascend,<br/> directA2, A3
--hicache-mem-layoutlayer_firstpage_first_direct,<br/> page_first_kv_splitA2, A3
--hicache-storage-<br/>backendNonefileA2, A3
--hicache-storage-<br/>prefetch-policybest_effortbest_effort,<br/> wait_complete,<br/> timeoutSpecial for GPU
--hicache-storage-<br/>backend-extra-configNoneType: strSpecial for GPU
+ +## LMCache + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--enable-lmcache``False`bool flag
(set to enable)
Special for GPU
+ +## Diffusion LLM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--dllm-algorithm``None`Type: strA2, A3
`--dllm-algorithm-config``None`Type: strA2, A3
+ +## Offloading (must be used with `--disable-cuda-graph`) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--cpu-offload-gb``0`Type: intA2, A3
`--offload-group-size``-1`Type: int (DeepSeek only)A2, A3
`--offload-num-in-group``1`Type: int (DeepSeek only)A2, A3
`--offload-prefetch-step``1`Type: int (DeepSeek only)A2, A3
`--offload-mode``cpu`cpu (DeepSeek only) <br/>meta (DeepSeek only) <br/>sharded_gpu (DeepSeek only)A2, A3
+ +## Args for multi-item scoring + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--multi-item-scoring-delimiter``None`Type: intA2, A3
+ +## Optimization/debug options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--disable-radix-cache``False`bool flag
(set to enable)
A2, A3
`--cuda-graph-max-bs``None`Type: intA2, A3
`--cuda-graph-bs``None`List[int]A2, A3
`--disable-cuda-graph``False`bool flag
(set to enable)
A2, A3
`--disable-cuda-graph-`
`padding`
`False`bool flag
(set to enable)
A2, A3
`--enable-profile-`
`cuda-graph`
`False`bool flag
(set to enable)
A2, A3
`--enable-cudagraph-gc``False`bool flag
(set to enable)
A2, A3
`--enable-nccl-nvls``False`bool flag
(set to enable)
Special for GPU
`--enable-symm-mem``False`bool flag
(set to enable)
Special for GPU
`--disable-flashinfer-`
`cutlass-moe-fp4-allgather`
`False`bool flag
(set to enable)
Special for GPU
`--enable-tokenizer-`
`batch-encode`
`False`bool flag
(set to enable)
A2, A3
--disable-tokenizer-<br/>batch-decode`False`bool flag
(set to enable)
A2, A3
--disable-custom-<br/>all-reduce`False`bool flag
(set to enable)
Special for GPU
--enable-mscclpp`False`bool flag
(set to enable)
Special for GPU
--enable-torch-<br/>symm-mem`False`bool flag
(set to enable)
Special for GPU
--disable-overlap<br/>-schedule`False`bool flag
(set to enable)
A2, A3
--enable-mixed-<br/>chunk`False`bool flag
(set to enable)
A2, A3
--enable-dp-attention`False`bool flag
(set to enable)
A2, A3
--enable-dp-attention-local-control-broadcast`False`bool flag
(set to enable)
A2, A3
--enable-dp-lm-head`False`bool flag
(set to enable)
A2, A3
--enable-two-<br/>batch-overlap`False`bool flag
(set to enable)
Planned
--enable-single-<br/>batch-overlap`False`bool flag
(set to enable)
A2, A3
--tbo-token-<br/>distribution-threshold0.48Type: floatPlanned
--enable-torch-<br/>compileFalsebool flag<br/> (set to enable)A2, A3
--enable-torch-<br/>compile-debug-mode`False`bool flag
(set to enable)
A2, A3
--enforce-piecewise-<br/>cuda-graph`False`bool flag<br/> (set to enable); <br/> Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported.A2, A3
--piecewise-cuda-<br/>graph-tokensNoneType: JSON<br/> listA2, A3
--piecewise-cuda-<br/>graph-compilereagereagerA2, A3
--torch-compile-max-bs32Type: intA2, A3
--piecewise-cuda-<br/>graph-max-tokensNoneType: intA2, A3
--torchao-config``Type: strSpecial for GPU
--enable-nan-detectionFalsebool flag<br/> (set to enable)A2, A3
--enable-p2p-check`False`bool flag
(set to enable)
Special for GPU
--triton-attention-<br/>reduce-in-fp32`False`bool flag
(set to enable)
Special for GPU
--triton-attention-<br/>num-kv-splits8Type: intSpecial for GPU
--triton-attention-<br/>split-tile-sizeNoneType: intSpecial for GPU
--delete-ckpt-<br/>after-loadingFalsebool flag<br/> (set to enable)A2, A3
--enable-memory-saver`False`bool flag
(set to enable)
A2, A3
--enable-weights-<br/>cpu-backup`False`bool flag
(set to enable)
A2, A3
--enable-draft-weights-<br/>cpu-backup`False`bool flag
(set to enable)
A2, A3
--allow-auto-truncate`False`bool flag
(set to enable)
A2, A3
--enable-custom-<br/>logit-processor`False`bool flag
(set to enable)
A2, A3
--flashinfer-mla-<br/>disable-ragged`False`bool flag
(set to enable)
Special for GPU
--disable-shared-<br/>experts-fusionTruebool flag
(set to enable)
A2, A3
--enforce-shared-experts-fusionFalsebool flag
(set to enable)
A2, A3
--disable-chunked-<br/>prefix-cacheTruebool flag
(set to enable)
A2, A3
--disable-fast-<br/>image-processor`False`bool flag
(set to enable)
A2, A3
--keep-mm-feature-<br/>on-device`False`bool flag
(set to enable)
A2, A3
--enable-return-<br/>hidden-states`False`bool flag
(set to enable)
A2, A3
--enable-return-<br/>routed-experts`False`bool flag
(set to enable)
A2, A3
--scheduler-recv-<br/>interval1Type: intA2, A3
--numa-nodeNoneList[int]A2, A3
--enable-deterministic-<br/>inferenceFalsebool flag<br/> (set to enable)Planned
`--rl-on-policy-target``None``fsdp`Planned
`--enable-layerwise-`
`nvtx-marker`
`False`bool flag
(set to enable)
Special for GPU
`--enable-attn-tp-`
`input-scattered`
`False`bool flag
(set to enable)
Experimental
`--enable-nsa-prefill-`
`context-parallel`
`False`bool flag
(set to enable)
A2, A3
`--enable-prefill-context-parallel``False`bool flag
(set to enable)
A2, A3
`--prefill-cp-mode``in-seq-split`Type: strA2, A3
`--enable-fused-qk-`
`norm-rope`
`False`bool flag
(set to enable)
Special for GPU
`--enable-precise-embedding-interpolation``False`bool flag
(set to enable)
A2, A3
`--gc-threshold``None`List[int]A2, A3
+ +## Dynamic batch tokenizer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--enable-dynamic-`
`batch-tokenizer`
`False`bool flag
(set to enable)
A2, A3
`--dynamic-batch-`
`tokenizer-batch-size`
`32`Type: intA2, A3
`--dynamic-batch-`
`tokenizer-batch-timeout`
`0.002`Type: floatA2, A3
+ +## Debug tensor dumps + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--debug-tensor-dump-`
`output-folder`
`None`Type: strA2, A3
`--debug-tensor-dump-`
`layers`
`None`List[int]A2, A3
`--debug-tensor-dump-`
`input-file`
`None`Type: strA2, A3
+ +## PD disaggregation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--disaggregation-mode``null`null,<br/> prefill,<br/> decodeA2, A3
`--disaggregation-transfer-backend``mooncake``ascend`A2, A3
`--disaggregation-bootstrap-port``8998`Type: intA2, A3
--disaggregation-ib-device`None`Type: strSpecial for GPU
--disaggregation-decode-<br/>enable-offload-kvcacheFalseFalseA2, A3
--num-reserved-decode-tokens512Type: intA2, A3
--disaggregation-decode-<br/>polling-interval1Type: intA2, A3
+ +## Encode prefill disaggregation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
--enable-adaptive-dispatch-to-encoder`False`bool flag<br/> (set to enable adaptively dispatch)A2, A3
--encoder-only`False`bool flag<br/> (set to launch an encoder-only server)A2, A3
--language-onlyFalsebool flag<br/> (set to load weights for the language model only)A2, A3
--encoder-transfer-backendzmq_to_schedulerzmq_to_scheduler, <br/> zmq_to_tokenizer,<br/> mooncakeA2, A3
`--encoder-urls``[]`List[str]<br/> (List of encoder server urls)A2, A3
+ +## Custom weight loader + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--custom-weight-loader``None`List[str]A2, A3
`--weight-loader-disable-`
`mmap`
`False`bool flag
(set to enable)
A2, A3
`--weight-loader-prefetch-checkpoints``False`bool flag
(set to enable)
A2, A3
`--weight-loader-prefetch-num-threads``4`Type: intA2, A3
`--remote-instance-weight-`
`loader-seed-instance-ip`
`None`Type: strA2, A3
`--remote-instance-weight-`
`loader-seed-instance-service-port`
`None`Type: intA2, A3
`--remote-instance-weight-`
`loader-send-weights-group-ports`
`None`Type: JSON
list
A2, A3
`--remote-instance-weight-`
`loader-backend`
`nccl`transfer_engine, <br/> ncclA2, A3
`--remote-instance-weight-`
`loader-start-seed-via-transfer-engine`
`False`bool flag
(set to enable)
Special for GPU
+ +## For PD-Multiplexing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--enable-pdmux``False`bool flag
(set to enable)
Special for GPU
`--pdmux-config-path``None`Type: strSpecial for GPU
`--sm-group-num``8`Type: intSpecial for GPU
+ +## For Multi-Modal + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
--enable-broadcast-mm-<br/>inputs-processFalsebool flag<br/> (set to enable)A2, A3
--mm-process-configNoneType: JSON / DictA2, A3
--mm-enable-dp-encoder`False`bool flag
(set to enable)
A2, A3
--limit-mm-data-per-request`None`Type: JSON / DictA2, A3
+ +## For checkpoint decryption + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
`--decrypted-config-file``None`Type: strA2, A3
`--decrypted-draft-config-file``None`Type: strA2, A3
`--enable-prefix-mm-cache``False`bool flag
(set to enable)
A2, A3
+ +## Forward hooks + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
--forward-hooksNoneType: JSON listA2, A3
+ +## Configuration file support + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptionsServer supported
--config`None`Type: strA2, A3
+ +## Other Params + +The following parameters are not supported because the third-party components that depend on are not compatible with the +NPU, like Ktransformer, checkpoint-engine etc. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptions
`--checkpoint-engine-`
`wait-weights-`
`before-ready`
`False`bool flag (set to enable)
`--kt-weight-path``None`Type: str
`--kt-method``AMXINT4`Type: str
`--kt-cpuinfer``None`Type: int
`--kt-threadpool-count`2Type: int
`--kt-num-gpu-experts``None`Type: int
`--kt-max-deferred-`
`experts-per-token`
`None`Type: int
+ +The following parameters have some functional deficiencies on community + + + + + + + + + + + + + + + + + + + + + +
ArgumentDefaultsOptions
--tool-serverNoneType: str
diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_models.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_models.mdx new file mode 100644 index 000000000000..d7ef03691547 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_models.mdx @@ -0,0 +1,692 @@ +--- +title: "Support Models on Ascend NPU" +metatags: + description: "Documentation for Support Models on Ascend NPU" +--- +This section describes the models supported on the Ascend NPU, including Large Language Models, Multimodal Language +Models, Embedding Models, Reward Models and Rerank Models. Mainstream DeepSeek/Qwen/GLM series are included. +You are welcome to enable various models based on your business requirements. + +## Large Language Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelsModel FamilyA2 SupportedA3 Supported
DeepSeek V3/V3.1DeepSeek
DeepSeek-V3.2-W8A8DeepSeek
DeepSeek-R1-0528-W8A8DeepSeek
DeepSeek-V2-Lite-W8A8DeepSeek
Qwen/Qwen3.5-397B-A17BQwen
Qwen/Qwen3-30B-A3B-Instruct-2507Qwen
Qwen/Qwen3-32BQwen
Qwen/Qwen3-0.6BQwen
Qwen3-235B-A22B-W8A8Qwen
Qwen/Qwen3-Next-80B-A3B-InstructQwen
Qwen3-Coder-480B-A35B-Instruct-w8a8-QuaRotQwen
Qwen/Qwen2.5-7B-InstructQwen
QWQ-32B-W8A8Qwen
meta-llama/Llama-4-Scout-17B-16E-InstructLlama
AI-ModelScope/Llama-3.1-8B-InstructLlama
LLM-Research/llama-2-7bLlama
LLM-Research/Llama-3.2-1B-InstructLlama
mistralai/Mistral-7B-Instruct-v0.2Mistral
google/gemma-3-4b-itGemma
microsoft/Phi-4-multimodal-instructPhi
allenai/OLMoE-1B-7B-0924OLMoE
stabilityai/stablelm-2-1_6bStableLM
CohereForAI/c4ai-command-r-v01Command-R
huihui-ai/grok-2Grok
ZhipuAI/chatglm2-6bChatGLM
Shanghai_AI_Laboratory/internlm2-7bInternLM 2
LGAI-EXAONE/EXAONE-3.5-7.8B-InstructExaONE 3
xverse/XVERSE-MoE-A36BXVERSE
HuggingFaceTB/SmolLM-1.7BSmolLM
ZhipuAI/glm-4-9b-chatGLM-4
XiaomiMiMo/MiMo-7B-RLMiMo
arcee-ai/AFM-4.5B-BaseArcee AFM-4.5B
Howeee/persimmon-8b-chatPersimmon
inclusionAI/Ling-liteLing
ibm-granite/granite-3.1-8b-instructGranite
ibm-granite/granite-3.0-3b-a800m-instructGranite MoE
AI-ModelScope/dbrx-instructDBRX (Databricks)
baichuan-inc/Baichuan2-13B-ChatBaichuan 2 (7B, 13B)
baidu/ERNIE-4.5-21B-A3B-PTERNIE-4.5 (4.5, 4.5MoE series)
OpenBMB/MiniCPM3-4BMiniCPM (v3, 4B)
moonshotai/Kimi-K2-ThinkingKimi
moonshotai/Kimi-Linear-48B-A3B-InstructKimi Linear (48B-A3B)
eigen-ai-labs/gpt-oss-120b-bf16GPTOSS
allenai/OLMo-2-1124-7B-InstructOLMo
cyankiwi/MiniMax-M2-BF16MiniMax-M2
upstage/SOLAR-10.7B-Instruct-v1.0Solar
FLM/Tele-FLMTele FLM (52B-1T)
bigcode/starcoder2-7bStarCoder2
arcee-ai/Trinity-MiniTrinity (Nano, Mini)
OrionStarAI/Orion-14B-BaseOrion (14B)
EleutherAI/gpt-j-6bGPT-J (6B)
+ +## Multimodal Language Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelsModel Family (Variants)A2 SupportedA3 Supported
Qwen/Qwen2.5-VL-3B-InstructQwen-VL
Qwen/Qwen2.5-VL-72B-InstructQwen-VL
Qwen/Qwen3-VL-30B-A3B-InstructQwen-VL
Qwen/Qwen3-VL-8B-InstructQwen-VL
Qwen/Qwen3-VL-4B-InstructQwen-VL
Qwen/Qwen3-VL-235B-A22B-InstructQwen-VL
deepseek-ai/deepseek-vl2DeepSeek-VL2
deepseek-ai/Janus-Pro-1BJanus-Pro (1B, 7B)
deepseek-ai/Janus-Pro-7BJanus-Pro (1B, 7B)
openbmb/MiniCPM-V-2_6MiniCPM-V / MiniCPM-o
openbmb/MiniCPM-o-2_6MiniCPM-V / MiniCPM-o
google/gemma-3-4b-itGemma 3 (Multimodal)
mistralai/Mistral-Small-3.1-24B-Instruct-2503Mistral-Small-3.1-24B
microsoft/Phi-4-multimodal-instructPhi-4-multimodal-instruct
XiaomiMiMo/MiMo-VL-7B-RLMiMo-VL (7B)
AI-ModelScope/llava-v1.6-34bLLaVA (v1.5 & v1.6)
lmms-lab/llava-next-72bLLaVA-NeXT (8B, 72B)
lmms-lab/llava-onevision-qwen2-7b-ovLLaVA-OneVision
moonshotai/Kimi-VL-A3B-InstructKimi-VL (A3B)
ZhipuAI/GLM-4.5VGLM-4.5V (106B)
LLM-Research/Llama-3.2-11B-Vision-InstructLlama 3.2 Vision (11B)
rednote-hilab/dots.ocrDotsVLM-OCR
PaddlePaddle/ERNIE-4.5-VL-28B-A3B-PTErnie4.5-VL
Qwen/Qwen3-Omni-30B-A3B-InstructQwen3-Omni
stepfun-ai/Step3-VL-10BStep3-VL (10B)
+ +## Diffusion language models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelsModel FamilyA2 SupportedA3 Supported
inclusionAI/LLaDA2.0-flashLLaDA2.0 (mini, flash)
JetLM/SDAR-8B-ChatSDAR (JetLM)
JetLM/SDAR-30B-A3B-ChatSDAR (JetLM)
+ +## Embedding Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelsModel FamilyA2 SupportedA3 Supported
intfloat/e5-mistral-7b-instructE5 (Llama/Mistral based)
iic/gte_Qwen2-1.5B-instructGTE-Qwen2
Qwen/Qwen3-Embedding-8BQwen3-Embedding
Alibaba-NLP/gme-Qwen2-VL-2B-InstructGME (Multimodal)
AI-ModelScope/clip-vit-large-patch14-336CLIP
BAAI/bge-large-en-v1.5BGE
+ +## Reward Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelsModel FamilyA2 SupportedA3 Supported
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2Llama3.1 Reward
Shanghai_AI_Laboratory/internlm2-7b-rewardInternLM 2 Reward
Qwen/Qwen2.5-Math-RM-72BQwen2.5 Reward - Math
Howeee/Qwen2.5-1.5B-apeachQwen2.5 Reward - Sequence
AI-ModelScope/Skywork-Reward-Gemma-2-27B-v0.2Gemma 2-27B Reward
+ +## Rerank Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelsModel FamilyA2 SupportedA3 Supported
BAAI/bge-reranker-v2-m3BGE-Reranker
Qwen/Qwen3-Reranker-8BQwen3-Reranker (decoder-only yes/no)
Qwen/Qwen3-VL-Reranker-2BQwen3-VL-Reranker (multimodal yes/no)
diff --git a/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_new_models.mdx b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_new_models.mdx new file mode 100644 index 000000000000..ac00fba81793 --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_support_new_models.mdx @@ -0,0 +1,551 @@ +--- +title: "How to Support New Models" +description: "This document explains how to add support for new language models and multimodal large language models (MLLMs) in SGLang. It also covers how to test new models and register external implementations." +--- +This document explains how to add support for new language models and multimodal large language models (MLLMs) in +SGLang. It also covers how to test new models and register external implementations. + +## How to Support a New Language Model + +To support a new model in SGLang, you only need to add a single file under +the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn +from existing model implementations and create a new file for your model. For most models, you should be able to find a +similar model to start with (e.g., starting from Llama). Also refer how +to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) + +NPU adaptations are embedded in existing model files (e.g., `llama.py`, `qwen3_vl.py`) through `_is_npu` conditional +branches. The NPU hardware backend lives at `sglang/srt/hardware_backend/npu/`. Some ops may need to use `torch_npu` +APIs in place of CUDA equivalents. + +## How to Support a New Multimodal Large Language Model + +To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the +standard LLM support: + +1. **Register your new model as multimodal**: + Extend `is_multimodal_model` + in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) + to return `True` for your model. + +2. **Register a new chat-template**: + Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/parser/conversation.py) and the corresponding matching function. + +3. **Multimodal Data Processor**: + Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your + model’s dedicated processor. + See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors) + for more details. + +4. **Handle Multimodal Tokens**: + Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be + expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data + with `RadixAttention`. + +5. **Handle Image Feature Extraction**: + Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model. + +6. **Adapt to Vision Attention**: + Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. + +You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or +other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. + + +On Ascend NPU, ensure vision processors and image feature extraction are compatible with the `torch_npu` backend. +Refer to `vit_npu_graph_runner.py` under `hardware_backend/npu/graph_runner/` and `qwen_vl_processor.py` under +`hardware_backend/npu/modules/` for NPU vision adaptation patterns. + + +## Testing and Debugging + +Please note all your testing and benchmarking results in PR description. + +### Interactive Debugging + +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands +should give the same text output and very similar prefill logits: + +- Get the reference output: + ```bash Command + python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm} + ``` +- Get the SGLang output: + ```bash Command + python3 -m sglang.bench_one_batch --correct --model [new model] + ``` + +### Add the Model to the Test Suite + +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in +the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/registered/models/test_generation_models.py) +file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, +MMMU-Pro, etc.) in your PR. +For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/registered/vlm/test_vision_openai_server_a.py)). + +This is an example command to run to test a new model on your local machine: + +```bash Run Test +ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others +``` + +### Benchmark + +- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.io/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer). +- **(Optional) Other evals**: If you ran other evals, please note the results in PR description. + + +For NPU-adapted models: add the corresponding test under `test/registered/ascend/` and verify correctness on Ascend NPU +hardware; run benchmarks on the NPU device and report performance metrics (TTFT, throughput), comparing against SGLang +GPU results as the primary baseline. Fall back to HF Transformer comparison when no GPU adaptation is available. + + +## Port a Model from vLLM to SGLang + +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable +resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models +from vLLM to SGLang. + +To port a model from vLLM to SGLang: + +- Compare these two files for guidance: + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) +- The major differences include: + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + - **For Ascend NPU**: Reference existing NPU-adapted models (e.g., `llama.py`, `deepseek_v2.py`) for NPU-specific + patterns, such as replacing CUDA kernels with `torch_npu` equivalents. The NPU backend is at + `sglang/srt/hardware_backend/npu/`. + +Note: make sure you add your new model to the supported models list in the supported models documentation. + +## Registering an External Model Implementation + +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. +This allows you to integrate your model without modifying the source code. + +For example: + +```python Register Model +from sglang.srt.models.registry import ModelRegistry +from sglang.srt.entrypoints.http_server import launch_server + +# For a single model, add it to the registry: +ModelRegistry.models[model_name] = model_class + +# For multiple models, you can imitate the import_model_classes() function: +from functools import lru_cache + +@lru_cache() +def import_new_model_classes(): + model_arch_name_to_cls = {} + # Populate model_arch_name_to_cls with your new model classes. + ... + return model_arch_name_to_cls + +ModelRegistry.models.update(import_new_model_classes()) + +# Launch the server with your server arguments: +launch_server(server_args) +``` + +## Example: Implementing and Serving a Llama Wrapper Model + +Below is an introductory, step-by-step walkthrough on how to implement a new model end-to-end in SGLang and then run it via the [Offline Engine](../basic_usage/offline_engine_api). + +### Implementing Our Model + +To keep things simple, this new model will be a simple wrapper around [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), and our goal will be just to bias the output logits for each `forward` call by taking the square root of each individual logit. + +Let's start by defining our model in a file called `llama_wrapper.py`. +The first step is to import the necessary libraries from SRT, which is SGLang's internal backend. + +```python Example +# In the file `llama_wrapper.py` + +import torch +from transformers import LlamaConfig +from typing import Optional +from sglang.srt.layers.logits_processor import LogitsProcessorOutput +from sglang.srt.layers.quantization.base_config import QuantizationConfig +from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors + +from sglang.srt.models.llama import LlamaForCausalLM +``` + +Next, we declare a new `class` for our model and have it inherit from `LlamaForCausalLM`, which allows our model to access `LlamaForCausalLM`'s predefined modules and layers, such as `LlamaAttention` and `LlamaMLP`. +Note that almost all model implementations take in `config` and `quant_config` as arguments for their `__init__` method; `config` and `quant_config` are passed in via [`model_loader/loader.py`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_loader/loader.py#L219). +Because we have inherited from `LlamaForCausalLM`, we can pass our parameters directly to its constructor, which will set the member variables for us. + +```python Class Definition +class LlamaWrapper(LlamaForCausalLM): + def __init__( + self, + config: LlamaConfig, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ) -> None: + super().__init__(config=config, quant_config=quant_config, prefix=prefix) +``` + +Now, we want to define the `forward` method, which is what will be called at inference time. +Note that the signature for `forward` is essentially the same for any model; you can take a look at the other models defined in the [`models` directory](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/) for references. +To see where exactly `forward` is called in the SGLang runtime's internals, take a look at [`forward_decode`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1705) and [`forward_extend`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1724) in the [`ModelRunner` class](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py). + +```python Forward Method Signature + @torch.no_grad() + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + forward_batch: ForwardBatch, + pp_proxy_tensors: Optional[PPProxyTensors] = None, + input_embeds: Optional[torch.Tensor] = None, + get_embedding: bool = False, + ) -> LogitsProcessorOutput: +``` + +We now call the `__call__` method for `self.model` (which is a member variable that `LlamaForCausalLM` defines in its `__init__` method), which eventually calls `LlamaForCausalLM`'s `forward` method. +After that, we feed the `hidden_states` into our model's `LogitsProcessor` (again defined in `LlamaForCausalLM`). + +```python Call Model and LogitsProcessor + hidden_states = self.model( + input_ids, + positions, + forward_batch, + input_embeds, + pp_proxy_tensors=pp_proxy_tensors, + ) + + res: LogitsProcessorOutput = self.logits_processor( + input_ids, + hidden_states, + self.lm_head, + forward_batch, + ) +``` + +After receiving the logits for the next token, we can finally perform our biasing step. + +```python Logit Biasing + orig_logits = res.next_token_logits + res.next_token_logits = torch.where( + orig_logits > 0, + orig_logits.sqrt(), + orig_logits + ) + + return res +``` + +Now, our `LlamaWrapper` model is created and ready to be served! + +### Serving Our Model Via SGLang's Offline Engine + +The next step of this walkthrough involves hosting our new model offline, so that it can be served locally and without an HTTP server. + +First, create a new file called `run.py`. +Now, we must ensure that SGLang's `ModelRegistry` can find our model. +To do this, we first download the model's configuration and weights from Huggingface. + +```python Example +# In the file `run.py` + +import asyncio +from functools import lru_cache +from huggingface_hub import snapshot_download +from llama_wrapper import LlamaWrapper # Make sure to import our new model! +import sglang as sgl +from sglang.srt.models.registry import ModelRegistry + +# Make sure to request access to this model on Huggingface, then export your +# `HF_TOKEN` to download the model snapshot +llama_dir = snapshot_download( + repo_id="meta-llama/Llama-3.1-8B-Instruct", + local_dir="./llama_ckpt", +) +``` + +Now that we have our model on disk, we want to point it to `LlamaWrapper` by changing the `architectures` field in `./llama_ckpt/config.json` to be `LlamaWrapper`. +That way, when we pass in the path of our model checkpoint to SGLang, it will know that we want to use "LlamaWrapper" instead of "LlamaForCausalLM" as our model. + +```python Example +{ + "architectures": [ + # "LlamaForCausalLM" + "LlamaWrapper" + ], + ... +} +``` + +However, if we don't link our `LlamaWrapper` class to the "LlamaWrapper" registry keyword, then SGLang won't be able to find our model. +Thus, to register our `LlamaWrapper`, we want to follow the steps in the above section titled "Registering an External Model Implementation". + +```python Register LlamaWrapper +@lru_cache() +def import_new_model_classes(): + model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper} + return model_arch_name_to_cls + +ModelRegistry.models.update(import_new_model_classes()) +``` + +Lastly, when we create our `Engine`, we just pass in the path to the local model directory. +Then, our `LlamaWrapper` is ready to be served; for this walkthrough, we will use SGLang `Engine`'s non-streaming asynchronous generation endpoint. + +```python Example +def main(): + llm = sgl.Engine(model_path="./llama_ckpt") + sampling_params = {"temperature": 0.2, "top_k": 5} + prompts = [ + "Write a short, neutral self-introduction for a fictional character. Hello, my name is", + "Provide a concise factual statement about France’s capital city. The capital of France is", + "Explain possible future trends in artificial intelligence. The future of AI is", + ] + + asyncio.run(run_llm(llm, sampling_params, prompts)) + + llm.shutdown() + +async def run_llm( + llm, + sampling_params, + prompts, +) -> None: + outputs = await llm.async_generate(prompts, sampling_params) + + for prompt, output in zip(prompts, outputs): + print(f"\nPrompt: {prompt}") + print(f"Generated text: {output['text']}") + +if __name__ == "__main__": + main() +``` + +Now, when we call `python run.py`, we will get the outputs of our newly created model! + +## Serving External Models via the Standard CLI + +The previous sections show how to register a model programmatically via `ModelRegistry` and serve it through the Offline Engine. Similar to vLLM model plugin, there is an alternative that lets you keep using the standard `python -m sglang.launch_server` CLI without modifying any SGLang source code: you can register your model using the `SGLANG_EXTERNAL_MODEL_PACKAGE` environment variable. + + +On Ascend NPU, `--device` and `--attention-backend` are auto-detected and can be omitted from the launch command. +SGLang sets the device to `npu` and attention backend to `ascend` automatically when `torch.npu.is_available()`. + + +### The `EntryClass` Variable + +When SGLang scans a model package, it looks for the variable `EntryClass` at the module level of your Python file. The [model registry](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/registry.py) imports your file, checks for `EntryClass`, and registers the class assigned to it. If you are using a model based on HuggingFace, the name of this class needs to match the `"architectures"` field in your model's `config.json`. + +For example, if you are implementing a Llama wrapper, add this line at the end of your model file: + +```python Example +# This is what "Add EntryClass at the end" means +EntryClass = LlamaWrapper +``` + +### Example: Text-Only Model + +Using the same Llama wrapper from the previous section, here is how to package and serve it via the CLI. + +1. Create your project + +``` +sglang_custom_project/ +|----setup.py +|----custom_llm/ + |----__init__.py + |----llama_wrapper.py +``` + +Write the `setup.py`: + +```python Example +# sglang_custom_project/setup.py + +from setuptools import setup, find_packages +setup( + name="sglang-custom-plugins", + version="0.1", + packages=find_packages(), +) +``` + +2. Write your model code + +Inside `llama_wrapper.py`, write your model and include `EntryClass`: + +```python Example +# sglang_custom_project/custom_llm/llama_wrapper.py + +import torch +from typing import Optional +from sglang.srt.layers.logits_processor import LogitsProcessorOutput +from sglang.srt.layers.quantization.base_config import QuantizationConfig +from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors +from sglang.srt.models.llama import LlamaForCausalLM + +class LlamaWrapper(LlamaForCausalLM): + def __init__(self, config, quant_config: Optional[QuantizationConfig] = None, + prefix: str = "") -> None: + super().__init__(config=config, quant_config=quant_config, prefix=prefix) + @torch.no_grad() + def forward(self, input_ids, positions, forward_batch, + pp_proxy_tensors=None, input_embeds=None, get_embedding=False): + hidden_states = self.model( + input_ids, positions, forward_batch, input_embeds, + pp_proxy_tensors=pp_proxy_tensors, + ) + res: LogitsProcessorOutput = self.logits_processor( + input_ids, hidden_states, self.lm_head, forward_batch, + ) + + orig = res.next_token_logits + res.next_token_logits = torch.where(orig > 0, orig.sqrt(), orig) + return res + +# Don't forget to add EntryClass +EntryClass = LlamaWrapper +``` + +3. Install your package + +Run this inside your `sglang_custom_project` directory to install your code into the active Python environment: + +```bash Command +pip install -e . +``` + +4. Update your `config.json` + +Update the `config.json` under your HuggingFace model checkpoint directory so the `architectures` field matches your class name: + +```json Config +{ + "architectures": ["LlamaWrapper"], + ... +} +``` + +5. Launch the server + +Set the environment variable before running the CLI: + +```bash Command +export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_llm +python -m sglang.launch_server \ + --model-path /path/to/Llama-3.1-8B-Instruct \ + --port 8000 +``` + +The `SGLANG_EXTERNAL_MODEL_PACKAGE` should be the parent folder name containing your model-related code. In this example, it should be `custom_llm`. + +### Example: Multimodal Model + +If you are working with multimodal models, setting `SGLANG_EXTERNAL_MODEL_PACKAGE` alone is not enough. SGLang also needs to recognize your architecture as multimodal to enable the image/video processing pipelines, and it needs a custom processor. + +You can handle this by setting two additional environment variables: + +- `SGLANG_EXTERNAL_MM_MODEL_ARCH`: Adds your architecture name to SGLang's internal list of multimodal models. +- `SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE`: Tells SGLang where to find your custom processor class. + +For example, let's build a custom model based on Qwen2-VL-Instruct that takes the square root of the logits. + +Create the project: + +``` +sglang_custom_project_vl/ +|----setup.py +|----custom_vlm/ + |----__init__.py + |----qwenvl_wrapper.py +``` + +Write `setup.py`: + +```python Example +# sglang_custom_project_vl/setup.py + +from setuptools import setup, find_packages +setup( + name="sglang-custom-plugins-vl", + version="0.1", + packages=find_packages(), +) +``` + +Write the model in `qwenvl_wrapper.py`: + +```python Example +# sglang_custom_project_vl/custom_vlm/qwenvl_wrapper.py +import torch +from sglang.srt.models.qwen2_vl import Qwen2VLForConditionalGeneration +from sglang.srt.multimodal.processors.qwen_vl import QwenVLImageProcessor + +class CustomQwen2VL(Qwen2VLForConditionalGeneration): + def forward(self, input_ids, positions, forward_batch, + input_embeds=None, get_embedding=False): + res = super().forward( + input_ids, positions, forward_batch, + input_embeds=input_embeds, get_embedding=get_embedding + ) + if not get_embedding: + orig = res.next_token_logits + res.next_token_logits = torch.where(orig > 0, orig.sqrt(), orig) + return res + +class CustomQwen2VLProcessor(QwenVLImageProcessor): + models = [CustomQwen2VL] + + def __init__(self, hf_config, server_args, _processor, *args, **kwargs): + super().__init__(hf_config, server_args, _processor, *args, **kwargs) + +EntryClass = CustomQwen2VL +``` + +**Note:** you don't need a separate `EntryClass` for the custom processor as long as you associate the processor with the specific model class. + +Install the package, update `config.json`, and launch: + +```bash Command +pip install -e . +``` + +```json Config +{ + "architectures": ["CustomQwen2VL"], + ... +} +``` + +```bash Command +export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_vlm +export SGLANG_EXTERNAL_MM_MODEL_ARCH=CustomQwen2VL +export SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE=custom_vlm + +python -m sglang.launch_server \ + --model-path /path/to/Qwen2-VL-2B-Instruct \ + --port 8000 \ + --enable-multimodal +``` + +## Documentation + +Add to table of supported models in [generative_models.md](./generative_models) or [multimodal_language_models.md](./multimodal_language_models) + + +For NPU-adapted models, also add entries to the NPU support models table in +[ascend_npu_support_models.mdx](./ascend_npu_support_models). + + +--- + +By following these guidelines, you can add support for new language models and multimodal large language models in +SGLang and ensure they are thoroughly tested and easily integrated into the system. diff --git a/docs_new/docs/hardware-platforms/ascend-npus/mindspore_backend.mdx b/docs_new/docs/hardware-platforms/ascend-npus/mindspore_backend.mdx new file mode 100644 index 000000000000..32adcbb7d70c --- /dev/null +++ b/docs_new/docs/hardware-platforms/ascend-npus/mindspore_backend.mdx @@ -0,0 +1,169 @@ +## Introduction + +MindSpore is a high-performance AI framework optimized for Ascend NPUs. This doc guides users to run MindSpore models in SGLang. + +## Requirements + +MindSpore currently only supports Ascend NPU devices. Users need to first install Ascend CANN software packages. +The CANN software packages can be downloaded from the [Ascend Official Website](https://www.hiascend.com). The recommended version is 8.3.RC2. + +## Supported Models + +Currently, the following models are supported: + +- **Qwen3**: Dense and MoE models +- **DeepSeek V3/R1** +- *More models coming soon...* + +## Installation + + +Currently, MindSpore models are provided by an independent package `sgl-mindspore`. Support for MindSpore is built upon current SGLang support for Ascend NPU platform. Please first [install SGLang for Ascend NPU](./ascend_npu) and then install `sgl-mindspore`: + + + +```shell Install +git clone https://github.com/mindspore-lab/sgl-mindspore.git +cd sgl-mindspore +pip install -e . +``` + + + +## Run Model + +Current SGLang-MindSpore supports Qwen3 and DeepSeek V3/R1 models. This doc uses Qwen3-8B as an example. + +### Offline infer + +Use the following script for offline infer: + + +```python Offline Inference +import sglang as sgl + +# Initialize the engine with MindSpore backend +llm = sgl.Engine( + model_path="/path/to/your/model", # Local model path + device="npu", # Use NPU device + model_impl="mindspore", # MindSpore implementation + attention_backend="ascend", # Attention backend + tp_size=1, # Tensor parallelism size + dp_size=1 # Data parallelism size +) + +# Generate text +prompts = [ + "Hello, my name is", + "The capital of France is", + "The future of AI is" +] + +sampling_params = {"temperature": 0, "top_p": 0.9} +outputs = llm.generate(prompts, sampling_params) + +for prompt, output in zip(prompts, outputs): + print(f"Prompt: {prompt}") + print(f"Generated: {output['text']}") + print("---") +``` + + +### Start server + +Launch a server with MindSpore backend: + + +```bash Launch Server +# Basic server startup +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --tp-size 1 \ + --dp-size 1 +``` + + +For distributed server with multiple nodes: + + +```bash Multi-node Distributed +# Multi-node distributed server +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --dist-init-addr 127.0.0.1:29500 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 4 \ + --dp-size 2 +``` + + +## Troubleshooting + +#### Debug Mode + +Enable sglang debug logging by log-level argument. + + +```bash Debug Mode +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --log-level DEBUG +``` + + +Enable mindspore info and debug logging by setting environments. + + +```bash Set Log Level +export GLOG_v=1 # INFO +export GLOG_v=0 # DEBUG +``` + + +#### Explicitly select devices + +Use the following environment variable to explicitly select the devices to use. + + +```shell Select Devices +export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 # to set device +``` + + +#### Some communication environment issues + +In case of some environment with special communication environment, users need set some environment variables. + + +```shell Disable LCCL +export MS_ENABLE_LCCL=off # current not support LCCL communication mode in SGLang-MindSpore +``` + + +#### Some dependencies of protobuf + +In case of some environment with special protobuf version, users need set some environment variables to avoid binary version mismatch. + + +```shell Fix Protobuf +export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python # to avoid protobuf binary version mismatch +``` + + +## Support +For MindSpore-specific issues: + +- Refer to the [MindSpore documentation](https://www.mindspore.cn/) diff --git a/docs_new/docs/hardware-platforms/cpu_server.mdx b/docs_new/docs/hardware-platforms/cpu_server.mdx new file mode 100644 index 000000000000..743766656ef3 --- /dev/null +++ b/docs_new/docs/hardware-platforms/cpu_server.mdx @@ -0,0 +1,387 @@ +--- +title: "CPU Servers" +--- +The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers. +SGLang is enabled and optimized on the CPUs equipped with Intel® AMX® Instructions, +which are 4th generation or newer Intel® Xeon® Scalable Processors. + +## Optimized Model List + +A list of popular LLMs are optimized and run efficiently on CPU, +including the most notable open-source models like Llama series, Qwen series, +and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model NameBF16W8A8_INT8FP8
DeepSeek-R1meituan/DeepSeek-R1-Channel-INT8deepseek-ai/DeepSeek-R1
DeepSeek-V3.1-TerminusIntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8deepseek-ai/DeepSeek-V3.1-Terminus
Llama-3.2-3Bmeta-llama/Llama-3.2-3B-InstructRedHatAI/Llama-3.2-3B-quantized.w8a8
Llama-3.1-8Bmeta-llama/Llama-3.1-8B-InstructRedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
QwQ-32BRedHatAI/QwQ-32B-quantized.w8a8
DeepSeek-Distilled-LlamaRedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8
Qwen3-235BQwen/Qwen3-235B-A22B-FP8
+ +**Note:** The model identifiers listed in the table above +have been verified on 6th Gen Intel® Xeon® P-core platforms. + +## Installation + +### Install Using Docker + +It is recommended to use Docker for setting up the SGLang environment. +A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile) is provided to facilitate the installation. +Replace `` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens). + +```bash Command +# Clone the SGLang repository +git clone https://github.com/sgl-project/sglang.git +cd sglang/docker + +# Build the docker image +docker build -t sglang-cpu:latest -f xeon.Dockerfile . + +# Initiate a docker container +docker run \ + -it \ + --privileged \ + --ipc=host \ + --network=host \ + -v /dev/shm:/dev/shm \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + -p 30000:30000 \ + -e "HF_TOKEN=" \ + sglang-cpu:latest /bin/bash +``` + +### Install From Source + +If you prefer to install SGLang in a bare metal environment, +the setup process is as follows: + +Please install the required packages and libraries beforehand if +they are not already present on your system. +You can refer to the Ubuntu-based installation commands in +[the Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile#L11) +for guidance. + +1. Install `uv` package manager, then create and activate a virtual environment: + +```bash Command +# Taking '/opt' as the example uv env folder, feel free to change it as needed +cd /opt +curl -LsSf https://astral.sh/uv/install.sh | sh +source $HOME/.local/bin/env +uv venv --python 3.12 +source .venv/bin/activate +``` + +2. Create a config file to direct the installation channel + (a.k.a. index-url) of `torch` related packages: + +```bash Command +vim .venv/uv.toml +``` + +Press 'a' to enter insert mode of `vim`, paste the following content into the created file + +```file +[[index]] +name = "torch" +url = "https://download.pytorch.org/whl/cpu" + +[[index]] +name = "torchvision" +url = "https://download.pytorch.org/whl/cpu" + +[[index]] +name = "torchaudio" +url = "https://download.pytorch.org/whl/cpu" + +[[index]] +name = "triton" +url = "https://download.pytorch.org/whl/cpu" + +``` + +Save the file (in `vim`, press 'esc' to exit insert mode, then ':x+Enter'), +and set it as the default `uv` config. + +```bash Command +export UV_CONFIG_FILE=/opt/.venv/uv.toml +``` + +3. Clone the `sglang` source code and build the packages + +```bash Command +# Clone the SGLang code +git clone https://github.com/sgl-project/sglang.git +cd sglang +git checkout + +# Use dedicated toml file +cd python +cp pyproject_cpu.toml pyproject.toml +# Install SGLang dependent libs, and build SGLang main package +uv pip install --upgrade pip setuptools +uv pip install . + +# Build the CPU backend kernels +cd ../sgl-kernel +cp pyproject_cpu.toml pyproject.toml +uv pip install . +``` + +4. Set the required environment variables + +```bash Command +export SGLANG_USE_CPU_ENGINE=1 + +# Set 'LD_LIBRARY_PATH' and 'LD_PRELOAD' to ensure the libs can be loaded by sglang processes +export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu +export LD_PRELOAD=${LD_PRELOAD}:/opt/.venv/lib/libiomp5.so:${LD_LIBRARY_PATH}/libtcmalloc.so.4:${LD_LIBRARY_PATH}/libtbbmalloc.so.2 +``` + +Notes: + +- Note that the environment variable `SGLANG_USE_CPU_ENGINE=1` + is required to enable the SGLang service with the CPU engine. + +- If you encounter code compilation issues during the `sgl-kernel` building process, + please check your `gcc` and `g++` versions and upgrade them if they are outdated. + It is recommended to use `gcc-13` and `g++-13` as they have been verified + in the official Docker container. + +- The system library path is typically located in one of the following directories: + `~/.local/lib/`, `/usr/local/lib/`, `/usr/local/lib64/`, `/usr/lib/`, `/usr/lib64/` + and `/usr/lib/x86_64-linux-gnu/`. In the above example commands, `/usr/lib/x86_64-linux-gnu` + is used. Please adjust the path according to your server configuration. + +- It is recommended to add the following to your `~/.bashrc` file to + avoid setting these variables every time you open a new terminal: + + ```bash Command + source .venv/bin/activate + export SGLANG_USE_CPU_ENGINE=1 + export LD_LIBRARY_PATH= + export LD_PRELOAD= + ``` + +## Launch of the Serving Engine + +Example command to launch SGLang serving: + +```bash Launch Server +sglang serve \ + --model-path \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --host 0.0.0.0 \ + --tp 6 +``` + +Notes: + +1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`. + +2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6). + The number of TP specified is how many TP ranks will be used during the execution. + On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). + Usually we can get the SNC information (How many available) from the Operating System with e.g. `lscpu` command. + + If the specified TP rank number differs from the total SNC count, + the system will automatically utilize the first `n` SNCs. + Note that `n` cannot exceed the total SNC number, doing so will result in an error. + + `SGLANG_CPU_OMP_THREADS_BIND` allows explicit control of CPU cores for each tensor parallel (TP) rank. + + **example 1**: Run SGLang service with TP=6, using the first 40 cores of each SNC on a Xeon® 6980P server, + which has 43-43-42 cores on the 3 SNCs of a socket, we should set: + + ```bash Command + export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253" + ``` + This configuration is equivalent to: + - rank 0: `numactl -C 0-39 -m 0` + - rank 1: `numactl -C 43-82 -m 1` + - rank 2: `numactl -C 86-125 -m 2` + - rank 3: `numactl -C 128-167 -m 3` + - rank 4: `numactl -C 171-210 -m 4` + - rank 5: `numactl -C 214-253 -m 5` + + + **example 2**: Run SGLang service with TP=2, using 96 cores cross 3 SNCs on a Xeon® 6972P server, + which has 32-32-32 cores on the 3 SNCs in a socket, we should set: + ```bash Command + export SGLANG_CPU_OMP_THREADS_BIND="0-95|96-191" + ``` + This configuration is equivalent to: + - rank 0: `numactl -C 0-95 -m 0-2` + - rank 1: `numactl -C 96-191 -m 3-5` + + Please beware that with SGLANG_CPU_OMP_THREADS_BIND set, + the available memory amounts of the ranks may not be determined in prior. + You may need to set proper `--max-total-tokens` to avoid the out-of-memory error. + +3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`. + To specify the maximum batch size when using `torch.compile`, set the flag `--torch-compile-max-bs`. + For example, `--enable-torch-compile --torch-compile-max-bs 4` means using `torch.compile` + and setting the maximum batch size to 4. + +4. A warmup step is automatically triggered when the service is started. + The server is ready when you see the log `The server is fired up and ready to roll!`. + +## Benchmarking with Requests + +You can benchmark the performance via the `bench_serving` script. +Run the command in another terminal. An example command would be: + +```bash Run Benchmark +python -m sglang.bench_serving \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1 \ + --request-rate inf \ + --random-range-ratio 1.0 +``` + +Detailed parameter descriptions are available via the command: + +```bash Benchmark Help +python -m sglang.bench_serving -h +``` + +Additionally, requests can be formatted using +[the OpenAI Completions API](../basic_usage/openai_api_completions) +and sent via the command line (e.g., using `curl`) or through your own scripts. + +## Example Usage Commands + +Large Language Models can range from fewer than 1 billion to several hundred billion parameters. +Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors +with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer, +or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common +4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors. + +### Example: Running DeepSeek-V3.1-Terminus + +An example command to launch service of W8A8_INT8 DeepSeek-V3.1-Terminus on a Xeon® 6980P server: + +```bash W8A8_INT8 +sglang serve \ + --model-path IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --quantization w8a8_int8 \ + --enable-torch-compile \ + --torch-compile-max-bs 4 \ + --host 0.0.0.0 \ + --tp 6 +``` + +Similarly, an example command to launch service of FP8 DeepSeek-V3.1-Terminus would be: + +```bash FP8 +sglang serve \ + --model-path deepseek-ai/DeepSeek-V3.1-Terminus \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --enable-torch-compile \ + --torch-compile-max-bs 4 \ + --host 0.0.0.0 \ + --tp 6 +``` + +Note: Please set `--torch-compile-max-bs` to the maximum desired batch size for your deployment. +The value `4` in the examples is illustrative. + +### Example: Running Llama-3.2-3B + +An example command to launch service of Llama-3.2-3B with BF16 precision: + +```bash BF16 +sglang serve \ + --model-path meta-llama/Llama-3.2-3B-Instruct \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --enable-torch-compile \ + --torch-compile-max-bs 16 \ + --host 0.0.0.0 \ + --tp 3 +``` + +The example command to launch service of W8A8_INT8 version of Llama-3.2-3B: + +```bash W8A8_INT8 +sglang serve \ + --model-path RedHatAI/Llama-3.2-3B-quantized.w8a8 \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device cpu \ + --quantization w8a8_int8 \ + --enable-torch-compile \ + --torch-compile-max-bs 16 \ + --host 0.0.0.0 \ + --tp 3 +``` + +Note: The `--torch-compile-max-bs` and `--tp` settings are examples that should be adjusted for your setup. +For instance, use `--tp 3` to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server. + +Once the server have been launched, you can test it using the `bench_serving` command or create +your own commands or scripts following [the benchmarking example](#benchmarking-with-requests). diff --git a/docs_new/docs/hardware-platforms/mthreads_gpu.mdx b/docs_new/docs/hardware-platforms/mthreads_gpu.mdx new file mode 100644 index 000000000000..a1df3bd05cb4 --- /dev/null +++ b/docs_new/docs/hardware-platforms/mthreads_gpu.mdx @@ -0,0 +1,29 @@ +--- +title: "Moore Threads GPUs" +metatags: + description: "Run SGLang on Moore Threads GPUs." +--- + +This document describes how run SGLang on Moore Threads GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues). + +## Install SGLang + +You can install SGLang using one of the methods below. + +### Install from Source + +```bash +# Use the default branch +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Compile sgl-kernel +pip install --upgrade pip +cd sgl-kernel +python setup_musa.py install + +# Install sglang python package +cd .. +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +pip install -e "python[all_musa]" +``` diff --git a/docs_new/docs/hardware-platforms/nvidia-gpus.mdx b/docs_new/docs/hardware-platforms/nvidia-gpus.mdx new file mode 100644 index 000000000000..979bed1207ab --- /dev/null +++ b/docs_new/docs/hardware-platforms/nvidia-gpus.mdx @@ -0,0 +1,5 @@ +--- +title: NVIDIA GPUs +--- + +Please refer to the [Installation Guide](../get-started/install) to get started with SGLang on NVIDIA GPUs. diff --git a/docs_new/docs/hardware-platforms/nvidia_jetson.mdx b/docs_new/docs/hardware-platforms/nvidia_jetson.mdx new file mode 100644 index 000000000000..26f8e58d472d --- /dev/null +++ b/docs_new/docs/hardware-platforms/nvidia_jetson.mdx @@ -0,0 +1,82 @@ +--- +title: NVIDIA Jetson Orin +description: Guide for installing and running SGLang on NVIDIA Jetson Orin devices. +--- +## Prerequisites + +Before starting, ensure the following: + +- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later. +- **CUDA Toolkit** and **cuDNN** are installed. +- Verify that the Jetson AGX Orin is in **high-performance mode**: +```bash +sudo nvpmodel -m 0 +``` +* * * * * +## Installing and running SGLang with Jetson Containers +Clone the jetson-containers github repository: +```bash +git clone https://github.com/dusty-nv/jetson-containers.git +``` +Run the installation script: +```bash +bash jetson-containers/install.sh +``` +Build the container image: +```bash +jetson-containers build sglang +``` +Run the container: +``` +jetson-containers run $(autotag sglang) +``` +Or you can also manually run a container with this command: +``` +docker run --runtime nvidia -it --rm --network=host IMAGE_NAME +``` +* * * * * + +Running Inference +----------------------------------------- + +Launch the server: +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ + --device cuda \ + --dtype half \ + --attention-backend flashinfer \ + --mem-fraction-static 0.8 \ + --context-length 8192 +``` +The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../advanced_features/server_arguments). + +After launching the engine, refer to [Chat completions](../basic_usage/openai_api_completions#Usage) to test the usability. +* * * * * +Running quantization with TorchAO +------------------------------------- +TorchAO is suggested to NVIDIA Jetson Orin. +```bash Command +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --device cuda \ + --dtype bfloat16 \ + --attention-backend flashinfer \ + --mem-fraction-static 0.8 \ + --context-length 8192 \ + --torchao-config int4wo-128 +``` +This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency. + + +* * * * * +Structured output with XGrammar +------------------------------- +Please refer to [SGLang doc structured output](../advanced_features/structured_outputs). +* * * * * + +Thanks to the support from [Nurgaliyev Shakhizat](https://github.com/shahizat), [Dustin Franklin](https://github.com/dusty-nv) and [Johnny Núñez Cano](https://github.com/johnnynunez). + +References +---------- +- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin) diff --git a/docs_new/docs/hardware-platforms/overview.mdx b/docs_new/docs/hardware-platforms/overview.mdx new file mode 100644 index 000000000000..5bb3c46f9638 --- /dev/null +++ b/docs_new/docs/hardware-platforms/overview.mdx @@ -0,0 +1,12 @@ +--- +title: Hardware Platforms +description: Platform-specific guides for running SGLang on GPUs, TPUs, NPUs, CPUs, and more. +--- + +- [NVIDIA GPUs](./nvidia-gpus) +- [AMD GPUs](./amd_gpu) +- [Ascend NPUs](./ascend-npus/ascend_npu) +- [CPU Server](./cpu_server) +- [NVIDIA Jetson Orin](./nvidia_jetson) +- [TPU](./tpu) +- [XPU](./xpu) diff --git a/docs_new/docs/hardware-platforms/plugin.mdx b/docs_new/docs/hardware-platforms/plugin.mdx new file mode 100644 index 000000000000..676eb045082f --- /dev/null +++ b/docs_new/docs/hardware-platforms/plugin.mdx @@ -0,0 +1,849 @@ +--- +title: "SGLang Plugin System" +metatags: + description: "Allows hardware vendors and developers to extend SGLang without modifying the main repository code." +--- + +## Overview + +Allows hardware vendors and developers to extend SGLang **without modifying the main repository code**. + +The framework provides two plugin types, both discovered via Python's standard `setuptools` entry_points: + + + + + + + + + + + + + + + + + + + + + + + + + + +
Plugin TypeEntry Point GroupPurpose
Hardware Platform Pluginsglang.srt.platformsRegister a custom hardware platform (device operations, KV cache pools, attention backends, graph capture, compilation backends, etc.)
General Pluginsglang.srt.pluginsInject hooks (before/after/around/replace) into any function/method, or replace entire classes
+ +### Principles + +- **Non-intrusive**: Existing CUDA/ROCm/NPU/XPU code remains unchanged. OOT code paths are added alongside existing hardware-specific logic. +- **Zero configuration**: Plugins are automatically discovered after `pip install`, no sglang code changes required. +- **Environment variable control**: `SGLANG_PLATFORM` selects or validates the active platform plugin; `SGLANG_PLUGINS` (comma-separated) controls which general plugins to load. + +### Current Scope & Future Direction + +The plugin system currently targets **out-of-tree (OOT) hardware platforms** — enabling new devices to integrate with SGLang without any changes to the main repository. The main-repo hardware paths (CUDA, ROCm, NPU, XPU, etc.) continue to use the existing `is_cuda()`/`is_npu()`/… utility functions. + +As the plugin interfaces mature and stabilize, in-tree hardware backends can be gradually migrated to the same plugin architecture. This would replace the scattered `if device == "cuda" … elif device == "npu" …` branches throughout the codebase with a single polymorphic dispatch through the platform interface, making each hardware backend self-contained and the core engine hardware-agnostic. + +## Architecture + +### Platform Hierarchy + +The platform hierarchy uses a DeviceMixin pattern to share device operations between SRT (LLM inference) and Multimodal subsystems: + +``` +DeviceMixin (shared device identity + operations) +├── SRTPlatform(DeviceMixin) # + graph runner, KV pool, … +│ └── MySRTPlatform(SRTPlatform, MyDeviceMixin) # OOT plugin +└── MMPlatform(DeviceMixin) # + attention backend, VAE, … (future) + └── MyMMPlatform(MMPlatform, MyDeviceMixin) # OOT plugin +``` + +Key design points: +- **DeviceMixin** provides platform identity queries (`is_cuda()`, `is_npu()`, etc.) and device operations (`set_device()`, `get_device_name()`, etc.) +- **SRTPlatform** adds SRT-specific factory methods, capability flags, and lifecycle hooks +- OOT plugins implement a **device mixin** (vendor-specific operations) and compose it with **SRTPlatform** via multiple inheritance +- All methods are **instance methods** (not classmethods), called through the `current_platform` singleton +- Device operations and factory methods raise `NotImplementedError` by default (fail-fast) +- Capability flags use safe conservative defaults (`False`/`pass`) +- Methods are annotated `[Active]` (called by SGLang core) or `[Planned]` (reserved for future migration) + +### Platform Discovery (`current_platform`) + +`current_platform` is a **lazy singleton** in `sglang.srt.platforms`. On first access it resolves the active platform through the following priority chain: + +``` +entry_points("sglang.srt.platforms") → Enumerate ALL plugins by name (metadata only) + │ + ├─ SGLANG_PLATFORM set (front-loading filter): + │ ├─ Name not found in discovered → RuntimeError + │ ├─ activate() returns non-None → load that platform + │ └─ activate() returns None → RuntimeError (hardware unavailable) + │ + └─ SGLANG_PLATFORM unset (auto-discover, activate all): + ├─ 0 activated → fallback base SRTPlatform + ├─ 1 activated → use it + └─ N activated → RuntimeError (must set SGLANG_PLATFORM) +``` + +### Plugin Loading Flow + +`load_plugins()` discovers and executes general plugins, then applies all registered hooks. It is called at four points: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Call SiteProcessTiming
cli/serve.py serve()MainBefore prepare_server_args()
launch_server.py __main__MainBefore prepare_server_args()
engine.py _launch_subprocesses()MainBefore server_args.check_server_args()
scheduler.py run_scheduler_process()SubprocessBefore Scheduler() construction
+ +> **Note**: `load_plugins()` is idempotent (guarded by `_plugins_loaded` flag). In spawn'd subprocesses the flag resets, so plugins are correctly re-loaded. + +``` +load_plugins() + ├── _get_excluded_dists() → compute dists to skip (via SGLANG_PLATFORM) + ├── load_plugins_by_group("sglang.srt.plugins", → discover entry_points, filter by SGLANG_PLUGINS + │ excluded_dists=...) skip plugins from unselected platform packages + ├── for each plugin: → set _current_plugin_source context var + │ func() side effects (register hooks with source tracking) + └── HookRegistry.apply_hooks() → monkey-patch targets +``` + +--- + +## Plugin Type 1: Hardware Platform Plugin + +### Description + +A hardware platform plugin registers an `SRTPlatform` subclass that tells SGLang how to interact with a specific hardware backend. + +### Quick Start + +**1. Create a minimal package:** + +``` +my_platform_plugin/ +├── pyproject.toml +└── my_platform_plugin/ + ├── __init__.py # activate() function + ├── device.py # MyDeviceMixin + └── platform.py # MySRTPlatform +``` + +**2. `pyproject.toml`:** + +```toml +[build-system] +requires = ["setuptools"] +build-backend = "setuptools.build_meta" + +[project] +name = "my-platform-plugin" +version = "0.1.0" + +[project.entry-points."sglang.srt.platforms"] +my_device = "my_platform_plugin:activate" +``` + +**3. `__init__.py`** — activation function: + +```python +def activate(): + """Return fully-qualified class name to activate, or None to skip.""" + if _my_device_is_available(): + return "my_platform_plugin.platform.MySRTPlatform" + return None +``` + +**4. `device.py`** — device mixin: + +```python +from sglang.srt.platforms.device_mixin import DeviceMixin, PlatformEnum + +class MyDeviceMixin(DeviceMixin): + _enum = PlatformEnum.OOT + device_name = "my_device" + device_type = "my_device" # torch device type + + def set_device(self, device) -> None: ... + def get_device_name(self, device_id=0) -> str: ... + def get_device_total_memory(self, device_id=0) -> int: ... + def get_current_memory_usage(self, device=None) -> float: ... + def get_device_capability(self, device_id=0): ... + def get_torch_distributed_backend_str(self) -> str: ... +``` + +**5. `platform.py`** — SRT platform: + +```python +from sglang.srt.platforms.interface import SRTPlatform +from my_platform_plugin.device import MyDeviceMixin + +class MySRTPlatform(SRTPlatform, MyDeviceMixin): + def get_default_attention_backend(self) -> str: ... + def support_cuda_graph(self) -> bool: ... + # ... override other methods as needed +``` + +**6. Install and verify:** + +```bash +pip install -e my_platform_plugin/ +python -c "from sglang.srt.platforms import current_platform; print(current_platform)" +``` + +### Platform Interface Reference + +#### Identity Queries (from DeviceMixin) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodDefaultDescription
is_cuda()Based on _enumWhether this is an NVIDIA CUDA platform
is_rocm()Based on _enumWhether this is an AMD ROCm platform
is_npu()Based on _enumWhether this is a Huawei NPU platform
is_cpu()Based on _enumWhether this is a CPU-only platform
is_xpu()Based on _enumWhether this is an Intel XPU platform
is_musa()Based on _enumWhether this is a Moore Threads MUSA platform
is_cuda_alike()CUDA+ROCM+MUSATrue if the hardware supports CUDA-like APIs
is_out_of_tree()True for OOTAutomatically detected based on _enum = PlatformEnum.OOT
+ +#### Device Operations (from DeviceMixin) + +> Methods annotated **[Active]** are called by SGLang core through `current_platform` — OOT implementations take effect immediately. +> Methods annotated **[Planned]** are reserved interfaces — SGLang core still uses hardcoded calls (e.g. `torch.cuda.empty_cache()`). OOT implementations will NOT take effect until the core is migrated in a future PR. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodDefaultStatusDescription
get_device(local_rank)raise NotImplementedErrorPlannedReturn torch.device for a given local rank
set_device(device)raise NotImplementedErrorPlannedSet the current device
get_device_name(device_id)raise NotImplementedErrorPlannedGet human-readable device name
get_device_uuid(device_id)raise NotImplementedErrorPlannedGet unique device identifier
get_device_capability(device_id)raise NotImplementedErrorPlannedGet DeviceCapability(major, minor). None if N/A
empty_cache()passPlannedRelease cached device memory
synchronize()passPlannedSynchronize device operations
get_device_total_memory(device_id)raise NotImplementedErrorActiveGet total device memory in bytes
get_available_memory(device_id)raise NotImplementedErrorPlannedReturn (free_bytes, total_bytes)
get_current_memory_usage(device)raise NotImplementedErrorActiveGet current peak memory usage in bytes
get_torch_distributed_backend_str()raise NotImplementedErrorPlannedDistributed backend string (e.g. "nccl", "hccl")
get_communicator_class()NonePlannedPlatform-specific communicator class
inference_mode()torch.inference_mode(True)PlannedReturn inference mode context manager
seed_everything(seed)Set random/np/torch seedsPlannedSet random seeds for reproducibility
verify_quantization(quant)passPlannedValidate quantization method support
get_cpu_architecture()Auto-detect x86/armPlannedDetect CPU architecture (CpuArchEnum)
+ +#### Types (from DeviceMixin) + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeDescription
PlatformEnumEnumeration of platform types: CUDA, ROCM, CPU, XPU, MUSA, NPU, TPU, MPS, OOT, UNSPECIFIED
CpuArchEnumCPU architecture: X86, ARM, UNSPECIFIED
DeviceCapabilityNamedTuple(major, minor) with comparison support. Methods: as_version_str(), to_int()
+ +#### Capability Flags (from SRTPlatform) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodDefaultDescription
support_cuda_graph()FalseWhether device graph capture is supported (plain CUDA graph)
support_piecewise_cuda_graph()FalseWhether piecewise CUDA graph (torch.compile backend) is supported
supports_fp8()FalseWhether FP8 quantization is supported
is_pin_memory_available()TrueWhether pinned memory is available
+ +#### Subsystem Factory Methods (from SRTPlatform) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodDefaultDescription
get_default_attention_backend()raise NotImplementedErrorDefault attention backend name
get_graph_runner_cls()raise NotImplementedErrorGraph Runner class
get_mha_kv_pool_cls()raise NotImplementedErrorMHA KV cache pool class
get_mla_kv_pool_cls()raise NotImplementedErrorMLA KV cache pool class
get_nsa_kv_pool_cls()raise NotImplementedErrorNSA KV cache pool class (DeepSeek V3.2)
get_paged_allocator_cls()raise NotImplementedErrorPaged allocator class
get_piecewise_backend_cls()raise NotImplementedErrorPiecewise compilation backend class
get_compile_backend(mode)"inductor"Compilation backend string
get_dispatch_key_name()"native"MultiPlatformOp dispatch key name
+ +#### Lifecycle Hooks (from SRTPlatform) + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodInvocation TimingPurpose
apply_server_args_defaults(server_args)After ServerArgs parsing, in __post_init__Set platform-specific defaults
init_backend()In each worker, before model constructionOne-time backend initialization
+ +### Environment Variables + + + + + + + + + + + + + + + + + + + + + + +
VariableDescription
SGLANG_PLATFORMSelect the platform plugin by entry_point name (e.g. kunlun, demo_cuda). When set, only the named plugin's activate() is called (front-loading filter) — other plugins are not touched. Additionally, general plugins (sglang.srt.plugins) from unselected platform packages are automatically skipped to avoid importing their dependencies. Required when multiple plugins would activate. Errors if the name is not found or if the plugin's hardware is unavailable.
SGLANG_PLUGINSComma-separated whitelist of general plugin names to load (group: sglang.srt.plugins). If unset, all discovered general plugins are loaded.
+ +--- + +## Plugin Type 2: General Plugin + +### Description + +General function plugins inject behavior into sglang **without requiring a custom platform**. Use cases include: + +- **Observability**: Add logging, metrics, and tracing to any function +- **Behavior modification**: Modify function arguments or return values +- **Performance profiling**: Add timing to critical functions +- **A/B testing**: Replace implementations at runtime + +### Quick Start + +**1. Create a minimal package:** + +``` +my_general_plugin/ +├── pyproject.toml +└── my_general_plugin/ + └── __init__.py # register() function +``` + +**2. `pyproject.toml`:** + +```toml +[build-system] +requires = ["setuptools"] +build-backend = "setuptools.build_meta" + +[project] +name = "my-general-plugin" +version = "0.1.0" + +[project.entry-points."sglang.srt.plugins"] +my_plugin = "my_general_plugin:register" +``` + +**3. `__init__.py`** — register hooks: + +```python +from sglang.srt.plugins.hook_registry import HookRegistry, HookType + +def register(): + """Entry point called by load_plugins().""" + HookRegistry.register( + "sglang.srt.managers.scheduler.Scheduler.__init__", + my_hook, + HookType.AROUND, + ) + +def my_hook(original_fn, self, *args, **kwargs): + result = original_fn(self, *args, **kwargs) + print(f"Scheduler initialized! gpu_id={self.gpu_id}") + return result +``` + +**4. Install and run:** + +```bash +pip install -e my_general_plugin/ +sglang serve --model-path [options] +# Look for "Scheduler initialized!" in logs +``` + +### Hook Types + +`HookRegistry` supports four hook types: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Hook TypeSignatureDescription
BEFOREfn(*args, **kwargs) -> (args, kwargs) \| NoneRuns before the original. Return None to keep args unchanged, or (args, kwargs) to modify.
AFTERfn(result, *args, **kwargs) -> new_result \| NoneRuns after the original. Return None to keep result, or a new value to replace.
AROUNDfn(original_fn, *args, **kwargs) -> resultWraps the original. You must call original_fn yourself. Full control over execution.
REPLACEfn(*args, **kwargs) -> result or classReplace the original function or class entirely. For class targets, pass a replacement class directly — it is substituted via setattr preserving isinstance()/issubclass() semantics.
+ +> **Note**: Only `REPLACE` accepts a class as the hook. Passing a class to `BEFORE`/`AFTER`/`AROUND` raises `TypeError` at registration time. + +### Registration API + +Hooks can be registered using the **imperative API** or the **decorator API**: + +```python +# --- Imperative API --- +from sglang.srt.plugins.hook_registry import HookRegistry, HookType + +def my_timer(original_fn, *args, **kwargs): + start = time.perf_counter() + result = original_fn(*args, **kwargs) + print(f"Elapsed: {time.perf_counter() - start:.3f}s") + return result + +HookRegistry.register( + "sglang.srt.managers.scheduler.Scheduler.get_next_batch_to_run", + my_timer, + HookType.AROUND, +) + +# --- Decorator API --- +from sglang.srt.plugins.hook_registry import plugin_hook, HookType + +@plugin_hook( + "sglang.srt.managers.scheduler.Scheduler.get_next_batch_to_run", + type=HookType.AROUND, +) +def my_timer(original_fn, *args, **kwargs): + start = time.perf_counter() + result = original_fn(*args, **kwargs) + print(f"Elapsed: {time.perf_counter() - start:.3f}s") + return result + +# --- Class replacement (REPLACE) --- +from sglang.srt.plugins.hook_registry import plugin_hook, HookType +from sglang.srt.managers.scheduler import Scheduler + +@plugin_hook( + "sglang.srt.managers.scheduler.Scheduler", + type=HookType.REPLACE, +) +class MyScheduler(Scheduler): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + print("Enhanced scheduler initialized!") +``` + +### Hook Target Resolution + +Target paths use fully-qualified dotted notation. Both formats are supported: + +- **Dotted**: `sglang.srt.managers.scheduler.Scheduler.__init__` +- **Entry-points style**: `sglang.srt.managers.scheduler:Scheduler.__init__` (colon treated as dot) + +### Common Hook Targets + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TargetDescription
sglang.srt.server_args.ServerArgs.add_cli_argsAdd custom CLI arguments
sglang.srt.server_args.ServerArgs.__post_init__Modify ServerArgs after parsing
sglang.srt.server_args.ServerArgs.check_server_argsAdd/relax validation
sglang.srt.managers.scheduler.Scheduler.__init__Custom scheduler state
sglang.srt.managers.scheduler.Scheduler.get_next_batch_to_runCustom scheduling policy
sglang.srt.managers.scheduler.Scheduler.run_batchProfiling / inspection
sglang.srt.managers.scheduler.Scheduler.process_batch_resultCustom metrics
sglang.srt.managers.tp_worker.TpModelWorker.__init__Custom worker state
sglang.srt.managers.tp_worker.TpModelWorker.forward_batch_generationForward pass wrapping
+ +--- + +## File Reference + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FileDescription
sglang/srt/platforms/device_mixin.pyPlatformEnum + DeviceMixin base class
sglang/srt/platforms/interface.pySRTPlatform base class (extends DeviceMixin)
sglang/srt/platforms/__init__.pycurrent_platform lazy singleton + discovery logic
sglang/srt/plugins/__init__.pyload_plugins() + load_plugins_by_group()
sglang/srt/plugins/hook_registry.pyHookRegistry, HookType, plugin_hook decorator
diff --git a/docs_new/docs/hardware-platforms/tpu.mdx b/docs_new/docs/hardware-platforms/tpu.mdx new file mode 100644 index 000000000000..b3d7f2516823 --- /dev/null +++ b/docs_new/docs/hardware-platforms/tpu.mdx @@ -0,0 +1,629 @@ +--- +title: "TPU" +description: "SGLang supports high-performance TPU inference through the SGLang-JAX backend, which is specifically optimized for Google Cloud TPUs. The JAX-based implementation delivers exceptional throughput and low latency for Large Language Model (LLM) serving workloads on TPU hardware." +--- +SGLang supports high-performance TPU inference through the SGLang-JAX backend, which is specifically optimized for Google Cloud TPUs. The JAX-based implementation delivers exceptional throughput and low latency for Large Language Model (LLM) serving workloads on TPU hardware. + +For TPU-specific issues or feature requests, please visit the [sglang-jax GitHub issues page](https://github.com/sgl-project/sglang-jax/issues). + +**NOTE:** SGLang TPU support is implemented via the SGLang-JAX backend, a dedicated JAX-based inference engine maintained as a separate repository at [https://github.com/sgl-project/sglang-jax](https://github.com/sgl-project/sglang-jax). + +## System Requirements + +### Supported TPU Hardware + + + + + + + + + + + + + + + + + + + + + + + + + + +
TPU TypeHBM MemoryAvailability
TPU v6e32 GBGoogle Cloud
TPU v796 GB per coreGoogle Cloud
+ +### Software Requirements + +- **Python:** 3.12 or higher +- **JAX:** Latest version with TPU support +- **Environment:** Google Cloud TPU VM or compatible TPU runtime +- **Optional:** SkyPilot for simplified cloud deployment + +## Feature Support Matrix + +SGLang-JAX provides comprehensive TPU-optimized features for production LLM serving: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeatureSupport StatusDescription
High-Throughput Continuous BatchingDynamic request batching for maximum TPU utilization
Radix Tree KV CacheMemory-efficient prefix sharing between requests
FlashAttention BackendTPU-optimized attention kernel for long sequences
Tensor ParallelismDistribute models across multiple TPU cores
Paged AttentionFlexible KV cache management with paging
Speculative Decoding (EAGLE/EAGLE3)20-40% throughput improvement for compatible models
Chunked PrefillMixed prefill-decode batching
OpenAI-Compatible APIDrop-in replacement for OpenAI API
Data Parallel Attention🚧In development - Attention computation with data parallelism
Quantization🚧In development - Model quantization for reduced memory usage
Multi-LoRA🚧In development - Serve multiple LoRA adapters simultaneously
+ +### Attention Backend Comparison + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
**Backend****Paged Attention****Spec Decoding****MLA****Sliding Window**
FlashAttention (fa)
Native
+ +**NOTE:** FlashAttention backend is recommended for production workloads due to superior memory efficiency and performance. + +## Optimized Model List + +The following models have been tested and optimized for TPU deployment: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyPerformance Status
Qwen 3⭐ Recommended for production
Qwen 3 MoE⭐ Best performance
Qwen 2Needs improvement
Qwen 2 MoENeeds improvement
Qwen 1.5Needs improvement
Llama/LLaMANeeds improvement
Grok-2Needs improvement
Gemma 2Verified on TPU
Bailing MoENeeds improvement
+ +## Installation + +### Method 1: Using PyPI (Recommended) + +```bash Command +pip install sglang-jax +``` + +### Method 2: From Source + +```bash Command +git clone https://github.com/sgl-project/sglang-jax +cd sglang-jax +uv venv --python 3.12 && source .venv/bin/activate +uv pip install -e "python[all]" +``` + +### Method 3: Using Docker + +**NOTE:** Docker support for TPU is currently under development. Please use PyPI or source installation methods. + +### Method 4: Cloud TPU with SkyPilot + +[SkyPilot](https://github.com/skypilot-org/skypilot) provides simplified deployment on Google Cloud TPU: + +1. Install SkyPilot and configure GCP access (see [SkyPilot documentation](https://skypilot.readthedocs.io/)) + +2. Create a SkyPilot configuration file: + +SkyPilot YAML: sglang-jax.sky.yaml}> + +```yaml Config +# sglang-jax.sky.yaml +resources: + accelerators: tpu-v6e-4 + accelerator_args: + tpu_vm: True + runtime_version: v2-alpha-tpuv6e + +run: | + git clone https://github.com/sgl-project/sglang-jax.git + cd sglang-jax + uv venv --python 3.12 + source .venv/bin/activate + uv pip install -e "python[all]" +``` + + + +3. Launch your TPU cluster: + +```bash Command +# Standard deployment +sky launch -c sglang-jax sglang-jax.sky.yaml --infra=gcp + +# With spot instances for cost savings +sky launch -c sglang-jax sglang-jax.sky.yaml --infra=gcp --use-spot +``` + +## Launch of the Serving Engine + +### Basic Example: Qwen-7B + +```bash Command +JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ + --model-path Qwen/Qwen-7B-Chat \ + --trust-remote-code \ + --dist-init-addr=0.0.0.0:10011 \ + --nnodes=1 \ + --tp-size=4 \ + --device=tpu \ + --random-seed=3 \ + --node-rank=0 \ + --mem-fraction-static=0.8 \ + --max-prefill-tokens=8192 \ + --download-dir=/tmp \ + --dtype=bfloat16 \ + --skip-server-warmup \ + --host 0.0.0.0 \ + --port 30000 +``` + +**Key Parameters Explained:** + +1. `JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache` - Enables JIT compilation caching to accelerate server startup on subsequent runs +2. `--tp-size=4` - Tensor parallelism size; match this to your TPU core count (typically 1, 4, or 8) +3. `--device=tpu` - Specifies TPU device (this is the default for sglang-jax) +4. `--dtype=bfloat16` - Uses bfloat16 precision, which TPUs are optimized for +5. `--mem-fraction-static=0.8` - Allocates 80% of TPU HBM for static memory (adjustable from 0.2 to 0.9) +6. `--max-prefill-tokens=8192` - Maximum number of tokens processed in the prefill phase + +### High-Performance Configuration: Qwen3-8B + +For production workloads with optimal throughput: + +```bash Command +python3 -u -m sgl_jax.launch_server \ + --model-path Qwen/Qwen3-8B \ + --trust-remote-code \ + --tp-size=4 \ + --device=tpu \ + --mem-fraction-static=0.8 \ + --chunked-prefill-size=2048 \ + --dtype=bfloat16 \ + --max-running-requests=256 \ + --page-size=128 \ + --attention-backend=fa +``` + +### Advanced: Speculative Decoding (EAGLE3) + +Speculative decoding can improve throughput by 20-40% for compatible models: + +```bash Command +python3 -u -m sgl_jax.launch_server \ + --model-path Qwen/Qwen3-32B \ + --trust-remote-code \ + --device=tpu \ + --tp-size=4 \ + --mem-fraction-static=0.8 \ + --max-prefill-tokens=4096 \ + --attention-backend=fa \ + --dtype=bfloat16 \ + --port=30000 \ + --host=0.0.0.0 \ + --disable-overlap-schedule \ + --speculative-algorithm=EAGLE3 \ + --speculative-draft-model-path=AngelSlim/Qwen3-32B_eagle3 \ + --page-size=64 \ + --speculative-eagle-topk=1 \ + --speculative-num-steps=3 \ + --speculative-num-draft-tokens=4 +``` + +**NOTE:** Speculative decoding is currently supported for Qwen3 and LLaMA model families. See the [Speculative Decoding documentation](https://github.com/sgl-project/sglang-jax/blob/main/docs/features/speculative_decoding.md) for detailed configuration guidance. + + +### Multi-Node Distributed Serving + +For large models requiring multiple TPU VMs: + +```bash Command +# Node 0 (coordinator) +python3 -m sgl_jax.launch_server \ + --model-path MODEL_PATH \ + --dist-init-addr=NODE0_IP:10011 \ + --nnodes=2 \ + --node-rank=0 \ + --tp-size=8 \ + [other parameters...] + +# Node 1 (worker) +python3 -m sgl_jax.launch_server \ + --model-path MODEL_PATH \ + --dist-init-addr=NODE0_IP:10011 \ + --nnodes=2 \ + --node-rank=1 \ + --tp-size=8 \ + [other parameters...] +``` + +## Benchmarking with Requests + +### Throughput Testing + +Basic throughput benchmark: + +```bash Command +python3 -m sgl_jax.bench_serving \ + --backend sgl-jax \ + --dataset-name random \ + --num-prompts=100 \ + --random-input=512 \ + --random-output=128 \ + --max-concurrency=8 \ + --random-range-ratio=1 \ + --warmup-requests=0 +``` + +### Latency Testing + +Measure single-batch latency: + +```bash Command +python3 -m sgl_jax.bench_one_batch_server \ + --base-url http://127.0.0.1:30000 \ + --model-path Qwen/Qwen-7B-Chat \ + --batch-size=32 \ + --input-len=256 \ + --output-len=32 +``` + +### Comprehensive Benchmark Script + +For systematic performance evaluation across different configurations: + +```bash Command +#!/bin/bash +set -e + +backend=${1:-sgl-jax} +num_prompts_per_concurrency=3 +input_seq_lens=(1024 4096 8192) +output_seq_lens=(1 1024) +max_concurrencies=(8 16 32 64 128 256) + +for input_seq_len in "${input_seq_lens[@]}"; do + for output_seq_len in "${output_seq_lens[@]}"; do + echo "=======================================" + echo "Testing ISL/OSL: $input_seq_len/$output_seq_len" + echo "=======================================" + for max_concurrency in "${max_concurrencies[@]}"; do + num_prompts=$((num_prompts_per_concurrency * max_concurrency)) + python3 -m sgl_jax.bench_serving \ + --backend ${backend} \ + --dataset-name random \ + --num-prompts ${num_prompts} \ + --random-input ${input_seq_len} \ + --random-output ${output_seq_len} \ + --max-concurrency ${max_concurrency} \ + --random-range-ratio 1 \ + --disable-ignore-eos \ + --warmup-requests 0 + done + done +done +``` + +For detailed help on all benchmark parameters: + +```bash Command +python3 -m sgl_jax.bench_serving --help +``` + +See the [Benchmark and Profiling Guide](https://github.com/sgl-project/sglang-jax/blob/main/docs/developer_guide/benchmark_and_profiling.md) for advanced benchmarking techniques and profiling with JAX Profiler. + +## Performance Optimization + +### Memory Optimization + +**Reduce memory usage:** +- Lower `--mem-fraction-static` (from 0.8 → 0.5 → 0.3) +- Decrease `--max-prefill-tokens` (from 16384 → 8192 → 4096) +- Reduce `--max-running-requests` + +**Handle OOM errors:** +- Start with conservative memory settings (`--mem-fraction-static=0.5`) +- Gradually increase until you find the optimal balance +- Increase `--page-size` for better memory locality (1 → 16 → 64 → 128) + +### Throughput Optimization + +To maximize tokens per second: + +- Use FlashAttention backend: `--attention-backend=fa` +- Enable speculative decoding (EAGLE3) for Qwen3 models (20-40% improvement) +- Increase `--max-running-requests` to 256+ +- Set `--mem-fraction-static` to 0.8+ (if memory allows) +- Use larger page sizes (64-128) +- Enable chunked prefill: `--chunked-prefill-size=2048` + +### Latency Optimization + +To minimize time-to-first-token (TTFT) and inter-token latency: + +- Reduce `--page-size` to 1-4 +- Lower `--max-running-requests` (16-32) for smaller batches +- Reduce `--chunked-prefill-size` +- Use conservative memory settings to avoid GC pauses + +### TPU-Specific Optimizations + +1. **JIT Compilation Cache:** + ```bash Command + export JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache + ``` + Always set this environment variable to cache compiled kernels and accelerate server startup. + +2. **Data Type Optimization:** + Use `--dtype=bfloat16` for TPU native optimization. TPUs are specifically designed for bfloat16 computations. + +3. **Tensor Parallelism:** + Match `--tp-size` to your TPU core configuration (1, 4, or 8) for optimal model distribution. + +4. **Attention Backend:** + Always use `--attention-backend=fa` (FlashAttention) for production workloads. + +## Troubleshooting + +### OOM (Out of Memory) Errors + +If you encounter out-of-memory errors: + +1. Reduce `--mem-fraction-static` from 0.8 to 0.5 or lower +2. Decrease `--max-prefill-tokens` from 8192 to 4096 or 2048 +3. Lower `--max-running-requests` to reduce concurrent batch size +4. Increase `--page-size` for better memory layout efficiency + +### Compilation Long-Time + +If the server takes too long to start: + +1. Ensure `JAX_COMPILATION_CACHE_DIR` is properly set +2. Understand that the first run requires JIT compilation (this is normal) +3. Subsequent runs will be significantly faster with cached compilations +4. Consider using `--skip-server-warmup` to defer compilation until first request + +### Low Throughput + +If you're not achieving expected throughput: + +1. Verify `--tp-size` matches your TPU core configuration +2. Check that `--attention-backend=fa` is enabled +3. Increase `--max-running-requests` to enable larger batch formation +4. Consider enabling speculative decoding for compatible models +5. Ensure memory settings allow for sufficient batch sizes + +### Connection Issues + +If clients cannot connect to the server: + +1. Ensure `--host=0.0.0.0` for external access (not just `127.0.0.1`) +2. Verify firewall rules allow traffic on the specified port (default: 30000) +3. Check that the server process is running: `curl http://localhost:30000/health` + +## Advanced Features + +### Speculative Decoding + +SGLang-JAX supports EAGLE and EAGLE3 speculative decoding algorithms for Qwen3 and LLaMA model families. Speculative decoding can improve throughput by 20-40% without affecting output quality. + +See the [Speculative Decoding documentation](https://github.com/sgl-project/sglang-jax/blob/main/docs/features/speculative_decoding.md) for detailed configuration and supported model combinations. + +### Chunked Prefill + +Enable mixed prefill-decode batching for better TPU utilization: + +```bash Command +--chunked-prefill-size=2048 --enable-mixed-chunk +``` + +This allows the scheduler to mix prefill operations with decode operations in the same batch, improving overall throughput. + +### Custom Attention Backends + +SGLang-JAX supports a plugin-based attention backend system. You can implement custom attention kernels optimized for specific use cases. + +See the [Attention Backend documentation](https://github.com/sgl-project/sglang-jax/blob/main/docs/features/attention_backend.md) for implementation details. + +### Environment Verification + +Verify your TPU setup before deploying: + +```bash Command +python -c "from sgl_jax import check_env; check_env.check_env()" +``` + +This command checks: +- Installed package versions +- TPU device availability and specifications +- System resources and configuration +- Compatibility of settings + +## Contributing + +We welcome contributions to improve TPU support in SGLang-JAX! + +### Areas for Contribution + +**Check the [Development Roadmap](https://github.com/sgl-project/sglang-jax/issues/190)** to see planned features and find opportunities to contribute new functionality. + +Current contribution areas include: + +- Performance optimizations for specific TPU generations +- Support for additional model architectures +- Documentation improvements and examples +- Bug reports and fixes +- Benchmark results and performance analysis + +### How to Contribute + +1. Visit the [sglang-jax repository](https://github.com/sgl-project/sglang-jax) +2. Read the [Contribution Guide](https://github.com/sgl-project/sglang-jax/blob/main/docs/developer_guide/contribution_guide.md) +3. Join the [SGL-JAX Slack community](https://sgl-fru7574.slack.com/archives/C09EBE5HT5X) for discussions +4. Report issues at [sglang-jax/issues](https://github.com/sgl-project/sglang-jax/issues) + +### Testing on TPU + +For contributors who need TPU access for testing: + +- Refer to the [TPU Resources Guide](https://github.com/sgl-project/sglang-jax/blob/main/docs/developer_guide/tpu_resources_guide.md) for information on accessing TPU hardware +- Use SkyPilot with spot instances for cost-effective testing +- Follow the [Benchmark and Profiling Guide](https://github.com/sgl-project/sglang-jax/blob/main/docs/developer_guide/benchmark_and_profiling.md) for performance validation + +## References + +### Documentation + +- [SGLang-JAX Repository](https://github.com/sgl-project/sglang-jax) +- [SGLang-JAX Installation Guide](https://github.com/sgl-project/sglang-jax/blob/main/docs/get_started/install.md) +- [Qwen Models Quick Start](https://github.com/sgl-project/sglang-jax/blob/main/docs/basic_usage/qwen.md) +- [Benchmark and Profiling Guide](https://github.com/sgl-project/sglang-jax/blob/main/docs/developer_guide/benchmark_and_profiling.md) +- [Speculative Decoding](https://github.com/sgl-project/sglang-jax/blob/main/docs/features/speculative_decoding.md) + +### External Resources + +- [JAX Documentation](https://jax.readthedocs.io/) +- [Google Cloud TPU Documentation](https://cloud.google.com/tpu/docs) +- [SkyPilot Documentation](https://skypilot.readthedocs.io/) diff --git a/docs_new/docs/hardware-platforms/xpu.mdx b/docs_new/docs/hardware-platforms/xpu.mdx new file mode 100644 index 000000000000..0c9f0ac835df --- /dev/null +++ b/docs_new/docs/hardware-platforms/xpu.mdx @@ -0,0 +1,140 @@ +--- +title: XPU +sidebarTitle: Intel GPUs (XPU) +--- +The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on Intel GPU, [see more context about Intel GPU support within PyTorch ecosystem](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html). + +Specifically, SGLang is optimized for [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics.html) and [ +Intel® Arc™ B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics.html). + +## Optimized Model List + +A list of LLMs have been optimized on Intel GPU, and more are on the way: + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model NameBF16
Llama-3.2-3B[meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
Llama-3.1-8B[meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
Qwen2.5-1.5B[Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B)
+ +**Note:** The model identifiers listed in the table above +have been verified on [Intel® Arc™ B580 Graphics](https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html). + +## Installation + +### Install From Source + +Currently SGLang XPU only supports installation from source. Please refer to ["Getting Started on Intel GPU"](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) to install XPU dependency. + +```bash Command +# Create and activate a conda environment +conda create -n sgl-xpu python=3.12 -y +conda activate sgl-xpu + +# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues. +pip3 install torch==2.11.0+xpu torchao torchvision torchaudio==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu +pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU + +# Clone the SGLang code +git clone https://github.com/sgl-project/sglang.git +cd sglang +git checkout + +# Use dedicated toml file +cd python +cp pyproject_xpu.toml pyproject.toml +# Install SGLang dependent libs, and build SGLang main package +pip install --upgrade pip setuptools +pip install -v . --extra-index-url https://download.pytorch.org/whl/xpu +``` + +### Install Using Docker + +[The SGLang XPU Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xpu.Dockerfile) is provided to facilitate the installation. +Replace `` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens). + +```bash Command +# Clone the SGLang repository +git clone https://github.com/sgl-project/sglang.git +cd sglang/docker + +# Build the docker image +docker build -t sglang-xpu:latest -f xpu.Dockerfile . + +# Initiate a docker container +docker run \ + -it \ + --privileged \ + --ipc=host \ + --network=host \ + --group-add $(getent group video | cut -d: -f3) \ + --device /dev/dri \ + -v /dev/dri/by-path:/dev/dri/by-path \ + -v /dev/shm:/dev/shm \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + -p 30000:30000 \ + -e "HF_TOKEN=" \ + sglang-xpu:latest /bin/bash +``` + +## Launch of the Serving Engine + +Example command to launch SGLang serving: + +```bash +sglang serve \ + --model-path \ + --trust-remote-code \ + --disable-overlap-schedule \ + --device xpu \ + --host 0.0.0.0 \ + --tp 2 \ # using multi GPUs + --attention-backend intel_xpu \ # using intel optimized XPU attention backend + --page-size \ # intel_xpu attention backend supports [32, 64, 128] +``` + +## Benchmarking with Requests + +You can benchmark the performance via the `bench_serving` script. +Run the command in another terminal. + +```bash +python -m sglang.bench_serving \ + --dataset-name random \ + --random-input-len 1024 \ + --random-output-len 1024 \ + --num-prompts 1 \ + --request-rate inf \ + --random-range-ratio 1.0 +``` + +The detail explanations of the parameters can be looked up by the command: + +```bash +python -m sglang.bench_serving -h +``` + +Additionally, the requests can be formed with +[OpenAI Completions API](../basic_usage/openai_api_completions) +and sent via the command line (e.g. using `curl`) or via your own script. diff --git a/docs_new/docs/references/custom_chat_template.mdx b/docs_new/docs/references/custom_chat_template.mdx new file mode 100644 index 000000000000..19cae6004479 --- /dev/null +++ b/docs_new/docs/references/custom_chat_template.mdx @@ -0,0 +1,54 @@ +--- +title: "Custom Chat Template" +metatags: + description: "SGLang custom chat templates: JSON and Jinja formats for OpenAI-compatible API server. Override tokenizer defaults." +--- +**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)). + +By default, the server uses the chat template specified in the model tokenizer from Hugging Face. +It should just work for most official models such as Llama-2/Llama-3. + +If needed, you can also override the chat template when launching the server: + +```bash Command +python -m sglang.launch_server \ + --model-path meta-llama/Llama-2-7b-chat-hf \ + --port 30000 \ + --chat-template llama-2 +``` + +If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file. + +## JSON Format + +You can load the JSON format, which is defined by `conversation.py`. + +```json Config +{ + "name": "my_model", + "system": "<|im_start|>system", + "user": "<|im_start|>user", + "assistant": "<|im_start|>assistant", + "sep_style": "CHATML", + "sep": "<|im_end|>", + "stop_str": ["<|im_end|>", "<|im_start|>"] +} +``` + +```bash Command +python -m sglang.launch_server \ + --model-path meta-llama/Llama-2-7b-chat-hf \ + --port 30000 \ + --chat-template ./my_model_template.json +``` + +## Jinja Format + +You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers. + +```bash Command +python -m sglang.launch_server \ + --model-path meta-llama/Llama-2-7b-chat-hf \ + --port 30000 \ + --chat-template ./my_model_template.jinja +``` diff --git a/docs_new/docs/references/environment_variables.mdx b/docs_new/docs/references/environment_variables.mdx new file mode 100644 index 000000000000..26791cbdd29b --- /dev/null +++ b/docs_new/docs/references/environment_variables.mdx @@ -0,0 +1,836 @@ +--- +title: "Environment Variables" +metatags: + description: "SGLang environment variables: SGLANG_* and SGL_* configs for performance, memory, DeepGEMM, DeepEP, profiling." +--- +SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time. + +*Note: SGLang uses two prefixes for environment variables: `SGL_` and `SGLANG_`. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.* + +## General Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
`SGLANG_USE_MODELSCOPE`Enable using models from ModelScope`false`
`SGLANG_HOST_IP`Host IP address for the server`0.0.0.0`
`SGLANG_PORT`Port for the serverauto-detected
`SGLANG_LOGGING_CONFIG_PATH`Custom logging configuration pathNot set
SGLANG_LOG_REQUEST_HEADERSComma-separated list of additional HTTP headers to log when --log-requests is enabled. Appends to the default x-smg-routing-key.Not set
SGLANG_HEALTH_CHECK_TIMEOUTTimeout for health check in seconds20
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVALThe interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.0
SGLANG_FORWARD_UNKNOWN_TOOLSForward unknown tool calls to clients instead of dropping themfalse (drop unknown tools)
SGLANG_REQ_WAITING_TIMEOUTTimeout (in seconds) for requests waiting in the queue before being scheduled`-1`
SGLANG_REQ_RUNNING_TIMEOUTTimeout (in seconds) for requests running in the decode batch`-1`
SGLANG_CACHE_DIRCache directory for model weights and other data~/.cache/sglang
SGLANG_PREFETCH_BLOCK_SIZE_MBBlock size (in MB) for sequential checkpoint prefetch reads that warm the OS page cache before workers load weights via mmap16
+ +## Performance Tuning + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
`SGLANG_ENABLE_TORCH_INFERENCE_MODE`Control whether to use torch.inference_mode`false`
`SGLANG_ENABLE_TORCH_COMPILE`Enable torch.compilefalse
`SGLANG_SET_CPU_AFFINITY`Enable CPU affinity setting (often set to `1` in Docker builds)false
`SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN`Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds)false
`SGLANG_IS_FLASHINFER_AVAILABLE`Control FlashInfer availability check`true`
`SGLANG_SKIP_P2P_CHECK`Skip P2P (peer-to-peer) access check`false`
`SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD`Sets the threshold for enabling chunked prefix caching`8192`
`SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION`Enable RoPE fusion in Fused Multi-Layer Attention`1`
`SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP`Disable overlap schedule for consecutive prefill batches`false`
`SGLANG_SCHEDULER_MAX_RECV_PER_POLL`Set the maximum number of requests per poll, with a negative value indicating no limit`-1`
`SGLANG_DISABLE_FA4_WARMUP`Disable Flash Attention 4 warmup passes (set to 1, true, yes, or on to disable)`false`
`SGLANG_DATA_PARALLEL_BUDGET_INTERVAL`Interval for DPBudget updates`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT`Default weight value for scheduler recv skipper counter (used when forward mode doesn't match specific modes). Only active when --scheduler-recv-interval > 1. The counter accumulates weights and triggers request polling when reaching the interval threshold.`1000`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE`Weight increment for decode forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during decode phase.`1`
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_TARGET_VERIFYWeight increment for target verify forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during verification phase.`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE`Weight increment when forward mode is None in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency when no specific forward mode is active.`1`
`SGLANG_MM_BUFFER_SIZE_MB`Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to `0` to disable.`0`
`SGLANG_MM_PRECOMPUTE_HASH`Enable precomputing of hash values for MultimodalDataItem`false`
`SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH`Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering)`false`
`SGLANG_SYMM_MEM_PREALLOC_GB_SIZE`Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg `--enable-symm-mem` is set.-1
SGLANG_CUSTOM_ALLREDUCE_ALGOThe algorithm of custom all-reduce. Set to oneshot or 1stage to force use one-shot. Set to twoshot or 2stage to force use two-shot.``
SGLANG_SKIP_SOFTMAX_PREFILL_THRESHOLD_SCALE_FACTORSkip-softmax threshold scale factor for TRT-LLM prefill attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087None
SGLANG_SKIP_SOFTMAX_DECODE_THRESHOLD_SCALE_FACTORSkip-softmax threshold scale factor for TRT-LLM decode attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087None
SGLANG_USE_SGL_FA3_KERNELUse sgl-kernel implementation for FlashAttention v3true
+ + +## DeepGEMM Configuration (Advanced Optimization) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
`SGLANG_ENABLE_JIT_DEEPGEMM`Enable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to `"0"` to disable)`"true"`
`SGLANG_JIT_DEEPGEMM_PRECOMPILE`Enable precompilation of DeepGEMM kernels`"true"`
`SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS`Number of workers for parallel DeepGEMM kernel compilation`4`
`SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE`Indicator flag used during the DeepGEMM precompile script`"false"`
`SGLANG_DG_CACHE_DIR`Directory for caching compiled DeepGEMM kernels`~/.cache/deep_gemm`
SGLANG_DG_USE_NVRTCUse NVRTC (instead of Triton) for JIT compilation (Experimental)"false"
SGLANG_USE_DEEPGEMM_BMMUse DeepGEMM for Batched Matrix Multiplication (BMM) operations`"false"`
SGLANG_JIT_DEEPGEMM_FAST_WARMUPPrecompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime.`"false"`
+ +## DeepEP Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
`SGLANG_DEEPEP_BF16_DISPATCH`Use Bfloat16 for dispatch`"false"`
`SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK`The maximum number of dispatched tokens on each GPU`"128"`
`SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK`The maximum number of dispatched tokens on each GPU for --moe-a2a-backend=flashinfer`"1024"`
`SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS`Number of SMs used for DeepEP combine when single batch overlap is enabled`"32"`
`SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO`Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together.`"false"`
+ +## MORI Configuration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_MORI_DISPATCH_DTYPEOverride MoRI-EP dispatch quantization type. auto uses auto-detection from weight dtype; bf16/fp8/fp4 forces the specified type for all layers"auto"
SGLANG_MORI_FP8_COMBUse FP8 for combine"false"
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANKMaximum number of dispatch tokens per rank for MORI-EP buffer allocation4096
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLDThreshold for switching between InterNodeV1 and InterNodeV1LL kernel types. InterNodeV1LL is used if SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK is less than or equal to this threshold; otherwise, InterNodeV1 is used.256
SGLANG_MORI_PREALLOC_MAX_RECV_TOKENSThis argument devives SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK which indicates customized amount of tokens preallocated for a rank, valid range from 1 to world_size*SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK, by default 0 means maximum. Setting a smaller value will reduce memory footprint but too small value could cause buffer overflow.0
SGLANG_MORI_MOE_MAX_INPUT_TOKENSTruncate the dispatch buffer to this many rows before MoE computation, reducing kernel overhead on padding tokens. The value must be >= the actual number of received tokens (totalRecvTokenNum); setting it too small causes incorrect results. 0 disables truncation (use full buffer).0
SGLANG_MORI_QP_PER_TRANSFERNumber of RDMA Queue Pairs (QPs) used per transfer operation1
SGLANG_MORI_POST_BATCH_SIZENumber of RDMA work requests posted in a single batch to each QP-1
SGLANG_MORI_NUM_WORKERSNumber of worker threads in the RDMA executor thread pool1
+ +## NSA Backend Configuration (For DeepSeek V3.2) + +{/* # Environment variable to control mtp precomputing of metadata for multi-step speculative decoding */} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_NSA_FUSE_TOPKFuse the operation of picking topk logits and picking topk indices from page tabletrue
SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATAPrecompute metadata that can be shared among different draft steps when MTP is enabledtrue
SGLANG_USE_FUSED_METADATA_COPYControl whether to use fused metadata copy kernel for cuda graph replaytrue
SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLDWhen the maximum kv len in current prefill batch exceeds this value, the sparse mla kernel will be applied, else it falls back to dense MHA implementation. Default to the index topk of model (2048 for DeepSeek V3.2)2048
+ + +## Memory Management + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_DEBUG_MEMORY_POOLEnable memory pool debugging`false`
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATIONClip max new tokens estimation for memory planning4096
SGLANG_DETOKENIZER_MAX_STATESMaximum states for detokenizerDefault value based on system
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECKEnable checks for memory imbalance across Tensor Parallel rankstrue
SGLANG_MOONCAKE_CUSTOM_MEM_POOLConfigure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, it defaults to NVLINK.None
+ +## Model-Specific Options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_USE_AITERUse AITER optimize implementation`false`
SGLANG_MOE_PADDINGEnable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds)`false`
SGLANG_CUTLASS_MOE (deprecated)Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use --moe-runner-backend=cutlass)`false`
+ +## Quantization + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_INT4_WEIGHTEnable INT4 weight quantizationfalse
SGLANG_FORCE_FP8_MARLINForce using FP8 MARLIN kernels even if other FP8 kernels are availablefalse
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTNQuantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_MOE_NVFP4_DISPATCHUse nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)"false"
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOEQuantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint`false`
SGLANG_QUANT_ALLOW_DOWNCASTINGAllow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization.`false`
SGLANG_FP8_IGNORED_LAYERSA comma-separated list of layer names to ignore during FP8 quantization. For example: model.layers.0,model.layers.1.,qkv_proj.""
+ + +## Distributed Computing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
`SGLANG_BLOCK_NONZERO_RANK_CHILDREN`Control blocking of non-zero rank children processes`1`
`SGLANG_IS_FIRST_RANK_ON_NODE`Indicates if the current process is the first rank on its node`"true"`
`SGLANG_PP_LAYER_PARTITION`Pipeline parallel layer partition specificationNot set
`SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS`Set one visible device per process for distributed computing`false`
+ +## PD Disaggregation — Staging Buffer (Heterogeneous TP) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_DISAGG_STAGING_BUFFEREnable GPU staging buffer for heterogeneous TP KV transfer. Required when prefill and decode use different TP/attention-TP sizes. Only for non-MLA models (e.g. GQA, MHA).false
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MBPrefill-side per-worker staging buffer size in MB. Used for gathering KV head slices before bulk RDMA transfer.64
SGLANG_DISAGG_STAGING_POOL_SIZE_MBDecode-side ring buffer pool total size in MB. Shared buffer receiving RDMA data from all prefill ranks. Larger values support higher concurrency.4096
SGLANG_STAGING_USE_TORCHForce using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging.false
+ +## Testing & Debugging (Internal/CI) + +*These variables are primarily used for internal testing, continuous integration, or debugging.* + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_IS_IN_CIIndicates if running in CI environmentfalse
SGLANG_IS_IN_CI_AMDIndicates running in AMD CI environmentfalse
SGLANG_TEST_RETRACTEnable retract decode testing`false`
SGLANG_TEST_RETRACT_NO_PREFILL_BSWhen SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS.2 ** 31
SGLANG_RECORD_STEP_TIMERecord step time for profiling`false`
SGLANG_TEST_REQUEST_TIME_STATSTest request time statistics`false`
SGLANG_DEBUG_SYMM_MEMEnable debug checks that verify tensors passed to NCCL communication ops are allocated in the symmetric memory pool. Logs warnings (rank 0 only) with stack traces for any tensor not in the pool.`false`
SGLANG_KERNEL_API_LOGLEVELControls crash-debug kernel API logging. 0 disables logging, 1 logs API names, 3 logs tensor metadata, 5 adds tensor statistics, and 10 also writes pre-call dump snapshots.0
SGLANG_KERNEL_API_LOGDESTDestination for crash-debug kernel API logs. Use stdout, stderr, or a file path. %i is replaced with the process PID.stdout
SGLANG_KERNEL_API_DUMP_DIROutput directory for level-10 kernel API input/output dumps. %i is replaced with the process PID.sglang_kernel_api_dumps
SGLANG_KERNEL_API_DUMP_INCLUDEComma-separated wildcard patterns for kernel API names to include in level-10 dumps.Not set
SGLANG_KERNEL_API_DUMP_EXCLUDEComma-separated wildcard patterns for kernel API names to exclude from level-10 dumps.Not set
+ +## Profiling & Benchmarking + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
`SGLANG_TORCH_PROFILER_DIR`Directory for PyTorch profiler output`/tmp`
`SGLANG_PROFILE_WITH_STACK`Set `with_stack` option (bool) for PyTorch profiler (capture stack trace)`true`
`SGLANG_PROFILE_RECORD_SHAPES`Set `record_shapes` option (bool) for PyTorch profiler (record shapes)`true`
`SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS`Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled`500`
`SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE`Config BatchSpanProcessor.max_export_batch_size if tracing is enabled`64`
+ +## Storage & Caching + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_WAIT_WEIGHTS_READY_TIMEOUTTimeout period for waiting on weights120
SGLANG_DISABLE_OUTLINES_DISK_CACHEDisable Outlines disk cachefalse
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHEUse SGLang's custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)false
SGLANG_HICACHE_DECODE_OFFLOAD_STRIDEDecode-side incremental KV cache offload stride. Rounded down to a multiple of --page-size (min is --page-size). If unset/invalid/<=0, it falls back to --page-size.Not set (uses --page-size)
+ + +## Function Calling / Tool Use + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDescriptionDefault Value
SGLANG_TOOL_STRICT_LEVELControls the strictness level of tool call parsing and validation. <br>Level 0: Off - No strict validation <br>Level 1: Function strict - Enables structural tag constraints for all tools (even if none have strict=True set) <br>Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have strict=True set0
diff --git a/docs_new/docs/references/faq.mdx b/docs_new/docs/references/faq.mdx new file mode 100644 index 000000000000..a2cb48695143 --- /dev/null +++ b/docs_new/docs/references/faq.mdx @@ -0,0 +1,42 @@ +--- +title: "Troubleshooting and Frequently Asked Questions" +metatags: + description: "SGLang troubleshooting: CUDA OOM, illegal memory access, server hangs, non-deterministic outputs." +--- +## Troubleshooting + +This page lists common errors and tips for resolving them. + +### CUDA Out of Memory +If you encounter out-of-memory (OOM) errors, you can adjust the following parameters: + +- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts. +- If OOM occurs during decoding, try lowering `--max-running-requests`. +- You can also decrease `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput. +- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`. + +### CUDA Error: Illegal Memory Access Encountered +This error may result from kernel errors or out-of-memory issues: +- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub. +- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues. + +### The server hangs +- If the server hangs during initialization or running, it can be memory issues (out of memory), network issues (nccl errors), or other bugs in sglang. + - If it is out of memory, you might see that `avail mem` is very low during the initialization or right after initialization. In this case, + you can try to decrease `--mem-fraction-static`, decrease `--cuda-graph-max-bs`, or decrease `--chunked-prefill-size`. +- Other bugs, please file an issue on GitHub. + + +## Frequently Asked Questions + +### The results are not deterministic, even with a temperature of 0 + +You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0. + +From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs. + +To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting. + +**Update**: +Recently, we also introduced a deterministic mode, you can enable it with `--enable-deterministic-inference`. +Please find more details in this blog post: https://lmsys.org/blog/2025-09-22-sglang-deterministic/ diff --git a/docs_new/docs/references/frontend/choices_methods.mdx b/docs_new/docs/references/frontend/choices_methods.mdx new file mode 100644 index 000000000000..6bfa8747c16b --- /dev/null +++ b/docs_new/docs/references/frontend/choices_methods.mdx @@ -0,0 +1,81 @@ +--- +title: "Choices Methods in SGLang" +metatags: + description: "SGLang choices methods: token_length_normalized, greedy_token_selection, unconditional_likelihood_normalized." +--- +This doc describes the choices methods supported by SGLang. + +The optional `choices_method` arg determines how options supplied to SGLang's `choices` primitive are selected. Only the `RuntimeEndpoint` backend supports the `choices_method` arg. Other backends, such as `OpenAI`, have bespoke selection implementations due to API limitations. + +## Methods + +### Token Length Normalized + +Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens. + +Usage example (alternatively, simply omit the `choices_method` arg): +```python Example +@sgl.function +def example(s): + s += sgl.user("What is the capital of France?") + s += sgl.assistant( + sgl.gen( + "answer", + choices=["London", "Paris", "Berlin"], + choices_method=sgl.token_length_normalized, + ) + ) +``` + + +This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are `["Paris", "Antidisestablishmentarianism"]`. + +### Greedy Token Selection + +Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option. + +Usage example: +```python Example +@sgl.function +def example(s): + s += sgl.user("What is the capital of France?") + s += sgl.assistant( + sgl.gen( + "answer", + choices=["London", "Paris", "Berlin"], + choices_method=sgl.greedy_token_selection, + ) + ) +``` + +This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example: +```python Example +@sgl.function +def us_president_example(s): + s += sgl.user("Name a US president.") + s += sgl.assistant( + sgl.gen( + "answer", + choices=["Donald Duck", "Millard Fillmore"], + choices_method=sgl.greedy_token_selection, + ) + ) +``` + +### Unconditional Likelihood Normalized + +Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in [this EleutherAI blogpost](https://blog.eleuther.ai/multiple-choice-normalization/). This method incurs an additional LLM call to obtain the unconditional likelihoods. + +Usage example: +```python Example +@sgl.function +def example(s): + s += sgl.user("What is the capital of France?") + s += sgl.assistant( + sgl.gen( + "answer", + choices=["London", "Paris", "Berlin"], + choices_method=sgl.unconditional_likelihood_normalized, + ) + ) +``` diff --git a/docs_new/docs/references/frontend/frontend_index.mdx b/docs_new/docs/references/frontend/frontend_index.mdx new file mode 100644 index 000000000000..0e5daf0c9bd4 --- /dev/null +++ b/docs_new/docs/references/frontend/frontend_index.mdx @@ -0,0 +1,7 @@ +--- +title: "Frontend Language" +metatags: + description: "SGLang frontend language documentation: tutorials and choices methods reference." +--- +- [Frontend Tutorial](./frontend_tutorial) +- [Choices Methods](./choices_methods) diff --git a/docs_new/docs/references/frontend/frontend_index.rst b/docs_new/docs/references/frontend/frontend_index.rst new file mode 100644 index 000000000000..62544cba5987 --- /dev/null +++ b/docs_new/docs/references/frontend/frontend_index.rst @@ -0,0 +1,9 @@ +Frontend Language +================= + +.. toctree:: + :maxdepth: 1 + :caption: Frontend Language + + frontend_tutorial.ipynb + choices_methods.md diff --git a/docs_new/docs/references/frontend/frontend_tutorial.ipynb b/docs_new/docs/references/frontend/frontend_tutorial.ipynb new file mode 100644 index 000000000000..9c4da052c397 --- /dev/null +++ b/docs_new/docs/references/frontend/frontend_tutorial.ipynb @@ -0,0 +1,456 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# SGLang Frontend Language" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launch A Server\n", + "\n", + "Launch the server in your terminal and wait for it to initialize." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sglang import assistant_begin, assistant_end\n", + "from sglang import assistant, function, gen, system, user\n", + "from sglang import image\n", + "from sglang import RuntimeEndpoint\n", + "from sglang.lang.api import set_default_backend\n", + "from sglang.srt.utils import load_image\n", + "from sglang.test.doc_patch import launch_server_cmd\n", + "from sglang.utils import print_highlight, terminate_process, wait_for_server\n", + "\n", + "server_process, port = launch_server_cmd(\n", + " \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", + "print(f\"Server started on http://localhost:{port}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Basic Usage\n", + "\n", + "The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def basic_qa(s, question):\n", + " s += system(f\"You are a helpful assistant than can answer questions.\")\n", + " s += user(question)\n", + " s += assistant(gen(\"answer\", max_tokens=512))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "state = basic_qa(\"List 3 countries and their capitals.\")\n", + "print_highlight(state[\"answer\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Multi-turn Dialog\n", + "\n", + "SGLang frontend language can also be used to define multi-turn dialogs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def multi_turn_qa(s):\n", + " s += system(f\"You are a helpful assistant than can answer questions.\")\n", + " s += user(\"Please give me a list of 3 countries and their capitals.\")\n", + " s += assistant(gen(\"first_answer\", max_tokens=512))\n", + " s += user(\"Please give me another list of 3 countries and their capitals.\")\n", + " s += assistant(gen(\"second_answer\", max_tokens=512))\n", + " return s\n", + "\n", + "\n", + "state = multi_turn_qa()\n", + "print_highlight(state[\"first_answer\"])\n", + "print_highlight(state[\"second_answer\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Control flow\n", + "\n", + "You may use any Python code within the function to define more complex control flows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def tool_use(s, question):\n", + " s += assistant(\n", + " \"To answer this question: \"\n", + " + question\n", + " + \". I need to use a \"\n", + " + gen(\"tool\", choices=[\"calculator\", \"search engine\"])\n", + " + \". \"\n", + " )\n", + "\n", + " if s[\"tool\"] == \"calculator\":\n", + " s += assistant(\"The math expression is: \" + gen(\"expression\"))\n", + " elif s[\"tool\"] == \"search engine\":\n", + " s += assistant(\"The key word to search is: \" + gen(\"word\"))\n", + "\n", + "\n", + "state = tool_use(\"What is 2 * 2?\")\n", + "print_highlight(state[\"tool\"])\n", + "print_highlight(state[\"expression\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parallelism\n", + "\n", + "Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def tip_suggestion(s):\n", + " s += assistant(\n", + " \"Here are two tips for staying healthy: \"\n", + " \"1. Balanced Diet. 2. Regular Exercise.\\n\\n\"\n", + " )\n", + "\n", + " forks = s.fork(2)\n", + " for i, f in enumerate(forks):\n", + " f += assistant(\n", + " f\"Now, expand tip {i+1} into a paragraph:\\n\"\n", + " + gen(\"detailed_tip\", max_tokens=256, stop=\"\\n\\n\")\n", + " )\n", + "\n", + " s += assistant(\"Tip 1:\" + forks[0][\"detailed_tip\"] + \"\\n\")\n", + " s += assistant(\"Tip 2:\" + forks[1][\"detailed_tip\"] + \"\\n\")\n", + " s += assistant(\n", + " \"To summarize the above two tips, I can say:\\n\" + gen(\"summary\", max_tokens=512)\n", + " )\n", + "\n", + "\n", + "state = tip_suggestion()\n", + "print_highlight(state[\"summary\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Constrained Decoding\n", + "\n", + "Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def regular_expression_gen(s):\n", + " s += user(\"What is the IP address of the Google DNS servers?\")\n", + " s += assistant(\n", + " gen(\n", + " \"answer\",\n", + " temperature=0,\n", + " regex=r\"((25[0-5]|2[0-4]\\d|[01]?\\d\\d?).){3}(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\",\n", + " )\n", + " )\n", + "\n", + "\n", + "state = regular_expression_gen()\n", + "print_highlight(state[\"answer\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use `regex` to define a `JSON` decoding schema." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "character_regex = (\n", + " r\"\"\"\\{\\n\"\"\"\n", + " + r\"\"\" \"name\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n", + " + r\"\"\" \"house\": \"(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)\",\\n\"\"\"\n", + " + r\"\"\" \"blood status\": \"(Pure-blood|Half-blood|Muggle-born)\",\\n\"\"\"\n", + " + r\"\"\" \"occupation\": \"(student|teacher|auror|ministry of magic|death eater|order of the phoenix)\",\\n\"\"\"\n", + " + r\"\"\" \"wand\": \\{\\n\"\"\"\n", + " + r\"\"\" \"wood\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n", + " + r\"\"\" \"core\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n", + " + r\"\"\" \"length\": [0-9]{1,2}\\.[0-9]{0,2}\\n\"\"\"\n", + " + r\"\"\" \\},\\n\"\"\"\n", + " + r\"\"\" \"alive\": \"(Alive|Deceased)\",\\n\"\"\"\n", + " + r\"\"\" \"patronus\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n", + " + r\"\"\" \"bogart\": \"[\\w\\d\\s]{1,16}\"\\n\"\"\"\n", + " + r\"\"\"\\}\"\"\"\n", + ")\n", + "\n", + "\n", + "@function\n", + "def character_gen(s, name):\n", + " s += user(\n", + " f\"{name} is a character in Harry Potter. Please fill in the following information about this character.\"\n", + " )\n", + " s += assistant(gen(\"json_output\", max_tokens=256, regex=character_regex))\n", + "\n", + "\n", + "state = character_gen(\"Harry Potter\")\n", + "print_highlight(state[\"json_output\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Batching \n", + "\n", + "Use `run_batch` to run a batch of prompts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def text_qa(s, question):\n", + " s += user(question)\n", + " s += assistant(gen(\"answer\", stop=\"\\n\"))\n", + "\n", + "\n", + "states = text_qa.run_batch(\n", + " [\n", + " {\"question\": \"What is the capital of the United Kingdom?\"},\n", + " {\"question\": \"What is the capital of France?\"},\n", + " {\"question\": \"What is the capital of Japan?\"},\n", + " ],\n", + " progress_bar=True,\n", + ")\n", + "\n", + "for i, state in enumerate(states):\n", + " print_highlight(f\"Answer {i+1}: {states[i]['answer']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Streaming \n", + "\n", + "Use `stream` to stream the output to the user." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def text_qa(s, question):\n", + " s += user(question)\n", + " s += assistant(gen(\"answer\", stop=\"\\n\"))\n", + "\n", + "\n", + "state = text_qa.run(\n", + " question=\"What is the capital of France?\", temperature=0.1, stream=True\n", + ")\n", + "\n", + "for out in state.text_iter():\n", + " print(out, end=\"\", flush=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Complex Prompts\n", + "\n", + "You may use `{system|user|assistant}_{begin|end}` to define complex prompts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def chat_example(s):\n", + " s += system(\"You are a helpful assistant.\")\n", + " # Same as: s += s.system(\"You are a helpful assistant.\")\n", + "\n", + " with s.user():\n", + " s += \"Question: What is the capital of France?\"\n", + "\n", + " s += assistant_begin()\n", + " s += \"Answer: \" + gen(\"answer\", max_tokens=100, stop=\"\\n\")\n", + " s += assistant_end()\n", + "\n", + "\n", + "state = chat_example()\n", + "print_highlight(state[\"answer\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Multi-modal Generation\n", + "\n", + "You may use SGLang frontend language to define multi-modal prompts.\n", + "See [here](https://docs.sglang.io/supported_models/text_generation/multimodal_language_models.html) for supported models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "server_process, port = launch_server_cmd(\n", + " \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning\"\n", + ")\n", + "\n", + "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n", + "print(f\"Server started on http://localhost:{port}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ask a question about an image." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@function\n", + "def image_qa(s, image_file, question):\n", + " s += user(image(image_file) + question)\n", + " s += assistant(gen(\"answer\", max_tokens=256))\n", + "\n", + "\n", + "image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\"\n", + "image_bytes, _ = load_image(image_url)\n", + "state = image_qa(image_bytes, \"What is in the image?\")\n", + "print_highlight(state[\"answer\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "terminate_process(server_process)" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs_new/docs/references/frontend/frontend_tutorial.mdx b/docs_new/docs/references/frontend/frontend_tutorial.mdx new file mode 100644 index 000000000000..7d09ca0a5d91 --- /dev/null +++ b/docs_new/docs/references/frontend/frontend_tutorial.mdx @@ -0,0 +1,287 @@ +--- +title: "SGLang Frontend Language" +metatags: + description: "SGLang frontend tutorial: multi-turn dialog, fork parallelism, regex constraints, batching, streaming." +--- +SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way. + +## Launch A Server + +Launch the server in your terminal and wait for it to initialize. + +```python Example +from sglang import assistant_begin, assistant_end +from sglang import assistant, function, gen, system, user +from sglang import image +from sglang import RuntimeEndpoint +from sglang.lang.api import set_default_backend +from sglang.srt.utils import load_image +from sglang.test.doc_patch import launch_server_cmd +from sglang.utils import print_highlight, terminate_process, wait_for_server + +server_process, port = launch_server_cmd( + "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}", process=server_process) +print(f"Server started on http://localhost:{port}") +``` + +Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints. + +```python Example +set_default_backend(RuntimeEndpoint(f"http://localhost:{port}")) +``` + +## Basic Usage + +The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant. + +```python Example +@function +def basic_qa(s, question): + s += system(f"You are a helpful assistant than can answer questions.") + s += user(question) + s += assistant(gen("answer", max_tokens=512)) +``` + +```python Example +state = basic_qa("List 3 countries and their capitals.") +print_highlight(state["answer"]) +``` + +## Multi-turn Dialog + +SGLang frontend language can also be used to define multi-turn dialogs. + +```python Example +@function +def multi_turn_qa(s): + s += system(f"You are a helpful assistant than can answer questions.") + s += user("Please give me a list of 3 countries and their capitals.") + s += assistant(gen("first_answer", max_tokens=512)) + s += user("Please give me another list of 3 countries and their capitals.") + s += assistant(gen("second_answer", max_tokens=512)) + return s + + +state = multi_turn_qa() +print_highlight(state["first_answer"]) +print_highlight(state["second_answer"]) +``` + +## Control flow + +You may use any Python code within the function to define more complex control flows. + +```python Example +@function +def tool_use(s, question): + s += assistant( + "To answer this question: " + + question + + ". I need to use a " + + gen("tool", choices=["calculator", "search engine"]) + + ". " + ) + + if s["tool"] == "calculator": + s += assistant("The math expression is: " + gen("expression")) + elif s["tool"] == "search engine": + s += assistant("The key word to search is: " + gen("word")) + + +state = tool_use("What is 2 * 2?") +print_highlight(state["tool"]) +print_highlight(state["expression"]) +``` + +## Parallelism + +Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel. + +```python Example +@function +def tip_suggestion(s): + s += assistant( + "Here are two tips for staying healthy: " + "1. Balanced Diet. 2. Regular Exercise.\n\n" + ) + + forks = s.fork(2) + for i, f in enumerate(forks): + f += assistant( + f"Now, expand tip {i+1} into a paragraph:\n" + + gen("detailed_tip", max_tokens=256, stop="\n\n") + ) + + s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n") + s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n") + s += assistant( + "To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512) + ) + + +state = tip_suggestion() +print_highlight(state["summary"]) +``` + +## Constrained Decoding + +Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models. + +```python Example +@function +def regular_expression_gen(s): + s += user("What is the IP address of the Google DNS servers?") + s += assistant( + gen( + "answer", + temperature=0, + regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", + ) + ) + + +state = regular_expression_gen() +print_highlight(state["answer"]) +``` + +Use `regex` to define a `JSON` decoding schema. + +```python Example +character_regex = ( + r"""\{\n""" + + r""" "name": "[\w\d\s]{1,16}",\n""" + + r""" "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n""" + + r""" "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n""" + + r""" "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n""" + + r""" "wand": \{\n""" + + r""" "wood": "[\w\d\s]{1,16}",\n""" + + r""" "core": "[\w\d\s]{1,16}",\n""" + + r""" "length": [0-9]{1,2}\.[0-9]{0,2}\n""" + + r""" \},\n""" + + r""" "alive": "(Alive|Deceased)",\n""" + + r""" "patronus": "[\w\d\s]{1,16}",\n""" + + r""" "bogart": "[\w\d\s]{1,16}"\n""" + + r"""\}""" +) + + +@function +def character_gen(s, name): + s += user( + f"{name} is a character in Harry Potter. Please fill in the following information about this character." + ) + s += assistant(gen("json_output", max_tokens=256, regex=character_regex)) + + +state = character_gen("Harry Potter") +print_highlight(state["json_output"]) +``` + +## Batching + +Use `run_batch` to run a batch of prompts. + +```python Example +@function +def text_qa(s, question): + s += user(question) + s += assistant(gen("answer", stop="\n")) + + +states = text_qa.run_batch( + [ + {"question": "What is the capital of the United Kingdom?"}, + {"question": "What is the capital of France?"}, + {"question": "What is the capital of Japan?"}, + ], + progress_bar=True, +) + +for i, state in enumerate(states): + print_highlight(f"Answer {i+1}: {states[i]['answer']}") +``` + +## Streaming + +Use `stream` to stream the output to the user. + +```python Example +@function +def text_qa(s, question): + s += user(question) + s += assistant(gen("answer", stop="\n")) + + +state = text_qa.run( + question="What is the capital of France?", temperature=0.1, stream=True +) + +for out in state.text_iter(): + print(out, end="", flush=True) +``` + +## Complex Prompts + +You may use `{system|user|assistant}_{begin|end}` to define complex prompts. + +```python Example +@function +def chat_example(s): + s += system("You are a helpful assistant.") + # Same as: s += s.system("You are a helpful assistant.") + + with s.user(): + s += "Question: What is the capital of France?" + + s += assistant_begin() + s += "Answer: " + gen("answer", max_tokens=100, stop="\n") + s += assistant_end() + + +state = chat_example() +print_highlight(state["answer"]) +``` + +```python Example +terminate_process(server_process) +``` + +## Multi-modal Generation + +You may use SGLang frontend language to define multi-modal prompts. +See [here](../../supported-models/multimodal_language_models) for supported models. + +```python Example +server_process, port = launch_server_cmd( + "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning" +) + +wait_for_server(f"http://localhost:{port}", process=server_process) +print(f"Server started on http://localhost:{port}") +``` + +```python Example +set_default_backend(RuntimeEndpoint(f"http://localhost:{port}")) +``` + +Ask a question about an image. + +```python Example +@function +def image_qa(s, image_file, question): + s += user(image(image_file) + question) + s += assistant(gen("answer", max_tokens=256)) + + +image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png" +image_bytes, _ = load_image(image_url) +state = image_qa(image_bytes, "What is in the image?") +print_highlight(state["answer"]) +``` + +```python Example +terminate_process(server_process) +``` diff --git a/docs_new/docs/references/multi_node_deployment/deploy_on_k8s.mdx b/docs_new/docs/references/multi_node_deployment/deploy_on_k8s.mdx new file mode 100644 index 000000000000..e2070fd1caef --- /dev/null +++ b/docs_new/docs/references/multi_node_deployment/deploy_on_k8s.mdx @@ -0,0 +1,340 @@ +--- +title: "Deploy On Kubernetes" +metatags: + description: "SGLang on Kubernetes with LWS: DeepSeek-R1 multi-node deployment, RoCE RDMA setup, NCCL debugging." +--- +This document is for deploying a RoCE network-based SGLang two-node inference service on a Kubernetes (K8S) cluster. + +[LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference. + +SGLang can also be deployed with LWS on Kubernetes for distributed model serving. + +Please see this guide for more details on deploying SGLang on Kubernetes using LWS. + +Here we take the deployment of DeepSeek-R1 as an example. + +## Prerequisites + +1. At least two Kubernetes nodes, each with two H20 systems and eight GPUs, are required. + +2. Make sure your K8S cluster has LWS correctly installed. If it hasn't been set up yet, please follow the [installation instructions](https://github.com/kubernetes-sigs/lws/blob/main/site/content/en/docs/installation/_index). **Note:** For LWS versions ≤0.5.x, you must use the Downward API to obtain `LWS_WORKER_INDEX`, as native support for this feature was introduced in v0.6.0. + +## Basic example + +For the basic example documentation, refer to [Deploy Distributed Inference Service with SGLang and LWS on GPUs](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/sglang). + +However, that document only covers the basic NCCL socket mode. + +In this section, we’ll make some simple modifications to adapt the setup to the RDMA scenario. + +## RDMA RoCE case + +* Check your env: + +```bash Command +[root@node1 ~]# ibstatus +Infiniband device 'mlx5_bond_0' port 1 status: + default gid: fe80:0000:0000:0000:0225:9dff:fe64:c79a + base lid: 0x0 + sm lid: 0x0 + state: 4: ACTIVE + phys state: 5: LinkUp + rate: 200 Gb/sec (2X NDR) + link_layer: Ethernet + +Infiniband device 'mlx5_bond_1' port 1 status: + default gid: fe80:0000:0000:0000:0225:9dff:fe6e:c3ec + base lid: 0x0 + sm lid: 0x0 + state: 4: ACTIVE + phys state: 5: LinkUp + rate: 200 Gb/sec (2X NDR) + link_layer: Ethernet + +Infiniband device 'mlx5_bond_2' port 1 status: + default gid: fe80:0000:0000:0000:0225:9dff:fe73:0dd7 + base lid: 0x0 + sm lid: 0x0 + state: 4: ACTIVE + phys state: 5: LinkUp + rate: 200 Gb/sec (2X NDR) + link_layer: Ethernet + +Infiniband device 'mlx5_bond_3' port 1 status: + default gid: fe80:0000:0000:0000:0225:9dff:fe36:f7ff + base lid: 0x0 + sm lid: 0x0 + state: 4: ACTIVE + phys state: 5: LinkUp + rate: 200 Gb/sec (2X NDR) + link_layer: Ethernet +``` + +* Prepare the `lws.yaml` file for deploying on k8s. + +```yaml Config +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: sglang +spec: + replicas: 1 + leaderWorkerTemplate: + size: 2 + restartPolicy: RecreateGroupOnPodRestart + leaderTemplate: + metadata: + labels: + role: leader + spec: + dnsPolicy: ClusterFirstWithHostNet + hostNetwork: true + hostIPC: true + containers: + - name: sglang-leader + image: sglang:latest + securityContext: + privileged: true + env: + - name: NCCL_IB_GID_INDEX + value: "3" + command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --mem-fraction-static + - "0.93" + - --torch-compile-max-bs + - "8" + - --max-running-requests + - "20" + - --tp + - "16" # Size of Tensor Parallelism + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20000 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --host + - "0.0.0.0" + - --port + - "40000" + resources: + limits: + nvidia.com/gpu: "8" + ports: + - containerPort: 40000 + readinessProbe: + tcpSocket: + port: 40000 + initialDelaySeconds: 15 + periodSeconds: 10 + volumeMounts: + - mountPath: /dev/shm + name: dshm + - name: model + mountPath: /work/models + - name: ib + mountPath: /dev/infiniband + volumes: + - name: dshm + emptyDir: + medium: Memory + - name: model + hostPath: + path: '< your models dir >' # modify it according your models dir + - name: ib + hostPath: + path: /dev/infiniband + workerTemplate: + spec: + dnsPolicy: ClusterFirstWithHostNet + hostNetwork: true + hostIPC: true + containers: + - name: sglang-worker + image: sglang:latest + securityContext: + privileged: true + env: + - name: NCCL_IB_GID_INDEX + value: "3" + command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --mem-fraction-static + - "0.93" + - --torch-compile-max-bs + - "8" + - --max-running-requests + - "20" + - --tp + - "16" # Size of Tensor Parallelism + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20000 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + resources: + limits: + nvidia.com/gpu: "8" + volumeMounts: + - mountPath: /dev/shm + name: dshm + - name: model + mountPath: /work/models + - name: ib + mountPath: /dev/infiniband + volumes: + - name: dshm + emptyDir: + medium: Memory + - name: ib + hostPath: + path: /dev/infiniband + - name: model + hostPath: + path: /data1/models/deepseek_v3_moe +*** +apiVersion: v1 +kind: Service +metadata: + name: sglang-leader +spec: + selector: + leaderworkerset.sigs.k8s.io/name: sglang + role: leader + ports: + - protocol: TCP + port: 40000 + targetPort: 40000 + +``` + +* Then use `kubectl apply -f lws.yaml` you will get this output. + +```text Output +NAME READY STATUS RESTARTS AGE +sglang-0 0/1 Running 0 9s +sglang-0-1 1/1 Running 0 9s +``` + +Wait for the sglang leader (`sglang-0`) status to change to 1/1, which indicates it is `Ready`. + +You can use the command `kubectl logs -f sglang-0` to view the logs of the leader node. + +Once successful, you should see output like this: + +```text Output +[2025-02-17 05:27:24 TP1] Capture cuda graph end. Time elapsed: 84.89 s +[2025-02-17 05:27:24 TP6] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP0] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP7] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP3] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP2] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP4] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP1] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24 TP5] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840 +[2025-02-17 05:27:24] INFO: Started server process [1] +[2025-02-17 05:27:24] INFO: Waiting for application startup. +[2025-02-17 05:27:24] INFO: Application startup complete. +[2025-02-17 05:27:24] INFO: Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit) +[2025-02-17 05:27:25] INFO: 127.0.0.1:48908 - "GET /get_model_info HTTP/1.1" 200 OK +[2025-02-17 05:27:25 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0 +[2025-02-17 05:27:32] INFO: 127.0.0.1:48924 - "POST /generate HTTP/1.1" 200 OK +[2025-02-17 05:27:32] The server is fired up and ready to roll! +``` + +If it doesn’t start up successfully, please follow these steps to check for any remaining issues. Thanks! + +### Debug + +* Set `NCCL_DEBUG=TRACE` to check if it is a NCCL communication problem. + +This should resolve most NCCL-related issues. + +***Notice: If you find that NCCL_DEBUG=TRACE is not effective in the container environment, but the process is stuck or you encounter hard-to-diagnose issues, try switching to a different container image. Some images may not handle standard error output properly.*** + +#### RoCE scenario + +* Please make sure that RDMA devices are available in the cluster environment. +* Please make sure that the nodes in the cluster have Mellanox NICs with RoCE. In this example, we use Mellanox ConnectX 5 model NICs, and the proper OFED driver has been installed. If not, please refer to the document [Install OFED Driver](https://docs.nvidia.com/networking/display/mlnxofedv461000/installing+mellanox+ofed) to install the driver. +* Check your env: + + ```shell Command + $ lspci -nn | grep Eth | grep Mellanox + 0000:7f:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0000:7f:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0000:c7:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0000:c7:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0001:08:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0001:08:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0001:a2:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + 0001:a2:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) + ``` + +* Check the OFED driver: + + ```shell Command + ofed_info -s + OFED-internal-23.07-0.5.0: + ``` + +* Show RDMA link status and check IB devices: + + ```shell Command + $ rdma link show + 8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0 + 9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2 + 10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4 + 11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6 + + $ ibdev2netdev + 8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0 + 9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2 + 10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4 + 11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6 + ``` + +* Test RoCE network speed on the host: + + ```shell Command + yum install qperf + # for server: + execute qperf + # for client + qperf -t 60 -cm1 rc_rdma_write_bw + ``` + +* Check RDMA accessible in your container: + + ```shell Command + # ibv_devices + # ibv_devinfo + ``` + +## Keys to success + +* In the YAML configuration above, pay attention to the NCCL environment variable. For older versions of NCCL, you should check the NCCL_IB_GID_INDEX environment setting. +* NCCL_SOCKET_IFNAME is also crucial, but in a containerized environment, this typically isn’t an issue. +* In some cases, it’s necessary to configure GLOO_SOCKET_IFNAME correctly. +* NCCL_DEBUG is essential for troubleshooting, but I've found that sometimes it doesn't show error logs within containers. This could be related to the Docker image you're using. You may want to try switching images if needed. +* Avoid using Docker images based on Ubuntu 18.04, as they tend to have compatibility issues. + +## Remaining issues + +* In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation. +* We utilize privileged mode, which isn’t secure. Additionally, in containerized environments, full GPU isolation cannot be achieved. + +## TODO + +* Integrated with [k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin). diff --git a/docs_new/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.mdx b/docs_new/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.mdx new file mode 100644 index 000000000000..14eac03fdf27 --- /dev/null +++ b/docs_new/docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.mdx @@ -0,0 +1,786 @@ +--- +title: "LWS Based PD Deploy" +metatags: + description: "SGLang LWS PD deployment: DeepSeek R1 prefill/decode disaggregation on Kubernetes with RDMA." +--- +## 0. Prerequisites + +1. k8s >=1.26 +2. lws installed on k8s. + +## 1. Image Preparation + +`lmsysorg/sglang:deepep` + +## 2. Deployment Manifest Files + +***Notice: We will package all deployment files into Helm Chart format in the near future. Interested community members can contact us to contribute*** + +### Prefill + +Prefill manifest file [prefill.yaml](https://github.com/sgl-project/sglang/blob/main/docs/references/multi_node_deployment/lws_pd/lws-examples/p.yaml) + +*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment* + +```yaml Config +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: deepseekr10528-prefill-main +spec: + leaderWorkerTemplate: + leaderTemplate: + metadata: + labels: + role: leader + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --port + - "30000" + - --host + - "0.0.0.0" + - --model-path + - /work/models + - --disaggregation-ib-device + # should modify according your rdma env + - mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 + - --chunked-prefill-size + - "524288" + - --max-prefill-tokens + - "32768" + - --page-size + - "64" + # - --init-expert-location + # - /home/aiges/tuned/attachment_ep_statistics/prefill_in1024.json + - --ep-dispatch-algorithm + - dynamic + - --eplb-algorithm + - deepseek + # - --deepep-config + # - /home/aiges/tuned/tuned_8sms.json + - --enable-dp-lm-head + - --enable-dp-attention + - --dp-size + - "16" + - --disable-radix-cache + - --moe-a2a-backend + - deepep + - --disaggregation-mode + - prefill + - --mem-fraction-static + - "0.7" + - --context-length + - "32768" + - --tp + - "16" + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + - --max-running-requests + - "1024" + env: +# - name: NVSHMEM_HCA_PE_MAPPING +# value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2" +# - name: NVSHMEM_HCA_LIST +# value: "mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NVSHMEM_ENABLE_NIC_PE_MAPPING + value: "1" + - name: SGLANG_SET_CPU_AFFINITY + value: "true" + - name: SGLANG_ENABLE_JIT_DEEPGEMM + value: "1" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: none + - name: NCCL_IB_TC + value: "136" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: MC_TE_METRIC + value: "false" + - name: NCCL_IB_SL + value: "5" + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + image: lmsysorg/sglang:deepep + name: sglang-leader + ports: + - containerPort: 30000 + protocol: TCP + readinessProbe: + periodSeconds: 30 + tcpSocket: + port: 30000 + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/configs + name: cf + - mountPath: /root/.cache + name: sgl-cache + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + - key: node-role + operator: Exists + volumes: + - emptyDir: + medium: Memory + name: dshm + - hostPath: + # modify according to you deployment env + path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528 + name: model + - hostPath: + path: /dev/infiniband + name: ib + - hostPath: + # modify according to you deployment env + path: /data1/maas_hosted_models/models/fused_moe_triton/configs + name: cf + - hostPath: + # modify according to you deployment env + path: /data1/sgl_cache + type: DirectoryOrCreate + name: sgl-cache + restartPolicy: RecreateGroupOnPodRestart + size: 2 + workerTemplate: + metadata: {} + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --disaggregation-ib-device + - mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 + - --chunked-prefill-size + - "524288" + - --max-prefill-tokens + - "32768" + - --page-size + - "64" + #- --init-expert-location + #- /home/aiges/tuned/attachment_ep_statistics/prefill_in1024.json + - --ep-dispatch-algorithm + - dynamic + - --eplb-algorithm + - deepseek +# - --deepep-config +# - /home/aiges/tuned/tuned_8sms.json + - --enable-dp-lm-head + - --enable-dp-attention + - --dp-size + - "16" + - --disable-radix-cache + - --moe-a2a-backend + - deepep + - --disaggregation-mode + - prefill + - --mem-fraction-static + - "0.7" + - --context-length + - "32768" + - --tp + - "16" + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + - --max-running-requests + - "1024" + env: + - name: SGLANG_SET_CPU_AFFINITY + value: "true" + - name: SGLANG_HACK_DEEPEP_NUM_SMS + value: "8" + - name: SGLANG_HACK_DEEPEP_NEW_MODE + value: "0" +# - name: NVSHMEM_HCA_PE_MAPPING +# value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2" +# - name: NVSHMEM_HCA_LIST +# value: "mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1" + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: NVSHMEM_IB_TRAFFIC_CLASS + value: "16" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NVSHMEM_ENABLE_NIC_PE_MAPPING + value: "1" + - name: CUDA_LAUNCH_BLOCKING + value: "0" + - name: SGLANG_MOONCAKE_TRANS_THREAD + value: "8" + - name: SGLANG_ENABLE_JIT_DEEPGEMM + value: "1" + - name: SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD + value: "0" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: none + - name: NCCL_IB_TC + value: "136" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: MC_TE_METRIC + value: "true" + - name: NCCL_IB_SL + value: "5" + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + image: lmsysorg/sglang:deepep + name: sglang-worker + ports: + - containerPort: 30001 + protocol: TCP + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/configs + name: cf + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + - key: node-role + operator: Exists + volumes: + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /dev/infiniband + name: ib + - hostPath: + path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528 + name: model + - hostPath: + path: /data1/maas_hosted_models/models/fused_moe_triton/configs + name: cf + - hostPath: + path: /data1/sgl_cache + type: DirectoryOrCreate + name: sgl-cache + +``` + +### Decode + +Decode node deployment manifest file [decode.yaml](https://github.com/sgl-project/sglang/blob/main/docs/references/multi_node_deployment/lws_pd/lws-examples/d.yaml) + +*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment* + +```yaml Config +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: deepseekr10528-decode-main +spec: + leaderWorkerTemplate: + leaderTemplate: + metadata: + labels: + role: leader + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --port + - "30000" + - --host + - "0.0.0.0" + - --model-path + - /work/models + - --chunked-prefill-size + - "262144" + - --page-size + - "64" + - --enable-dp-attention + - --enable-dp-lm-head + - --dp-size + - "16" + - --moe-a2a-backend + - deepep + - --disaggregation-mode + - decode + - --mem-fraction-static + - "0.849" + - --context-length + - "32768" + - --disaggregation-ib-device + - "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3" + - --cuda-graph-max-bs + - "64" + - --max-running-requests + - "2048" + - --tp-size + - "16" # Size of Tensor Parallelism + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + env: + - name: CUDA_LAUNCH_BLOCKING + value: "0" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NVSHMEM_ENABLE_NIC_PE_MAPPING + value: "1" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: "none" + - name: NCCL_IB_TC + value: "136" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: NCCL_IB_SL + value: "5" + - name: MC_TE_METRIC + value: "true" + - name: SGLANG_MOONCAKE_TRANS_THREAD + value: "16" + - name: SGLANG_ENABLE_JIT_DEEPGEMM + value: "1" + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + image: lmsysorg/sglang:deepep + name: sglang-leader + ports: + - containerPort: 30000 + protocol: TCP + readinessProbe: + periodSeconds: 30 + tcpSocket: + port: 30000 + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/configs + name: cf + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + - key: node-role + operator: Exists + volumes: + - hostPath: + path: /data1/sgl_cache1 + type: DirectoryOrCreate + name: sgl-cache + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528 + name: model + - hostPath: + path: /dev/infiniband + name: ib + - hostPath: + path: /data1/maas_hosted_models/models/fused_moe_triton/configs + name: cf + restartPolicy: RecreateGroupOnPodRestart + size: 2 + workerTemplate: + metadata: {} + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --chunked-prefill-size + - "262144" + - --page-size + - "64" + - --enable-dp-attention + - --enable-dp-lm-head + #- --enable-two-batch-overlap + - --dp-size + - "16" + - --moe-a2a-backend + - deepep + - --disaggregation-mode + - decode + - --mem-fraction-static + - "0.849" + - --context-length + - "32768" + - --disaggregation-ib-device + # should modify according your rdma env + - "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3" + - --cuda-graph-max-bs + - "64" + - --max-running-requests + - "2048" + - --tp-size + - "16" # Size of Tensor Parallelism + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + env: + - name: SGLANG_HACK_DEEPEP_NUM_SMS + value: "24" + - name: SGLANG_HACK_DEEPEP_NEW_MODE + value: "0" + - name: NVSHMEM_IB_TRAFFIC_CLASS + value: "16" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NVSHMEM_ENABLE_NIC_PE_MAPPING + value: "1" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: "none" + - name: NCCL_IB_TC + value: "136" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: MC_TE_METRIC + value: "true" + - name: NCCL_IB_SL + value: "5" + - name: SGLANG_MOONCAKE_TRANS_THREAD + value: "16" + - name: SGLANG_ENABLE_JIT_DEEPGEMM + value: "1" + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + image: lmsysorg/sglang:deepep + name: sglang-worker + ports: + - containerPort: 30001 + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/configs + name: cf + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + - key: node-role + operator: Exists + volumes: + - hostPath: + path: /data1/sgl_cache1 + type: DirectoryOrCreate + name: sgl-cache + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /dev/infiniband + name: ib + - hostPath: + # modify according to you deployment env + path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528 + name: model + - hostPath: + # modify according to you deployment env + path: /data1/maas_hosted_models/models/fused_moe_triton/configs + name: cf + networkConfig: + subdomainPolicy: Shared + replicas: 1 + rolloutStrategy: + rollingUpdateConfiguration: + maxSurge: 0 + maxUnavailable: 1 + type: RollingUpdate + startupPolicy: LeaderCreated +``` + +Execute separately: + +```bash Command +kubectl apply -f p.yaml +kubectl apply -f d.yaml +``` + +At this point, we have completed the deployment of the 1P1D SGLang engine part. + +To allow our users to directly experience the model API, we still need a load balancer to handle sequential calls between prefill and decode. Different companies implement LBs differently, and the community will also officially release a new LB component written in Rust in the near future. + +Currently, we use a static K8S service + minilb approach to implement model API calls. + +### Creating Service for Prefill and Decode + +#### Create prefill k8s service +[p-svc.yaml](https://github.com/sgl-project/sglang/blob/main/docs/references/multi_node_deployment/lws_pd/lws-examples/p-svc.yaml) +```yaml Config +apiVersion: v1 +kind: Service +metadata: + name: deepseekr10528-prefill-main +spec: + selector: + leaderworkerset.sigs.k8s.io/name: deepseekr10528-prefill-main + role: leader + ports: + - protocol: TCP + port: 30000 + targetPort: 30000 +``` +Execute `kubectl apply -f p-svc.yaml` + +#### Create decode k8s service +[d-svc.yaml](https://github.com/sgl-project/sglang/blob/main/docs/references/multi_node_deployment/lws_pd/lws-examples/d-svc.yaml) +```yaml Config +apiVersion: v1 +kind: Service +metadata: + name: deepseekr10528-decode-main +spec: + selector: + leaderworkerset.sigs.k8s.io/name: deepseekr10528-decode-main + role: leader + ports: + - protocol: TCP + port: 30000 + targetPort: 30000 +``` +Execute `kubectl apply -f d-svc.yaml` + +#### Deploy minilb and lb service +[lb.yaml](https://github.com/sgl-project/sglang/blob/main/docs/references/multi_node_deployment/lws_pd/lws-examples/lb.yaml) +```yaml Config +apiVersion: apps/v1 +kind: Deployment +metadata: + name: deepseekr10528-lb-main + labels: + app: deepseekr10528-lb +spec: + replicas: 1 + selector: + matchLabels: + app: deepseekr10528-lb + template: + metadata: + labels: + app: deepseekr10528-lb + spec: + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + - key: node-role + operator: Exists + containers: + - name: sgl-minilb + image: lmsysorg/sglang:deepep + command: + - python + - -m + - sglang_router.launch_router + - --pd-disaggregation + - --prefill + - http://deepseekr10528-prefill-main:30000 + - --decode + - http://deepseekr10528-decode-main:30000 + - --host + - 0.0.0.0 + - --port + - "8000" + ports: + - containerPort: 8000 +*** +apiVersion: v1 +kind: Service +metadata: + name: deepseekr10528-lb-service +spec: + type: NodePort + selector: + app: deepseekr10528-lb + ports: + - protocol: TCP + port: 8000 # Service Port(In-Cluster) + targetPort: 8000 # Exposed Container + nodePort: 30800 +``` +Execute `kubectl apply -f lb.yaml` + +After waiting for all model deployments to succeed, you will get the following output: + +```bash Command +[root@ecs-001]# kubectl get po +deepseekr10528-decode-main-0 1/1 Running 0 74m +deepseekr10528-decode-main-0-1 1/1 Running 0 74m +deepseekr10528-lb-main-9c5dbfc57-6lcbd 1/1 Running 0 22m +deepseekr10528-prefill-main-0 1/1 Running 0 74m +deepseekr10528-prefill-main-0-1 1/1 Running 0 74m +[root@ecs-cbm-x1-pd-cpu-001 main_doc]# kubectl get svc |grep dee +deepseekr10528-decode-main ClusterIP None 97m +deepseekr10528-lb-service NodePort 172.16.242.169 8000:30800/TCP 22m +deepseekr10528-prefill-main ClusterIP None 97m +``` + +At this point, select a nodePort:30800 to access: + +```bash Command +[root@ecs-001]# curl -X POST "http://{nodePort}:30800/v1/chat/completions" \ +> -H "Content-Type: application/json" \ +> -H "Authorization: Bearer None" \ +> -d '{ +> "rid":"ccccdd", +> "model": "r1", +> "messages": [ +> {"role": "system", "content": "0: You are a helpful AI assistant"}, +> {"role": "user", "content": "你是谁?."} +> ], +> "max_tokens":221 +> }' +{"id":"ccccdd","object":"chat.completion","created":1750252498,"model":"qwen2","choices":[{"index":0,"message":{"role":"assistant","content":"\n嗯,用户问了一个很基础的自我介绍问题"你是谁?"。这可能是第一次互动时的常规开场白,也可能是想确认我的身份和功能范围。\n\n用户没有提供任何背景信息,语气简洁中性。这种场景下新用户的可能性较高,需要给出清晰友好的自我介绍,同时突出实用价值来降低陌生感。\n\n考虑到中文用户,应该用简体中文回复。重点要说明三点:身份归属(深度求索)、功能定位(AI助手)、服务范围(学习/工作/生活)。结尾用开放性问题引导对话很关键——既能了解需求,又能避免让用户面对空白输入框时不知所措。\n\n用波浪线结尾可以软化语气,那个笑脸表情😊刚好能中和AI的机械感。不过要控制表情符号数量,避免显得轻浮。\n\n你好呀!我是你的AI助手,由深度求索公司(DeepSeek)开发的语言模型,名字叫 **DeepSeek-R1**。你可以把我当成一个知识丰富、随叫随到的小帮手~😊\n\n我的任务就是陪你聊天、解答问题、","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":14,"total_tokens":235,"completion_tokens":221,"prompt_tokens_details":null}} + +``` +## FAQ + +1. The current deployment startup parameters may not be fully compatible with all RDMA scenarios. Different RDMA NCCL-related environment configurations may be needed in different network environments. + +2. Some preset, optimized configurations for EPLB are not used here. You can adjust them according to [6017](https://github.com/sgl-project/sglang/issues/6017) as needed. diff --git a/docs_new/docs/references/multi_node_deployment/multi_node.mdx b/docs_new/docs/references/multi_node_deployment/multi_node.mdx new file mode 100644 index 000000000000..645203ea8ed8 --- /dev/null +++ b/docs_new/docs/references/multi_node_deployment/multi_node.mdx @@ -0,0 +1,103 @@ +--- +title: "Multi-Node Deployment" +metatags: + description: "SGLang multi-node: Llama 405B on 2 nodes, DeepSeek V3/R1, SLURM cluster deployment examples." +--- +## Llama 3.1 405B + +**Run 405B (fp16) on Two Nodes** + +```bash Command +# replace 172.16.4.52:20000 with your own node ip address and port of the first node + +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-405B-Instruct \ + --tp 16 \ + --dist-init-addr 172.16.4.52:20000 \ + --nnodes 2 \ + --node-rank 0 + +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-405B-Instruct \ + --tp 16 \ + --dist-init-addr 172.16.4.52:20000 \ + --nnodes 2 \ + --node-rank 1 +``` + +Note that LLama 405B (fp8) can also be launched on a single node. + +```bash Command +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8 +``` + +## DeepSeek V3/R1 + +Please refer to [DeepSeek documents for reference](../../basic_usage/deepseek_v3#running-examples-on-multi-node). + +## Multi-Node Inference on SLURM + +This example showcases how to serve SGLang server across multiple nodes by SLURM. Submit the following job to the SLURM cluster. + +```bash Command +#!/bin/bash -l + +#SBATCH -o SLURM_Logs/%x_%j_master.out +#SBATCH -e SLURM_Logs/%x_%j_master.err +#SBATCH -D ./ +#SBATCH -J Llama-405B-Online-Inference-TP16-SGL + +#SBATCH --nodes=2 +#SBATCH --ntasks=2 +#SBATCH --ntasks-per-node=1 # Ensure 1 task per node +#SBATCH --cpus-per-task=18 +#SBATCH --mem=224GB +#SBATCH --partition="lmsys.org" +#SBATCH --gres=gpu:8 +#SBATCH --time=12:00:00 + +echo "[INFO] Activating environment on node $SLURM_PROCID" +if ! source ENV_FOLDER/bin/activate; then + echo "[ERROR] Failed to activate environment" >&2 + exit 1 +fi + +# Define parameters +model=MODEL_PATH +tp_size=16 + +echo "[INFO] Running inference" +echo "[INFO] Model: $model" +echo "[INFO] TP Size: $tp_size" + +# Set NCCL initialization address using the hostname of the head node +HEAD_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n 1) +NCCL_INIT_ADDR="${HEAD_NODE}:8000" +echo "[INFO] NCCL_INIT_ADDR: $NCCL_INIT_ADDR" + +# Launch the model server on each node using SLURM +srun --ntasks=2 --nodes=2 --output="SLURM_Logs/%x_%j_node$SLURM_NODEID.out" \ + --error="SLURM_Logs/%x_%j_node$SLURM_NODEID.err" \ + python3 -m sglang.launch_server \ + --model-path "$model" \ + --grammar-backend "xgrammar" \ + --tp "$tp_size" \ + --dist-init-addr "$NCCL_INIT_ADDR" \ + --nnodes 2 \ + --node-rank "$SLURM_NODEID" & + +# Wait for the NCCL server to be ready on port 30000 +while ! nc -z "$HEAD_NODE" 30000; do + sleep 1 + echo "[INFO] Waiting for $HEAD_NODE:30000 to accept connections" +done + +echo "[INFO] $HEAD_NODE:30000 is ready to accept connections" + +# Keep the script running until the SLURM job times out +wait +``` + +Then, you can test the server by sending requests following other [documents](../../basic_usage/openai_api_completions). + +Thanks for [aflah02](https://github.com/aflah02) for providing the example, based on his [blog post](https://aflah02.substack.com/p/multi-node-llm-inference-with-sglang). diff --git a/docs_new/docs/references/multi_node_deployment/multi_node_index.mdx b/docs_new/docs/references/multi_node_deployment/multi_node_index.mdx new file mode 100644 index 000000000000..eba1a29555b2 --- /dev/null +++ b/docs_new/docs/references/multi_node_deployment/multi_node_index.mdx @@ -0,0 +1,11 @@ +--- +title: "Multi-Node Deployment" +metatags: + description: "SGLang multi-node deployment index: K8s, LWS, RBG, PD disaggregation guides." +--- +- [Multi Node](./multi_node) +- [Deploy On K8S](./deploy_on_k8s) +- [Lws Pd Deploy](./lws_pd/lws_pd_deploy) +- [Deepseekv32 Pd](./rbg_pd/deepseekv32_pd) +- [Deploying DeepSeek with PD Disaggregation on 96 H100 GPUs](https://lmsys.org/blog/2025-05-05-large-scale-ep/) +- [Deploying Kimi K2 with PD Disaggregation on 128 H200 GPUs](https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/) diff --git a/docs_new/docs/references/multi_node_deployment/multi_node_index.rst b/docs_new/docs/references/multi_node_deployment/multi_node_index.rst new file mode 100644 index 000000000000..78636869ec26 --- /dev/null +++ b/docs_new/docs/references/multi_node_deployment/multi_node_index.rst @@ -0,0 +1,14 @@ +Multi-Node Deployment +===================== + +.. toctree:: + :maxdepth: 1 + :caption: Multi-Node Deployment + + multi_node.md + deploy_on_k8s.md + lws_pd/lws_pd_deploy.md + rbg_pd/deepseekv32_pd.md + +- `Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs `_ +- `Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs `_ diff --git a/docs_new/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.mdx b/docs_new/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.mdx new file mode 100644 index 000000000000..fbc63eb1b53e --- /dev/null +++ b/docs_new/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.mdx @@ -0,0 +1,570 @@ +--- +title: "DeepSeekV32-Exp RBG Based PD Deploy" +metatags: + description: "SGLang DeepSeek V3.2 RBG deployment: RoleBasedGroup PD disaggregation on Kubernetes." +--- +## 0. Prerequisites + +1. k8s >=1.26 +2. lws installed on k8s. +3. rbg installed on k8s. + +For RBG installation, please refer to: https://github.com/sgl-project/rbg + +## 1. Image Preparation + +`lmsysorg/sglang:latest` + + +### 2. All In One manifest file + +*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment* + +rbg-dsv32.yml + +```yaml Config +apiVersion: workloads.x-k8s.io/v1alpha1 +kind: RoleBasedGroup +metadata: + name: deepseek-rbg-32exp + namespace: default +spec: + roles: + - name: prefill + replicas: 1 + workload: + apiVersion: leaderworkerset.x-k8s.io/v1 + kind: LeaderWorkerSet + restartPolicy: None + leaderWorkerSet: + size: 1 + patchLeaderTemplate: + metadata: + labels: + role: leader + pd_role: prefill + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --port + - "30000" + - --trust-remote + - --host + - 0.0.0.0 + - --disaggregation-ib-device + - mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 + - --disable-radix-cache + - --chunked-prefill-size + - "131072" + - --page-size + - "64" + # - --enable-eplb + - --ep-dispatch-algorithm + - dynamic + - --eplb-algorithm + - deepseek + - --enable-dp-lm-head + - --enable-dp-attention + - --dp-size + - "8" + - --moe-a2a-backend + - deepep + - --deepep-mode + - normal + - --disaggregation-mode + - prefill + - --mem-fraction-static + - "0.8" + - --max-prefill-tokens + - "32768" + - --context-length + - "32768" + - --tp + - "8" + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + - --max-running-requests + - "1024" + env: + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + livenessProbe: + failureThreshold: 3000 + httpGet: + path: /health + port: 30000 + initialDelaySeconds: 300 + periodSeconds: 60 + successThreshold: 1 + timeoutSeconds: 10 + readinessProbe: + failureThreshold: 20 + httpGet: + path: /health + port: 30000 + periodSeconds: 30 + successThreshold: 1 + timeoutSeconds: 10 + name: sglang + ports: + - containerPort: 30000 + name: sglang-http + protocol: TCP + + patchWorkerTemplate: {} + template: + metadata: + labels: + inference-framework: sglang + inference-stack.io/monitoring: "enabled" + spec: + containers: + - name: sglang + image: lmsysorg/sglang:latest + env: + - name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK + value: "1" + - name: CUDA_LAUNCH_BLOCKING + value: "0" + - name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT + value: "1000000000" + - name: NVSHMEM_IB_TRAFFIC_CLASS + value: "16" + - name: NVSHMEM_DISABLE_P2P + value: "0" + - name: ENABLE_METRICS + value: "true" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NVSHMEM_IB_SL + value: "5" + - name: SGLANG_SET_CPU_AFFINITY + value: "true" + - name: SGL_ENABLE_JIT_DEEPGEMM + value: "1" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: "none" + - name: NCCL_IB_TC + value: "136" + - name: NCCL_IB_SL + value: "5" + - name: NCCL_IB_TIMEOUT + value: "22" + - name: NCCL_IB_GID_INDEX + value: "3" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: NCCL_SOCKET_IFNAME + value: bond0 + - name: GLOO_SOCKET_IFNAME + value: bond0 + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: "bond0" + - name: MC_TE_METRIC + value: "false" + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang + name: src + + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + volumes: + - hostPath: + path: /var/run/sys-topology + name: topo + - hostPath: + path: /data1/sgl_cache4 + type: DirectoryOrCreate + name: sgl-cache + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /data/DeepSeek-V3.2-Exp + name: model + - hostPath: + path: /dev/infiniband + name: ib + - hostPath: + path: /data/src/sglang + type: DirectoryOrCreate + name: src + + - name: decode + replicas: 1 + workload: + apiVersion: leaderworkerset.x-k8s.io/v1 + kind: LeaderWorkerSet + leaderWorkerSet: + size: 1 + patchLeaderTemplate: + metadata: + labels: + role: leader + pd_role: decode + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --port + - "30000" + - --trust-remote + - --host + - 0.0.0.0 + - --disaggregation-ib-device + - mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 + - --chunked-prefill-size + - "131072" + - --eplb-rebalance-layers-per-chunk + - "29" + - --page-size + - "64" + - --enable-dp-attention + - --enable-dp-lm-head + - --dp-size + - "8" + - --moe-a2a-backend + - deepep + - --deepep-mode + - low_latency + - --disaggregation-mode + - decode + - --mem-fraction-static + - "0.8" + - --context-length + - "32768" + - --max-running-requests + - "2048" + - --tp-size + - "8" # Size of Tensor Parallelism + - --cuda-graph-max-bs + - "16" + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + env: + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + livenessProbe: + failureThreshold: 30000 + httpGet: + path: /health + port: 30000 + initialDelaySeconds: 300 + periodSeconds: 60 + successThreshold: 1 + timeoutSeconds: 10 + name: sglang + readinessProbe: + failureThreshold: 20 + httpGet: + path: /health + port: 30000 + periodSeconds: 30 + successThreshold: 1 + timeoutSeconds: 10 + patchWorkerTemplate: + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --crash-dump-folder + - /log + - --chunked-prefill-size + - "262144" + - --eplb-rebalance-layers-per-chunk + - "29" + - --page-size + - "64" + - --enable-dp-attention + - --enable-dp-lm-head + - --dp-size + - "32" + - --moe-a2a-backend + - "deepep" + - --deepep-mode + - low_latency + - --disaggregation-mode + - decode + - --mem-fraction-static + - "0.849" + - --context-length + - "32768" + - --disaggregation-ib-device + - mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 + - --max-running-requests + - "4096" + - --cuda-graph-max-bs + - "16" + - --tp-size + - "8" # Size of Tensor Parallelism + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + env: + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + name: sglang + template: + metadata: + labels: + inference-framework: sglang-unuse + inference-stack.io/monitoring: "enabled" + spec: + containers: + - image: lmsysorg/sglang:latest + name: sglang + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang + name: src + env: + - name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK + value: "1" + - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT + value: "100000000" + - name: NVSHMEM_DISABLE_P2P + value: "0" + - name: NVSHMEM_IB_TRAFFIC_CLASS + value: "16" + - name: NVSHMEM_IB_SL + value: "5" + - name: ENABLE_METRICS + value: "true" + - name: CUDA_LAUNCH_BLOCKING + value: "0" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: "none" + - name: NCCL_IB_TC + value: "136" + - name: NCCL_IB_SL + value: "5" + - name: NCCL_IB_TIMEOUT + value: "22" + - name: NCCL_IB_GID_INDEX + value: "3" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: NCCL_SOCKET_IFNAME + value: bond0 + - name: GLOO_SOCKET_IFNAME + value: bond0 + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: "bond0" + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: MC_TE_METRIC + value: "false" + - name: SGL_ENABLE_JIT_DEEPGEMM + value: "1" + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + volumes: + - hostPath: + path: /var/run/sys-topology + name: topo + - hostPath: + path: /data1/sgl_cache4 + type: DirectoryOrCreate + name: sgl-cache + - hostPath: + path: /data/src/sglang + type: DirectoryOrCreate + name: src + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /data/DeepSeek-V3.2-Exp + name: model + - hostPath: + path: /dev/infiniband + name: ib + - name: router + replicas: 1 + dependencies: [ "decode", "prefill" ] + template: + spec: + containers: + - name: scheduler + image: lmsysorg/sglang:latest + command: + - sh + - -c + - > + python3 -m sglang_router.launch_router + --host 0.0.0.0 + --port 8080 + --pd-disaggregation + --policy random + --service-discovery + --service-discovery-namespace ${NAMESPACE} + --service-discovery-port 30000 + --prefill-selector pd_role=prefill + --decode-selector pd_role=decode + --max-payload-size 2147483648 + --worker-startup-timeout-secs 1200 + env: + - name: NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace +*** +apiVersion: v1 +kind: Service +metadata: + labels: + app: deepseek-rbg-32exp + name: deepseek-rbg-32exp + namespace: default +spec: + ports: + - name: http + port: 8080 + protocol: TCP + targetPort: 8080 + nodePort: 30080 + + selector: + rolebasedgroup.workloads.x-k8s.io/name: deepseek-rbg-32exp + rolebasedgroup.workloads.x-k8s.io/role: router + type: NodePort + +``` + +```bash Command +[root@ecs-001]# kubectl get po -n default +deepseek-rbg-32exp-decode-main-0 1/1 Running 0 74m +deepseek-rbg-32exp-decode-0-1 1/1 Running 0 74m +deepseek-rbg-32exp-router-9c5dbfc57 1/1 Running 0 22m +deepseek-rbg-32exp-prefill-0 1/1 Running 0 74m + +[root@ecs-cbm-x1-pd-cpu-001 main_doc]# kubectl get svc |grep dee +deepseek-rbg-32exp-decode ClusterIP None 97m +deepseek-rbg-32exp-router-service NodePort 172.16.242.169 8000:30800/TCP 22m +deepseek-rbg-32exp-prefill ClusterIP None 97m +``` + +At this point, select a nodePort:30800 to access: + +```bash Command +[root@ecs-001]# curl -X POST "http://{nodePort}:30800/v1/chat/completions" \ +> -H "Content-Type: application/json" \ +> -H "Authorization: Bearer None" \ +> -d '{ +> "rid":"ccccdd", +> "model": "dsv32", +> "messages": [ +> {"role": "system", "content": "0: You are a helpful AI assistant"}, +> {"role": "user", "content": "你是谁?."} +> ], +> "max_tokens":221 +> }' +{"id":"ccccdd","object":"chat.completion","created":1750252498,"model":"qwen2","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n嗯,用户问了一个很基础的自我介绍问题"你是谁?"。这可能是第一次互动时的常规开场白,也可能是想确认我的身份和功能范围。\n\n用户没有提供任何背景信息,语气简洁中性。这种场景下新用户的可能性较高,需要给出清晰友好的自我介绍,同时突出实用价值来降低陌生感。\n\n考虑到中文用户,应该用简体中文回复。重点要说明三点:身份归属(深度求索)、功能定位(AI助手)、服务范围(学习/工作/生活)。结尾用开放性问题引导对话很关键——既能了解需求,又能避免让用户面对空白输入框时不知所措。\n\n用波浪线结尾可以软化语气,那个笑脸表情😊刚好能中和AI的机械感。不过要控制表情符号数量,避免显得轻浮。\n</think>\n你好呀!我是你的AI助手,由深度求索公司(DeepSeek)开发的语言模型,名字叫 **DeepSeek-V32**。你可以把我当成一个知识丰富、随叫随到的小帮手~😊\n\n我的任务就是陪你聊天、解答问题、","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":14,"total_tokens":235,"completion_tokens":221,"prompt_tokens_details":null}} + +``` +## FAQ + +1. The current deployment startup parameters may not be fully compatible with all RDMA scenarios. Different RDMA NCCL-related environment configurations may be needed in different network environments. + +2. Please ensure that the sglang code in the image has incorporated the changes from [PR #10912](https://github.com/sgl-project/sglang/pull/10912). diff --git a/docs_new/docs/references/overview.mdx b/docs_new/docs/references/overview.mdx new file mode 100644 index 000000000000..60af634539e0 --- /dev/null +++ b/docs_new/docs/references/overview.mdx @@ -0,0 +1,13 @@ +--- +title: References +description: FAQ, environment variables, production metrics, deployment guides, and more. +--- + +- [FAQ](./faq) +- [Environment Variables](./environment_variables) +- [Production Metrics](./production_metrics) +- [Production Request Trace](./production_request_trace) +- [Multi-Node Deployment](./multi_node_deployment/multi_node) +- [Custom Chat Template](./custom_chat_template) +- [Frontend Language](./frontend/frontend_tutorial) +- [Post-Training Integration](./post_training_integration) diff --git a/docs_new/docs/references/post_training_integration.mdx b/docs_new/docs/references/post_training_integration.mdx new file mode 100644 index 000000000000..d4544fd0f289 --- /dev/null +++ b/docs_new/docs/references/post_training_integration.mdx @@ -0,0 +1,34 @@ +--- +title: "Post-Training Integration" +metatags: + description: "SGLang post-training: RLHF integration with Miles, slime, AReaL, ROLL, verl, Unsloth, LLaMA Factory." +--- +SGLang has become the de facto inference backend for modern LLM training frameworks, powering state-of-the-art models across the industry. From GLM-4.6 to Qwen3, leading models leverage SGLang's high-performance inference during reinforcement learning and post-training workflows. + +What makes SGLang essential for post-training? + +- Open-To-Use Refit Functionality: diverse method for colocate or disaggregate +- Easy To Postpone Generation: enable partial rollout and dedicated rollout control +- Fine-Grained Engine Sleep And Wake Up: facilitate maxium-powered rollout and training +- Training Serving Alignment: ensure the performance consistency in training and serving +- Load Balancing Router: cache-aware load-balancing for high-throughput rollout +- Deterministic Inference: ensure zero kl divergence between rollout and training + +These capabilities, combined with native integration support across major frameworks, have established SGLang as the infrastructure backbone for modern LLM/VLMs post-training. We also share our latest work in this slide, [Optimizing Large-Scale RL with SGLang](https://gamma.app/docs/Optimizing-RL-with-SGLang-y0kqgj877k34779). + +## Adoption + +- [**Miles**](https://github.com/radixark/miles): Enterprise-scale RL framework for large MoE models with SGLang-native rollout, speculative training, and production-grade stability +- [**slime**](https://github.com/THUDM/slime): Post-training framework combining Megatron and SGLang, used to train GLM-4.6 +- [**AReaL**](https://github.com/inclusionAI/AReaL): Fully asynchronous RL system achieving 2.77x speedup with SGLang backend for continuous rollout generation +- [**ROLL**](https://github.com/alibaba/ROLL): ROLL is an efficient and user-friendly RL library designed for Large Language Models utilizing Large Scale GPU resources +- [**verl**](https://github.com/volcengine/verl): Full-stack RLHF framework supporting PPO, GRPO, and ReMax with modular SGLang integration +- [**Unsloth**](https://docs.unsloth.ai/basics/inference-and-deployment/sglang-guide): 2x faster fine-tuning with optimized kernels, deploys seamlessly with SGLang inference +- [**LLaMA Factory**](https://github.com/hiyouga/LLaMA-Factory): Unified framework for training 100+ LLMs with LoRA, QLoRA, and full fine-tuning methods +- [**Tunix**](https://github.com/google/tunix): Google's JAX-native library for LLM post-training with SFT, DPO, PPO, and GRPO support +- [**RL2**](https://github.com/ChenmienTan/RL2): Ray Less Reinforcement Learning, a concise library of post-training for large language models + + +## Collaboration + +Due to the privacy of the design parternes, we cannot list the companies that adopt SGLang for post-training. However, we are happy to share the details with you if you are interested and trust the choice among 10+ top companies and frontier labs across US and China. If you are interested in integrating SGLang with your training framework or need technical support, we're here to help! Reach out to us at **rl_team@lmsys.org** for partnerships, integration guidance, and custom feature development. diff --git a/docs_new/docs/references/production_metrics.mdx b/docs_new/docs/references/production_metrics.mdx new file mode 100644 index 000000000000..97de5a708f8d --- /dev/null +++ b/docs_new/docs/references/production_metrics.mdx @@ -0,0 +1,270 @@ +--- +title: "Production Metrics" +metatags: + description: "SGLang Prometheus metrics: TTFT, TPOT, throughput, cache hit rate. Grafana dashboard setup guide." +--- +SGLang exposes the following metrics via Prometheus. You can enable it by adding `--enable-metrics` when you launch the server. + +An example of the monitoring dashboard is available in [examples/monitoring/grafana.json](https://github.com/sgl-project/sglang/blob/main/examples/monitoring/grafana/dashboards/json/sglang-dashboard.json). + +Here is an example of the metrics: + +```text Output +$ curl http://localhost:30000/metrics +# HELP sglang:prompt_tokens_total Number of prefill tokens processed. +# TYPE sglang:prompt_tokens_total counter +sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.128902e+06 +# HELP sglang:generation_tokens_total Number of generation tokens processed. +# TYPE sglang:generation_tokens_total counter +sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.557572e+06 +# HELP sglang:token_usage The token usage +# TYPE sglang:token_usage gauge +sglang:token_usage{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.28 +# HELP sglang:cache_hit_rate The cache hit rate +# TYPE sglang:cache_hit_rate gauge +sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.007507552643049313 +# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds. +# TYPE sglang:time_to_first_token_seconds histogram +sglang:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 2.3518979474117756e+06 +sglang:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 +sglang:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 +sglang:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 +sglang:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 +sglang:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0 +sglang:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 3.0 +sglang:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:time_to_first_token_seconds_bucket{le="0.25",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:time_to_first_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 27.0 +sglang:time_to_first_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0 +sglang:time_to_first_token_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 314.0 +sglang:time_to_first_token_seconds_bucket{le="7.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 941.0 +sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1330.0 +sglang:time_to_first_token_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1970.0 +sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2326.0 +sglang:time_to_first_token_seconds_bucket{le="25.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2417.0 +sglang:time_to_first_token_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2513.0 +sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11008.0 +sglang:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 11008.0 +# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds +# TYPE sglang:e2e_request_latency_seconds histogram +sglang:e2e_request_latency_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3.116093850019932e+06 +sglang:e2e_request_latency_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0 +sglang:e2e_request_latency_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:e2e_request_latency_seconds_bucket{le="1.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:e2e_request_latency_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0 +sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0 +sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 10.0 +sglang:e2e_request_latency_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11.0 +sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 14.0 +sglang:e2e_request_latency_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 247.0 +sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 486.0 +sglang:e2e_request_latency_seconds_bucket{le="50.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 845.0 +sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1513.0 +sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11228.0 +sglang:e2e_request_latency_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 11228.0 +# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds. +# TYPE sglang:time_per_output_token_seconds histogram +sglang:time_per_output_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 866964.5791549598 +sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0 +sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 73.0 +sglang:time_per_output_token_seconds_bucket{le="0.015",model_name="meta-llama/Llama-3.1-8B-Instruct"} 382.0 +sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 593.0 +sglang:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 855.0 +sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1035.0 +sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1815.0 +sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11685.0 +sglang:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Llama-3.1-8B-Instruct"} 433413.0 +sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 4.950195e+06 +sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.039435e+06 +sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.171662e+06 +sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.266055e+06 +sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.296752e+06 +sglang:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.312226e+06 +sglang:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.339675e+06 +sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.357747e+06 +sglang:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.389414e+06 +sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.400757e+06 +sglang:time_per_output_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.400757e+06 +# HELP sglang:func_latency_seconds Function latency in seconds +# TYPE sglang:func_latency_seconds histogram +sglang:func_latency_seconds_sum{name="generate_request"} 4.514771912145079 +sglang:func_latency_seconds_bucket{le="0.05",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.07500000000000001",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.1125",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.16875",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.253125",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.3796875",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.56953125",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="0.8542968750000001",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="1.2814453125",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="1.9221679687500002",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="2.8832519531250003",name="generate_request"} 14006.0 +sglang:func_latency_seconds_bucket{le="4.3248779296875",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="6.487316894531251",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="9.730975341796876",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="14.596463012695313",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="21.89469451904297",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="32.84204177856446",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="49.26306266784668",name="generate_request"} 14007.0 +sglang:func_latency_seconds_bucket{le="+Inf",name="generate_request"} 14007.0 +sglang:func_latency_seconds_count{name="generate_request"} 14007.0 +# HELP sglang:num_running_reqs The number of running requests +# TYPE sglang:num_running_reqs gauge +sglang:num_running_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 162.0 +# HELP sglang:num_used_tokens The number of used tokens +# TYPE sglang:num_used_tokens gauge +sglang:num_used_tokens{model_name="meta-llama/Llama-3.1-8B-Instruct"} 123859.0 +# HELP sglang:gen_throughput The generate throughput (token/s) +# TYPE sglang:gen_throughput gauge +sglang:gen_throughput{model_name="meta-llama/Llama-3.1-8B-Instruct"} 86.50814177726902 +# HELP sglang:num_queue_reqs The number of requests in the waiting queue +# TYPE sglang:num_queue_reqs gauge +sglang:num_queue_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 2826.0 +``` + +## Setup Guide + +This section describes how to set up the monitoring stack (Prometheus + Grafana) provided in the `examples/monitoring` directory. + +### Prerequisites + +- Docker and Docker Compose installed +- SGLang server running with metrics enabled + +### Usage + +1. **Start your SGLang server with metrics enabled:** + + ```bash Command + python -m sglang.launch_server \ + --model-path \ + --port 30000 \ + --enable-metrics \ + --enable-mfu-metrics + ``` + Replace `` with the actual path to your model (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct`). Ensure the server is accessible from the monitoring stack (you might need `--host 0.0.0.0` if running in Docker). By default, the metrics endpoint will be available at `http://:30000/metrics`. + +2. **Navigate to the monitoring example directory:** + ```bash Command + cd examples/monitoring + ``` + +3. **Start the monitoring stack:** + ```bash Command + docker compose up -d + ``` + This command will start Prometheus and Grafana in the background. + +4. **Access the monitoring interfaces:** + * **Grafana:** Open your web browser and go to [http://localhost:3000](http://localhost:3000). + * **Prometheus:** Open your web browser and go to [http://localhost:9090](http://localhost:9090). + +5. **Log in to Grafana:** + * Default Username: `admin` + * Default Password: `admin` + You will be prompted to change the password upon your first login. + +6. **View the Dashboard:** + The SGLang dashboard is pre-configured and should be available automatically. Navigate to `Dashboards` -> `Browse` -> `SGLang Monitoring` folder -> `SGLang Dashboard`. + +### Troubleshooting + +* **Port Conflicts:** If you encounter errors like "port is already allocated," check if other services (including previous instances of Prometheus/Grafana) are using ports `9090` or `3000`. Use `docker ps` to find running containers and `docker stop ` to stop them, or use `lsof -i :` to find other processes using the ports. You might need to adjust the ports in the `docker-compose.yaml` file if they permanently conflict with other essential services on your system. + +To modify Grafana's port to the other one(like 3090) in your Docker Compose file, you need to explicitly specify the port mapping under the grafana service. + + Option 1: Add GF_SERVER_HTTP_PORT to the environment section: + ``` + environment: + - GF_AUTH_ANONYMOUS_ENABLED=true + - GF_SERVER_HTTP_PORT=3090 # <-- Add this line + ``` + Option 2: Use port mapping: + ``` + grafana: + image: grafana/grafana:latest + container_name: grafana + ports: + - "3090:3000" # <-- Host:Container port mapping + ``` +* **Connection Issues:** + * Ensure both Prometheus and Grafana containers are running (`docker ps`). + * Verify the Prometheus data source configuration in Grafana (usually auto-configured via `grafana/datasources/datasource.yaml`). Go to `Connections` -> `Data sources` -> `Prometheus`. The URL should point to the Prometheus service (e.g., `http://prometheus:9090`). + * Confirm that your SGLang server is running and the metrics endpoint (`http://:30000/metrics`) is accessible *from the Prometheus container*. If SGLang is running on your host machine and Prometheus is in Docker, use `host.docker.internal` (on Docker Desktop) or your machine's network IP instead of `localhost` in the `prometheus.yaml` scrape configuration. +* **No Data on Dashboard:** + * Generate some traffic to your SGLang server to produce metrics. For example, run a benchmark: + ```bash Command + python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 100 --random-input 128 --random-output 128 + ``` + * Check the Prometheus UI (`http://localhost:9090`) under `Status` -> `Targets` to see if the SGLang endpoint is being scraped successfully. + * Verify the `model_name` and `instance` labels in your Prometheus metrics match the variables used in the Grafana dashboard. You might need to adjust the Grafana dashboard variables or the labels in your Prometheus configuration. + +### Configuration Files + +The monitoring setup is defined by the following files within the `examples/monitoring` directory: + +* `docker-compose.yaml`: Defines the Prometheus and Grafana services. +* `prometheus.yaml`: Prometheus configuration, including scrape targets. +* `grafana/datasources/datasource.yaml`: Configures the Prometheus data source for Grafana. +* `grafana/dashboards/config/dashboard.yaml`: Tells Grafana to load dashboards from the specified path. +* `grafana/dashboards/json/sglang-dashboard.json`: The actual Grafana dashboard definition in JSON format. + +You can customize the setup by modifying these files. For instance, you might need to update the `static_configs` target in `prometheus.yaml` if your SGLang server runs on a different host or port. + +#### Check if the metrics are being collected + +Run: +```text Output +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 3000 \ + --random-input 1024 \ + --random-output 1024 \ + --random-range-ratio 0.5 +``` + +to generate some requests. + +Then you should be able to see the metrics in the Grafana dashboard. + +## Estimated Performance Metrics (MFU-related) + +SGLang exports the following estimated per-GPU counters that can be used to derive +Model FLOPs Utilization (MFU)-related signals: + +- `sglang:estimated_flops_per_gpu_total`: Estimated floating-point operations. +- `sglang:estimated_read_bytes_per_gpu_total`: Estimated bytes read from memory. +- `sglang:estimated_write_bytes_per_gpu_total`: Estimated bytes written to memory. + +These metrics are available when both `--enable-metrics` and +`--enable-mfu-metrics` are enabled. + +These are cumulative counters. Use Prometheus `rate(...)` to get per-second values. + +### PromQL examples + +Average TFLOPS per GPU: + +```promql +rate(sglang:estimated_flops_per_gpu_total[1m]) / 1e12 +``` + +Average estimated memory bandwidth in GB/s: + +```promql +(rate(sglang:estimated_read_bytes_per_gpu_total[1m]) + + rate(sglang:estimated_write_bytes_per_gpu_total[1m])) / 1e9 +``` + +### Notes + +- These metrics are estimates intended for observability and trend analysis. +- Estimated memory bytes reflect modeled traffic and are not a direct hardware + counter from GPU profilers. diff --git a/docs_new/docs/references/production_request_trace.mdx b/docs_new/docs/references/production_request_trace.mdx new file mode 100644 index 000000000000..d81e41e56460 --- /dev/null +++ b/docs_new/docs/references/production_request_trace.mdx @@ -0,0 +1,136 @@ +--- +title: "Production Request Tracing" +metatags: + description: "SGLang OpenTelemetry tracing: Jaeger visualization, trace context propagation, PD disaggregation support." +--- +SGLang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--enable-trace` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server. + +You can find example screenshots of the visualization in https://github.com/sgl-project/sglang/issues/8965. + +## Setup Guide +This section explains how to configure the request tracing and export the trace data. +1. Install the required packages and tools + * install Docker and Docker Compose + * install the dependencies + ```bash Command + # enter the SGLang root directory + pip install -e "python[tracing]" + + # or manually install the dependencies using pip + pip install opentelemetry-sdk opentelemetry-api opentelemetry-exporter-otlp opentelemetry-exporter-otlp-proto-grpc + ``` + +2. Launch OpenTelemetry collector and Jaeger + ```bash Command + docker compose -f examples/monitoring/tracing_compose.yaml up -d + ``` + +3. Start your SGLang server with tracing enabled + ```bash Command + # set env variables + export SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS=500 + export SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE=64 + # start the prefill and decode server + python -m sglang.launch_server --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 + # start the model-gate-way + python -m sglang_router.launch_router --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 + ``` + + Replace `0.0.0.0:4317` with the actual endpoint of the OpenTelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317. + + To use the HTTP/protobuf span exporter, set the following environment variable and point to an HTTP endpoint, for example, `http://0.0.0.0:4318/v1/traces`. + ```bash Command + export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf + ``` + + +4. Raise some requests +5. Observe whether trace data is being exported + * Access port 16686 of Jaeger using a web browser to visualize the request traces. + * The OpenTelemetry Collector also exports trace data in JSON format to /tmp/otel_trace.json. In a follow-up patch, we will provide a tool to convert this data into a Perfetto-compatible format, enabling visualization of requests in the Perfetto UI. + +6. Dynamically adjust trace level + The trace level accepts configurable values from `0` to `3`. The meanings of different trace level values are as follows: + ``` + 0: disable tracing + 1: Trace important slices + 2: Trace all slices except nested ones + 3: Trace all slices + ``` + The trace level can be dynamically set via HTTP API, for example: + ```bash Command + curl http://0.0.0.0:30000/set_trace_level?level=2 + ``` + Replace `0.0.0.0:30000` with your actual server address, and replace `level=2` with the level you want to set. + + **Note**: You must set the parameter `--enable-trace`; otherwise, the trace capability will not be enabled regardless of any dynamic adjustments to the trace level. + +## How to add Tracing for slices you're interested in?(API introduction) +We have already inserted instrumentation points in the tokenizer and scheduler main threads. If you wish to trace additional request execution segments or perform finer-grained tracing, please use the APIs from the tracing package as described below. + +**All of the following implementations are done in python/sglang/srt/observability/req_time_stats.py. If you want to add another slice, please do it here.** + +1. Initialization + + Every process involved in tracing during the initialization phase should execute: + ```python Example + process_tracing_init(otlp_traces_endpoint, server_name) + ``` + The otlp_traces_endpoint is obtained from the arguments, and you can set server_name freely, but it should remain consistent across all processes. + + Every thread involved in tracing during the initialization phase should execute: + ```python Example + trace_set_thread_info("thread label", tp_rank, dp_rank) + ``` + The "thread label" can be regarded as the name of the thread, used to distinguish different threads in the visualization view. + +2. Create a trace context for a request + Each request needs to call `TraceReqContext()` to initialize a request context, which is used to generate slice spans and record request stage info. You can either store it within the request object or maintain it as a global variable. + +3. Mark the beginning and end of a request + ``` + trace_ctx.trace_req_start(). + trace_ctx.trace_req_finish() + ``` + trace_req_start() and trace_req_finish() must be called within the same process, for example, in the tokenizer. + +4. Add tracing for a slice + + * Add slice tracing normally: + ```python Example + trace_ctx.trace_slice_start(RequestStage.TOKENIZER.stage_name) + trace_ctx.trace_slice_end(RequestStage.TOKENIZER.stage_name) + + or + trace_ctx.trace_slice(slice: TraceSliceContext) + ``` + + - The end of the last slice in a thread must be marked with thread_finish_flag=True, or explicitly call trace_ctx.abort(); otherwise, the thread's span will not be properly generated. + ```python Example + trace_ctx.slice_end(RequestStage.D.stage_name, thread_finish_flag = True) + trace_ctx.abort() + ``` + +5. When the request execution flow transfers to another thread, the thread context needs to be explicitly rebuilt. + - receiver: Execute the following code after receiving the request via ZMQ + ```python Example + trace_ctx.rebuild_thread_context() + ``` + +## How to Extend the Tracing Framework to Support Complex Tracing Scenarios + +The currently provided tracing package still has potential for further development. If you wish to build more advanced features upon it, you must first understand its existing design principles. + +The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a three-level trace context structure or span structure: `TraceReqContext`, `TraceThreadContext` and `TraceSliceContext`. Their relationship is as follows: +``` +TraceReqContext (req_id="req-123") +├── TraceThreadContext(thread_label="scheduler", tp_rank=0) +| └── TraceSliceContext(slice_name="prefill") +| +└── TraceThreadContext(thread_label="scheduler", tp_rank=1) + └── TraceSliceContext(slice_name="prefill") +``` + +Each traced request maintains a global `TraceReqContext` and creates a corresponding request span. For every thread that processes the request, a `TraceThreadContext` is recorded and a thread span is created. The `TraceThreadContext` is nested within the `TraceReqContext`, and each currently traced code slice—potentially nested—is stored in its associated `TraceThreadContext`. + +In addition to the above hierarchy, each slice also records its previous slice via Span.add_link(), which can be used to trace the execution flow. diff --git a/docs_new/docs/references/torch_compile_cache.mdx b/docs_new/docs/references/torch_compile_cache.mdx new file mode 100644 index 000000000000..0e3c850cac4e --- /dev/null +++ b/docs_new/docs/references/torch_compile_cache.mdx @@ -0,0 +1,16 @@ +--- +title: "Enabling cache for torch.compile" +metatags: + description: "SGLang torch.compile cache: TORCHINDUCTOR_CACHE_DIR for faster deployment across multiple machines." +--- +SGLang uses `max-autotune-no-cudagraphs` mode of torch.compile. The auto-tuning can be slow. +If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps. + +This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html + + +1. Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once. +```text Output +TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile +``` +2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`. diff --git a/docs_new/docs/sglang-diffusion/api/cli.mdx b/docs_new/docs/sglang-diffusion/api/cli.mdx new file mode 100644 index 000000000000..1cb04974d74a --- /dev/null +++ b/docs_new/docs/sglang-diffusion/api/cli.mdx @@ -0,0 +1,294 @@ +--- +title: CLI reference +sidebarTitle: CLI +description: Run one-off generation tasks and launch the HTTP server from the command line. +--- +Use the CLI for one-off generation with `sglang generate` or to start a persistent HTTP server with `sglang serve`. + +### Overlay repos for non-diffusers models + +If `--model-path` points to a supported non-diffusers source repo, SGLang can resolve it +through a self-hosted overlay repo. + +SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface. + +Override example: + +```bash Command +export SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY='{ + "Wan-AI/Wan2.2-S2V-14B": { + "overlay_repo_id": "your-org/Wan2.2-S2V-14B-overlay", + "overlay_revision": "main" + } +}' + +sglang generate \ + --model-path Wan-AI/Wan2.2-S2V-14B \ + --config configs/wan_s2v.yaml +``` + +The overlay repo should be a complete diffusers-style/componentized repo + +You can also pass the overlay repo itself as `--model-path` if it contains `_overlay/overlay_manifest.json`. + +Notes: +1. `SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY` is only an optional override for +development and debugging. It accepts either a JSON object or a path to a JSON +file, and can extend or replace built-in entries for the current process. +2. On the first load, SGLang will: + - download overlay metadata from the overlay repo + - download the required files from the original source repo + - materialize a local standard component repo under `~/.cache/sgl_diffusion/materialized_models/` +3. Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory. + + +## Quick Start + +### Generate + +```bash Command +sglang generate \ + --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" \ + --save-output +``` + +### Serve + +```bash Command +sglang serve \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --num-gpus 4 \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --port 30010 +``` + +For request and response examples, see [OpenAI-Compatible API](./openai_api). + + +Use `sglang generate --help` and `sglang serve --help` for the full argument list. The CLI help output is the source of truth for exhaustive flags. + + +## Common Options + +### Model and runtime + +- `--model-path {MODEL}`: model path or Hugging Face model ID +- `--lora-path {PATH}` and `--lora-nickname {NAME}`: load a LoRA adapter +- `--num-gpus {N}`: number of GPUs to use +- `--tp-size {N}`: tensor parallelism size, mainly for encoders +- `--sp-degree {N}`: sequence parallelism size +- `--ulysses-degree {N}` and `--ring-degree {N}`: USP parallelism controls +- `--attention-backend {BACKEND}`: attention backend for native SGLang pipelines +- `--component-attention-backends {MAP}`: per-component attention backend overrides, for example `text_encoder=torch_sdpa,transformer=fa` +- `--attention-backend-config {CONFIG}`: attention backend configuration + +### Sampling and output + +- `--prompt {PROMPT}` and `--negative-prompt {PROMPT}` +- `--image-path {PATH} [{PATH} ...]`: input image(s) for image-to-video or image-to-image generation +- `--num-inference-steps {STEPS}` and `--seed {SEED}` +- `--height {HEIGHT}`, `--width {WIDTH}`, `--num-frames {N}`, `--fps {FPS}` +- `--output-path {PATH}`, `--output-file-name {NAME}`, `--save-output`, `--return-frames` + +For frame interpolation and upscaling, see [Post-Processing](./post_processing). + +### Quantized transformers + +For quantized transformer checkpoints, prefer: + +- `--model-path` for the base pipeline +- `--transformer-path` for a quantized `transformers` transformer component folder +- `--transformer-weights-path` for a quantized safetensors file, directory, or repo + +See [Quantization](../quantization) for supported quantization families and examples. + +## Configuration Files + +Use `--config` to load JSON or YAML configuration. Command-line flags override values from the config file. + +```bash Command +sglang generate --config config.yaml +``` + +Example: + +```yaml Config +model_path: FastVideo/FastHunyuan-diffusers +prompt: A beautiful woman in a red dress walking down a street +output_path: outputs/ +num_gpus: 2 +sp_size: 2 +tp_size: 1 +num_frames: 45 +height: 720 +width: 1280 +num_inference_steps: 6 +seed: 1024 +fps: 24 +precision: bf16 +vae_precision: fp16 +vae_tiling: true +vae_sp: true +enable_torch_compile: false +``` + +## Generate + +`sglang generate` runs a single generation job and exits when the job finishes. + +```bash Command +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --text-encoder-cpu-offload \ + --pin-cpu-memory \ + --num-gpus 4 \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --prompt "A curious raccoon" \ + --save-output \ + --output-path outputs \ + --output-file-name "a-curious-raccoon.mp4" +``` + + +HTTP server-only arguments are ignored by `sglang generate`. + + +For diffusers pipelines, Cache-DiT can be enabled with `SGLANG_CACHE_DIT_ENABLED=true` or `--cache-dit-config`. See [Cache-DiT](../cache_dit). + +## Serve + +`sglang serve` starts the HTTP server and keeps the model loaded for repeated requests. + +```bash Command +sglang serve \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --text-encoder-cpu-offload \ + --pin-cpu-memory \ + --num-gpus 4 \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --port 30010 +``` + +### Cloud Storage + +SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation. + +```bash Command +export SGLANG_CLOUD_STORAGE_TYPE=s3 +export SGLANG_S3_BUCKET_NAME=my-bucket +export SGLANG_S3_ACCESS_KEY_ID=your-access-key +export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key +export SGLANG_S3_ENDPOINT_URL=https://minio.example.com +``` + +See [Environment Variables](../environment_variables) for the full set of storage options. + +## Component Path Overrides + +Override individual pipeline components such as `vae`, `transformer`, or `text_encoder` with `---path`. + +```bash Command +sglang serve \ + --model-path black-forest-labs/FLUX.2-dev \ + --vae-path fal/FLUX.2-Tiny-AutoEncoder +``` + +The component key must match the key in the model's `model_index.json`, and the path must be either a Hugging Face repo ID or a complete component directory. + +## Component Attention Backend Overrides + +Use `--component-attention-backends` when one pipeline component needs a different native attention backend from the global `--attention-backend`. + +```bash Command +sglang generate \ + --model-path Lightricks/LTX-2.3 \ + --attention-backend fa \ + --component-attention-backends text_encoder=torch_sdpa +``` + +The component key must match a pipeline module key such as `text_encoder`, `text_encoder_2`, `transformer`, `transformer_2`, or `connectors`. Component overrides take precedence over the global `--attention-backend` only while that component is being constructed. + +You can also pass dotted CLI entries: + +```bash Command +sglang generate \ + --model-path \ + --component-attention-backends.text_encoder torch_sdpa \ + --component-attention-backends.transformer fa +``` + +## Diffusers Backend + +Use `--backend diffusers` to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class. + +### Key Options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentValuesDescription
--backendauto, sglang, diffusersChoose native SGLang, force native, or force diffusers
--diffusers-attention-backendflash, _flash_3_hub, sage, xformers, nativeAttention backend for diffusers pipelines
--trust-remote-codeflagRequired for models with custom pipeline classes
--vae-tiling and --vae-slicingflagLower memory usage for VAE decode
--dit-precision and --vae-precisionfp16, bf16, fp32Precision controls
--enable-torch-compileflagEnable torch.compile
--cache-dit-config{PATH}Cache-DiT config for diffusers pipelines
+ +### Example + +```bash +sglang generate \ + --model-path AIDC-AI/Ovis-Image-7B \ + --backend diffusers \ + --trust-remote-code \ + --diffusers-attention-backend flash \ + --prompt "A serene Japanese garden with cherry blossoms" \ + --height 1024 \ + --width 1024 \ + --num-inference-steps 30 \ + --save-output \ + --output-path outputs \ + --output-file-name ovis_garden.png +``` + +For pipeline-specific arguments not exposed in the CLI, pass `diffusers_kwargs` in a config file. diff --git a/docs_new/docs/sglang-diffusion/api/openai_api.mdx b/docs_new/docs/sglang-diffusion/api/openai_api.mdx new file mode 100644 index 000000000000..95874f991278 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/api/openai_api.mdx @@ -0,0 +1,450 @@ +--- +title: OpenAI API +sidebarTitle: OpenAI API +description: Image and video generation endpoints with LoRA adapter management. +--- +The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management. + +## Prerequisites + +- Python 3.11+ if you plan to use the OpenAI Python SDK. + +## Serve + +Launch the server using the `sglang serve` command. + +### Start the server + +```bash +SERVER_ARGS=( + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers + --text-encoder-cpu-offload + --pin-cpu-memory + --num-gpus 4 + --ulysses-degree=2 + --ring-degree=2 + --port 30010 +) + +sglang serve "${SERVER_ARGS[@]}" +``` + +- **--model-path**: Path to the model or model ID. +- **--port**: HTTP port to listen on (default: `30000`). + +**Get Model Information** + +**Endpoint:** `GET /models` + +Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings. + +**Curl Example:** + +```bash curl +curl -sS -X GET "http://localhost:30010/models" +``` + +**Response Example:** + +```json +{ + "model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", + "task_type": "T2V", + "pipeline_name": "wan_pipeline", + "pipeline_class": "WanPipeline", + "num_gpus": 4, + "dit_precision": "bf16", + "vae_precision": "fp16" +} +``` + +--- + +## Endpoints + +### Image Generation + +The server implements an OpenAI-compatible Images API under the `/v1/images` namespace. + +**Create an image** + +**Endpoint:** `POST /v1/images/generations` + +**Python Example (b64_json response):** + +```python Python +import base64 +from openai import OpenAI + +client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1") + +img = client.images.generate( + prompt="A calico cat playing a piano on stage", + size="1024x1024", + n=1, + response_format="b64_json", +) + +image_bytes = base64.b64decode(img.data[0].b64_json) +with open("output.png", "wb") as f: + f.write(image_bytes) +``` + +**Curl Example:** + +```bash curl +curl -sS -X POST "http://localhost:30010/v1/images/generations" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -d '{ + "prompt": "A calico cat playing a piano on stage", + "size": "1024x1024", + "n": 1, + "response_format": "b64_json" + }' +``` + +> **Note** +> If `response_format=url` is used and cloud storage is not configured, the API returns +> a relative URL like `/v1/images//content`. + +**Edit an image** + +**Endpoint:** `POST /v1/images/edits` + +This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image. + +**Curl Example (b64_json response):** + +```bash Command +curl -sS -X POST "http://localhost:30010/v1/images/edits" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -F "image=@local_input_image.png" \ + -F "url=image_url.jpg" \ + -F "prompt=A calico cat playing a piano on stage" \ + -F "size=1024x1024" \ + -F "response_format=b64_json" +``` + +**Curl Example (URL response):** + +```bash Command +curl -sS -X POST "http://localhost:30010/v1/images/edits" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -F "image=@local_input_image.png" \ + -F "url=image_url.jpg" \ + -F "prompt=A calico cat playing a piano on stage" \ + -F "size=1024x1024" \ + -F "response_format=url" +``` + +**Download image content** + +When `response_format=url` is used with `POST /v1/images/generations` or `POST /v1/images/edits`, +the API returns a relative URL like `/v1/images//content`. + +**Endpoint:** `GET /v1/images/{image_id}/content` + +**Curl Example:** + +```bash +curl -sS -L "http://localhost:30010/v1/images//content" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -o output.png +``` + +### Video Generation + +The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace. + +**Create a video (text-to-video)** + +**Endpoint:** `POST /v1/videos` + +**Python Example:** + +```python Python +from openai import OpenAI + +client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1") + +video = client.videos.create( + prompt="A calico cat playing a piano on stage", + size="1280x720" +) +print(f"Video ID: {video.id}, Status: {video.status}") +``` + +**Curl Example:** + +```bash curl +curl -sS -X POST "http://localhost:30010/v1/videos" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -d '{ + "prompt": "A calico cat playing a piano on stage", + "size": "1280x720" + }' +``` + +**Create a video (image-to-video)** + +For I2V or TI2V models (e.g., Wan2.1 I2V, LTX-2.3 two-stage), pass an input image via multipart form upload or a reference URL. + +**Curl Example (multipart form upload):** + +```bash Command +curl -sS -X POST "http://localhost:30010/v1/videos" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -F "prompt=A cat playing a piano" \ + -F "input_reference=@input_image.png" \ + -F "size=1280x720" +``` + +**Curl Example (reference URL):** + +```bash Command +curl -sS -X POST "http://localhost:30010/v1/videos" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -d '{ + "prompt": "A cat playing a piano", + "reference_url": "https://example.com/input_image.png", + "size": "1280x720" + }' +``` + +**List videos** + +**Endpoint:** `GET /v1/videos` + +**Python Example:** + +```python Python +videos = client.videos.list() +for item in videos.data: + print(item.id, item.status) +``` + +**Curl Example:** + +```bash curl +curl -sS -X GET "http://localhost:30010/v1/videos" \ + -H "Authorization: Bearer sk-proj-1234567890" +``` + +**Download video content** + +**Endpoint:** `GET /v1/videos/{video_id}/content` + +**Python Example:** + +```python Python +import time + +# Poll for completion +while True: + page = client.videos.list() + item = next((v for v in page.data if v.id == video_id), None) + if item and item.status == "completed": + break + time.sleep(5) + +# Download content +resp = client.videos.download_content(video_id=video_id) +with open("output.mp4", "wb") as f: + f.write(resp.read()) +``` + +**Curl Example:** + +```bash curl +curl -sS -L "http://localhost:30010/v1/videos//content" \ + -H "Authorization: Bearer sk-proj-1234567890" \ + -o output.mp4 +``` + +--- + +### LoRA Management + +The server supports dynamic loading, merging, and unmerging of LoRA adapters. + +**Important Notes:** +- Mutual Exclusion: Only one LoRA can be *merged* (active) at a time +- Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one +- Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost + +**Set LoRA Adapter** + +Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters. + +**Endpoint:** `POST /v1/set_lora` + +**Parameters:** +- `lora_nickname` (string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAs +- `lora_path` (string or list of strings/None, optional): Path to the `.safetensors` file(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length of `lora_nickname` +- `target` (string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length of `lora_nickname`. Valid values: + - `"all"` (default): Apply to all transformers + - `"transformer"`: Apply only to the primary transformer (high noise for Wan2.2) + - `"transformer_2"`: Apply only to transformer_2 (low noise for Wan2.2) + - `"critic"`: Apply only to the critic model +- `strength` (float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length of `lora_nickname`. Values < 1.0 reduce the effect, values > 1.0 amplify the effect + +**Single LoRA Example:** + +```bash Command +curl -X POST http://localhost:30010/v1/set_lora \ + -H "Content-Type: application/json" \ + -d '{ + "lora_nickname": "lora_name", + "lora_path": "/path/to/lora.safetensors", + "target": "all", + "strength": 0.8 + }' +``` + +**Multiple LoRA Example:** + +```bash Command +curl -X POST http://localhost:30010/v1/set_lora \ + -H "Content-Type: application/json" \ + -d '{ + "lora_nickname": ["lora_1", "lora_2"], + "lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"], + "target": ["transformer", "transformer_2"], + "strength": [0.8, 1.0] + }' +``` + +**Multiple LoRA with Same Target:** + +```bash Command +curl -X POST http://localhost:30010/v1/set_lora \ + -H "Content-Type: application/json" \ + -d '{ + "lora_nickname": ["style_lora", "character_lora"], + "lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"], + "target": "all", + "strength": [0.7, 0.9] + }' +``` + +> [!NOTE] +> When using multiple LoRAs: +> - All list parameters (`lora_nickname`, `lora_path`, `target`, `strength`) must have the same length +> - If `target` or `strength` is a single value, it will be applied to all LoRAs +> - Multiple LoRAs applied to the same target will be merged in order + + +**Merge LoRA Weights** + +Manually merges the currently set LoRA weights into the base model. + +> [!NOTE] +> `set_lora` automatically performs a merge, so this is typically only needed if you have manually unmerged but want to re-apply the same LoRA without calling `set_lora` again.* + +**Endpoint:** `POST /v1/merge_lora_weights` + +**Parameters:** +- `target` (string, optional): Which transformer(s) to merge. One of "all" (default), "transformer", "transformer_2", "critic" +- `strength` (float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect + +**Curl Example:** + +```bash +curl -X POST http://localhost:30010/v1/merge_lora_weights \ + -H "Content-Type: application/json" \ + -d '{"strength": 0.8}' +``` + + +**Unmerge LoRA Weights** + +Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA. + +**Endpoint:** `POST /v1/unmerge_lora_weights` + +**Curl Example:** + +```bash +curl -X POST http://localhost:30010/v1/unmerge_lora_weights \ + -H "Content-Type: application/json" +``` + +**List LoRA Adapters** + +Returns loaded LoRA adapters and current application status per module. + +**Endpoint:** `GET /v1/list_loras` + +**Curl Example:** + +```bash +curl -sS -X GET "http://localhost:30010/v1/list_loras" +``` + +**Response Example:** + +```json +{ + "loaded_adapters": [ + { "nickname": "lora_a", "path": "/weights/lora_a.safetensors" }, + { "nickname": "lora_b", "path": "/weights/lora_b.safetensors" } + ], + "active": { + "transformer": [ + { + "nickname": "lora2", + "path": "tarn59/pixel_art_style_lora_z_image_turbo", + "merged": true, + "strength": 1.0 + } + ] + } +} +``` + +Notes: +- If LoRA is not enabled for the current pipeline, the server will return an error. +- `num_lora_layers_with_weights` counts only layers that have LoRA weights applied for the active adapter. + +### Example: Switching LoRAs + +1. Set LoRA A: + ```bash Command + curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}' + ``` +2. Generate with LoRA A... +3. Unmerge LoRA A: + ```bash Command + curl -X POST http://localhost:30010/v1/unmerge_lora_weights + ``` +4. Set LoRA B: + ```bash Command + curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}' + ``` +5. Generate with LoRA B... + +### Adjust Output Quality + +The server supports adjusting output quality and compression levels for both image and video generation through the `output-quality` and `output-compression` parameters. + +#### Parameters + +- **`output-quality`** (string, optional): Preset quality level that automatically sets compression. **Default is `"default"`**. Valid values: + - `"maximum"`: Highest quality (100) + - `"high"`: High quality (90) + - `"medium"`: Medium quality (55) + - `"low"`: Lower quality (35) + - `"default"`: Auto-adjust based on media type (50 for video, 75 for image) + +- **`output-compression`** (integer, optional): Direct compression level override (0-100). **Default is `None`**. When provided (not `None`), takes precedence over `output-quality`. + - `0`: Lowest quality, smallest file size + - `100`: Highest quality, largest file size + +#### Notes + +- **Precedence**: When both `output-quality` and `output-compression` are provided, `output-compression` takes precedence +- **Format Support**: Quality settings apply to JPEG, and video formats. PNG uses lossless compression and ignores these settings +- **File Size vs Quality**: Lower compression values (or "low" quality preset) produce smaller files but may show visible artifacts diff --git a/docs_new/docs/sglang-diffusion/api/post_processing.mdx b/docs_new/docs/sglang-diffusion/api/post_processing.mdx new file mode 100644 index 000000000000..132363a5a615 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/api/post_processing.mdx @@ -0,0 +1,237 @@ +--- +title: "Post-Processing" +metatags: + description: "Use SGLang Diffusion post-processing for frame interpolation and spatial upscaling after generation." +--- + +SGLang diffusion supports optional post-processing steps that run after +generation to improve temporal smoothness (frame interpolation) or spatial +resolution (upscaling). These steps are independent of the diffusion model and +can be combined in a single run. + +When both are enabled, **frame interpolation runs first** (increasing the frame +count), then **upscaling runs on every frame** (increasing the spatial +resolution). + +--- + +## Frame Interpolation (video only) + +Frame interpolation synthesizes new frames between each pair of consecutive +generated frames, producing smoother motion without re-running the diffusion +model. + +The `--frame-interpolation-exp` flag controls how many rounds of interpolation +to apply: each round inserts one new frame into every gap between adjacent +frames, so the output frame count follows the formula: + +> **(N − 1) × 2^exp + 1** +> +> e.g. 5 original frames with `exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames; +> with `exp=2` → **17** frames. + +### CLI Arguments + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescription
--enable-frame-interpolationEnable frame interpolation. Model weights are downloaded automatically on first use.
--frame-interpolation-exp {EXP}Interpolation exponent — 1 = 2× temporal resolution, 2 = 4×, etc. (default: 1)
--frame-interpolation-scale {SCALE}RIFE inference scale; use 0.5 for high-resolution inputs to save memory (default: 1.0)
--frame-interpolation-model-path {PATH}Local directory or HuggingFace repo ID containing RIFE flownet.pkl weights (default: elfgum/RIFE-4.22.lite, downloaded automatically)
+ +### Supported Models + +Frame interpolation uses the [RIFE](https://github.com/hzwer/Practical-RIFE) +(Real-Time Intermediate Flow Estimation) architecture. Only **RIFE 4.22.lite** +(`IFNet` with 4-scale `IFBlock` backbone) is supported. The network topology is +hard-coded, so custom weights provided via `--frame-interpolation-model-path` +must be a `flownet.pkl` checkpoint that is compatible with this architecture. + +Other RIFE versions (e.g., older `v4.x` variants with different block counts) +or entirely different frame interpolation methods (FILM, AMT, etc.) are **not +supported**. + + + + + + + + + + + + + + + + + + + + + +
WeightHuggingFace RepoDescription
RIFE 4.22.lite *(default)*elfgum/RIFE-4.22.liteLightweight model, downloaded automatically on first use
+ +### Example + +Generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9): + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --prompt "A dog running through a park" \ + --num-frames 5 \ + --enable-frame-interpolation \ + --frame-interpolation-exp 1 \ + --save-output +``` + +--- + +## Upscaling (image and video) + +Upscaling increases the spatial resolution of generated images or video frames +using [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN). The model weights +are downloaded automatically on first use and cached for subsequent runs. + +### CLI Arguments + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentDescription
--enable-upscalingEnable post-generation upscaling using Real-ESRGAN.
--upscaling-scale {SCALE}Desired upscaling factor (default: 4). The 4× model is used internally; if a different scale is requested, a bicubic resize is applied after the network output.
--upscaling-model-path {PATH}Local .pth file, HuggingFace repo ID, or repo_id:filename for Real-ESRGAN weights (default: ai-forever/Real-ESRGAN with RealESRGAN_x4.pth, downloaded automatically). Use the repo_id:filename format to specify a custom weight file from a HuggingFace repo (e.g. my-org/my-esrgan:weights.pth).
+ +### Supported Models + +Upscaling supports two Real-ESRGAN network architectures. The correct +architecture is **auto-detected** from the checkpoint keys, so you only need to +point `--upscaling-model-path` at a valid `.pth` file: + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArchitectureExample WeightsDescription
RRDBNetRealESRGAN_x4plus.pthHeavier model with higher quality; best for photos
SRVGGNetCompactRealESRGAN_x4.pth *(default)*, realesr-animevideov3.pth, realesr-general-x4v3.pthLightweight model; faster inference, good for video
+ +The default weight file is +[`ai-forever/Real-ESRGAN`](https://huggingface.co/ai-forever/Real-ESRGAN) with +`RealESRGAN_x4.pth` (SRVGGNetCompact, 4× native scale). + +Other super-resolution models (e.g., SwinIR, HAT, BSRGAN) are **not supported** +— only Real-ESRGAN checkpoints using the two architectures above are +compatible. + +### Examples + +Generate a 1024×1024 image and upscale to 4096×4096: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --prompt "A cat sitting on a windowsill" \ + --output-size 1024x1024 \ + --enable-upscaling \ + --save-output +``` + +Generate a video and upscale each frame by 4×: + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "A curious raccoon" \ + --enable-upscaling \ + --upscaling-scale 4 \ + --save-output +``` + +--- + +## Combining Frame Interpolation and Upscaling + +Frame interpolation and upscaling can be combined in a single run. +Interpolation is applied first (increasing the frame count), then upscaling is +applied to every frame (increasing the spatial resolution). + +Example — generate 5 frames, interpolate to 9 frames, and upscale each frame +by 4×: + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --prompt "A curious raccoon" \ + --num-frames 5 \ + --enable-frame-interpolation \ + --frame-interpolation-exp 1 \ + --enable-upscaling \ + --upscaling-scale 4 \ + --save-output +``` diff --git a/docs_new/docs/sglang-diffusion/attention_backends.mdx b/docs_new/docs/sglang-diffusion/attention_backends.mdx new file mode 100644 index 000000000000..4aaa735bbab8 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/attention_backends.mdx @@ -0,0 +1,486 @@ +--- +title: "Attention Backends" +description: "Select and configure attention backends for SGLang diffusion pipelines." +--- +This document describes the attention backends available in sglang diffusion (`sglang.multimodal_gen`) and how to select them. + +## Overview + +Attention backends are defined by `AttentionBackendEnum` (`sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum`) and selected via the CLI flag `--attention-backend`. + +Backend selection is performed by the shared attention layers (e.g. `LocalAttention` / `USPAttention` / `UlyssesAttention` in `sglang.multimodal_gen.runtime.layers.attention.layer`) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders). + +When using the diffusers backend, `--attention-backend` is passed through to diffusers' +`set_attention_backend` (e.g., `flash`, `_flash_3_hub`, `sage`, `xformers`, `native`). + +- **CUDA**: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA. +- **ROCm**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA. +- **Intel XPU**: uses XPU Flash Attention backend (fp16/bf16, head sizes 64/96/128/192/256); otherwise falls back to PyTorch SDPA. +- **MUSA**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA. +- **MPS**: always uses PyTorch SDPA. +- **NPU**: for ring attention uses FA otherwise uses PyTorch SDPA. + +## Backend options + +For SGLang-native pipelines, the CLI accepts the lowercase names of `AttentionBackendEnum`. The table below lists the backends implemented by the built-in platforms. `fa3`/`fa4` are accepted as aliases for `fa`. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CLI valueEnum valueNotes
`fa` / `fa3` / `fa4``FA`FlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__).
`torch_sdpa``TORCH_SDPA`PyTorch scaled_dot_product_attention.
`sliding_tile_attn``SLIDING_TILE_ATTN`Sliding Tile Attention (STA). Requires st_attn. Configure via --attention-backend-config.
`sage_attn``SAGE_ATTN`Requires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py.
`sage_attn_3``SAGE_ATTN_3`Requires SageAttention3 installed per upstream instructions.
`video_sparse_attn``VIDEO_SPARSE_ATTN`Requires vsa. Configure sparsity via --attention-backend-config.
`vmoba_attn``VMOBA_ATTN`Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.
`aiter``AITER`Requires aiter.
aiter_sageAITER_SAGERequires aiter.
sla_attnSLA_ATTNSparse Linear Attention. Requires SpargeAttn. Install with pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation.
sage_sla_attnSAGE_SLA_ATTNSageAttention + Sparse Linear Attention. Requires SpargeAttn (same install as SLA).
`sparse_video_gen_2_attn``SPARSE_VIDEO_GEN_2_ATTN`Requires svg. See installation instructions at https://github.com/svg-project/Sparse-VideoGen.
+ +## Selection priority + +The selection order in `runtime/layers/attention/selector.py` is: + +1. `global_force_attn_backend(...)` / `global_force_attn_backend_context_manager(...)` +2. Component override from `--component-attention-backends` while that component is being constructed +3. CLI `--attention-backend` (`ServerArgs.attention_backend`) +4. Auto selection (platform capability, dtype, and installed packages) + +## Configuration + +Some backends require additional configuration. You can pass these parameters via `--attention-backend-config`. This argument accepts: +- A path to a JSON or YAML configuration file. +- A JSON string (e.g., `'{"sparsity": 0.5}'`). +- Key-value pairs (e.g., `"sparsity=0.5,enable_x=true"`). + +### Supported Configuration Parameters + +**Sliding Tile Attention (`sliding_tile_attn`)** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescriptionDefault
`mask_strategy_file_path``str`**Required.** Path to the mask strategy JSON file.-
`sta_mode``str`Mode of STA.``STA_inference``
`skip_time_steps``int`Number of steps to use full attention before switching to sparse attention.`15`
+ +**Video Sparse Attention (`video_sparse_attn`)** + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescriptionDefault
`sparsity``float`Validation sparsity (0.0 - 1.0).`0.0`
+ +**V-MoBA (`vmoba_attn`)** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescriptionDefault
`temporal_chunk_size``int`Chunk size for temporal dimension.-
`temporal_topk``int`Top-K tokens to select in temporal dimension.-
`spatial_chunk_size``list[int]`Chunk size for spatial dimension (H, W).-
`spatial_topk``int`Top-K tokens to select in spatial dimension.-
`st_chunk_size``list[int]`Chunk size for spatiotemporal dimension (T, H, W).-
`st_topk``int`Top-K tokens to select in spatiotemporal dimension.-
`moba_select_mode``str`Selection mode (e.g., `threshold`).`threshold`
`moba_threshold``float`Threshold value for selection.`0.25`
`moba_threshold_type``str`Type of thresholding (e.g., `query_head`).`query_head`
`first_full_step``int`Number of initial steps to use full attention.`12`
`first_full_layer``int`Number of initial layers to use full attention.`0`
`temporal_layer``int`Number of temporal layers.`1`
`spatial_layer``int`Number of spatial layers.`1`
`st_layer``int`Number of spatiotemporal layers.`1`
+ +## Platform support matrix + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendCUDAROCmXPUMUSAMPSNPUNotes
`fa`YesYesCUDA requires SM80+ and fp16/bf16. XPU uses its own flash attention backend. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa. No extra installations are required for NPU
`torch_sdpa`YesYesYesYesMost compatible option across platforms.
`sliding_tile_attn`YesNoNoNoCUDA-only. Requires st_attn. Configure via --attention-backend-config.
`sage_attn`YesNoNoNoCUDA-only (optional dependency).
`sage_attn_3`YesNoNoNoCUDA-only (optional dependency).
`video_sparse_attn`YesNoNoNoCUDA-only. Requires vsa. Configure sparsity via --attention-backend-config.
sla_attnYesNoNoNoCUDA-only. Requires SpargeAttn.
sage_sla_attnYesNoNoNoCUDA-only. Requires SpargeAttn.
vmoba_attnYesNoNoNoCUDA-only. Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.
aiterNoNoRequires aiter.
aiter_sageNoNoRequires aiter.
`sparse_video_gen_2_attn`YesNoNoNoCUDA-only. Requires svg.
+ +## Usage + +### Select a backend via CLI + +```bash +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend fa +``` + +```bash +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend torch_sdpa +``` + +### Override one component + +Use component overrides when a specific module needs different attention semantics from the main transformer: + +```bash +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend fa \ + --component-attention-backends text_encoder=torch_sdpa +``` + +Component keys match pipeline module names from `model_index.json`, such as `text_encoder`, `text_encoder_2`, `transformer`, `transformer_2`, or `connectors`. + +### Using Sliding Tile Attention (STA) + +```bash +# Pass the mask strategy file path via config +sglang generate \ + --model-path \ + --prompt "..." \ + --attention-backend sliding_tile_attn \ + --attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json" +``` + +### Notes for ROCm / MPS + +- ROCm: use `--attention-backend torch_sdpa` or `fa` depending on what is available in your environment. +- MPS: the platform implementation always uses `torch_sdpa`. diff --git a/docs_new/docs/sglang-diffusion/cache_dit.mdx b/docs_new/docs/sglang-diffusion/cache_dit.mdx new file mode 100644 index 000000000000..2c6c88bc00f9 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/cache_dit.mdx @@ -0,0 +1,577 @@ +--- +title: "Cache-DiT Acceleration" +description: "Configure Cache-DiT acceleration for diffusion inference." +--- +SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss. + +## Overview + +**Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop: + +- **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences +- **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions +- **SCM (Step Computation Masking)**: Step-level caching control for additional speedup + +## Basic Usage + +Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` : + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" +``` + +## Diffusers Backend + +Cache-DiT supports loading acceleration configs from a custom YAML file. For +diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This +flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`). + +### Single GPU inference + +Define a `cache.yaml` file that contains: + +- DBCache + TaylorSeer + +```yaml +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 +``` + +Then apply the config with: + +```bash +sglang generate \ + --backend diffusers \ + --model-path Qwen/Qwen-Image \ + --cache-dit-config cache.yaml \ + --prompt "A beautiful sunset over the mountains" +``` + +- DBCache + TaylorSeer + SCM (Step Computation Mask) + +```yaml Config +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 + # Must set the num_inference_steps for SCM. The SCM will automatically + # generate the steps computation mask based on the num_inference_steps. + # Reference: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#scm-steps-computation-masking + num_inference_steps: 28 + steps_computation_mask: fast +``` + +- DBCache + TaylorSeer + SCM (Step Computation Mask) + Cache CFG + +```yaml Config +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 + num_inference_steps: 28 + steps_computation_mask: fast + enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc. +``` + +### Distributed inference + +- 1D Parallelism + +Define a parallelism only config yaml `parallel.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: auto + attention_backend: native +``` + +Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4. + +Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference) + +```bash +sglang generate \ + --backend diffusers \ + --num-gpus 4 \ + --model-path Qwen/Qwen-Image \ + --cache-dit-config parallel.yaml \ + --prompt "A futuristic cityscape at sunset" +``` + +- 2D Parallelism + +You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: auto + tp_size: 2 + attention_backend: native +``` +Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size. + +- 3D Parallelism + +You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: 2 + ring_size: 2 + tp_size: 2 + attention_backend: native +``` +Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2. + +- Ulysses Anything Attention + +To enable Ulysses Anything Attention, you can define a parallelism config yaml `parallel_uaa.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: auto + attention_backend: native + ulysses_anything: true +``` + +- Ulysses FP8 Communication + +For device that don't have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml `parallel_fp8.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: auto + attention_backend: native + ulysses_float8: true +``` + +- Async Ulysses CP + +You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml `parallel_async.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: auto + attention_backend: native + ulysses_async: true # Now, only support for FLUX.1, Qwen-Image, Ovis-Image and Z-Image. +``` +Then, apply the config from yaml. Here `ulysses_async: true` means enabling async ulysses CP. + +- TE-P and VAE-P + +You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml `parallel_extra.yaml` file that contains: + +```yaml Config +parallelism_config: + ulysses_size: auto + attention_backend: native + extra_parallel_modules: ["text_encoder", "vae"] +``` + + +### Hybrid Cache and Parallelism + +Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains: + +```yaml Config +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 +parallelism_config: + ulysses_size: auto + attention_backend: native + extra_parallel_modules: ["text_encoder", "vae"] +``` + +Then, apply the hybrid cache and parallel acceleration config from yaml. + +```bash +sglang generate \ + --backend diffusers \ + --num-gpus 4 \ + --model-path Qwen/Qwen-Image \ + --cache-dit-config hybrid.yaml \ + --prompt "A beautiful sunset over the mountains" +``` + +### Attention Backend + +In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file `attention.yaml` that only contains: + +```yaml Config +attention_backend: "flash" # '_flash_3' for Hopper +``` + +### Quantization + +You can also specify the quantization config in the yaml file, required `torchao>=0.16.0`. For example, define a yaml file `quantize.yaml` that contains: + +```yaml Config +quantize_config: # quantization configuration for transformer modules + # float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc. + quant_type: "float8" + # layers to exclude from quantization (transformer). layers that contains any of the + # keywords in the exclude_layers list will be excluded from quantization. This is useful + # for some sensitive layers that are not robust to quantization, e.g., embedding layers. + exclude_layers: + - "embedder" + - "embed" + verbose: false # whether to print verbose logs during quantization +``` +Then, apply the quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. For example: + +```bash Command +sglang generate \ + --backend diffusers \ + --model-path Qwen/Qwen-Image \ + --warmup \ + --cache-dit-config quantize.yaml \ + --enable-torch-compile \ + --dit-cpu-offload false \ + --text-encoder-cpu-offload false \ + --prompt "A beautiful sunset over the mountains" +``` + +### Combined Configs: Cache + Parallelism + Quantization + +You can also combine all the above configs together in a single yaml file `combined.yaml` that contains: + +```yaml Config +cache_config: + max_warmup_steps: 8 + warmup_interval: 2 + max_cached_steps: -1 + max_continuous_cached_steps: 2 + Fn_compute_blocks: 1 + Bn_compute_blocks: 0 + residual_diff_threshold: 0.12 + enable_taylorseer: true + taylorseer_order: 1 +parallelism_config: + ulysses_size: auto + attention_backend: native + extra_parallel_modules: ["text_encoder", "vae"] +quantize_config: + quant_type: "float8" + exclude_layers: + - "embedder" + - "embed" + verbose: false +``` +Then, apply the combined cache, parallelism and quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. + +## Advanced Configuration + +### DBCache Parameters + +DBCache controls block-level caching behavior: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Fn`SGLANG_CACHE_DIT_FN`1Number of first blocks to always compute
Bn`SGLANG_CACHE_DIT_BN`0Number of last blocks to always compute
W`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching starts
R`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
MC`SGLANG_CACHE_DIT_MC`3Maximum continuous cached steps
+ +### TaylorSeer Configuration + +TaylorSeer improves caching accuracy using Taylor expansion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterEnv VariableDefaultDescription
Enable`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
Order`SGLANG_CACHE_DIT_TS_ORDER`1Taylor expansion order (1 or 2)
+ +### Combined Configuration Example + +DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters +simultaneously: + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_FN=2 \ +SGLANG_CACHE_DIT_BN=1 \ +SGLANG_CACHE_DIT_WARMUP=4 \ +SGLANG_CACHE_DIT_RDT=0.4 \ +SGLANG_CACHE_DIT_MC=4 \ +SGLANG_CACHE_DIT_TAYLORSEER=true \ +SGLANG_CACHE_DIT_TS_ORDER=2 \ +sglang generate --model-path black-forest-labs/FLUX.1-dev \ + --prompt "A curious raccoon in a forest" +``` + +### SCM (Step Computation Masking) + +SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and +which to use cached results. + +**SCM Presets** + +SCM is configured with presets: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PresetCompute RatioSpeedQuality
`none`100%BaselineBest
`slow`~75%~1.3xHigh
`medium`~50%~2xGood
`fast`~35%~3xAcceptable
`ultra`~25%~4xLower
+ +**Usage** + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_SCM_PRESET=medium \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A futuristic cityscape at sunset" +``` + +**Custom SCM Bins** + +For fine-grained control over which steps to compute vs cache: + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \ +SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A futuristic cityscape at sunset" +``` + +**SCM Policy** + + + + + + + + + + + + + + + + + + + + + + + + + + +
PolicyEnv VariableDescription
`dynamic``SGLANG_CACHE_DIT_SCM_POLICY=dynamic`Adaptive caching based on content (default)
`static``SGLANG_CACHE_DIT_SCM_POLICY=static`Fixed caching pattern
+ +## Environment Variables + +All Cache-DiT parameters can be configured via environment variables. +See [Environment Variables](./environment_variables) for the complete list. + +## Supported Models + +SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyExample Models
WanWan2.1, Wan2.2
FluxFLUX.1-dev, FLUX.2-dev
Z-ImageZ-Image-Turbo
QwenQwen-Image, Qwen-Image-Edit
HunyuanHunyuanVideo
+ +## Performance Tips + +1. **Start with defaults**: The default parameters work well for most models +2. **Use TaylorSeer**: It typically improves both speed and quality +3. **Tune R threshold**: Lower values = better quality, higher values = faster +4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance +5. **Warmup matters**: Higher warmup = more stable caching decisions + +## Limitations + +- **SGLang-native pipelines**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically + disabled when `world_size > 1`. +- **SCM minimum steps**: SCM requires >= 8 inference steps to be effective +- **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported + +## Troubleshooting + +### SCM disabled for low step count + +For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache +acceleration still works. + +## References + +- [Cache-DiT](https://github.com/vipshop/cache-dit) +- [SGLang Diffusion](./performance-optimization) diff --git a/docs_new/docs/sglang-diffusion/caching-acceleration.mdx b/docs_new/docs/sglang-diffusion/caching-acceleration.mdx new file mode 100644 index 000000000000..d0f3e73dfe55 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/caching-acceleration.mdx @@ -0,0 +1,87 @@ +--- +title: "Caching Acceleration" +description: "Compare caching acceleration strategies for diffusion models." +--- +SGLang provides two complementary caching strategies for Diffusion Transformer (DiT) models. Both reduce denoising cost by skipping redundant computation, but they operate at different levels. + +## Overview + +SGLang supports two complementary caching approaches: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
StrategyScopeMechanismBest For
Cache-DiTBlock-levelSkip individual transformer blocks dynamicallyAdvanced, higher speedup
TeaCacheTimestep-levelSkip entire denoising steps based on L1 similaritySimple, built-in
+ +## Cache-DiT + +[Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with +advanced strategies like DBCache and TaylorSeer. It can achieve up to **1.69x speedup**. + +See [cache_dit.md](./cache_dit) for detailed configuration. + +### Quick Start + +```bash +SGLANG_CACHE_DIT_ENABLED=true \ +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" +``` + +### Key Features + +- **DBCache**: Dynamic block-level caching based on residual differences +- **TaylorSeer**: Taylor expansion-based calibration for optimized caching +- **SCM**: Step-level computation masking for additional speedup + +## TeaCache + +TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely. + +See [teacache.md](./teacache) for detailed documentation. + +### Quick Overview + +- Tracks L1 distance between modulated inputs across timesteps +- When accumulated distance is below threshold, reuses cached residual +- Supports CFG with separate positive/negative caches + +### Supported Models + +- Wan (wan2.1, wan2.2) +- Hunyuan (HunyuanVideo) +- Z-Image + +For Flux and Qwen models, TeaCache is automatically disabled when CFG is enabled. + + +## References + +- [Cache-DiT Repository](https://github.com/vipshop/cache-dit) +- [TeaCache Paper](https://arxiv.org/abs/2411.14324) diff --git a/docs_new/docs/sglang-diffusion/ci_perf.mdx b/docs_new/docs/sglang-diffusion/ci_perf.mdx new file mode 100644 index 000000000000..2f87e63586bc --- /dev/null +++ b/docs_new/docs/sglang-diffusion/ci_perf.mdx @@ -0,0 +1,33 @@ +--- +title: "CI Performance Baselines" +description: "Generate and update diffusion performance baselines used in CI." +--- +## Perf Baseline Generation Script + +`python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`. + +### Usage + +Update a single case: + +```bash +python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --case qwen_image_t2i +``` + +Select by regex: + +```bash +python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --match 'qwen_image_.*' +``` + +Run all keys from the baseline file `scenarios`: + +```bash +python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --all-from-baseline +``` + +Specify input/output paths and timeout: + +```bash +python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --baseline python/sglang/multimodal_gen/test/server/perf_baselines.json --out /tmp/perf_baselines.json --timeout 600 +``` diff --git a/docs_new/docs/sglang-diffusion/compatibility_matrix.mdx b/docs_new/docs/sglang-diffusion/compatibility_matrix.mdx new file mode 100644 index 000000000000..0c9e038b64fa --- /dev/null +++ b/docs_new/docs/sglang-diffusion/compatibility_matrix.mdx @@ -0,0 +1,647 @@ +--- +title: "Supported Models" +description: "Check model compatibility across diffusion optimizations and backends." +--- +The table below shows every supported model and the optimizations supported for them. + +The symbols used have the following meanings: + +- ✅ = Full compatibility +- ❌ = No compatibility +- ⭕ = Does not apply to this model + +## Models x Optimization + +The `HuggingFace Model ID` can be passed directly to `from_pretrained()` methods, and sglang-diffusion will use the +optimal +default parameters when initializing and generating videos. + +### Video Generation Models + +Optimization columns are abbreviated to keep the matrix readable: + +- `Tea` = TeaCache +- `Tile` = Sliding Tile Attention +- `Sage` = Sage Attention +- `VSA` = Video Sparse Attention +- `SLA` = Sparse Linear Attention +- `SageSLA` = Sage Sparse Linear Attention +- `SVG2` = Sparse Video Gen 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model NameHugging Face Model IDResolutionTeaTileSageVSASLASageSLASVG2
FastWan2.1 T2V 1.3B`FastVideo/FastWan2.1-T2V-1.3B-Diffusers`480p
FastWan2.2 TI2V 5B Full Attn`FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers`720p
Wan2.2 TI2V 5B`Wan-AI/Wan2.2-TI2V-5B-Diffusers`720p
Wan2.2 T2V A14B`Wan-AI/Wan2.2-T2V-A14B-Diffusers`480p
720p
Wan2.2 I2V A14B`Wan-AI/Wan2.2-I2V-A14B-Diffusers`480p
720p
HunyuanVideo`hunyuanvideo-community/HunyuanVideo`720×1280
544×960
FastHunyuan`FastVideo/FastHunyuan-diffusers`720×1280
544×960
Wan2.1 T2V 1.3B`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`480p
Wan2.1 T2V 14B`Wan-AI/Wan2.1-T2V-14B-Diffusers`480p, 720p
Wan2.1 I2V 480P`Wan-AI/Wan2.1-I2V-14B-480P-Diffusers`480p
Wan2.1 I2V 720P`Wan-AI/Wan2.1-I2V-14B-720P-Diffusers`720p
TurboWan2.1 T2V 1.3B`IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers`480p
TurboWan2.1 T2V 14B`IPostYellow/TurboWan2.1-T2V-14B-Diffusers`480p
TurboWan2.1 T2V 14B 720P`IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers`720p
TurboWan2.2 I2V A14B`IPostYellow/TurboWan2.2-I2V-A14B-Diffusers`720p
Wan2.1 Fun 1.3B InPweizhou03/Wan2.1-Fun-1.3B-InP-Diffusers480p
Helios BaseBestWishYsh/Helios-Base720p
Helios MidBestWishYsh/Helios-Mid720p
Helios DistilledBestWishYsh/Helios-Distilled720p
LTX-2 (one/two-stage/TI2V)Lightricks/LTX-2768×512
1536×1024
LTX-2.3 (one/two-stage/TI2V/HQ)Lightricks/LTX-2.3768×512
1536×1024
1920×1088 (HQ default)
+ +**Note**: + +1. Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue. +2. SageSLA is based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation` +3. LTX pipeline selection: + - One-stage: `--pipeline-class-name LTX2Pipeline` + - Two-stage: `--pipeline-class-name LTX2TwoStagePipeline` + - Two-stage HQ: `--pipeline-class-name LTX2TwoStageHQPipeline` (HQ defaults to 1920×1088; you can still override `--width/--height`) + - LTX-2 and LTX-2.3 support both T2V and TI2V (`--image-path`) on one-stage and two-stage pipelines (including HQ). + - The spatial upsampler and distilled LoRA are auto-resolved from the model snapshot by default, and can still be overridden with `--spatial-upsampler-path` and `--distilled-lora-path`. + - For LTX models, the `Resolutions` column uses output video `width×height` semantics, matching `sglang generate --width ... --height ...`. +4. LTX-2 / LTX-2.3 two-stage also supports `--ltx2-two-stage-device-mode {original,snapshot,resident}`: + - `snapshot` is the default and recommended mode. + - `resident` usually provides the best latency/throughput but uses much more VRAM. + - `original` keeps official two-stage semantics without the premerged stage-2 transformer path. + - Example (one prior run): `original` `154.67s`, `snapshot` `114.05s`, `resident` `75.71s`; peak VRAM trend is `original < snapshot < resident`. + +### Image Generation Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model NameHuggingFace Model ID
FLUX.1-dev`black-forest-labs/FLUX.1-dev`
FLUX.2-dev`black-forest-labs/FLUX.2-dev`
FLUX.2-dev-NVFP4black-forest-labs/FLUX.2-dev-NVFP4
FLUX.2-Klein-4Bblack-forest-labs/FLUX.2-klein-4B
FLUX.2-Klein-9Bblack-forest-labs/FLUX.2-klein-9B
Z-ImageTongyi-MAI/Z-Image
Z-Image-TurboTongyi-MAI/Z-Image-Turbo
GLM-Imagezai-org/GLM-Image
Qwen ImageQwen/Qwen-Image
Qwen Image 2512Qwen/Qwen-Image-2512
Qwen Image Edit`Qwen/Qwen-Image-Edit`
Qwen Image Edit 2509Qwen/Qwen-Image-Edit-2509
Qwen Image Edit 2511Qwen/Qwen-Image-Edit-2511
Qwen Image LayeredQwen/Qwen-Image-Layered
SD3 Mediumstabilityai/stable-diffusion-3-medium-diffusers
SD3.5 Mediumstabilityai/stable-diffusion-3.5-medium-diffusers
SD3.5 Largestabilityai/stable-diffusion-3.5-large-diffusers
Hunyuan3D-2tencent/Hunyuan3D-2
SANA 1.5 1.6BEfficient-Large-Model/SANA1.5_1.6B_1024px_diffusers
SANA 1.5 4.8BEfficient-Large-Model/SANA1.5_4.8B_1024px_diffusers
SANA 1600M 1024pxEfficient-Large-Model/Sana_1600M_1024px_diffusers
SANA 600M 1024pxEfficient-Large-Model/Sana_600M_1024px_diffusers
SANA 1600M 512pxEfficient-Large-Model/Sana_1600M_512px_diffusers
SANA 600M 512pxEfficient-Large-Model/Sana_600M_512px_diffusers
FireRed-Image-Edit 1.0FireRedTeam/FireRed-Image-Edit-1.0
FireRed-Image-Edit 1.1FireRedTeam/FireRed-Image-Edit-1.1
ERNIE-Imagebaidu/ERNIE-Image
ERNIE-Image-Turbobaidu/ERNIE-Image-Turbo
+ +## Supported Components + +SGLang Diffusion supports overriding individual pipeline components with +`---path`. The value can be either a Hugging Face repo ID or a local +component directory. + +The same overrides can also be provided in config files through +`component_paths.`. + +### Common Syntax + +CLI: + +```bash Command +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --vae-path black-forest-labs/FLUX.2-small-decoder \ + --transformer-path /models/flux2/transformer +``` + +Config file: + +```yaml Config +model_path: black-forest-labs/FLUX.2-dev +component_paths: + vae: black-forest-labs/FLUX.2-small-decoder + transformer: /models/flux2/transformer +``` + +Use the component name from the pipeline's `model_index.json` or the native pipeline's registered module name: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Component TypeSupported KeysNotes
VAEvae, video_vae, audio_vaevae is the common image-generation override
Transformer / DiTtransformer, video_dit, audio_dittransformer is the standard override for the main denoiser
Text / Preprocesstext_encoder, text_encoder_2, tokenizer, processor, image_processorReplacement encoders often need matching preprocessing assets
Auxiliaryscheduler, spatial_upsampler, vocoder, connectors, dual_tower_bridge, image_encoder, vision_language_encoderOnly valid for pipelines that expose these components
+ +### Known Component Repos + +The table below lists concrete Hugging Face component repos that are already used in SGLang Diffusion docs or tests. It is not an exhaustive catalog of all compatible component repos. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Base ModelOverride KeyExample RepoNotes
black-forest-labs/FLUX.2-devvaeblack-forest-labs/FLUX.2-small-decoderDecoder-only FLUX.2 VAE override
black-forest-labs/FLUX.2-devvaefal/FLUX.2-Tiny-AutoEncoderExisting tested custom VAE path
+ +### VAE + +- `--vae-path` is the common image-generation override. +- `--video-vae-path` and `--audio-vae-path` are only relevant for pipelines with separate video or audio VAEs. + +### Transformer / DiT + +- `--transformer-path` is the standard override for the main denoising transformer. +- For quantized transformers, prefer `--transformer-path` or `--transformer-weights-path`; see `quantization.md`. +- `--video-dit-path` and `--audio-dit-path` are only for pipelines that split denoisers by modality. + +### Text Encoders and Preprocessors + +- `--text-encoder-path` and `--text-encoder-2-path` override primary and secondary text encoders. +- `--tokenizer-path`, `--processor-path`, and `--image-processor-path` are useful when the replacement encoder requires matching preprocessing assets. + +### Auxiliary Components + +- `--scheduler-path` is only relevant when the pipeline exposes a scheduler component. +- `--spatial-upsampler-path` is mainly for two-stage pipelines such as `LTX2TwoStagePipeline`. +- `--vocoder-path`, `--connectors-path`, `--dual-tower-bridge-path`, `--image-encoder-path`, and `--vision-language-encoder-path` are only valid for pipelines that expose those components. + +### Notes + +1. Component overrides are only valid when the target pipeline actually uses + that component. +2. The override key should match the component name in the pipeline's + `model_index.json` or the native pipeline's registered module name. + +## Verified LoRA Examples + +This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline. + + +LoRAs that are not listed here are not necessarily incompatible. +In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions. +The entries below simply reflect configurations that have been manually validated by the SGLang team. + + +### Verified LoRAs by Base Model + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Base ModelSupported LoRAs
Wan2.2`lightx2v/Wan2.2-Distill-Loras`
`Cseti/wan2.2-14B-Arcane_Jinx-lora-v1`
Wan2.1`lightx2v/Wan2.1-Distill-Loras`
Z-Image-Turbo`tarn59/pixel_art_style_lora_z_image_turbo`
`wcde/Z-Image-Turbo-DeJPEG-Lora`
Qwen-Image`lightx2v/Qwen-Image-Lightning`
`flymy-ai/qwen-image-realism-lora`
`prithivMLmods/Qwen-Image-HeadshotX`
`starsfriday/Qwen-Image-EVA-LoRA`
Qwen-Image-Edit`ostris/qwen_image_edit_inpainting`
`lightx2v/Qwen-Image-Edit-2511-Lightning`
Flux`dvyio/flux-lora-simple-illustration`
`XLabs-AI/flux-furry-lora`
`XLabs-AI/flux-RealismLora`
+ +## Special requirements + +### Sliding Tile Attention + +- Currently, only Hopper GPUs (H100s) are supported. diff --git a/docs_new/docs/sglang-diffusion/contributing.mdx b/docs_new/docs/sglang-diffusion/contributing.mdx new file mode 100644 index 000000000000..f447518f63e7 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/contributing.mdx @@ -0,0 +1,77 @@ +--- +title: "Contributing to SGLang Diffusion" +metatags: + description: "This guide outlines the requirements for contributing to the SGLang Diffusion module (sglang.multimodalgen)." +--- + +This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`). + +## Contributor Guides + +- [Support New Models](./support_new_models): implementation guide for adding new diffusion pipelines +- [CI Performance](./ci_perf): update and regenerate perf baselines + + +## On AI-Assisted ("Vibe Coding") PRs + +Vibe-coded PRs are welcome — we judge code quality, not how it was produced. The bar is the same for all PRs: + +- **No over-commenting.** If the name says it all, skip the docstring. +- **No over-catching.** Don't guard against errors that virtually never happen in practice. +- **Test before submitting.** AI-generated code can be subtly wrong — verify correctness end-to-end. + +## Commit Message Convention + +We follow a structured commit message format to maintain a clean history. + +**Format:** +```text +[diffusion] : +``` + +**Examples:** +- `[diffusion] cli: add --perf-dump-path argument` +- `[diffusion] scheduler: fix deadlock in batch processing` +- `[diffusion] model: support Stable Diffusion 3.5` + +**Rules:** +- **Prefix**: Always start with `[diffusion]`. +- **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc. +- **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature"). + +## Performance Reporting + +For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report. + +### How to Generate a Report + +1. **Baseline**: run the benchmark (for a single generation task) + ```bash + $ sglang generate --model-path --prompt "A benchmark prompt" --perf-dump-path baseline.json + ``` + +2. **New**: run the same benchmark, without modifying any server_args or sampling_params + ```bash + $ sglang generate --model-path --prompt "A benchmark prompt" --perf-dump-path new.json + ``` + +3. **Compare**: run the compare script, which will print a Markdown table to the console + ```bash + $ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...] + ### Performance Comparison Report + ... + ``` +4. **Paste**: paste the table into the PR description + +## CI-Based Change Protection + +Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that: + +- support a new model + - add a testcase for this new model to `testcase_configs.py` +- support or fix important features +- significantly improve performance + +Please run the according testcase, then update/add the baseline to `perf_baselines.json` by following the instruction in console if applicable. + +See [test](https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen/test) for examples diff --git a/docs_new/docs/sglang-diffusion/disaggregation.mdx b/docs_new/docs/sglang-diffusion/disaggregation.mdx new file mode 100644 index 000000000000..fcdc4e679472 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/disaggregation.mdx @@ -0,0 +1,361 @@ +--- +title: "Disaggregated Diffusion Pipeline" +metatags: + description: "Split SGLang Diffusion pipelines into independent encoder, denoiser, and decoder services for disaggregated serving." +--- + +Split a monolithic text-to-video/image pipeline into independent **Encoder**, **Denoiser**, and **Decoder** roles, each running on its own GPU(s). A central **DiffusionServer** routes requests through the pipeline. + +## Quick Start + +Disaggregation is controlled by a single flag: `--disagg-role`. Each component is launched independently, just like LLM PD disaggregation. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
--disagg-roleWhat it runs
monolithic(Default) Standard single-server mode
encoderAll stages with the default RoleType.ENCODER affinity: InputValidationStage, TextEncodingStage (plus ImageEncodingStage / ImageVAEEncodingStage for image-conditioned pipelines), LatentPreparationStage, TimestepPreparationStage, and any model-specific "before denoising" stage (e.g. QwenImageLayeredBeforeDenoisingStage, GlmImageBeforeDenoisingStage).
denoiserDenoisingStage (and its subclasses: CausalDMDDenoisingStage, DmdDenoisingStage, LTX2AVDenoisingStage, LTX2RefinementStage, Hunyuan3DShapeDenoisingStage, ...) — the DiT forward loop plus the scheduler stepping it drives.
decoderDecodingStage (VAE decode) and its subclasses (LTX2AVDecodingStage, HeliosDecodingStage, ...).
serverDiffusionServer head node + HTTP server (no GPU)
+ +> Each stage declares its role via the `role_affinity` property on `PipelineStage` (default `ENCODER`). When `--disagg-role` is not `monolithic`, the pipeline only instantiates stages whose affinity matches, so the above table is the source of truth for what actually runs in each process. + +### Single-Machine Example (Verified) + +The following commands have been tested end-to-end on an 8×H200 machine with +`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. Each role runs on a separate GPU via +`--base-gpu-id`; the `server` head node requires no GPU. + +```bash +# Terminal 1: Encoder (GPU 0) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role encoder \ + --disagg-server-addr tcp://127.0.0.1:19655 \ + --scheduler-port 19000 \ + --num-gpus 1 --base-gpu-id 0 + +# Terminal 2: Denoiser (GPU 1) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role denoiser \ + --disagg-server-addr tcp://127.0.0.1:19655 \ + --scheduler-port 19001 \ + --num-gpus 1 --base-gpu-id 1 + +# Terminal 3: Decoder (GPU 2) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role decoder \ + --disagg-server-addr tcp://127.0.0.1:19655 \ + --scheduler-port 19002 \ + --num-gpus 1 --base-gpu-id 2 + +# Terminal 4: DiffusionServer head (no GPU, receives HTTP requests) +sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ + --disagg-role server \ + --encoder-urls "tcp://127.0.0.1:19000" \ + --denoiser-urls "tcp://127.0.0.1:19001" \ + --decoder-urls "tcp://127.0.0.1:19002" \ + --host 0.0.0.0 --port 22000 \ + --scheduler-port 19655 + +# Send request (video generation) +curl http://127.0.0.1:22000/v1/videos \ + -H "Content-Type: application/json" \ + -d '{"model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "prompt": "A curious raccoon exploring a garden, cinematic", "size": "832x480"}' +``` + +> **Tested result (8×H200):** +> Encoder 2.3 s (TextEncoding) → Denoiser 312.8 s (50 steps, layerwise offload) → Decoder 7.1 s (VAE decode). +> Total ~322 s for 81-frame 1024×1024 video. + +> **Tip:** `--base-gpu-id` controls which physical GPU the role uses. +> Encoder and Decoder can share a GPU (e.g. both `--base-gpu-id 0`) to save resources, +> but make sure the combined GPU memory is sufficient. + +### Multi-Machine Example + +The exact same CLI pattern — just replace `127.0.0.1` with actual IPs and add +RDMA flags for direct transfer: + +```bash +# Machine A (10.0.0.1): Encoder +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role encoder \ + --disagg-server-addr tcp://10.0.0.4:19655 \ + --scheduler-port 19000 \ + --num-gpus 1 \ + --disagg-p2p-hostname 10.0.0.1 --disagg-ib-device mlx5_0 + +# Machine B (10.0.0.2): Denoiser (4 GPUs with SP) +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role denoiser \ + --disagg-server-addr tcp://10.0.0.4:19655 \ + --scheduler-port 19001 \ + --num-gpus 4 --denoiser-sp 4 --denoiser-ulysses 2 --denoiser-ring 2 \ + --disagg-p2p-hostname 10.0.0.2 --disagg-ib-device mlx5_0 + +# Machine C (10.0.0.3): Decoder +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role decoder \ + --disagg-server-addr tcp://10.0.0.4:19655 \ + --scheduler-port 19002 \ + --num-gpus 1 \ + --disagg-p2p-hostname 10.0.0.3 --disagg-ib-device mlx5_0 + +# Machine D (10.0.0.4): DiffusionServer head +sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --disagg-role server \ + --encoder-urls "tcp://10.0.0.1:19000" \ + --denoiser-urls "tcp://10.0.0.2:19001" \ + --decoder-urls "tcp://10.0.0.3:19002" \ + --host 0.0.0.0 --port 30000 \ + --scheduler-port 19655 \ + --disagg-dispatch-policy max_free_slots +``` + +> ZMQ handles startup order gracefully — instances and head can start in any order. + +## Multiple Instances per Role + +Use semicolons in `--*-urls` to register multiple instances: + +```bash +# 2 encoders + 2 denoisers (4-GPU SP each) + 1 decoder +sglang serve --model-path ... --disagg-role server \ + --encoder-urls "tcp://10.0.0.1:35000;tcp://10.0.0.2:35000" \ + --denoiser-urls "tcp://10.0.0.3:35000;tcp://10.0.0.4:35000" \ + --decoder-urls "tcp://10.0.0.5:35000" +``` + +## Port Convention + +Result endpoints are derived deterministically from the head node's `--scheduler-port` (default: 5555): + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SocketPort
DS frontend (ROUTER)scheduler_port
Encoder result (PULL)scheduler_port + 1
Denoiser result (PULL)scheduler_port + 2
Decoder result (PULL)scheduler_port + 3
+ +Role instances derive their result endpoint automatically from `--disagg-server-addr`. No manual endpoint configuration needed. + +## Transfer Mechanism + +Tensor data between roles (encoder→denoiser, denoiser→decoder) is transferred via a P2P transfer engine. The DiffusionServer only routes lightweight control messages (alloc/push/ready); actual tensor data flows directly between instances. + +**mooncake-transfer-engine** is required for disaggregated diffusion. It provides RDMA for direct GPU-to-GPU data movement. + +```bash +pip install mooncake-transfer-engine +``` + +### Transfer Flow + +1. **Sender** (encoder/denoiser) stages tensors: async copy to transfer buffer (GPU or CPU pinned, depending on GPUDirect support), overlapped with metadata JSON serialization. +2. **Sender** sends `transfer_staged` control message to DiffusionServer (metadata only, no tensor data). +3. **DiffusionServer** sends `transfer_alloc` to receiver → receiver allocates buffer slot → replies `transfer_allocated`. +4. **DiffusionServer** sends `transfer_push` to receiver with sender's address info. +5. **Receiver** pulls data via transfer engine (Mooncake RDMA or mock), sends `transfer_ready`. +6. **Receiver** loads tensors async on a dedicated transfer stream, overlapped with the previous request's compute. + +Decoder results (final output) flow back through DiffusionServer as raw ZMQ frames to the HTTP client. + +### RDMA Flags + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FlagDefaultDescription
--disagg-p2p-hostname127.0.0.1RDMA-reachable hostname/IP of this instance
--disagg-ib-deviceNoneInfiniBand device (e.g., mlx5_0, mlx5_roce0)
--disagg-transfer-pool-size256 MiBPinned memory pool per instance
+ +Set `--disagg-p2p-hostname` to the actual IP on each machine. For multi-machine, `--disagg-ib-device` specifies the RDMA NIC. + +## Per-Role Parallelism + + + + + + + + + + + + + + + + + + + + + + + + + + +
FlagDescription
--encoder-tpEncoder tensor parallelism
--denoiser-tp / --denoiser-sp / --denoiser-ulysses / --denoiser-ringDenoiser parallelism
--decoder-tpDecoder tensor parallelism
+ +If not specified, parallelism is auto-derived from `--num-gpus`. + +## Other Options + + + + + + + + + + + + + + + + + + + + + + + + + + +
FlagDefaultDescription
--disagg-timeout600Timeout (seconds) for pending requests
--disagg-dispatch-policyround_robinround_robin or max_free_slots
+ +## Python API + +For programmatic single-machine deployment, `launch_pool_disagg_server()` is available: + +```python +from sglang.multimodal_gen.runtime.server_args import ServerArgs +from sglang.multimodal_gen.runtime.launch_server import launch_pool_disagg_server + +server_args = ServerArgs.from_kwargs( + model_path="Wan-AI/Wan2.1-T2V-14B-Diffusers", + denoiser_sp=4, denoiser_ulysses=2, denoiser_ring=2, + disagg_ib_device="mlx5_0", +) + +launch_pool_disagg_server( + server_args, + encoder_gpus=[[0]], + denoiser_gpus=[[1, 2, 3, 4], [5, 6, 7, 8]], + decoder_gpus=[[0]], +) +``` + +## Architecture + +``` +Client ─── HTTP (port 30000) ──► FastAPI Server + │ + ▼ + DiffusionServer (ROUTER, scheduler_port) + ┌───────┼───────┐ + PUSH work │ │ │ PUSH work + ▼ │ ▼ + Encoder[0..N] │ Decoder[0..K] + │ │ ▲ + P2P tensor │ │ │ P2P tensor + transfer ▼ │ │ transfer + Denoiser[0..M] ─────┘ + │ + PULL results ◄────┘ (decoder → DS → client) +``` + +### Request State Machine + +``` +PENDING → ENCODER_WAITING → ENCODER_RUNNING → ENCODER_DONE + │ + DENOISING_WAITING → DENOISING_RUNNING → DENOISING_DONE + │ + DECODER_WAITING → DECODER_RUNNING → DONE +``` + +Any state can transition to `FAILED` or `TIMED_OUT`. diff --git a/docs_new/docs/sglang-diffusion/dynamic_batching.mdx b/docs_new/docs/sglang-diffusion/dynamic_batching.mdx new file mode 100644 index 000000000000..b05a6eb892f2 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/dynamic_batching.mdx @@ -0,0 +1,143 @@ +--- +title: "Inference Batching" +description: "Batch compatible native SGLang-Diffusion requests during serving." +mode: wide +--- +Dynamic batching is an opt-in SGLang-Diffusion serving mode that merges compatible queued requests into one native pipeline batch. It is separate from LLM continuous batching and tokenizer batching. + +Use it for concurrent T2I or T2V traffic with the same model and sampling shape. Keep singleton serving for latency-sensitive or highly mixed traffic. + +## Enable + +Dynamic batching is disabled by default with `--batching-max-size 1`. + +```bash Command +sglang serve \ + --model-path black-forest-labs/FLUX.1-dev \ + --port 30010 \ + --batching-mode dynamic \ + --batching-max-size 8 \ + --batching-delay-ms 5 \ + --enable-batching-metrics +``` + +For request formats, see the [OpenAI-Compatible API](./api/openai_api). + +Use `--batching-config /path/to/batching_config.json` to load JSON rules when a model or resolution needs a lower cap than `--batching-max-size`: + +```json Config +{ + "schema_version": 1, + "rules": [ + { + "model_contains": "Qwen-Image", + "resolution": "1024x1024", + "max_batch_size": 1 + } + ] +} +``` + +## Compatibility + +An initial implementation of dynamic batching for T2I and T2V models can be found in [#18764](https://github.com/sgl-project/sglang/pull/18764). The current compatibility grid is below and will be updated as more coverage is added. See [Supported Models](./compatibility_matrix) for full model IDs. + +`✅` means supported, `❌` means not currently supported, `?` means untested, and `-` means not applicable. + +### Image + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelT2II2I
FLUX.1-dev-
FLUX.2-dev
FLUX.2-dev-NVFP4??
FLUX.2-Klein-4B
FLUX.2-Klein-9B??
Z-Image?-
Z-Image-Turbo-
GLM-Image-
Qwen Image-
Qwen Image 2512-
Qwen Image Edit-
Qwen Image Edit 2509-?
Qwen Image Edit 2511-?
Qwen Image Layered??
SD3 Medium?-
SD3.5 Medium?-
SD3.5 Large?-
Hunyuan3D-2?-
SANA 1.5 1.6B-
SANA 1.5 4.8B-
SANA 1600M 1024px?-
SANA 600M 1024px?-
SANA 1600M 512px?-
SANA 600M 512px?-
FireRed-Image-Edit 1.0-?
FireRed-Image-Edit 1.1-?
ERNIE-Image?-
ERNIE-Image-Turbo?-
+ +### Video + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSupport
FastWan2.1 T2V 1.3B
FastWan2.2 TI2V 5B Full Attn
Wan2.2 TI2V 5B
Wan2.2 T2V A14B
Wan2.2 I2V A14B
HunyuanVideo
FastHunyuan
Wan2.1 T2V 1.3B
Wan2.1 T2V 14B
Wan2.1 I2V 480P?
Wan2.1 I2V 720P?
TurboWan2.1 T2V 1.3B
TurboWan2.1 T2V 14B
TurboWan2.1 T2V 14B 720P
TurboWan2.2 I2V A14B?
Wan2.1 Fun 1.3B InP?
Helios Base?
Helios Mid?
Helios Distilled?
LTX-2?
LTX-2.3?
+ +## Notes + +- Requests batch only when model inputs, sampling parameters, output handling, and any configured rules are compatible. +- There is no startup probing, runtime learning, OOM retry, or automatic fallback to singletons. If a merged batch fails or cannot be split, every request in that batch receives an error. +- Batch shape can change kernels, so singleton and dynamic outputs are not expected to be bit-exact. +- Use `--enable-batching-metrics` to inspect realized batches: + +```text +Dynamic batch dispatch: size=2/8, user_max=8, queue_wait=5.12ms, stop_reason=delay +Dynamic batch dispatch: size=1/8, user_max=8, queue_wait=0.04ms, stop_reason=config_cap:1 +Dynamic batch stats (last 5 dispatches): avg_size=2.80, merged_rate=60.0%, full_rate=20.0%, utilization=35.0%, wait_avg=3.21ms, wait_p95=5.12ms, top_rejects=none +``` diff --git a/docs_new/docs/sglang-diffusion/environment_variables.mdx b/docs_new/docs/sglang-diffusion/environment_variables.mdx new file mode 100644 index 000000000000..8ade9a7ca47a --- /dev/null +++ b/docs_new/docs/sglang-diffusion/environment_variables.mdx @@ -0,0 +1,395 @@ +--- +title: "Environment Variables" +description: "Configure SGLang diffusion behavior with environment variables." +--- +## Runtime + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
SGLANG_DIFFUSION_TARGET_DEVICEcudaTarget device for inference (cuda, rocm, xpu, npu, musa, mps, cpu)
SGLANG_DIFFUSION_ATTENTION_BACKENDnot setOverride attention backend via env var (e.g. fa, torch_sdpa, sage_attn)
SGLANG_DIFFUSION_ATTENTION_CONFIGnot setPath to attention backend configuration file (JSON/YAML)
SGLANG_DIFFUSION_STAGE_LOGGINGfalseEnable per-stage timing logs
SGLANG_DIFFUSION_SERVER_DEV_MODEfalseEnable dev-only HTTP endpoints for debugging
SGLANG_DIFFUSION_TORCH_PROFILER_DIRnot setDirectory for torch profiler traces (absolute path). Enables profiling when set
SGLANG_DIFFUSION_CACHE_ROOT~/.cache/sgl_diffusionRoot directory for cache files
SGLANG_DIFFUSION_CONFIG_ROOT~/.config/sgl_diffusionRoot directory for configuration files
SGLANG_DIFFUSION_LOGGING_LEVELINFODefault logging level
SGLANG_DIFFUSION_WORKER_MULTIPROC_METHODforkMultiprocess context for workers (fork or spawn)
SGLANG_USE_RUNAI_MODEL_STREAMERtrueUse Run:AI model streamer for model loading
+ +## Platform-Specific + +### Apple MPS + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
SGLANG_USE_MLXnot setSet to 1 to enable MLX fused Metal kernels for norm ops on MPS
+ +### ROCm (AMD GPUs) + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
SGLANG_USE_ROCM_VAEfalseUse AITer GroupNorm in VAE for improved performance on ROCm
SGLANG_USE_ROCM_CUDNN_BENCHMARKfalseEnable MIOpen auto-tuning for VAE conv layers on ROCm
+ +### Quantization + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKENDnot setFlashInfer FP4 GEMM backend for generic NVFP4 fallback
+ +## Caching Acceleration + +These variables configure caching acceleration for Diffusion Transformer (DiT) models. +SGLang supports multiple caching strategies - see [caching documentation](./caching-acceleration) for an overview. + +### Cache-DiT Configuration + +See [cache-dit documentation](./cache_dit) for detailed configuration. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
`SGLANG_CACHE_DIT_ENABLED`falseEnable Cache-DiT acceleration
`SGLANG_CACHE_DIT_FN`1First N blocks to always compute
`SGLANG_CACHE_DIT_BN`0Last N blocks to always compute
`SGLANG_CACHE_DIT_WARMUP`4Warmup steps before caching
`SGLANG_CACHE_DIT_RDT`0.24Residual difference threshold
`SGLANG_CACHE_DIT_MC`3Max continuous cached steps
`SGLANG_CACHE_DIT_TAYLORSEER`falseEnable TaylorSeer calibrator
`SGLANG_CACHE_DIT_TS_ORDER`1TaylorSeer order (1 or 2)
`SGLANG_CACHE_DIT_SCM_PRESET`noneSCM preset (none/slow/medium/fast/ultra)
`SGLANG_CACHE_DIT_SCM_POLICY`dynamicSCM caching policy
`SGLANG_CACHE_DIT_SCM_COMPUTE_BINS`not setCustom SCM compute bins
`SGLANG_CACHE_DIT_SCM_CACHE_BINS`not setCustom SCM cache bins
+ +### Cache-DiT Secondary Transformer + +For dual-transformer models (e.g., Wan2.2 with high/low-noise experts), these variables configure caching for the secondary transformer. Each falls back to its primary counterpart if not set. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
SGLANG_CACHE_DIT_SECONDARY_FN(from primary)First N blocks to always compute
SGLANG_CACHE_DIT_SECONDARY_BN(from primary)Last N blocks to always compute
SGLANG_CACHE_DIT_SECONDARY_WARMUP(from primary)Warmup steps before caching
SGLANG_CACHE_DIT_SECONDARY_RDT(from primary)Residual difference threshold
SGLANG_CACHE_DIT_SECONDARY_MC(from primary)Max continuous cached steps
SGLANG_CACHE_DIT_SECONDARY_TAYLORSEER(from primary)Enable TaylorSeer calibrator
SGLANG_CACHE_DIT_SECONDARY_TS_ORDER(from primary)TaylorSeer order (1 or 2)
+ +## Cloud Storage + +These variables configure S3-compatible cloud storage for automatically uploading generated images and videos. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
`SGLANG_CLOUD_STORAGE_TYPE`not setSet to `s3` to enable cloud storage
`SGLANG_S3_BUCKET_NAME`not setThe name of the S3 bucket
`SGLANG_S3_ENDPOINT_URL`not setCustom endpoint URL (for MinIO, OSS, etc.)
`SGLANG_S3_REGION_NAME`us-east-1AWS region name
`SGLANG_S3_ACCESS_KEY_ID`not setAWS Access Key ID
`SGLANG_S3_SECRET_ACCESS_KEY`not setAWS Secret Access Key
+ +## CUDA Crash Debugging + +These variables enable kernel API logging and optional input/output dumps around diffusion CUDA kernel call boundaries. They are useful when tracking down CUDA crashes such as illegal memory access, device-side assert, or shape mismatches in custom kernels. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Environment VariableDefaultDescription
SGLANG_KERNEL_API_LOGLEVEL0Controls crash-debug kernel API logging. 1 logs API names, 3 logs tensor metadata, 5 adds tensor statistics, and 10 also writes dump snapshots.
SGLANG_KERNEL_API_LOGDESTstdoutDestination for crash-debug kernel API logs. Use stdout, stderr, or a file path. %i is replaced with the process PID.
SGLANG_KERNEL_API_DUMP_DIRsglang_kernel_api_dumpsOutput directory for level-10 kernel API dumps. %i is replaced with the process PID.
SGLANG_KERNEL_API_DUMP_INCLUDEnot setComma-separated wildcard patterns for kernel API names to include in level-10 dumps.
SGLANG_KERNEL_API_DUMP_EXCLUDEnot setComma-separated wildcard patterns for kernel API names to exclude from level-10 dumps.
diff --git a/docs_new/docs/sglang-diffusion/index.mdx b/docs_new/docs/sglang-diffusion/index.mdx new file mode 100644 index 000000000000..42a9ffbe1fde --- /dev/null +++ b/docs_new/docs/sglang-diffusion/index.mdx @@ -0,0 +1,56 @@ +--- +title: SGLang Diffusion +description: Accelerated image and video generation with diffusion models. +--- +SGLang Diffusion is a high-performance inference framework for image and video generation. It provides native SGLang pipelines, diffusers backend support, an OpenAI-compatible server, and an optimized kernel stack built on both precompiled `sgl-kernel` operators and JIT kernels for key inference paths. + +## Key Features + +- Broad model support across Wan, Hunyuan, Qwen-Image, FLUX, Z-Image, GLM-Image, and more +- Fast inference with `sgl-kernel`, JIT kernels, scheduler improvements, and caching acceleration +- Multiple interfaces: `sglang generate`, `sglang serve`, and an OpenAI-compatible API +- Multi-platform support for NVIDIA, AMD, Intel XPU, Ascend, Apple Silicon, and Moore Threads + +## Quick Start + +```bash +uv pip install "sglang[diffusion]" --prerelease=allow +``` + +```bash +sglang generate --model-path Qwen/Qwen-Image \ + --prompt "A beautiful sunset over the mountains" \ + --save-output +``` + +```bash +sglang serve --model-path Qwen/Qwen-Image --port 30010 +``` + +## Start Here + +- [Installation](/docs/sglang-diffusion/installation): install SGLang Diffusion and platform dependencies +- [Compatibility Matrix](/docs/sglang-diffusion/compatibility_matrix): check model, optimization, and component override support +- [CLI](/docs/sglang-diffusion/api/cli): run one-off generation jobs or launch a persistent server +- [OpenAI-Compatible API](/docs/sglang-diffusion/api/openai_api): send image and video requests to the HTTP server +- [Attention Backends](/docs/sglang-diffusion/attention_backends): choose the best backend for your model and hardware +- [Inference Batching](/docs/sglang-diffusion/dynamic_batching): batch compatible native diffusion requests during serving +- [Caching Acceleration](/docs/sglang-diffusion/caching-acceleration): use Cache-DiT or TeaCache to reduce denoising cost +- [Quantization](/docs/sglang-diffusion/quantization): load quantized transformer checkpoints +- [Contributing](/docs/sglang-diffusion/contributing): contribution workflow, adding new models, and CI perf baselines + +## Additional Documentation + +- [Post-Processing](/docs/sglang-diffusion/api/post_processing): frame interpolation and upscaling +- [Performance Overview](/docs/sglang-diffusion/performance-optimization): overview of attention, caching, and profiling +- [Environment Variables](/docs/sglang-diffusion/environment_variables): platform, caching, storage, and debugging configuration +- [Support New Models](/docs/sglang-diffusion/support_new_models): implementation guide for new diffusion pipelines +- [CI Performance](/docs/sglang-diffusion/ci_perf): performance baseline generation + +## References + +- [SGLang GitHub](https://github.com/sgl-project/sglang) +- [Cache-DiT](https://github.com/vipshop/cache-dit) +- [FastVideo](https://github.com/hao-ai-lab/FastVideo) +- [xDiT](https://github.com/xdit-project/xDiT) +- [Diffusers](https://github.com/huggingface/diffusers) diff --git a/docs_new/docs/sglang-diffusion/installation.mdx b/docs_new/docs/sglang-diffusion/installation.mdx new file mode 100644 index 000000000000..13210d83b1b6 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/installation.mdx @@ -0,0 +1,130 @@ +--- +title: Install SGLang Diffusion +description: Install SGLang Diffusion on NVIDIA, AMD, MUSA, and Ascend platforms. +--- +You can install SGLang-Diffusion using one of the methods below. The standard installation already includes SGLang's optimized kernel stack, including both `sgl-kernel` and JIT kernels used by diffusion workloads. + +## Standard Installation (NVIDIA GPUs) + +### Method 1: With pip or uv + +It is recommended to use uv for a faster installation: + +```bash Command +pip install --upgrade pip +pip install uv +uv pip install "sglang[diffusion]" --prerelease=allow +``` + +### Method 2: From source + +```bash Command +# Use the latest release branch +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Install the Python packages +pip install --upgrade pip +pip install -e "python[diffusion]" + +# With uv +uv pip install -e "python[diffusion]" --prerelease=allow +``` + +### Method 3: Using Docker + +The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang), built from the [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile). +Replace `` below with your HuggingFace Hub [token](https://huggingface.co/docs/hub/en/security-tokens). + +```bash Command +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:dev \ + zsh -c '\ + echo "Installing diffusion dependencies..." && \ + pip install -e "python[diffusion]" && \ + echo "Starting SGLang-Diffusion..." && \ + sglang generate \ + --model-path black-forest-labs/FLUX.1-dev \ + --prompt "A logo With Bold Large text: SGL Diffusion" \ + --save-output \ + ' +``` + +## Platform-Specific: ROCm (AMD GPUs) + +For AMD Instinct GPUs (e.g., MI300X), you can use the ROCm-enabled Docker image: + +```bash Command +docker run --device=/dev/kfd --device=/dev/dri --ipc=host \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env HF_TOKEN= \ + lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \ + sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output +``` + +For detailed ROCm system configuration and installation from source, see [AMD GPUs](../hardware-platforms/amd_gpu). + +## Platform-Specific: MUSA (Moore Threads GPUs) + +For Moore Threads GPUs (MTGPU) with the MUSA software stack, please follow the instructions below to install from source: + +```bash Command +# Clone the repository +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Install the Python packages +pip install --upgrade pip +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +pip install -e "python[all_musa]" +``` + +## Platform-Specific: Intel XPU + +For Intel Data Center GPU Max or Arc GPUs, follow the [XPU installation guide](../hardware-platforms/xpu) to set up the base environment, then install diffusion dependencies: + +```bash Command +pip install -e "python[diffusion]" +``` + +## Platform-Specific: Ascend NPU + +For Ascend NPU, please follow the [NPU installation guide](../hardware-platforms/ascend-npus/ascend_npu). + +Quick test: + +```bash Command +sglang generate --model-path black-forest-labs/FLUX.1-dev \ + --prompt "A logo With Bold Large text: SGL Diffusion" \ + --save-output +``` + +## Platform-Specific: Apple MPS + +For Apple MPS, please follow the instructions below to install from source: + +```bash Command +# Install ffmpeg +brew install ffmpeg + +# Install uv +brew install uv + +# Clone the repository +git clone https://github.com/sgl-project/sglang.git +cd sglang + +# Create and activate a virtual environment +uv venv -p 3.11 sglang-diffusion +source sglang-diffusion/bin/activate + +# Install the Python packages +uv pip install --upgrade pip +rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml +uv pip install -e "python[all_mps]" +``` diff --git a/docs_new/docs/sglang-diffusion/performance-optimization.mdx b/docs_new/docs/sglang-diffusion/performance-optimization.mdx new file mode 100644 index 000000000000..ef5a8950f638 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/performance-optimization.mdx @@ -0,0 +1,73 @@ +--- +title: "Performance Optimization" +description: "Optimize SGLang diffusion performance with caching, kernels, and profiling." +--- +This section covers the main performance levers for SGLang Diffusion: attention backends, caching acceleration, and profiling. + +## Overview + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OptimizationTypeDescription
Cache-DiTCachingBlock-level caching with DBCache, TaylorSeer, and SCM
TeaCacheCachingTimestep-level caching based on temporal similarity
Attention BackendsKernelOptimized attention implementations (FlashAttention, SageAttention, etc.)
Inference BatchingSchedulerRequest batching for native diffusion serving
ProfilingDiagnosticsPyTorch Profiler and Nsight Systems guidance
+ +## Start Here + +- Use [Attention Backends](./attention_backends) to choose the best backend for your model and hardware. +- Use [Inference Batching](./dynamic_batching) to improve throughput for compatible concurrent requests. +- Use [Caching Acceleration](./caching-acceleration) to reduce denoising cost with Cache-DiT or TeaCache. +- Use [Profiling](./profiling) when you need to diagnose a bottleneck rather than guess. + +## Caching at a Glance + +- [Cache-DiT](./cache_dit) is block-level caching for diffusers pipelines and higher speedup-oriented tuning. +- [TeaCache](./teacache) is timestep-level caching built into SGLang model families. + + +## Current Baseline Snapshot + +For Ring SP benchmark details, see: + +- [Ring SP Performance](./ring_sp_performance) + +## References + +- [Cache-DiT Repository](https://github.com/vipshop/cache-dit) +- [TeaCache Paper](https://arxiv.org/abs/2411.14324) diff --git a/docs_new/docs/sglang-diffusion/profiling.mdx b/docs_new/docs/sglang-diffusion/profiling.mdx new file mode 100644 index 000000000000..2fb327a2a2d5 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/profiling.mdx @@ -0,0 +1,138 @@ +--- +title: "Profiling" +description: "Profile SGLang diffusion workloads with PyTorch Profiler and Nsight Systems." +--- +This guide covers profiling techniques for multimodal generation pipelines in SGLang. + +## PyTorch Profiler + +PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics. + +### Denoising Stage Profiling + +Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step): + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --seed 0 \ + --profile +``` + +**Parameters:** +- `--profile`: Enable profiling for the denoising stage +- `--num-profiled-timesteps N`: Number of timesteps to profile after warmup (default: 5) + - Smaller values reduce trace file size + - Example: `--num-profiled-timesteps 10` profiles 10 steps after 1 warmup step + +### Full Pipeline Profiling + +Profile all pipeline stages (text encoding, denoising, VAE decoding, etc.): + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --seed 0 \ + --profile \ + --profile-all-stages +``` + +**Parameters:** +- `--profile-all-stages`: Used with `--profile`, profile all pipeline stages instead of just denoising + +### Output Location + +By default, trace files are saved in the ./logs/ directory. + +The exact output file path will be shown in the console output, for example: + +```bash +[mm-dd hh:mm:ss] Saved profiler traces to: /sgl-workspace/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz +``` + +### View Traces + +Load and visualize trace files at: +- https://ui.perfetto.dev/ (recommended) +- chrome://tracing (Chrome only) + +For large trace files, reduce `--num-profiled-timesteps` or avoid using `--profile-all-stages`. + + +### `--perf-dump-path` (Stage/Step Timing Dump) + +Besides profiler traces, you can also dump a lightweight JSON report that contains: +- stage-level timing breakdown for the full pipeline +- step-level timing breakdown for the denoising stage (per diffusion step) + +This is useful to quickly identify which stage dominates end-to-end latency, and whether denoising steps have uniform runtimes (and if not, which step has an abnormal spike). + +The dumped JSON contains a `denoise_steps_ms` field formatted as an array of objects, each with a `step` key (the step index) and a `duration_ms` key. + +Example: + +```bash +sglang generate \ + --model-path \ + --prompt "" \ + --perf-dump-path perf.json +``` + +## Nsight Systems + +Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns. + +### Installation + +See the [SGLang profiling guide](../developer_guide/benchmark_and_profiling#profile-with-nsight) for installation instructions. + +### Basic Profiling + +Profile the entire pipeline execution: + +```bash +nsys profile \ + --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + --force-overwrite=true \ + -o QwenImage \ + sglang generate \ + --model-path Qwen/Qwen-Image \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --seed 0 +``` + +### Targeted Stage Profiling + +Use `--delay` and `--duration` to capture specific stages and reduce file size: + +```bash +nsys profile \ + --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + --force-overwrite=true \ + --delay 10 \ + --duration 30 \ + -o QwenImage_denoising \ + sglang generate \ + --model-path Qwen/Qwen-Image \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --seed 0 +``` + +**Parameters:** +- `--delay N`: Wait N seconds before starting capture (skip initialization overhead) +- `--duration N`: Capture for N seconds (focus on specific stages) +- `--force-overwrite`: Overwrite existing output files + +## Notes + +- **Reduce trace size**: Use `--num-profiled-timesteps` with smaller values or `--delay`/`--duration` with Nsight Systems +- **Stage-specific analysis**: Use `--profile` alone for denoising stage, add `--profile-all-stages` for full pipeline +- **Multiple runs**: Profile with different prompts and resolutions to identify bottlenecks across workloads + +## FAQ + +- If you are profiling `sglang generate` with Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model's inference steps to extend the execution time. diff --git a/docs_new/docs/sglang-diffusion/quantization.mdx b/docs_new/docs/sglang-diffusion/quantization.mdx new file mode 100644 index 000000000000..392c0831b06b --- /dev/null +++ b/docs_new/docs/sglang-diffusion/quantization.mdx @@ -0,0 +1,601 @@ +--- +title: "Quantization" +metatags: + description: "SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate." +--- + +SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep +the base model and the quantized transformer override separate. + +## Quick Reference + +Use these paths: + +- `--model-path`: the base or original model +- `--transformer-path`: a quantized transformers-style transformer component directory that already contains its own `config.json` +- `--transformer-weights-path`: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID + +Recommended example: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \ + --prompt "a curious pikachu" +``` + +For quantized transformers-style transformer component folders: + +```bash +sglang generate \ + --model-path /path/to/base-model \ + --transformer-path /path/to/quantized-transformer \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" +``` + +NOTE: Some model-specific integrations also accept a quantized repo or local +directory directly as `--model-path`, but that is a compatibility path. If a +repo contains multiple candidate checkpoints, pass +`--transformer-weights-path` explicitly. + +## Quant Families + +Here, `quant_family` means a checkpoint and loading family with shared CLI +usage and loader behavior. It is not just the numeric precision or a kernel +backend. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
quant_familycheckpoint formcanonical CLIsupported modelsextra dependencyplatform / notes
fp8Quantized transformer component folder, or safetensors with quantization_config metadata--transformer-path or --transformer-weights-pathALLNoneComponent-folder and single-file flows are both supported
modelopt-fp8Converted ModelOpt FP8 transformer directory or repo with config.json--transformer-pathFLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image EditNoneSerialized config stays quant_method=modelopt with quant_algo=FP8; dit_layerwise_offload is supported and dit_cpu_offload stays disabled
modelopt-nvfp4Mixed transformer directory/repo with config.json, or raw NVFP4 safetensors export/repo--transformer-path for mixed overrides; --transformer-weights-path for raw exportsFLUX.1, FLUX.2, Wan2.2NoneMixed override repos keep the base model separate; raw exports such as black-forest-labs/FLUX.2-dev-NVFP4 still use the weights-path flow
nunchaku-svdqPre-quantized Nunchaku transformer weights, usually named svdq-{int4\|fp4}_r{rank}-...--transformer-weights-pathModel-specific support such as Qwen-Image, FLUX, and Z-ImagenunchakuSGLang can infer precision and rank from the filename and supports both int4 and nvfp4
msmodelslimPre-quantized msmodelslim transformer weights--model-pathWan2.2 familyNoneCurrently only compatible with the Ascend NPU family and supports both w8a8 and w4a4
+ +## Validated ModelOpt Checkpoints + +This section is the canonical support matrix for the nine diffusion ModelOpt +checkpoints currently wired up in SGLang docs and B200 CI coverage. + +Published checkpoints keep the serialized quantization config as +`quant_method=modelopt`; the FP8 vs NVFP4 split below is a documentation label +derived from `quant_algo`. + +Eight of the nine repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the +official `black-forest-labs/FLUX.2-dev-NVFP4` repo. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Quant AlgoBase ModelPreferred CLIHF RepoCurrent ScopeNotes
FP8black-forest-labs/FLUX.1-dev--transformer-pathlmsys/flux1-dev-modelopt-fp8-sglang-transformersingle-transformer override, deterministic latent/image comparison, H100 benchmark, torch-profiler traceSGLang converter keeps a validated BF16 fallback set for modulation and FF projection layers; use --model-id FLUX.1-dev for local mirrors
FP8black-forest-labs/FLUX.2-dev--transformer-pathlmsys/flux2-dev-modelopt-fp8-sglang-transformersingle-transformer override load and generation pathpublished SGLang-ready transformer override
FP8Wan-AI/Wan2.2-T2V-A14B-Diffusers--transformer-pathlmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformerprimary transformer quantized, transformer_2 kept BF16primary-transformer-only path; keep transformer_2 on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately
FP8hunyuanvideo-community/HunyuanVideo--transformer-pathlmsys/hunyuanvideo-modelopt-fp8-sglang-transformersingle-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler traceHunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores
FP8Qwen/Qwen-Image--transformer-pathlmsys/qwen-image-modelopt-fp8-sglang-transformersingle-transformer override, BF16-vs-FP8 image comparison, H100 benchmark, torch-profiler traceshares the Qwen Image FP8 fallback preset; keep img_in, txt_in, timestep embedder, norm_out.linear, proj_out, img_mod/txt_mod, and img_mlp.net.2 in BF16
FP8Qwen/Qwen-Image-Edit-2511--transformer-pathlmsys/qwen-image-edit-modelopt-fp8-sglang-transformerTI2I edit path, BF16-vs-FP8 image comparison, H100 benchmarkshares QwenImageTransformer2DModel with Qwen Image and uses the same Qwen Image FP8 fallback preset
NVFP4black-forest-labs/FLUX.1-dev--transformer-pathlmsys/flux1-dev-modelopt-nvfp4-sglang-transformermixed BF16+NVFP4 transformer override, correctness validation, 4x RTX 5090 benchmark, torch-profiler traceuse build_modelopt_nvfp4_transformer.py; validated builder keeps selected FLUX.1 modules in BF16 and sets swap_weight_nibbles=false
NVFP4black-forest-labs/FLUX.2-dev--transformer-weights-pathblack-forest-labs/FLUX.2-dev-NVFP4packed-QKV load pathofficial raw export repo; validated packed export detection and runtime layout handling
NVFP4Wan-AI/Wan2.2-T2V-A14B-Diffusers--transformer-pathlmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformerprimary transformer quantized with ModelOpt NVFP4, transformer_2 kept BF16primary-transformer-only path; keep transformer_2 on the base checkpoint, and current B200/Blackwell bring-up uses SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn
+ +These nine checkpoints are also the intended case set for the B200 diffusion CI +job (`multimodal-gen-test-1-b200`). + +## ModelOpt FP8 + +### Usage Examples + +Converted ModelOpt FP8 checkpoints should be loaded as transformer component +overrides. If the repo or local directory already contains `config.json`, use +`--transformer-path`. + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --transformer-path lmsys/flux2-dev-modelopt-fp8-sglang-transformer \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +```bash +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --transformer-path lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer \ + --prompt "a fox walking through neon rain" \ + --save-output +``` + +```bash +sglang generate \ + --model-path hunyuanvideo-community/HunyuanVideo \ + --transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \ + --height 544 --width 960 --num-frames 17 \ + --prompt "A cinematic shot of a red sports car driving through rain at night" \ + --save-output +``` + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --transformer-path lmsys/qwen-image-modelopt-fp8-sglang-transformer \ + --prompt "A tiny astronaut reading a book under a glass greenhouse" \ + --save-output +``` + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image-Edit-2511 \ + --transformer-path lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer \ + --image-path /path/to/input.png \ + --prompt "Turn the scene into a warm watercolor illustration" \ + --save-output +``` + +### Notes + +- `--transformer-path` is the canonical flag for converted ModelOpt FP8 + transformer component repos or directories that already carry `config.json`. +- If the override repo or local directory contains its own `config.json`, + SGLang reads the quantization config from that override instead of relying on + the base model config. +- `--transformer-weights-path` still works when you intentionally point at raw + weight files or a directory that should be metadata-probed as weights first. +- `dit_layerwise_offload` is supported for ModelOpt FP8 checkpoints. +- `dit_cpu_offload` still stays disabled for ModelOpt FP8 checkpoints. +- The layerwise offload path now preserves the non-contiguous FP8 weight stride + expected by the runtime FP8 GEMM path. +- On disk, the quantization config stays `quant_method=modelopt` with + `quant_algo=FP8`; the `modelopt-fp8` label in this document is a support + family name, not a serialized config key. +- To build the converted checkpoint yourself from a ModelOpt diffusers export, + use `python -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer`. + +## ModelOpt NVFP4 + +### Usage Examples + +For mixed ModelOpt NVFP4 transformer overrides that already contain +`config.json`, keep the base model and quantized transformer separate and use +`--transformer-path`: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.1-dev \ + --transformer-path lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +For raw NVFP4 exports such as the official FLUX.2 release, use +`--transformer-weights-path`: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev \ + --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +SGLang also supports passing the NVFP4 repo or local directory directly as +`--model-path`: + +```bash +sglang generate \ + --model-path black-forest-labs/FLUX.2-dev-NVFP4 \ + --prompt "A Logo With Bold Large Text: SGL Diffusion" \ + --save-output +``` + +For a dual-transformer Wan2.2 export where only the primary `transformer` +was quantized: + +```bash +SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn \ +sglang generate \ + --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --transformer-path lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer \ + --prompt "a fox walking through neon rain" \ + --save-output +``` + +### Notes + +- Use `--transformer-path` for mixed ModelOpt NVFP4 transformer repos or local + directories that already include `config.json`. +- Use `--transformer-weights-path` for raw NVFP4 exports, individual + safetensors files, or repo layouts that should be treated as weights first. +- For dual-transformer pipelines such as `Wan2.2-T2V-A14B-Diffusers`, the + primary `--transformer-path` override targets only `transformer`. Use a + per-component override such as `--transformer-2-path` only when you + intentionally want a non-default `transformer_2`. +- On Blackwell, the validated Wan2.2 ModelOpt NVFP4 path currently prefers + FlashInfer FP4 GEMM via + `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn`. +- This environment-variable override is a current workaround for NVFP4 cases + where the default sglang JIT/CUTLASS `sm100` path rejects a large-M shape at + `can_implement()`. The intended long-term fix is to add a validated CUTLASS + fallback for those shapes rather than rely on the override. +- Direct `--model-path` loading is a compatibility path for FLUX.2 NVFP4-style + repos or local directories. +- If `--transformer-weights-path` is provided explicitly, it takes precedence + over the compatibility `--model-path` flow. +- For local directories, SGLang first looks for `*-mixed.safetensors`, then + falls back to loading from the directory. +- To force the generic diffusion ModelOpt FP4 path onto a specific FlashInfer + backend, set `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND`. Supported values + include `flashinfer_cudnn`, `flashinfer_cutlass`, and `flashinfer_trtllm`. +- On disk, the quantization config stays `quant_method=modelopt` with + `quant_algo=NVFP4`; the `modelopt-nvfp4` label here is again a documentation + family name rather than a serialized config key. + +## Nunchaku (SVDQuant) + +### Install + +Install the runtime dependency first: + +```bash +pip install nunchaku +``` + +For platform-specific installation methods and troubleshooting, see the +[Nunchaku installation guide](https://nunchaku.tech/docs/nunchaku/installation/installation.html). + +### File Naming and Auto-Detection + +For Nunchaku checkpoints, `--model-path` should still point to the original +base model, while `--transformer-weights-path` points to the quantized +transformer weights. + +If the basename of `--transformer-weights-path` contains the pattern +`svdq-(int4|fp4)_r{rank}`, SGLang will automatically: +- enable SVDQuant +- infer `--quantization-precision` +- infer `--quantization-rank` + +Examples: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
checkpoint name fragmentinferred precisioninferred ranknotes
svdq-int4_r32int432Standard INT4 checkpoint
svdq-int4_r128int4128Higher-quality INT4 checkpoint
svdq-fp4_r32nvfp432fp4 in the filename maps to CLI value nvfp4
svdq-fp4_r128nvfp4128Higher-quality NVFP4 checkpoint
+ +Common filenames: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
filenameprecisionranktypical use
svdq-int4_r32-qwen-image.safetensorsint432Balanced default
svdq-int4_r128-qwen-image.safetensorsint4128Quality-focused
svdq-fp4_r32-qwen-image.safetensorsnvfp432RTX 50-series / NVFP4 path
svdq-fp4_r128-qwen-image.safetensorsnvfp4128Quality-focused NVFP4
svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensorsint432Lightning 4-step
svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensorsint4128Lightning 8-step
+ +If your checkpoint name does not follow this convention, pass +`--enable-svdquant`, `--quantization-precision`, and `--quantization-rank` +explicitly. + +### Usage Examples + +Recommended auto-detected flow: + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \ + --prompt "a beautiful sunset" \ + --save-output +``` + +Manual override when the filename does not encode the quant settings: + +```bash +sglang generate \ + --model-path Qwen/Qwen-Image \ + --transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \ + --enable-svdquant \ + --quantization-precision int4 \ + --quantization-rank 128 \ + --prompt "a beautiful sunset" \ + --save-output +``` + +### Notes + +- `--transformer-weights-path` is the canonical flag for Nunchaku checkpoints. + Older config names such as `quantized_model_path` are treated as + compatibility aliases. +- Auto-detection only happens when the checkpoint basename matches + `svdq-(int4|fp4)_r{rank}`. +- The CLI values are `int4` and `nvfp4`. In filenames, the NVFP4 variant is + written as `fp4`. +- Lightning checkpoints usually expect matching `--num-inference-steps`, such + as `4` or `8`. +- Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) + or SM12x GPUs. Hopper (SM90) is currently rejected. + +## [ModelSlim](https://gitcode.com/Ascend/msmodelslim) +MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware. + +- **Installation** + + ```bash + # Clone repo and install msmodelslim: + git clone https://gitcode.com/Ascend/msmodelslim.git + cd msmodelslim + bash install.sh + ``` + +- **Multimodal_sd quantization** + + Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to [Wan2.2-T2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card). + > Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech). + + Run quantization using one-click quantization (recommended): + + ```bash + msmodelslim quant \ + --model_path /path/to/wan2_2_float_weights \ + --save_path /path/to/wan2_2_quantized_weights \ + --device npu \ + --model_type Wan2_2 \ + --quant_type w8a8 \ + --trust_remote_code True + ``` + + For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/multimodal_sd/README.md) section in ModelSLim repo. + + > Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim. + +- **Auto-Detection and different formats** + + For msmodelslim checkpoints, it's enough to specify only ```--model-path```, the detection of quantization occurs automatically for each layer using parsing of `quant_model_description.json` config. + + In the case of `Wan2.2` only `Diffusers` weights storage format are supported, whereas modelslim saves the quantized model in the original `Wan2.2` format, + for conversion in use `python/sglang/multimodal_gen/tools/wan_repack.py` script: + + ```bash + python wan_repack.py \ + --input-path {path_to_quantized_model} \ + --output-path {path_to_converted_model} + ``` + + After that, please copy all files from original `Diffusers` checkpoint (instead of `transformer`/`tranfsormer_2` folders) + +- **Usage Example** + + With auto-detected flow: + + ```bash + sglang generate \ + --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \ + --prompt "a beautiful sunset" \ + --save-output + ``` + +- **Available Quantization Methods**: + - [x] ```W4A4_DYNAMIC``` linear with online quantization of activations + - [x] ```W8A8``` linear with offline quantization of activations + - [x] ```W8A8_DYNAMIC``` linear with online quantization of activations + - [ ] ```mxfp8``` linear in progress diff --git a/docs_new/docs/sglang-diffusion/ring_sp_performance.mdx b/docs_new/docs/sglang-diffusion/ring_sp_performance.mdx new file mode 100644 index 000000000000..eca61967ada2 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/ring_sp_performance.mdx @@ -0,0 +1,158 @@ +--- +title: "Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)" +metatags: + description: "Review Ring-SP benchmark results for Wan2.2-TI2V-5B-Diffusers in SGLang Diffusion." +--- + +This page reports Ring-SP performance for `Wan2.2-TI2V-5B-Diffusers` using: + +- Parallel config: `sp=2, ulysses=1, ring=2` (short: `u1r2`) +- Baseline config: `sp=1, ulysses=1, ring=1` (short: `u1r1`) + +## Benchmark Setup + +- Model: `Wan2.2-TI2V-5B-Diffusers` +- GPU: `48G RTX40 series * 2` + +## Online Serving + +### Ring SP (`u1r2`) + +```bash +sglang serve \ + --model-type diffusion \ + --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \ + --num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \ + --port 8898 +``` + +### Baseline (`u1r1`) + +```bash +sglang serve \ + --model-type diffusion \ + --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \ + --num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \ + --port 8898 +``` + +## Benchmarks + +### Benchmark Disclaimer + +These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns. + +### Stage Time Breakdown + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Stage / Metricu1r2 (s)u1r1 baseline (s)Speedup
InputValidation0.10600.10290.97x
TextEncoding1.39652.22611.59x
LatentPreparation0.00020.00021.00x
TimestepPreparation0.00030.00041.33x
Denoising52.635871.67851.36x
Decoding7.670813.43141.75x
Total63.7490.631.42x
+ +### Memory Usage + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Memory Metricu1r2 (GB)u1r1 baseline (GB)Delta
Peak GPU Memory20.0727.40-7.33
Peak Allocated13.3520.40-7.05
Memory Overhead6.727.00-0.28
Overhead Ratio33.5%25.6%+7.9pp
+ +## Summary + +- End-to-end latency improves from `90.63s` to `63.74s` (`1.42x`). +- Main gains come from `Denoising` (`1.36x`) and `Decoding` (`1.75x`). +- Absolute memory usage drops noticeably on Ring-SP (`Peak GPU Memory -7.33GB`, `Peak Allocated -7.05GB`). +- Overhead ratio rises (`+7.9pp`), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain. diff --git a/docs_new/docs/sglang-diffusion/support_new_models.mdx b/docs_new/docs/sglang-diffusion/support_new_models.mdx new file mode 100644 index 000000000000..766e24b82691 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/support_new_models.mdx @@ -0,0 +1,601 @@ +--- +title: "How to Support New Diffusion Models" +metatags: + description: "This document explains how to add support for new diffusion models in SGLang Diffusion." +--- + +This document explains how to add support for new diffusion models in SGLang Diffusion. + +## Architecture Overview + +SGLang Diffusion is engineered for both performance and flexibility, built upon a pipeline architecture. This +design allows developers to construct pipelines for various diffusion models while keeping the core generation +loop standardized for optimization. + +At its core, the architecture revolves around two key concepts, as highlighted in our [blog post](https://lmsys.org/blog/2025-11-07-sglang-diffusion/#architecture): + +- **`ComposedPipeline`**: This class orchestrates a series of `PipelineStage`s to define the complete generation process for a specific model. It acts as the main entry point for a model and manages the data flow between the different stages of the diffusion process. +- **`PipelineStage`**: Each stage is a modular component that encapsulates a function within the diffusion process. Examples include prompt encoding, the denoising loop, or VAE decoding. + +### Two Pipeline Styles + +SGLang Diffusion supports two pipeline composition styles. Both are valid; choose the one that best fits your model. + +#### Style A: Hybrid Monolithic Pipeline (Recommended Default) + +The recommended default for most new models. Uses a three-stage structure: + +``` +BeforeDenoisingStage (model-specific) → DenoisingStage (standard) → DecodingStage (standard) +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
StageOwnershipResponsibility
{Model}BeforeDenoisingStageModel-specificAll pre-processing: input validation, text/image encoding, latent preparation, timestep computation
DenoisingStageFramework-standardThe denoising loop (DiT/UNet forward passes), shared across all models
DecodingStageFramework-standardVAE decoding from latent space to pixel space, shared across all models
+ +**Why recommended?** Modern diffusion models often have highly heterogeneous pre-processing requirements — different text encoders, different latent formats, different conditioning mechanisms. The Hybrid approach keeps pre-processing isolated per model, avoids fragile shared stages with excessive conditional logic, and lets developers port Diffusers reference code quickly. + +#### Style B: Modular Composition Style + +Uses the framework's fine-grained standard stages (`TextEncodingStage`, `LatentPreparationStage`, `TimestepPreparationStage`, etc.) to build the pipeline by composition. Convenience methods like `add_standard_t2i_stages()` and `add_standard_ti2i_stages()` make this very concise. + +This style is appropriate when: +- **The new model's pre-processing can largely reuse existing stages** — e.g., a model that uses standard CLIP/T5 text encoding + standard latent preparation with minimal customization. +- **A model-specific optimization needs to be extracted as a standalone stage** — e.g., a specialized encoding or conditioning step that benefits from being a separate stage for profiling, parallelism control, or reuse across multiple pipeline variants. + +#### How to Choose + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SituationRecommended Style
Model has unique/complex pre-processing (VLM captioning, AR token generation, custom latent packing, etc.)Hybrid — consolidate into a BeforeDenoisingStage
Model fits neatly into standard text-to-image or text+image-to-image patternModular — use add_standard_t2i_stages() / add_standard_ti2i_stages()
Porting a Diffusers pipeline with many custom stepsHybrid — copy the __call__ logic into a single stage
Adding a variant of an existing model that shares most logicModular — reuse existing stages, customize via PipelineConfig callbacks
A specific pre-processing step needs special parallelism or profiling isolationModular — extract that step as a dedicated stage
+ +## Key Components for Implementation + +To add support for a new diffusion model, you will need to define or configure the following components: + +1. **`PipelineConfig`**: A dataclass holding static configurations for your model pipeline — precision settings, model architecture parameters, and callback methods used by the standard `DenoisingStage` and `DecodingStage`. Each model has its own subclass. + +2. **`SamplingParams`**: A dataclass defining runtime generation parameters — `prompt`, `negative_prompt`, `guidance_scale`, `num_inference_steps`, `seed`, `height`, `width`, etc. + +3. **Pre-processing stage(s)**: Either a single model-specific `{Model}BeforeDenoisingStage` (Hybrid style) or a combination of standard stages (Modular style). See [Two Pipeline Styles](#two-pipeline-styles) above. + +4. **`ComposedPipeline`**: A class that wires together your pre-processing stage(s) with the standard `DenoisingStage` and `DecodingStage`. See base definitions: + - [`ComposedPipelineBase`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py) + - [`PipelineStage`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py) + - [Central registry](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/registry.py) + +5. **Modules (model components)**: Each pipeline references modules loaded from the model repository (e.g., Diffusers `model_index.json`): + - `text_encoder`: Encodes text prompts into embeddings. + - `tokenizer`: Tokenizes raw text input for the text encoder(s). + - `processor`: Preprocesses images and extracts features; often used in image-to-image tasks. + - `image_encoder`: Specialized image feature extractor. + - `dit/transformer`: The core denoising network (DiT/UNet architecture) operating in latent space. + - `scheduler`: Controls the timestep schedule and denoising dynamics. + - `vae`: Variational Autoencoder for encoding/decoding between pixel space and latent space. + +## Pipeline Stages Reference + +### Core Stages (used by all pipelines) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Stage ClassDescription
DenoisingStageExecutes the main denoising loop, iteratively applying the model (DiT/UNet) to refine the latents.
DecodingStageDecodes the final latent tensor back into pixel space using the VAE.
DmdDenoisingStageA specialized denoising stage for DMD model architectures.
CausalDMDDenoisingStageA specialized causal denoising stage for specific video models.
+ +### Pre-processing Stages (for Modular Composition Style) + +The following fine-grained stages can be composed to build the pre-processing portion of a pipeline. They are best suited for models whose pre-processing largely fits the standard patterns. If your model requires significant customization, consider the Hybrid style with a single `BeforeDenoisingStage` instead. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Stage ClassDescription
InputValidationStageValidates user-provided SamplingParams.
TextEncodingStageEncodes text prompts into embeddings using one or more text encoders.
ImageEncodingStageEncodes input images into embeddings, often used in image-to-image tasks.
ImageVAEEncodingStageEncodes an input image into latent space using the VAE.
TimestepPreparationStagePrepares the scheduler's timesteps for the diffusion process.
LatentPreparationStageCreates the initial noisy latent tensor that will be denoised.
+ +## Implementation Guide + +### Step 1: Obtain and Study the Reference Implementation + +Before writing any code, obtain the model's original implementation or Diffusers pipeline code: +- The model's Diffusers pipeline source (e.g., the `pipeline_*.py` file from the `diffusers` library or HuggingFace repo) +- Or the model's official reference implementation (e.g., from the model author's GitHub repo) +- Or the HuggingFace model ID to look up `model_index.json` and the associated pipeline class + +Once you have the reference code, study it thoroughly: + +1. Find the model's `model_index.json` to identify required modules. +2. Read the Diffusers pipeline's `__call__` method to understand: + - How text prompts are encoded + - How latents are prepared (shape, dtype, scaling) + - How timesteps/sigmas are computed + - What conditioning kwargs the DiT expects + - How the denoising loop works + - How VAE decoding is done + +### Step 2: Evaluate Reuse of Existing Pipelines and Stages + +Before creating any new files, check whether an existing pipeline or stage can be reused or extended. Only create new pipelines/stages when the existing ones would need substantial structural changes or when no architecturally similar implementation exists. + +- **Compare against existing pipelines** (Flux, Wan, Qwen-Image, GLM-Image, HunyuanVideo, LTX, etc.). If the new model shares most of its structure with an existing one, prefer adding a new config variant or reusing existing stages. +- **Check existing stages** in `runtime/pipelines_core/stages/` and `stages/model_specific_stages/`. +- **Check existing model components** — many models share VAEs (e.g., `AutoencoderKL`), text encoders (CLIP, T5), and schedulers. Reuse these directly. + +### Step 3: Implement Model Components + +Adapt the model's core components: + +- **DiT/Transformer**: Implement in [`runtime/models/dits/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/dits/) +- **Encoders**: Implement in [`runtime/models/encoders/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/encoders/) +- **VAEs**: Implement in [`runtime/models/vaes/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/vaes/) +- **Schedulers**: Implement in [`runtime/models/schedulers/`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/models/schedulers/) if needed + +Use SGLang's fused kernels where possible (see `LayerNormScaleShift`, `RMSNormScaleShift`, `apply_qk_norm`, etc.). + +**Tensor Parallel (TP) and Sequence Parallel (SP)**: For multi-GPU deployment, it is recommended to add TP/SP support to the DiT model. This can be done incrementally after the single-GPU implementation is verified. Reference implementations: +- **Wan model** (`runtime/models/dits/wanvideo.py`) — Full TP + SP: `ColumnParallelLinear`/`RowParallelLinear` for attention, sequence dimension sharding via `get_sp_world_size()` +- **Qwen-Image model** (`runtime/models/dits/qwen_image.py`) — SP via `USPAttention` (Ulysses + Ring Attention) + +### Step 4: Create Configs + +- **DiT Config**: `configs/models/dits/{model_name}.py` +- **VAE Config**: `configs/models/vaes/{model_name}.py` +- **SamplingParams**: `configs/sample/{model_name}.py` + +### Step 5: Create PipelineConfig + +The `PipelineConfig` provides callbacks that the standard `DenoisingStage` and `DecodingStage` use: + +```python +# python/sglang/multimodal_gen/configs/pipeline_configs/my_model.py + +@dataclass +class MyModelPipelineConfig(ImagePipelineConfig): + task_type: ModelTaskType = ModelTaskType.T2I + vae_precision: str = "bf16" + should_use_guidance: bool = True + dit_config: DiTConfig = field(default_factory=MyModelDitConfig) + vae_config: VAEConfig = field(default_factory=MyModelVAEConfig) + + def get_freqs_cis(self, batch, device, rotary_emb, dtype): + """Prepare rotary position embeddings for the DiT.""" + ... + + def prepare_pos_cond_kwargs(self, batch, latent_model_input, t, **kwargs): + """Build positive conditioning kwargs for each denoising step.""" + return { + "hidden_states": latent_model_input, + "encoder_hidden_states": batch.prompt_embeds[0], + "timestep": t, + } + + def prepare_neg_cond_kwargs(self, batch, latent_model_input, t, **kwargs): + """Build negative conditioning kwargs for CFG.""" + return { + "hidden_states": latent_model_input, + "encoder_hidden_states": batch.negative_prompt_embeds[0], + "timestep": t, + } + + def get_decode_scale_and_shift(self): + """Return (scale, shift) for latent denormalization before VAE decode.""" + ... +``` + +### Step 6: Implement Pre-processing + +Choose based on your model's needs (see [How to Choose](#how-to-choose)): + +#### Option A: BeforeDenoisingStage (Hybrid Style) + +Create a single stage that handles all pre-processing. Best when the model has custom/complex pre-processing logic. + +```python +# python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/my_model.py + +class MyModelBeforeDenoisingStage(PipelineStage): + """Monolithic pre-processing stage for MyModel. + + Consolidates: input validation, text/image encoding, latent + preparation, and timestep computation. + """ + + def __init__(self, vae, text_encoder, tokenizer, transformer, scheduler): + super().__init__() + self.vae = vae + self.text_encoder = text_encoder + self.tokenizer = tokenizer + self.transformer = transformer + self.scheduler = scheduler + + @torch.no_grad() + def forward(self, batch: Req, server_args: ServerArgs) -> Req: + device = get_local_torch_device() + + # 1. Encode prompt (model-specific logic) + prompt_embeds, negative_prompt_embeds = self._encode_prompt(...) + + # 2. Prepare latents + latents = self._prepare_latents(...) + + # 3. Prepare timesteps + timesteps, sigmas = self._prepare_timesteps(...) + + # 4. Populate batch for DenoisingStage + batch.prompt_embeds = [prompt_embeds] + batch.negative_prompt_embeds = [negative_prompt_embeds] + batch.latents = latents + batch.timesteps = timesteps + batch.num_inference_steps = len(timesteps) + batch.sigmas = sigmas.tolist() + batch.generator = generator + batch.raw_latent_shape = latents.shape + return batch +``` + +#### Option B: Standard Stages (Modular Style) + +Skip creating a custom stage entirely — configure via `PipelineConfig` callbacks and use framework helpers. Best when the model fits standard patterns. + +(This option has no separate stage file; the pipeline class in Step 7 calls `add_standard_t2i_stages()` directly.) + +**Key batch fields that `DenoisingStage` expects** (regardless of which option you choose): + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescription
batch.latentstorch.TensorInitial noisy latent tensor
batch.timestepstorch.TensorTimestep schedule
batch.num_inference_stepsintNumber of denoising steps
batch.sigmaslist[float]Sigma schedule (must be a Python list, not numpy)
batch.prompt_embedslist[torch.Tensor]Positive prompt embeddings (wrapped in a list)
batch.negative_prompt_embedslist[torch.Tensor]Negative prompt embeddings (wrapped in a list)
batch.generatortorch.GeneratorRNG generator for reproducibility
batch.raw_latent_shapetupleOriginal latent shape before any packing
+ +### Step 7: Define the Pipeline Class + +#### Hybrid Style + +```python +# python/sglang/multimodal_gen/runtime/pipelines/my_model.py + +class MyModelPipeline(LoRAPipeline, ComposedPipelineBase): + pipeline_name = "MyModelPipeline" # Must match model_index.json _class_name + + _required_config_modules = [ + "text_encoder", "tokenizer", "vae", "transformer", "scheduler", + ] + + def create_pipeline_stages(self, server_args: ServerArgs): + # 1. Monolithic pre-processing (model-specific) + self.add_stage( + MyModelBeforeDenoisingStage( + vae=self.get_module("vae"), + text_encoder=self.get_module("text_encoder"), + tokenizer=self.get_module("tokenizer"), + transformer=self.get_module("transformer"), + scheduler=self.get_module("scheduler"), + ), + ) + + # 2. Standard denoising loop (framework-provided) + self.add_stage( + DenoisingStage( + transformer=self.get_module("transformer"), + scheduler=self.get_module("scheduler"), + ), + ) + + # 3. Standard VAE decoding (framework-provided) + self.add_standard_decoding_stage() + + +EntryClass = [MyModelPipeline] +``` + +#### Modular Style + +```python +# python/sglang/multimodal_gen/runtime/pipelines/my_model.py + +class MyModelPipeline(LoRAPipeline, ComposedPipelineBase): + pipeline_name = "MyModelPipeline" + + _required_config_modules = [ + "text_encoder", "tokenizer", "vae", "transformer", "scheduler", + ] + + def create_pipeline_stages(self, server_args: ServerArgs): + # All pre-processing + denoising + decoding in one call + self.add_standard_t2i_stages( + prepare_extra_timestep_kwargs=[prepare_mu], # model-specific hooks + ) + + +EntryClass = [MyModelPipeline] +``` + +### Step 8: Register the Model + +Register your configs in [`registry.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/registry.py): + +```python +register_configs( + model_family="my_model", + sampling_param_cls=MyModelSamplingParams, + pipeline_config_cls=MyModelPipelineConfig, + hf_model_paths=["org/my-model-name"], +) +``` + +The `EntryClass` in your pipeline file is automatically discovered by the registry — no additional registration needed for the pipeline class itself. + +### Step 9: Verify Output Quality + +After implementation, verify that the generated output is not noise. A noisy or garbled output is the most common sign of an incorrect implementation. Common causes include: + +- Incorrect latent scale/shift factors +- Wrong timestep/sigma schedule (order, dtype, or value range) +- Mismatched conditioning kwargs +- Rotary embedding style mismatch (`is_neox_style`) + +Debug by comparing intermediate tensor values against the Diffusers reference pipeline with the same seed. + +## Reference Implementations + +### Hybrid Style + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelPipelineBeforeDenoisingStagePipelineConfig
GLM-Imageruntime/pipelines/glm_image.pystages/model_specific_stages/glm_image.pyconfigs/pipeline_configs/glm_image.py
Qwen-Image-Layeredruntime/pipelines/qwen_image.pystages/model_specific_stages/qwen_image_layered.pyconfigs/pipeline_configs/qwen_image.py
+ +### Modular Style + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelPipelineNotes
Qwen-Image (T2I)runtime/pipelines/qwen_image.pyUses add_standard_t2i_stages()
Qwen-Image-Editruntime/pipelines/qwen_image.pyUses add_standard_ti2i_stages()
Fluxruntime/pipelines/flux.pyUses add_standard_t2i_stages() with custom prepare_mu
Wanruntime/pipelines/wan_pipeline.pyUses add_standard_ti2v_stages()
+ +## Checklist + +Before submitting your implementation, verify: + +**Common (both styles):** +- [ ] **Pipeline file** at `runtime/pipelines/{model_name}.py` with `EntryClass` +- [ ] **PipelineConfig** at `configs/pipeline_configs/{model_name}.py` +- [ ] **SamplingParams** at `configs/sample/{model_name}.py` +- [ ] **DiT model** at `runtime/models/dits/{model_name}.py` +- [ ] **Model configs** (DiT, VAE) at `configs/models/dits/` and `configs/models/vaes/` +- [ ] **Registry entry** in `registry.py` via `register_configs()` +- [ ] `pipeline_name` matches Diffusers `model_index.json` `_class_name` +- [ ] `_required_config_modules` lists all modules from `model_index.json` +- [ ] `PipelineConfig` callbacks (`prepare_pos_cond_kwargs`, etc.) match the DiT's `forward()` signature +- [ ] Uses framework-standard `DenoisingStage` and `DecodingStage` (not custom denoising loops) +- [ ] **TP/SP support** considered for DiT model (recommended; reference `wanvideo.py` for TP+SP, `qwen_image.py` for USPAttention) +- [ ] **Output quality verified** — generated images/videos are not noise; compared against Diffusers reference output + +**Hybrid style only:** +- [ ] **BeforeDenoisingStage** at `stages/model_specific_stages/{model_name}.py` +- [ ] `BeforeDenoisingStage.forward()` populates all batch fields required by `DenoisingStage` diff --git a/docs_new/docs/sglang-diffusion/teacache.mdx b/docs_new/docs/sglang-diffusion/teacache.mdx new file mode 100644 index 000000000000..d7f86219f6a1 --- /dev/null +++ b/docs_new/docs/sglang-diffusion/teacache.mdx @@ -0,0 +1,143 @@ +--- +title: "TeaCache Acceleration" +description: "Configure TeaCache for temporal similarity-based diffusion acceleration." +--- + +> **Note**: This is one of two caching strategies available in SGLang. +> For an overview of all caching options, see [caching](./caching-acceleration). + +TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely. + +## Overview + +TeaCache works by: +1. Tracking the L1 distance between modulated inputs across consecutive timesteps +2. Accumulating the rescaled L1 distance over steps +3. When accumulated distance is below a threshold, reusing the cached residual +4. Supporting CFG (Classifier-Free Guidance) with separate positive/negative caches + +## How It Works + +### L1 Distance Tracking + +At each denoising step, TeaCache computes the relative L1 distance between the current and previous modulated inputs: + +```text +rel_l1 = |current - previous|.mean() / |previous|.mean() +``` + +This distance is then rescaled using polynomial coefficients and accumulated: + +```text +accumulated += poly(coefficients)(rel_l1) +``` + +### Cache Decision + +- If `accumulated >= threshold`: Force computation, reset accumulator +- If `accumulated < threshold`: Skip computation, use cached residual + +### CFG Support + +For models that support CFG cache separation (Wan, Hunyuan, Z-Image), TeaCache maintains separate caches for positive and negative branches: +- `previous_modulated_input` / `previous_residual` for positive branch +- `previous_modulated_input_negative` / `previous_residual_negative` for negative branch + +For models that don't support CFG separation (Flux, Qwen), TeaCache is automatically disabled when CFG is enabled. + +## Configuration + +TeaCache is configured via `TeaCacheParams` in the sampling parameters: + +```python +from sglang.multimodal_gen.configs.sample.teacache import TeaCacheParams + +params = TeaCacheParams( + teacache_thresh=0.1, # Threshold for accumulated L1 distance + coefficients=[1.0, 0.0, 0.0], # Polynomial coefficients for L1 rescaling +) +``` + +### Parameters + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescription
`teacache_thresh`floatThreshold for accumulated L1 distance. Lower = more caching, faster but potentially lower quality
`coefficients`list[float]Polynomial coefficients for L1 rescaling. Model-specific tuning
+ +### Model-Specific Configurations + +Different models may have different optimal configurations. The coefficients are typically tuned per-model to balance speed and quality. + +## Supported Models + +TeaCache is built into the following model families: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyCFG Cache SeparationNotes
Wan (wan2.1, wan2.2)YesFull support
Hunyuan (HunyuanVideo)YesTo be supported
Z-ImageYesTo be supported
FluxNoTo be supported
QwenNoTo be supported
+ + +## References + +- [TeaCache: Accelerating Diffusion Models with Temporal Similarity](https://arxiv.org/abs/2411.14324) diff --git a/docs_new/docs/supported-models.mdx b/docs_new/docs/supported-models.mdx new file mode 100644 index 000000000000..bcfcf89a75ca --- /dev/null +++ b/docs_new/docs/supported-models.mdx @@ -0,0 +1,87 @@ +--- +title: Supported models +description: See which families of SGLang-compatible models are actively maintained. +mode: wide +--- + +SGLang supports model families across text generation, retrieval, and reward workflows. Browse the sections below for the primary product paths and jump to the detail pages when you are ready to explore a specific class. + +### Text generation + + + + Production-tuned Llama and Qwen families validated for high-throughput + serving. + + + Vision-text hybrids that stay responsive on multi-GPU setups. + + + Score-based and diffusion backbones for structured text generation + workflows. + + + +### Retrieval and ranking + + + + Dense and sparse embeddings optimized with FlashInfer kernels. + + + Low-latency rerankers for multi-stage retrieval pipelines. + + + Lightweight classifiers covering safety, intent, and context filters. + + + +### Specialized models + + + + RLHF and reward scoring pipelines optimized for production latency. + + diff --git a/docs_new/docs/supported-models/classify_models.mdx b/docs_new/docs/supported-models/classify_models.mdx new file mode 100644 index 000000000000..8883f4ec05d4 --- /dev/null +++ b/docs_new/docs/supported-models/classify_models.mdx @@ -0,0 +1,163 @@ +--- +title: Classification Models +--- +This document describes the `/v1/classify` API endpoint implementation in SGLang, which is compatible with vLLM's classification API format. + +## Overview + +The classification API allows you to classify text inputs using classification models. This implementation follows the same format as vLLM's 0.7.0 classification API. + +## API Endpoint + +```text Output +POST /v1/classify +``` + +## Request Format + +```json Config +{ + "model": "model_name", + "input": "text to classify" +} +``` + +### Parameters + +- `model` (string, required): The name of the classification model to use +- `input` (string, required): The text to classify +- `user` (string, optional): User identifier for tracking +- `rid` (string, optional): Request ID for tracking +- `priority` (integer, optional): Request priority + +## Response Format + +```json Config +{ + "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682", + "object": "list", + "created": 1745383213, + "model": "jason9693/Qwen2.5-1.5B-apeach", + "data": [ + { + "index": 0, + "label": "Default", + "probs": [0.565970778465271, 0.4340292513370514], + "num_classes": 2 + } + ], + "usage": { + "prompt_tokens": 10, + "total_tokens": 10, + "completion_tokens": 0, + "prompt_tokens_details": null + } +} +``` + +### Response Fields + +- `id`: Unique identifier for the classification request +- `object`: Always "list" +- `created`: Unix timestamp when the request was created +- `model`: The model used for classification +- `data`: Array of classification results + - `index`: Index of the result + - `label`: Predicted class label + - `probs`: Array of probabilities for each class + - `num_classes`: Total number of classes +- `usage`: Token usage information + - `prompt_tokens`: Number of input tokens + - `total_tokens`: Total number of tokens + - `completion_tokens`: Number of completion tokens (always 0 for classification) + - `prompt_tokens_details`: Additional token details (optional) + +## Example Usage + +### Using curl + +```bash Command +curl -v "http://127.0.0.1:8000/v1/classify" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "jason9693/Qwen2.5-1.5B-apeach", + "input": "Loved the new café—coffee was great." + }' +``` + +### Using Python + +```python Example +import requests +import json + +# Make classification request +response = requests.post( + "http://127.0.0.1:8000/v1/classify", + headers={"Content-Type": "application/json"}, + json={ + "model": "jason9693/Qwen2.5-1.5B-apeach", + "input": "Loved the new café—coffee was great." + } +) + +# Parse response +result = response.json() +print(json.dumps(result, indent=2)) +``` + +## Supported Models + +The classification API works with any classification model supported by SGLang, including: + +### Classification Models (Multi-class) +- `LlamaForSequenceClassification` - Multi-class classification +- `Qwen2ForSequenceClassification` - Multi-class classification +- `Qwen3ForSequenceClassification` - Multi-class classification +- `BertForSequenceClassification` - Multi-class classification +- `Gemma2ForSequenceClassification` - Multi-class classification + +**Label Mapping**: The API automatically uses the `id2label` mapping from the model's `config.json` file to provide meaningful label names instead of generic class names. If `id2label` is not available, it falls back to `LABEL_0`, `LABEL_1`, etc., or `Class_0`, `Class_1` as a last resort. + +### Reward Models (Single score) +- `InternLM2ForRewardModel` - Single reward score +- `Qwen2ForRewardModel` - Single reward score +- `LlamaForSequenceClassificationWithNormal_Weights` - Special reward model + +**Note**: The `/classify` endpoint in SGLang was originally designed for reward models but now supports all non-generative models. Our `/v1/classify` endpoint provides a standardized vLLM-compatible interface for classification tasks. + +## Error Handling + +The API returns appropriate HTTP status codes and error messages: + +- `400 Bad Request`: Invalid request format or missing required fields +- `500 Internal Server Error`: Server-side processing error + +Error response format: +```json Config +{ + "error": "Error message", + "type": "error_type", + "code": 400 +} +``` + +## Implementation Details + +The classification API is implemented using: + +1. **Rust Model Gateway**: Handles routing and request/response models in `sgl-model-gateway/src/protocols/spec.rs` +2. **Python HTTP Server**: Implements the actual endpoint in `python/sglang/srt/entrypoints/http_server.py` +3. **Classification Service**: Handles the classification logic in `python/sglang/srt/entrypoints/openai/serving_classify.py` + +## Testing + +Use the provided test script to verify the implementation: + +```bash Command +python test_classify_api.py +``` + +## Compatibility + +This implementation is compatible with vLLM's classification API format, allowing seamless migration from vLLM to SGLang for classification tasks. diff --git a/docs_new/docs/supported-models/diffusion_language_models.mdx b/docs_new/docs/supported-models/diffusion_language_models.mdx new file mode 100644 index 000000000000..4dea5392055b --- /dev/null +++ b/docs_new/docs/supported-models/diffusion_language_models.mdx @@ -0,0 +1,133 @@ +--- +title: Diffusion language models +--- +Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies. + +## Example Launch Command + +SGLang supports different DLLM algorithms such as `LowConfidence` and `JointThreshold`. + +```bash Command +python3 -m sglang.launch_server \ + --model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path + --dllm-algorithm LowConfidence \ + --dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set. + --host 0.0.0.0 \ + --port 30000 +``` + +## Example Configuration File + +Depending on the algorithm selected, the configuration parameters vary. + +LowConfidence Config: + +```yaml Config +# Confidence threshold for accepting predicted tokens +# - Higher values: More conservative, better quality but slower +# - Lower values: More aggressive, faster but potentially lower quality +# Range: 0.0 - 1.0 +threshold: 0.95 + +# Default: 32, for LLaDA2MoeModelLM +block_size: 32 +``` + +JointThreshold Config: + +```yaml Config +# Decoding threshold for Mask-to-Token (M2T) phase +# - Higher values: More conservative, better quality but slower +# - Lower values: More aggressive, faster but potentially lower quality +# Range: 0.0 - 1.0 +threshold: 0.5 +# Decoding threshold for Token-to-Token (T2T) phase +# Range: 0.0 - 1.0 +# Setting to 0.0 allows full editing (recommended for most cases). +edit_threshold: 0.0 +# Max extra T2T steps after all masks are removed. Prevents infinite loops. +max_post_edit_steps: 16 +# 2-gram repetition penalty (default 0). +# An empirical value of 3 is often sufficient to mitigate most repetitions. +penalty_lambda: 0 +``` + +## Example Client Code Snippet + +Just like other supported models, diffusion language models can be used via the REST API or Python client. + +Python client example for making a generation request to the launched server: + +```python Example +import sglang as sgl + +def main(): + llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini", + dllm_algorithm="LowConfidence", + max_running_requests=1, + trust_remote_code=True) + + prompts = [ + "SYSTEMdetailed thinking off<|role_end|>HUMAN Write a brief introduction of the great wall <|role_end|>ASSISTANT" + ] + + sampling_params = { + "temperature": 0, + "max_new_tokens": 1024, + } + + outputs = llm.generate(prompts, sampling_params) + print(outputs) + +if __name__ == '__main__': + main() +``` + +Curl example for making a generation request to the launched server: + +```bash Command +curl -X POST "http://127.0.0.1:30000/generate" \ + -H "Content-Type: application/json" \ + -d '{ + "text": [ + "SYSTEMdetailed thinking off<|role_end|>HUMAN Write the number from 1 to 128 <|role_end|>ASSISTANT", + "SYSTEMdetailed thinking off<|role_end|>HUMAN Write a brief introduction of the great wall <|role_end|>ASSISTANT" + ], + "stream": true, + "sampling_params": { + "temperature": 0, + "max_new_tokens": 1024 + } + }' +``` + +## Supported Models + +Below the supported models are summarized in a table. + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyExample ModelDescription
LLaDA2.0 (mini, flash)inclusionAI/LLaDA2.0-flashLLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture.
SDAR (JetLM)JetLM/SDAR-8B-ChatSDAR series diffusion language model (Chat), dense architecture.
SDAR (JetLM)JetLM/SDAR-30B-A3B-ChatSDAR series diffusion language model (Chat), MoE architecture.
diff --git a/docs_new/docs/supported-models/embedding_models.mdx b/docs_new/docs/supported-models/embedding_models.mdx new file mode 100644 index 000000000000..3edf834492a5 --- /dev/null +++ b/docs_new/docs/supported-models/embedding_models.mdx @@ -0,0 +1,173 @@ +--- +title: Embedding models +description: Dense and sparse embedding models with FlashInfer acceleration and SGLang's batching infrastructure. +--- +SGLang provides robust support for embedding models by integrating efficient serving mechanisms with its flexible programming interface. This integration allows for streamlined handling of embedding tasks, facilitating faster and more accurate retrieval and semantic search operations. SGLang's architecture enables better resource utilization and reduced latency in embedding model deployment. + + +Embedding models are executed with `--is-embedding` flag and some may require `--trust-remote-code` + + +## Quick Start + +### Launch Server + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-Embedding-4B \ + --is-embedding \ + --host 0.0.0.0 \ + --port 30000 +``` + +### Client Request + +```python +import requests + +url = "http://127.0.0.1:30000" + +payload = { + "model": "Qwen/Qwen3-Embedding-4B", + "input": "What is the capital of France?", + "encoding_format": "float" +} + +response = requests.post(url + "/v1/embeddings", json=payload).json() +print("Embedding:", response["data"][0]["embedding"]) +``` + + +## Multimodal Embedding Example + +For multimodal models like GME that support both text and images: + +```bash +python3 -m sglang.launch_server \ + --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct \ + --is-embedding \ + --chat-template gme-qwen2-vl \ + --host 0.0.0.0 \ + --port 30000 +``` + +```python Example +import requests + +url = "http://127.0.0.1:30000" + +text_input = "Represent this image in embedding space." +image_path = "https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/023.jpg" + +payload = { + "model": "gme-qwen2-vl", + "input": [ + { + "text": text_input + }, + { + "image": image_path + } + ], +} + +response = requests.post(url + "/v1/embeddings", json=payload).json() + +print("Embeddings:", [x.get("embedding") for x in response.get("data", [])]) +``` + +## Matryoshka Embedding Example + +[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost. + +### 1. Launch a Matryoshka‑capable model + +If the model config already includes `matryoshka_dimensions` or `is_matryoshka` then no override is needed. Otherwise, you can use `--json-model-override-args` as below: + +```bash Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-Embedding-0.6B \ + --is-embedding \ + --host 0.0.0.0 \ + --port 30000 \ + --json-model-override-args '{"matryoshka_dimensions": [128, 256, 512, 1024, 1536]}' +``` + +1. Setting `"is_matryoshka": true` allows truncating to any dimension. Otherwise, the server will validate that the specified dimension in the request is one of `matryoshka_dimensions`. +2. Omitting `dimensions` in a request returns the full vector. + +### 2. Make requests with different output dimensions + +```python +import requests + +url = "http://127.0.0.1:30000" + +# Request a truncated (Matryoshka) embedding by specifying a supported dimension. +payload = { + "model": "Qwen/Qwen3-Embedding-0.6B", + "input": "Explain diffusion models simply.", + "dimensions": 512 # change to 128 / 1024 / omit for full size +} + +response = requests.post(url + "/v1/embeddings", json=payload).json() +print("Embedding:", response["data"][0]["embedding"]) +``` + + +## Supported Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyExample ModelChat templateDescription
E5 (Llama/Mistral based)`intfloat/e5-mistral-7b-instruct`N/AHigh-quality text embeddings based on Mistral/Llama architectures
GTE-Qwen2`Alibaba-NLP/gte-Qwen2-7B-instruct`N/AAlibaba's text embedding model with multilingual support
Qwen3-Embedding`Qwen/Qwen3-Embedding-4B`N/ALatest Qwen3-based text embedding model for semantic representation
BGE`BAAI/bge-large-en-v1.5`N/ABAAI's text embeddings (requires attention-backend triton/torch_native)
GME (Multimodal)`Alibaba-NLP/gme-Qwen2-VL-2B-Instruct``gme-qwen2-vl`Multimodal embedding for text and image cross-modal tasks
CLIP`openai/clip-vit-large-patch14-336`N/AOpenAI's CLIP for image and text embeddings
diff --git a/docs_new/docs/supported-models/generative_models.mdx b/docs_new/docs/supported-models/generative_models.mdx new file mode 100644 index 000000000000..b5258001b804 --- /dev/null +++ b/docs_new/docs/supported-models/generative_models.mdx @@ -0,0 +1,287 @@ +--- +title: Large Language Models +--- +These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling. + +## Example launch Command + +```shell Command +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-1B-Instruct \ # example HF/local path + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Supported models + +Below the supported models are summarized in a table. + +If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen3ForCausalLM`, use the expression: + +```text Output +repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen3ForCausalLM +``` + +in the GitHub search bar. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model Family (Variants)Example HuggingFace IdentifierDescription
**DeepSeek** (v1, v2, v3/R1)`deepseek-ai/DeepSeek-R1`Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. SGLang provides Deepseek v3/R1 model-specific optimizations and Reasoning Parser
**Kimi K2** (Thinking, Instruct)`moonshotai/Kimi-K2-Instruct`Moonshot AI's 1 trillion parameter MoE model (32B active) with 128K–256K context; state-of-the-art agentic intelligence with stable long-horizon agency across 200–300 sequential tool calls. Features MLA attention and native INT4 quantization. See Reasoning Parser docs
**Kimi Linear** (48B-A3B)`moonshotai/Kimi-Linear-48B-A3B-Instruct`Moonshot AI's hybrid linear attention model (48B total, 3B active) with 1M token context; features Kimi Delta Attention (KDA) for up to 6× faster decoding and 75% KV cache reduction vs full attention.
**GPT-OSS**openai/gpt-oss-20b, openai/gpt-oss-120bOpenAI’s latest GPT-OSS series for complex reasoning, agentic tasks, and versatile developer use cases.
Qwen (3.5, 3, 3MoE, 3Next, 2.5, 2 series)Qwen/Qwen3.5-397B-A17B, Qwen/Qwen3-0.6B, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-Next-80B-A3B-InstructAlibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. SGLang provides Qwen3 specific reasoning parser
**Llama** (2, 3.x, 4 series)`meta-llama/Llama-4-Scout-17B-16E-Instruct`Meta's open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. SGLang provides Llama-4 model-specific optimizations
**Mistral** (Mixtral, NeMo, Small3)`mistralai/Mistral-7B-Instruct-v0.2`Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale.
**Gemma** (v1, v2, v3)`google/gemma-3-1b-it`Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input.
**Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series)microsoft/Phi-4-multimodal-instruct, microsoft/Phi-3.5-MoE-instructMicrosoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model.
**MiniCPM** (v3, 4B)`openbmb/MiniCPM3-4B`OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks.
**OLMo** (2, 3)allenai/OLMo-3-1125-32B, allenai/OLMo-3-32B-Think, allenai/OLMo-2-1124-7B-InstructAllen AI’s series of Open Language Models designed to enable the science of language models.
**OLMoE** (Open MoE)`allenai/OLMoE-1B-7B-0924`Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation.
MiniMax-M2 (M2, M2.1, M2.5)MiniMaxAI/MiniMax-M2.5, MiniMaxAI/MiniMax-M2.1, MiniMaxAI/MiniMax-M2MiniMax's SOTA LLM for coding & agentic workflows.
**StableLM** (3B, 7B)`stabilityai/stablelm-tuned-alpha-7b`StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability.
**Command-(R,A)** (Cohere)CohereLabs/c4ai-command-r-v01, CohereLabs/c4ai-command-r7b-12-2024, CohereLabs/c4ai-command-a-03-2025Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use.
**DBRX** (Databricks)`databricks/dbrx-instruct`Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model.
**Grok** (xAI)`xai-org/grok-1`xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference.
**ChatGLM** (GLM-130B family)`THUDM/chatglm2-6b`Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment.
**InternLM 2** (7B, 20B)`internlm/internlm2-7b`Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens).
**ExaONE 3** (Korean-English)`LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct`LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation.
**Baichuan 2** (7B, 13B)`baichuan-inc/Baichuan2-13B-Chat`BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license.
**XVERSE** (MoE)`xverse/XVERSE-MoE-A36B`Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing.
**SmolLM** (135M–1.7B)`HuggingFaceTB/SmolLM-1.7B`Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices.
**GLM-4** (Multilingual 9B)`ZhipuAI/glm-4-9b-chat`Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V).
**MiMo** (7B series)`XiaomiMiMo/MiMo-7B-RL`Xiaomi's reasoning-optimized model series, leverages Multiple-Token Prediction for faster inference.
**ERNIE-4.5** (4.5, 4.5MoE series)`baidu/ERNIE-4.5-21B-A3B-PT`Baidu's ERNIE-4.5 series which consists of MoE with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model.
**Arcee AFM-4.5B**`arcee-ai/AFM-4.5B-Base`Arcee's foundational model series for real world reliability and edge deployments.
**Persimmon** (8B)`adept/persimmon-8b-chat`Adept’s open 8B model with a 16K context window and fast inference; trained for broad usability and licensed under Apache 2.0.
**Solar** (10.7B)`upstage/SOLAR-10.7B-Instruct-v1.0`Upstage's 10.7B parameter model, optimized for instruction-following tasks. This architecture incorporates a depth-up scaling methodology, enhancing model performance.
**Tele FLM** (52B-1T)`CofeAI/Tele-FLM`BAAI & TeleAI's multilingual model, available in 52-billion and 1-trillion parameter variants. It is a decoder-only transformer trained on ~2T tokens
**Ling** (16.8B–290B)inclusionAI/Ling-lite, inclusionAI/Ling-plusInclusionAI’s open MoE models. Ling-Lite has 16.8B total / 2.75B active parameters, and Ling-Plus has 290B total / 28.8B active parameters. They are designed for high performance on NLP and complex reasoning tasks.
**Granite 3.0, 3.1** (IBM)`ibm-granite/granite-3.1-8b-instruct`IBM's open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems.
**Granite 3.0 MoE** (IBM)`ibm-granite/granite-3.0-3b-a800m-instruct`IBM’s Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale.
**GPT-J** (6B)`EleutherAI/gpt-j-6b`EleutherAI's GPT-2-like causal language model (6B) trained on the Pile dataset.
**Orion** (14B)`OrionStarAI/Orion-14B-Base`A series of open-source multilingual large language models by OrionStarAI, pretrained on a 2.5T token multilingual corpus including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages.
**Llama Nemotron Super** (v1, v1.5, NVIDIA)nvidia/Llama-3_3-Nemotron-Super-49B-v1, nvidia/Llama-3_3-Nemotron-Super-49B-v1_5The NVIDIA Nemotron family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents.
**Llama Nemotron Ultra** (v1, NVIDIA)`nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`The NVIDIA Nemotron family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents.
**NVIDIA Nemotron Nano 2.0**`nvidia/NVIDIA-Nemotron-Nano-9B-v2`The NVIDIA Nemotron family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. Nemotron-Nano-9B-v2 is a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models.
NVIDIA Nemotron 3 Super (NVIDIA)nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4The NVIDIA Nemotron 3 Super is a 120B-parameter MoE model (12B active) delivering high-quality reasoning and generation for enterprise AI agents.
NVIDIA Nemotron 3 Nano (NVIDIA)nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16The NVIDIA Nemotron 3 Nano is a compact model designed for efficient edge and enterprise deployment with strong reasoning capabilities.
StarCoder2 (3B-15B)bigcode/starcoder2-7bStarCoder2 is a family of open large language models (LLMs) specialized for code generation and understanding. It is the successor to StarCoder, jointly developed by the BigCode project (a collaboration between Hugging Face, ServiceNow Research, and other contributors).
Jet-Nemotronjet-ai/Jet-Nemotron-2BJet-Nemotron is a new family of hybrid-architecture language models that surpass state-of-the-art open-source full-attention language models, while achieving significant efficiency gains.
Trinity (Nano, Mini)arcee-ai/Trinity-MiniArcee's foundational MoE Trinity family of models, open weights under Apache 2.0.
LFM2 (350M, 1.2B)LiquidAI/LFM2.5-1.2B-InstructLiquid AI's hybrid attention + short convolution language model.
LFM2-MoE (8B-A1B, 24B-A2B)LiquidAI/LFM2-8B-A1BLiquid AI's Mixture-of-Experts variant with sigmoid routing and top-k expert selection.
Falcon-H1 (0.5B–34B)tiiuae/Falcon-H1-34B-InstructTII's hybrid Mamba-Transformer architecture combining attention and state-space models for efficient long-context inference.
Hunyuan-Large (389B, MoE)tencent/Tencent-Hunyuan-LargeTencent's open-source MoE model with 389B total / 52B active parameters, featuring Cross-Layer Attention (CLA) for improved efficiency.
**IBM Granite 4.0 (Hybrid, Dense)**ibm-granite/granite-4.0-h-micro, ibm-granite/granite-4.0-microIBM Granite 4.0 micro models: hybrid Mamba–MoE (h-micro) and dense (micro) variants. Enterprise-focused reasoning models
Sarvam 2 (30B-A2B, 105B-A10B)sarvamai/sarvam-2Sarvam's Mixture-of-Experts models. The 105B variant uses MLA (Multi-head Latent Attention) and the 30B variant uses GQA, both with 128 routed experts.
diff --git a/docs_new/docs/supported-models/mindspore_models.mdx b/docs_new/docs/supported-models/mindspore_models.mdx new file mode 100644 index 000000000000..83ecd3a6f918 --- /dev/null +++ b/docs_new/docs/supported-models/mindspore_models.mdx @@ -0,0 +1,152 @@ +--- +title: "MindSpore Models" +--- +## Introduction + +MindSpore is a high-performance AI framework optimized for Ascend NPUs. This doc guides users to run MindSpore models in SGLang. + +## Requirements + +MindSpore currently only supports Ascend NPU devices. Users need to first install Ascend CANN 8.5. +The CANN software packages can be downloaded from the [Ascend Official Website](https://www.hiascend.com). + +## Supported Models + +Currently, the following models are supported: + +- **Qwen3**: Dense and MoE models +- **DeepSeek V3/R1** +- *More models coming soon...* + +## Installation + +> **Note**: Currently, MindSpore models are provided by an independent package `sgl-mindspore`. Support for MindSpore is built upon current SGLang support for Ascend NPU platform. Please first [install SGLang for Ascend NPU](../hardware-platforms/ascend-npus/ascend_npu) and then install `sgl-mindspore`: + +```bash Install +git clone https://github.com/mindspore-lab/sgl-mindspore.git +cd sgl-mindspore +pip install -e . +``` + + +## Run Model + +Current SGLang-MindSpore supports Qwen3 and DeepSeek V3/R1 models. This doc uses Qwen3-8B as an example. + +### Offline inference + +Use the following script for offline inference: + +```python Offline Infer +import sglang as sgl + +# Initialize the engine with MindSpore backend +llm = sgl.Engine( + model_path="/path/to/your/model", # Local model path + device="npu", # Use NPU device + model_impl="mindspore", # MindSpore implementation + attention_backend="ascend", # Attention backend + tp_size=1, # Tensor parallelism size + dp_size=1 # Data parallelism size +) + +# Generate text +prompts = [ + "Hello, my name is", + "The capital of France is", + "The future of AI is" +] + +sampling_params = {"temperature": 0, "top_p": 0.9} +outputs = llm.generate(prompts, sampling_params) + +for prompt, output in zip(prompts, outputs): + print(f"Prompt: {prompt}") + print(f"Generated: {output['text']}") + print("---") +``` + +### Start server + +Launch a server with MindSpore backend: + +```bash Command +# Basic server startup +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --tp-size 1 \ + --dp-size 1 +``` + +For distributed server with multiple nodes: + +```bash Command +# Multi-node distributed server +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --dist-init-addr 127.0.0.1:29500 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 4 \ + --dp-size 2 +``` + +## Troubleshooting + +#### Debug Mode + +Enable sglang debug logging by log-level argument. + +```bash Debug Mode +python3 -m sglang.launch_server \ + --model-path /path/to/your/model \ + --host 0.0.0.0 \ + --device npu \ + --model-impl mindspore \ + --attention-backend ascend \ + --log-level DEBUG +``` + +Enable mindspore info and debug logging by setting environments. + +```bash Command +export GLOG_v=1 # INFO +export GLOG_v=0 # DEBUG +``` + +#### Explicitly select devices + +Use the following environment variable to explicitly select the devices to use. + +```bash Command +export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 # to set device +``` + +#### Some communication environment issues + +In case of some environment with special communication environment, users need set some environment variables. + +```bash Disable LCCL +export MS_ENABLE_LCCL=off # current not support LCCL communication mode in SGLang-MindSpore +``` + +#### Some dependencies of protobuf + +In case of some environment with special protobuf version, users need set some environment variables to avoid binary version mismatch. + +```bash Command +export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python # to avoid protobuf binary version mismatch +``` + +## Support +For MindSpore-specific issues: + +- Refer to the [MindSpore documentation](https://www.mindspore.cn/) diff --git a/docs_new/docs/supported-models/modelscope.mdx b/docs_new/docs/supported-models/modelscope.mdx new file mode 100644 index 000000000000..2e26eb292923 --- /dev/null +++ b/docs_new/docs/supported-models/modelscope.mdx @@ -0,0 +1,29 @@ +--- +title: "Use Models From ModelScope" +--- +To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable `SGLANG_USE_MODELSCOPE`. + +```bash Set Environment Variable +export SGLANG_USE_MODELSCOPE=true +``` + +We take [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) as an example. + +Launch the Server: +```bash Python +python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000 +``` + +Or start it by docker: + +```bash Docker +docker run --gpus all \ + -p 30000:30000 \ + -v ~/.cache/modelscope:/root/.cache/modelscope \ + --env "SGLANG_USE_MODELSCOPE=true" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000 +``` + +Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space. diff --git a/docs_new/docs/supported-models/multimodal_language_models.mdx b/docs_new/docs/supported-models/multimodal_language_models.mdx new file mode 100644 index 000000000000..2686c660db8f --- /dev/null +++ b/docs_new/docs/supported-models/multimodal_language_models.mdx @@ -0,0 +1,375 @@ +--- +title: "Multimodal Language Models" +metatags: + description: "Documentation for Multimodal Language Models" +--- +These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders. + +## Example launch Command + +```bash Launch Server +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path + --host 0.0.0.0 \ + --port 30000 \ +``` + +> See the [OpenAI APIs section](../basic_usage/openai_api_vision) for how to send multimodal requests. + +## Supported models + +Below the supported models are summarized in a table. + +If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen2_5_VLForConditionalGeneration`, use the expression: + +```text GitHub Search +repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration +``` + +in the GitHub search bar. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model Family (Variants)Example HuggingFace IdentifierDescriptionNotes
Qwen-VLQwen/Qwen3-VL-235B-A22B-InstructAlibaba's vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content.
DeepSeek-VL2deepseek-ai/deepseek-vl2Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs.
DeepSeek-OCR / OCR-2deepseek-ai/DeepSeek-OCR-2OCR-focused DeepSeek models for document understanding and text extraction.Use --trust-remote-code.
Janus-Pro (1B, 7B)deepseek-ai/Janus-Pro-7BDeepSeek's open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks.
MiniCPM-V / MiniCPM-oopenbmb/MiniCPM-V-2_6MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices.
Llama 3.2 Vision (11B)meta-llama/Llama-3.2-11B-Vision-InstructVision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks.
LLaVA (v1.5 & v1.6)e.g. liuhaotian/llava-v1.5-13bOpen vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts.
LLaVA-NeXT (8B, 72B)lmms-lab/llava-next-72bImproved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks.
LLaVA-OneVisionlmms-lab/llava-onevision-qwen2-7b-ovEnhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format.
Gemma 3 (Multimodal)google/gemma-3-4b-itGemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context.
Kimi-VL (A3B)moonshotai/Kimi-VL-A3B-InstructKimi-VL is a multimodal model that can understand and generate text from images.
Mistral-Small-3.1-24Bmistralai/Mistral-Small-3.1-24B-Instruct-2503Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output.
Phi-4-multimodal-instructmicrosoft/Phi-4-multimodal-instructPhi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang.
MiMo-VL (7B)XiaomiMiMo/MiMo-VL-7B-RLXiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks.
GLM-4.5V (106B) / GLM-4.1V(9B)zai-org/GLM-4.5VGLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningUse --chat-template glm-4v
GLM-OCRzai-org/GLM-OCRGLM-OCR: A fast and accurate general OCR model
DotsVLM (General/OCR)rednote-hilab/dots.vlm1.instRedNote's vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training.
DotsVLM-OCRrednote-hilab/dots.ocrSpecialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities.Don't use --trust-remote-code
NVILA (8B, 15B, Lite-2B, Lite-8B, Lite-15B)Efficient-Large-Model/NVILA-8BchatmlNVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance.
NVIDIA Nemotron Nano 2.0 VLnvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16NVIDIA Nemotron Nano v2 VL enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities. It builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, in order to achieve higher inference throughput in long document and video scenarios.Use --trust-remote-code. You may need to adjust --max-mamba-cache-size [default is 512] to fit memory constraints.
Ernie4.5-VLbaidu/ERNIE-4.5-VL-28B-A3B-PTBaidu's vision-language models(28B,424B). Support image and video comprehension, and also support thinking.
JetVLMJetVLM is an vision-language model designed for high-performance multimodal understanding and generation tasks built upon Jet-Nemotron.Coming soon
Step3-VL (10B)stepfun-ai/Step3-VL-10BStepFun's lightweight open-source 10B parameter VLM for multimodal intelligence, excelling in visual perception, complex reasoning, and human alignment.
Qwen3-ASR (0.6B, 1.7B)Qwen/Qwen3-ASR-1.7BAlibaba's automatic speech recognition models supporting 52 languages. Served via the /v1/audio/transcriptions endpoint.
Qwen3-OmniQwen/Qwen3-Omni-30B-A3B-InstructAlibaba's omni-modal MoE model. Currently supports the Thinker component (multimodal understanding for text, images, audio, and video), while the Talker component (audio generation) is not yet supported.
LFM2-VLLiquidAI/LFM2.5-VL-1.6BLiquid AI's vision-language model combining a SigLip2 vision encoder (NaFlex variable-resolution) with the LFM2 hybrid attention + short convolution language model. Supports multi-image inputs.
+ +## Audio Transcription + +SGLang supports audio-only ASR models via the OpenAI-compatible `/v1/audio/transcriptions` endpoint. Upload an audio file and receive a transcription. + +### Launch Command + +```bash Command +sglang serve \ + --model-path Qwen/Qwen3-ASR-1.7B \ + --served-model-name qwen3-asr \ + --trust-remote-code \ + --host 0.0.0.0 --port 30000 +``` + +### Example Request + +```bash Command +curl http://localhost:30000/v1/audio/transcriptions \ + -F file=@audio.wav \ + -F model=qwen3-asr \ + -F response_format=verbose_json +``` + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyExample IdentifierNotes
Whisperopenai/whisper-large-v3OpenAI's speech recognition model.
Qwen3-ASR (0.6B, 1.7B)Qwen/Qwen3-ASR-1.7BUse --trust-remote-code. Supports 52 languages.
+ +## Video Input Support + +SGLang supports video input for Vision-Language Models (VLMs), enabling temporal reasoning tasks such as video question answering, captioning, and holistic scene understanding. Video clips are decoded, key frames are sampled, and the resulting tensors are batched together with the text prompt, allowing multimodal inference to integrate visual and linguistic context. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model FamilyExample IdentifierVideo notes
Qwen-VL (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-Omni)Qwen/Qwen3-VL-235B-A22B-InstructThe processor gathers video_data, runs Qwen's frame sampler, and merges the resulting features with text tokens before inference.
GLM-4v (4.5V, 4.1V, MOE)zai-org/GLM-4.5VVideo clips are read with Decord, converted to tensors, and passed to the model alongside metadata for rotary-position handling.
NVILA (Full & Lite)Efficient-Large-Model/NVILA-8BThe runtime samples eight frames per clip and attaches them to the multimodal request when video_data is present.
LLaVA video variants (LLaVA-NeXT-Video, LLaVA-OneVision)lmms-lab/LLaVA-NeXT-Video-7BThe processor routes video prompts to the LlavaVid video-enabled architecture, and the provided example shows how to query it with sgl.video(...) clips.
NVIDIA Nemotron Nano 2.0 VLnvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16The processor samples at 2 FPS, at a max of 128 frames, as per model training. The model uses EVS, a pruning method that removes redundant tokens from video embeddings. By default video_pruning_rate=0.7. Change this by providing: --json-model-override-args '{"video_pruning_rate": 0.0}' to disable EVS, for example.
JetVLMThe runtime samples eight frames per clip and attaches them to the multimodal request when video_data is present.
+ +Use `sgl.video(path, num_frames)` when building prompts to attach clips from your SGLang programs. + +Example OpenAI-compatible request that sends a video clip: + +```python Complete Example +import requests + +url = "http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s happening in this video?"}, + { + "type": "video_url", + "video_url": { + "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +## Usage Notes + +### Performance Optimization + +For multimodal models, you can use the `--keep-mm-feature-on-device` flag to optimize for latency at the cost of increased GPU memory usage: + +- **Default behavior**: Multimodal feature tensors are moved to CPU after processing to save GPU memory +- **With `--keep-mm-feature-on-device`**: Feature tensors remain on GPU, reducing device-to-host copy overhead and improving latency, but consuming more GPU memory + +Use this flag when you have sufficient GPU memory and want to minimize latency for multimodal inference. + +### Multimodal Inputs Limitation + +- **Use `--mm-process-config '{"image":{"max_pixels":1048576},"video":{"fps":3,"max_pixels":602112,"max_frames":60}}'`**: To set `image`, `video`, and `audio` input limits. + +This can reduce GPU memory usage, improve inference speed, and help to avoid OOM, but may impact model performance, thus set a proper value based on your specific use case. The config entries are passed as `images_kwargs`, `videos_kwargs`, and `audio_kwargs` to the HuggingFace processor, so each modality's settings are kept separate and do not collide. Refer to the HuggingFace documentation for your model's processor to understand the available parameters. + +### Bidirectional Attention in Multimodal Model Serving +**Note for serving the Gemma-3 multimodal model**: + +As mentioned in [Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM +](https://huggingface.co/blog/gemma3#multimodality), Gemma-3 employs bidirectional attention between image tokens during the prefill phase. Currently, SGLang only supports bidirectional attention when using the Triton Attention Backend. Note, however, that SGLang's current bidirectional attention implementation is incompatible with both CUDA Graph and Chunked Prefill. + +To enable bidirectional attention, you can use the `TritonAttnBackend` while disabling CUDA Graph and Chunked Prefill. Example launch command: +```bash Bidirectional Attention +python -m sglang.launch_server \ + --model-path google/gemma-3-4b-it \ + --host 0.0.0.0 --port 30000 \ + --enable-multimodal \ + --dtype bfloat16 --triton-attention-reduce-in-fp32 \ + --attention-backend triton \ # Use Triton attention backend + --disable-cuda-graph \ # Disable Cuda Graph + --chunked-prefill-size -1 # Disable Chunked Prefill +``` + +If higher serving performance is required and a certain degree of accuracy loss is acceptable, you may choose to use other attention backends, and you can also enable features like CUDA Graph and Chunked Prefill for better performance, but note that the model will fall back to using causal attention instead of bidirectional attention. diff --git a/docs_new/docs/supported-models/rerank_models.mdx b/docs_new/docs/supported-models/rerank_models.mdx new file mode 100644 index 000000000000..034896dc8f07 --- /dev/null +++ b/docs_new/docs/supported-models/rerank_models.mdx @@ -0,0 +1,340 @@ +--- +title: Rerank models +--- +SGLang offers comprehensive support for rerank models by incorporating optimized serving frameworks with a flexible programming interface. This setup enables efficient processing of cross-encoder reranking tasks, improving the accuracy and relevance of search result ordering. SGLang’s design ensures high throughput and low latency during reranker model deployment, making it ideal for semantic-based result refinement in large-scale retrieval systems. + + +Rerank models in SGLang fall into two categories: + +- **Cross-encoder rerank models**: run with `--is-embedding` (embedding runner). +- **Decoder-only rerank models**: run **without** `--is-embedding` and use next-token logprob scoring (yes/no). + - Text-only (e.g. Qwen3-Reranker) + - Multimodal (e.g. Qwen3-VL-Reranker): also supports image/video content + +Some models may require `--trust-remote-code`. + + +## Supported rerank models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model Family (Rerank)Example HuggingFace IdentifierChat TemplateDescription
BGE-Reranker (BgeRerankModel)BAAI/bge-reranker-v2-m3N/ACurrently only support attention-backend triton and torch_native. High-performance cross-encoder reranker model from BAAI. Suitable for reranking search results based on semantic relevance.
Qwen3-Reranker (decoder-only yes/no)Qwen/Qwen3-Reranker-8Bexamples/chat_template/qwen3_reranker.jinjaDecoder-only reranker using next-token logprob scoring for labels (yes/no). Launch without --is-embedding.
Qwen3-VL-Reranker (multimodal yes/no)Qwen/Qwen3-VL-Reranker-2Bexamples/chat_template/qwen3_vl_reranker.jinjaMultimodal decoder-only reranker supporting text, images, and videos. Uses yes/no logprob scoring. Launch without --is-embedding.
+ + +## Cross-Encoder Rerank (embedding runner) + +### Launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path BAAI/bge-reranker-v2-m3 \ + --host 0.0.0.0 \ + --disable-radix-cache \ + --chunked-prefill-size -1 \ + --attention-backend triton \ + --is-embedding \ + --port 30000 +``` + +### Example Client Request + +```python +import requests + +url = "http://127.0.0.1:30000/v1/rerank" + +payload = { + "model": "BAAI/bge-reranker-v2-m3", + "query": "what is panda?", + "documents": [ + "hi", + "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." + ], + "top_n": 1, + "return_documents": True +} + +response = requests.post(url, json=payload) +response_json = response.json() + +for item in response_json: + if item.get("document"): + print(f"Score: {item['score']:.2f} - Document: '{item['document']}'") + else: + print(f"Score: {item['score']:.2f} - Index: {item['index']}") +``` + +**Request Parameters:** + +- `query` (required): The query text to rank documents against +- `documents` (required): List of documents to be ranked +- `model` (required): Model to use for reranking +- `top_n` (optional): Maximum number of documents to return. Defaults to returning all documents. If specified value is greater than the total number of documents, all documents will be returned. +- `return_documents` (optional): Whether to return documents in the response. Defaults to `True`. + +## Qwen3-Reranker (decoder-only yes/no rerank) + +### Launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-Reranker-0.6B \ + --trust-remote-code \ + --disable-radix-cache \ + --host 0.0.0.0 \ + --port 8001 \ + --chat-template examples/chat_template/qwen3_reranker.jinja +``` + + +Qwen3-Reranker uses decoder-only logprob scoring (yes/no). Do NOT launch it with `--is-embedding`. + + +### Example Client Request (supports optional instruct, top_n, and return_documents) + +```shell +curl -X POST http://127.0.0.1:8001/v1/rerank \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3-Reranker-0.6B", + "query": "法国首都是哪里?", + "documents": [ + "法国的首都是巴黎。", + "德国的首都是柏林。", + "香蕉是黄色的水果。" + ], + "instruct": "Given a web search query, retrieve relevant passages that answer the query.", + "top_n": 2, + "return_documents": true + }' +``` + +**Request Parameters:** + +- `query` (required): The query text to rank documents against +- `documents` (required): List of documents to be ranked +- `model` (required): Model to use for reranking +- `instruct` (optional): Instruction text for the reranker +- `top_n` (optional): Maximum number of documents to return. Defaults to returning all documents. If specified value is greater than the total number of documents, all documents will be returned. +- `return_documents` (optional): Whether to return documents in the response. Defaults to `True`. + +### Response Format + +`/v1/rerank` returns a list of objects (sorted by descending score): + +- `score`: float, higher means more relevant +- `document`: the original document string (only included when `return_documents` is `true`) +- `index`: the original index in the input `documents` +- `meta_info`: optional debug/usage info (may be present for some models) + +The number of returned results is controlled by the `top_n` parameter. If `top_n` is not specified or is greater than the total number of documents, all documents are returned. + +Example (with `return_documents: true`): + +```json +[ + {"score": 0.99, "document": "法国的首都是巴黎。", "index": 0}, + {"score": 0.01, "document": "德国的首都是柏林。", "index": 1}, + {"score": 0.00, "document": "香蕉是黄色的水果。", "index": 2} +] +``` + +Example (with `return_documents: false`): + +```json +[ + {"score": 0.99, "index": 0}, + {"score": 0.01, "index": 1}, + {"score": 0.00, "index": 2} +] +``` + +Example (with `top_n: 2`): + +```json +[ + {"score": 0.99, "document": "法国的首都是巴黎。", "index": 0}, + {"score": 0.01, "document": "德国的首都是柏林。", "index": 1} +] +``` + +### Common Pitfalls + +- **`--chat-template` is required.** Without `--chat-template examples/chat_template/qwen3_reranker.jinja`, the server does not recognize the model as a decoder-only reranker and returns a 400 error: `"This model does not appear to be an embedding model by default. Please add `--is-embedding`..."`. The fix is to add the chat template flag, NOT `--is-embedding`. +- If you launch Qwen3-Reranker with `--is-embedding`, `/v1/rerank` cannot compute yes/no logprob scores. Relaunch **without** `--is-embedding`. +- If you see a validation error like "score should be a valid number" and the backend returned a list, upgrade to a version that coerces `embedding[0]` into `score` for rerank responses. + +## Qwen3-VL-Reranker (multimodal decoder-only rerank) + +Qwen3-VL-Reranker extends the Qwen3-Reranker to support multimodal content, allowing reranking of documents containing text, images, and videos. + +### Launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-Reranker-2B \ + --trust-remote-code \ + --disable-radix-cache \ + --host 0.0.0.0 \ + --port 30000 \ + --chat-template examples/chat_template/qwen3_vl_reranker.jinja +``` + + +Qwen3-VL-Reranker uses decoder-only logprob scoring (yes/no) like Qwen3-Reranker. Do NOT launch it with `--is-embedding`. + + +### Text-Only Reranking (backward compatible) + +```python +import requests + +url = "http://127.0.0.1:30000/v1/rerank" + +payload = { + "model": "Qwen3-VL-Reranker-2B", + "query": "What is machine learning?", + "documents": [ + "Machine learning is a branch of artificial intelligence that enables computers to learn from data.", + "The weather in Paris is usually mild with occasional rain.", + "Deep learning is a subset of machine learning using neural networks with many layers.", + ], + "instruct": "Retrieve passages that answer the question.", + "return_documents": True +} + +response = requests.post(url, json=payload) +results = response.json() + +for item in results: + print(f"Score: {item['score']:.4f} - {item['document'][:60]}...") +``` + +### Image Reranking (text query, image/mixed documents) + +```python +import requests + +url = "http://127.0.0.1:30000/v1/rerank" + +payload = { + "query": "A woman playing with her dog on a beach at sunset.", + "documents": [ + # Document 1: Text description + "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset.", + # Document 2: Image URL + [ + { + "type": "image_url", + "image_url": { + "url": "https://example.com/beach_dog.jpeg" + } + } + ], + # Document 3: Text + Image (mixed) + [ + {"type": "text", "text": "A joyful scene at the beach:"}, + { + "type": "image_url", + "image_url": { + "url": "https://example.com/beach_dog.jpeg" + } + } + ] + ], + "instruct": "Retrieve images or text relevant to the user's query.", + "return_documents": False +} + +response = requests.post(url, json=payload) +results = response.json() + +for item in results: + print(f"Index: {item['index']}, Score: {item['score']:.4f}") +``` + +### Multimodal Query Reranking (query with image) + +```python +import requests + +url = "http://127.0.0.1:30000/v1/rerank" + +payload = { + # Query with text and image + "query": [ + {"type": "text", "text": "Find similar images to this:"}, + { + "type": "image_url", + "image_url": { + "url": "https://example.com/reference_image.jpeg" + } + } + ], + "documents": [ + "A cat sleeping on a couch.", + "A woman and her dog enjoying the sunset at the beach.", + "A busy city street with cars and pedestrians.", + [ + { + "type": "image_url", + "image_url": { + "url": "https://example.com/similar_image.jpeg" + } + } + ] + ], + "instruct": "Find images or descriptions similar to the query image." +} + +response = requests.post(url, json=payload) +results = response.json() + +for item in results: + print(f"Index: {item['index']}, Score: {item['score']:.4f}") +``` + +### Request Parameters (Multimodal) + +- `query` (required): Can be a string (text-only) or a list of content parts: + - `{"type": "text", "text": "..."}` for text + - `{"type": "image_url", "image_url": {"url": "..."}}` for images + - `{"type": "video_url", "video_url": {"url": "..."}}` for videos +- `documents` (required): List where each document can be a string or list of content parts (same format as query) +- `instruct` (optional): Instruction text for the reranker +- `top_n` (optional): Maximum number of documents to return +- `return_documents` (optional): Whether to return documents in the response (default: `false`) + +### Common Pitfalls + +- Always use `--chat-template examples/chat_template/qwen3_vl_reranker.jinja` for Qwen3-VL-Reranker. +- Do NOT launch with `--is-embedding`. +- For best results, use `--disable-radix-cache` to avoid caching issues with multimodal content. +- **Note**: Currently only `Qwen3-VL-Reranker-2B` is tested and supported. The 8B model may have different behavior and is not guaranteed to work with this template. diff --git a/docs_new/docs/supported-models/reward_models.mdx b/docs_new/docs/supported-models/reward_models.mdx new file mode 100644 index 000000000000..0ba8ccf51997 --- /dev/null +++ b/docs_new/docs/supported-models/reward_models.mdx @@ -0,0 +1,30 @@ +--- +title: Reward models +--- + +These models output a scalar reward score or classification result, often used in reinforcement learning or content moderation tasks. + +They are executed with `--is-embedding` and some may require `--trust-remote-code`. + +## Example launch Command + + +```shell Command +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen2.5-Math-RM-72B \ # example HF/local path + --is-embedding \ + --host 0.0.0.0 \ + --tp-size=4 \ # set for tensor parallelism + --port 30000 \ +``` + + +## Supported models + +| Model Family (Reward) | Example HuggingFace Identifier | Description | +|---------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------| +| **Llama (3.1 Reward / `LlamaForSequenceClassification`)** | `Skywork/Skywork-Reward-Llama-3.1-8B-v0.2` | Reward model (preference classifier) based on Llama 3.1 (8B) for scoring and ranking responses for RLHF. | +| **Gemma 2 (27B Reward / `Gemma2ForSequenceClassification`)** | `Skywork/Skywork-Reward-Gemma-2-27B-v0.2` | Derived from Gemma‑2 (27B), this model provides human preference scoring for RLHF and multilingual tasks. | +| **InternLM 2 (Reward / `InternLM2ForRewardMode`)** | `internlm/internlm2-7b-reward` | InternLM 2 (7B)–based reward model used in alignment pipelines to guide outputs toward preferred behavior. | +| **Qwen2.5 (Reward - Math / `Qwen2ForRewardModel`)** | `Qwen/Qwen2.5-Math-RM-72B` | A 72B math-specialized RLHF reward model from the Qwen2.5 series, tuned for evaluating and refining responses. | +| **Qwen2.5 (Reward - Sequence / `Qwen2ForSequenceClassification`)** | `jason9693/Qwen2.5-1.5B-apeach` | A smaller Qwen2.5 variant used for sequence classification, offering an alternative RLHF scoring mechanism. | diff --git a/docs_new/docs/supported-models/support_new_models.mdx b/docs_new/docs/supported-models/support_new_models.mdx new file mode 100644 index 000000000000..ba7836732787 --- /dev/null +++ b/docs_new/docs/supported-models/support_new_models.mdx @@ -0,0 +1,522 @@ +--- +title: "How to Support New Models" +description: "This document explains how to add support for new language models and multimodal large language models (MLLMs) in SGLang. It also covers how to test new models and register external implementations." +--- +This document explains how to add support for new language models and multimodal large language models (MLLMs) in +SGLang. It also covers how to test new models and register external implementations. + +## How to Support a New Language Model + +To support a new model in SGLang, you only need to add a single file under +the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn +from existing model implementations and create a new file for your model. For most models, you should be able to find a +similar model to start with (e.g., starting from Llama). Also refer how +to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) + +## How to Support a New Multimodal Large Language Model + +To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the +standard LLM support: + +1. **Register your new model as multimodal**: + Extend `is_multimodal_model` + in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) + to return `True` for your model. + +2. **Register a new chat-template**: + Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/parser/conversation.py) and the corresponding matching function. + +3. **Multimodal Data Processor**: + Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your + model’s dedicated processor. + See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors) + for more details. + +4. **Handle Multimodal Tokens**: + Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be + expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data + with `RadixAttention`. + +5. **Handle Image Feature Extraction**: + Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model. + +6. **Adapt to Vision Attention**: + Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. + +You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or +other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. + +## Testing and Debugging + +Please note all your testing and benchmarking results in PR description. + +### Interactive Debugging + +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands +should give the same text output and very similar prefill logits: + +- Get the reference output: + ```bash Command + python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm} + ``` +- Get the SGLang output: + ```bash Command + python3 -m sglang.bench_one_batch --correct --model [new model] + ``` + +### Add the Model to the Test Suite + +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in +the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/registered/models/test_generation_models.py) +file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, +MMMU-Pro, etc.) in your PR. \\ +For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/registered/vlm/test_vision_openai_server_a.py)). + +This is an example command to run to test a new model on your local machine: + +```bash Run Test +ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others +``` + +### Benchmark + +- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.io/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer). +- **(Optional) Other evals**: If you ran other evals, please note the results in PR description. + +## Port a Model from vLLM to SGLang + +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable +resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models +from vLLM to SGLang. + +To port a model from vLLM to SGLang: + +- Compare these two files for guidance: + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) +- The major differences include: + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + +Note: make sure you add your new model to the supported models list in the supported models documentation. + +## Registering an External Model Implementation + +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. +This allows you to integrate your model without modifying the source code. + +For example: + +```python Register Model +from sglang.srt.models.registry import ModelRegistry +from sglang.srt.entrypoints.http_server import launch_server + +# For a single model, add it to the registry: +ModelRegistry.models[model_name] = model_class + +# For multiple models, you can imitate the import_model_classes() function: +from functools import lru_cache + +@lru_cache() +def import_new_model_classes(): + model_arch_name_to_cls = {} + # Populate model_arch_name_to_cls with your new model classes. + ... + return model_arch_name_to_cls + +ModelRegistry.models.update(import_new_model_classes()) + +# Launch the server with your server arguments: +launch_server(server_args) +``` + +## Example: Implementing and Serving a Llama Wrapper Model + +Below is an introductory, step-by-step walkthrough on how to implement a new model end-to-end in SGLang and then run it via the [Offline Engine](../basic_usage/offline_engine_api). + +### Implementing Our Model + +To keep things simple, this new model will be a simple wrapper around [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), and our goal will be just to bias the output logits for each `forward` call by taking the square root of each individual logit. + +Let's start by defining our model in a file called `llama_wrapper.py`. +The first step is to import the necessary libraries from SRT, which is SGLang's internal backend. + +```python Example +# In the file `llama_wrapper.py` + +import torch +from transformers import LlamaConfig +from typing import Optional +from sglang.srt.layers.logits_processor import LogitsProcessorOutput +from sglang.srt.layers.quantization.base_config import QuantizationConfig +from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors + +from sglang.srt.models.llama import LlamaForCausalLM +``` + +Next, we declare a new `class` for our model and have it inherit from `LlamaForCausalLM`, which allows our model to access `LlamaForCausalLM`'s predefined modules and layers, such as `LlamaAttention` and `LlamaMLP`. +Note that almost all model implementations take in `config` and `quant_config` as arguments for their `__init__` method; `config` and `quant_config` are passed in via [`model_loader/loader.py`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_loader/loader.py#L219). +Because we have inherited from `LlamaForCausalLM`, we can pass our parameters directly to its constructor, which will set the member variables for us. + +```python Class Definition +class LlamaWrapper(LlamaForCausalLM): + def __init__( + self, + config: LlamaConfig, + quant_config: Optional[QuantizationConfig] = None, + prefix: str = "", + ) -> None: + super().__init__(config=config, quant_config=quant_config, prefix=prefix) +``` + +Now, we want to define the `forward` method, which is what will be called at inference time. +Note that the signature for `forward` is essentially the same for any model; you can take a look at the other models defined in the [`models` directory](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/) for references. +To see where exactly `forward` is called in the SGLang runtime's internals, take a look at [`forward_decode`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1705) and [`forward_extend`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1724) in the [`ModelRunner` class](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py). + +```python Forward Method Signature + @torch.no_grad() + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + forward_batch: ForwardBatch, + pp_proxy_tensors: Optional[PPProxyTensors] = None, + input_embeds: Optional[torch.Tensor] = None, + get_embedding: bool = False, + ) -> LogitsProcessorOutput: +``` + +We now call the `__call__` method for `self.model` (which is a member variable that `LlamaForCausalLM` defines in its `__init__` method), which eventually calls `LlamaForCausalLM`'s `forward` method. +After that, we feed the `hidden_states` into our model's `LogitsProcessor` (again defined in `LlamaForCausalLM`). + +```python Call Model and LogitsProcessor + hidden_states = self.model( + input_ids, + positions, + forward_batch, + input_embeds, + pp_proxy_tensors=pp_proxy_tensors, + ) + + res: LogitsProcessorOutput = self.logits_processor( + input_ids, + hidden_states, + self.lm_head, + forward_batch, + ) +``` + +After receiving the logits for the next token, we can finally perform our biasing step. + +```python Logit Biasing + orig_logits = res.next_token_logits + res.next_token_logits = torch.where( + orig_logits > 0, + orig_logits.sqrt(), + orig_logits + ) + + return res +``` + +Now, our `LlamaWrapper` model is created and ready to be served! + +### Serving Our Model Via SGLang's Offline Engine + +The next step of this walkthrough involves hosting our new model offline, so that it can be served locally and without an HTTP server. + +First, create a new file called `run.py`. +Now, we must ensure that SGLang's `ModelRegistry` can find our model. +To do this, we first download the model's configuration and weights from Huggingface. + +```python Example +# In the file `run.py` + +import asyncio +from functools import lru_cache +from huggingface_hub import snapshot_download +from llama_wrapper import LlamaWrapper # Make sure to import our new model! +import sglang as sgl +from sglang.srt.models.registry import ModelRegistry + +# Make sure to request access to this model on Huggingface, then export your +# `HF_TOKEN` to download the model snapshot +llama_dir = snapshot_download( + repo_id="meta-llama/Llama-3.1-8B-Instruct", + local_dir="./llama_ckpt", +) +``` + +Now that we have our model on disk, we want to point it to `LlamaWrapper` by changing the `architectures` field in `./llama_ckpt/config.json` to be `LlamaWrapper`. +That way, when we pass in the path of our model checkpoint to SGLang, it will know that we want to use "LlamaWrapper" instead of "LlamaForCausalLM" as our model. + +```python Example +{ + "architectures": [ + # "LlamaForCausalLM" + "LlamaWrapper" + ], + ... +} +``` + +However, if we don't link our `LlamaWrapper` class to the "LlamaWrapper" registry keyword, then SGLang won't be able to find our model. +Thus, to register our `LlamaWrapper`, we want to follow the steps in the above section titled "Registering an External Model Implementation". + +```python Register LlamaWrapper +@lru_cache() +def import_new_model_classes(): + model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper} + return model_arch_name_to_cls + +ModelRegistry.models.update(import_new_model_classes()) +``` + +Lastly, when we create our `Engine`, we just pass in the path to the local model directory. +Then, our `LlamaWrapper` is ready to be served; for this walkthrough, we will use SGLang `Engine`'s non-streaming asynchronous generation endpoint. + +```python Example +def main(): + llm = sgl.Engine(model_path="./llama_ckpt") + sampling_params = {"temperature": 0.2, "top_k": 5} + prompts = [ + "Write a short, neutral self-introduction for a fictional character. Hello, my name is", + "Provide a concise factual statement about France’s capital city. The capital of France is", + "Explain possible future trends in artificial intelligence. The future of AI is", + ] + + asyncio.run(run_llm(llm, sampling_params, prompts)) + + llm.shutdown() + +async def run_llm( + llm, + sampling_params, + prompts, +) -> None: + outputs = await llm.async_generate(prompts, sampling_params) + + for prompt, output in zip(prompts, outputs): + print(f"\nPrompt: {prompt}") + print(f"Generated text: {output['text']}") + +if __name__ == "__main__": + main() +``` + +Now, when we call `python run.py`, we will get the outputs of our newly created model! + +## Serving External Models via the Standard CLI + +The previous sections show how to register a model programmatically via `ModelRegistry` and serve it through the Offline Engine. Similar to vLLM model plugin, there is an alternative that lets you keep using the standard `python -m sglang.launch_server` CLI without modifying any SGLang source code: you can register your model using the `SGLANG_EXTERNAL_MODEL_PACKAGE` environment variable. + +### The `EntryClass` Variable + +When SGLang scans a model package, it looks for the variable `EntryClass` at the module level of your Python file. The [model registry](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/registry.py) imports your file, checks for `EntryClass`, and registers the class assigned to it. If you are using a model based on HuggingFace, the name of this class needs to match the `"architectures"` field in your model's `config.json`. + +For example, if you are implementing a Llama wrapper, add this line at the end of your model file: + +```python Example +# This is what "Add EntryClass at the end" means +EntryClass = LlamaWrapper +``` + +### Example: Text-Only Model + +Using the same Llama wrapper from the previous section, here is how to package and serve it via the CLI. + +1. Create your project + +``` +sglang_custom_project/ +|----setup.py +|----custom_llm/ + |----__init__.py + |----llama_wrapper.py +``` + +Write the `setup.py`: + +```python Example +# sglang_custom_project/setup.py + +from setuptools import setup, find_packages +setup( + name="sglang-custom-plugins", + version="0.1", + packages=find_packages(), +) +``` + +2. Write your model code + +Inside `llama_wrapper.py`, write your model and include `EntryClass`: + +```python Example +# sglang_custom_project/custom_llm/llama_wrapper.py + +import torch +from typing import Optional +from sglang.srt.layers.logits_processor import LogitsProcessorOutput +from sglang.srt.layers.quantization.base_config import QuantizationConfig +from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors +from sglang.srt.models.llama import LlamaForCausalLM + +class LlamaWrapper(LlamaForCausalLM): + def __init__(self, config, quant_config: Optional[QuantizationConfig] = None, + prefix: str = "") -> None: + super().__init__(config=config, quant_config=quant_config, prefix=prefix) + @torch.no_grad() + def forward(self, input_ids, positions, forward_batch, + pp_proxy_tensors=None, input_embeds=None, get_embedding=False): + hidden_states = self.model( + input_ids, positions, forward_batch, input_embeds, + pp_proxy_tensors=pp_proxy_tensors, + ) + res: LogitsProcessorOutput = self.logits_processor( + input_ids, hidden_states, self.lm_head, forward_batch, + ) + + orig = res.next_token_logits + res.next_token_logits = torch.where(orig > 0, orig.sqrt(), orig) + return res + +# Don't forget to add EntryClass +EntryClass = LlamaWrapper +``` + +3. Install your package + +Run this inside your `sglang_custom_project` directory to install your code into the active Python environment: + +```bash Command +pip install -e . +``` + +4. Update your `config.json` + +Update the `config.json` under your HuggingFace model checkpoint directory so the `architectures` field matches your class name: + +```json Config +{ + "architectures": ["LlamaWrapper"], + ... +} +``` + +5. Launch the server + +Set the environment variable before running the CLI: + +```bash Command +export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_llm +python -m sglang.launch_server \ + --model-path /path/to/Llama-3.1-8B-Instruct \ + --port 8000 +``` + +The `SGLANG_EXTERNAL_MODEL_PACKAGE` should be the parent folder name containing your model-related code. In this example, it should be `custom_llm`. + +### Example: Multimodal Model + +If you are working with multimodal models, setting `SGLANG_EXTERNAL_MODEL_PACKAGE` alone is not enough. SGLang also needs to recognize your architecture as multimodal to enable the image/video processing pipelines, and it needs a custom processor. + +You can handle this by setting two additional environment variables: + +- `SGLANG_EXTERNAL_MM_MODEL_ARCH`: Adds your architecture name to SGLang's internal list of multimodal models. +- `SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE`: Tells SGLang where to find your custom processor class. + +For example, let's build a custom model based on Qwen2-VL-Instruct that takes the square root of the logits. + +Create the project: + +``` +sglang_custom_project_vl/ +|----setup.py +|----custom_vlm/ + |----__init__.py + |----qwenvl_wrapper.py +``` + +Write `setup.py`: + +```python Example +# sglang_custom_project_vl/setup.py + +from setuptools import setup, find_packages +setup( + name="sglang-custom-plugins-vl", + version="0.1", + packages=find_packages(), +) +``` + +Write the model in `qwenvl_wrapper.py`: + +```python Example +# sglang_custom_project_vl/custom_vlm/qwenvl_wrapper.py +import torch +from sglang.srt.models.qwen2_vl import Qwen2VLForConditionalGeneration +from sglang.srt.multimodal.processors.qwen_vl import QwenVLImageProcessor + +class CustomQwen2VL(Qwen2VLForConditionalGeneration): + def forward(self, input_ids, positions, forward_batch, + input_embeds=None, get_embedding=False): + res = super().forward( + input_ids, positions, forward_batch, + input_embeds=input_embeds, get_embedding=get_embedding + ) + if not get_embedding: + orig = res.next_token_logits + res.next_token_logits = torch.where(orig > 0, orig.sqrt(), orig) + return res + +class CustomQwen2VLProcessor(QwenVLImageProcessor): + models = [CustomQwen2VL] + + def __init__(self, hf_config, server_args, _processor, *args, **kwargs): + super().__init__(hf_config, server_args, _processor, *args, **kwargs) + +EntryClass = CustomQwen2VL +``` + +**Note:** you don't need a separate `EntryClass` for the custom processor as long as you associate the processor with the specific model class. + +Install the package, update `config.json`, and launch: + +```bash Command +pip install -e . +``` + +```json Config +{ + "architectures": ["CustomQwen2VL"], + ... +} +``` + +```bash Command +export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_vlm +export SGLANG_EXTERNAL_MM_MODEL_ARCH=CustomQwen2VL +export SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE=custom_vlm + +python -m sglang.launch_server \ + --model-path /path/to/Qwen2-VL-2B-Instruct \ + --port 8000 \ + --enable-multimodal +``` + +## Documentation + +Add to table of supported models in [generative_models.md](./generative_models) or [multimodal_language_models.md](./multimodal_language_models) + +--- + +By following these guidelines, you can add support for new language models and multimodal large language models in +SGLang and ensure they are thoroughly tested and easily integrated into the system. diff --git a/docs_new/docs/supported-models/transformers_fallback.mdx b/docs_new/docs/supported-models/transformers_fallback.mdx new file mode 100644 index 000000000000..8db61f5282a2 --- /dev/null +++ b/docs_new/docs/supported-models/transformers_fallback.mdx @@ -0,0 +1,59 @@ +--- +title: "Transformers Fallback in SGLang" +--- +`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon! + +## Example launch Command + +By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`. + +```shell Launch Server +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-1B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --model-impl transformers +``` + +## Supported features + +### Quantization + +Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization) for more information about supported quantization in SGLang. + +### Remote code + +This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production! + +A model just needs the following two things: + +```python Example +from transformers import PreTrainedModel +from torch import nn + +class MyAttention(nn.Module): + + def forward(self, hidden_states, **kwargs): # <- kwargs are required + + ... + attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] + attn_output, attn_weights = attention_interface( + self, + query_states, + key_states, + value_states, + **kwargs, + ) + ... + +class MyModel(PreTrainedModel): + _supports_attention_backend = True +``` + +Here is what happens in the background: + +1. The config is loaded +2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`. +3. The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`. + +That's it! diff --git a/docs_new/docs_migration_plan.md b/docs_new/docs_migration_plan.md new file mode 100644 index 000000000000..6391c6f9b3c3 --- /dev/null +++ b/docs_new/docs_migration_plan.md @@ -0,0 +1,133 @@ +# SGLang Documentation Migration Plan + +## Background + +Migrate the new Mintlify-based documentation (currently in the standalone `sgl-docs` repo) into the sglang main repo under `docs_new/`, and point `staging.docs.sglang.io` to it. + +### Current State + +| Item | Location | Stack | Domain | +|------|----------|-------|--------| +| Old docs | `sglang/docs/` | Sphinx + GitHub Pages | `docs.sglang.io` | +| New docs + cookbook | `sgl-project/sgl-docs` repo | Mintlify | `lmsysorg.mintlify.app` (temp preview) | + +- Cookbook is already inside `sgl-docs/cookbook/`, no separate repo needed. +- Old docs CI (`execute-notebook.yml`, `lint.yml`) only watches `docs/**`, will not be triggered by `docs_new/**`. + +--- + +## Phase 1: Git Subtree Merge (Local Experiment) + +> Goal: Merge `sgl-docs` into `sglang` repo's `docs_new/` directory, preserving full commit history and authorship. + +```bash +# 1. Create a new branch (sglang remote is NOT affected) +cd /path/to/sglang +git checkout -b docs-new-migration + +# 2. Add sgl-docs as a remote (sgl-docs repo is NOT affected, read-only fetch) +git remote add sgl-docs git@github.com:sgl-project/sgl-docs.git +git fetch sgl-docs + +# 3. Subtree merge — all sgl-docs content goes into docs_new/, full history preserved +git subtree add --prefix=docs_new sgl-docs main + +# 4. Keep the remote for ongoing sync during migration period +# (remove only after sgl-docs is officially archived) +``` + +### Safety Guarantees + +- `sgl-docs` original repo: **unaffected** (fetch only, no push) +- `sglang` remote: **unaffected** (local branch, no push until ready) +- Rollback: `git checkout main && git branch -D docs-new-migration` + +### Side Effect: Contributors + +`git subtree add` (without `--squash`) imports all original commits. Authors from `sgl-docs` will appear in `sglang`'s git history and GitHub Contributors list. This is intentional — it gives proper credit. + +--- + +## Phase 2: Configure Mintlify for `docs_new/` (on branch) + +> Goal: Make Mintlify read from `sglang` repo's `docs_new/` subdirectory instead of the standalone `sgl-docs` repo. **No need to merge to main first** — Mintlify can point to a specific branch for validation. + +1. Log in to [Mintlify Dashboard](https://dashboard.mintlify.com) +2. Change the project's **GitHub repository** from `sgl-project/sgl-docs` to `sgl-project/sglang` +3. Set **Branch** to `docs-new-migration` (temporarily, for validation) +4. Set **Documentation directory** to `docs_new` (Mintlify supports monorepo subdirectory) +5. `docs.json` (Mintlify config) will be at `docs_new/docs.json` after the subtree merge — paths inside it (e.g., `cookbook/llm/Qwen/Qwen3`) are relative to `docs_new/`, so no changes needed +6. Verify the preview build succeeds on Mintlify + +--- + +## Phase 3: DNS & Custom Domain for `staging.docs.sglang.io` + +> Goal: Make `staging.docs.sglang.io` serve the new Mintlify docs. + +1. **DNS**: Add a CNAME record for `staging.docs.sglang.io` pointing to Mintlify's endpoint (typically `cname.mintlify.dev`) +2. **Mintlify Dashboard**: Settings > Custom Domain > add `staging.docs.sglang.io` +3. Mintlify handles SSL certificate automatically +4. Verify `staging.docs.sglang.io` loads correctly + +--- + +## Phase 4: Ongoing Sync During Migration Period + +> During the transition, `sgl-docs` may still receive updates. Sync them into `docs_new/` as needed. + +```bash +# Pull latest changes from sgl-docs into docs_new/ +git subtree pull --prefix=docs_new sgl-docs main +``` + +Once `sgl-docs` is frozen, this step is no longer needed. + +--- + +## Phase 5: CI/CD (Optional, Post-Migration) + +Current `docs/**` CI workflows will **NOT** trigger for `docs_new/**` changes. This is fine initially since Mintlify has its own GitHub integration for auto-deployment on push to main. + +Optional additions later: +- Link checking (lychee) for `docs_new/**/*.mdx` +- Mintlify broken-link or build validation on PR + +--- + +## Phase 6: Final Cutover + +> Goal: Promote staging to production. + +| Stage | `docs.sglang.io` | `staging.docs.sglang.io` | +|-------|-------------------|--------------------------| +| After Phase 3 | Sphinx (old docs) | Mintlify (new docs) | +| After cutover | Mintlify (new docs) | Keep or remove | + +Cutover steps: +1. Confirm `staging.docs.sglang.io` is stable and content-complete +2. Update DNS: point `docs.sglang.io` CNAME from GitHub Pages to Mintlify (`cname.mintlify.dev`) +3. Update Mintlify Dashboard custom domain to `docs.sglang.io` +4. Remove or archive old resources: + - Delete `sglang/docs/` (old Sphinx docs) + - Delete `.github/workflows/release-docs.yml` and `.github/workflows/execute-notebook.yml` + - Archive `sgl-project/sgl-docs` repo on GitHub + - Remove the `sgl-docs` git remote: `git remote remove sgl-docs` + - Optionally archive `sgl-project/sgl-project.github.io` repo + +--- + +## Execution Order + +> Mintlify supports pointing to a specific branch, so we can validate on `docs-new-migration` **before** merging to main. + +| Step | Action | Who | Dependency | +|------|--------|-----|------------| +| 1 | Phase 1: subtree merge on local branch | Dev | — | +| 2 | Push branch to `sgl-project/sglang` | Dev | Step 1 | +| 3 | Phase 2: configure Mintlify Dashboard to read from `sgl-project/sglang` branch `docs-new-migration` `docs_new/` | Admin (Mintlify access) | Step 2 | +| 4 | Phase 3: DNS CNAME + Mintlify custom domain for `staging.docs.sglang.io` | Admin (DNS access) | Step 3 | +| 5 | Verify staging site | Team | Step 4 | +| 6 | Merge PR to main, switch Mintlify branch back to `main` | Dev + Admin | Step 5 confirmed OK | +| 7 | Phase 4: sync any remaining sgl-docs updates | Dev | As needed | +| 8 | Phase 6: final cutover when ready | Admin | Step 6 done | diff --git a/docs_new/favicon.png b/docs_new/favicon.png new file mode 100644 index 000000000000..3e0fe3eda519 Binary files /dev/null and b/docs_new/favicon.png differ diff --git a/docs_new/fonts/Approach-Medium.woff2 b/docs_new/fonts/Approach-Medium.woff2 new file mode 100644 index 000000000000..8fc25399eafa Binary files /dev/null and b/docs_new/fonts/Approach-Medium.woff2 differ diff --git a/docs_new/fonts/Approach-Regular.woff2 b/docs_new/fonts/Approach-Regular.woff2 new file mode 100644 index 000000000000..2d57a149f126 Binary files /dev/null and b/docs_new/fonts/Approach-Regular.woff2 differ diff --git a/docs_new/images/dpa.png b/docs_new/images/dpa.png new file mode 100644 index 000000000000..672e022186e4 Binary files /dev/null and b/docs_new/images/dpa.png differ diff --git a/docs_new/index.mdx b/docs_new/index.mdx new file mode 100644 index 000000000000..6a5b1ed19ddb --- /dev/null +++ b/docs_new/index.mdx @@ -0,0 +1,497 @@ +--- +title: Welcome to SGLang +description: High-performance serving framework for large language and multimodal models. +keywords: + - sglang + - llm serving + - multimodal + - inference runtime +mode: wide +--- + + + Star + + + Fork + + +

+ + + + Designed for low-latency, high-throughput inference with RadixAttention, prefix caching, and multi-GPU parallelism. + + + + Broad support for Llama, Qwen, DeepSeek, and more. Compatible with Hugging + Face and OpenAI APIs. + + + + Native support across Hardware Platforms + including NVIDIA, AMD, Intel Xeon, Google TPU, and Ascend NPU accelerators. + + + + Open-source with widespread adoption, powering 400k+ GPUs and integrated with major RL frameworks. + + + +SGLang powers large-scale production deployments, generating trillions of tokens each day across more than 400,000 GPUs worldwide. It is hosted under the non-profit open-source organization [LMSYS](https://lmsys.org/about/). + +--- + +## Get Started + +SGLang is an inference framework meant for production level serving. +It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters. + + + + Install SGLang with pip, from source, or via Docker on your preferred hardware platform. + + + + Launch your first model server and send requests in minutes with OpenAI-compatible APIs. + + + +## News and latest blogs + +{/* BEGIN_LMSYS_SGLANG_BLOG_CARDS */} + +{/* END_LMSYS_SGLANG_BLOG_CARDS */} + +--- + +## Learn more and join the community + +
+
+

+ Stay connected +

+
+
+ + + {" "} + Development roadmap + to follow current priorities and upcoming work. +
+
+ + + {" "} + Weekly public development meeting + to hear updates and join open discussions. +
+
+ + + {" "} + Slack + for questions, feedback, and community support. +
+
+ X Twitter + and + + + {" "} + LinkedIn + for project updates. +
+
+ + + {" "} + LMSYS blog + for release notes, benchmarks, and technical deep dives. +
+
+ + + {" "} + Learning materials + for blogs, slides, and videos. +
+
+
+
diff --git a/docs_new/logo/logo.png b/docs_new/logo/logo.png new file mode 100644 index 000000000000..2a8bc258f666 Binary files /dev/null and b/docs_new/logo/logo.png differ diff --git a/docs_new/scripts/gen_redirects.py b/docs_new/scripts/gen_redirects.py new file mode 100755 index 000000000000..77e2dd7f189e --- /dev/null +++ b/docs_new/scripts/gen_redirects.py @@ -0,0 +1,216 @@ +#!/usr/bin/env python3 +"""Generate Mintlify docs.json redirects from old Sphinx paths to new Mintlify paths.""" + +from __future__ import annotations + +import json +import os +from pathlib import Path + +REPO = Path(__file__).resolve().parent.parent.parent +OLD_DOCS = REPO / "docs" +NEW_DOCS = REPO / "docs_new" / "docs" + +# Directory-level renames (old → new, under /docs/ prefix) +SECTION_RENAMES = { + "get_started": "get-started", + "platforms": "hardware-platforms", + "supported_models": "supported-models", + "diffusion": "sglang-diffusion", +} + +# Explicit file-level mappings. Keys are old URL paths (no .html, with leading /). +# Values are new URL paths (with /docs/ prefix, no extension). +EXPLICIT = { + # get_started → get-started + "/get_started/install": "/docs/get-started/installation", + # developer_guide rename + "/developer_guide/development_jit_kernel_guide": "/docs/developer_guide/JIT_kernels", + # platforms → hardware-platforms (with file renames) + "/platforms/amd_gpu": "/docs/hardware-platforms/amd-gpus", + "/platforms/cpu_server": "/docs/hardware-platforms/cpu-server", + "/platforms/tpu": "/docs/hardware-platforms/tpu", + "/platforms/xpu": "/docs/hardware-platforms/xpu", + # platforms/ascend → hardware-platforms/ascend-npus (flattened, renamed) + "/platforms/ascend/ascend_npu": "/docs/hardware-platforms/ascend-npus/SGLang-installation-with-NPUs-support", + "/platforms/ascend/ascend_npu_best_practice": "/docs/hardware-platforms/ascend-npus/Best-Practice-on-Ascend-NPU", + "/platforms/ascend/ascend_npu_deepseek_example": "/docs/hardware-platforms/ascend-npus/DeepSeek-Examples", + "/platforms/ascend/ascend_npu_glm5_examples": "/docs/hardware-platforms/ascend-npus/GLM-5", + "/platforms/ascend/ascend_npu_qwen3_examples": "/docs/hardware-platforms/ascend-npus/Qwen3-Examples", + "/platforms/ascend/ascend_npu_qwen3_5_examples": "/docs/hardware-platforms/ascend-npus/Qwen3.5", + "/platforms/ascend/ascend_npu_support_features": "/docs/hardware-platforms/ascend-npus/Support-Features-on-Ascend-NPU", + "/platforms/ascend/ascend_npu_support_models": "/docs/hardware-platforms/ascend-npus/Support-Models-on-Ascend-NPU", + # Old pages dropped — redirect to section overview + "/platforms/ascend/ascend_contribution_guide": "/docs/hardware-platforms/overview", + "/platforms/ascend/ascend_npu_environment_variables": "/docs/hardware-platforms/overview", + "/platforms/ascend/ascend_npu_quantization": "/docs/hardware-platforms/overview", + "/platforms/ascend/ascend_npu_support": "/docs/hardware-platforms/overview", + "/platforms/ascend/mindspore_backend": "/docs/hardware-platforms/overview", + "/platforms/ascend_npu_ring_sp_performance": "/docs/hardware-platforms/overview", + "/platforms/apple_metal": "/docs/hardware-platforms/overview", + "/platforms/mthreads_gpu": "/docs/hardware-platforms/overview", + "/platforms/nvidia_jetson": "/docs/hardware-platforms/overview", + "/platforms/plugin": "/docs/hardware-platforms/overview", + # supported_models → supported-models (flattened, renamed) + "/supported_models": "/docs/supported-models", + "/supported_models/index": "/docs/supported-models", + "/supported_models/extending/mindspore_models": "/docs/supported-models/mindspore-models", + "/supported_models/extending/modelscope": "/docs/supported-models/modelscope", + "/supported_models/extending/support_new_models": "/docs/supported-models/new-model-support", + "/supported_models/extending/transformers_fallback": "/docs/supported-models/transformers-fallback", + "/supported_models/extending/index": "/docs/supported-models", + "/supported_models/retrieval_ranking/classify_models": "/docs/supported-models/classification-models", + "/supported_models/retrieval_ranking/embedding_models": "/docs/supported-models/embedding-models", + "/supported_models/retrieval_ranking/rerank_models": "/docs/supported-models/rerank-models", + "/supported_models/retrieval_ranking/index": "/docs/supported-models", + "/supported_models/specialized/reward_models": "/docs/supported-models/reward-models", + "/supported_models/specialized/index": "/docs/supported-models", + "/supported_models/text_generation/generative_models": "/docs/supported-models/large-language-models", + "/supported_models/text_generation/multimodal_language_models": "/docs/supported-models/vision-language-models", + "/supported_models/text_generation/diffusion_language_models": "/docs/supported-models/diffusion-language-models", + "/supported_models/text_generation/index": "/docs/supported-models", + # diffusion → sglang-diffusion (file renames snake_case → kebab-case) + "/diffusion": "/docs/sglang-diffusion/installation", + "/diffusion/index": "/docs/sglang-diffusion/installation", + "/diffusion/installation": "/docs/sglang-diffusion/installation", + "/diffusion/environment_variables": "/docs/sglang-diffusion/environment-variables", + "/diffusion/ci_perf": "/docs/sglang-diffusion/ci-performance", + "/diffusion/api/cli": "/docs/sglang-diffusion/api/cli", + "/diffusion/api/openai_api": "/docs/sglang-diffusion/api/openai-api", + "/diffusion/performance/attention_backends": "/docs/sglang-diffusion/attention-backends", + "/diffusion/performance/cache/cache_dit": "/docs/sglang-diffusion/cache-dit", + "/diffusion/performance/cache/index": "/docs/sglang-diffusion/caching-acceleration", + "/diffusion/performance/cache/teacache": "/docs/sglang-diffusion/tea-cache", + "/diffusion/performance/index": "/docs/sglang-diffusion/performance-optimization", + "/diffusion/performance/profiling": "/docs/sglang-diffusion/profiling", + # Diffusion pages dropped + "/diffusion/api/post_processing": "/docs/sglang-diffusion/installation", + "/diffusion/compatibility_matrix": "/docs/sglang-diffusion/installation", + "/diffusion/contributing": "/docs/sglang-diffusion/installation", + "/diffusion/development": "/docs/sglang-diffusion/installation", + "/diffusion/disaggregation": "/docs/sglang-diffusion/installation", + "/diffusion/performance/ring_sp_performance": "/docs/sglang-diffusion/performance-optimization", + "/diffusion/quantization": "/docs/sglang-diffusion/installation", + "/diffusion/reference": "/docs/sglang-diffusion/installation", + "/diffusion/support_new_models": "/docs/sglang-diffusion/installation", + "/diffusion/usage": "/docs/sglang-diffusion/installation", + # basic_usage dropped pages + "/basic_usage/deepseek_ocr": "/docs/basic_usage/overview", + "/basic_usage/qwen3_5": "/docs/basic_usage/qwen3", + # advanced_features dropped pages + "/advanced_features/adaptive_speculative_decoding": "/docs/advanced_features/speculative_decoding", + "/advanced_features/hisparse_guide": "/docs/advanced_features/overview", + # references dropped + "/references/learn_more": "/", + "/references/release_lookup": "/docs/references/overview", + # Root index + "/index": "/", + "/": "/", +} + + +def old_url_from_path(rel: Path) -> str | None: + """Convert old docs/ to its Sphinx URL path (no .html, leading /).""" + parts = list(rel.parts) + stem = rel.stem + # Skip README, release_lookup/README, top-level non-doc files + if stem == "README": + return None + # Drop the extension → URL path + new_parts = parts[:-1] + [stem] + return "/" + "/".join(new_parts) + + +def new_url_for(old_url: str, new_files_set: set[str]) -> str | None: + """Compute new URL from old URL using section rename + explicit overrides.""" + if old_url in EXPLICIT: + return EXPLICIT[old_url] + # Default rule: `/section/path` → `/docs/section/path`, applying section renames + parts = old_url.strip("/").split("/") + if not parts or not parts[0]: + return None + section = parts[0] + section = SECTION_RENAMES.get(section, section) + new_url = "/docs/" + "/".join([section] + parts[1:]) + # Verify destination exists in new file tree + if new_url in new_files_set: + return new_url + return None # unmapped + + +def list_new_urls() -> set[str]: + urls = set() + for p in NEW_DOCS.rglob("*"): + if not p.is_file(): + continue + if p.suffix not in (".mdx", ".ipynb", ".md"): + continue + rel = p.relative_to(NEW_DOCS) + # Mintlify routes .mdx / .ipynb as `/docs/` + url = "/docs/" + str(rel.with_suffix("")).replace(os.sep, "/") + urls.add(url) + return urls + + +def main(): + new_urls = list_new_urls() + redirects: list[dict] = [] + seen_sources: set[str] = set() + unmapped: list[str] = [] + + # Iterate all old files + old_files = [] + for p in sorted(OLD_DOCS.rglob("*")): + if not p.is_file(): + continue + if p.suffix not in (".md", ".rst", ".ipynb"): + continue + rel = p.relative_to(OLD_DOCS) + # Skip non-doc dirs + if rel.parts and rel.parts[0] in ( + "_static", + "performance_dashboard", + "release_lookup", + ): + continue + old_files.append(rel) + + for rel in old_files: + old_url = old_url_from_path(rel) + if old_url is None: + continue + # Old Sphinx URLs end in .html + source = old_url + ".html" + if source in seen_sources: + continue + new_url = new_url_for(old_url, new_urls) + if new_url is None: + unmapped.append(source) + continue + redirects.append({"source": source, "destination": new_url}) + seen_sources.add(source) + + # Also add explicit entries whose source key wasn't derived from a file (e.g. index variants) + for old_key, new_val in EXPLICIT.items(): + source = old_key + ".html" + if source in seen_sources: + continue + # Only add if old_key corresponds to an actual old page pattern we care about + # Skip bare "/" and "/index" (handled by Mintlify default) + if old_key in ("/", "/index"): + continue + redirects.append({"source": source, "destination": new_val}) + seen_sources.add(source) + + # Output + print(f"# Total redirects: {len(redirects)}") + print(f"# Unmapped old URLs: {len(unmapped)}") + if unmapped: + print("# --- UNMAPPED ---") + for u in unmapped: + print(f"# {u}") + print(json.dumps(redirects, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/docs_new/scripts/update_lmsys_sglang_blogs.py b/docs_new/scripts/update_lmsys_sglang_blogs.py new file mode 100755 index 000000000000..4f29a5d8cab7 --- /dev/null +++ b/docs_new/scripts/update_lmsys_sglang_blogs.py @@ -0,0 +1,286 @@ +#!/usr/bin/env python3 +"""Sync SGLang-related LMSYS blog cards into index.mdx.""" + +from __future__ import annotations + +import json +import os +import re +import urllib.request +from dataclasses import dataclass +from pathlib import Path + +ROOT = Path(__file__).resolve().parents[1] +INDEX_PATH = ROOT / "index.mdx" + +START_MARKER = "{/* BEGIN_LMSYS_SGLANG_BLOG_CARDS */}" +END_MARKER = "{/* END_LMSYS_SGLANG_BLOG_CARDS */}" + +LMSYS_BLOG_API_URL = ( + "https://api.github.com/repos/lm-sys/lm-sys.github.io/contents/blog" +) +LMSYS_BLOG_BASE_URL = "https://lmsys.org/blog" +LMSYS_BASE_URL = "https://lmsys.org" +DEFAULT_IMAGE_URL = "https://lmsys.org/social.png" + +MAX_CARDS = int(os.getenv("LMSYS_SGLANG_MAX_CARDS", "6")) +KEYWORDS = [ + "sglang", + "sgl-project/sglang", + "sgl-kernel", + "sglang-jax", + "sgl diffusion", + "sglang diffusion", +] + +FRONTMATTER_RE = re.compile(r"\A---\s*\n(.*?)\n---\s*\n?", flags=re.DOTALL) +HTML_IMG_RE = re.compile(r"]*\ssrc=[\"']([^\"']+)[\"']", flags=re.IGNORECASE) +MD_IMG_RE = re.compile(r"!\[[^\]]*]\(([^)]+)\)") + + +@dataclass +class BlogPost: + slug: str + title: str + url: str + image: str + date: str + + +def build_headers() -> dict[str, str]: + headers = { + "Accept": "application/vnd.github+json", + "User-Agent": "sgl-docs-lmsys-blog-sync", + } + token = os.getenv("GITHUB_TOKEN") + if token: + headers["Authorization"] = f"Bearer {token}" + return headers + + +def download_blog_sources() -> list[tuple[str, str]]: + # Fetch the directory listing for /blog only — no need to download the whole repo. + request = urllib.request.Request(LMSYS_BLOG_API_URL, headers=build_headers()) + with urllib.request.urlopen(request, timeout=60) as response: + items: list[dict] = json.loads(response.read()) + + sources: list[tuple[str, str]] = [] + for item in items: + if item.get("type") != "file" or not item.get("name", "").endswith(".md"): + continue + download_url = item.get("download_url") + if not download_url: + continue + raw_request = urllib.request.Request(download_url, headers=build_headers()) + with urllib.request.urlopen(raw_request, timeout=30) as raw_response: + content = raw_response.read().decode("utf-8", errors="replace") + sources.append((item["name"], content)) + + return sources + + +def split_frontmatter(content: str) -> tuple[dict[str, str], str]: + match = FRONTMATTER_RE.match(content) + if not match: + return {}, content + + frontmatter: dict[str, str] = {} + for raw_line in match.group(1).splitlines(): + line = raw_line.strip() + if not line or ":" not in line: + continue + + key, value = line.split(":", 1) + cleaned = value.strip() + if ( + (cleaned.startswith('"') and cleaned.endswith('"')) + or (cleaned.startswith("'") and cleaned.endswith("'")) + ) and len(cleaned) >= 2: + cleaned = cleaned[1:-1] + frontmatter[key.strip()] = cleaned + + return frontmatter, content[match.end() :] + + +def first_image_from_body(body: str) -> str | None: + markdown_match = MD_IMG_RE.search(body) + if markdown_match: + candidate = markdown_match.group(1).strip() + if candidate.startswith("<") and candidate.endswith(">"): + candidate = candidate[1:-1] + if " " in candidate: + candidate = candidate.split(" ", 1)[0] + return candidate + + html_match = HTML_IMG_RE.search(body) + if html_match: + return html_match.group(1).strip() + + return None + + +def to_absolute_url(url_or_path: str | None) -> str: + if not url_or_path: + return DEFAULT_IMAGE_URL + + value = url_or_path.strip() + if value.startswith(("http://", "https://")): + return value + if value.startswith("//"): + return f"https:{value}" + return f"{LMSYS_BASE_URL}/{value.lstrip('/')}" + + +def is_relevant(slug: str, title: str, body: str) -> bool: + searchable = f"{slug}\n{title}\n{body}".lower() + return any(keyword in searchable for keyword in KEYWORDS) + + +def parse_blog_post(filename: str, content: str) -> BlogPost | None: + if not filename.endswith(".md"): + return None + + slug = filename[:-3] + frontmatter, body = split_frontmatter(content) + + title = frontmatter.get("title", "").strip() or slug.replace("-", " ").title() + preview_img = frontmatter.get("previewImg") or first_image_from_body(body) + image = to_absolute_url(preview_img) + url = f"{LMSYS_BLOG_BASE_URL}/{slug}/" + date = frontmatter.get("date", "").strip() or slug[:10] + + if not is_relevant(slug=slug, title=title, body=body): + return None + + return BlogPost(slug=slug, title=title, url=url, image=image, date=date) + + +def render_cards(posts: list[BlogPost]) -> str: + if not posts: + return "No relevant LMSYS blog posts matched the current sync keywords." + + lines = [ + '
', + " ", + ] + for post in posts: + safe_title = json.dumps(post.title) + safe_url = json.dumps(post.url) + safe_image = json.dumps(post.image) + lines.extend( + [ + " ", + " ", + " ", + "
", + '
', + " ", + f" {{{safe_title}}}", + "

", + " ", + f" {{{json.dumps(post.date)}}}", + "

", + "
", + " ", + ] + ) + lines.extend([" ", ""]) + return "\n".join(lines) + + +def replace_generated_block(index_text: str, generated_cards: str) -> str: + pattern = re.compile( + rf"{re.escape(START_MARKER)}.*?{re.escape(END_MARKER)}", + flags=re.DOTALL, + ) + replacement = f"{START_MARKER}\n{generated_cards}\n{END_MARKER}" + updated_text, replacements = pattern.subn( + lambda _match: replacement, index_text, count=1 + ) + if replacements != 1: + raise RuntimeError( + f"Could not find exactly one marker block in {INDEX_PATH.name}. " + f"Expected markers: {START_MARKER} ... {END_MARKER}" + ) + return updated_text + + +def main() -> None: + sources = download_blog_sources() + relevant_posts: list[BlogPost] = [] + + for filename, content in sources: + post = parse_blog_post(filename=filename, content=content) + if post is not None: + relevant_posts.append(post) + + relevant_posts.sort(key=lambda post: post.slug, reverse=True) + selected_posts = relevant_posts[:MAX_CARDS] + + generated_cards = render_cards(selected_posts) + current_index = INDEX_PATH.read_text(encoding="utf-8") + updated_index = replace_generated_block( + index_text=current_index, generated_cards=generated_cards + ) + + if updated_index != current_index: + INDEX_PATH.write_text(updated_index, encoding="utf-8") + + print( + "Scanned " + f"{len(sources)} blog files, matched {len(relevant_posts)} posts, " + f"published {len(selected_posts)} cards." + ) + + +if __name__ == "__main__": + main() diff --git a/docs_new/src/snippets/autoregressive/deepseek-math-v2-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-math-v2-deployment.jsx new file mode 100644 index 000000000000..5906d78770ce --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-math-v2-deployment.jsx @@ -0,0 +1,359 @@ +export const DeepSeekMathV2Deployment = () => { + const modelFamily = 'deepseek-ai'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', subtitle: '183GB', default: true }, + { id: 'b300', label: 'B300', subtitle: '275GB', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser deepseek-r1' : null + }, + dpattention: { + name: 'dpattention', + title: 'DP Attention', + items: [ + { id: 'disabled', label: 'Disabled', subtitle: 'Low Latency', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'High Throughput', default: false } + ], + commandRule: null + } + }; + + // BF16 only, B200/B300 tp=8 + const modelConfigs = { + b200: { bf16: { tp: 8, mem: null } }, + b300: { bf16: { tp: 8, mem: null } } + }; + + const generateCommand = (values) => { + const { hardware } = values; + + const modelName = `${modelFamily}/DeepSeek-Math-V2`; + + const hwConfig = modelConfigs[hardware].bf16; + const tpValue = hwConfig.tp; + const memFraction = hwConfig.mem; + + let cmd = 'sglang serve --model-path'; + cmd += ` ${modelName}`; + + // TP setting + cmd += ` \\\n --tp ${tpValue}`; + + // DP Attention: --dp matches --tp + if (values.dpattention === 'enabled') { + cmd += ` \\\n --dp ${tpValue} \\\n --enable-dp-attention`; + } + + // EP setting (commonly matches tp for MoE models) + cmd += ` \\\n --ep ${tpValue}`; + + // Apply commandRule from all options + Object.entries(options).forEach(([key, option]) => { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` \\\n ${rule}`; + } + } + }); + + // Memory fraction based on hardware and quantization (skip for 8-card configs) + if (memFraction) { + cmd += ` \\\n --mem-fraction-static ${memFraction}`; + } + + cmd += ' \\\n --host 0.0.0.0 \\\n --port 30000'; + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-ocr-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-ocr-deployment.jsx new file mode 100644 index 000000000000..ce2b70b64021 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-ocr-deployment.jsx @@ -0,0 +1,168 @@ +export const DeepSeekOCRDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'fp16', label: 'FP16', default: true } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode - prioritize page theme over system preference + useEffect(() => { + const checkDarkMode = () => { + // Check Mintlify's theme class on html element + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, strategy } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + // Validation checks + // Check MI300X compatibility - MI300X + DeepSeek-OCR only supports FP16 quantization + if ((hardware === 'mi300x') && quantization !== 'fp16') { + return '# Error: MI300X + DeepSeek-OCR only supports FP16 quantization\n# Please select FP16 quantization'; + } + + // Model path + let modelPath = 'deepseek-ai/DeepSeek-OCR'; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelPath}`; + cmd += ` \\\n --dtype float16`; + + // TP strategy + if (strategyArray.includes('tp')) { + cmd += ` \\\n --tp 1`; + } + + // DP strategy + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 1 \\\n --enable-dp-attention`; + } + + // EP strategy + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 1`; + } + + cmd += ` \\\n --enable-symm-mem # Optional: improves performance, but may be unstable`; + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-ocr-v2-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-ocr-v2-deployment.jsx new file mode 100644 index 000000000000..caf66e233c24 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-ocr-v2-deployment.jsx @@ -0,0 +1,342 @@ +export const DeepSeekOCR2Deployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'fp16', label: 'FP16', default: true }, + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false } + ] + }, + }; + + const generateCommand = (values) => { + const { hardware, strategy } = values; + + const strategyArray = Array.isArray(strategy) ? strategy : []; + + let modelPath = 'deepseek-ai/DeepSeek-OCR-2'; + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelPath}`; + cmd += ` \\\n --enable-multimodal`; + + if (strategyArray.includes('tp')) { + cmd += ` \\\n --tp 1`; + } + + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 1 \\\n --enable-dp-attention`; + } + + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 1`; + } + + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ` \\\n --attention-backend triton` + ` \\\n --trust-remote-code`; + } + + cmd += ` \\\n --host 0.0.0.0 \\\n --port 30000`; + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx new file mode 100644 index 000000000000..b342d6b84e47 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx @@ -0,0 +1,856 @@ +export const DeepSeekR1AdvancedDeployment = () => { +const lookupData = { + "model": "deepseek-r1", + "version": "v0.5.6", + "ui_options": { + "hardware": [ + { + "id": "b200", + "label": "B200", + "default": true + }, + { + "id": "h200", + "label": "H200", + "default": false + }, + { + "id": "mi300x", + "label": "MI300X", + "default": false + }, + { + "id": "mi325x", + "label": "MI325X", + "default": false + }, + { + "id": "mi355x", + "label": "MI355X", + "default": false + } + ], + "quantization": [ + { + "id": "fp8", + "label": "FP8", + "default": true + }, + { + "id": "fp4", + "label": "FP4", + "default": false + } + ], + "scenario": [ + { + "id": "low-latency", + "label": "Low Latency", + "subtitle": "Concurrency 4-8", + "default": true + }, + { + "id": "high-throughput", + "label": "High Throughput", + "subtitle": "Concurrency 16-128", + "default": false + } + ], + "gpu_count": [ + { + "id": 4, + "label": "4 GPUs", + "default": false + }, + { + "id": 8, + "label": "8 GPUs", + "default": true + } + ] + }, + "configs": [ + { + "hardware": "b200", + "quantization": "fp4", + "gpu_count": 4, + "scenario": "low-latency", + "parameters": { + "model_path": "nvidia/DeepSeek-R1-0528-FP4-v2", + "tensor_parallel_size": 4, + "cuda_graph_max_bs": 256, + "max_running_requests": 256, + "mem_fraction_static": 0.85, + "ep_size": 4, + "scheduler_recv_interval": 10, + "enable_symm_mem": true, + "stream_interval": 10 + } + }, + { + "hardware": "b200", + "quantization": "fp4", + "gpu_count": 4, + "scenario": "high-throughput", + "parameters": { + "model_path": "nvidia/DeepSeek-R1-0528-FP4-v2", + "tensor_parallel_size": 4, + "cuda_graph_max_bs": 256, + "max_running_requests": 256, + "mem_fraction_static": 0.85, + "ep_size": 4, + "scheduler_recv_interval": 30, + "enable_symm_mem": true, + "stream_interval": 10 + } + }, + { + "hardware": "b200", + "quantization": "fp4", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "model_path": "nvidia/DeepSeek-R1-0528-FP4-v2", + "tensor_parallel_size": 8, + "cuda_graph_max_bs": 256, + "max_running_requests": 256, + "mem_fraction_static": 0.85, + "kv_cache_dtype": "fp8_e4m3", + "chunked_prefill_size": 16384, + "ep_size": 8, + "scheduler_recv_interval": 10, + "enable_symm_mem": true, + "stream_interval": 10 + } + }, + { + "hardware": "b200", + "quantization": "fp4", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "model_path": "nvidia/DeepSeek-R1-0528-FP4-v2", + "tensor_parallel_size": 8, + "cuda_graph_max_bs": 256, + "max_running_requests": 256, + "mem_fraction_static": 0.85, + "kv_cache_dtype": "fp8_e4m3", + "chunked_prefill_size": 16384, + "ep_size": 8, + "scheduler_recv_interval": 30, + "enable_symm_mem": true, + "stream_interval": 10 + } + }, + { + "hardware": "b200", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "env_vars": "SGLANG_ENABLE_JIT_DEEPGEMM=false", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "tensor_parallel_size": 8, + "cuda_graph_max_bs": 128, + "max_running_requests": 128, + "mem_fraction_static": 0.82, + "kv_cache_dtype": "fp8_e4m3", + "chunked_prefill_size": 32768, + "max_prefill_tokens": 32768, + "scheduler_recv_interval": 10, + "stream_interval": 30, + "fp8_gemm_backend": "flashinfer_trtllm" + } + }, + { + "hardware": "b200", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "env_vars": "SGLANG_ENABLE_JIT_DEEPGEMM=false", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "tensor_parallel_size": 8, + "cuda_graph_max_bs": 128, + "max_running_requests": 128, + "mem_fraction_static": 0.82, + "kv_cache_dtype": "fp8_e4m3", + "chunked_prefill_size": 32768, + "max_prefill_tokens": 32768, + "scheduler_recv_interval": 30, + "stream_interval": 30, + "fp8_gemm_backend": "flashinfer_trtllm" + } + }, + { + "hardware": "h200", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "disable_radix_cache": true, + "max_running_requests": 256, + "cuda_graph_max_bs": 256, + "chunked_prefill_size": 32768, + "max_prefill_tokens": 32768, + "mem_fraction_static": 0.82, + "attention_backend": "flashinfer", + "stream_interval": 10, + "decode_log_interval": 1 + } + }, + { + "hardware": "h200", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "disable_radix_cache": true, + "max_running_requests": 512, + "cuda_graph_max_bs": 512, + "chunked_prefill_size": 32768, + "max_prefill_tokens": 32768, + "mem_fraction_static": 0.82, + "attention_backend": "flashinfer", + "stream_interval": 10, + "decode_log_interval": 1 + } + }, + { + "hardware": "mi300x", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 SGLANG_AITER_MLA_PERSIST=1", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "cuda_graph_max_bs": 128, + "chunked_prefill_size": 131072, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 131072, + "kv_cache_dtype": "fp8_e4m3", + "attention_backend": "aiter", + "disable_radix_cache": true + } + }, + { + "hardware": "mi300x", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 SGLANG_AITER_MLA_PERSIST=1", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "cuda_graph_max_bs": 512, + "chunked_prefill_size": 131072, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 131072, + "kv_cache_dtype": "fp8_e4m3", + "attention_backend": "aiter", + "disable_radix_cache": true + } + }, + { + "hardware": "mi325x", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 SGLANG_AITER_MLA_PERSIST=1", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "cuda_graph_max_bs": 128, + "chunked_prefill_size": 131072, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 131072, + "kv_cache_dtype": "fp8_e4m3", + "attention_backend": "aiter", + "disable_radix_cache": true + } + }, + { + "hardware": "mi325x", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 SGLANG_AITER_MLA_PERSIST=1", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "cuda_graph_max_bs": 512, + "chunked_prefill_size": 131072, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 131072, + "kv_cache_dtype": "fp8_e4m3", + "attention_backend": "aiter", + "disable_radix_cache": true + } + }, + { + "hardware": "mi355x", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 RCCL_MSCCL_ENABLE=0 ROCM_QUICK_REDUCE_QUANTIZATION=INT4", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "disable_radix_cache": true, + "chunked_prefill_size": 196608, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 196608, + "cuda_graph_max_bs": 128, + "attention_backend": "aiter", + "kv_cache_dtype": "fp8_e4m3" + } + }, + { + "hardware": "mi355x", + "quantization": "fp8", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 RCCL_MSCCL_ENABLE=0 ROCM_QUICK_REDUCE_QUANTIZATION=INT4", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "disable_radix_cache": true, + "chunked_prefill_size": 196608, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 196608, + "cuda_graph_max_bs": 512, + "attention_backend": "aiter", + "kv_cache_dtype": "fp8_e4m3" + } + }, + { + "hardware": "mi355x", + "quantization": "fp4", + "gpu_count": 8, + "scenario": "low-latency", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 ROCM_QUICK_REDUCE_QUANTIZATION=INT4", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "disable_radix_cache": true, + "chunked_prefill_size": 196608, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 196608, + "cuda_graph_max_bs": 128, + "attention_backend": "aiter", + "kv_cache_dtype": "fp8_e4m3" + } + }, + { + "hardware": "mi355x", + "quantization": "fp4", + "gpu_count": 8, + "scenario": "high-throughput", + "parameters": { + "env_vars": "SGLANG_USE_AITER=1 ROCM_QUICK_REDUCE_QUANTIZATION=INT4", + "model_path": "deepseek-ai/DeepSeek-R1-0528", + "trust_remote_code": true, + "tensor_parallel_size": 8, + "mem_fraction_static": 0.8, + "disable_radix_cache": true, + "chunked_prefill_size": 196608, + "num_continuous_decode_steps": 4, + "max_prefill_tokens": 196608, + "cuda_graph_max_bs": 512, + "attention_backend": "aiter", + "kv_cache_dtype": "fp8_e4m3" + } + } + ], + "validation": [ + { + "hardware": "h200", + "quantization": "fp4", + "error": "FP4 is only available for B200 hardware. Please select FP8 quantization." + } + ] +}; + +const fieldToFlag = { + model_path: 'model-path', + trust_remote_code: 'trust-remote-code', + tensor_parallel_size: 'tp', + data_parallel_size: 'dp', + ep_size: 'ep-size', + cuda_graph_max_bs: 'cuda-graph-max-bs', + max_running_requests: 'max-running-requests', + mem_fraction_static: 'mem-fraction-static', + kv_cache_dtype: 'kv-cache-dtype', + chunked_prefill_size: 'chunked-prefill-size', + max_prefill_tokens: 'max-prefill-tokens', + enable_flashinfer_allreduce_fusion: 'enable-flashinfer-allreduce-fusion', + scheduler_recv_interval: 'scheduler-recv-interval', + enable_symm_mem: 'enable-symm-mem', + disable_radix_cache: 'disable-radix-cache', + attention_backend: 'attention-backend', + moe_runner_backend: 'moe-runner-backend', + stream_interval: 'stream-interval', + quantization: 'quantization', + decode_log_interval: 'decode-log-interval', + fp8_gemm_backend: 'fp8-gemm-backend', + num_continuous_decode_steps: 'num-continuous-decode-steps', +}; + +const findConfig = (hardware, quantization, gpuCount, scenario) => { + const match = lookupData.configs.find((entry) => { + const hardwareMatch = entry.hardware === hardware; + const quantizationMatch = entry.quantization === quantization; + const gpuCountMatch = !entry.gpu_count || entry.gpu_count === Number.parseInt(gpuCount, 10); + const scenarioMatch = entry.scenario === scenario; + return hardwareMatch && quantizationMatch && gpuCountMatch && scenarioMatch; + }); + return match ? match.parameters : null; +}; + +const getAvailableGpuCounts = (hardware, quantization) => { + const entries = lookupData.configs.filter( + (entry) => entry.hardware === hardware && entry.quantization === quantization + ); + const gpuCounts = [...new Set(entries.map((entry) => entry.gpu_count))].filter(Boolean); + return gpuCounts.length > 0 ? gpuCounts.sort((a, b) => a - b) : [8]; +}; + +const generateCommandFromConfig = (config) => { + if (!config) { + return '# Error: Configuration not found'; + } + + let command = ''; + if (config.env_vars) { + command = `${config.env_vars} `; + } + + command += 'python3 -m sglang.launch_server \\\n'; + command += ` --model-path ${config.model_path}`; + + for (const [key, value] of Object.entries(config)) { + if (key === 'model_path' || key === 'env_vars') { + continue; + } + + const flagName = fieldToFlag[key]; + if (!flagName) { + continue; + } + + if (typeof value === 'boolean') { + if (value) { + command += ` \\\n --${flagName}`; + } + continue; + } + + command += ` \\\n --${flagName} ${value}`; + } + + return command; +}; + +const validateSelection = (hardware, quantization) => { + for (const rule of lookupData.validation || []) { + const hardwareMatch = Array.isArray(rule.hardware) + ? rule.hardware.includes(hardware) + : rule.hardware === hardware; + const quantizationMatch = Array.isArray(rule.quantization) + ? rule.quantization.includes(quantization) + : rule.quantization === quantization; + if (hardwareMatch && quantizationMatch) { + return rule.error; + } + } + return null; +}; + +const resolveItems = (option, values) => + typeof option.getDynamicItems === 'function' ? option.getDynamicItems(values) : option.items; + + const uiOptions = lookupData.ui_options; + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: uiOptions.hardware + .filter((option) => + ['b200', 'h200', 'mi300x', 'mi325x', 'mi355x'].includes(option.id) + ) + .map((option) => ({ + id: option.id, + label: option.label, + default: option.id === 'b200', + })), + }, + quantization: { + name: 'quantization', + title: 'Quantization', + getDynamicItems: (values) => + uiOptions.quantization.map((option) => { + const fp4Disabled = ['h200', 'mi300x', 'mi325x'].includes(values.hardware) && option.id === 'fp4'; + return { + id: option.id, + label: option.label, + default: + ['h200', 'mi300x', 'mi325x'].includes(values.hardware) + ? option.id === 'fp8' + : option.default, + disabled: fp4Disabled, + disabledReason: fp4Disabled ? 'FP4 not supported on H200, MI300X, MI325X' : '', + }; + }), + }, + gpuCount: { + name: 'gpuCount', + title: 'GPU Count', + getDynamicItems: (values) => { + const availableGpuCounts = getAvailableGpuCounts(values.hardware, values.quantization); + const allGpuCounts = uiOptions.gpu_count.map((option) => + typeof option.id === 'number' ? option.id : Number.parseInt(option.id, 10) + ); + const defaultGpuCount = Math.max(...availableGpuCounts); + + return allGpuCounts.map((count) => ({ + id: String(count), + label: `${count} GPUs`, + default: count === defaultGpuCount, + disabled: !availableGpuCounts.includes(count), + disabledReason: availableGpuCounts.includes(count) + ? '' + : `${count} GPUs not available for ${values.hardware.toUpperCase()} ${values.quantization.toUpperCase()}`, + })); + }, + }, + scenario: { + name: 'scenario', + title: 'Scenario', + items: uiOptions.scenario.map((option) => ({ + id: option.id, + label: option.label, + subtitle: option.subtitle, + default: option.default, + })), + }, + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState) || []; + const fallback = + items.find((item) => item.default && !item.disabled) || + items.find((item) => !item.disabled) || + items[0]; + initialState[key] = fallback ? fallback.id : ''; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => { + const next = { ...prev, [optionName]: value }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') { + continue; + } + const items = option.getDynamicItems(next); + const current = items.find((item) => item.id === next[key]); + if (!current || current.disabled) { + const fallback = + items.find((item) => item.default && !item.disabled) || + items.find((item) => !item.disabled); + if (fallback) { + next[key] = fallback.id; + } + } + } + return next; + }); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = (vals) => { + const validationError = validateSelection(vals.hardware, vals.quantization); + if (validationError) { + return `# Error: ${validationError}`; + } + + const config = findConfig( + vals.hardware, + vals.quantization, + vals.gpuCount || '8', + vals.scenario + ); + if (!config) { + return `# Error: No configuration found for: +# Hardware: ${vals.hardware} +# Quantization: ${vals.quantization} +# GPU Count: ${vals.gpuCount} +# Scenario: ${vals.scenario} +# This combination is not yet supported.`; + } + + return generateCommandFromConfig(config); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-r1-basic-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-r1-basic-deployment.jsx new file mode 100644 index 000000000000..d95ca5a2dfe1 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-r1-basic-deployment.jsx @@ -0,0 +1,394 @@ +export const DeepSeekR1BasicDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: true }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + ], + }, + quantization: { + name: 'quantization', + title: 'Quantization', + getDynamicItems: (values) => { + const fp4Disabled = values.hardware === 'h100' || values.hardware === 'mi300x'; + return [ + { id: 'fp8', label: 'FP8', default: true }, + { + id: 'fp4', + label: 'FP4', + default: false, + disabled: fp4Disabled, + disabledReason: 'H100 and MI300X only support FP8 quantization', + }, + ]; + }, + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false }, + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: false }, + ], + }, + thinking: { + name: 'thinking', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false }, + ], + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false }, + ], + }, + }; + + const resolveItems = (option, values) => + typeof option.getDynamicItems === 'function' ? option.getDynamicItems(values) : option.items; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + continue; + } + + const items = resolveItems(option, initialState) || []; + const fallback = + items.find((item) => item.default && !item.disabled) || + items.find((item) => !item.disabled) || + items[0]; + initialState[key] = fallback ? fallback.id : ''; + } + return initialState; + }; + + const generateCommand = (values) => { + const { hardware, quantization, strategy, thinking, toolcall } = values; + const strategyValues = Array.isArray(strategy) ? strategy : []; + + if ((hardware === 'h100' || hardware === 'mi300x') && quantization === 'fp4') { + return '# Error: H100 and MI300X only support FP8 quantization'; + } + + const modelPath = + quantization === 'fp4' + ? 'nvidia/DeepSeek-R1-0528-FP4-v2' + : 'deepseek-ai/DeepSeek-R1-0528'; + + let command = 'python3 -m sglang.launch_server \\\n'; + command += ` --model-path ${modelPath}`; + + if (strategyValues.includes('tp')) { + command += ' \\\n --tp 8'; + } + if (strategyValues.includes('dp')) { + command += ' \\\n --dp 8 \\\n --enable-dp-attention'; + } + if (strategyValues.includes('ep')) { + command += ' \\\n --ep 8'; + } + if (strategyValues.includes('mtp')) { + command = 'SGLANG_ENABLE_SPEC_V2=1 ' + command; + command += + ' \\\n --speculative-algorithm EAGLE' + + ' \\\n --speculative-num-steps 3' + + ' \\\n --speculative-eagle-topk 1' + + ' \\\n --speculative-num-draft-tokens 4'; + } + + command += ' \\\n --enable-symm-mem # Optional: improves performance, but may be unstable'; + + if (hardware === 'b200' || (hardware === 'mi355x' && quantization === 'fp8')) { + command += + ' \\\n --kv-cache-dtype fp8_e4m3 # Optional: enables fp8 kv cache and fp8 attention kernels to improve performance'; + } + + if (thinking === 'enabled') { + command += ' \\\n --reasoning-parser deepseek-r1'; + } + if (toolcall === 'enabled') { + command += + ' \\\n --tool-call-parser deepseekv3' + + ' \\\n --chat-template examples/chat_template/tool_chat_template_deepseekr1.jinja'; + } + + return command; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => { + const next = { ...prev, [optionName]: value }; + if (optionName === 'hardware') { + const quantizationItems = resolveItems(options.quantization, next); + const current = quantizationItems.find((item) => item.id === next.quantization); + if (!current || current.disabled) { + const fallback = + quantizationItems.find((item) => item.default && !item.disabled) || + quantizationItems.find((item) => !item.disabled); + if (fallback) { + next.quantization = fallback.id; + } + } + } + return next; + }); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-v3-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-v3-deployment.jsx new file mode 100644 index 000000000000..8615b01be643 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-v3-deployment.jsx @@ -0,0 +1,186 @@ +export const DeepSeekV3Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: true }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'fp8', label: 'FP8', default: true }, + { id: 'fp4', label: 'FP4', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false }, + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode - prioritize page theme over system preference + useEffect(() => { + const checkDarkMode = () => { + // Check Mintlify's theme class on html element + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, strategy, thinking, toolcall } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + // Validation - H100/H200/MI300X/MI325X only supports FP8 + if (['h100', 'h200', 'mi300x', 'mi325x'].includes(hardware) && quantization === 'fp4') { + return '# Error: This hardware only supports FP8 quantization\n# Please select FP8 quantization or use B200/MI355X hardware'; + } + + const modelPath = quantization === 'fp4' ? 'nvidia/DeepSeek-V3-0324-NVFP4' : 'deepseek-ai/DeepSeek-V3'; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelPath}`; + + if (strategyArray.includes('tp')) cmd += ' \\\n --tp 8'; + if (strategyArray.includes('dp')) cmd += ' \\\n --dp 8 \\\n --enable-dp-attention'; + if (strategyArray.includes('ep')) cmd += ' \\\n --ep 8'; + if (strategyArray.includes('mtp')) { + cmd = 'SGLANG_ENABLE_SPEC_V2=1 ' + cmd; + cmd += ' \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4'; + } + + cmd += ' \\\n --enable-symm-mem # Optional: improves performance, but may be unstable'; + + if (hardware === 'b200') { + cmd += ' \\\n --kv-cache-dtype fp8_e4m3 # Optional: enables fp8 kv cache and fp8 attention kernels'; + } + + if (thinking === 'enabled') cmd += ' \\\n --reasoning-parser deepseek-v3'; + if (toolcall === 'enabled') cmd += ' \\\n --tool-call-parser deepseekv3 \\\n --chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja'; + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-v31-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-v31-deployment.jsx new file mode 100644 index 000000000000..b09fa61f8bc3 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-v31-deployment.jsx @@ -0,0 +1,197 @@ +export const DeepSeekV31Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelname: { + name: 'modelname', + title: 'Model Name', + items: [ + { id: 'v31', label: 'DeepSeek-V3.1', default: true }, + { id: 'v31terminus', label: 'DeepSeek-V3.1-Terminus', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', default: true, required: true }, + { id: 'dp', label: 'DP attention', default: false }, + { id: 'ep', label: 'EP', default: false }, + { id: 'mtp', label: 'Multi-token Prediction', default: false } + ] + }, + reasoningParser: { + name: 'reasoningParser', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode - prioritize page theme over system preference + useEffect(() => { + const checkDarkMode = () => { + // Check Mintlify's theme class on html element + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, modelname, strategy, reasoningParser, toolcall } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + // Model name mapping + const modelMap = { + 'v31': 'DeepSeek-V3.1', + 'v31terminus': 'DeepSeek-V3.1-Terminus' + }; + + const modelName = `deepseek-ai/${modelMap[modelname]}`; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelName}`; + + // TP is mandatory + cmd += ` \\\n --tp 8`; + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 8 \\\n --enable-dp-attention`; + } + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 8`; + } + // Multi-token prediction (MTP) configuration + if (strategyArray.includes('mtp')) { + cmd += ` \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4`; + } + + // Add tool-call-parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser deepseekv31`; + } + + // Add reasoning-parser when enabled + if (reasoningParser === 'enabled') { + cmd += ` \\\n --reasoning-parser deepseek-v3`; + } + + // Add chat-template if tool calling is enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --chat-template ./examples/chat_template/tool_chat_template_deepseekv31.jinja`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-v32-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-v32-deployment.jsx new file mode 100644 index 000000000000..641afd385687 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-v32-deployment.jsx @@ -0,0 +1,318 @@ +export const DeepSeekV32Deployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/DeepSeekConfigGenerator/index.js. + // + // Model variants: + // DeepSeek-V3.2, V3.2-Exp, V3.2-Speciale → deepseek-ai/ family, TP=8 + // DeepSeek-V3.2-NVFP4 → nvidia/ family, B200 only, TP=4 + // DeepSeek-V3.2-MXFP4 → amd/ family, MI300X/MI355X only, TP=8 + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelname: { + name: 'modelname', + title: 'Model Name', + getDynamicItems: (values) => { + const hw = values.hardware; + const isB200 = hw === 'b200'; + const isAMD = hw === 'mi300x' || hw === 'mi355x'; + return [ + { id: 'v32', label: 'DeepSeek-V3.2', default: !isB200 && !isAMD }, + { id: 'v32speciale', label: 'DeepSeek-V3.2-Speciale', default: false }, + { id: 'v32exp', label: 'DeepSeek-V3.2-Exp', default: false }, + { id: 'v32nvfp4', label: 'DeepSeek-V3.2-NVFP4', default: isB200, disabled: !isB200, disabledReason: 'NVFP4 requires B200 (Blackwell)' }, + { id: 'v32mxfp4', label: 'DeepSeek-V3.2-MXFP4', default: isAMD, disabled: !isAMD, disabledReason: 'MXFP4 requires AMD MI300X/MI355X' } + ]; + } + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + condition: (values) => values.modelname !== 'v32nvfp4' && values.modelname !== 'v32mxfp4', + items: [ + { id: 'tp', label: 'TP', default: true, required: true }, + { id: 'dp', label: 'DP attention', default: false }, + { id: 'ep', label: 'EP', default: false }, + { id: 'mtp', label: 'Multi-token Prediction', default: false } + ] + }, + reasoningParser: { + name: 'reasoningParser', + title: 'Reasoning Parser', + condition: (values) => values.modelname !== 'v32nvfp4' && values.modelname !== 'v32mxfp4', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + condition: (values) => values.modelname !== 'v32nvfp4' && values.modelname !== 'v32mxfp4' && values.modelname !== 'v32speciale', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + const resolveItems = (option, vals) => { + if (typeof option.getDynamicItems === 'function') return option.getDynamicItems(vals); + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + if (option.type === 'checkbox') { + const items = resolveItems(option, initialState); + initialState[key] = items.filter(i => i.default).map(i => i.id); + } else { + const items = resolveItems(option, initialState); + const def = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = def.id; + } + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + // When hardware changes, re-resolve model name defaults (NVFP4→B200, MXFP4→AMD). + useEffect(() => { + setValues(prev => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find(i => i.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.hardware]); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + const generateCommand = () => { + const { hardware, modelname, strategy, reasoningParser, toolcall } = values; + + const isNvfp4 = modelname === 'v32nvfp4'; + const isMxfp4 = modelname === 'v32mxfp4'; + const isAMD = hardware === 'mi300x' || hardware === 'mi355x'; + + // Validation: NVFP4 requires B200 + if (isNvfp4 && hardware !== 'b200') { + return `# Error: DeepSeek-V3.2-NVFP4 requires NVIDIA B200 (Blackwell) hardware\n# Please select "B200" for Hardware Platform or choose a different model`; + } + + // Validation: MXFP4 requires AMD MI300X/MI355X + if (isMxfp4 && !isAMD) { + return `# Error: DeepSeek-V3.2-MXFP4 requires AMD MI300X/MI355X hardware\n# Please select "MI300X" or "MI355X" for Hardware Platform or choose a different model`; + } + + // Validation: Speciale doesn't support tool calling + if (modelname === 'v32speciale' && toolcall === 'enabled') { + return `# Error: DeepSeek-V3.2-Speciale doesn't support tool calling\n# Please select "Disabled" for Tool Call Parser or choose a different model`; + } + + // Model name mapping + const modelMap = { + 'v32': 'DeepSeek-V3.2', + 'v32exp': 'DeepSeek-V3.2-Exp', + 'v32speciale': 'DeepSeek-V3.2-Speciale', + 'v32nvfp4': 'DeepSeek-V3.2-NVFP4', + 'v32mxfp4': 'DeepSeek-V3.2-mxfp4' + }; + + let modelFamily; + if (isNvfp4) modelFamily = 'nvidia'; + else if (isMxfp4) modelFamily = 'amd'; + else modelFamily = 'deepseek-ai'; + + const modelName = `${modelFamily}/${modelMap[modelname]}`; + + // NVFP4: fixed config + if (isNvfp4) { + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ' \\\n --tp 4'; + cmd += ' \\\n --quantization modelopt_fp4'; + cmd += ' \\\n --moe-runner-backend flashinfer_trtllm'; + return cmd; + } + + // MXFP4: fixed config for AMD + if (isMxfp4) { + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ' \\\n --tp 8'; + cmd += ' \\\n --trust-remote-code'; + return cmd; + } + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + + // Hardware platform specific parameters + if (isAMD) { + cmd += ' \\\n --trust-remote-code'; + cmd += ' \\\n --nsa-prefill-backend tilelang'; + cmd += ' \\\n --nsa-decode-backend tilelang'; + cmd += ' \\\n --cuda-graph-max-bs 64'; + } + + // Strategy configurations + const strategyArray = Array.isArray(strategy) ? strategy : []; + const tpSize = 8; + const dpSize = 8; + const epSize = 8; + cmd += ` \\\n --tp ${tpSize}`; + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp ${dpSize} \\\n --enable-dp-attention`; + } + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep ${epSize}`; + } + + // Multi-token prediction (MTP) configuration + if (strategyArray.includes('mtp')) { + cmd += ' \\\n --speculative-algorithm EAGLE'; + cmd += ' \\\n --speculative-num-steps 3'; + cmd += ' \\\n --speculative-eagle-topk 1'; + cmd += ' \\\n --speculative-num-draft-tokens 4'; + } + + // Add tool-call-parser if enabled (not supported for Speciale) + if (toolcall === 'enabled' && modelname !== 'v32speciale') { + if (modelname === 'v32exp') { + cmd += ' \\\n --tool-call-parser deepseekv31'; + } else if (modelname === 'v32') { + cmd += ' \\\n --tool-call-parser deepseekv32'; + } + } + + // Add reasoning-parser when enabled + if (reasoningParser === 'enabled') { + cmd += ' \\\n --reasoning-parser deepseek-v3'; + } + + // Add chat-template if tool calling is enabled (only for v32exp) + if (toolcall === 'enabled' && modelname === 'v32exp') { + cmd += ' \\\n --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja'; + } + + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (typeof option.condition === 'function' && !option.condition(values)) return null; + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required || !!item.disabled; + return ( + + ); + }) + ) : ( + items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/deepseek-v4-deployment.jsx b/docs_new/src/snippets/autoregressive/deepseek-v4-deployment.jsx new file mode 100644 index 000000000000..0f71bf95fe72 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/deepseek-v4-deployment.jsx @@ -0,0 +1,943 @@ +export const DeepSeekV4Deployment = () => { + // DeepSeek-V4 deployment matrix (small / real checkpoint): + // Hardware × Recipe → concrete launch command. + // + // Hardware (quantization determined by GPU generation): + // B200 → FP4 weights, Flash TP=4 / Pro TP=8 single-node + // GB200 → FP4 weights, Flash TP=4 / Pro TP=8 2-node + // GB300 → FP4 weights, Flash TP=4 / Pro TP=4 single-node + // H200 → FP8 weights, Flash TP=4 / Pro TP=16 2-node + // Model variant → HF slug: + // Flash (285B) → deepseek-ai/DeepSeek-V4-Flash + // Pro (1.6T) → deepseek-ai/DeepSeek-V4-Pro + // + // Recipe: + // low-latency → TP(+DP on H200 no, Blackwell no), MTP 3/4 + // balanced → DP-attn + DeepEP + MTP 1/2 + // max-throughput → DP-attn + DeepEP, no MTP + // cp → TP + DeepEP + context-parallel flags, no MTP + // pd-disagg → 1P1D (prefill + decode + router), separate commands shown together + // + // HF slugs, parser names, and `sglang serve` flag parity are all confirmed — + // see cookbook_v2/DISCUSSION.md ("人类提供的事实" and 设计决定 §3). + + const options = { + hardware: { + name: "hardware", + title: "Hardware Platform", + items: [ + { id: "b200", label: "B200 (FP4)", default: true }, + { id: "b300", label: "B300 (FP4)", default: false }, + { id: "gb200", label: "GB200 (FP4)", default: false }, + { id: "gb300", label: "GB300 (FP4)", default: false }, + { id: "h200", label: "H200 (FP8)", default: false }, + { id: "h200-fp4", label: "H200 (FP4)", default: false }, + ], + }, + modelSize: { + name: "modelSize", + title: "Model Variant", + items: [ + { id: "small", label: "Flash", default: true, subtitle: "285B" }, + { id: "big", label: "Pro", default: false, subtitle: "1.6T" }, + ], + }, + recipe: { + name: "recipe", + title: "Recipe", + items: [ + { id: "low-latency", label: "Low-Latency", default: true }, + { id: "balanced", label: "Balanced", default: false }, + { id: "max-throughput", label: "Max-Throughput", default: false }, + { id: "cp", label: "Context-Parallel", default: false }, + { id: "pd-disagg", label: "PD-Disagg", default: false }, + ], + }, + reasoningParser: { + name: "reasoningParser", + title: "Reasoning Parser", + items: [ + { id: "disabled", label: "Disabled", default: true }, + { id: "enabled", label: "Enabled", default: false, subtitle: "deepseek-v4" }, + ], + }, + toolcall: { + name: "toolcall", + title: "Tool Call Parser", + items: [ + { id: "disabled", label: "Disabled", default: true }, + { id: "enabled", label: "Enabled", default: false, subtitle: "deepseekv4" }, + ], + }, + }; + + // Recipes that are not supported on the H200 (FP4) Marlin path. + const H200_FP4_UNSUPPORTED_RECIPES = new Set(["cp", "pd-disagg"]); + + const resolveItems = (option, vals) => { + if (option.name === "recipe" && vals && vals.hardware === "h200-fp4") { + return option.items.map((it) => + H200_FP4_UNSUPPORTED_RECIPES.has(it.id) + ? { ...it, disabled: true, disabledReason: "Not supported on H200 (FP4)" } + : it + ); + } + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option); + const def = items.find((i) => i.default && !i.disabled) || items.find((i) => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains("dark") || + html.getAttribute("data-theme") === "dark" || + html.style.colorScheme === "dark"; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ["class", "data-theme", "style"], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => { + const next = { ...prev, [optionName]: value }; + // Switching to H200 (FP4) while cp / pd-disagg is selected: fall back + // to low-latency since those recipes are not supported on this path. + if ( + optionName === "hardware" && + value === "h200-fp4" && + H200_FP4_UNSUPPORTED_RECIPES.has(next.recipe) + ) { + next.recipe = "low-latency"; + } + return next; + }); + }; + + // ============================================================================ + // generateCommand — strict mirror of sunrise_allinone.py LAUNCH_COMMANDS + // for BOTH small and big (1.6T) real-checkpoint rows. + // + // SOURCE OF TRUTH: sunrise_final/sunrise_allinone.py LAUNCH_COMMANDS dict. + // Allowed deviations are documented in cookbook_v2/DISCUSSION.md + // → "Human-approved diffs from allinone": + // 1. NVSHMEM env (B200) removed — personal hardware NIC mapping + // 2. Model path uses HF slug instead of allinone's local paths + // 3. `sglang serve` instead of `python3 -m sglang.launch_server` + // 4. (retired — big is now a real ckpt and exposed) + // 5. GB300 PD MNNVL topology envs (MC_FORCE_MNNVL / NCCL_*) removed; + // SGLANG_MOONCAKE_CUSTOM_MEM_POOL kept. + // + // Any other diff vs allinone is a bug — fix the JSX, not the whitelist. + // ============================================================================ + + // === SHARED BEGIN === + // Constants reachable by both generateCommand and buildPDDisaggCommand. + // verify_commands.mjs also scrapes this block between the SHARED markers and + // prepends it to the extracted function bodies (since `new Function(body)` + // loses closure scope). Don't rename the markers. + + // Per (hardware, modelSize) spec derived from allinone _MODEL_SPEC. + // "small" (JSX id) = DeepSeek-V4-Flash (285B); "big" = DeepSeek-V4-Pro (1.6T). + // The internal ids match allinone's model="small" / model="big" keys so the + // verify_commands.py diff is mechanical. One HF repo per variant holds both + // FP8 and FP4 weights (quantization picked by hardware, not by repo suffix). + const HW_SIZE_SPEC = { + "b200|small": { slug: "deepseek-ai/DeepSeek-V4-Flash", tp: 4, multinode: false }, + "b200|big": { slug: "deepseek-ai/DeepSeek-V4-Pro", tp: 8, multinode: false }, + "gb300|small": { slug: "deepseek-ai/DeepSeek-V4-Flash", tp: 4, multinode: false }, + "gb300|big": { slug: "deepseek-ai/DeepSeek-V4-Pro", tp: 4, multinode: false }, + "gb200|small": { slug: "deepseek-ai/DeepSeek-V4-Flash", tp: 4, multinode: false }, + "gb200|big": { slug: "deepseek-ai/DeepSeek-V4-Pro", tp: 8, multinode: true, nnodes: 2 }, + // H200 needs an FP8-only Instruct ckpt (deepseek-ai's Flash/Pro repos ship + // FP4-mixed weights that Hopper can't run). sgl-project publishes FP8 + // repackagings for both variants. + "h200|small": { slug: "sgl-project/DeepSeek-V4-Flash-FP8", tp: 4, multinode: false }, + "h200|big": { slug: "sgl-project/DeepSeek-V4-Pro-FP8", tp: 16, multinode: true, nnodes: 2 }, + // H200 (FP4) runs the original FP4-mixed Instruct repos through the Marlin + // MoE runner: experts are dequantized from FP4 to FP16 at runtime, so a + // single-node TP=4 / TP=8 deployment fits Flash / Pro on Hopper. + "h200-fp4|small": { slug: "deepseek-ai/DeepSeek-V4-Flash", tp: 4, multinode: false }, + "h200-fp4|big": { slug: "deepseek-ai/DeepSeek-V4-Pro", tp: 8, multinode: false }, + }; + // Per (hardware, modelSize) PD role TP (from allinone _PD_SPEC). + const PD_TP_SPEC = { + "b200|small": { tp: 2, multinode: false }, + "b200|big": { tp: 8, multinode: false }, + "gb300|small": { tp: 4, multinode: false }, + "gb300|big": { tp: 4, multinode: false }, + "gb200|small": { tp: 4, multinode: false }, + "gb200|big": { tp: 8, multinode: true, nnodes: 2 }, + "h200|small": { tp: 4, multinode: false }, + "h200|big": { tp: 16, multinode: true, nnodes: 2 }, + }; + // Recipes that have been end-to-end verified on the latest (Flash/Pro) HF + // checkpoints. Every cell NOT listed here is emitted with its entire body + // commented out (every line prefixed with `# `) plus a "being verified" + // banner on top — so copy-pasting an unverified command is a no-op in shell. + // To mark a cell verified, add its "hardware|modelSize|recipe" string here + // and the cell renders as a normal, runnable command. + // pd-disagg is verified as a single unit (both prefill and decode together). + const VERIFIED_RECIPES = new Set([ + "b200|small|low-latency", + "b200|small|balanced", + "b200|small|max-throughput", + "b200|small|cp", + "b200|small|pd-disagg", + "b200|big|low-latency", + "b200|big|balanced", + "b200|big|max-throughput", + "b200|big|cp", + "h200|small|low-latency", + "h200|small|balanced", + "h200|small|max-throughput", + "gb300|small|low-latency", + "gb300|big|low-latency", + "gb300|small|balanced", + "gb300|big|balanced", + "gb300|small|max-throughput", + "gb300|big|max-throughput", + "h200|small|cp", + "h200|small|pd-disagg", + "h200|big|low-latency", + "h200|big|balanced", + "h200|big|max-throughput", + "h200|big|pd-disagg", + "gb300|small|cp", + "gb300|big|cp", + "gb300|small|pd-disagg", + "gb300|big|pd-disagg", + "gb200|small|low-latency", + "gb200|small|balanced", + "gb200|small|max-throughput", + "gb200|small|cp", + "gb200|big|low-latency", + "gb200|big|balanced", + "gb200|big|max-throughput", + "h200-fp4|small|low-latency", + "h200-fp4|small|balanced", + "h200-fp4|small|max-throughput", + "h200-fp4|big|low-latency", + "h200-fp4|big|balanced", + "h200-fp4|big|max-throughput", + ]); + // Recipes whose command is intentionally not yet provided (e.g. blocked by an + // upstream limitation). Showing a minimal placeholder is friendlier to users + // than emitting a commented-out invalid command. + const TBD_RECIPES = new Set([ + "h200|big|cp", + "gb200|small|pd-disagg", + "gb200|big|pd-disagg", + ]); + const TBD_PLACEHOLDER = "# to be provided"; + const BEING_VERIFIED_NOTE = + "# NOTE: this recipe is being verified on the latest checkpoint"; + + // Prefix every line with "# " so the whole command becomes a shell no-op. + const commentOutCommand = (cmd) => + cmd + .split("\n") + .map((line) => (line.length ? `# ${line}` : "#")) + .join("\n"); + + // DeepEP large SMS flag (allinone _DEEPEP_LARGE_SMS_FLAG). + const DEEPEP_LARGE_SMS_FLAG = + ` --deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'`; + + // Multi-node flags (renders with / placeholders; + // allinone template uses {node0_ip} / {node_rank} that verify_commands.py formats + // with the same placeholder strings so dynamic-diff stays exact). + const multiNodeFlags = (nnodes) => [ + ` --nnodes ${nnodes}`, + ` --node-rank `, + ` --dist-init-addr :20000`, + ]; + + const prependMultiNodeNote = (cmd, nnodes) => + `# Multi-node (${nnodes} nodes). Run the same command on every node with:\n` + + `# = 0 on the head node, 1..${nnodes - 1} on the others\n` + + `# = IP of the head node (reachable from all others)\n` + + `${cmd}`; + // === SHARED END === + + const generateCommand = () => { + const { hardware: rawHardware, modelSize, recipe, reasoningParser, toolcall } = values; + // B300 usage is identical to B200 — alias so we don't duplicate every spec entry. + const hardware = rawHardware === "b300" ? "b200" : rawHardware; + const specKey = `${hardware}|${modelSize}`; + const spec = HW_SIZE_SPEC[specKey]; + const { slug, tp, multinode, nnodes } = spec; + const isBig = modelSize === "big"; + + if (recipe === "pd-disagg") { + return buildPDDisaggCommand(hardware, modelSize); + } + + // H200 (FP4) Marlin path: dedicated branch — Hopper runs the FP4-mixed + // Instruct repos through the Marlin MoE runner, so it doesn't share envs + // or flags with either the FP8 H200 path or the Blackwell paths. + // Flash: TP=4, single node Pro: TP=8, single node + // low-latency: MTP 3 / 1 / 4 (steps / topk / draft-tokens) + // balanced: MTP 1 / 1 / 2 + // max-throughput: MTP disabled + if (hardware === "h200-fp4") { + const verifyKey = `${hardware}|${modelSize}|${recipe}`; + if (TBD_RECIPES.has(verifyKey)) return TBD_PLACEHOLDER; + + const fp4Flags = [ + " --trust-remote-code", + ` --model-path ${slug}`, + ` --tp ${tp}`, + " --moe-runner-backend marlin", + ]; + if (recipe === "low-latency") { + fp4Flags.push(" --speculative-algo EAGLE"); + fp4Flags.push(" --speculative-num-steps 3"); + fp4Flags.push(" --speculative-eagle-topk 1"); + fp4Flags.push(" --speculative-num-draft-tokens 4"); + } else if (recipe === "balanced") { + fp4Flags.push(" --speculative-algo EAGLE"); + fp4Flags.push(" --speculative-num-steps 1"); + fp4Flags.push(" --speculative-eagle-topk 1"); + fp4Flags.push(" --speculative-num-draft-tokens 2"); + } + if (isBig) fp4Flags.push(" --mem-fraction-static 0.88"); + if (toolcall === "enabled") fp4Flags.push(" --tool-call-parser deepseekv4"); + if (reasoningParser === "enabled") fp4Flags.push(" --reasoning-parser deepseek-v4"); + fp4Flags.push(" --host 0.0.0.0"); + fp4Flags.push(" --port 30000"); + + const fp4Cmd = `sglang serve \\\n${fp4Flags.join(" \\\n")}`; + return VERIFIED_RECIPES.has(verifyKey) + ? fp4Cmd + : `${BEING_VERIFIED_NOTE}\n${commentOutCommand(fp4Cmd)}`; + } + + // ---- env ---- + // _LAUNCH_HEAD always prepends these: + // Per-hardware env (whitelist #1: NVSHMEM removed for B200). + const HW_ENV = { + h200: ["SGLANG_DSV4_FP4_EXPERTS=0"], // allinone _ENV_H200 + b200: [], // _ENV_B200 minus NVSHMEM + gb300: [], // _ENV_GB300 + // GB200 multinode needs NCCL MNNVL for cross-node NVLink communication. + gb200: multinode ? ["NCCL_MNNVL_ENABLE=1", "NCCL_CUMEM_ENABLE=1"] : [], + }[hardware]; + + // Recipe-specific env (matches allinone exactly, taking size into account). + const recipeEnv = []; + if (recipe === "low-latency") { + // Big low-latency dispatch-token cap. + if (hardware === "h200" && isBig) { + recipeEnv.push("SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128"); + } else if (hardware === "gb200" && isBig) { + recipeEnv.push("SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256"); + } + // B200/B300 Pro accuracy-verified env vars. + if (isBig && hardware === "b200") { + recipeEnv.push( + "SGLANG_JIT_DEEPGEMM_PRECOMPILE=0", + "SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT=1", + "SGLANG_OPT_USE_JIT_NORM=1", + "SGLANG_OPT_USE_JIT_INDEXER_METADATA=1", + "SGLANG_OPT_USE_TOPK_V2=1", + "SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1", + ); + } + } else if (recipe === "balanced") { + if (hardware === "h200") { + recipeEnv.push(isBig + ? "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128" + : "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256"); + } else if (isBig && hardware === "b200") { + // B200/B300 Pro accuracy-verified env vars. + recipeEnv.push( + "SGLANG_JIT_DEEPGEMM_PRECOMPILE=0", + "SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT=1", + "SGLANG_OPT_USE_JIT_NORM=1", + "SGLANG_OPT_USE_JIT_INDEXER_METADATA=1", + "SGLANG_OPT_USE_TOPK_V2=1", + "SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1", + "SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1", + "SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=0", + "SGLANG_OPT_FIX_HASH_MEGA_MOE=0", + "SGLANG_OPT_USE_FAST_MASK_EP=1", + "SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1", + "SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096", + "SGLANG_OPT_FIX_NEXTN_MEGA_MOE=1", + "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0", + ); + } else { + recipeEnv.push(isBig + ? "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256" + : "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024"); + } + } else if (recipe === "max-throughput") { + if (hardware === "h200") { + recipeEnv.push(isBig + ? "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128" + : "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256"); + } else if (isBig && hardware === "b200") { + // B200/B300 Pro accuracy-verified env vars. + recipeEnv.push( + "SGLANG_JIT_DEEPGEMM_PRECOMPILE=0", + "SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT=1", + "SGLANG_OPT_USE_JIT_NORM=1", + "SGLANG_OPT_USE_JIT_INDEXER_METADATA=1", + "SGLANG_OPT_USE_TOPK_V2=1", + "SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1", + "SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1", + "SGLANG_OPT_USE_FAST_MASK_EP=1", + "SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1", + "SGLANG_OPT_FIX_NEXTN_MEGA_MOE=1", + "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0", + "NVSHMEM_DISABLE_IB=1", + "SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW=1", + "SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1", + "SGLANG_OPT_FIX_HASH_MEGA_MOE=1", + "SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8320", + ); + } else { + recipeEnv.push(isBig + ? "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256" + : "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024"); + } + } else if (recipe === "cp") { + recipeEnv.push("SGLANG_OPT_USE_JIT_INDEXER_METADATA=1"); + if (hardware === "h200") { + recipeEnv.push("SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024"); + } else { + // Blackwell cp: small=1024, big=256 (allinone ternary). + recipeEnv.push(isBig + ? "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256" + : "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024"); + } + } + // SGLANG_ENABLE_SPEC_V2=1 was in allinone's _ENV_MTP for low-latency / balanced + // recipes, but V4 auto-enables spec-v2 when MTP is detected — human confirmed + // the env is redundant on the public cookbook path. Kept as a no-op reference + // in allinone for legacy runs. + + // ---- flags ---- + const flags = []; + flags.push(" --trust-remote-code"); // _LAUNCH_HEAD + flags.push(` --model-path ${slug}`); + + if (recipe === "low-latency") { + // allinone: + // H200 small: pure TP + MTP_314 + // H200 big: DP-attn + DeepEP + MTP_314 + cg=32 max-run=64 + multi-node + mem-frac 0.82 + // GB200 big: pure TP + multinode + flashinfer_mxfp4 + MTP_314 + mem-frac 0.82 (no DP-attn/DeepEP) + // Blackwell: TP + flashinfer_mxfp4 + MTP_314 + chunked-prefill-size 4096 + autotune-fix + // Big Blackwell additionally: mem-frac 0.82 + flags.push(` --tp ${tp}`); + if (hardware === "h200" && isBig) { + flags.push(` --dp ${tp}`); + flags.push(" --enable-dp-attention"); + } + if (multinode) flags.push(...multiNodeFlags(nnodes)); + if (hardware === "h200" && isBig) { + flags.push(" --moe-a2a-backend deepep"); + } + if (hardware !== "h200") { + flags.push(" --moe-runner-backend flashinfer_mxfp4"); + } + if (hardware === "h200" && isBig) { + flags.push(" --cuda-graph-max-bs 8"); + flags.push(" --max-running-requests 32"); + } + // MTP 3/4 + flags.push(" --speculative-algo EAGLE"); + flags.push(" --speculative-num-steps 3"); + flags.push(" --speculative-eagle-topk 1"); + flags.push(" --speculative-num-draft-tokens 4"); + if (hardware !== "h200") { + // B200/B300 Pro accuracy-verified: chunked-prefill-size 8192 + flags.push(isBig ? " --chunked-prefill-size 8192" : " --chunked-prefill-size 4096"); + flags.push(" --disable-flashinfer-autotune"); + flags.push(" --swa-full-tokens-ratio 0.1"); + } + // B200/B300 Pro accuracy-verified: mem-fraction-static 0.90 + if (isBig && hardware !== "h200") { + flags.push(" --mem-fraction-static 0.90"); + } else if (isBig) { + flags.push(" --mem-fraction-static 0.88"); + } + } else if (recipe === "balanced") { + // allinone balanced: TP + DP + DP-attn + DeepEP + MTP_112. + // H200 small: cg=128 max-run=128 | H200 big: cg=128 max-run=128 (same) + // B200 small: no cg/max-run | B200 big: cg=64 max-run=128 + // GB300 small: no cg/max-run | GB300 big: cg=128 max-run=256 + flags.push(` --tp ${tp}`); + flags.push(` --dp ${tp}`); + flags.push(" --enable-dp-attention"); + if (multinode) flags.push(...multiNodeFlags(nnodes)); + // B200/B300 Pro accuracy-verified: flashinfer_mxfp4 (not deepep) for balanced. + if (isBig && hardware === "b200") { + flags.push(" --moe-runner-backend flashinfer_mxfp4"); + flags.push(" --disable-flashinfer-autotune"); + flags.push(" --chunked-prefill-size 32768"); + flags.push(" --swa-full-tokens-ratio 0.1"); + } else { + flags.push(" --moe-a2a-backend deepep"); + } + flags.push(" --speculative-algo EAGLE"); + flags.push(" --speculative-num-steps 1"); + flags.push(" --speculative-eagle-topk 1"); + flags.push(" --speculative-num-draft-tokens 2"); + if (hardware === "h200" && isBig) { + flags.push(" --mem-fraction-static 0.88"); + } else if (isBig && hardware === "gb300") { + flags.push(" --mem-fraction-static 0.9"); + } else if (isBig && hardware === "gb200") { + flags.push(" --mem-fraction-static 0.78"); + } else if (isBig) { + flags.push(" --mem-fraction-static 0.92"); + } + if (hardware === "h200" && isBig) { + flags.push(" --cuda-graph-max-bs 8"); + flags.push(" --max-running-requests 32"); + } else if (hardware === "h200") { + flags.push(" --cuda-graph-max-bs 128"); + flags.push(" --max-running-requests 128"); + } else if (isBig && hardware === "b200") { + flags.push(" --cuda-graph-max-bs 256"); + } else if (isBig && hardware === "gb300") { + flags.push(" --cuda-graph-max-bs 128"); + flags.push(" --max-running-requests 256"); + } else if (isBig && hardware === "gb200") { + flags.push(" --cuda-graph-max-bs 64"); + flags.push(" --max-running-requests 128"); + } + // allinone H200 gates DEEPEP_LARGE_SMS_FLAG on !multinode — only H200 big + // is multi-node; all Blackwell cells get the flag unconditionally. + if (!multinode) flags.push(DEEPEP_LARGE_SMS_FLAG); + } else if (recipe === "max-throughput") { + // allinone max-throughput: TP + DP + DP-attn + DeepEP (NO MTP). + // H200 small: cg=128 max-run=256 | H200 big: cg=128 max-run=256 (same) + // B200 small: no cg/max-run | B200 big: cg=64 max-run=256 + // GB300 small: no cg/max-run | GB300 big: cg=128 max-run=256 + flags.push(` --tp ${tp}`); + flags.push(` --dp ${tp}`); + flags.push(" --enable-dp-attention"); + if (multinode) flags.push(...multiNodeFlags(nnodes)); + flags.push(" --moe-a2a-backend deepep"); + if (hardware === "h200" && isBig) { + flags.push(" --mem-fraction-static 0.88"); + } else if (isBig && hardware === "gb300") { + flags.push(" --mem-fraction-static 0.9"); + } else if (isBig && hardware === "gb200") { + flags.push(" --mem-fraction-static 0.78"); + } else if (isBig) { + flags.push(" --mem-fraction-static 0.835"); + } + if (hardware === "h200") { + flags.push(" --cuda-graph-max-bs 128"); + flags.push(" --max-running-requests 256"); + } else if (isBig && hardware === "b200") { + // B200/B300 Pro accuracy-verified max-throughput config. + flags.push(" --cuda-graph-max-bs 544"); + flags.push(" --swa-full-tokens-ratio 0.075"); + flags.push(" --chunked-prefill-size 65536"); + flags.push(" --tokenizer-worker-num 8"); + flags.push(" --enable-prefill-delayer"); + } else if (isBig && hardware === "gb300") { + flags.push(" --cuda-graph-max-bs 128"); + flags.push(" --max-running-requests 256"); + } else if (isBig && hardware === "gb200") { + flags.push(" --cuda-graph-max-bs 64"); + flags.push(" --max-running-requests 256"); + } + if (!multinode) flags.push(DEEPEP_LARGE_SMS_FLAG); + } else if (recipe === "cp") { + // allinone cp: TP (NO --dp) + DeepEP + _CP_FLAGS (mem-frac 0.78, max-run 1024). + // Blackwell big additionally: mem-frac 0.70 (overrides), cg=256, max-run=256. + // No flashinfer_mxfp4 even on Blackwell (allinone omits). + flags.push(` --tp ${tp}`); + if (multinode) flags.push(...multiNodeFlags(nnodes)); + flags.push(" --moe-a2a-backend deepep"); + flags.push(" --enable-nsa-prefill-context-parallel"); + flags.push(" --nsa-prefill-cp-mode round-robin-split"); + flags.push(" --chunked-prefill-size 16384"); + // GB300 big CP needs higher mem-fraction-static: Pro 1.6T weights at + // tp=4 are ~224 GB/card on a 273 GB GB300, so 0.78 leaves a negative + // KV pool (init_memory_pool fails: "Not enough memory ... weights + // 224 GB > static target 213 GB"). 0.88 gives weights 224 + KV 16 + + // runtime 33. Other Blackwell tp=8 paths fit fine at 0.78. + // Verified on 2026-04-25 (journal 2026-04-25-001 Cell B, Δ4). + if (hardware === "gb300" && isBig) { + flags.push(" --mem-fraction-static 0.88"); + } else { + flags.push(" --mem-fraction-static 0.78"); + } + // allinone _CP_FLAGS has --max-running-requests 1024; Blackwell big cp overrides + // to 256. Human directed (2026-04-24) to emit only one value — keep 256 override + // for big Blackwell, else the default 1024. + if (isBig && hardware !== "h200") { + flags.push(" --cuda-graph-max-bs 256"); + flags.push(" --max-running-requests 256"); + } else { + flags.push(" --max-running-requests 1024"); + } + // H200 CP gates DEEPEP_LARGE_SMS_FLAG on !multinode; Blackwell always gets it. + if (!multinode) flags.push(DEEPEP_LARGE_SMS_FLAG); + } + + // Optional parsers (cookbook UI extension; not in allinone — opt-in toggles only). + if (toolcall === "enabled") flags.push(" --tool-call-parser deepseekv4"); + if (reasoningParser === "enabled") flags.push(" --reasoning-parser deepseek-v4"); + + flags.push(" --host 0.0.0.0"); + flags.push(" --port 30000"); + + // Assemble: [HW env] [recipe env] \ sglang serve \ flags... + const envAll = [...HW_ENV, ...recipeEnv]; + const envBlock = envAll.length ? envAll.join(" \\\n") + " \\\n" : ""; + // B200/B300 Pro recipes carry many accuracy-verified env vars that will be + // consolidated; prepend a shell comment so users know these are temporary. + const simplifyNote = (isBig && hardware === "b200" && recipeEnv.length > 2) + ? "# flags will be simplified\n" + : ""; + const base = `${simplifyNote}${envBlock}sglang serve \\\n${flags.join(" \\\n")}`; + // GB200 multinode may need machine-specific NVSHMEM / Gloo env vars; + // emit them as commented hints above the env block so users know to check. + let cmd = base; + if (hardware === "gb200" && multinode) { + cmd = + `# The following env vars may be needed depending on your cluster:\n` + + `# GLOO_SOCKET_IFNAME=\n` + + `# NVSHMEM_ENABLE_NIC_PE_MAPPING=1\n` + + `# NVSHMEM_HCA_LIST=\n` + + cmd; + } + const withMultinode = multinode ? prependMultiNodeNote(cmd, nnodes) : cmd; + + // H200 Pro low-latency: show BOTH a single-node (TP=8 marlin) variant + // and the existing multi-node (TP=16 DP-attn + DeepEP) variant. + if (hardware === "h200" && isBig && recipe === "low-latency") { + const singleFlags = [ + " --trust-remote-code", + " --model-path deepseek-ai/DeepSeek-V4-Pro", + " --tp 8", + " --moe-runner-backend marlin", + " --speculative-algo EAGLE", + " --speculative-num-steps 3", + " --speculative-eagle-topk 1", + " --speculative-num-draft-tokens 4", + " --chunked-prefill-size 4096", + " --disable-flashinfer-autotune", + " --mem-fraction-static 0.88", + ]; + if (toolcall === "enabled") singleFlags.push(" --tool-call-parser deepseekv4"); + if (reasoningParser === "enabled") singleFlags.push(" --reasoning-parser deepseek-v4"); + singleFlags.push(" --host 0.0.0.0"); + singleFlags.push(" --port 30000"); + const singleNodeCmd = `sglang serve \\\n${singleFlags.join(" \\\n")}`; + const combined = + `# --- Single-Node (TP=8, Marlin) ---\n${singleNodeCmd}\n\n` + + `# --- Multi-Node (2 nodes, TP=16, DP-Attn + DeepEP) ---\n${withMultinode}`; + const verifyKey = `${hardware}|${modelSize}|${recipe}`; + if (TBD_RECIPES.has(verifyKey)) return TBD_PLACEHOLDER; + return VERIFIED_RECIPES.has(verifyKey) + ? combined + : `${BEING_VERIFIED_NOTE}\n${commentOutCommand(combined)}`; + } + + const verifyKey = `${hardware}|${modelSize}|${recipe}`; + if (TBD_RECIPES.has(verifyKey)) return TBD_PLACEHOLDER; + return VERIFIED_RECIPES.has(verifyKey) + ? withMultinode + : `${BEING_VERIFIED_NOTE}\n${commentOutCommand(withMultinode)}`; + }; + + // ============================================================================ + // buildPDDisaggCommand — mirror of allinone pd-p / pd-d for small AND big. + // + // _PD_SPEC[(hw, size)] → tp (and whether multinode). + // H200-fp8 small: tp=4 single-node, ib=mlx5_0 + // H200-fp8 big: tp=16 2-node, ib=mlx5_0 + // B200 small: tp=2 single-node, ib=mlx5_7 + // B200 big: tp=8 single-node, ib=mlx5_7 + // GB300 small/big: tp=4 single-node, ib="" (uses MNNVL, no IB device) + // + // deepep flag only on Blackwell PD; H200 PD does NOT use deepep. + // cap_env (SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024) only on B200 decode. + // SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True only on GB300. + // --dist-init-addr for disagg wiring only on non-GB300. + // --max-running-requests 256 only on decode (PD decode can't retract). + // No flashinfer_mxfp4 / autotune-fix / MTP / mem-fraction-static on PD (allinone omits). + // ============================================================================ + const buildPDDisaggCommand = (rawHardware, modelSize) => { + // B300 usage is identical to B200 — alias so we don't duplicate every spec entry. + const hardware = rawHardware === "b300" ? "b200" : rawHardware; + const specKey = `${hardware}|${modelSize}`; + const { tp: pdTp, multinode, nnodes } = PD_TP_SPEC[specKey]; + const slug = HW_SIZE_SPEC[specKey].slug; + const ibDevice = { h200: "mlx5_0", b200: "mlx5_7", gb300: "", gb200: "" }[hardware]; + const isGB300 = hardware === "gb300"; + const isBlackwell = hardware === "b200" || hardware === "gb200" || isGB300; + + const HW_ENV = { + h200: ["SGLANG_DSV4_FP4_EXPERTS=0"], + b200: [], + gb300: [], + gb200: [], + }[hardware]; + // Whitelist #5: only SGLANG_MOONCAKE_CUSTOM_MEM_POOL kept; MC_FORCE_MNNVL / + // NCCL_MNNVL_ENABLE / NCCL_CUMEM_ENABLE may also be needed depending on the + // GB300 cluster's NVLink/IB topology — see §3.2 "Configuration Tips" note. + const MNNVL_ENV = isGB300 ? ["SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True"] : []; + + const buildRole = (mode, port, distPort) => { + const roleEnv = []; + if (hardware === "b200" && mode === "decode") { + roleEnv.push("SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024"); + } + // GB300 PD needs DeepEP dispatch buffer cap on BOTH prefill + decode; + // without it, the first forward fails `deep_ep.cpp:1233` assertion + // `x.size(0) <= num_max_dispatch_tokens_per_rank`. The cap also + // co-moves with --max-running-requests below: 256 for big (which + // uses --max-running-requests 128, per-rank=32 ≤ 256), 1024 for + // small (--max-running-requests 256, per-rank=64 ≤ 1024). + // Verified on 2026-04-25 (journal 2026-04-25-001 §C/§D). + if (isGB300) { + roleEnv.push(modelSize === "big" + ? "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256" + : "SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024"); + } + // H200 Pro PD: tp=16 multinode + DeepEP needs the dispatch buffer cap on + // BOTH prefill + decode (matches production playground LWS for the same + // hw/model combo). Verified on 2026-04-25 (journal 2026-04-25-014). + if (hardware === "h200" && modelSize === "big") { + roleEnv.push("SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128"); + } + const envAll = [...HW_ENV, ...roleEnv, ...MNNVL_ENV]; + const envBlock = envAll.length ? envAll.join(" \\\n") + " \\\n" : ""; + + const flags = []; + flags.push(" --trust-remote-code"); + flags.push(` --model-path ${slug}`); + flags.push(` --tp ${pdTp}`); + flags.push(` --dp ${pdTp}`); + flags.push(" --enable-dp-attention"); + if (multinode) flags.push(...multiNodeFlags(nnodes)); + // H200 Pro PD also needs deepep: at tp=16 the FP8 block_n=128 doesn't + // divide moe intermediate_size_per_partition (3072 / 16 = 192) so MoE + // experts must be kept on a single rank rather than TP-sharded. Verified + // on 2026-04-25 (journal 2026-04-25-014, candidate cookbook Bug L). + if (isBlackwell || (hardware === "h200" && modelSize === "big")) { + flags.push(" --moe-a2a-backend deepep"); + } + flags.push(` --disaggregation-mode ${mode}`); + flags.push(" --disaggregation-transfer-backend mooncake"); + if (ibDevice) flags.push(` --disaggregation-ib-device ${ibDevice}`); + // Same-host PD bootstrap addr; for multinode PD (h200 big tp=16 across 2 + // nodes) skip this — argparse would override the multinode dist-init-addr + // already emitted by multiNodeFlags above. Verified 2026-04-25 (journal + // 2026-04-25-014). sglang falls back to its own bootstrap port (default + // 8998) which works for cross-node mooncake handshake. + if (!isGB300 && !multinode) flags.push(` --dist-init-addr 127.0.0.1:${distPort}`); + // H200 Pro PD memory-budget: cookbook defaults give available_gpu_memory + // ~17.93 GB after weights but reserve target = (1 - mem_fraction_static) + // × 138 GB = 87 GB → "Not enough memory" at memory profile. mem-frac 0.90 + // and cg-max-bs 128 verified on 2026-04-25 (journal 2026-04-25-014). 128 + // matches gb300|big|pd decode and gives larger decode batching headroom; + // CG capture takes ~1 hr (one-time, vs ~5 min for cg=64) but runtime + // throughput is better. + if (hardware === "h200" && modelSize === "big") { + flags.push(" --cuda-graph-max-bs 128"); + flags.push(" --mem-fraction-static 0.9"); + } + if (mode === "decode") { + // GB300 big PD decode is the most memory-pressured PD role: Pro 1.6T + // weights at tp=4 take ~224 GB/card on a 273 GB GB300; runtime needs + // headroom for DeepEP buffer + mooncake KV recv + CG private pool. + // Cookbook defaults (mem-frac 0.874, cg_max_bs 512, max-running 256) + // OOM during CG capture. mem-frac sweep at 0.83 / 0.87 / 0.89 / 0.91 + // all pass static validation; 0.9 picked as the default — leaves + // ~14 GB / GPU post-CG headroom for mooncake transfer + activation + // peaks while giving ~1M-token KV pool. + if (isGB300 && modelSize === "big") { + flags.push(" --max-running-requests 128"); + flags.push(" --mem-fraction-static 0.9"); + flags.push(" --cuda-graph-max-bs 128"); + } else { + flags.push(" --max-running-requests 256"); + } + } + flags.push(" --host 0.0.0.0"); + flags.push(` --port ${port}`); + + return `${envBlock}sglang serve \\\n${flags.join(" \\\n")}`; + }; + + const prefillHeader = multinode + ? `# --- Prefill role (port 30000) — multi-node, run on each of ${nnodes} nodes ---` + : "# --- Prefill role (port 30000) ---"; + const decodeHeader = multinode + ? `# --- Decode role (port 30001) — multi-node, run on each of ${nnodes} nodes ---` + : "# --- Decode role (port 30001) ---"; + + const prefill = `${prefillHeader}\n${buildRole("prefill", 30000, 30335)}`; + const decode = `${decodeHeader}\n${buildRole("decode", 30001, 30435)}`; + // Router addresses prefill / decode by their reachable hostnames / IPs. + // Substitute / with the actual hosts before + // running. On a same-host deployment, both can be 127.0.0.1. + const router = `# --- Router (port 8000) --- +python3 -m sglang_router.launch_router \\ + --pd-disaggregation \\ + --prefill http://:30000 \\ + --decode http://:30001 \\ + --host 0.0.0.0 --port 8000 \\ + --disable-circuit-breaker \\ + --health-check-interval-secs 999999`; + + const full = `${prefill}\n\n${decode}\n\n${router}`; + const verifyKey = `${hardware}|${modelSize}|pd-disagg`; + if (TBD_RECIPES.has(verifyKey)) return TBD_PLACEHOLDER; + return VERIFIED_RECIPES.has(verifyKey) + ? full + : `${BEING_VERIFIED_NOTE}\n${commentOutCommand(full)}`; + }; + + // ---- styles ---- + const containerStyle = { maxWidth: "900px", margin: "0 auto", display: "flex", flexDirection: "column", gap: "4px" }; + const cardStyle = { + padding: "8px 12px", + border: `1px solid ${isDark ? "#374151" : "#e5e7eb"}`, + borderLeft: `3px solid ${isDark ? "#E85D4D" : "#D45D44"}`, + borderRadius: "4px", + display: "flex", + alignItems: "center", + gap: "12px", + background: isDark ? "#1f2937" : "#fff", + }; + const titleStyle = { fontSize: "13px", fontWeight: "600", minWidth: "140px", flexShrink: 0, color: isDark ? "#e5e7eb" : "inherit" }; + const itemsStyle = { display: "flex", rowGap: "2px", columnGap: "6px", flexWrap: "wrap", alignItems: "center", flex: 1 }; + const labelBaseStyle = { + padding: "4px 10px", + border: `1px solid ${isDark ? "#9ca3af" : "#d1d5db"}`, + borderRadius: "3px", + cursor: "pointer", + display: "inline-flex", + flexDirection: "column", + alignItems: "center", + justifyContent: "center", + fontWeight: "500", + fontSize: "13px", + transition: "all 0.2s", + userSelect: "none", + minWidth: "45px", + textAlign: "center", + flex: 1, + background: isDark ? "#374151" : "#fff", + color: isDark ? "#e5e7eb" : "inherit", + }; + const checkedStyle = { background: "#D45D44", color: "white", borderColor: "#D45D44" }; + const disabledStyle = { cursor: "not-allowed", opacity: 0.4 }; + const subtitleStyle = { display: "block", fontSize: "9px", marginTop: "1px", lineHeight: "1.1", opacity: 0.7 }; + const commandDisplayStyle = { + flex: 1, + padding: "12px 16px", + background: isDark ? "#111827" : "#f5f5f5", + borderRadius: "6px", + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: "12px", + lineHeight: "1.5", + color: isDark ? "#e5e7eb" : "#374151", + whiteSpace: "pre-wrap", + overflowX: "auto", + margin: 0, + border: `1px solid ${isDark ? "#374151" : "#e5e7eb"}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ Enabling MegaMoE +

+ MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput on MoE layers. + It is currently verified on B200/B300 Pro (balanced & max-throughput recipes above). + We have not yet tested the full hardware/recipe matrix, but it should work on other platforms (GB200, GB300, Flash). + To enable it, add the flag and env vars: +

+
{
+`# Add this flag to the sglang serve command:
+--moe-a2a-backend deepep
+
+# And set these env vars:
+SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
+SGLANG_OPT_FIX_HASH_MEGA_MOE=1
+SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
+SGLANG_OPT_FIX_NEXTN_MEGA_MOE=1
+SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8320
+SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0`
+        }
+

+ Adjust SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your chunked prefill size (e.g. 4096 for balanced, 8320 for max-throughput).
+ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0 — if your config mentions DeepEP dispatch buffer constraints, they do not apply when this is set to 0.
+ These flags are expected to be simplified in a future release. +

+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/devstral-2-deployment.jsx b/docs_new/src/snippets/autoregressive/devstral-2-deployment.jsx new file mode 100644 index 000000000000..a3a66a415f1a --- /dev/null +++ b/docs_new/src/snippets/autoregressive/devstral-2-deployment.jsx @@ -0,0 +1,182 @@ +export const Devstral2Deployment = () => { + // Config options based on Devstral2ConfigGenerator + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + model: { + name: 'model', + title: 'Model', + items: [ + { id: 'small', label: 'Devstral Small 2 (24B)', default: true }, + { id: 'large', label: 'Devstral 2 (123B)', default: false } + ] + }, + weights: { + name: 'weights', + title: 'Weights / Precision', + items: [ + { id: 'fp8', label: 'FP8', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Model configurations + const modelConfigs = { + small: { + modelId: 'mistralai/Devstral-Small-2-24B-Instruct-2512', + tpByHardware: { h100: 1, h200: 1, b200: 1, mi300x: 1, mi325x: 1, mi355x: 1 }, + allowedWeights: ['fp8'] + }, + large: { + modelId: 'mistralai/Devstral-2-123B-Instruct-2512', + tpByHardware: { h100: 4, h200: 2, b200: 2, mi300x: 2, mi325x: 2, mi355x: 2 }, + allowedWeights: ['fp8'] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, model, weights, toolcall } = values; + + const modelCfg = modelConfigs[model]; + if (!modelCfg) return `# Error: Unknown model selection: ${model}`; + + if (!modelCfg.allowedWeights.includes(weights)) { + const allowed = modelCfg.allowedWeights.map(w => w.toUpperCase()).join(', '); + return `# Error: ${modelCfg.modelId} only supports: ${allowed}\n# Please change "Weights / Precision" to a supported value.`; + } + + const tp = modelCfg.tpByHardware[hardware]; + if (!tp) return `# Error: Unknown hardware platform: ${hardware}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelCfg.modelId}`; + + if (tp > 1) { + cmd += ` \\\n --tp ${tp}`; + } + + // Add tool-call-parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser mistral`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/ernie-45-deployment.jsx b/docs_new/src/snippets/autoregressive/ernie-45-deployment.jsx new file mode 100644 index 000000000000..9adf1adf4815 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/ernie-45-deployment.jsx @@ -0,0 +1,345 @@ +export const Ernie45Deployment = () => { + const options = { + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '21b', label: '21B', subtitle: 'A3B', default: true }, + { id: '300b', label: '300B', subtitle: 'A47B', default: false } + ] + }, + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false, disabledWhen: (values) => values.modelsize === '21b' }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false, disabledWhen: (values) => values.modelsize === '21b' } + ] + } + }; + + const generateCommand = (values) => { + const { modelsize, hardware, strategy } = values; + + const strategyArray = Array.isArray(strategy) ? strategy : []; + + let modelPath; + if (modelsize === '21b') { + modelPath = 'baidu/ERNIE-4.5-21B-A3B-PT'; + } else if (modelsize === '300b') { + modelPath = 'baidu/ERNIE-4.5-300B-A47B-PT'; + } else { + modelPath = 'baidu/ERNIE-4.5-21B-A3B-PT'; + } + + let cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelPath}`; + + const tpValue = modelsize === '300b' ? 8 : 1; + const dpValue = modelsize === '300b' ? 8 : null; + const epValue = modelsize === '300b' ? 8 : null; + + if (strategyArray.includes('tp')) { + cmd += ` \\\n --tp ${tpValue}`; + } + + if (strategyArray.includes('dp') && modelsize === '300b') { + cmd += ` \\\n --dp ${dpValue} \\\n --enable-dp-attention`; + } + + if (strategyArray.includes('ep') && modelsize === '300b') { + cmd += ` \\\n --ep ${epValue}`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/gemma4-deployment.jsx b/docs_new/src/snippets/autoregressive/gemma4-deployment.jsx new file mode 100644 index 000000000000..47a907584b64 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/gemma4-deployment.jsx @@ -0,0 +1,398 @@ +export const Gemma4Deployment = () => { + const options = { + modelSize: { + name: 'modelSize', + title: 'Model Variant', + items: [ + { id: 'e2b', label: 'E2B (~2B)', default: false }, + { id: 'e4b', label: 'E4B (~4B)', default: true }, + { id: '31b', label: '31B (Dense)', default: false }, + { id: '26b-a4b', label: '26B-A4B (MoE)', default: false }, + ] + }, + hardware: { + name: 'hardware', + title: 'Hardware Platform', + getDynamicItems: (values) => { + const size = values.modelSize; + const showMI300X = size === '31b' || size === '26b-a4b'; + return [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false, disabled: !showMI300X }, + ]; + } + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser gemma4' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser gemma4' : null + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (MTP)', + condition: (values) => !['mi300x'].includes(values.hardware), + items: [ + { id: 'disabled', label: 'Disabled', subtitle: 'Baseline', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'Lower Latency', default: false } + ] + }, + }; + + const modelConfigs = { + h200: { + e2b: { tp: 1, mem: 0.85 }, + e4b: { tp: 1, mem: 0.85 }, + '31b': { tp: 2, mem: 0.85 }, + '26b-a4b': { tp: 1, mem: 0.85 }, + }, + b200: { + e2b: { tp: 1, mem: 0.9 }, + e4b: { tp: 1, mem: 0.9 }, + '31b': { tp: 1, mem: 0.9 }, + '26b-a4b': { tp: 1, mem: 0.9 }, + }, + mi300x: { + '31b': { tp: 1, mem: 0.80 }, + '26b-a4b': { tp: 1, mem: 0.80 }, + }, + }; + + const generateCommand = (values) => { + const { hardware, modelSize } = values; + + const hwConfig = modelConfigs[hardware]?.[modelSize]; + if (!hwConfig) return `# Error: Unknown hardware/model combination`; + + let { tp, mem } = hwConfig; + + const modelNames = { + 'e2b': 'google/gemma-4-E2B-it', + 'e4b': 'google/gemma-4-E4B-it', + '31b': 'google/gemma-4-31B-it', + '26b-a4b': 'google/gemma-4-26B-A4B-it', + }; + + const mtpEnabled = values.speculative === 'enabled'; + if (mtpEnabled && modelSize === '26b-a4b' && hardware !== 'mi300x') { + tp = 2; + } + + let cmd = `sglang serve --model-path ${modelNames[modelSize]}`; + if (tp > 1) { + cmd += ` \\\n --tp ${tp}`; + } + + Object.entries(options).forEach(([key, option]) => { + if (key === 'modelSize' || key === 'hardware') return; + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) cmd += ` \\\n ${rule}`; + } + }); + + if (mtpEnabled) { + cmd += ` \\\n --speculative-algorithm NEXTN`; + cmd += ` \\\n --speculative-draft-model-path ${modelNames[modelSize]}-assistant`; + cmd += ` \\\n --speculative-num-steps 5`; + cmd += ` \\\n --speculative-num-draft-tokens 6`; + cmd += ` \\\n --speculative-eagle-topk 1`; + } + + cmd += ` \\\n --mem-fraction-static ${mem}`; + cmd += ` \\\n --host 0.0.0.0 --port 30000`; + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-45-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-45-deployment.jsx new file mode 100644 index 000000000000..90ee08ffe08d --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-45-deployment.jsx @@ -0,0 +1,197 @@ +export const GLM45Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false }, + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, strategy, thinking, toolcall } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + const modelSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `zai-org/GLM-4.5${modelSuffix}`; + + // Determine TP value based on hardware and quantization + let tpValue = 4; // Default for MI300X/MI325X + if (hardware === 'mi355x') { + tpValue = quantization === 'fp8' ? 2 : 4; // MI355X: TP=2 for FP8, TP=4 for BF16 + } + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + // TP is mandatory + cmd += ` \\\n --tp ${tpValue}`; + + // MI300X/MI325X BF16 requires extra flags + if ((hardware === 'mi300x' || hardware === 'mi325x') && quantization === 'bf16') { + cmd += ` \\\n --max-context-length 8192 \\\n --mem-fraction-static 0.9`; + } + + // Strategy-specific parameters + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 8 \\\n --enable-dp-attention`; + } + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 8`; + } + if (strategyArray.includes('mtp')) { + cmd = 'SGLANG_ENABLE_SPEC_V2=1 ' + cmd; + cmd += ` \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4`; + } + + // Add tool call parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser glm45`; + } + + // Add thinking parser if enabled + if (thinking === 'enabled') { + cmd += ` \\\n --reasoning-parser glm45`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-45v-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-45v-deployment.jsx new file mode 100644 index 000000000000..8fb67ad25b49 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-45v-deployment.jsx @@ -0,0 +1,175 @@ +export const GLM45VDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, reasoning, toolcall } = values; + + // Model configuration + const config = { + baseName: 'GLM-4.5V', + b200: { tp: 4 }, + h100: { tp: 4 }, + h200: { tp: 4 }, + mi300x: { tp: 4 }, + mi355x: { tp: 4 } + }; + + const hwConfig = config[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `zai-org/${config.baseName}${quantSuffix}`; + + // Check if AMD hardware + const isAMD = ['mi300x', 'mi325x', 'mi355x'].includes(hardware); + + let cmd = ''; + if (isAMD) { + cmd = 'SGLANG_USE_AITER=0 python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp-size ${hwConfig.tp}`; + } else { + cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + } + + // Add reasoning parser + if (reasoning === 'enabled') { + cmd += ' \\\n --reasoning-parser glm45'; + } + + // Add tool call parser + if (toolcall === 'enabled') { + cmd += ' \\\n --tool-call-parser glm45'; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-46-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-46-deployment.jsx new file mode 100644 index 000000000000..2c55fdbbafa3 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-46-deployment.jsx @@ -0,0 +1,207 @@ +export const GLM46Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false }, + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, strategy, thinking, toolcall } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + // Check for H100 + BF16 error + if (hardware === 'h100' && quantization === 'bf16') { + return '# Error: GLM-4.6 in BF16 precision requires more VRAM than 8*H100\n# Please use H200/B200 or select FP8 quantization'; + } + + const modelSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `zai-org/GLM-4.6${modelSuffix}`; + + // Determine TP value based on hardware and quantization + let tpValue = 8; // Default for NVIDIA GPUs + if (hardware === 'mi300x' || hardware === 'mi325x') { + tpValue = 4; // MI300X/MI325X: TP=4 for both BF16 and FP8 + } else if (hardware === 'mi355x') { + tpValue = quantization === 'fp8' ? 2 : 4; // MI355X: TP=2 for FP8, TP=4 for BF16 + } + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + // TP is mandatory + cmd += ` \\\n --tp ${tpValue}`; + + // MI300X/MI325X BF16 requires extra flags + if ((hardware === 'mi300x' || hardware === 'mi325x') && quantization === 'bf16') { + cmd += ` \\\n --max-context-length 8192 \\\n --mem-fraction-static 0.9`; + } + + // Strategy-specific parameters + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 8 \\\n --enable-dp-attention`; + } + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 8`; + } + if (strategyArray.includes('mtp')) { + cmd = 'SGLANG_ENABLE_SPEC_V2=1 ' + cmd; + cmd += ` \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4`; + } + + // Add tool call parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser glm45`; + } + + // Add thinking parser if enabled + if (thinking === 'enabled') { + cmd += ` \\\n --reasoning-parser glm45`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-46v-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-46v-deployment.jsx new file mode 100644 index 000000000000..237fc307ec3e --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-46v-deployment.jsx @@ -0,0 +1,196 @@ +export const GLM46VDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '106b', label: '106B', subtitle: 'GLM-4.6V', default: true }, + { id: '9b', label: '9B', subtitle: 'GLM-4.6V-Flash', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, modelsize, quantization, reasoning, toolcall } = values; + + // Model configurations + const modelConfigs = { + '106b': { + baseName: 'GLM-4.6V', + h100: { tp: 8 }, + h200: { tp: 8 }, + b200: { tp: 8 }, + mi300x: { tp: 8 }, + mi355x: { tp: 8 } + }, + '9b': { + baseName: 'GLM-4.6V-Flash', + h100: { tp: 1 }, + h200: { tp: 1 }, + b200: { tp: 1 }, + mi300x: { tp: 1 }, + mi355x: { tp: 1 } + } + }; + + const config = modelConfigs[modelsize]; + if (!config) { + return `# Error: Unknown model size: ${modelsize}`; + } + + const hwConfig = config[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `zai-org/${config.baseName}${quantSuffix}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + if (hwConfig.tp === 8) { + cmd += ` \\\n --mm-enable-dp-encoder`; + } + } + + // Add reasoning parser if enabled + if (reasoning === 'enabled') { + cmd += ` \\\n --reasoning-parser glm45`; + } + + // Add tool call parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser glm45`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-47-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-47-deployment.jsx new file mode 100644 index 000000000000..f90b33c402f5 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-47-deployment.jsx @@ -0,0 +1,197 @@ +export const GLM47Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'ep', label: 'EP', subtitle: 'Expert Parallel', default: false }, + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, strategy, thinking, toolcall } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + const modelSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `zai-org/GLM-4.7${modelSuffix}`; + + // Determine TP value based on hardware and quantization + let tpValue = 4; // Default for MI300X and MI325X + if (hardware === 'mi355x') { + tpValue = quantization === 'fp8' ? 2 : 4; // MI355X: TP=2 for FP8, TP=4 for BF16 + } + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + // TP is mandatory + cmd += ` \\\n --tp ${tpValue}`; + + // MI300X/MI325X BF16 requires extra flags + if ((hardware === 'mi300x' || hardware === 'mi325x') && quantization === 'bf16') { + cmd += ` \\\n --max-context-length 8192 \\\n --mem-fraction-static 0.9`; + } + + // Strategy-specific parameters + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 8 \\\n --enable-dp-attention`; + } + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 8`; + } + if (strategyArray.includes('mtp')) { + cmd = 'SGLANG_ENABLE_SPEC_V2=1 ' + cmd; + cmd += ` \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4`; + } + + // Add tool call parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser glm47`; + } + + // Add thinking parser if enabled + if (thinking === 'enabled') { + cmd += ` \\\n --reasoning-parser glm47`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-47-flash-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-47-flash-deployment.jsx new file mode 100644 index 000000000000..afdcdf46c138 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-47-flash-deployment.jsx @@ -0,0 +1,191 @@ +export const GLM47FlashDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', subtitle: 'Tensor Parallel', default: true, required: true }, + { id: 'dp', label: 'DP', subtitle: 'Data Parallel', default: false }, + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, strategy, thinking, toolcall } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + const modelName = `zai-org/GLM-4.7-Flash`; + + // GLM-4.7-Flash is a 30B-A3B MoE model, lighter than GLM-4.7 + const tpValue = 1; // Default for single GPU + + let cmd = 'python -m sglang.launch_server \\\n '; + cmd += ` --model ${modelName}`; + + // TP is mandatory + cmd += ` \\\n --tp ${tpValue}`; + + if (hardware === 'b200') { + cmd += ` \\\n --attention-backend triton`; + } + + // Strategy-specific parameters + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 1 \\\n --enable-dp-attention`; + } + if (strategyArray.includes('mtp')) { + cmd = 'SGLANG_ENABLE_SPEC_V2=1 ' + cmd; + + if (hardware === 'b200') { + cmd += ` \\\n --speculative-draft-attention-backend triton`; + } + cmd += ` \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4`; + } + + // Add tool call parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser glm47`; + } + + // Add thinking parser if enabled + if (thinking === 'enabled') { + cmd += ` \\\n --reasoning-parser glm45`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-5-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-5-deployment.jsx new file mode 100644 index 000000000000..26f1931ad575 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-5-deployment.jsx @@ -0,0 +1,266 @@ +export const GLM5Deployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/GLM5ConfigGenerator/index.js. + // + // Supported quantization per hardware: + // H100 / H200 / MI300X / MI325X / MI355X → BF16 (AMD only) + FP8 (NV only) + // B200 → NVFP4 (default), FP8, BF16 + // + // BF16 always needs 2x GPUs compared to FP8. AMD only supports BF16. + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X/MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + getDynamicItems: (values) => { + const hw = values.hardware; + const isAMD = hw === 'mi300x' || hw === 'mi355x'; + const isB200 = hw === 'b200'; + return [ + { id: 'bf16', label: 'BF16', subtitle: 'Full Weights', default: isAMD }, + { id: 'fp8', label: 'FP8', subtitle: 'High Throughput', default: !isAMD && !isB200, disabled: isAMD, disabledReason: 'FP8 not verified on AMD' }, + { id: 'nvfp4', label: 'NVFP4', subtitle: 'Highest Throughput', default: isB200, disabled: !isB200, disabledReason: 'NVFP4 only on B200' } + ]; + } + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + condition: (values) => values.quantization !== 'nvfp4', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + condition: (values) => values.quantization !== 'nvfp4', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + dpattention: { + name: 'dpattention', + title: 'DP Attention', + condition: (values) => values.quantization !== 'nvfp4', + items: [ + { id: 'disabled', label: 'Disabled', subtitle: 'Low Latency', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'High Throughput', default: false } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding', + condition: (values) => values.hardware !== 'mi300x' && values.hardware !== 'mi355x' && values.quantization !== 'nvfp4', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + } + }; + + // BF16 always 2× the GPUs of FP8. + const modelConfigs = { + h100: { fp8: { tp: 16, mem: 0.85 }, bf16: { tp: 32, mem: 0.85 } }, + h200: { fp8: { tp: 8, mem: 0.85 }, bf16: { tp: 16, mem: 0.85 } }, + b200: { nvfp4: { tp: 4, mem: 0.9 }, fp8: { tp: 8, mem: 0.9 }, bf16: { tp: 16, mem: 0.9 } }, + mi300x: { bf16: { tp: 8, mem: 0.80 } }, + mi355x: { bf16: { tp: 8, mem: 0.80 } } + }; + + const resolveItems = (option, values) => { + if (typeof option.getDynamicItems === 'function') return option.getDynamicItems(values); + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + // When hardware changes, re-resolve quantization (and downstream) defaults to + // stay consistent (AMD→BF16, B200→NVFP4, etc.). + useEffect(() => { + setValues(prev => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find(i => i.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.hardware]); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { hardware, quantization } = values; + const isAMD = hardware === 'mi300x' || hardware === 'mi355x'; + const isNVFP4 = quantization === 'nvfp4'; + const effectiveQuant = isAMD ? 'bf16' : quantization; + + let modelName; + if (isNVFP4) { + modelName = 'nvidia/GLM-5-NVFP4'; + } else { + const suffix = effectiveQuant === 'fp8' ? '-FP8' : ''; + modelName = `zai-org/GLM-5${suffix}`; + } + + const hwConfig = modelConfigs[hardware][effectiveQuant]; + const tpValue = hwConfig.tp; + const memFraction = hwConfig.mem; + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp ${tpValue}`; + + // NVFP4 B200: trtllm NSA backends, flashinfer fusion, FP8 KV cache. + if (isNVFP4) { + cmd += ' \\\n --trust-remote-code'; + cmd += ' \\\n --quantization modelopt_fp4'; + cmd += ' \\\n --kv-cache-dtype fp8_e4m3'; + cmd += ' \\\n --nsa-decode-backend trtllm'; + cmd += ' \\\n --nsa-prefill-backend trtllm'; + cmd += ' \\\n --moe-runner-backend flashinfer_trtllm'; + cmd += ' \\\n --enable-flashinfer-allreduce-fusion'; + cmd += ' \\\n --enable-dp-lm-head'; + cmd += ' \\\n --disable-radix-cache'; + cmd += ' \\\n --max-prefill-tokens 32768'; + cmd += ' \\\n --chunked-prefill-size 32768'; + cmd += ` \\\n --mem-fraction-static ${memFraction}`; + cmd += ' \\\n --scheduler-recv-interval 10'; + cmd += ' \\\n --tokenizer-worker-num 6'; + return cmd; + } + + // AMD-specific: NSA tilelang backend. + if (isAMD) { + cmd += ' \\\n --trust-remote-code'; + cmd += ' \\\n --nsa-prefill-backend tilelang'; + cmd += ' \\\n --nsa-decode-backend tilelang'; + cmd += ' \\\n --chunked-prefill-size 131072'; + cmd += ' \\\n --watchdog-timeout 1200'; + } + + if (values.dpattention === 'enabled') { + cmd += ` \\\n --dp ${tpValue} \\\n --enable-dp-attention`; + } + if (values.reasoning === 'enabled') cmd += ' \\\n --reasoning-parser glm45'; + if (values.toolcall === 'enabled') cmd += ' \\\n --tool-call-parser glm47'; + if (values.speculative === 'enabled') { + cmd += ' \\\n --speculative-algorithm EAGLE'; + cmd += ' \\\n --speculative-num-steps 3'; + cmd += ' \\\n --speculative-eagle-topk 1'; + cmd += ' \\\n --speculative-num-draft-tokens 4'; + } + + // B200 FP8: consolidated optimized flags. + if (hardware === 'b200' && effectiveQuant === 'fp8') { + cmd += ' \\\n --ep 1'; + cmd += ' \\\n --quantization fp8'; + cmd += ' \\\n --attention-backend nsa'; + cmd += ' \\\n --nsa-decode-backend trtllm'; + cmd += ' \\\n --nsa-prefill-backend trtllm'; + cmd += ' \\\n --moe-runner-backend flashinfer_trtllm'; + cmd += ' \\\n --enable-flashinfer-allreduce-fusion'; + } + + cmd += ` \\\n --mem-fraction-static ${memFraction}`; + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (typeof option.condition === 'function' && !option.condition(values)) return null; + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-51-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-51-deployment.jsx new file mode 100644 index 000000000000..fd2734362eed --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-51-deployment.jsx @@ -0,0 +1,230 @@ +export const GLM51Deployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/GLM51ConfigGenerator/index.js. + // + // Supported quantization per hardware: + // H100 / H200 / B200 → BF16 + FP8 + // GB300 → FP8 only + // MI300X/MI325X/MI355X → BF16 (FP8 not verified on AMD) + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'gb300', label: 'GB300', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + getDynamicItems: (values) => { + const hw = values.hardware; + const isAMD = ['mi300x', 'mi325x', 'mi355x'].includes(hw); + const isGB300 = hw === 'gb300'; + return [ + { id: 'bf16', label: 'BF16', subtitle: 'Full Weights', default: isAMD, disabled: isGB300, disabledReason: isGB300 ? 'BF16 is not recommended on GB300 for GLM-5.1' : '' }, + { id: 'fp8', label: 'FP8', subtitle: 'High Throughput', default: !isAMD, disabled: isAMD, disabledReason: isAMD ? 'FP8 not verified on AMD' : '' } + ]; + } + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + dpattention: { + name: 'dpattention', + title: 'DP Attention', + items: [ + { id: 'disabled', label: 'Disabled', subtitle: 'Low Latency', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'High Throughput', default: false } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding', + condition: (values) => !['mi300x', 'mi325x', 'mi355x'].includes(values.hardware), + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + } + }; + + const modelConfigs = { + h100: { fp8: { tp: 16, mem: 0.85 }, bf16: { tp: 32, mem: 0.85 } }, + h200: { fp8: { tp: 8, mem: 0.85 }, bf16: { tp: 16, mem: 0.85 } }, + b200: { fp8: { tp: 8, mem: 0.9 }, bf16: { tp: 16, mem: 0.9 } }, + gb300: { fp8: { tp: 4, mem: 0.9 } }, + mi300x: { bf16: { tp: 8, mem: 0.80 } }, + mi325x: { bf16: { tp: 8, mem: 0.80 } }, + mi355x: { bf16: { tp: 8, mem: 0.80 } } + }; + + const resolveItems = (option, values) => { + if (typeof option.getDynamicItems === 'function') return option.getDynamicItems(values); + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + useEffect(() => { + setValues(prev => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find(i => i.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.hardware]); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { hardware, quantization } = values; + const isAMD = ['mi300x', 'mi325x', 'mi355x'].includes(hardware); + const isGB300 = hardware === 'gb300'; + const effectiveQuant = isAMD ? 'bf16' : (isGB300 && quantization === 'bf16' ? 'fp8' : quantization); + const suffix = effectiveQuant === 'fp8' ? '-FP8' : ''; + const modelName = `zai-org/GLM-5.1${suffix}`; + + const hwConfig = modelConfigs[hardware][effectiveQuant]; + if (!hwConfig) return '# Configuration not available for the selected hardware and quantization.'; + + const tpValue = hwConfig.tp; + const memFraction = hwConfig.mem; + const enableSpec = values.speculative === 'enabled'; + + let cmd = ''; + if (enableSpec) cmd += 'SGLANG_ENABLE_SPEC_V2=1 '; + cmd += 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp ${tpValue}`; + + if (isAMD) { + cmd += ' \\\n --trust-remote-code'; + cmd += ' \\\n --nsa-prefill-backend tilelang'; + cmd += ' \\\n --nsa-decode-backend tilelang'; + cmd += ' \\\n --chunked-prefill-size 131072'; + cmd += ' \\\n --watchdog-timeout 1200'; + } + + if (values.dpattention === 'enabled') { + cmd += ` \\\n --dp ${tpValue} \\\n --enable-dp-attention`; + } + if (values.reasoning === 'enabled') cmd += ' \\\n --reasoning-parser glm45'; + if (values.toolcall === 'enabled') cmd += ' \\\n --tool-call-parser glm47'; + if (enableSpec) { + cmd += ' \\\n --speculative-algorithm EAGLE'; + cmd += ' \\\n --speculative-num-steps 3'; + cmd += ' \\\n --speculative-eagle-topk 1'; + cmd += ' \\\n --speculative-num-draft-tokens 4'; + } + + cmd += ` \\\n --mem-fraction-static ${memFraction}`; + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (typeof option.condition === 'function' && !option.condition(values)) return null; + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-glyph-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-glyph-deployment.jsx new file mode 100644 index 000000000000..62162a9443ec --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-glyph-deployment.jsx @@ -0,0 +1,372 @@ +export const GLMGlyphDeployment = () => { + const modelFamily = 'zai-org'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser glm45' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser glm45' : null + } + }; + + const modelConfig = { + baseName: 'Glyph', + b200: { tp: 4, bf16: true, fp8: true }, + h100: { tp: 4, bf16: true, fp8: true }, + h200: { tp: 4, bf16: true, fp8: true }, + mi300x: { tp: 4, bf16: true, fp8: true }, + mi325x: { tp: 4, bf16: true, fp8: true }, + mi355x: { tp: 2, bf16: true, fp8: true } + }; + + const generateCommand = (values) => { + const { hardware, quantization } = values; + + const hwConfig = modelConfig[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `${modelFamily}/${modelConfig.baseName}${quantSuffix}`; + + const isAMD = ['mi300x', 'mi325x', 'mi355x'].includes(hardware); + + let cmd = ''; + if (isAMD) { + cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp ${hwConfig.tp}`; + } else { + cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + } + + for (const [key, option] of Object.entries(options)) { + if (key === 'hardware' || key === 'quantization') continue; + + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` \\\n ${rule}`; + } + } + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/glm-ocr-deployment.jsx b/docs_new/src/snippets/autoregressive/glm-ocr-deployment.jsx new file mode 100644 index 000000000000..773a922a0534 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/glm-ocr-deployment.jsx @@ -0,0 +1,140 @@ +export const GLMOCRDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'mtp', label: 'MTP', subtitle: 'Multi-token Prediction', default: true } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode - prioritize page theme over system preference + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { strategy } = values; + const strategyArray = Array.isArray(strategy) ? strategy : []; + + const modelName = 'zai-org/GLM-OCR'; + + let cmd = 'SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (strategyArray.includes('mtp')) { + cmd += ` \\\n --speculative-algorithm EAGLE`; + cmd += ` \\\n --speculative-num-steps 3`; + cmd += ` \\\n --speculative-eagle-topk 1`; + cmd += ` \\\n --speculative-num-draft-tokens 4`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/gpt-oss-deployment.jsx b/docs_new/src/snippets/autoregressive/gpt-oss-deployment.jsx new file mode 100644 index 000000000000..6740554b9e1f --- /dev/null +++ b/docs_new/src/snippets/autoregressive/gpt-oss-deployment.jsx @@ -0,0 +1,237 @@ +export const GPTOSSDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '120b', label: '120B', subtitle: 'MOE', default: true }, + { id: '20b', label: '20B', subtitle: 'MOE', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'mxfp4', label: 'MXFP4', default: true }, + { id: 'bf16', label: 'BF16', default: false } + ] + }, + reasoningParser: { + name: 'reasoningParser', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, modelsize, quantization, reasoningParser, toolcall, speculative } = values; + + // Model configurations + const modelConfigs = { + '120b': { + baseName: '120b', + h100: { tp: 8 }, + h200: { tp: 8 }, + b200: { tp: 8 }, + mi300x: { tp: 8 }, + mi325x: { tp: 8 }, + mi355x: { tp: 8 } + }, + '20b': { + baseName: '20b', + h100: { tp: 1 }, + h200: { tp: 1 }, + b200: { tp: 1 }, + mi300x: { tp: 1 }, + mi325x: { tp: 1 }, + mi355x: { tp: 1 } + } + }; + + const config = modelConfigs[modelsize]; + if (!config) { + return `# Error: Unknown model size: ${modelsize}`; + } + + const hwConfig = config[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'bf16' ? '-bf16' : ''; + const orgPrefix = quantization === 'bf16' ? 'lmsys' : 'openai'; + const modelName = `${orgPrefix}/gpt-oss-${config.baseName}${quantSuffix}`; + + let cmd = ''; + + // MI30x GPUs with speculative decoding: Work In Progress + if ((hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') && speculative === 'enabled') { + return '# MI30x GPUs Speculative Decoding: Work In Progress'; + } + + // MI300X/MI325X MXFP4: Work In Progress (only MI355X with gfx950 supports MXFP4) + if ((hardware === 'mi300x' || hardware === 'mi325x') && quantization === 'mxfp4') { + return '# MI300X/MI325X GPUs with MXFP4 quantization: Work In Progress'; + } + + // AMD MI30x requires SGLANG_USE_AITER=0 due to YaRN RoPE precision issues + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += 'SGLANG_USE_AITER=0 '; + } + + if (speculative === 'enabled') { + cmd += 'SGLANG_ENABLE_SPEC_V2=1 SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 '; + } + + cmd += 'python -m sglang.launch_server \\\n'; + + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + // Add reasoning parser if enabled + if (reasoningParser === 'enabled') { + cmd += ` \\\n --reasoning-parser gpt-oss`; + } + + // Add tool call parser if enabled + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser gpt-oss`; + } + + // Add speculative decoding if enabled (MI30x handled above) + if (speculative === 'enabled') { + cmd += ` \\\n --speculative-algorithm EAGLE3 \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4`; + + if (modelsize === '120b') { + cmd += ` \\\n --speculative-draft-model-path nvidia/gpt-oss-120b-Eagle3`; + } else if (modelsize === '20b') { + cmd += ` \\\n --speculative-draft-model-path zhuyksir/EAGLE3-gpt-oss-20b-bf16`; + } + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/hunyuan3-preview-deployment.jsx b/docs_new/src/snippets/autoregressive/hunyuan3-preview-deployment.jsx new file mode 100644 index 000000000000..82abba7a8b21 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/hunyuan3-preview-deployment.jsx @@ -0,0 +1,174 @@ +export const Hunyuan3PreviewDeployment = () => { + // Hunyuan 3 Preview (~276B total / ~20B active MoE) — BF16 only. + // ~552GB weights; 80GB-class GPUs (A100/H100) cannot fit single-node. + // H200 (141GB): tp=8 + // B200 (180GB): tp=8 + // B300 (275GB): tp=4 + // GB300 (275GB, 4-GPU node): tp=4 + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'b300', label: 'B300', default: false }, + { id: 'gb300', label: 'GB300', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (MTP)', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'Low Latency', default: false } + ] + } + }; + + const modelConfigs = { + h200: { tp: 8, mem: 0.9 }, + b200: { tp: 8, mem: 0.9 }, + b300: { tp: 4, mem: 0.9 }, + gb300: { tp: 4, mem: 0.9 } + }; + + const resolveItems = (option, values) => { + if (typeof option.getDynamicItems === 'function') return option.getDynamicItems(values); + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { hardware } = values; + const isBlackwell = hardware === 'b200' || hardware === 'b300' || hardware === 'gb300'; + const hwConfig = modelConfigs[hardware]; + if (!hwConfig) return '# Configuration not available for the selected hardware.'; + + const modelName = 'tencent/Hy3-preview'; + const tpValue = hwConfig.tp; + const memFraction = hwConfig.mem; + const enableSpec = values.speculative === 'enabled'; + + let cmd = ''; + if (enableSpec) cmd += 'SGLANG_ENABLE_SPEC_V2=1 '; + cmd += 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp ${tpValue}`; + + if (values.reasoning === 'enabled') cmd += ' \\\n --reasoning-parser hunyuan'; + if (values.toolcall === 'enabled') cmd += ' \\\n --tool-call-parser hunyuan'; + if (enableSpec) { + cmd += ' \\\n --speculative-algorithm EAGLE'; + cmd += ' \\\n --speculative-num-steps 3'; + cmd += ' \\\n --speculative-eagle-topk 1'; + cmd += ' \\\n --speculative-num-draft-tokens 4'; + } + + cmd += ' \\\n --trust-remote-code'; + cmd += ` \\\n --mem-fraction-static ${memFraction}`; + + if (isBlackwell) cmd += ' \\\n --attention-backend trtllm_mha'; + + return cmd; + }; + + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (typeof option.condition === 'function' && !option.condition(values)) return null; + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/kimi-k2-deployment.jsx b/docs_new/src/snippets/autoregressive/kimi-k2-deployment.jsx new file mode 100644 index 000000000000..425ae3a97066 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/kimi-k2-deployment.jsx @@ -0,0 +1,373 @@ +export const KimiK2Deployment = () => { + const modelFamily = 'moonshotai'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelname: { + name: 'modelname', + title: 'Model Name', + items: [ + { id: 'instruct', label: 'Kimi-K2-Instruct', default: true }, + { id: 'thinking', label: 'Kimi-K2-Thinking', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', default: true, required: true }, + { id: 'dp', label: 'DP attention', default: false }, + { id: 'ep', label: 'EP', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + const generateCommand = (values) => { + const { hardware, modelname, strategy, reasoning, toolcall } = values; + + if (modelname === 'instruct' && reasoning === 'enabled') { + return `# Error: Kimi-K2-Instruct doesn't support reasoning parser\n# Please select "Disabled" for Reasoning Parser or choose Kimi-K2-Thinking model`; + } + + const modelMap = { + 'instruct': 'Kimi-K2-Instruct', + 'thinking': 'Kimi-K2-Thinking' + }; + + const modelName = `${modelFamily}/${modelMap[modelname]}`; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd = 'SGLANG_ROCM_FUSED_DECODE_MLA=0 ' + cmd; + } + + cmd += ` --model-path ${modelName}`; + + const strategyArray = Array.isArray(strategy) ? strategy : []; + cmd += ` \\\n --tp 8`; + if (strategyArray.includes('dp')) { + cmd += ` \\\n --dp 4 \\\n --enable-dp-attention`; + } + if (strategyArray.includes('ep')) { + cmd += ` \\\n --ep 4`; + } + + cmd += ` \\\n --trust-remote-code`; + + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser kimi_k2`; + } + + if (reasoning === 'enabled') { + cmd += ` \\\n --reasoning-parser kimi_k2`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/kimi-k25-deployment.jsx b/docs_new/src/snippets/autoregressive/kimi-k25-deployment.jsx new file mode 100644 index 000000000000..7376cb955833 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/kimi-k25-deployment.jsx @@ -0,0 +1,264 @@ +export const KimiK25Deployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/KimiK25ConfigGenerator/index.js. + // + // GPU requirements: + // H200: tp=8 + // B300: tp=8 + // MI300X: tp=4 (64 heads / 4 = 16 heads per GPU, AITER MLA requires heads_per_gpu % 16 == 0) + // MI325X: tp=4 (same constraint as MI300X) + // MI350X: tp=4 (same constraint as MI300X) + // MI355X: tp=4 (same constraint as MI300X) + // + // NVFP4 quantization is only supported on NVIDIA Blackwell (B300). + // Speculative decoding is only supported on H200 and B300. + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b300', label: 'B300', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi350x', label: 'MI350X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + getDynamicItems: (values) => { + const hw = values.hardware; + const isB300 = hw === 'b300'; + return [ + { id: 'int4', label: 'INT4', subtitle: 'initial model', default: true }, + { id: 'nvfp4', label: 'NVFP4', subtitle: 'Blackwell only', default: false, disabled: !isB300, disabledReason: 'NVFP4 only on B300' } + ]; + } + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + dpattention: { + name: 'dpattention', + title: 'DP Attention', + items: [ + { id: 'disabled', label: 'Disabled', subtitle: 'Low Latency', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'High Throughput', default: false } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding', + condition: (values) => values.hardware === 'h200' || values.hardware === 'b300', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + const modelConfigs = { + h200: { tp: 8 }, + b300: { tp: 8 }, + mi300x: { tp: 4 }, + mi325x: { tp: 4 }, + mi350x: { tp: 4 }, + mi355x: { tp: 4 } + }; + + const resolveItems = (option, values) => { + if (typeof option.getDynamicItems === 'function') return option.getDynamicItems(values); + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + // When hardware changes, re-resolve quantization defaults (NVFP4 only on B300). + useEffect(() => { + setValues(prev => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find(i => i.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.hardware]); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command - mirrors sgl-cookbook's config.generateCommand(values) exactly. + const generateCommand = () => { + const { hardware, quantization, speculative } = values; + const isAMD = hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi350x' || hardware === 'mi355x'; + + // NVFP4 is only supported on NVIDIA Blackwell (B300) + if (quantization === 'nvfp4' && hardware !== 'b300') { + return '# NVFP4 quantization is only supported on NVIDIA Blackwell GPUs (B300)'; + } + + // Speculative decoding only supported on H200 and B300 + if (speculative === 'enabled' && hardware !== 'h200' && hardware !== 'b300') { + return '# Speculative Decoding for Kimi-K2.5 is only supported on H200 and B300'; + } + + // Model path depends on quantization + const modelName = quantization === 'nvfp4' + ? 'nvidia/Kimi-K2.5-NVFP4' + : 'moonshotai/Kimi-K2.5'; + + const hwConfig = modelConfigs[hardware]; + const tpValue = hwConfig.tp; + + let cmd = ''; + + // AMD ROCm environment variables + if (isAMD) { + cmd += 'SGLANG_USE_AITER=1 SGLANG_ROCM_FUSED_DECODE_MLA=0 '; + } + + // Speculative decoding env var + if (speculative === 'enabled') { + cmd += 'SGLANG_ENABLE_SPEC_V2=1 '; + } + + // If we added any env vars above, break to a new line for readability + if (isAMD || speculative === 'enabled') { + cmd += '\\\n'; + } + + cmd += 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp ${tpValue}`; + cmd += ' \\\n --trust-remote-code'; + + // DP Attention: --dp matches --tp + if (values.dpattention === 'enabled') { + cmd += ` \\\n --dp ${tpValue} \\\n --enable-dp-attention`; + } + + // Reasoning parser + if (values.reasoning === 'enabled') { + cmd += ' \\\n --reasoning-parser kimi_k2'; + } + + // Tool call parser + if (values.toolcall === 'enabled') { + cmd += ' \\\n --tool-call-parser kimi_k2'; + } + + // Speculative decoding (EAGLE3) + if (speculative === 'enabled') { + cmd += ' \\\n --speculative-algorithm EAGLE3 \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4 \\\n --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3'; + } + + // AMD: FP8 KV cache for memory efficiency + if (isAMD) { + cmd += ' \\\n --kv-cache-dtype fp8_e4m3'; + } + + cmd += ' \\\n --host 0.0.0.0 \\\n --port 30000'; + + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (typeof option.condition === 'function' && !option.condition(values)) return null; + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/kimi-k26-deployment.jsx b/docs_new/src/snippets/autoregressive/kimi-k26-deployment.jsx new file mode 100644 index 000000000000..11a1b68d94f1 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/kimi-k26-deployment.jsx @@ -0,0 +1,187 @@ +export const KimiK26Deployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/KimiK26ConfigGenerator/index.js. + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'b300', label: 'B300', default: false }, + { id: 'gb200', label: 'GB200', default: false }, + { id: 'gb300', label: 'GB300', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi350x', label: 'MI350X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + ], + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true }, + ], + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true }, + ], + }, + dpattention: { + name: 'dpattention', + title: 'DP Attention', + items: [ + { id: 'disabled', label: 'Disabled', subtitle: 'Low Latency', default: true }, + { id: 'enabled', label: 'Enabled', subtitle: 'High Throughput', default: false }, + ], + }, + }; + + const modelConfigs = { + h200: { tp: 8 }, + b200: { tp: 8 }, + b300: { tp: 8 }, + gb200: { tp: 4 }, + gb300: { tp: 4 }, + mi300x: { tp: 4 }, + mi325x: { tp: 4 }, + mi350x: { tp: 4 }, + mi355x: { tp: 4 }, + }; + + const resolveItems = (option, values) => + typeof option.getDynamicItems === 'function' ? option.getDynamicItems(values) : option.items || []; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find((item) => item.default && !item.disabled) || items.find((item) => !item.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { hardware, reasoning, toolcall, dpattention } = values; + const isAMD = hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi350x' || hardware === 'mi355x'; + const hwConfig = modelConfigs[hardware]; + const tpValue = hwConfig.tp; + + let cmd = ''; + + if (isAMD) { + cmd += 'SGLANG_USE_AITER=1 SGLANG_ROCM_FUSED_DECODE_MLA=0 \\\n'; + } + + cmd += 'sglang serve \\\n'; + cmd += ' --model-path moonshotai/Kimi-K2.6'; + cmd += ` \\\n --tp ${tpValue}`; + if (isAMD) { + cmd += ' \\\n --mem-fraction-static 0.8'; + } + cmd += ' \\\n --trust-remote-code'; + + if (dpattention === 'enabled') { + cmd += ` \\\n --dp ${tpValue} \\\n --enable-dp-attention`; + } + + if (reasoning === 'enabled') { + cmd += ' \\\n --reasoning-parser kimi_k2'; + } + + if (toolcall === 'enabled') { + cmd += ' \\\n --tool-call-parser kimi_k2'; + } + + if (isAMD) { + cmd += ' \\\n --kv-cache-dtype fp8_e4m3'; + } + + cmd += ' \\\n --host 0.0.0.0 \\\n --port 30000'; + return cmd; + }; + + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/kimi-linear-deployment.jsx b/docs_new/src/snippets/autoregressive/kimi-linear-deployment.jsx new file mode 100644 index 000000000000..74e4d11ef3dd --- /dev/null +++ b/docs_new/src/snippets/autoregressive/kimi-linear-deployment.jsx @@ -0,0 +1,358 @@ +export const KimiLinearDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300x', default: false }, + { id: 'mi325x', label: 'MI325x', default: false }, + { id: 'mi355x', label: 'MI355x', default: false } + ] + }, + modelname: { + name: 'modelname', + title: 'Model Name', + items: [ + { id: 'instruct', label: 'Kimi-Linear-48B-A3B-Instruct', default: true }, + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', default: true, required: true }, + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + const generateCommand = (values) => { + const { hardware, modelname, strategy, reasoning, toolcall } = values; + + if (modelname === 'instruct' && reasoning === 'enabled') { + return `# Error: Kimi-Linear doesn't support reasoning parser\n# Please select "Disabled" for Reasoning Parser or choose Kimi-Linear-Thinking model`; + } + + const modelMap = { + 'instruct': 'moonshotai/Kimi-Linear-48B-A3B-Instruct', + }; + + const modelName = modelMap[modelname]; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd = 'SGLANG_ROCM_FUSED_DECODE_MLA=0 ' + cmd; + } + + cmd += ` --model-path ${modelName}`; + + cmd += ` \\\n --tp 4`; + + cmd += ` \\\n --trust-remote-code`; + + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser kimi_k2`; + } + + if (reasoning === 'enabled') { + cmd += ` \\\n --reasoning-parser kimi_k2`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/ling-25-1t-deployment.jsx b/docs_new/src/snippets/autoregressive/ling-25-1t-deployment.jsx new file mode 100644 index 000000000000..f69c81e676e9 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/ling-25-1t-deployment.jsx @@ -0,0 +1,189 @@ +export const Ling251TDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'gb200', label: 'GB200', default: false }, + { id: 'gb300', label: 'GB300', default: false } + ] + }, + parallelism: { + name: 'parallelism', + title: 'Parallelism Strategy', + items: [ + { id: 'tp4pp2', label: 'TP4 + PP2', default: true }, + { id: 'tp8', label: 'TP8', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, parallelism, toolcall } = values; + + const isGB = hardware === 'gb200' || hardware === 'gb300'; + const envPrefix = isGB ? 'NCCL_IB_DISABLE=1 ' : ''; + + let tp, pp; + if (isGB && parallelism === 'tp8') { + tp = 8; + pp = null; + } else if (isGB) { + tp = 4; + pp = 2; + } else { + tp = 8; + pp = 2; + } + + const needMemFrac = hardware === 'h200' || (isGB && parallelism !== 'tp8'); + + const generateNodeCmd = (rank) => { + let cmd = `${envPrefix}python3 -m sglang.launch_server \\\n`; + cmd += ` --model-path inclusionAI/Ling-2.5-1T \\\n`; + cmd += ` --trust-remote-code \\\n`; + cmd += ` --tp-size ${tp} \\\n`; + if (pp) { + cmd += ` --pp-size ${pp} \\\n`; + } + cmd += ` --nnodes 2 \\\n`; + cmd += ` --node-rank ${rank} \\\n`; + if (rank === 0) { + cmd += ` --host 0.0.0.0 \\\n`; + cmd += ` --port \${PORT} \\\n`; + } + cmd += ` --dist-init-addr \${MASTER_IP}:\${DIST_PORT}`; + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser qwen`; + } + if (needMemFrac) { + cmd += ` \\\n --mem-frac 0.95`; + } + return cmd; + }; + + let output = `# MASTER_IP is Node 0 IP. PORT and DIST_PORT can be assigned by yourself.\n\n`; + output += `# Node 0:\n`; + output += generateNodeCmd(0); + output += `\n\n\n# Node 1:\n`; + output += generateNodeCmd(1); + + return output; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + const isGB = values.hardware === 'gb200' || values.hardware === 'gb300'; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + // Only show parallelism for GB200/GB300 + if (key === 'parallelism' && !isGB) return null; + return ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isItemDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/ling-26-1t-deployment.jsx b/docs_new/src/snippets/autoregressive/ling-26-1t-deployment.jsx new file mode 100644 index 000000000000..c0bca2a4beca --- /dev/null +++ b/docs_new/src/snippets/autoregressive/ling-26-1t-deployment.jsx @@ -0,0 +1,178 @@ +export const Ling261TDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'gb300', label: 'GB300 ×4 (1 node)', default: true }, + { id: 'gb200', label: 'GB200 ×4 (1 node)', default: false }, + { id: 'h200', label: 'H200 ×8 (2 nodes)', default: false }, + { id: 'b200', label: 'B200 ×8 (2 nodes)', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'qwen3 (split )', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, toolcall, reasoning } = values; + const isSingleNode = hardware === 'gb300' || hardware === 'gb200'; + + const tail = (cmd) => { + let out = cmd; + out += ` \\\n --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}'`; + if (toolcall === 'enabled') out += ` \\\n --tool-call-parser qwen`; + if (reasoning === 'enabled') out += ` \\\n --reasoning-parser qwen3`; + return out; + }; + + if (isSingleNode) { + let cmd = `sglang serve \\\n`; + cmd += ` --model-path inclusionAI/Ling-2.6-1T \\\n`; + cmd += ` --tp-size 4 \\\n`; + cmd += ` --trust-remote-code \\\n`; + cmd += ` --host 0.0.0.0 \\\n`; + cmd += ` --port \${PORT}`; + return tail(cmd); + } + + // Two-node deployment + const generateNodeCmd = (rank) => { + let cmd = `sglang serve \\\n`; + cmd += ` --model-path inclusionAI/Ling-2.6-1T \\\n`; + cmd += ` --tp-size 8 \\\n`; + cmd += ` --pp-size 2 \\\n`; + cmd += ` --nnodes 2 \\\n`; + cmd += ` --node-rank ${rank} \\\n`; + cmd += ` --trust-remote-code \\\n`; + if (rank === 0) { + cmd += ` --host 0.0.0.0 \\\n`; + cmd += ` --port \${PORT} \\\n`; + } + cmd += ` --dist-init-addr \${MASTER_IP}:\${DIST_PORT}`; + return tail(cmd); + }; + + let output = `# MASTER_IP is Node 0 IP. PORT and DIST_PORT can be assigned by yourself.\n\n`; + output += `# Node 0:\n`; + output += generateNodeCmd(0); + output += `\n\n\n# Node 1:\n`; + output += generateNodeCmd(1); + + return output; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isItemDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/ling-26-flash-deployment.jsx b/docs_new/src/snippets/autoregressive/ling-26-flash-deployment.jsx new file mode 100644 index 000000000000..71802b1908fe --- /dev/null +++ b/docs_new/src/snippets/autoregressive/ling-26-flash-deployment.jsx @@ -0,0 +1,160 @@ +export const Ling26FlashDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h20', label: 'H20-3e ×4', default: true }, + { id: 'h100', label: 'H100 ×4', default: false }, + { id: 'h200', label: 'H200 ×4', default: false }, + { id: 'b200', label: 'B200 ×4', default: false } + ] + }, + yarn: { + name: 'yarn', + title: 'Context Length', + items: [ + { id: 'enabled', label: '256K (YaRN ×2)', default: true }, + { id: 'disabled', label: '128K (default)', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'qwen3 (split )', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command + const generateCommand = () => { + const { yarn, toolcall, reasoning } = values; + + let cmd = `sglang serve \\\n`; + cmd += ` --model-path inclusionAI/Ling-2.6-flash \\\n`; + cmd += ` --tp-size 4 \\\n`; + cmd += ` --trust-remote-code \\\n`; + cmd += ` --host 0.0.0.0 \\\n`; + cmd += ` --port \${PORT}`; + if (yarn === 'enabled') { + cmd += ` \\\n --context-length 262144`; + cmd += ` \\\n --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}'`; + } + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser qwen25`; + } + if (reasoning === 'enabled') { + cmd += ` \\\n --reasoning-parser qwen3`; + } + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isItemDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/llada-21-deployment.jsx b/docs_new/src/snippets/autoregressive/llada-21-deployment.jsx new file mode 100644 index 000000000000..025bf82479f9 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/llada-21-deployment.jsx @@ -0,0 +1,338 @@ +export const LLaDA21Deployment = () => { + const modelFamily = 'inclusionAI'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: 'mini', label: 'Mini (16B)', subtitle: 'MoE', default: true }, + { id: 'flash', label: 'Flash (100B)', subtitle: 'MoE', default: false } + ] + } + }; + + const generateCommand = (values) => { + const { hardware, modelsize } = values; + + const modelName = modelsize === 'mini' ? 'LLaDA2.1-mini' : 'LLaDA2.1-flash'; + const modelPath = `${modelFamily}/${modelName}`; + + let tpSize; + if (modelsize === 'mini') { + tpSize = 1; + } else { + if (hardware === 'b200') { + tpSize = 2; + } else { + tpSize = 4; + } + } + + const args = []; + args.push(`--model-path ${modelPath}`); + args.push(`--dllm-algorithm JointThreshold`); + args.push(`--tp ${tpSize}`); + args.push(`--trust-remote-code`); + args.push(`--mem-fraction-static 0.8`); + args.push(`--max-running-requests 1`); + if (hardware === 'h100' || hardware === 'h200' || hardware === 'b200') { + args.push(`--attention-backend flashinfer`); + } + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` ${args.join(' \\\n ')}`; + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/llama31-deployment.jsx b/docs_new/src/snippets/autoregressive/llama31-deployment.jsx new file mode 100644 index 000000000000..d319b9cc9f13 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/llama31-deployment.jsx @@ -0,0 +1,252 @@ +export const Llama31Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '8b', label: '8B', default: false }, + { id: '70b', label: '70B', default: true }, + { id: '405b', label: '405B', default: false } + ] + }, + category: { + name: 'category', + title: 'Category', + items: [ + { id: 'base', label: 'Base', default: false }, + { id: 'instruct', label: 'Instruct', default: true } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + optimization: { + name: 'optimization', + title: 'Optimization Mode', + items: [ + { id: 'basic', label: 'Basic', default: true }, + { id: 'throughput', label: 'Throughput Optimized', default: false }, + { id: 'latency', label: 'Latency Optimized', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, optimization, modelsize, category, toolcall, quantization } = values; + + const isAMD = hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x'; + + // Model size mapping + const sizeMap = { + '8b': '8B', + '70b': '70B', + '405b': '405B' + }; + const sizeToken = sizeMap[modelsize] || '70B'; + const categorySuffix = category === 'instruct' ? '-Instruct' : ''; + + // Determine model path + let modelPath; + if (quantization === 'fp8' && category === 'instruct') { + if (modelsize === '405b') { + // Meta official FP8 for 405B + modelPath = `meta-llama/Llama-3.1-${sizeToken}${categorySuffix}-FP8`; + } else if (isAMD) { + // AMD FP8-KV variants for 70B/8B on AMD GPUs + modelPath = `amd/Llama-3.1-${sizeToken}${categorySuffix}-FP8-KV`; + } else { + modelPath = `meta-llama/Llama-3.1-${sizeToken}${categorySuffix}`; + } + } else { + modelPath = `meta-llama/Llama-3.1-${sizeToken}${categorySuffix}`; + } + + // Determine TP size + let tpSize; + if (isAMD) { + // AMD GPU TP configuration + const amdTpConfig = { + 'mi300x': { + '405b': { bf16: 8, fp8: 4 }, + '70b': { bf16: 1, fp8: 1 }, + '8b': { bf16: 1, fp8: 1 } + }, + 'mi325x': { + '405b': { bf16: 8, fp8: 4 }, + '70b': { bf16: 1, fp8: 1 }, + '8b': { bf16: 1, fp8: 1 } + }, + 'mi355x': { + '405b': { bf16: 4, fp8: 2 }, + '70b': { bf16: 1, fp8: 1 }, + '8b': { bf16: 1, fp8: 1 } + } + }; + tpSize = quantization === 'fp8' + ? amdTpConfig[hardware][modelsize].fp8 + : amdTpConfig[hardware][modelsize].bf16; + } else { + // NVIDIA GPU TP configuration + if (modelsize === '405b') { + tpSize = 8; + } else if (modelsize === '70b' && (hardware === 'h100' || hardware === 'h200')) { + tpSize = 2; + } + } + + // Build command args + const args = []; + args.push(`--model-path ${modelPath}`); + + if (tpSize) { + args.push(`--tp ${tpSize}`); + } + + // Add quantization flag only if not using FP8 variant model + if (quantization === 'fp8' && category !== 'instruct') { + args.push(`--quantization fp8`); + } + + // NVIDIA-specific optimizations + if (!isAMD) { + if (optimization === 'throughput') { + args.push(`--enable-dp-attention`); + args.push(`--mem-fraction-static 0.85`); + } else if (optimization === 'latency') { + args.push(`--speculative-algorithm EAGLE3`); + args.push(`--speculative-num-steps 3`); + args.push(`--speculative-eagle-topk 1`); + args.push(`--speculative-num-draft-tokens 4`); + if (modelsize === '8b' && category === 'instruct') { + args.push(`--speculative-draft-model-path yuhuili/EAGLE3-LLaMA3.1-Instruct-8B`); + } else { + args.push(`--speculative-draft-model-path \${EAGLE3_MODEL_PATH}`); + } + args.push(`--disable-shared-experts-fusion`); + args.push(`--max-running-requests 64`); + args.push(`--mem-fraction-static 0.85`); + args.push(`--kv-cache-dtype fp8_e4m3`); + args.push(`--context-length 32768`); + } + } + + if (toolcall === 'enabled') { + args.push(`--tool-call-parser llama3`); + } + + let cmd = 'sglang serve \\\n'; + cmd += ` ${args.join(' \\\n ')}`; + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/llama33-70b-deployment.jsx b/docs_new/src/snippets/autoregressive/llama33-70b-deployment.jsx new file mode 100644 index 000000000000..ca53bb394fe0 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/llama33-70b-deployment.jsx @@ -0,0 +1,138 @@ +export const Llama33Deployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Calling', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization, toolcall } = values; + + // Select model based on quantization + const modelPath = quantization === 'fp8' + ? 'amd/Llama-3.3-70B-Instruct-FP8-KV' + : 'meta-llama/Llama-3.3-70B-Instruct'; + + // Build command + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelPath} \\\n`; + cmd += ` --tp 1`; + + // Add tool calling parser + if (toolcall === 'enabled') { + cmd += ' \\\n --tool-call-parser llama3'; + } + + cmd += ' \\\n --host 0.0.0.0 \\\n'; + cmd += ' --port 30000'; + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/llama4-maverick-deployment.jsx b/docs_new/src/snippets/autoregressive/llama4-maverick-deployment.jsx new file mode 100644 index 000000000000..b93fb6ed94c3 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/llama4-maverick-deployment.jsx @@ -0,0 +1,347 @@ +export const Llama4MaverickDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300x', default: true }, + { id: 'mi325x', label: 'MI325x', default: false }, + { id: 'mi355x', label: 'MI355x', default: false } + ] + }, + host: { + name: 'host', + title: 'Host', + type: 'text', + default: '0.0.0.0', + placeholder: '0.0.0.0' + }, + port: { + name: 'port', + title: 'Port', + type: 'text', + default: '8000', + placeholder: '8000' + } + }; + + const generateCommand = (values) => { + const { hardware, quantization, toolcall, speculative, host, port } = values; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct`; + + if (hardware === 'h100' || hardware === 'h200') { + cmd += ` \\\n --tp 8`; + } else if (hardware === 'b200') { + cmd += ` \\\n --tp 8`; + } else if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ` \\\n --tp 8`; + } + + if (quantization === 'fp8') { + cmd += ` \\\n --quantization fp8`; + } + + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser pythonic`; + } + + if (speculative === 'enabled') { + cmd += ` \\\n --speculative-algorithm EAGLE3 \\\n`; + cmd += ` --speculative-draft-model-path lmsys/sglang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-v1 \\\n`; + cmd += ` --speculative-num-steps 3 \\\n`; + cmd += ` --speculative-eagle-topk 1 \\\n`; + cmd += ` --speculative-num-draft-tokens 4 \\\n`; + cmd += ` --mem-fraction-static 0.75 \\\n`; + cmd += ` --cuda-graph-max-bs 2`; + } + + cmd += ` \\\n --enable-multimodal`; + cmd += ` \\\n --context-length 65536`; + cmd += ` \\\n --dtype bfloat16`; + cmd += ` \\\n --trust-remote-code`; + cmd += ` \\\n --host ${host || '0.0.0.0'}`; + cmd += ` \\\n --port ${port || '8000'}`; + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/llama4-scout-deployment.jsx b/docs_new/src/snippets/autoregressive/llama4-scout-deployment.jsx new file mode 100644 index 000000000000..14d4029f19da --- /dev/null +++ b/docs_new/src/snippets/autoregressive/llama4-scout-deployment.jsx @@ -0,0 +1,374 @@ +export const Llama4ScoutDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: false }, + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300x', default: false }, + { id: 'mi325x', label: 'MI325x', default: false }, + { id: 'mi355x', label: 'MI355x', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (EAGLE3)', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enable EAGLE3', default: false } + ] + }, + host: { + name: 'host', + title: 'Host', + type: 'text', + default: '0.0.0.0', + placeholder: '0.0.0.0' + }, + port: { + name: 'port', + title: 'Port', + type: 'text', + default: '8000', + placeholder: '8000' + } + }; + + const generateCommand = (values) => { + const { hardware, quantization, toolcall, speculative, host, port } = values; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct`; + + if (hardware === 'h100' || hardware === 'h200') { + cmd += ` \\\n --tp 8`; + } else if (hardware === 'b200') { + cmd += ` \\\n --tp 8`; + } else if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ` \\\n --tp 8`; + } + + if (quantization === 'fp8') { + cmd += ` \\\n --quantization fp8`; + } + + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser pythonic`; + } + + if (speculative === 'enabled') { + cmd += ` \\\n --speculative-algorithm EAGLE3 \\\n`; + cmd += ` --speculative-draft-model-path lmsys/sglang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-v1 \\\n`; + cmd += ` --speculative-num-steps 3 \\\n`; + cmd += ` --speculative-eagle-topk 1 \\\n`; + cmd += ` --speculative-num-draft-tokens 4 \\\n`; + cmd += ` --mem-fraction-static 0.75 \\\n`; + cmd += ` --cuda-graph-max-bs 2`; + } + + cmd += ` \\\n --enable-multimodal`; + cmd += ` \\\n --context-length 65536`; + cmd += ` \\\n --dtype bfloat16`; + cmd += ` \\\n --trust-remote-code`; + cmd += ` \\\n --host ${host || '0.0.0.0'}`; + cmd += ` \\\n --port ${port || '8000'}`; + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/mimo-v2-flash-deployment.jsx b/docs_new/src/snippets/autoregressive/mimo-v2-flash-deployment.jsx new file mode 100644 index 000000000000..985d1a285b0d --- /dev/null +++ b/docs_new/src/snippets/autoregressive/mimo-v2-flash-deployment.jsx @@ -0,0 +1,194 @@ +export const MiMoV2FlashDeployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/MiMoConfigGenerator/index.js. + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelname: { + name: 'modelname', + title: 'Model Name', + items: [ + { id: 'mimo-v2-flash', label: 'MiMo-V2-Flash', default: true } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP 8 (Required)', default: true, disabled: true }, + { id: 'dp', label: 'DP Attention (DP 2)', default: true }, + { id: 'mtp', label: 'Multi-token Prediction (MTP)', default: true }, + { id: 'optimization', label: 'Performance Optimizations', default: true } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning & Tools', + type: 'checkbox', + items: [ + { id: 'reasoning', label: 'Reasoning Parser (Qwen3)', default: true }, + { id: 'toolcall', label: 'Tool Call Parser', default: true } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command — mirrors sgl-cookbook's config.generateCommand(values) exactly + const generateCommand = () => { + const { hardware, strategy, reasoning } = values; + const isMI355X = hardware === 'mi355x'; + + const modelPath = 'XiaomiMiMo/MiMo-V2-Flash'; + const strategyArray = Array.isArray(strategy) ? strategy : []; + const reasoningArray = Array.isArray(reasoning) ? reasoning : []; + + if (isMI355X && strategyArray.includes('mtp')) { + return '# MI355X Speculative Decoding (EAGLE): Work In Progress\n' + + '# Uncheck "Multi-token Prediction (MTP)" to view the validated non-speculative MI355X command.'; + } + + const commandPrefix = isMI355X + ? 'PYTHONPATH=/sgl-workspace/aiter SGLANG_USE_AITER=0 USE_ROCM_AITER_ROPE_BACKEND=0' + : 'SGLANG_ENABLE_SPEC_V2=1'; + const tpSize = isMI355X ? 4 : 8; + + let cmd = `${commandPrefix} sglang serve \\\n`; + cmd += ` --model-path ${modelPath} \\\n`; + cmd += ` --trust-remote-code \\\n`; + cmd += ` --tp-size ${tpSize}`; + + // DP settings + if (!isMI355X && strategyArray.includes('dp')) { + cmd += ` \\\n --dp-size 2 \\\n --enable-dp-attention`; + } + + // Performance Optimizations + if (strategyArray.includes('optimization')) { + cmd += ` \\\n --mem-fraction-static 0.75 \\\n --max-running-requests 128 \\\n --chunked-prefill-size 16384 \\\n --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}'`; + cmd += isMI355X + ? ` \\\n --attention-backend triton \\\n --prefill-attention-backend triton \\\n --decode-attention-backend triton \\\n --disable-custom-all-reduce` + : ` \\\n --attention-backend fa3`; + } + + // MTP/Speculative settings + if (!isMI355X && strategyArray.includes('mtp')) { + cmd += ` \\\n --speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4 \\\n --enable-multi-layer-eagle`; + } + + // Reasoning Parser + if (reasoningArray.includes('reasoning')) { + cmd += ` \\\n --reasoning-parser qwen3`; + } + + // Tool Call Parser + if (reasoningArray.includes('toolcall')) { + cmd += ` \\\n --tool-call-parser mimo`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = item.disabled; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/mimo-v25-deployment.jsx b/docs_new/src/snippets/autoregressive/mimo-v25-deployment.jsx new file mode 100644 index 000000000000..31b0fcad9de6 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/mimo-v25-deployment.jsx @@ -0,0 +1,463 @@ +export const MiMoV25Deployment = () => { + // MiMo-V2.5 family deployment matrix: + // Variant × Hardware → slug, tp, multinode, blackwell + // + // V2.5-Pro (1.02T / 42B active) — text-only: + // H200 → tp=16, 2 nodes, FP8 (Hopper: fa3 + DeepEP) + // H100 → tp=16, 2 nodes, FP8 (Hopper: fa3 + DeepEP) + // B200 → tp=8, single-node, FP8 (Blackwell verified: fa4 + flashinfer_trtllm) + // GB300 → tp=8, 2 nodes, FP8 (Blackwell verified: fa4 + flashinfer_trtllm + NCCL_MNNVL) + // V2.5 (310B / 15B active) — multimodal. Checkpoint is TP=4 interleaved, + // so attention-TP per DP group must be 4; effective parallelism = TP/DP = 4. + // H200 → tp=8, dp=2, single-node, FP8 (verified) + // H100 → tp=8, dp=2, single-node, FP8 + // B200 → tp=4, dp=1, single-node, FP8 + // GB300 → tp=4, dp=1, single-node, FP8 + // + // Optional toggles: + // EAGLE MTP — adds --speculative-* flags + SGLANG_ENABLE_SPEC_V2=1. + // DeepEP — Hopper only (Blackwell uses flashinfer_trtllm). Adds + // --moe-a2a-backend deepep + --moe-dense-tp-size 1 + // (and --ep on Pro) + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256. + // Requires `pip install deep_ep`. + + const options = { + modelVariant: { + name: "modelVariant", + title: "Model Variant", + items: [ + { id: "pro", label: "V2.5-Pro", default: true, subtitle: "1.02T / 42B" }, + { id: "base", label: "V2.5", default: false, subtitle: "310B / 15B" }, + ], + }, + hardware: { + name: "hardware", + title: "Hardware Platform", + items: [ + { id: "h200", label: "H200", default: true }, + { id: "h100", label: "H100", default: false }, + { id: "b200", label: "B200", default: false }, + { id: "gb300", label: "GB300", default: false }, + { id: "tpu-v7x", label: "TPU v7x", default: false, subtitle: "sgl-jax, Pro only" }, + { id: "tpu-v6e", label: "TPU v6e", default: false, subtitle: "sgl-jax, Pro only" }, + ], + }, + eagleMtp: { + name: "eagleMtp", + title: "EAGLE MTP", + items: [ + { id: "enabled", label: "Enabled", default: true, subtitle: "EAGLE" }, + { id: "disabled", label: "Disabled", default: false }, + ], + }, + dpAttention: { + name: "dpAttention", + title: "DP Attention", + items: [ + { id: "enabled", label: "Enabled", default: false, subtitle: "auto for V2.5" }, + { id: "disabled", label: "Disabled", default: true }, + ], + }, + expertParallelism: { + name: "expertParallelism", + title: "Expert Parallelism", + items: [ + { id: "enabled", label: "Enabled", default: false, subtitle: "Pro Hopper" }, + { id: "disabled", label: "Disabled", default: true }, + ], + }, + deepep: { + name: "deepep", + title: "DeepEP", + items: [ + { id: "enabled", label: "Enabled", default: false, subtitle: "needs deep_ep" }, + { id: "disabled", label: "Disabled", default: true, subtitle: "default" }, + ], + }, + reasoningParser: { + name: "reasoningParser", + title: "Reasoning Parser", + items: [ + { id: "enabled", label: "Enabled", default: true, subtitle: "mimo" }, + { id: "disabled", label: "Disabled", default: false }, + ], + }, + toolcall: { + name: "toolcall", + title: "Tool Call Parser", + items: [ + { id: "enabled", label: "Enabled", default: true, subtitle: "mimo" }, + { id: "disabled", label: "Disabled", default: false }, + ], + }, + }; + + // Per (variant, hardware): HF slug, tp, multinode info, Blackwell flag. + // V2.5 (base) checkpoint has TP=4-interleaved fused qkv_proj, so attention + // TP per DP group MUST be 4. Effective TP/DP = 4. With tp=8 → dp=2; tp=4 → dp=1. + // TPU rows go through the sgl-jax stack (`python -m sgl_jax.launch_server`), + // not the CUDA `sglang serve` binary; tp == total JAX devices across nodes. + const HW_VARIANT_SPEC = { + "pro|h200": { slug: "XiaomiMiMo/MiMo-V2.5-Pro", tp: 16, multinode: true, nnodes: 2, blackwell: false, jax: false }, + "pro|h100": { slug: "XiaomiMiMo/MiMo-V2.5-Pro", tp: 16, multinode: true, nnodes: 2, blackwell: false, jax: false }, + "pro|b200": { slug: "XiaomiMiMo/MiMo-V2.5-Pro", tp: 8, multinode: false, blackwell: true, jax: false }, + "pro|gb300": { slug: "XiaomiMiMo/MiMo-V2.5-Pro", tp: 8, multinode: true, nnodes: 2, blackwell: true, jax: false }, + "pro|tpu-v7x": { slug: "XiaomiMiMo/MiMo-V2.5-Pro", tp: 32, multinode: true, nnodes: 4, blackwell: false, jax: true }, + "pro|tpu-v6e": { slug: "XiaomiMiMo/MiMo-V2.5-Pro", tp: 64, multinode: true, nnodes: 16, blackwell: false, jax: true }, + "base|h200": { slug: "XiaomiMiMo/MiMo-V2.5", tp: 8, multinode: false, blackwell: false, jax: false, dp: 2 }, + "base|h100": { slug: "XiaomiMiMo/MiMo-V2.5", tp: 8, multinode: false, blackwell: false, jax: false, dp: 2 }, + "base|b200": { slug: "XiaomiMiMo/MiMo-V2.5", tp: 4, multinode: false, blackwell: true, jax: false, dp: 1 }, + "base|gb300": { slug: "XiaomiMiMo/MiMo-V2.5", tp: 4, multinode: false, blackwell: true, jax: false, dp: 1 }, + }; + + const multiNodeFlags = (nnodes) => [ + ` --nnodes ${nnodes}`, + ` --node-rank `, + ` --dist-init-addr :20000`, + ]; + + const prependMultiNodeNote = (cmd, nnodes) => + `# Multi-node (${nnodes} nodes). Run the same command on every node with:\n` + + `# = 0 on the head node, 1..${nnodes - 1} on the others\n` + + `# = IP of the head node (reachable from all others)\n` + + `${cmd}`; + + // Toggles whose value is forced by the current variant + hardware. Returns + // { optionName -> { force: "enabled" | "disabled", reason } }. The render + // layer grays out the OTHER radio, and a useEffect snaps the value to the + // forced choice so the UI never disagrees with the generated command. + const computeConstraints = (variant, hardware) => { + const isPro = variant === "pro"; + const spec = HW_VARIANT_SPEC[`${variant}|${hardware}`]; + const blackwell = spec ? spec.blackwell : false; + const jax = spec ? spec.jax : false; + const c = {}; + if (!isPro) { + // V2.5 checkpoint is TP=4-interleaved; tp/dp must equal 4. With dp>1 we + // must use DP-attention; with dp=1 it must be off (single attention group). + if (spec && spec.dp > 1) { + c.dpAttention = { force: "enabled", reason: "V2.5 checkpoint is TP=4-interleaved; DP-attention is required (--dp = tp/4)." }; + } else { + c.dpAttention = { force: "disabled", reason: "Single attention group on this hardware (tp=4, dp=1)." }; + } + } + if (blackwell) { + // DeepEP upstream targets Ampere/Hopper PTX; only experimental paths exist + // for sm_100 in sglang and the verified Blackwell stack uses flashinfer_trtllm. + c.deepep = { force: "disabled", reason: "Blackwell uses flashinfer_trtllm; DeepEP is Hopper / Ampere only." }; + } + if (jax) { + // sgl-jax stack: only V2.5-Pro is supported on TPU today; speculative + // decoding and the DeepEP CUDA backend do not apply to the JAX runtime. + // EP is always on (both verified launch commands set --ep-size = --tp-size). + c.modelVariant = { force: "pro", reason: "sgl-jax TPU runtime only supports MiMo-V2.5-Pro today." }; + c.eagleMtp = { force: "disabled", reason: "EAGLE MTP is not supported on the sgl-jax TPU runtime." }; + c.deepep = { force: "disabled", reason: "DeepEP is a CUDA-only backend; sgl-jax uses the fused Pallas MoE kernel." }; + c.expertParallelism = { force: "enabled", reason: "sgl-jax TPU recipes always use EP = TP." }; + } + return c; + }; + + const resolveItems = (option, constraints) => { + const c = constraints[option.name]; + if (!c) return option.items; + // Gray out every item that doesn't match the forced choice. Works for both + // binary (enabled/disabled) toggles and N-way options like modelVariant. + return option.items.map((item) => + item.id !== c.force ? { ...item, disabled: true, disabledReason: c.reason } : item, + ); + }; + + const getInitialState = () => { + const initialState = {}; + const constraints = computeConstraints("pro", "h200"); + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, constraints); + const def = items.find((i) => i.default && !i.disabled) || items.find((i) => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains("dark") || + html.getAttribute("data-theme") === "dark" || + html.style.colorScheme === "dark"; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ["class", "data-theme", "style"], + }); + return () => observer.disconnect(); + }, []); + + // Snap forced toggles to their required value whenever variant/hardware + // changes — keeps the visible radio in sync with the generated command. + useEffect(() => { + const constraints = computeConstraints(values.modelVariant, values.hardware); + let patch = null; + for (const [key, c] of Object.entries(constraints)) { + if (values[key] !== c.force) { + patch = patch || {}; + patch[key] = c.force; + } + } + if (patch) setValues((prev) => ({ ...prev, ...patch })); + }, [values.modelVariant, values.hardware]); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { modelVariant, hardware, eagleMtp, dpAttention, expertParallelism, deepep, reasoningParser, toolcall } = values; + const specKey = `${modelVariant}|${hardware}`; + const spec = HW_VARIANT_SPEC[specKey]; + const { slug, tp, multinode, nnodes, blackwell, jax } = spec; + const isPro = modelVariant === "pro"; + + // ---------------- sgl-jax (TPU) branch ---------------- + if (jax) { + // Recipe sources: + // v7x: tp=ep=32, dp=4, omits --attention-backend, mem-frac 0.95, swa 0.25 + // v6e: tp=ep=64, dp=8, --attention-backend fa, mem-frac 0.92, swa 0.15 + // + // sgl-jax conventions: + // - `--tp-size` is always the total JAX device count; per-DP TP is + // derived automatically as tp/dp. + // - No `--enable-dp-attention` flag — DP attention is the default + // (FFN layers auto-pick EP-split for MoE, attn-TP-split for dense). + const isV7x = hardware === "tpu-v7x"; + const useEp = expertParallelism === "enabled"; + const useDpAttn = dpAttention === "enabled"; + const dpSize = isV7x ? 4 : 8; + const flags = []; + flags.push(` --model-path ${slug}`); + flags.push(" --trust-remote-code"); + flags.push(` --tp-size ${tp}`); + if (useEp) flags.push(` --ep-size ${tp}`); + if (useDpAttn) flags.push(` --dp-size ${dpSize}`); + flags.push(" --moe-backend fused"); + if (!isV7x) flags.push(" --attention-backend fa"); + flags.push(" --host 0.0.0.0"); + flags.push(" --port 30000"); + flags.push(" --page-size 256"); + flags.push(" --context-length 262144"); + flags.push(" --chunked-prefill-size 4096"); + flags.push(" --max-running-requests 512"); + if (isV7x) { + flags.push(" --dtype bfloat16"); + flags.push(" --mem-fraction-static 0.95"); + flags.push(" --swa-full-tokens-ratio 0.25"); + flags.push(" --log-level info"); + } else { + flags.push(" --max-seq-len 4096"); + flags.push(" --max-prefill-tokens 16384"); + flags.push(" --mem-fraction-static 0.92"); + flags.push(" --swa-full-tokens-ratio 0.15"); + } + if (reasoningParser === "enabled") flags.push(" --reasoning-parser mimo"); + if (toolcall === "enabled") flags.push(" --tool-call-parser mimo"); + flags.push(` --nnodes ${nnodes}`); + flags.push(" --node-rank "); + flags.push(" --dist-init-addr :20000"); + const cmd = `JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python -m sgl_jax.launch_server \\\n${flags.join(" \\\n")}`; + return prependMultiNodeNote(cmd, nnodes); + } + + // ---------------- CUDA (sglang serve) branch ---------------- + // Toggles. EAGLE MTP / EP / DeepEP / DP-attn are gated by hardware + variant + // through computeConstraints; here we just read the (already-snapped) value. + const useMtp = eagleMtp === "enabled"; + const useDeepep = !blackwell && deepep === "enabled"; + const useEp = isPro && !blackwell && expertParallelism === "enabled"; + const useDpAttn = dpAttention === "enabled"; + // dp size: V2.5 picks tp/4 from spec; Pro picks tp. + const dpSize = !isPro ? spec.dp : tp; + + // ---- env (kept inline before `sglang serve`, matching the verified launch style) ---- + const envVars = []; + if (isPro && blackwell && multinode) { + envVars.push("NCCL_MNNVL_ENABLE=1", "NCCL_CUMEM_ENABLE=1"); + } + if (useMtp) envVars.push("SGLANG_ENABLE_SPEC_V2=1"); + if (useDeepep) envVars.push("SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256"); + + // ---- flags ---- + const flags = []; + flags.push(" --trust-remote-code"); + flags.push(` --model-path ${slug}`); + flags.push(` --tp ${tp}`); + + if (useDpAttn) { + flags.push(` --dp ${dpSize}`); + flags.push(" --enable-dp-attention"); + if (!isPro) { + flags.push(" --enable-dp-lm-head"); + flags.push(" --mm-enable-dp-encoder"); + } + } + + if (useEp) flags.push(` --ep ${tp}`); + + if (multinode) flags.push(...multiNodeFlags(nnodes)); + + // MoE backend: Blackwell uses flashinfer_trtllm (hardware-driven); Hopper + // optionally uses DeepEP (toggle). + if (isPro && blackwell) { + flags.push(" --moe-runner-backend flashinfer_trtllm"); + } else if (useDeepep) { + flags.push(" --moe-a2a-backend deepep"); + if (!isPro) flags.push(" --deepep-mode auto"); + flags.push(" --moe-dense-tp-size 1"); + } + + if (isPro) { + if (blackwell) { + flags.push(" --attention-backend fa4"); + flags.push(" --mem-fraction-static 0.8"); + flags.push(" --max-running-requests 128"); + flags.push(" --chunked-prefill-size 16384"); + if (hardware === "b200") flags.push(" --swa-full-tokens-ratio 0.1"); + flags.push(` --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}'`); + } else { + flags.push(" --mem-fraction-static 0.7"); + flags.push(" --max-running-requests 128"); + flags.push(" --chunked-prefill-size 32768"); + flags.push(" --cuda-graph-max-bs 64"); + flags.push(" --page-size 64"); + flags.push(" --swa-full-tokens-ratio 0.3"); + flags.push(` --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}'`); + } + } else { + flags.push(" --mem-fraction-static 0.65"); + flags.push(" --chunked-prefill-size 16384"); + } + + if (useMtp) { + flags.push(" --speculative-algorithm EAGLE"); + flags.push(" --speculative-num-steps 3"); + flags.push(" --speculative-eagle-topk 1"); + flags.push(" --speculative-num-draft-tokens 4"); + if (!blackwell) flags.push(" --enable-multi-layer-eagle"); + } + + if (reasoningParser === "enabled") flags.push(" --reasoning-parser mimo"); + if (toolcall === "enabled") flags.push(" --tool-call-parser mimo"); + + flags.push(" --host 0.0.0.0"); + flags.push(" --port 30000"); + + const envInline = envVars.length ? envVars.join(" ") + " " : ""; + const base = `${envInline}sglang serve \\\n${flags.join(" \\\n")}`; + return multinode ? prependMultiNodeNote(base, nnodes) : base; + }; + + // ---- styles ---- + const containerStyle = { maxWidth: "900px", margin: "0 auto", display: "flex", flexDirection: "column", gap: "4px" }; + const cardStyle = { + padding: "8px 12px", + border: `1px solid ${isDark ? "#374151" : "#e5e7eb"}`, + borderLeft: `3px solid ${isDark ? "#E85D4D" : "#D45D44"}`, + borderRadius: "4px", + display: "flex", + alignItems: "center", + gap: "12px", + background: isDark ? "#1f2937" : "#fff", + }; + const titleStyle = { fontSize: "13px", fontWeight: "600", minWidth: "140px", flexShrink: 0, color: isDark ? "#e5e7eb" : "inherit" }; + const itemsStyle = { display: "flex", rowGap: "2px", columnGap: "6px", flexWrap: "wrap", alignItems: "center", flex: 1 }; + const labelBaseStyle = { + padding: "4px 10px", + border: `1px solid ${isDark ? "#9ca3af" : "#d1d5db"}`, + borderRadius: "3px", + cursor: "pointer", + display: "inline-flex", + flexDirection: "column", + alignItems: "center", + justifyContent: "center", + fontWeight: "500", + fontSize: "13px", + transition: "all 0.2s", + userSelect: "none", + minWidth: "45px", + textAlign: "center", + flex: 1, + background: isDark ? "#374151" : "#fff", + color: isDark ? "#e5e7eb" : "inherit", + }; + const checkedStyle = { background: "#D45D44", color: "white", borderColor: "#D45D44" }; + const disabledStyle = { cursor: "not-allowed", opacity: 0.4 }; + const subtitleStyle = { display: "block", fontSize: "9px", marginTop: "1px", lineHeight: "1.1", opacity: 0.7 }; + const commandDisplayStyle = { + flex: 1, + padding: "12px 16px", + background: isDark ? "#111827" : "#f5f5f5", + borderRadius: "6px", + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: "12px", + lineHeight: "1.5", + color: isDark ? "#e5e7eb" : "#374151", + whiteSpace: "pre-wrap", + overflowX: "auto", + margin: 0, + border: `1px solid ${isDark ? "#374151" : "#e5e7eb"}`, + }; + + const constraints = computeConstraints(values.modelVariant, values.hardware); + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const items = resolveItems(option, constraints); + return ( +
+
{option.title}
+
+ {items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/minimax-m2-deployment.jsx b/docs_new/src/snippets/autoregressive/minimax-m2-deployment.jsx new file mode 100644 index 000000000000..420199f529df --- /dev/null +++ b/docs_new/src/snippets/autoregressive/minimax-m2-deployment.jsx @@ -0,0 +1,353 @@ +export const MiniMaxM2Deployment = () => { + const modelFamily = 'MiniMaxAI'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelname: { + name: 'modelname', + title: 'Model Name', + items: [ + { id: 'M2.1', label: 'MiniMax-M2.1', default: true }, + { id: 'M2', label: 'MiniMax-M2', default: false } + ] + }, + strategy: { + name: 'strategy', + title: 'Deployment Strategy', + type: 'checkbox', + items: [ + { id: 'tp', label: 'TP', default: true, required: true }, + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + const generateCommand = (values) => { + const { hardware, modelname, strategy, reasoning, toolcall } = values; + + const modelMap = { + 'M2.1': 'MiniMax-M2.1', + 'M2': 'MiniMax-M2' + }; + + const modelName = `${modelFamily}/${modelMap[modelname]}`; + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + + cmd += ` \\\n --tp 4`; + + cmd += ` \\\n --trust-remote-code`; + + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser minimax-m2`; + } + + if (reasoning === 'enabled') { + cmd += ` \\\n --reasoning-parser minimax-append-think`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/minimax-m25-deployment.jsx b/docs_new/src/snippets/autoregressive/minimax-m25-deployment.jsx new file mode 100644 index 000000000000..a0aa9574009c --- /dev/null +++ b/docs_new/src/snippets/autoregressive/minimax-m25-deployment.jsx @@ -0,0 +1,390 @@ +export const MiniMaxM25Deployment = () => { + const modelFamily = 'MiniMaxAI'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'a100', label: 'A100', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + gpuCount: { + name: 'gpuCount', + title: 'GPU Count', + getDynamicItems: (values) => { + const isAMD = values.hardware === 'mi300x' || values.hardware === 'mi325x' || values.hardware === 'mi355x'; + return [ + { + id: '2gpu', + label: '2', + default: isAMD, + disabled: !isAMD + }, + { + id: '4gpu', + label: '4', + default: !isAMD, + disabled: false + }, + { + id: '8gpu', + label: '8', + default: false, + disabled: false + } + ]; + } + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser minimax-append-think' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser minimax-m2' : null + } + }; + + const generateCommand = (values) => { + const { hardware, gpuCount, thinking, toolcall } = values; + + const isAMD = hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x'; + if (gpuCount === '2gpu' && !isAMD) { + return '# Please select compatible hardware\n# 2-GPU requires AMD MI300X/MI325X/MI355X'; + } + + const modelName = `${modelFamily}/MiniMax-M2.5`; + + let cmd = ''; + cmd += 'python -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelName}`; + + if (gpuCount === '8gpu') { + cmd += ` \\\n --tp 8`; + cmd += ` \\\n --ep 8`; + } else if (gpuCount === '4gpu') { + cmd += ` \\\n --tp 4`; + if (isAMD) { + cmd += ` \\\n --ep 4`; + } + } else if (gpuCount === '2gpu') { + cmd += ` \\\n --tp 2`; + if (isAMD) { + cmd += ` \\\n --ep 2`; + } + } + + if (toolcall === 'enabled') { + cmd += ` \\\n --tool-call-parser minimax-m2`; + } + + if (thinking === 'enabled') { + cmd += ` \\\n --reasoning-parser minimax-append-think`; + } + + cmd += ` \\\n --trust-remote-code`; + cmd += ` \\\n --mem-fraction-static 0.85`; + + if (isAMD) { + cmd += ` \\\n --kv-cache-dtype fp8_e4m3`; + cmd += ` \\\n --attention-backend triton`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/minimax-m27-deployment.jsx b/docs_new/src/snippets/autoregressive/minimax-m27-deployment.jsx new file mode 100644 index 000000000000..198f4ca5b719 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/minimax-m27-deployment.jsx @@ -0,0 +1,203 @@ +export const MiniMaxM27Deployment = () => { + // Config options. `getDynamicItems(values)` is evaluated at render time so that + // e.g. the 2-GPU option is only enabled on AMD or GB300 hardware. + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'gb300', label: 'GB300', default: false }, + { id: 'a100', label: 'A100', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + gpuCount: { + name: 'gpuCount', + title: 'GPU Count', + getDynamicItems: (values) => { + const hw = values.hardware; + const isAMD = hw === 'mi300x' || hw === 'mi325x' || hw === 'mi355x'; + const isGB300 = hw === 'gb300'; + const canUse2GPU = isAMD || isGB300; + return [ + { id: '2gpu', label: '2', default: canUse2GPU, disabled: !canUse2GPU }, + { id: '4gpu', label: '4', default: !canUse2GPU, disabled: false }, + { id: '8gpu', label: '8', default: false, disabled: isGB300 } + ]; + } + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + } + }; + + // Helper: resolve an option's items (static or dynamic) given current values + const resolveItems = (option, values) => { + if (typeof option.getDynamicItems === 'function') { + return option.getDynamicItems(values); + } + return option.items; + }; + + const getInitialState = () => { + const initialState = {}; + // Resolve hardware first so gpuCount's dynamic items can see it + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const defaultItem = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = defaultItem.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + // When hardware changes, re-evaluate gpuCount so disabled/default shifts apply + useEffect(() => { + setValues(prev => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find(i => i.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.hardware]); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command mirrors sgl-cookbook src/components/autoregressive/MiniMaxM27ConfigGenerator/index.js + const generateCommand = () => { + const { hardware, gpuCount, thinking, toolcall } = values; + + const isAMD = hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x'; + const isGB300 = hardware === 'gb300'; + const canUse2GPU = isAMD || isGB300; + + if (gpuCount === '2gpu' && !canUse2GPU) { + return '# Please select compatible hardware\n# 2-GPU requires AMD MI300X/MI325X/MI355X or GB300'; + } + + const modelName = 'MiniMaxAI/MiniMax-M2.7'; + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + + if (gpuCount === '8gpu') { + cmd += ' \\\n --tp 8'; + cmd += ' \\\n --ep 8'; + } else if (gpuCount === '4gpu') { + cmd += ' \\\n --tp 4'; + if (isAMD) cmd += ' \\\n --ep 4'; + } else if (gpuCount === '2gpu') { + cmd += ' \\\n --tp 2'; + if (isAMD) cmd += ' \\\n --ep 2'; + } + + if (toolcall === 'enabled') cmd += ' \\\n --tool-call-parser minimax-m2'; + if (thinking === 'enabled') cmd += ' \\\n --reasoning-parser minimax-append-think'; + + cmd += ' \\\n --trust-remote-code'; + cmd += ' \\\n --mem-fraction-static 0.85'; + + if (isAMD) { + cmd += ' \\\n --kv-cache-dtype fp8_e4m3'; + cmd += ' \\\n --attention-backend triton'; + } + + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/ministral-3-deployment.jsx b/docs_new/src/snippets/autoregressive/ministral-3-deployment.jsx new file mode 100644 index 000000000000..8092c8e702e8 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/ministral-3-deployment.jsx @@ -0,0 +1,348 @@ +export const Ministral3Deployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300x', default: true }, + { id: 'mi325x', label: 'MI325x', default: false }, + { id: 'mi355x', label: 'MI355x', default: false } + ] + }, + model: { + name: 'model', + title: 'Model', + items: [ + { id: 'small', label: 'Ministral-3-8B-Instruct-2512', default: true }, + { id: 'large', label: 'Ministral-3-14B-Instruct-2512', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'enabled', default: true }, + { id: 'disabled', label: 'disabled', default: false } + ], + commandRule: (value) => (value === 'enabled' ? '--tool-call-parser mistral' : null) + } + }; + + const modelConfigs = { + small: { + modelId: 'mistralai/Ministral-3-8B-Instruct-2512', + tpByHardware: { mi300x: 1, mi325x: 1, mi355x: 1 } + }, + large: { + modelId: 'mistralai/Ministral-3-14B-Instruct-2512', + tpByHardware: { mi300x: 1, mi325x: 1, mi355x: 1 } + } + }; + + const generateCommand = (values) => { + const { hardware, model } = values; + + const modelCfg = modelConfigs[model]; + if (!modelCfg) return `# Error: Unknown model selection: ${model}`; + + const tp = modelCfg.tpByHardware[hardware]; + if (!tp) return `# Error: Unknown hardware platform: ${hardware}`; + + let cmd = 'sglang serve \\\n'; + + cmd += ` --model-path ${modelCfg.modelId}`; + + if (tp > 1) { + cmd += ` \\\n --tp ${tp}`; + } + + cmd += ` \\\n --trust-remote-code`; + + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) cmd += ` \\\n ${rule}`; + } + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/mistral-medium-3-5-deployment.jsx b/docs_new/src/snippets/autoregressive/mistral-medium-3-5-deployment.jsx new file mode 100644 index 000000000000..0976d5b3978e --- /dev/null +++ b/docs_new/src/snippets/autoregressive/mistral-medium-3-5-deployment.jsx @@ -0,0 +1,349 @@ +export const MistralMedium35Deployment = () => { + const modelId = 'mistralai/Mistral-Medium-3.5-128B'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'b300', label: 'B300', default: false }, + ], + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser mistral' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser mistral' : null + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (EAGLE)', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--dtype bfloat16 \\\n --speculative-algorithm EAGLE \\\n --speculative-draft-model-path mistralai/Mistral-Medium-3.5-128B-EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4' : null + }, + }; + + // 128B dense FP8 ≈ 130GB, plus KV cache headroom + const modelConfigs = { + h100: { tp: 4 }, + h200: { tp: 4 }, + b200: { tp: 2 }, + b300: { tp: 2 }, + }; + + const generateCommand = (values) => { + const { hardware } = values; + const hwConfig = modelConfigs[hardware]; + if (!hwConfig) return `# Error: Unknown hardware combination`; + const { tp } = hwConfig; + + let cmd = `sglang serve --model-path ${modelId}`; + cmd += ` \\\n --tp ${tp}`; + + Object.entries(options).forEach(([key, option]) => { + if (key === 'hardware') return; + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) cmd += ` \\\n ${rule}`; + } + }); + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/mistral-small-4-deployment.jsx b/docs_new/src/snippets/autoregressive/mistral-small-4-deployment.jsx new file mode 100644 index 000000000000..d70f47b58273 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/mistral-small-4-deployment.jsx @@ -0,0 +1,365 @@ +export const MistralSmall4Deployment = () => { + const modelId = 'mistralai/Mistral-Small-4-119B-2603'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + getDynamicItems: (values) => { + const isNvfp4 = values.quantization === 'fp4'; + return [ + { id: 'h100', label: 'H100', default: !isNvfp4, disabled: isNvfp4 }, + { id: 'h200', label: 'H200', default: false, disabled: isNvfp4 }, + { id: 'b200', label: 'B200', default: isNvfp4, disabled: false }, + { id: 'b300', label: 'B300', default: false, disabled: false }, + ]; + } + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'fp8', label: 'FP8', default: true }, + { id: 'fp4', label: 'NVFP4', subtitle: 'Blackwell only', default: false }, + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser mistral' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser mistral' : null + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (EAGLE)', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--speculative-algorithm EAGLE \\\n --speculative-draft-model-path mistralai/Mistral-Small-4-119B-2603-eagle \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4' : null + }, + }; + + const modelConfigs = { + h100: { fp8: { tp: 2 } }, + h200: { fp8: { tp: 2 } }, + b200: { fp8: { tp: 1 }, fp4: { tp: 1 } }, + b300: { fp8: { tp: 1 }, fp4: { tp: 1 } }, + }; + + const generateCommand = (values) => { + const { hardware, quantization } = values; + + const hwConfig = modelConfigs[hardware]?.[quantization]; + if (!hwConfig) return `# Error: Unknown hardware/quantization combination`; + + const { tp } = hwConfig; + + const modelName = quantization === 'fp4' + ? 'mistralai/Mistral-Small-4-119B-2603-NVFP4' + : modelId; + + let cmd = `sglang serve --model-path ${modelName}`; + cmd += ` \\\n --tp ${tp}`; + + Object.entries(options).forEach(([key, option]) => { + if (key === 'quantization' || key === 'hardware') return; + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) cmd += ` \\\n ${rule}`; + } + }); + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/nemotron3-nano-deployment.jsx b/docs_new/src/snippets/autoregressive/nemotron3-nano-deployment.jsx new file mode 100644 index 000000000000..421bcb46ed80 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/nemotron3-nano-deployment.jsx @@ -0,0 +1,371 @@ +export const Nemotron3NanoDeployment = () => { + const modelFamily = 'nvidia'; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: true } + ] + }, + modelVariant: { + name: 'modelVariant', + title: 'Model Variant', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false }, + { id: 'nvfp4', label: 'NVFP4', default: false } + ] + }, + tp: { + name: 'tp', + title: 'Tensor Parallel (TP)', + items: [ + { id: '1', label: 'TP=1', default: true }, + { id: '2', label: 'TP=2', default: false }, + { id: '4', label: 'TP=4', default: false }, + { id: '8', label: 'TP=8', default: false } + ] + }, + kvcache: { + name: 'kvcache', + title: 'KV Cache DType', + items: [ + { id: 'fp8_e4m3', label: 'fp8_e4m3', default: true }, + { id: 'bf16', label: 'bf16', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser nemotron_3' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser qwen3_coder' : null + } + }; + + const generateCommand = (values) => { + const { hardware, modelVariant, tp, kvcache, thinking, toolcall } = values; + + // Default to FP8 if not selected + const variant = modelVariant || 'fp8'; + const baseName = 'NVIDIA-Nemotron-3-Nano-30B-A3B'; + + const modelName = `${modelFamily}/${baseName}-${variant.toUpperCase()}`; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelName} \\\n`; + cmd += ` --trust-remote-code \\\n`; + cmd += ` --tp ${tp} \\\n`; + cmd += ` --kv-cache-dtype ${kvcache} \\\n`; + + // Add thinking parser and tool call parser if enabled + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` ${rule} \\\n`; + } + } + } + + // Remove trailing backslash from last option + cmd = cmd.trimEnd(); + if (cmd.endsWith('\\')) { + cmd = cmd.slice(0, -1).trimEnd(); + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/nemotron3-nano-omni-deployment.jsx b/docs_new/src/snippets/autoregressive/nemotron3-nano-omni-deployment.jsx new file mode 100644 index 000000000000..3793cbff89ef --- /dev/null +++ b/docs_new/src/snippets/autoregressive/nemotron3-nano-omni-deployment.jsx @@ -0,0 +1,200 @@ +export const Nemotron3NanoOmniDeployment = () => { + const MODEL_PATHS = { + reasoning: 'nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning', + bf16: 'nvidia/Nemotron-3-Nano-Omni-30B-A3B-BF16', + fp8: 'nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8', + nvfp4: 'nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4', + }; + + const options = { + model: { + name: 'model', + title: 'Model', + items: [ + { id: 'reasoning', label: 'Reasoning', default: true }, + { id: 'bf16', label: 'BF16', default: false }, + { id: 'fp8', label: 'FP8', default: false }, + { id: 'nvfp4', label: 'NVFP4', default: false }, + ], + }, + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false }, + { id: 'a100', label: 'A100', default: false }, + { id: 'l40s', label: 'L40S', default: false }, + ], + }, + tp: { + name: 'tp', + title: 'Tensor Parallel (TP)', + items: [ + { id: '1', label: 'TP=1', default: false }, + { id: '2', label: 'TP=2', default: false }, + { id: '4', label: 'TP=4', default: true }, + { id: '8', label: 'TP=8', default: false }, + ], + }, + kvcache: { + name: 'kvcache', + title: 'KV Cache DType', + items: [ + { id: 'none', label: 'None', default: true }, + { id: 'fp8_e4m3', label: 'fp8_e4m3', default: false }, + ], + }, + thinking: { + name: 'thinking', + title: 'Reasoning Parser', + items: [ + { id: 'thinking_on', label: 'Enabled', default: true }, + { id: 'thinking_off', label: 'Disabled', default: false }, + ], + commandRule: (value) => value === 'thinking_on' ? '--reasoning-parser deepseek-r1' : null, + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'toolcall_on', label: 'Enabled', default: true }, + { id: 'toolcall_off', label: 'Disabled', default: false }, + ], + commandRule: (value) => value === 'toolcall_on' ? '--tool-call-parser qwen3_coder' : null, + }, + }; + + const generateCommand = (values) => { + const { tp, kvcache, model, hardware } = values; + + if (model === 'nvfp4' && hardware !== 'b200') { + return '# NVFP4 requires Blackwell hardware. Please select B200.'; + } + + if (hardware === 'l40s' && tp === '1') { + return '# TP=1 is not supported on L40S for this model. Please use TP=2 or higher.'; + } + + const modelPath = MODEL_PATHS[model] || MODEL_PATHS.reasoning; + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelPath} \\\n`; + cmd += ' --host 0.0.0.0 \\\n'; + cmd += ' --port 30000 \\\n'; + cmd += ' --trust-remote-code \\\n'; + cmd += ` --tp ${tp} \\\n`; + + if (kvcache && kvcache !== 'none') { + cmd += ` --kv-cache-dtype ${kvcache} \\\n`; + } + + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` ${rule} \\\n`; + } + } + } + + cmd = cmd.trimEnd(); + if (cmd.endsWith('\\')) { + cmd = cmd.slice(0, -1).trimEnd(); + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const items = option.items || []; + const defaultItem = items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items[0]?.id || ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const items = option.items || []; + return ( +
+
{option.title}
+
+ {items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/nemotron3-super-deployment.jsx b/docs_new/src/snippets/autoregressive/nemotron3-super-deployment.jsx new file mode 100644 index 000000000000..c9c623695fb9 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/nemotron3-super-deployment.jsx @@ -0,0 +1,381 @@ +export const Nemotron3SuperDeployment = () => { + const MODEL_PATHS = { + bf16: 'nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16', + fp8: 'nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8', + nvfp4: 'nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4', + }; + + const options = { + model: { + name: 'model', + title: 'Model', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false }, + { id: 'nvfp4', label: 'NVFP4', default: false }, + ] + }, + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: true } + ] + }, + tp: { + name: 'tp', + title: 'Tensor Parallel (TP)', + items: [ + { id: '2', label: 'TP=2', default: false }, + { id: '4', label: 'TP=4', default: true }, + { id: '8', label: 'TP=8', default: false } + ] + }, + mtp: { + name: 'mtp', + title: 'Multi-token Prediction (MTP)', + items: [ + { id: 'enabled', label: 'Enabled', default: false }, + { id: 'disabled', label: 'Disabled', default: true } + ], + commandRule: (value) => value === 'enabled' ? '--speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4 \\\n --disable-radix-cache' : null + }, + kvcache: { + name: 'kvcache', + title: 'KV Cache DType', + items: [ + { id: 'none', label: 'None', default: true }, + { id: 'fp8_e4m3', label: 'fp8_e4m3', default: false }, + { id: 'bf16', label: 'bf16', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Reasoning Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser nemotron_3' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser qwen3_coder' : null + } + }; + + const generateCommand = (values) => { + const { tp, kvcache, model } = values; + + const modelPath = MODEL_PATHS[model] || MODEL_PATHS['bf16']; + + let cmd = 'python3 -m sglang.launch_server \\\n'; + cmd += ` --model-path ${modelPath} \\\n`; + cmd += ` --trust-remote-code \\\n`; + cmd += ` --tp ${tp} \\\n`; + + if (kvcache && kvcache !== 'none') { + cmd += ` --kv-cache-dtype ${kvcache} \\\n`; + } + + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` ${rule} \\\n`; + } + } + } + + cmd = cmd.trimEnd(); + if (cmd.endsWith('\\')) { + cmd = cmd.slice(0, -1).trimEnd(); + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen25-vl-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen25-vl-deployment.jsx new file mode 100644 index 000000000000..4c686cb42e84 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen25-vl-deployment.jsx @@ -0,0 +1,364 @@ +export const Qwen25VLDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '72b', label: '72B', subtitle: 'Dense', default: true }, + { id: '32b', label: '32B', subtitle: 'Dense', default: false }, + { id: '7b', label: '7B', subtitle: 'Dense', default: false }, + { id: '3b', label: '3B', subtitle: 'Dense', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true } + ] + } + }; + + const modelConfigs = { + '72b': { + baseName: '72B', + mi300x: { tp: 8, ep: 0 }, + mi325x: { tp: 8, ep: 0 }, + mi355x: { tp: 8, ep: 0 } + }, + '32b': { + baseName: '32B', + mi300x: { tp: 2, ep: 0 }, + mi325x: { tp: 2, ep: 0 }, + mi355x: { tp: 2, ep: 0 } + }, + '7b': { + baseName: '7B', + mi300x: { tp: 1, ep: 0 }, + mi325x: { tp: 1, ep: 0 }, + mi355x: { tp: 1, ep: 0 } + }, + '3b': { + baseName: '3B', + mi300x: { tp: 1, ep: 0 }, + mi325x: { tp: 1, ep: 0 }, + mi355x: { tp: 1, ep: 0 } + } + }; + + const generateCommand = (values) => { + const { hardware, modelsize: modelSize } = values; + + const modelSizeConfig = modelConfigs[modelSize]; + if (!modelSizeConfig) { + return `# Error: Unknown model size: ${modelSize}`; + } + + const hwConfig = modelSizeConfig[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const modelName = `Qwen/Qwen2.5-VL-${modelSizeConfig.baseName}-Instruct`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + if ((hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') && modelSize === '72b') { + cmd += ` \\\n --context-length 128000`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen3-coder-480b-a35b-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen3-coder-480b-a35b-deployment.jsx new file mode 100644 index 000000000000..a249854643c3 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen3-coder-480b-a35b-deployment.jsx @@ -0,0 +1,139 @@ +export const Qwen3CoderDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + } + }; + + // Model configurations + const modelConfigs = { + '480b': { + baseName: '480B-A35B', + mi300x: { tp: 8, ep: 0 } + } + }; + + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, quantization } = values; + + const config = modelConfigs['480b']; + const hwConfig = config[hardware]; + + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + // Build model name + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `Qwen/Qwen3-Coder-${config.baseName}-Instruct${quantSuffix}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + // TP is always 8 for this model + cmd += ` \\\n --tp ${hwConfig.tp}`; + + // FP8 requires EP=2 for MoE dimension alignment + if (quantization === 'fp8') { + cmd += ` \\\n --ep 2`; + } + + // Context length verified on MI300X + cmd += ` \\\n --context-length 8192`; + + // Page size for MoE models + cmd += ` \\\n --page-size 32`; + + // FP8 requires trust-remote-code + if (quantization === 'fp8') { + cmd += ` \\\n --trust-remote-code`; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = item.disabled; + return ( + + ); + })} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen3-coder-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen3-coder-deployment.jsx new file mode 100644 index 000000000000..1ceccbf42e09 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen3-coder-deployment.jsx @@ -0,0 +1,426 @@ +export const Qwen3CoderDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + { id: 'b200', label: 'B200', default: false }, + { id: 'gb200', label: 'GB200', default: false } + ] + }, + modelSize: { + name: 'modelSize', + title: 'Model Size', + items: [ + { id: '480b', label: '480B', subtitle: 'MOE', default: true }, + { id: '30b', label: '30B', subtitle: 'MOE', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false }, + { id: 'nvfp4', label: 'NVFP4', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser qwen3_coder' : null + } + }; + + const modelConfigs = { + '480b': { + baseName: '480B-A35B', + mi300x: { tp: 8 }, + mi325x: { tp: 8 }, + mi355x: { tp: 8 }, + b200: { tp: 8 }, + gb200: { tp: 8 } + }, + '30b': { + baseName: '30B-A3B', + mi300x: { tp: 1 }, + mi325x: { tp: 1 }, + mi355x: { tp: 1 } + } + }; + + const generateCommand = (values) => { + const { hardware, modelSize, quantization } = values; + + const isNvidia = hardware === 'b200' || hardware === 'gb200'; + + const modelConfig = modelConfigs[modelSize]; + const hwConfig = modelConfig[hardware]; + + if (!hwConfig) { + return `# Configuration not available: ${modelSize.toUpperCase()} model has not been verified on ${hardware.toUpperCase()}.`; + } + + // NVFP4 is only available on NVIDIA hardware + if (quantization === 'nvfp4' && !isNvidia) { + return `# NVFP4 quantization is only available on NVIDIA B200/GB200 hardware.`; + } + + // BF16 not verified on NVIDIA + if (quantization === 'bf16' && isNvidia) { + return `# BF16 deployment on ${hardware.toUpperCase()} has not been verified yet. Please use FP8 or NVFP4.`; + } + + // Build model name + let modelName; + if (quantization === 'nvfp4') { + modelName = `nvidia/Qwen3-Coder-${modelConfig.baseName}-Instruct-NVFP`; + } else { + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + modelName = `Qwen/Qwen3-Coder-${modelConfig.baseName}-Instruct${quantSuffix}`; + } + + let cmd = ''; + if (!isNvidia) { + cmd += 'SGLANG_USE_AITER=0 '; + } + cmd += 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + // TP setting + cmd += ` \\\n --tp ${hwConfig.tp}`; + + // EP and DP attention settings + if (quantization === 'nvfp4') { + cmd += ` \\\n --ep 1`; + cmd += ` \\\n --enable-dp-attention`; + } else if (modelSize === '480b' && quantization === 'fp8') { + // FP8 requires EP=2 for 480B model due to MoE dimension alignment + // moe_intermediate_size=2560, with tp=8 ep=1: 2560/8=320, 320%128!=0 + // with tp=8 ep=2: 2560/4=640, 640%128=0 ✓ + cmd += ` \\\n --ep 2`; + } + + // MOE runner backend for NVIDIA + if (isNvidia) { + if (quantization === 'nvfp4') { + cmd += ` \\\n --moe-runner-backend flashinfer_cutlass`; + cmd += ` \\\n --quantization modelopt_fp4`; + } else if (quantization === 'fp8') { + cmd += ` \\\n --moe-runner-backend triton`; + } + } + + // Apply commandRule from all options + Object.entries(options).forEach(([key, option]) => { + if (option.commandRule && values[key]) { + // Pass the full values object so commandRule can access other option values + const additionalCmd = option.commandRule(values[key], values); + if (additionalCmd) { + cmd += ` \\\n ${additionalCmd}`; + } + } + }); + + // AMD-specific flags + if (!isNvidia) { + // Context length verified on MI300X/MI325X/MI355X + cmd += ` \\\n --context-length 8192`; + + // Page size for MoE models + cmd += ` \\\n --page-size 32`; + + // FP8 requires trust-remote-code + if (quantization === 'fp8') { + cmd += ` \\\n --trust-remote-code`; + } + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen3-coder-next-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen3-coder-next-deployment.jsx new file mode 100644 index 000000000000..700768c9a0af --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen3-coder-next-deployment.jsx @@ -0,0 +1,370 @@ +export const Qwen3CoderNextDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'b200', label: 'B200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'enabled', label: 'Enabled', default: true }, + { id: 'disabled', label: 'Disabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser qwen3_coder' : null + }, + mambaCache: { + name: 'mambaCache', + title: 'Mamba Radix Cache', + items: [ + { id: 'v1', label: 'V1', default: true }, + { id: 'v2', label: 'V2', default: false } + ], + commandRule: (value) => value === 'v2' ? '--mamba-scheduler-strategy extra_buffer \\\n --page-size 64' : null + } + }; + + const modelConfigs = { + default: { + baseName: 'Qwen3-Coder-Next', + h100: { bf16: { tp: 4 }, fp8: { tp: 2 } }, + h200: { bf16: { tp: 2 }, fp8: { tp: 1 } }, + b200: { bf16: { tp: 2 }, fp8: { tp: 1 } }, + mi300x: { bf16: { tp: 2 }, fp8: { tp: 1 } }, + mi325x: { bf16: { tp: 2 }, fp8: { tp: 1 } }, + mi355x: { bf16: { tp: 2 }, fp8: { tp: 1 } } + } + }; + + const generateCommand = (values) => { + const { hardware, quantization } = values; + + const hwConfig = modelConfigs.default[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantConfig = hwConfig[quantization]; + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `Qwen/${modelConfigs.default.baseName}${quantSuffix}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + // TP setting + if (quantConfig.tp > 1) { + cmd += ` \\\n --tp ${quantConfig.tp}`; + } + + // Apply commandRule from all options + Object.entries(options).forEach(([key, option]) => { + if (option.commandRule && values[key]) { + const additionalCmd = option.commandRule(values[key], values); + if (additionalCmd) { + cmd += ` \\\n ${additionalCmd}`; + } + } + }); + + // AMD GPUs require triton attention backend + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ` \\\n --attention-backend triton`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen3-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen3-deployment.jsx new file mode 100644 index 000000000000..ee99a57938b2 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen3-deployment.jsx @@ -0,0 +1,330 @@ +export const Qwen3Deployment = () => { + // Model configurations + const modelConfigs = { + '235b': { + baseName: '235B-A22B', + hasThinkingVariants: true, + h100: { tp: 8, ep: 0, bf16: true, fp8: true }, + h200: { tp: 8, ep: 0, bf16: true, fp8: true }, + b200: { tp: 8, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 4, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 4, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 4, ep: 0, bf16: true, fp8: true } + }, + '30b': { + baseName: '30B-A3B', + hasThinkingVariants: true, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '32b': { + baseName: '32B', + hasThinkingVariants: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '14b': { + baseName: '14B', + hasThinkingVariants: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '8b': { + baseName: '8B', + hasThinkingVariants: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '4b': { + baseName: '4B', + hasThinkingVariants: true, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '1.7b': { + baseName: '1.7B', + hasThinkingVariants: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '0.6b': { + baseName: '0.6B', + hasThinkingVariants: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + } + }; + + // Base options + const baseOptions = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '235b', label: '235B', subtitle: 'MOE', default: true }, + { id: '30b', label: '30B', subtitle: 'MOE', default: false }, + { id: '32b', label: '32B', subtitle: 'Dense', default: false }, + { id: '14b', label: '14B', subtitle: 'Dense', default: false }, + { id: '8b', label: '8B', subtitle: 'Dense', default: false }, + { id: '4b', label: '4B', subtitle: 'Dense', default: false }, + { id: '1.7b', label: '1.7B', subtitle: 'Dense', default: false }, + { id: '0.6b', label: '0.6B', subtitle: 'Dense', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + category: { + name: 'category', + title: 'Categories', + items: [ + { id: 'base', label: 'Base', default: true }, + { id: 'instruct', label: 'Instruct', default: false }, + { id: 'thinking', label: 'Thinking', default: false } + ] + }, + reasoningParser: { + name: 'reasoningParser', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Get dynamic options based on current values + const getDisplayOptions = (values) => { + const options = { ...baseOptions }; + const currentModelConfig = modelConfigs[values.modelsize]; + + // If model doesn't have thinking variants, disable non-base category options + if (currentModelConfig && !currentModelConfig.hasThinkingVariants) { + options.category = { + ...baseOptions.category, + items: baseOptions.category.items.map(item => ({ + ...item, + disabled: item.id !== 'base' + })) + }; + } + + // Only show reasoningParser when category is not 'instruct' + if (values.category === 'instruct') { + delete options.reasoningParser; + } + + return options; + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(baseOptions).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => { + const newValues = { ...prev, [optionName]: value }; + + // Auto-switch to 'base' category for models without thinking variants + if (optionName === 'modelsize') { + const modelConfig = modelConfigs[value]; + if (modelConfig && !modelConfig.hasThinkingVariants) { + if (newValues.category !== 'base') { + newValues.category = 'base'; + } + } + } + + // Reset reasoningParser when switching to 'instruct' category + if (optionName === 'category' && value === 'instruct') { + newValues.reasoningParser = 'disabled'; + } + + return newValues; + }); + }; + + // Generate command + const generateCommand = () => { + const { hardware, modelsize, quantization, category, reasoningParser, toolcall } = values; + const displayOptions = getDisplayOptions(values); + + // Special error handling + const commandKey = `${hardware}-${modelsize}-${quantization}-${category}`; + if (commandKey === 'h100-235b-bf16-instruct' || commandKey === 'h100-235b-bf16-thinking') { + return '# Error: Model is too large, cannot fit into 8*H100\n# Please use H200 (141GB) or select FP8 quantization'; + } + + const config = modelConfigs[modelsize]; + if (!config) { + return `# Error: Unknown model size: ${modelsize}`; + } + + const hwConfig = config[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + + // Build model name based on model category + let modelName; + if (config.hasThinkingVariants) { + if (category === 'base') { + modelName = `Qwen/Qwen3-${config.baseName}${quantSuffix}`; + } else { + const thinkingSuffix = category === 'thinking' ? '-Thinking' : '-Instruct'; + const dateSuffix = '-2507'; + modelName = `Qwen/Qwen3-${config.baseName}${thinkingSuffix}${dateSuffix}${quantSuffix}`; + } + } else { + modelName = `Qwen/Qwen3-${config.baseName}${quantSuffix}`; + } + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + let ep = hwConfig.ep; + if (quantization === 'fp8' && hwConfig.tp === 8) { + ep = 2; + } + + if (ep > 0) { + cmd += ` \\\n --ep ${ep}`; + } + + // Add reasoning parser + if (reasoningParser === 'enabled' && category !== 'instruct') { + cmd += ' \\\n --reasoning-parser qwen3'; + } + + // Add tool call parser + if (toolcall === 'enabled') { + cmd += ' \\\n --tool-call-parser qwen25'; + } + + return cmd; + }; + + // Get current display options + const displayOptions = getDisplayOptions(values); + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(displayOptions).map(([key, option]) => ( +
+
{option.title}
+
+ {option.items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = item.disabled; + return ( + + ); + })} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen3-next-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen3-next-deployment.jsx new file mode 100644 index 000000000000..87dee3ee25bd --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen3-next-deployment.jsx @@ -0,0 +1,409 @@ +export const Qwen3NextDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '80b', label: '80B', subtitle: 'MOE', default: true }, + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', subtitle: 'Full Weights', default: true }, + { id: 'fp8', label: 'FP8', subtitle: 'High Throughput', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'instruct', label: 'Instruct', subtitle: 'General Purpose', default: true }, + { id: 'thinking', label: 'Thinking', subtitle: 'Reasoning / CoT', default: false } + ], + commandRule: (value) => value === 'thinking' ? '--reasoning-parser qwen3' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser qwen' : null + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4' : null + }, + mambaCache: { + name: 'mambaCache', + title: 'Mamba Radix Cache', + items: [ + { id: 'v1', label: 'V1', default: true }, + { id: 'v2', label: 'V2', default: false } + ], + commandRule: (value) => value === 'v2' ? '--mamba-scheduler-strategy extra_buffer \\\n --page-size 64' : null + } + }; + + const modelConfigs = { + '80b': { + baseName: '80B-A3B', + isMOE: true, + h100: { tp: 4, ep: 0, bf16: true, fp8: true }, + h200: { tp: 2, ep: 0, bf16: true, fp8: true }, + b200: { tp: 2, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 2, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 2, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 2, ep: 0, bf16: true, fp8: true } + } + }; + + const generateCommand = (values) => { + const { hardware, modelsize: modelSize, quantization, thinking } = values; + const commandKey = `${hardware}-${modelSize}-${quantization}-${thinking}`; + + const modelSizeConfig = modelConfigs[modelSize]; + if (!modelSizeConfig) { + return `# Error: Unknown model size: ${modelSize}`; + } + + const hwConfig = modelSizeConfig[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const thinkingSuffix = thinking === 'thinking' ? '-Thinking' : '-Instruct'; + const modelName = `Qwen/Qwen3-Next-${modelSizeConfig.baseName}${thinkingSuffix}${quantSuffix}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + let ep = hwConfig.ep; + if (quantization === 'fp8' && hwConfig.tp === 8) { + ep = 2; + } + + if (ep > 0) { + cmd += ` \\\n --ep ${ep}`; + } + + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` \\\n ${rule}`; + } + } + } + + // AMD GPUs require triton attention backend + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ` \\\n --attention-backend triton`; + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen3-vl-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen3-vl-deployment.jsx new file mode 100644 index 000000000000..06137bd0c4a3 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen3-vl-deployment.jsx @@ -0,0 +1,245 @@ +export const Qwen3VLDeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '235b', label: '235B', subtitle: 'MOE', default: true }, + { id: '30b', label: '30B', subtitle: 'MOE', default: false }, + { id: '32b', label: '32B', subtitle: 'Dense', default: false }, + { id: '8b', label: '8B', subtitle: 'Dense', default: false }, + { id: '4b', label: '4B', subtitle: 'Dense', default: false }, + { id: '2b', label: '2B', subtitle: 'Dense', default: false } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + thinking: { + name: 'thinking', + title: 'Thinking Capabilities', + items: [ + { id: 'instruct', label: 'Instruct', default: true }, + { id: 'thinking', label: 'Thinking', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + // Model configurations + const modelConfigs = { + '235b': { + baseName: '235B-A22B', + isMOE: true, + h100: { tp: 8, ep: 0, bf16: true, fp8: true }, + h200: { tp: 8, ep: 0, bf16: true, fp8: true }, + b200: { tp: 8, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 8, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 8, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 8, ep: 0, bf16: true, fp8: true } + }, + '30b': { + baseName: '30B-A3B', + isMOE: true, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '32b': { + baseName: '32B', + isMOE: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '8b': { + baseName: '8B', + isMOE: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '4b': { + baseName: '4B', + isMOE: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + }, + '2b': { + baseName: '2B', + isMOE: false, + h100: { tp: 1, ep: 0, bf16: true, fp8: true }, + h200: { tp: 1, ep: 0, bf16: true, fp8: true }, + b200: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi300x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi325x: { tp: 1, ep: 0, bf16: true, fp8: true }, + mi355x: { tp: 1, ep: 0, bf16: true, fp8: true } + } + }; + + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { hardware, modelsize, quantization, thinking, toolcall } = values; + const commandKey = `${hardware}-${modelsize}-${quantization}-${thinking}`; + + // Special error handling + if (commandKey === 'h100-235b-bf16-instruct' || commandKey === 'h100-235b-bf16-thinking') { + return '# Error: Model is too large, cannot fit into 8*H100\n# Please use H200 (141GB) or select FP8 quantization'; + } + + const config = modelConfigs[modelsize]; + if (!config) { + return `# Error: Unknown model size: ${modelsize}`; + } + + const hwConfig = config[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const thinkingSuffix = thinking === 'thinking' ? '-Thinking' : '-Instruct'; + const modelName = `Qwen/Qwen3-VL-${config.baseName}${thinkingSuffix}${quantSuffix}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + let ep = hwConfig.ep; + if (quantization === 'fp8' && hwConfig.tp === 8) { + ep = 2; + } + + if (ep > 0) { + cmd += ` \\\n --ep ${ep}`; + } + + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + if (modelsize === '32b' && quantization === 'bf16') { + cmd += ` \\\n --context-length 65536`; + } + } + + if (thinking === 'thinking') { + cmd += ' \\\n --reasoning-parser qwen3'; + } + + if (toolcall === 'enabled') { + cmd += ' \\\n --tool-call-parser qwen'; + } + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = item.disabled; + return ( + + ); + })} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen35-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen35-deployment.jsx new file mode 100644 index 000000000000..68b074ce2c00 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen35-deployment.jsx @@ -0,0 +1,442 @@ +export const Qwen35Deployment = () => { + // Qwen3.5 Configuration Generator + // + // MoE models (Gated Delta Networks + sparse MoE, hybrid architecture): + // 397B-A17B, 122B-A10B, 35B-A3B + // + // Dense models (standard transformer): + // 27B, 9B, 4B, 2B, 0.8B + // + // GPU requirements (BF16): + // 397B-A17B: H100 tp=16, H200 tp=8, B200 tp=8, B300 tp=4, MI300X tp=8, MI325X tp=4, MI355X tp=4 + // 122B-A10B: H100 tp=4, H200 tp=2, B200 tp=2, B300 tp=1, MI300X tp=2, MI325X tp=1, MI355X tp=1 + // 35B-A3B: H100 tp=1, H200 tp=1, B200 tp=1, B300 tp=1, MI300X tp=1, MI325X tp=1, MI355X tp=1 + // 27B/9B/4B/2B/0.8B: tp=1 on all hardware (including MI300X, MI325X, MI355X) + // + // GPU requirements (FP8, where available): + // 397B-A17B: H100 tp=8, H200 tp=8 ep=8, B200 tp=4, B300 tp=2, MI300X tp=4, MI325X tp=2, MI355X tp=2 + // 122B-A10B: H100 tp=2, H200 tp=1, B200 tp=1, B300 tp=1, MI300X tp=1, MI325X tp=1, MI355X tp=1 + // 35B-A3B: H100 tp=1, H200 tp=1, B200 tp=1, B300 tp=1, MI300X tp=1, MI325X tp=1, MI355X tp=1 + // 27B: tp=1 on all hardware (including MI300X, MI325X, MI355X) + // + // FP4 (397B only, Blackwell required): B200 tp=4, B300 tp=2 + + const MOE_MODELS = new Set(['397b', '122b', '35b']); + const FP8_MODELS = new Set(['397b', '122b', '35b', '27b']); + + // Maps model id -> HuggingFace model name suffix + const MODEL_SUFFIX = { + '397b': '397B-A17B', + '122b': '122B-A10B', + '35b': '35B-A3B', + '27b': '27B', + '9b': '9B', + '4b': '4B', + '2b': '2B', + '0.8b': '0.8B', + }; + + const options = { + model: { + name: 'model', + title: 'Model Variant', + items: [ + { id: '397b', label: '397B', subtitle: 'MoE', default: true }, + { id: '122b', label: '122B', subtitle: 'MoE', default: false }, + { id: '35b', label: '35B', subtitle: 'MoE', default: false }, + { id: '27b', label: '27B', subtitle: 'Dense', default: false }, + { id: '9b', label: '9B', subtitle: 'Dense', default: false }, + { id: '4b', label: '4B', subtitle: 'Dense', default: false }, + { id: '2b', label: '2B', subtitle: 'Dense', default: false }, + { id: '0.8b', label: '0.8B', subtitle: 'Dense', default: false }, + ] + }, + hardware: { + name: 'hardware', + title: 'Hardware Platform', + getDynamicItems: (values) => { + const isNvfp4 = values.quantization === 'fp4'; + return [ + { id: 'h100', label: 'H100', default: !isNvfp4, disabled: isNvfp4 }, + { id: 'h200', label: 'H200', default: false, disabled: isNvfp4 }, + { id: 'b200', label: 'B200', default: false, disabled: false }, + { id: 'b300', label: 'B300', default: isNvfp4, disabled: false }, + { id: 'mi300x', label: 'MI300X', default: false, disabled: isNvfp4 }, + { id: 'mi325x', label: 'MI325X', default: false, disabled: isNvfp4 }, + { id: 'mi355x', label: 'MI355X', default: false, disabled: isNvfp4 } + ]; + } + }, + quantization: { + name: 'quantization', + title: 'Quantization', + getDynamicItems: (values) => { + const hasFp8 = FP8_MODELS.has(values.model); + const hasFp4 = values.model === '397b'; + return [ + { id: 'bf16', label: 'BF16', default: !hasFp8 }, + { id: 'fp8', label: 'FP8', default: hasFp8, disabled: !hasFp8, + disabledReason: 'No FP8 variant available for this model' }, + { id: 'fp4', label: 'FP4', default: false, disabled: !hasFp4, + disabledReason: 'FP4 is only available for Qwen3.5-397B-A17B' } + ]; + } + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (MTP)', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true } + ] + }, + mambaCache: { + name: 'mambaCache', + title: 'Mamba Radix Cache', + condition: (values) => MOE_MODELS.has(values.model), + getDynamicItems: (currentValues) => { + const amdGpus = ['mi300x', 'mi325x', 'mi355x']; + const isAmdGpu = amdGpus.includes(currentValues.hardware); + const mtpEnabled = currentValues.speculative === 'enabled'; + + // MTP requires V2 mamba radix cache + if (mtpEnabled && !isAmdGpu) { + return [ + { id: 'v1', label: 'V1', default: false, disabled: true }, + { id: 'v2', label: 'V2', default: true } + ]; + } + + // Show V2 as disabled for AMD GPUs (V2 requires FLA backend, NVIDIA only) + if (isAmdGpu) { + return [ + { id: 'v1', label: 'V1', default: true }, + { id: 'v2', label: 'V2', default: false, disabled: true } + ]; + } + + // Show both V1 and V2 enabled for NVIDIA GPUs + return [ + { id: 'v1', label: 'V1', default: true }, + { id: 'v2', label: 'V2', default: false } + ]; + } + } + }; + + const modelConfigs = { + '397b': { + h100: { bf16: { tp: 16, mem: 0.8 }, fp8: { tp: 8, mem: 0.8 } }, + h200: { bf16: { tp: 8, mem: 0.8 }, fp8: { tp: 8, ep: 8, mem: 0.8 } }, + b200: { bf16: { tp: 8, mem: 0.8 }, fp8: { tp: 4, mem: 0.8 }, fp4: { tp: 4, mem: 0.85 } }, + b300: { bf16: { tp: 4, mem: 0.8 }, fp8: { tp: 2, mem: 0.8 }, fp4: { tp: 2, mem: 0.8 } }, + mi300x: { bf16: { tp: 8, mem: 0.8 }, fp8: { tp: 4, mem: 0.8 } }, + mi325x: { bf16: { tp: 4, mem: 0.8 }, fp8: { tp: 2, mem: 0.8 } }, + mi355x: { bf16: { tp: 4, mem: 0.8 }, fp8: { tp: 2, mem: 0.8 } } + }, + '122b': { + h100: { bf16: { tp: 4, mem: 0.8 }, fp8: { tp: 2, mem: 0.8 } }, + h200: { bf16: { tp: 2, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 2, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 2, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } } + }, + '35b': { + h100: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } } + }, + '27b': { + h100: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } } + }, + '9b': { + h100: { bf16: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 } } + }, + '4b': { + h100: { bf16: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 } } + }, + '2b': { + h100: { bf16: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 } } + }, + '0.8b': { + h100: { bf16: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 } }, + b300: { bf16: { tp: 1, mem: 0.8 } }, + mi300x: { bf16: { tp: 1, mem: 0.8 } }, + mi325x: { bf16: { tp: 1, mem: 0.8 } }, + mi355x: { bf16: { tp: 1, mem: 0.8 } } + } + }; + + const resolveItems = (option, vals) => + typeof option.getDynamicItems === 'function' ? option.getDynamicItems(vals) : option.items; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + // When hardware or model changes, re-resolve dynamic selections to stay consistent. + useEffect(() => { + setValues(prev => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find(i => i.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find(i => i.default && !i.disabled) || items.find(i => !i.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.hardware, values.model]); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command — must produce byte-identical output to sgl-cookbook's + // config.generateCommand(values) for every valid combination. + const generateCommand = () => { + const { model, hardware, quantization, speculative, mambaCache } = values; + + const hwConfig = modelConfigs[model]?.[hardware]?.[quantization]; + if (!hwConfig) { + if (quantization === 'fp4') { + return '# FP4 requires B200/B300 (Blackwell) and is only available for Qwen3.5-397B-A17B'; + } + return '# Please select a valid hardware and quantization combination'; + } + + let modelName; + if (quantization === 'fp4') { + modelName = 'nvidia/Qwen3.5-397B-A17B-NVFP4'; + } else { + const suffix = MODEL_SUFFIX[model]; + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + modelName = `Qwen/Qwen3.5-${suffix}${quantSuffix}`; + } + + const tpValue = hwConfig.tp; + const epValue = hwConfig.ep; + const memFraction = hwConfig.mem; + + // Initialize the base command + let cmd = `sglang serve --model-path ${modelName}`; + if (tpValue > 1) { + cmd += ` \\\n --tp ${tpValue}`; + } + if (epValue) { + cmd += ` \\\n --expert-parallel-size ${epValue}`; + } + + // Force Mamba V1 for AMD GPUs (V2 requires FLA backend) + // Force Mamba V2 when MTP is enabled + const amdGpus = ['mi300x', 'mi325x', 'mi355x']; + const actualMambaCache = amdGpus.includes(hardware) ? 'v1' : (speculative === 'enabled' ? 'v2' : mambaCache); + + // Apply commandRules from options (reasoning, toolcall, speculative, mambaCache) + // Skip quantization and model (handled via model name) + const commandRules = { + reasoning: (value) => value === 'enabled' ? '--reasoning-parser qwen3' : null, + toolcall: (value) => value === 'enabled' ? '--tool-call-parser qwen3_coder' : null, + speculative: (value) => value === 'enabled' ? '--speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4' : null, + mambaCache: (value) => value === 'v2' ? '--mamba-scheduler-strategy extra_buffer' : null, + }; + + // Iterate options in order, applying commandRules + for (const [key, option] of Object.entries(options)) { + if (key === 'quantization' || key === 'model') continue; + // Skip options that don't pass their condition + if (option.condition && !option.condition(values)) continue; + const rule = commandRules[key]; + if (rule) { + const adjustedValue = key === 'mambaCache' ? actualMambaCache : values[key]; + const result = rule(adjustedValue); + if (result) { + cmd += ` \\\n ${result}`; + } + } + } + + // Chunked prefill tuning for H200 FP8 + MTP (validated on H200 only) + if (hardware === 'h200' && quantization === 'fp8' && speculative === 'enabled') { + cmd += ` \\\n --max-running-requests 128`; + cmd += ` \\\n --chunked-prefill-size 16384`; + cmd += ` \\\n --tokenizer-worker-num 6`; + } + + // Enable allreduce fusion for all Qwen3.5 configs (skip for FP4: benchmark only enables this for TP>=8). + if (quantization !== 'fp4') { + cmd += ` \\\n --enable-flashinfer-allreduce-fusion`; + } + + // H200 FP8-specific optimizations + if (hardware === 'h200' && quantization === 'fp8') { + cmd += ` \\\n --attention-backend flashinfer`; + if (MOE_MODELS.has(model)) { + cmd += ` \\\n --mamba-ssm-dtype bfloat16`; + } + } + + // Append backend configurations + if (hardware === 'b200' || hardware === 'b300') { + cmd += ` \\\n --attention-backend trtllm_mha`; + } + + // Append AMD GPU-specific backend configurations + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ` \\\n --attention-backend triton`; + } + + // Tokenizer workers for H200 and B200/B300 + if (hardware === 'h200' || hardware === 'b200' || hardware === 'b300') { + if (speculative === 'disabled') { + cmd += ` \\\n --tokenizer-worker-num 6`; + } + } + + // FP4-specific backend settings + if (quantization === 'fp4') { + cmd += ' \\\n --quantization modelopt_fp4'; + cmd += ' \\\n --fp4-gemm-backend flashinfer_cutlass'; + cmd += ' \\\n --kv-cache-dtype fp8_e4m3'; + cmd += ' \\\n --moe-runner-backend flashinfer_trtllm'; + cmd += ' \\\n --chunked-prefill-size 32768'; + cmd += ' \\\n --max-prefill-tokens 32768'; + cmd += ' \\\n --max-running-requests 128'; + cmd += ' \\\n --stream-interval 30'; + cmd += ' \\\n --disable-radix-cache'; + } + + // Add memory fraction last + cmd += ` \\\n --mem-fraction-static ${memFraction}`; + + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (typeof option.condition === 'function' && !option.condition(values)) return null; + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map(item => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/qwen36-deployment.jsx b/docs_new/src/snippets/autoregressive/qwen36-deployment.jsx new file mode 100644 index 000000000000..891d852e422c --- /dev/null +++ b/docs_new/src/snippets/autoregressive/qwen36-deployment.jsx @@ -0,0 +1,237 @@ +export const Qwen36Deployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/Qwen36ConfigGenerator/index.js. + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h100', label: 'H100', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'b200', label: 'B200', default: false }, + ], + }, + modelSize: { + name: 'modelSize', + title: 'Model Size', + items: [ + { id: '35b-a3b', label: '35B-A3B (MoE)', default: true }, + { id: '27b', label: '27B (Dense)', default: false }, + ], + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'fp8', label: 'FP8', default: true }, + { id: 'bf16', label: 'BF16', default: false }, + ], + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true }, + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser qwen3' : null, + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true }, + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser qwen3_coder' : null, + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding (MTP)', + items: [ + { id: 'disabled', label: 'Disabled', default: false }, + { id: 'enabled', label: 'Enabled', default: true }, + ], + commandRule: (value) => value === 'enabled' ? '--speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4' : null, + }, + mambaCache: { + name: 'mambaCache', + title: 'Mamba Radix Cache', + getDynamicItems: (values) => { + const mtpEnabled = values.speculative === 'enabled'; + if (mtpEnabled) { + return [ + { id: 'v1', label: 'V1', default: false, disabled: true }, + { id: 'v2', label: 'V2', default: true }, + ]; + } + return [ + { id: 'v1', label: 'V1', default: true }, + { id: 'v2', label: 'V2', default: false }, + ]; + }, + commandRule: (value) => value === 'v2' ? '--mamba-scheduler-strategy extra_buffer' : null, + }, + }; + + const modelConfigs = { + '35b-a3b': { + baseName: '35B-A3B', + h100: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + }, + '27b': { + baseName: '27B', + h100: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + h200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + b200: { bf16: { tp: 1, mem: 0.8 }, fp8: { tp: 1, mem: 0.8 } }, + }, + }; + + const resolveItems = (option, vals) => + typeof option.getDynamicItems === 'function' ? option.getDynamicItems(vals) : option.items; + + const getInitialState = () => { + const initialState = {}; + for (const [key, option] of Object.entries(options)) { + const items = resolveItems(option, initialState); + const def = items.find((item) => item.default && !item.disabled) || items.find((item) => !item.disabled) || items[0]; + initialState[key] = def.id; + } + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + useEffect(() => { + setValues((prev) => { + const next = { ...prev }; + for (const [key, option] of Object.entries(options)) { + if (typeof option.getDynamicItems !== 'function') continue; + const items = option.getDynamicItems(next); + const current = items.find((item) => item.id === next[key]); + if (!current || current.disabled) { + const fallback = items.find((item) => item.default && !item.disabled) || items.find((item) => !item.disabled); + if (fallback) next[key] = fallback.id; + } + } + return next; + }); + }, [values.speculative]); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { hardware, modelSize, quantization, speculative } = values; + const sizeConfig = modelConfigs[modelSize]; + const hwConfig = sizeConfig?.[hardware]?.[quantization]; + if (!hwConfig) { + return '# Please select a valid hardware and quantization combination'; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `Qwen/Qwen3.6-${sizeConfig.baseName}${quantSuffix}`; + + let cmd = ''; + if (speculative === 'enabled') { + cmd += 'SGLANG_ENABLE_SPEC_V2=1 '; + } + + cmd += `sglang serve --model-path ${modelName}`; + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + const adjustedValues = { + ...values, + mambaCache: speculative === 'enabled' ? 'v2' : values.mambaCache, + }; + + for (const [key, option] of Object.entries(options)) { + if (key === 'quantization' || key === 'hardware' || key === 'modelSize') continue; + if (!option.commandRule) continue; + const rule = option.commandRule(adjustedValues[key]); + if (rule) { + cmd += ` \\\n ${rule}`; + } + } + + if (hardware === 'b200') { + cmd += ` \\\n --attention-backend trtllm_mha`; + } + + cmd += ` \\\n --mem-fraction-static ${hwConfig.mem}`; + return cmd; + }; + + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.4 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const items = resolveItems(option, values); + return ( +
+
{option.title}
+
+ {items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = !!item.disabled; + return ( + + ); + })} +
+
+ ); + })} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/ring-25-1t-deployment.jsx b/docs_new/src/snippets/autoregressive/ring-25-1t-deployment.jsx new file mode 100644 index 000000000000..6ffa5169e547 --- /dev/null +++ b/docs_new/src/snippets/autoregressive/ring-25-1t-deployment.jsx @@ -0,0 +1,217 @@ +export const Ring251TDeployment = () => { + // Config mirrors sgl-cookbook src/components/autoregressive/Ring25ConfigGenerator/index.js. + // + // GPU requirements: + // H200 / B200 / GB200 / GB300 / MI355X: single-node (tp per platform) + // MI300X / MI325X: two nodes, tp-size 8, pp-size 2 (multi-node scripts) + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'b200', label: 'B200', default: false }, + { id: 'gb200', label: 'GB200', default: false }, + { id: 'gb300', label: 'GB300', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ] + } + }; + + const modelConfigs = { + h200: { fp8: { tp: 8 } }, + b200: { fp8: { tp: 8 } }, + gb200: { fp8: { tp: 4 } }, + gb300: { fp8: { tp: 4 } }, + mi300x: { fp8: { tp: 8, pp: 2, nnodes: 2 } }, + mi325x: { fp8: { tp: 8, pp: 2, nnodes: 2 } }, + mi355x: { fp8: { tp: 8 } } + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = option.items.filter(item => item.default).map(item => item.id); + } else { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues(prev => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } else { + return { ...prev, [optionName]: currentValues.filter(id => id !== itemId) }; + } + }); + }; + + // Generate command — byte-identical to sgl-cookbook Ring25ConfigGenerator + const generateCommand = () => { + const { hardware, reasoning, toolcall } = values; + const modelName = 'inclusionAI/Ring-2.5-1T'; + const amdMultiNode = hardware === 'mi300x' || hardware === 'mi325x'; + + // Extra flags from reasoning / toolcall + const extraFlags = []; + if (reasoning === 'enabled') extraFlags.push('--reasoning-parser deepseek-r1'); + if (toolcall === 'enabled') extraFlags.push('--tool-call-parser qwen'); + + if (amdMultiNode) { + const hwConfig = modelConfigs[hardware].fp8; + const tpSize = hwConfig.tp; + const ppSize = hwConfig.pp; + + const buildAmdNodeCmd = (nodeRank) => { + let cmd = 'sglang serve \\\n'; + cmd += `--model-path ${modelName} \\\n`; + cmd += '--trust-remote-code \\\n'; + cmd += `--tp-size ${tpSize} \\\n`; + cmd += `--pp-size ${ppSize} \\\n`; + cmd += `--nnodes ${hwConfig.nnodes} \\\n`; + cmd += `--node-rank ${nodeRank} \\\n`; + if (nodeRank === 0) { + cmd += '--host 0.0.0.0 \\\n'; + cmd += '--port 30000 \\\n'; + } + cmd += '--dist-init-addr ${MASTER_IP}:${DIST_PORT} \\\n'; + cmd += '--attention-backend triton \\\n'; + cmd += '--model-loader-extra-config \'{"enable_multithread_load": "true","num_threads": 64}\' \\\n'; + cmd += '--mem-frac 0.95'; + extraFlags.forEach((flag) => { + cmd += ` \\\n${flag}`; + }); + return cmd; + }; + + const envBlock = + 'export MASTER_IP= # Replace with the IP of Node 0\n' + + 'export PORT=30000\n' + + 'export DIST_PORT=20000\n' + + '# Replace with your actual NIC interface name\n' + + 'export GLOO_SOCKET_IFNAME=\n' + + 'export TP_SOCKET_IFNAME=\n'; + + let out = envBlock + '\n'; + + out += '\n# Node 0:\n'; + out += buildAmdNodeCmd(0); + + out += '\n\n\n# Node 1:\n'; + out += buildAmdNodeCmd(1); + + return out; + } + + // Single-node path (H200, B200, GB200, GB300, MI355X) + const hwConfig = modelConfigs[hardware].fp8; + const tpValue = hwConfig.tp; + + let cmd = 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + cmd += ` \\\n --tp ${tpValue}`; + cmd += ' \\\n --trust-remote-code'; + + extraFlags.forEach((flag) => { + cmd += ` \\\n ${flag}`; + }); + + return cmd; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const disabledStyle = { cursor: 'not-allowed', opacity: 0.5 }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.type === 'checkbox' ? ( + option.items.map(item => { + const isChecked = (values[option.name] || []).includes(item.id); + const isItemDisabled = item.required; + return ( + + ); + }) + ) : ( + option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + }) + )} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/step-35-deployment.jsx b/docs_new/src/snippets/autoregressive/step-35-deployment.jsx new file mode 100644 index 000000000000..9e633b64c13a --- /dev/null +++ b/docs_new/src/snippets/autoregressive/step-35-deployment.jsx @@ -0,0 +1,393 @@ +export const Step35Deployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', default: true }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi350x', label: 'MI350X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '196b', label: '196B', subtitle: 'MOE', default: true }, + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + reasoningParser: { + name: 'reasoningParser', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser step3p5' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser step3p5' : null + }, + speculative: { + name: 'speculative', + title: 'Speculative Decoding', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => { + if (value !== 'enabled') return null; + + let cmd = '--speculative-algorithm EAGLE \\\n --speculative-num-steps 3 \\\n --speculative-eagle-topk 1 \\\n --speculative-num-draft-tokens 4 \\\n --enable-multi-layer-eagle '; + + return cmd; + } + } + }; + + const modelConfigs = { + '196b': { + baseName: '196b', + isMOE: true, + h200: { tp: 4, bf16: true }, + mi300x: { tp: 4, bf16: true }, + mi325x: { tp: 4, bf16: true }, + mi350x: { tp: 4, bf16: true }, + mi355x: { tp: 4, bf16: true }, + }, + }; + + const generateCommand = (values) => { + const { hardware, modelsize: modelSize, quantization, reasoningParser } = values; + const isAMD = hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi350x' || hardware === 'mi355x'; + + const modelSizeConfig = modelConfigs[modelSize]; + const hwConfig = modelSizeConfig[hardware]; + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `stepfun-ai/Step-3.5-Flash${quantSuffix}`; + + let tpValue = hwConfig.tp; + + let cmd = ''; + + cmd += 'sglang serve \\\n'; + cmd += ` --model-path ${modelName}`; + + if (tpValue > 1) { + cmd += ` \\\n --tp ${tpValue}`; + } + // EP required for FP8, and for AMD BF16 (AITER CK GEMM N=320 crash without EP) + if (quantSuffix === '-FP8' || isAMD) { + cmd += ` \\\n --ep ${tpValue}`; + } + + // Trust remote code for custom architecture + cmd += ' \\\n --trust-remote-code'; + + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key], values); + + if (rule) { + cmd += ` \\\n ${rule}`; + } + } + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/autoregressive/step-3vl-10b-deployment.jsx b/docs_new/src/snippets/autoregressive/step-3vl-10b-deployment.jsx new file mode 100644 index 000000000000..d6e80970a7fc --- /dev/null +++ b/docs_new/src/snippets/autoregressive/step-3vl-10b-deployment.jsx @@ -0,0 +1,383 @@ +export const Step3VL10BDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h100', label: 'H100', default: false }, + { id: 'h200', label: 'H200', default: false }, + { id: 'a100', label: 'A100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '10b', label: '10B', subtitle: 'Dense', default: true } + ] + }, + quantization: { + name: 'quantization', + title: 'Quantization', + items: [ + { id: 'bf16', label: 'BF16', default: true }, + { id: 'fp8', label: 'FP8', default: false } + ] + }, + reasoning: { + name: 'reasoning', + title: 'Reasoning Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--reasoning-parser deepseek-r1' : null + }, + toolcall: { + name: 'toolcall', + title: 'Tool Call Parser', + items: [ + { id: 'disabled', label: 'Disabled', default: true }, + { id: 'enabled', label: 'Enabled', default: false } + ], + commandRule: (value) => value === 'enabled' ? '--tool-call-parser hermes' : null + } + }; + + const modelConfigs = { + '10b': { + baseName: '10B', + isMOE: false, + b200: { tp: 1, bf16: true, fp8: true }, + h100: { tp: 1, bf16: true, fp8: true }, + h200: { tp: 1, bf16: true, fp8: true }, + a100: { tp: 1, bf16: true, fp8: true }, + mi300x: { tp: 1, bf16: true, fp8: true }, + mi325x: { tp: 1, bf16: true, fp8: true }, + mi355x: { tp: 1, bf16: true, fp8: true } + } + }; + + const generateCommand = (values) => { + const { hardware, modelsize: modelSize, quantization } = values; + + const modelSizeConfig = modelConfigs[modelSize]; + if (!modelSizeConfig) { + return `# Error: Unknown model size: ${modelSize}`; + } + + const hwConfig = modelSizeConfig[hardware]; + if (!hwConfig) { + return `# Error: Unknown hardware platform: ${hardware}`; + } + + const quantSuffix = quantization === 'fp8' ? '-FP8' : ''; + const modelName = `stepfun-ai/Step3-VL-10B${quantSuffix}`; + + let cmd = 'python -m sglang.launch_server \\\n'; + cmd += ` --model ${modelName}`; + + if (hwConfig.tp > 1) { + cmd += ` \\\n --tp ${hwConfig.tp}`; + } + + cmd += ' \\\n --host 0.0.0.0 \\\n --port 30000'; + if (hardware === 'mi300x' || hardware === 'mi325x' || hardware === 'mi355x') { + cmd += ' \\\n --attention-backend triton'; + } + cmd += ' \\\n --trust-remote-code'; + + for (const [key, option] of Object.entries(options)) { + if (option.commandRule) { + const rule = option.commandRule(values[key]); + if (rule) { + cmd += ` \\\n ${rule}`; + } + } + } + + return cmd; + }; + + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = generateCommand(values); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + return ( + + ); + }) + )} +
+
+ ); + })} +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/flux-deployment.jsx b/docs_new/src/snippets/diffusion/flux-deployment.jsx new file mode 100644 index 000000000000..2a004865c2a2 --- /dev/null +++ b/docs_new/src/snippets/diffusion/flux-deployment.jsx @@ -0,0 +1,335 @@ +export const FluxDeployment = () => { + const config = { + modelFamily: 'FLUX', + + options: { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + ] + }, + version: { + name: 'version', + title: 'Model Version', + items: [ + { id: 'flux1-dev', label: 'FLUX.1-dev', subtitle: '12B', default: true }, + { id: 'flux2-dev', label: 'FLUX.2-dev', subtitle: '32B', default: false } + ] + } + }, + + modelConfigs: { + 'flux1-dev': { repoId: 'black-forest-labs/FLUX.1-dev' }, + 'flux2-dev': { repoId: 'black-forest-labs/FLUX.2-dev' } + }, + + generateCommand: function(values) { + const { version } = values; + const config = this.modelConfigs[version]; + + return `sglang serve \\ + --model-path ${config.repoId} \\ + --ulysses-degree=1 \\ + --ring-degree=1`; + } + }; + + if (!config || !config.options) { + return
Error: Invalid configuration provided
; + } + + const getInitialState = () => { + const initialState = {}; + Object.entries(config.options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(config.options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = config.generateCommand ? config.generateCommand.call(config, values) : ''; + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(config.options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + + return ( + + ); + }) + )} +
+
+ ); + })} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/ltx-deployment.jsx b/docs_new/src/snippets/diffusion/ltx-deployment.jsx new file mode 100644 index 000000000000..e3627c20921e --- /dev/null +++ b/docs_new/src/snippets/diffusion/ltx-deployment.jsx @@ -0,0 +1,233 @@ +export const LTXDeployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'h200', label: 'H200', subtitle: 'Fastest, resident', default: true }, + { id: 'standard', label: 'Standard CUDA', subtitle: 'Snapshot mode', default: false }, + { id: 'official', label: 'Official Match', subtitle: 'Original switching', default: false }, + ], + }, + model: { + name: 'model', + title: 'Model', + items: [ + { id: 'ltx23', label: 'LTX-2.3', default: true }, + { id: 'ltx2', label: 'LTX-2', default: false }, + ], + }, + pipeline: { + name: 'pipeline', + title: 'Pipeline', + items: [ + { id: 'two-stage', label: 'Two Stage', default: true, validModels: ['ltx2', 'ltx23'] }, + { id: 'two-stage-hq', label: 'Two Stage HQ', subtitle: 'High Quality', default: false, validModels: ['ltx23'] }, + { id: 'one-stage', label: 'One Stage', default: false, validModels: ['ltx2', 'ltx23'] }, + ], + }, + }; + + const modelConfigs = { + ltx2: { + repoId: 'Lightricks/LTX-2', + pipelines: { + 'one-stage': 'LTX2Pipeline', + 'two-stage': 'LTX2TwoStagePipeline', + }, + supportedLoras: [], + }, + ltx23: { + repoId: 'Lightricks/LTX-2.3', + pipelines: { + 'one-stage': 'LTX2Pipeline', + 'two-stage': 'LTX2TwoStagePipeline', + 'two-stage-hq': 'LTX2TwoStageHQPipeline', + }, + supportedLoras: [ + { + id: 'transition', + path: 'valiantcat/LTX-2.3-Transition-LORA', + weightName: 'ltx2.3-transition.safetensors', + validPipelines: ['two-stage', 'two-stage-hq'], + }, + ], + }, + }; + + const getInitialState = () => ({ + hardware: 'h200', + model: 'ltx23', + pipeline: 'two-stage', + selectedLoraPath: 'none', + }); + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const availableLoras = (() => { + const config = modelConfigs[values.model]; + return (config?.supportedLoras || []).filter((lora) => lora.validPipelines.includes(values.pipeline)); + })(); + + const handleRadioChange = (optionName, itemId) => { + setValues((prev) => { + const next = { ...prev, [optionName]: itemId }; + + const validPipeline = options.pipeline.items.some((item) => ( + item.id === next.pipeline && item.validModels.includes(next.model) + )); + if (!validPipeline) { + next.pipeline = 'two-stage'; + } + + const config = modelConfigs[next.model]; + const nextSupported = (config?.supportedLoras || []).filter((lora) => lora.validPipelines.includes(next.pipeline)); + const isValid = nextSupported.some((lora) => lora.path === prev.selectedLoraPath); + if (!isValid) { + next.selectedLoraPath = 'none'; + } + return next; + }); + }; + + const handleLoraToggle = (path) => { + setValues((prev) => ({ + ...prev, + selectedLoraPath: prev.selectedLoraPath === path ? 'none' : path, + })); + }; + + const getDeviceMode = () => { + if (values.hardware === 'h200') { + return 'resident'; + } + if (values.hardware === 'official') { + return 'original'; + } + return 'snapshot'; + }; + + const generateCommand = () => { + const config = modelConfigs[values.model]; + const pipelineClass = config.pipelines[values.pipeline]; + if (!pipelineClass) { + return '# Error: Invalid configuration'; + } + + let command = `sglang serve \\\n --model-path ${config.repoId} \\\n --pipeline-class-name ${pipelineClass}`; + if (values.pipeline !== 'one-stage') { + command += ` \\\n --ltx2-two-stage-device-mode ${getDeviceMode()}`; + } + + const selectedLora = availableLoras.find((lora) => lora.path === values.selectedLoraPath); + if (selectedLora) { + command += ` \\\n --lora-path ${selectedLora.path} \\\n --lora-weight-name ${selectedLora.weightName}`; + } + + command += ` \\\n --port 30000`; + return command; + }; + + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const itemsToDisplay = key === 'pipeline' + ? option.items.filter((item) => item.validModels.includes(values.model)) + : option.items; + + return ( +
+
{option.title}
+
+ {itemsToDisplay.map((item) => { + const isChecked = values[option.name] === item.id; + return ( + + ); + })} +
+
+ ); + })} + +
+
Select LoRA Model
+
+ {availableLoras.length === 0 && ( +
+ No LoRA models available for this configuration. +
+ )} + {availableLoras.map((lora) => { + const isSelected = values.selectedLoraPath === lora.path; + return ( + + ); + })} +
+
+ +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/mova-deployment.jsx b/docs_new/src/snippets/diffusion/mova-deployment.jsx new file mode 100644 index 000000000000..aabd6284eaa1 --- /dev/null +++ b/docs_new/src/snippets/diffusion/mova-deployment.jsx @@ -0,0 +1,115 @@ +export const MOVADeployment = () => { + // Config options + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'a100', label: 'A100', default: false } + ] + }, + resolution: { + name: 'resolution', + title: 'Resolution', + items: [ + { id: '360p', label: '360p', subtitle: 'Fast inference, lower VRAM', default: true }, + { id: '720p', label: '720p', subtitle: 'Higher resolution', default: false } + ] + } + }; + + // Initialize state + const getInitialState = () => { + const initialState = {}; + Object.entries(options).forEach(([key, option]) => { + const defaultItem = option.items.find(item => item.default); + initialState[key] = defaultItem ? defaultItem.id : option.items[0].id; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues(prev => ({ ...prev, [optionName]: value })); + }; + + // Generate command + const generateCommand = () => { + const { resolution } = values; + const modelPath = resolution === '720p' + ? 'OpenMOSS-Team/MOVA-720p' + : 'OpenMOSS-Team/MOVA-360p'; + + return `export SG_OUTPUT_DIR=/root/output_mova +mkdir -p "$SG_OUTPUT_DIR" + +sglang serve \\ + --model-path ${modelPath} \\ + --host 0.0.0.0 \\ + --port 30002 \\ + --adjust-frames false \\ + --num-gpus 8 \\ + --ring-degree 2 \\ + --ulysses-degree 4 \\ + --tp 1 \\ + --enable-torch-compile \\ + --save-output \\ + --output-dir "$SG_OUTPUT_DIR"`; + }; + + // Styles - with dark mode support + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; + const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + })} +
+
+ ))} +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/qwen-image-deployment.jsx b/docs_new/src/snippets/diffusion/qwen-image-deployment.jsx new file mode 100644 index 000000000000..1328c819d0d4 --- /dev/null +++ b/docs_new/src/snippets/diffusion/qwen-image-deployment.jsx @@ -0,0 +1,316 @@ +export const QwenImageDeployment = () => { + const config = { + modelFamily: 'Qwen-Image', + + options: { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + } + }, + + generateCommand: function(values) { + return `sglang serve \\ + --model-path Qwen/Qwen-Image \\ + --ulysses-degree=1 \\ + --ring-degree=1`; + } + }; + + if (!config || !config.options) { + return
Error: Invalid configuration provided
; + } + + const getInitialState = () => { + const initialState = {}; + Object.entries(config.options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(config.options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = config.generateCommand ? config.generateCommand.call(config, values) : ''; + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(config.options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + + return ( + + ); + }) + )} +
+
+ ); + })} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/qwen-image-edit-deployment.jsx b/docs_new/src/snippets/diffusion/qwen-image-edit-deployment.jsx new file mode 100644 index 000000000000..866cbd70edf8 --- /dev/null +++ b/docs_new/src/snippets/diffusion/qwen-image-edit-deployment.jsx @@ -0,0 +1,319 @@ +export const QwenImageEditDeployment = () => { + const config = { + modelFamily: 'Qwen-Image-Edit', + + options: { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false } + ] + } + }, + + generateCommand: function(values) { + return `sglang serve \\ + --model-path Qwen/Qwen-Image-Edit-2511 \\ + --ulysses-degree=1 \\ + --ring-degree=1`; + } + }; + + if (!config || !config.options) { + return
Error: Invalid configuration provided
; + } + + const getInitialState = () => { + const initialState = {}; + Object.entries(config.options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(config.options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = config.generateCommand ? config.generateCommand.call(config, values) : ''; + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(config.options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + + return ( + + ); + }) + )} +
+
+ ); + })} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/wan21-deployment.jsx b/docs_new/src/snippets/diffusion/wan21-deployment.jsx new file mode 100644 index 000000000000..e1f44a60e748 --- /dev/null +++ b/docs_new/src/snippets/diffusion/wan21-deployment.jsx @@ -0,0 +1,337 @@ +export const Wan21Deployment = () => { + const MODELSIZE_DEFS = [ + { + id: '14b', + label: '14B', + subtitle: 'High-quality, 480P/720P', + default: true, + validTasks: ['t2v', 'i2v'], + }, + { + id: '1_3b', + label: '1.3B', + subtitle: 'Lightweight, 480P', + default: false, + validTasks: ['t2v'], + }, + ]; + + const modelConfigs = { + 't2v-14b': { + repoId: 'Wan-AI/Wan2.1-T2V-14B-Diffusers', + supportedLoras: [ + { id: 'general', label: 'General Wan2.1 LoRA', path: 'NIVEDAN/wan2.1-lora' }, + ], + }, + 't2v-1_3b': { + repoId: 'Wan-AI/Wan2.1-T2V-1.3B-Diffusers', + supportedLoras: [], + }, + 'i2v-14b': { + repoId: 'Wan-AI/Wan2.1-I2V-14B-720P-Diffusers', + supportedLoras: [ + { id: 'fight', label: 'Fight Style LoRA', path: 'valiantcat/Wan2.1-Fight-LoRA' }, + ], + }, + }; + + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [{ id: 'mi300x', label: 'MI300X/MI325X/MI355X', default: true }], + }, + task: { + name: 'task', + title: 'Task Type', + items: [ + { id: 't2v', label: 'Text-to-Video (T2V)', default: true }, + { id: 'i2v', label: 'Image-to-Video (I2V)', default: false }, + ], + }, + modelsize: { + name: 'modelsize', + title: 'Model Variant', + items: MODELSIZE_DEFS.map(({ validTasks, ...rest }) => rest), + }, + bestPractice: { + name: 'bestPractice', + title: 'Sequence Parallelism', + items: [ + { id: 'off', label: 'Standard', default: true }, + { id: 'on', label: 'Best Practice (4 GPUs)', default: false }, + ], + }, + }; + + function modelSizeItemsForTask(task) { + return MODELSIZE_DEFS.filter((item) => item.validTasks.includes(task)).map( + ({ validTasks, ...rest }) => rest + ); + } + + const getInitialState = () => { + const task = 't2v'; + const sizes = modelSizeItemsForTask(task); + const modelsize = sizes.find((size) => size.default)?.id || sizes[0].id; + const configKey = `${task}-${modelsize}`; + const supported = modelConfigs[configKey]?.supportedLoras || []; + return { + hardware: 'mi300x', + task, + modelsize, + bestPractice: 'off', + selectedLoraPath: supported[0]?.path ?? '', + }; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, itemId) => { + setValues((prev) => { + let next = { ...prev, [optionName]: itemId }; + + if (optionName === 'task') { + const sizes = modelSizeItemsForTask(itemId); + if (!sizes.some((size) => size.id === next.modelsize)) { + next.modelsize = sizes.find((size) => size.default)?.id || sizes[0].id; + } + } + + if (optionName === 'task' || optionName === 'modelsize') { + const configKey = `${next.task}-${next.modelsize}`; + const supported = modelConfigs[configKey]?.supportedLoras || []; + if (supported.length === 0) { + next.selectedLoraPath = ''; + } else if ( + next.selectedLoraPath && + !supported.some((lora) => lora.path === next.selectedLoraPath) + ) { + next.selectedLoraPath = supported[0].path; + } + } + + return next; + }); + }; + + const handleLoraToggle = (path) => { + setValues((prev) => ({ + ...prev, + selectedLoraPath: prev.selectedLoraPath === path ? '' : path, + })); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const generateCommand = () => { + const { task, modelsize, selectedLoraPath, bestPractice } = values; + const configKey = `${task}-${modelsize}`; + const config = modelConfigs[configKey]; + + if (!config) { + return '# Error: Invalid configuration'; + } + + let command = `sglang serve \\\n --model-path ${config.repoId} \\\n --dit-layerwise-offload true`; + + if (bestPractice === 'on') { + command += ` \\\n --num-gpus 4 \\\n --ulysses-degree 2 \\\n --enable-cfg-parallel`; + } + + if (selectedLoraPath) { + command += ` \\\n --lora-path ${selectedLoraPath}`; + } + + return command; + }; + + const modelSizeItems = modelSizeItemsForTask(values.task); + const loraConfigKey = `${values.task}-${values.modelsize}`; + const availableLoras = modelConfigs[loraConfigKey]?.supportedLoras || []; + const command = generateCommand(); + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(options).map(([key, option]) => ( +
+
{option.title}
+
+ {(key === 'modelsize' ? modelSizeItems : option.items).map((item) => { + const isChecked = values[option.name] === item.id; + return ( + + ); + })} +
+
+ ))} + + {availableLoras.length > 0 && ( +
+
Select LoRA Model (Only some of the supported LoRAs are listed here)
+
+ {availableLoras.map((lora) => { + const isChecked = values.selectedLoraPath === lora.path; + return ( + + ); + })} +
+
+ )} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/diffusion/wan22-deployment.jsx b/docs_new/src/snippets/diffusion/wan22-deployment.jsx new file mode 100644 index 000000000000..fe749d5316f5 --- /dev/null +++ b/docs_new/src/snippets/diffusion/wan22-deployment.jsx @@ -0,0 +1,216 @@ + + export const Wan22Deployment = () => { + const options = { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'mi300x', label: 'MI300X', default: false }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + ], + }, + task: { + name: 'task', + title: 'Task Type', + items: [ + { id: 'i2v', label: 'Image-to-Video (I2V)', default: false }, + { id: 't2v', label: 'Text-to-Video (T2V)', default: true }, + { id: 'ti2v', label: 'Text/Image-to-Video (TI2V)', default: false }, + ], + }, + modelsize: { + name: 'modelsize', + title: 'Model Size', + items: [ + { id: '14b', label: 'A14B', subtitle: 'Diffusers (A14B)', default: true, validTasks: ['i2v', 't2v'] }, + { id: '5b', label: '5B', subtitle: 'Diffusers', default: false, validTasks: ['ti2v'] }, + ], + }, + bestPractice: { + name: 'bestPractice', + title: 'Sequence Parallelism', + items: [ + { id: 'off', label: 'Standard', default: true }, + { id: 'on', label: 'Best Practice (4 GPUs)', default: false }, + ], + }, + }; + + const modelConfigs = { + 'i2v-14b': { + repoId: 'Wan-AI/Wan2.2-I2V-A14B-Diffusers', + supportedLoras: [{ id: 'distill', path: 'lightx2v/Wan2.2-Distill-Loras' }], + }, + 't2v-14b': { + repoId: 'Wan-AI/Wan2.2-T2V-A14B-Diffusers', + supportedLoras: [{ id: 'arcane', path: 'Cseti/wan2.2-14B-Arcane_Jinx-lora-v1' }], + }, + 'ti2v-5b': { + repoId: 'Wan-AI/Wan2.2-TI2V-5B-Diffusers', + supportedLoras: [], + }, + }; + + const getInitialState = () => ({ + hardware: 'b200', + task: 't2v', + modelsize: '14b', + bestPractice: 'off', + selectedLoraPath: 'none', + }); + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + const availableLoras = (() => { + const configKey = `${values.task}-${values.modelsize}`; + return modelConfigs[configKey]?.supportedLoras || []; + })(); + + const handleRadioChange = (optionName, itemId) => { + setValues((prev) => { + const next = { ...prev, [optionName]: itemId }; + if (optionName === 'task') { + next.modelsize = itemId === 'ti2v' ? '5b' : '14b'; + } + + const configKey = `${next.task}-${next.modelsize}`; + const nextSupported = modelConfigs[configKey]?.supportedLoras || []; + const isValid = nextSupported.some((lora) => lora.path === prev.selectedLoraPath); + if (!isValid) { + next.selectedLoraPath = 'none'; + } + return next; + }); + }; + + const handleLoraToggle = (path) => { + setValues((prev) => ({ + ...prev, + selectedLoraPath: prev.selectedLoraPath === path ? 'none' : path, + })); + }; + + const generateCommand = () => { + const { task, modelsize, selectedLoraPath, bestPractice } = values; + const configKey = `${task}-${modelsize}`; + const config = modelConfigs[configKey]; + if (!config) { + return '# Error: Invalid configuration'; + } + + let command = `sglang serve \\\n --model-path ${config.repoId} \\\n --dit-layerwise-offload true`; + if (bestPractice === 'on') { + command += ` \\\n --num-gpus 4 \\\n --ulysses-degree 2 \\\n --enable-cfg-parallel`; + } + if (selectedLoraPath && selectedLoraPath !== 'none') { + command += ` \\\n --lora-path ${selectedLoraPath}`; + } + return command; + }; + +const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; +const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'center', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; +const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '140px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit' }; +const itemsStyle = { display: 'flex', rowGap: '2px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center', flex: 1 }; +const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', flex: 1, background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; +const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; +const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; +const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + return ( +
+ {Object.entries(options).map(([key, option]) => { + const itemsToDisplay = key === 'modelsize' + ? option.items.filter((item) => item.validTasks.includes(values.task)) + : option.items; + + return ( +
+
{option.title}
+
+ {itemsToDisplay.map((item) => { + const isChecked = values[option.name] === item.id; + return ( + + ); + })} +
+
+ ); + })} + +
+
Select LoRA Model (Only some of the supported LoRAs are listed here)
+
+ {availableLoras.length === 0 && ( +
+ No LoRA models available for this model. +
+ )} + {availableLoras.map((lora) => { + const isSelected = values.selectedLoraPath === lora.path; + return ( + + ); + })} +
+
+ +
+
Run this Command:
+
{generateCommand()}
+
+
+ ); + }; diff --git a/docs_new/src/snippets/diffusion/zimage-turbo-deployment.jsx b/docs_new/src/snippets/diffusion/zimage-turbo-deployment.jsx new file mode 100644 index 000000000000..71d9d80f8eba --- /dev/null +++ b/docs_new/src/snippets/diffusion/zimage-turbo-deployment.jsx @@ -0,0 +1,319 @@ +export const ZImageTurboDeployment = () => { + const config = { + modelFamily: 'Z-Image-Turbo', + + options: { + hardware: { + name: 'hardware', + title: 'Hardware Platform', + items: [ + { id: 'mi300x', label: 'MI300X', default: true }, + { id: 'mi325x', label: 'MI325X', default: false }, + { id: 'mi355x', label: 'MI355X', default: false }, + { id: 'b200', label: 'B200', default: true }, + { id: 'h200', label: 'H200', default: false }, + { id: 'h100', label: 'H100', default: false } + ] + } + }, + + generateCommand: function(values) { + return `sglang serve \\ + --model-path Tongyi-MAI/Z-Image-Turbo \\ + --ulysses-degree=1 \\ + --ring-degree=1`; + } + }; + + if (!config || !config.options) { + return
Error: Invalid configuration provided
; + } + + const getInitialState = () => { + const initialState = {}; + Object.entries(config.options).forEach(([key, option]) => { + if (option.type === 'checkbox') { + initialState[key] = (option.items || []) + .filter((item) => item.default) + .map((item) => item.id); + return; + } + + if (option.type === 'text') { + initialState[key] = option.default || ''; + return; + } + + let items = option.items || []; + if (option.getDynamicItems) { + const defaultValues = {}; + Object.entries(config.options).forEach(([innerKey, innerOption]) => { + if (innerOption.type === 'checkbox') { + defaultValues[innerKey] = (innerOption.items || []) + .filter((item) => item.default) + .map((item) => item.id); + } else if (innerOption.type === 'text') { + defaultValues[innerKey] = innerOption.default || ''; + } else if (innerOption.items && innerOption.items.length > 0) { + const defaultItem = innerOption.items.find((item) => item.default); + defaultValues[innerKey] = defaultItem ? defaultItem.id : innerOption.items[0].id; + } + }); + items = option.getDynamicItems(defaultValues); + } + + const defaultItem = items && items.find((item) => item.default); + initialState[key] = defaultItem ? defaultItem.id : items && items[0] ? items[0].id : ''; + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = + html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { + attributes: true, + attributeFilter: ['class', 'data-theme', 'style'], + }); + + return () => observer.disconnect(); + }, []); + + const handleRadioChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const handleCheckboxChange = (optionName, itemId, isChecked) => { + setValues((prev) => { + const currentValues = prev[optionName] || []; + if (isChecked) { + return { ...prev, [optionName]: [...currentValues, itemId] }; + } + return { + ...prev, + [optionName]: currentValues.filter((id) => id !== itemId), + }; + }); + }; + + const handleTextChange = (optionName, value) => { + setValues((prev) => ({ ...prev, [optionName]: value })); + }; + + const command = config.generateCommand ? config.generateCommand.call(config, values) : ''; + + const containerStyle = { + maxWidth: '900px', + margin: '0 auto', + display: 'flex', + flexDirection: 'column', + gap: '4px', + }; + const cardStyle = { + padding: '8px 12px', + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, + borderRadius: '4px', + display: 'flex', + alignItems: 'center', + gap: '12px', + background: isDark ? '#1f2937' : '#fff', + }; + const titleStyle = { + fontSize: '13px', + fontWeight: '600', + minWidth: '140px', + flexShrink: 0, + color: isDark ? '#e5e7eb' : 'inherit', + }; + const itemsStyle = { + display: 'flex', + rowGap: '2px', + columnGap: '6px', + flexWrap: 'wrap', + alignItems: 'center', + flex: 1, + }; + const labelBaseStyle = { + padding: '4px 10px', + border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, + borderRadius: '3px', + cursor: 'pointer', + display: 'inline-flex', + flexDirection: 'column', + alignItems: 'center', + justifyContent: 'center', + fontWeight: '500', + fontSize: '13px', + transition: 'all 0.2s', + userSelect: 'none', + minWidth: '45px', + textAlign: 'center', + flex: 1, + background: isDark ? '#374151' : '#fff', + color: isDark ? '#e5e7eb' : 'inherit', + }; + const checkedStyle = { + background: '#D45D44', + color: 'white', + borderColor: '#D45D44', + }; + const disabledStyle = { + cursor: 'not-allowed', + opacity: 0.5, + }; + const subtitleStyle = { + display: 'block', + fontSize: '9px', + marginTop: '1px', + lineHeight: '1.1', + opacity: 0.7, + }; + const textInputStyle = { + flex: 1, + padding: '8px 10px', + borderRadius: '4px', + border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, + background: isDark ? '#111827' : '#fff', + color: isDark ? '#e5e7eb' : '#111827', + fontSize: '13px', + }; + const commandDisplayStyle = { + flex: 1, + padding: '12px 16px', + background: isDark ? '#111827' : '#f5f5f5', + borderRadius: '6px', + fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", + fontSize: '12px', + lineHeight: '1.5', + color: isDark ? '#e5e7eb' : '#374151', + whiteSpace: 'pre-wrap', + overflowX: 'auto', + margin: 0, + border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, + }; + + return ( +
+ {Object.entries(config.options).map(([key, option]) => { + if (option.condition && !option.condition(values)) { + return null; + } + + const items = option.getDynamicItems ? option.getDynamicItems(values) : option.items || []; + + return ( +
+
{option.title}
+
+ {option.type === 'text' ? ( + handleTextChange(option.name, event.target.value)} + style={textInputStyle} + /> + ) : option.type === 'checkbox' ? ( + (option.items || []).map((item) => { + const isChecked = (values[option.name] || []).includes(item.id); + const isDisabled = + item.required || + (typeof item.disabledWhen === 'function' && item.disabledWhen(values)); + + return ( + + ); + }) + ) : ( + items.map((item) => { + const isChecked = values[option.name] === item.id; + const isDisabled = Boolean(item.disabled); + + return ( + + ); + }) + )} +
+
+ ); + })} + +
+
Run this Command:
+
{command}
+
+
+ ); +}; diff --git a/docs_new/src/snippets/specbundle/specbundle-deployment.jsx b/docs_new/src/snippets/specbundle/specbundle-deployment.jsx new file mode 100644 index 000000000000..aa40caa43605 --- /dev/null +++ b/docs_new/src/snippets/specbundle/specbundle-deployment.jsx @@ -0,0 +1,214 @@ +export const SpecBundleDeployment = () => { + // Config options based on SpecBundleConfigGenerator - matching original structure exactly + const baseConfig = { + options: { + mode: { + name: 'mode', + title: 'Launch Mode', + renderType: 'radio', + items: [ + { id: 'with-server', label: 'With Server', subtitle: 'Launch SGLang server & Benchmark concurrently', default: true }, + { id: 'without-server', label: 'Without Server', subtitle: 'Connect to an existing server (--skip-launch-server)', default: false } + ] + }, + common: { + name: 'common', + title: 'Common Configuration', + renderType: 'inputs', + items: [ + { id: 'modelPath', label: 'Model Path', type: 'text', placeholder: 'e.g., meta-llama/Llama-3.1-8B-Instruct', default: 'meta-llama/Llama-3.1-8B-Instruct', description: 'Path to the target model.' }, + { id: 'port', label: 'Port', type: 'number', default: 30000, description: 'Port to launch/connect the SGLang server.' }, + { id: 'configList', label: 'Config List', type: 'text', default: '1,3,1,4', description: 'Format: ,,,' }, + { id: 'benchmarkList', label: 'Benchmark List', type: 'textarea', default: 'mtbench:5 ceval:5:accountant', description: 'Format: ::. Supported: aime, ceval, financeqa, gpqa, gsm8k, humaneval, livecodebench, math500, mmlu, mmstar, mtbench, simpleqa' } + ] + }, + server: { + name: 'server', + title: 'Server Configuration', + renderType: 'inputs', + requiredMode: 'with-server', + items: [ + { id: 'draftModelPath', label: 'Draft Model Path', type: 'text', placeholder: 'Path to draft model', default: '', description: 'Path to the speculative draft model.' }, + { id: 'tpSize', label: 'TP Size', type: 'number', default: 1, description: 'Number of GPUs for Tensor Parallelism.' }, + { id: 'memFraction', label: 'Memory Fraction Static', type: 'number', step: '0.1', default: 0.9, description: 'The memory fraction for the static memory.' }, + { id: 'attentionBackend', label: 'Attention Backend', type: 'text', default: '', description: 'The attention backend used in sglang' }, + { id: 'trustRemoteCode', label: 'Trust Remote Code', type: 'checkbox', default: true, description: 'Whether to trust remote code.' } + ] + } + } + }; + + // Initialize state - matching original logic + const getInitialState = () => { + const initialState = {}; + Object.values(baseConfig.options).forEach(option => { + if (option.renderType === 'radio') { + const defaultItem = option.items.find(item => item.default); + initialState[option.name] = defaultItem ? defaultItem.id : option.items[0].id; + } else if (option.renderType === 'inputs') { + option.items.forEach(item => { + initialState[item.id] = item.default; + }); + } + }); + return initialState; + }; + + const [values, setValues] = useState(getInitialState); + const [isDark, setIsDark] = useState(false); + + // Detect dark mode + useEffect(() => { + const checkDarkMode = () => { + const html = document.documentElement; + const isDarkMode = html.classList.contains('dark') || + html.getAttribute('data-theme') === 'dark' || + html.style.colorScheme === 'dark'; + setIsDark(isDarkMode); + }; + checkDarkMode(); + const observer = new MutationObserver(checkDarkMode); + observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class', 'data-theme', 'style'] }); + return () => observer.disconnect(); + }, []); + + // Get display options based on current mode + const getDisplayOptions = () => { + const options = {}; + const currentMode = values.mode; + Object.entries(baseConfig.options).forEach(([key, option]) => { + if (option.requiredMode && option.requiredMode !== currentMode) { + return; + } + options[key] = option; + }); + return options; + }; + + const handleRadioChange = (optionName, itemId) => { + setValues(prev => ({ ...prev, [optionName]: itemId })); + }; + + const handleInputChange = (itemId, value) => { + setValues(prev => ({ ...prev, [itemId]: value })); + }; + + const handleCheckboxChange = (itemId, checked) => { + setValues(prev => ({ ...prev, [itemId]: checked })); + }; + + // Generate command - matching original logic + const generateCommand = () => { + const { mode, modelPath, port, configList, benchmarkList, draftModelPath, tpSize, memFraction, attentionBackend, trustRemoteCode } = values; + + let cmd = 'python bench_eagle3.py'; + if (modelPath) cmd += ` \\\n --model-path ${modelPath}`; + if (port) cmd += ` \\\n --port ${port}`; + if (configList) cmd += ` \\\n --config-list ${configList}`; + if (benchmarkList) cmd += ` \\\n --benchmark-list ${benchmarkList.replace(/\n/g, ' ')}`; + + if (mode === 'without-server') { + cmd += ' \\\n --skip-launch-server'; + } else { + if (draftModelPath) cmd += ` \\\n --speculative-draft-model-path ${draftModelPath}`; + if (tpSize) cmd += ` \\\n --tp-size ${tpSize}`; + if (memFraction) cmd += ` \\\n --mem-fraction-static ${memFraction}`; + if (attentionBackend) cmd += ` \\\n --attention-backend ${attentionBackend}`; + if (trustRemoteCode) cmd += ` \\\n --trust-remote-code`; + } + + return cmd; + }; + + // Styles + const containerStyle = { maxWidth: '900px', margin: '0 auto', display: 'flex', flexDirection: 'column', gap: '4px' }; + const cardStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}`, borderLeft: `3px solid ${isDark ? '#E85D4D' : '#D45D44'}`, borderRadius: '4px', display: 'flex', alignItems: 'flex-start', gap: '12px', background: isDark ? '#1f2937' : '#fff' }; + const titleStyle = { fontSize: '13px', fontWeight: '600', minWidth: '180px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit', paddingTop: '4px' }; + const contentStyle = { flex: 1 }; + const itemsStyle = { display: 'flex', rowGap: '4px', columnGap: '6px', flexWrap: 'wrap', alignItems: 'center' }; + const labelBaseStyle = { padding: '4px 10px', border: `1px solid ${isDark ? '#9ca3af' : '#d1d5db'}`, borderRadius: '3px', cursor: 'pointer', display: 'inline-flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', fontWeight: '500', fontSize: '13px', transition: 'all 0.2s', userSelect: 'none', minWidth: '45px', textAlign: 'center', background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit' }; + const checkedStyle = { background: '#D45D44', color: 'white', borderColor: '#D45D44' }; + const subtitleStyle = { display: 'block', fontSize: '9px', marginTop: '1px', lineHeight: '1.1', opacity: 0.7 }; + const inputGroupStyle = { display: 'flex', flexDirection: 'column', gap: '8px' }; + const inputRowStyle = { display: 'flex', alignItems: 'flex-start', gap: '12px' }; + const inputLabelStyle = { fontSize: '13px', fontWeight: '500', minWidth: '180px', flexShrink: 0, color: isDark ? '#e5e7eb' : 'inherit', paddingTop: '8px' }; + const inputContentStyle = { flex: 1, display: 'flex', flexDirection: 'column' }; + const inputStyle = { padding: '8px 12px', border: `1px solid ${isDark ? '#4b5563' : '#d1d5db'}`, borderRadius: '4px', fontSize: '13px', background: isDark ? '#374151' : '#fff', color: isDark ? '#e5e7eb' : 'inherit', width: '100%', boxSizing: 'border-box' }; + const textareaStyle = { ...inputStyle, minHeight: '60px', resize: 'vertical' }; + const descStyle = { color: isDark ? '#9ca3af' : '#666', marginTop: '4px', fontSize: '11px' }; + const commandDisplayStyle = { flex: 1, padding: '12px 16px', background: isDark ? '#111827' : '#f5f5f5', borderRadius: '6px', fontFamily: "'Menlo', 'Monaco', 'Courier New', monospace", fontSize: '12px', lineHeight: '1.5', color: isDark ? '#e5e7eb' : '#374151', whiteSpace: 'pre-wrap', overflowX: 'auto', margin: 0, border: `1px solid ${isDark ? '#374151' : '#e5e7eb'}` }; + + const displayOptions = getDisplayOptions(); + + return ( +
+ {Object.entries(displayOptions).map(([key, option]) => ( +
+ {/* Render Radio Group - title on left */} + {option.renderType === 'radio' && ( +
+
{option.title}
+
+ {option.items.map(item => { + const isChecked = values[option.name] === item.id; + return ( + + ); + })} +
+
+ )} + + {/* Render Input Group - each input has label on left */} + {option.renderType === 'inputs' && ( +
+ {option.items.map(item => ( +
+
{item.label}
+
+ {item.type === 'textarea' ? ( +