NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 2 additions & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/speculative_decoding/collect_hidden_states/common.py‎
Lines changed: 171 additions & 0 deletions b/‎examples/speculative_decoding/collect_hidden_states/common.py‎
Lines changed: 171 additions & 0 deletions
diff --git a/‎examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py‎
Lines changed: 33 additions & 24 deletions b/‎examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py‎
Lines changed: 33 additions & 24 deletions
diff --git a/‎examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py‎
Lines changed: 3 additions & 1 deletion b/‎examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/speculative_decoding/main.py‎
Lines changed: 4 additions & 13 deletions b/‎examples/speculative_decoding/main.py‎
Lines changed: 4 additions & 13 deletions
@@ -18,6 +18,7 @@ Changelog
 **New Features**
 
 - Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.
+- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
 
 0.44 (2026-05-xx)
 ^^^^^^^^^^^^^^^^^
@@ -34,7 +35,7 @@ Changelog
 - Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.
 - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
 - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml>`_ for usage.
-- Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.kernels.quantization.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
+- Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
 
 **Backward Breaking Changes**
 
 
@@ -0,0 +1,171 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Shared helpers for ``compute_hidden_states_*`` dump scripts.
+
+Groups two concerns used by both the HF and vLLM dump entry points:
+
+- Aux-layer selection via the ``--aux-layers`` flag (``"eagle"`` / ``"dflash"``
+  / explicit comma-separated list). Returned values are **0-based transformer
+  layer IDs**; callers indexing into HuggingFace's ``outputs.hidden_states``
+  tuple must add ``+1`` because ``hidden_states[0]`` is the embedding output.
+- Answer-only-loss support: registering ``--answer-only-loss`` /
+  ``--chat-template`` flags, loading a chat template file, verifying the
+  template contains ``{% generation %}`` tags, and computing per-conversation
+  ``loss_mask`` from the tokenizer's ``assistant_masks``.
+"""
+
+import argparse
+from pathlib import Path
+
+import torch
+
+_DFLASH_DEFAULT_NUM_DRAFT_LAYERS = 5
+
+
+def add_aux_layers_args(parser: argparse.ArgumentParser) -> None:
+    """Register the ``--aux-layers`` flag on ``parser``."""
+    parser.add_argument(
+        "--aux-layers",
+        type=str,
+        default="eagle",
+        help=(
+            "Aux layer indices to capture. One of: "
+            "'eagle' (EAGLE-3 default from modelopt), "
+            f"'dflash' ({_DFLASH_DEFAULT_NUM_DRAFT_LAYERS}-layer DFlash default from modelopt), "
+            "or a comma-separated list like '2,5,8' to override. Default: eagle."
+        ),
+    )
+
+
+def resolve_aux_layers(args: argparse.Namespace, num_hidden_layers: int) -> list[int]:
+    """Resolve ``args.aux_layers`` to a sorted, de-duped list of 0-based layer IDs."""
+    value = args.aux_layers.strip().lower()
+    if value == "eagle":
+        from modelopt.torch.speculative.plugins.hf_eagle import default_eagle_aux_layer_ids
+
+        return default_eagle_aux_layer_ids(num_hidden_layers)
+    if value == "dflash":
+        from modelopt.torch.speculative.plugins.modeling_dflash import build_target_layer_ids
+
+        return sorted(
+            set(build_target_layer_ids(num_hidden_layers, _DFLASH_DEFAULT_NUM_DRAFT_LAYERS))
+        )
+    try:
+        indices = [int(tok) for tok in args.aux_layers.split(",") if tok.strip()]
+    except ValueError as e:
+        raise ValueError(
+            f"--aux-layers must be 'eagle', 'dflash', or a comma-separated int list, "
+            f"got: {args.aux_layers!r}"
+        ) from e
+    if not indices:
+        raise ValueError(f"--aux-layers int list is empty: {args.aux_layers!r}")
+    for i in indices:
+        if not 0 <= i < num_hidden_layers:
+            raise ValueError(f"--aux-layers index {i} out of range [0, {num_hidden_layers})")
+    return sorted(set(indices))
+
+
+def add_answer_only_loss_args(parser: argparse.ArgumentParser) -> None:
+    """Register ``--answer-only-loss`` and ``--chat-template`` flags on ``parser``."""
+    parser.add_argument(
+        "--answer-only-loss",
+        action="store_true",
+        help=(
+            "If set, compute an assistant-token mask via the tokenizer's "
+            "{% generation %} tags and save it as 'loss_mask' in each .pt file. "
+            "Downstream offline training uses this to apply loss only on "
+            "assistant-produced tokens."
+        ),
+    )
+    parser.add_argument(
+        "--chat-template",
+        type=Path,
+        default=None,
+        help=(
+            "Path to a Jinja chat template file that overrides tokenizer.chat_template. "
+            "Required with --answer-only-loss if the model's default template lacks "
+            "{% generation %} / {% endgeneration %} tags."
+        ),
+    )
+
+
+def load_chat_template(path: Path | None) -> str | None:
+    """Read a Jinja chat template from ``path``, or return ``None`` if not provided."""
+    if path is None:
+        return None
+    with open(path) as f:
+        return f.read()
+
+
+def verify_generation_tags(chat_template: str | None) -> None:
+    """Raise if ``chat_template`` lacks ``{% generation %}`` / ``{% endgeneration %}`` tags.
+
+    These tags are required for ``apply_chat_template(..., return_assistant_tokens_mask=True)``
+    to return the assistant-token mask needed for answer-only-loss training.
+    """
+    if chat_template and "generation" in chat_template and "endgeneration" in chat_template:
+        return
+    raise ValueError(
+        "--answer-only-loss requires {% generation %} / {% endgeneration %} tags in the "
+        "chat template, but the current template does not have them.\n\n"
+        "To fix, pass --chat-template pointing to a template with generation tags:\n"
+        "  1. Copy the model's chat_template from tokenizer_config.json\n"
+        "  2. Wrap assistant content with {% generation %} / {% endgeneration %}\n"
+        "See https://huggingface.co/docs/transformers/en/chat_templating"
+        "#train-on-completions-only for details."
+    )
+
+
+def tokenize_with_loss_mask(
+    tokenizer,
+    conversations: list,
+    answer_only_loss: bool,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Tokenize one conversation and derive its loss mask from the same call.
+
+    Uses a single ``apply_chat_template`` invocation so ``input_ids`` and
+    ``loss_mask`` are guaranteed to come from the same tokenization — this
+    eliminates the risk of argument drift between two separate calls.
+
+    Returns:
+        input_ids: ``LongTensor`` of shape ``(1, seq_len)``.
+        loss_mask: ``LongTensor`` of shape ``(seq_len,)``. All-ones when
+            ``answer_only_loss=False``; the assistant-token mask from the
+            tokenizer when ``answer_only_loss=True`` (requires ``{% generation %}``
+            tags in the chat template — verify beforehand).
+    """
+    out = tokenizer.apply_chat_template(
+        conversations,
+        return_tensors="pt",
+        return_dict=True,
+        return_assistant_tokens_mask=answer_only_loss,
+        add_generation_prompt=False,
+    )
+    input_ids = out["input_ids"]
+    seq_len = input_ids.shape[-1]
+    if answer_only_loss:
+        mask = out["assistant_masks"]
+        if not isinstance(mask, torch.Tensor):
+            mask = torch.tensor(mask, dtype=torch.long)
+        loss_mask = mask.squeeze(0).to(torch.long)
+        if loss_mask.shape[0] != seq_len:
+            raise RuntimeError(
+                f"assistant_masks length {loss_mask.shape[0]} does not match "
+                f"input_ids length {seq_len}"
+            )
+    else:
+        loss_mask = torch.ones(seq_len, dtype=torch.long)
+    return input_ids, loss_mask
@@ -20,6 +20,14 @@
 from pathlib import Path
 
 import torch
+from common import (
+    add_answer_only_loss_args,
+    add_aux_layers_args,
+    load_chat_template,
+    resolve_aux_layers,
+    tokenize_with_loss_mask,
+    verify_generation_tags,
+)
 from datasets import load_dataset
 from tqdm import tqdm as tqdm
 from transformers import AutoModel, AutoTokenizer
@@ -90,6 +98,8 @@ def parse_args() -> argparse.Namespace:
         action="store_true",
         help="Set trust_remote_code for Huggingface models and tokenizers",
     )
+    add_aux_layers_args(parser)
+    add_answer_only_loss_args(parser)
 
     return parser.parse_args()
 
@@ -138,12 +148,20 @@ def keep_conversation(entry):
         args.model, dtype="auto", device_map="auto", trust_remote_code=args.trust_remote_code
     )
     num_hidden_layers = getattr(model.config, "num_hidden_layers", None)
+    if num_hidden_layers is None:
+        raise ValueError(f"model.config has no 'num_hidden_layers' attribute: {model.config}")
+    selected_layer_ids = resolve_aux_layers(args, num_hidden_layers)
 
     tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=args.trust_remote_code)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token
+    override_template = load_chat_template(args.chat_template)
+    if override_template is not None:
+        tokenizer.chat_template = override_template
     if tokenizer.chat_template is not None:
         tokenizer.chat_template = tokenizer.chat_template.replace(REMOVE_THINK_CHAT_TEMPLATE, "")
+    if args.answer_only_loss:
+        verify_generation_tags(tokenizer.chat_template)
 
     output_dir = args.output_dir
     output_dir.mkdir(parents=True, exist_ok=True)
@@ -152,29 +170,21 @@ def keep_conversation(entry):
     num_success = 0
     pbar = tqdm(total=len(dataset), desc=f"DP#{args.dp_rank} Processing conversations")
 
-    async def dump_hidden_states(idx: int, conversation_id: int, input_ids: torch.Tensor):
+    async def dump_hidden_states(
+        idx: int,
+        conversation_id: int,
+        input_ids: torch.Tensor,
+        loss_mask: torch.Tensor,
+    ):
         nonlocal num_success
-        nonlocal num_hidden_layers
 
         # Get hidden states
         with torch.inference_mode():
             outputs = model(input_ids=input_ids.to(model.device), output_hidden_states=True)
-            if num_hidden_layers is None:
-                num_hidden_layers = len(outputs.hidden_states) - 1
-            else:
-                assert num_hidden_layers + 1 == len(outputs.hidden_states), (
-                    f"Expected {num_hidden_layers}+1 layers of hidden states, but got {len(outputs.hidden_states)}."
-                )
-            # Extract hidden states from layers with index (2, N/2, N-3), and the output hidden states
+            # outputs.hidden_states[0] is the embedding output; layer k output is at index k+1.
             hidden_states = outputs.hidden_states
-            selected_layer_indices = [
-                2,
-                max(0, num_hidden_layers // 2),
-                max(1, num_hidden_layers - 3),
-            ]
-            selected_layer_indices = sorted(set(selected_layer_indices))
             aux_hidden_states = torch.cat(
-                [hidden_states[i].squeeze(0).cpu() for i in selected_layer_indices], dim=-1
+                [hidden_states[lid + 1].squeeze(0).cpu() for lid in selected_layer_ids], dim=-1
             )
             output_hidden_states = hidden_states[-1].squeeze(0).cpu()
         output_file = output_dir / f"{conversation_id}.pt"
@@ -185,6 +195,7 @@ async def dump_hidden_states(idx: int, conversation_id: int, input_ids: torch.Te
                     "input_ids": input_ids.squeeze(0).cpu(),
                     "hidden_states": output_hidden_states,
                     "aux_hidden_states": aux_hidden_states,
+                    "loss_mask": loss_mask,
                     "conversation_id": conversation_id,
                 },
                 f,
@@ -206,19 +217,17 @@ async def submit_generates():
                 num_invalid += 1
                 continue
 
-            # Tokenize and check length
-            # return_dict=True ensures BatchEncoding is returned on all transformers
-            # versions: in <5.0 the default is False (returns raw tensor), in 5.0+
-            # the default changed to True (returns BatchEncoding).
-            input_ids = tokenizer.apply_chat_template(
-                conversations, return_tensors="pt", return_dict=True, add_generation_template=False
-            )["input_ids"]
+            # Single apply_chat_template call produces both input_ids and loss_mask,
+            # guaranteeing they come from the same tokenization.
+            input_ids, loss_mask = tokenize_with_loss_mask(
+                tokenizer, conversations, args.answer_only_loss
+            )
             num_input_tokens = input_ids.shape[1]
             if num_input_tokens <= 10 or num_input_tokens > args.max_seq_len:
                 num_skipped_too_long += 1
                 continue
 
-            tasks.append(dump_hidden_states(idx, conversation_id, input_ids))
+            tasks.append(dump_hidden_states(idx, conversation_id, input_ids, loss_mask))
             # Increment only for valid conversations to match dump file index
             idx += 1
         await asyncio.gather(*tasks)
 
@@ -23,6 +23,7 @@
 from pathlib import Path
 
 import torch
+from common import add_aux_layers_args, resolve_aux_layers
 from datasets import load_dataset
 from tensorrt_llm import LLM, SamplingParams
 from tensorrt_llm.llmapi import CudaGraphConfig, KvCacheConfig, SaveHiddenStatesDecodingConfig
@@ -122,6 +123,7 @@ def parse_args() -> argparse.Namespace:
         default=None,
         help="""moe_cluster_parallel_size for TRTLLM.""",
     )
+    add_aux_layers_args(parser)
 
     return parser.parse_args()
 
@@ -194,7 +196,7 @@ def keep_conversation(entry):
         "output_directory": str(args.output_dir),
         "write_interval": 1,
         "file_prefix": f"dp_{args.dp_rank}",
-        "eagle3_layers_to_capture": {1, num_hidden_layers // 2 - 1, num_hidden_layers - 4},
+        "eagle3_layers_to_capture": set(resolve_aux_layers(args, num_hidden_layers)),
     }
     sampling_params = SamplingParams(max_tokens=32, temperature=0)
 
 
@@ -108,7 +108,7 @@ def make_speculative_data_module(
             raise ValueError("sample_size must be -1 (use all samples) or a positive integer")
         if data_args.sample_size > 0:
             dumped_files = dumped_files[: data_args.sample_size]
-        train_dataset = OfflineSupervisedDataset(dumped_files)
+        train_dataset = OfflineSupervisedDataset(dumped_files, answer_only_loss=answer_only_loss)
         data_collator = EagleOfflineDataCollator(train_len=train_len)
 
     return {
 
@@ -49,7 +49,7 @@
 
 import modelopt.torch.opt as mto
 import modelopt.torch.speculative as mtsp
-from modelopt.torch.speculative.config import EagleConfig
+from modelopt.torch.speculative.config import DFlashConfig, EagleConfig
 from modelopt.torch.speculative.utils import load_vlm_or_llm, patch_transformers5_params_loading
 from modelopt.torch.utils import print_rank_0
 
@@ -318,18 +318,9 @@ def train():
                 model.eagle_module.d2t = torch.load(data_args.draft_vocab_cache, weights_only=True)
                 print_rank_0(f"Loaded draft vocab cache from {data_args.draft_vocab_cache}.")
         elif training_args.mode == "dflash":
-            # Auto-detect mask_token_id from tokenizer if not set
-            if not dflash_cfg.get("dflash_mask_token_id"):
-                if tokenizer.mask_token_id is not None:
-                    dflash_cfg["dflash_mask_token_id"] = tokenizer.mask_token_id
-                    print_rank_0(
-                        f"Auto-detected mask_token_id={tokenizer.mask_token_id} from tokenizer"
-                    )
-                else:
-                    raise ValueError(
-                        "mask_token_id not found in tokenizer and not set in config. "
-                        "Set dflash.dflash_mask_token_id in the training YAML."
-                    )
+            dflash_cfg = DFlashConfig.model_validate(
+                dflash_cfg, context={"tokenizer": tokenizer, "data_args": data_args}
+            ).model_dump()
             mtsp.convert(model, [("dflash", dflash_cfg)])
         else:
             raise Exception(f"{training_args.mode} is not supported!")