Remove Nemotron-SFT-Math-v3 from data blend as not correctly used

kevalmorabia97 · kevalmorabia97 · commit 7291fca67d3b · 2026-04-28T10:02:57.000-07:00
Signed-off-by: Keval Morabia &lt;28916987+kevalmorabia97@users.noreply.github.com&gt;
diff --git a/examples/dataset/MEGATRON_DATA_PREP.md b/examples/dataset/MEGATRON_DATA_PREP.md
@@ -5,16 +5,19 @@ Tokenization commands for the Nemotron Pre-Training and Post-Training dataset co
 Two parameters vary by model — set them before running the commands below:
 
 ```bash
-TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2   # HuggingFace tokenizer (or local path)
-OUTPUT_DIR=tokenized_nano_v2                   # Output directory for tokenized files
+TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2        # HuggingFace tokenizer (or local path)
+OUTPUT_DIR=tokenized_nemotron_v2                   # Output directory for tokenized files
 ```
 
 Output files are written in Megatron binary format (`.bin` / `.idx`). See [examples/dataset/README.md](../dataset/README.md) for full tokenization documentation.
 
 > [!TIP]
 > Token count for a `.bin` file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion.
 
-> Tokenizing each of the datasets below will take anywhere between 10 minutes to 1 hour. You can tokenize all in parallel to speed up the process.
+> [!NOTE]
+> Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process.
+>
+> You may tokenize more datasets or skip some datasets depending on your needs.
 
 ---
 
@@ -79,21 +82,6 @@ for SPLIT in high_part00 high_part01; do
 done
 ```
 
-**[nvidia/Nemotron-SFT-Math-v3](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Math-v3)**:
-
-```bash
-python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
-    --hf_dataset nvidia/Nemotron-SFT-Math-v3 \
-    --hf_name default \
-    --hf_split train \
-    --json_keys messages \
-    --tokenizer ${TOKENIZER} \
-    --output_dir ${OUTPUT_DIR} \
-    --workers 96 \
-    --max_sequence_length 256_000 \
-    --reasoning_content inline
-```
-
 **[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
 
 ```bash
@@ -157,7 +145,6 @@ nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bi
 nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
 nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
 nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
-nvidia--Nemotron-SFT-Math-v3_default_train_messages.{bin,idx}
 competitive_programming_python_00_messages.{bin,idx}
 competitive_programming_cpp_00_messages.{bin,idx}
 MCQ_messages.{bin,idx}
diff --git a/examples/dataset/README.md b/examples/dataset/README.md
@@ -193,7 +193,7 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
 Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
 To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
 
-For **very large datasets** (tens of millions of documents), add `--hf_streaming --hf_max_samples_per_split <num_samples>` to avoid downloading the full dataset — only the rows actually consumed are fetched.
+For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. `tool_calls`, `function_call` fields) that cause Arrow type-cast errors in non-streaming mode, add `--hf_streaming` to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with `--hf_max_samples_per_split <num_samples>` to cap the row count; without it streaming still works but re-downloads on every run with no disk cache.
 
 > **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
 > Re-runs read from cache and are much faster.
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
@@ -62,27 +62,26 @@ Distillation uses the **30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-
 
 See [examples/dataset/MEGATRON_DATA_PREP.md](../../../dataset/MEGATRON_DATA_PREP.md) for tokenization commands for all datasets used in this blend.
 
-For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2`, `OUTPUT_DIR=tokenized_nano_v2`.
+For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2`, `OUTPUT_DIR=tokenized_nemotron_v2`.
 
 #### Data Blend
 
 **30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-training v1/v3 (Math 30, Coding 20, Science 15, IF 5)**
 
 ```bash
 DATA_BLEND=" \
-5  tokenized_nano_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
-20 tokenized_nano_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000 \
-5  tokenized_nano_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000 \
-15 tokenized_nano_v2/nvidia--Nemotron-Math-v2_default_high_part00_messages \
-10 tokenized_nano_v2/nvidia--Nemotron-Math-v2_default_high_part01_messages \
-5  tokenized_nano_v2/nvidia--Nemotron-SFT-Math-v3_default_train_messages \
-15 tokenized_nano_v2/competitive_programming_python_00_messages \
-5  tokenized_nano_v2/competitive_programming_cpp_00_messages \
-10 tokenized_nano_v2/nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000 \
-3  tokenized_nano_v2/MCQ_messages \
-2  tokenized_nano_v2/RQA_messages \
-3  tokenized_nano_v2/reasoning_on_messages \
-2  tokenized_nano_v2/reasoning_off_messages \
+5  tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
+20 tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000 \
+5  tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000 \
+15 tokenized_nemotron_v2/nvidia--Nemotron-Math-v2_default_high_part00_messages \
+15 tokenized_nemotron_v2/nvidia--Nemotron-Math-v2_default_high_part01_messages \
+15 tokenized_nemotron_v2/competitive_programming_python_00_messages \
+5  tokenized_nemotron_v2/competitive_programming_cpp_00_messages \
+10 tokenized_nemotron_v2/nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000 \
+3  tokenized_nemotron_v2/MCQ_messages \
+2  tokenized_nemotron_v2/RQA_messages \
+3  tokenized_nemotron_v2/reasoning_on_messages \
+2  tokenized_nemotron_v2/reasoning_off_messages \
 "
 ```
 
@@ -92,8 +91,7 @@ DATA_BLEND=" \
 | Nemotron-Pretraining-SFT-v1 / General (10M samples) | 16B | 20 | Upweighted to better close MMLU gap |
 | Nemotron-Pretraining-SFT-v1 / MATH (10M samples) | 12B | 5 | Pretraining math |
 | Nemotron-Math-v2 / high_part00 | 9B | 15 | Hard math reasoning |
-| Nemotron-Math-v2 / high_part01 | 11B | 10 | Hard math reasoning |
-| Nemotron-SFT-Math-v3 | 2B | 5 | Tool-Integrated Reasoning (TIR) traces |
+| Nemotron-Math-v2 / high_part01 | 11B | 15 | Hard math reasoning |
 | Nemotron-SFT-Competitive-Programming-v2 / python_00 | 7B | 15 | Python reasoning traces |
 | Nemotron-SFT-Competitive-Programming-v2 / cpp_00 | 7B | 5 | C++ reasoning traces |
 | Nemotron-Post-Training-Dataset-v1 / stem (5M samples) | 20B | 10 | Broad STEM |
diff --git a/modelopt/torch/utils/plugins/megatron_preprocess_data.py b/modelopt/torch/utils/plugins/megatron_preprocess_data.py
@@ -78,8 +78,9 @@
     --strip_newlines
 ```
 
-Note: ``--hf_streaming`` without ``--hf_max_samples_per_split`` falls back to non-streaming,
-since streaming the full dataset is slower than the cached non-streaming path.
+Note: streaming does not cache to disk, so re-runs re-download. For full-dataset streaming
+without a sample cap this is slower than non-streaming mode, but it avoids Arrow schema
+compatibility issues with complex nested message types.
 """
 
 import argparse
@@ -191,7 +192,14 @@ def encode(self, json_line: str):
                 if tools:
                     kwargs["tools"] = tools
                 value = self._process_messages(value)
-                text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
+                try:
+                    text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
+                except Exception as e:
+                    print(
+                        f"apply_chat_template failed: {e}\nData:\n{json.dumps(data, indent=2, default=str)}",
+                        flush=True,
+                    )
+                    raise
                 # chat template already embeds all special tokens; don't add BOS again
                 add_special_tokens = False
             else:
@@ -452,8 +460,9 @@ def megatron_preprocess_data(
         hf_split: Hugging Face Hub dataset split. Defaults to None (all splits).
         hf_max_samples_per_split: Maximum number of rows to consume per split.
         hf_streaming: Load HuggingFace datasets in streaming mode. Only consumed rows are
-            downloaded — useful for very large pretraining datasets. Note: streaming does not
-            cache to disk, so re-runs re-download. Defaults to False.
+            downloaded — useful for very large pretraining datasets or datasets with complex
+            nested message schemas that cause Arrow type-cast errors in non-streaming mode.
+            Note: streaming does not cache to disk, so re-runs re-download. Defaults to False.
         output_dir: Path to directory to save binary output files.
         tokenizer_name_or_path: Name or path of the Hugging Face tokenizer to use.
         json_keys: Key or list of keys to extract from json. Defaults to ["text"].
@@ -485,10 +494,9 @@ def megatron_preprocess_data(
         warnings.warn(
             "--hf_streaming is set but --hf_max_samples_per_split is not. "
             "Streaming without a sample cap re-downloads the full dataset on every run with no "
-            "disk cache, which is slower than non-streaming mode. Falling back to streaming=False.",
+            "disk cache, which is slower than the cached non-streaming path.",
             stacklevel=2,
         )
-        hf_streaming = False
 
     Path(output_dir).mkdir(parents=True, exist_ok=True)
     vocab_size = AutoTokenizer.from_pretrained(tokenizer_name_or_path).vocab_size