Skip to content

Commit 7291fca

Browse files
Remove Nemotron-SFT-Math-v3 from data blend as not correctly used
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 5bda5fc commit 7291fca

4 files changed

Lines changed: 36 additions & 43 deletions

File tree

examples/dataset/MEGATRON_DATA_PREP.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,19 @@ Tokenization commands for the Nemotron Pre-Training and Post-Training dataset co
55
Two parameters vary by model — set them before running the commands below:
66

77
```bash
8-
TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
9-
OUTPUT_DIR=tokenized_nano_v2 # Output directory for tokenized files
8+
TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
9+
OUTPUT_DIR=tokenized_nemotron_v2 # Output directory for tokenized files
1010
```
1111

1212
Output files are written in Megatron binary format (`.bin` / `.idx`). See [examples/dataset/README.md](../dataset/README.md) for full tokenization documentation.
1313

1414
> [!TIP]
1515
> Token count for a `.bin` file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion.
1616
17-
> Tokenizing each of the datasets below will take anywhere between 10 minutes to 1 hour. You can tokenize all in parallel to speed up the process.
17+
> [!NOTE]
18+
> Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process.
19+
>
20+
> You may tokenize more datasets or skip some datasets depending on your needs.
1821
1922
---
2023

@@ -79,21 +82,6 @@ for SPLIT in high_part00 high_part01; do
7982
done
8083
```
8184

82-
**[nvidia/Nemotron-SFT-Math-v3](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Math-v3)**:
83-
84-
```bash
85-
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
86-
--hf_dataset nvidia/Nemotron-SFT-Math-v3 \
87-
--hf_name default \
88-
--hf_split train \
89-
--json_keys messages \
90-
--tokenizer ${TOKENIZER} \
91-
--output_dir ${OUTPUT_DIR} \
92-
--workers 96 \
93-
--max_sequence_length 256_000 \
94-
--reasoning_content inline
95-
```
96-
9785
**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
9886

9987
```bash
@@ -157,7 +145,6 @@ nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bi
157145
nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
158146
nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
159147
nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
160-
nvidia--Nemotron-SFT-Math-v3_default_train_messages.{bin,idx}
161148
competitive_programming_python_00_messages.{bin,idx}
162149
competitive_programming_cpp_00_messages.{bin,idx}
163150
MCQ_messages.{bin,idx}

examples/dataset/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
193193
Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
194194
To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
195195

196-
For **very large datasets** (tens of millions of documents), add `--hf_streaming --hf_max_samples_per_split <num_samples>` to avoid downloading the full dataset — only the rows actually consumed are fetched.
196+
For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. `tool_calls`, `function_call` fields) that cause Arrow type-cast errors in non-streaming mode, add `--hf_streaming` to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with `--hf_max_samples_per_split <num_samples>` to cap the row count; without it streaming still works but re-downloads on every run with no disk cache.
197197

198198
> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
199199
> Re-runs read from cache and are much faster.

examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -62,27 +62,26 @@ Distillation uses the **30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-
6262

6363
See [examples/dataset/MEGATRON_DATA_PREP.md](../../../dataset/MEGATRON_DATA_PREP.md) for tokenization commands for all datasets used in this blend.
6464

65-
For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2`, `OUTPUT_DIR=tokenized_nano_v2`.
65+
For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2`, `OUTPUT_DIR=tokenized_nemotron_v2`.
6666

6767
#### Data Blend
6868

6969
**30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-training v1/v3 (Math 30, Coding 20, Science 15, IF 5)**
7070

7171
```bash
7272
DATA_BLEND=" \
73-
5 tokenized_nano_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
74-
20 tokenized_nano_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000 \
75-
5 tokenized_nano_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000 \
76-
15 tokenized_nano_v2/nvidia--Nemotron-Math-v2_default_high_part00_messages \
77-
10 tokenized_nano_v2/nvidia--Nemotron-Math-v2_default_high_part01_messages \
78-
5 tokenized_nano_v2/nvidia--Nemotron-SFT-Math-v3_default_train_messages \
79-
15 tokenized_nano_v2/competitive_programming_python_00_messages \
80-
5 tokenized_nano_v2/competitive_programming_cpp_00_messages \
81-
10 tokenized_nano_v2/nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000 \
82-
3 tokenized_nano_v2/MCQ_messages \
83-
2 tokenized_nano_v2/RQA_messages \
84-
3 tokenized_nano_v2/reasoning_on_messages \
85-
2 tokenized_nano_v2/reasoning_off_messages \
73+
5 tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
74+
20 tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000 \
75+
5 tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000 \
76+
15 tokenized_nemotron_v2/nvidia--Nemotron-Math-v2_default_high_part00_messages \
77+
15 tokenized_nemotron_v2/nvidia--Nemotron-Math-v2_default_high_part01_messages \
78+
15 tokenized_nemotron_v2/competitive_programming_python_00_messages \
79+
5 tokenized_nemotron_v2/competitive_programming_cpp_00_messages \
80+
10 tokenized_nemotron_v2/nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000 \
81+
3 tokenized_nemotron_v2/MCQ_messages \
82+
2 tokenized_nemotron_v2/RQA_messages \
83+
3 tokenized_nemotron_v2/reasoning_on_messages \
84+
2 tokenized_nemotron_v2/reasoning_off_messages \
8685
"
8786
```
8887

@@ -92,8 +91,7 @@ DATA_BLEND=" \
9291
| Nemotron-Pretraining-SFT-v1 / General (10M samples) | 16B | 20 | Upweighted to better close MMLU gap |
9392
| Nemotron-Pretraining-SFT-v1 / MATH (10M samples) | 12B | 5 | Pretraining math |
9493
| Nemotron-Math-v2 / high_part00 | 9B | 15 | Hard math reasoning |
95-
| Nemotron-Math-v2 / high_part01 | 11B | 10 | Hard math reasoning |
96-
| Nemotron-SFT-Math-v3 | 2B | 5 | Tool-Integrated Reasoning (TIR) traces |
94+
| Nemotron-Math-v2 / high_part01 | 11B | 15 | Hard math reasoning |
9795
| Nemotron-SFT-Competitive-Programming-v2 / python_00 | 7B | 15 | Python reasoning traces |
9896
| Nemotron-SFT-Competitive-Programming-v2 / cpp_00 | 7B | 5 | C++ reasoning traces |
9997
| Nemotron-Post-Training-Dataset-v1 / stem (5M samples) | 20B | 10 | Broad STEM |

modelopt/torch/utils/plugins/megatron_preprocess_data.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,9 @@
7878
--strip_newlines
7979
```
8080
81-
Note: ``--hf_streaming`` without ``--hf_max_samples_per_split`` falls back to non-streaming,
82-
since streaming the full dataset is slower than the cached non-streaming path.
81+
Note: streaming does not cache to disk, so re-runs re-download. For full-dataset streaming
82+
without a sample cap this is slower than non-streaming mode, but it avoids Arrow schema
83+
compatibility issues with complex nested message types.
8384
"""
8485

8586
import argparse
@@ -191,7 +192,14 @@ def encode(self, json_line: str):
191192
if tools:
192193
kwargs["tools"] = tools
193194
value = self._process_messages(value)
194-
text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
195+
try:
196+
text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
197+
except Exception as e:
198+
print(
199+
f"apply_chat_template failed: {e}\nData:\n{json.dumps(data, indent=2, default=str)}",
200+
flush=True,
201+
)
202+
raise
195203
# chat template already embeds all special tokens; don't add BOS again
196204
add_special_tokens = False
197205
else:
@@ -452,8 +460,9 @@ def megatron_preprocess_data(
452460
hf_split: Hugging Face Hub dataset split. Defaults to None (all splits).
453461
hf_max_samples_per_split: Maximum number of rows to consume per split.
454462
hf_streaming: Load HuggingFace datasets in streaming mode. Only consumed rows are
455-
downloaded — useful for very large pretraining datasets. Note: streaming does not
456-
cache to disk, so re-runs re-download. Defaults to False.
463+
downloaded — useful for very large pretraining datasets or datasets with complex
464+
nested message schemas that cause Arrow type-cast errors in non-streaming mode.
465+
Note: streaming does not cache to disk, so re-runs re-download. Defaults to False.
457466
output_dir: Path to directory to save binary output files.
458467
tokenizer_name_or_path: Name or path of the Hugging Face tokenizer to use.
459468
json_keys: Key or list of keys to extract from json. Defaults to ["text"].
@@ -485,10 +494,9 @@ def megatron_preprocess_data(
485494
warnings.warn(
486495
"--hf_streaming is set but --hf_max_samples_per_split is not. "
487496
"Streaming without a sample cap re-downloads the full dataset on every run with no "
488-
"disk cache, which is slower than non-streaming mode. Falling back to streaming=False.",
497+
"disk cache, which is slower than the cached non-streaming path.",
489498
stacklevel=2,
490499
)
491-
hf_streaming = False
492500

493501
Path(output_dir).mkdir(parents=True, exist_ok=True)
494502
vocab_size = AutoTokenizer.from_pretrained(tokenizer_name_or_path).vocab_size

0 commit comments

Comments
 (0)