You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/dataset/MEGATRON_DATA_PREP.md
+6-19Lines changed: 6 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,16 +5,19 @@ Tokenization commands for the Nemotron Pre-Training and Post-Training dataset co
5
5
Two parameters vary by model — set them before running the commands below:
6
6
7
7
```bash
8
-
TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
9
-
OUTPUT_DIR=tokenized_nano_v2# Output directory for tokenized files
8
+
TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
9
+
OUTPUT_DIR=tokenized_nemotron_v2# Output directory for tokenized files
10
10
```
11
11
12
12
Output files are written in Megatron binary format (`.bin` / `.idx`). See [examples/dataset/README.md](../dataset/README.md) for full tokenization documentation.
13
13
14
14
> [!TIP]
15
15
> Token count for a `.bin` file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion.
16
16
17
-
> Tokenizing each of the datasets below will take anywhere between 10 minutes to 1 hour. You can tokenize all in parallel to speed up the process.
17
+
> [!NOTE]
18
+
> Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process.
19
+
>
20
+
> You may tokenize more datasets or skip some datasets depending on your needs.
18
21
19
22
---
20
23
@@ -79,21 +82,6 @@ for SPLIT in high_part00 high_part01; do
**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
194
194
To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
195
195
196
-
For **very large datasets** (tens of millions of documents), add `--hf_streaming --hf_max_samples_per_split <num_samples>` to avoid downloading the full dataset — only the rows actually consumed are fetched.
196
+
For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. `tool_calls`, `function_call` fields) that cause Arrow type-cast errors in non-streaming mode, add `--hf_streaming` to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with `--hf_max_samples_per_split <num_samples>` to cap the row count; without it streaming still works but re-downloads on every run with no disk cache.
197
197
198
198
> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
0 commit comments