Skip to content

Commit b8ca0ed

Browse files
committed
docs: add training guide and improve wandb config
1 parent 203d866 commit b8ca0ed

2 files changed

Lines changed: 236 additions & 0 deletions

File tree

docs/training.md

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
## Training Pipeline
2+
3+
This guide explains how to train PVNet using the open-data-pvnet pipeline. It covers:
4+
- Configuration structure and best practices
5+
- The two supported training workflows (with clear recommendations)
6+
- Common pitfalls and how to resolve them
7+
8+
---
9+
10+
## 1. Configuration Basics
11+
12+
All configuration files live in a structured directory:
13+
14+
```
15+
open-data-pvnet/
16+
└── src/
17+
└── open_data_pvnet/
18+
└── configs/
19+
└── PVNet_configs/
20+
```
21+
22+
The main configuration file is `config.yaml`, which tells the system which sub-configurations to use. Think of it as the master control panel that connects everything together.
23+
24+
---
25+
26+
## 2. Understanding the Key Configuration Parts
27+
28+
### Trainer Configuration
29+
Controls how your model trains:
30+
- GPU/CPU usage
31+
- Training duration
32+
- Precision settings
33+
34+
### Model Configuration
35+
Defines your model's architecture:
36+
- Which encoders to use (GFS, satellite, etc.)
37+
- Forecast horizon
38+
- Optimizer settings
39+
40+
### Data Configuration (Most Important!)
41+
This determines how your data is loaded. PVNet supports two approaches:
42+
- **Streamed batches**: Directly from Zarr files
43+
- **Premade batches**: From pre-generated samples
44+
45+
You must choose only one approach at a time - mixing them will cause errors.
46+
47+
---
48+
49+
## 3. Two Ways to Train PVNet
50+
51+
### Method 1: Streamed Batches (Direct Zarr Loading)
52+
53+
This approach loads data directly from Zarr files during training.
54+
55+
**When to use it:**
56+
- You have sufficient disk space and bandwidth
57+
- You don't want to pre-generate samples
58+
59+
**How to set it up:**
60+
In your `config.yaml`, ensure you have:
61+
```yaml
62+
- datamodule: streamed_batches.yaml
63+
```
64+
65+
**Important settings:**
66+
```yaml
67+
_target_: pvnet.data.DataModule
68+
configuration: /ABSOLUTE/PATH/example_configuration.yaml
69+
batch_size: 8
70+
num_workers: 2
71+
prefetch_factor: 2
72+
train_period:
73+
- null
74+
- "2023-06-30"
75+
val_period:
76+
- "2023-07-01"
77+
- "2023-12-31"
78+
```
79+
80+
**What to avoid:**
81+
Don't include `sample_output_dir`, `num_train_samples`, or `num_val_samples` in your streamed configuration, as these will cause errors.
82+
83+
**To start training:**
84+
```bash
85+
python run.py experiment=example_simple
86+
```
87+
88+
### Method 2: Premade Batches (Recommended for Beginners)
89+
90+
This approach uses pre-generated samples, making training more stable and reproducible.
91+
92+
**When to use it:**
93+
- You want consistent, reproducible results
94+
- You want faster iteration during development
95+
- You've encountered issues with Zarr or storage
96+
97+
**Step 1: Generate samples**
98+
Navigate to `open-data-pvnet/src/open_data_pvnet/scripts` and run:
99+
```bash
100+
python save_samples.py \
101+
+datamodule.sample_output_dir="GFS_samples" \
102+
+datamodule.num_train_samples=10 \
103+
+datamodule.num_val_samples=2 \
104+
datamodule.num_workers=2
105+
```
106+
107+
This creates a directory with your samples:
108+
```
109+
scripts/
110+
└── GFS_samples/
111+
├── data_configuration.yaml
112+
├── train/
113+
└── val/
114+
```
115+
116+
**Step 2: Switch to premade batches**
117+
In your `config.yaml`, change to:
118+
```yaml
119+
- datamodule: premade_batches.yaml
120+
```
121+
122+
**Step 3: Configure the premade batches**
123+
```yaml
124+
_target_: pvnet.data.DataModule
125+
sample_output_dir: /ABSOLUTE/PATH/TO/GFS_samples
126+
batch_size: 8
127+
num_workers: 2
128+
prefetch_factor: 2
129+
```
130+
131+
**Important:** Always use absolute paths! Hydra changes the working directory at runtime, so relative paths will break.
132+
133+
**Step 4: Train**
134+
```bash
135+
python run.py experiment=example_simple
136+
```
137+
138+
---
139+
140+
## 4. Data Configuration Best Practices
141+
142+
When setting up your data configuration (`example_configuration.yaml`), we strongly recommend:
143+
144+
1. Download data locally
145+
2. Point to local Zarr paths
146+
3. Set `public: false` for local data
147+
148+
Example configuration:
149+
```yaml
150+
gsp:
151+
zarr_path: C:/data/gsp/combined_2023_gsp.zarr
152+
public: false
153+
154+
nwp:
155+
gfs:
156+
zarr_path: C:/data/nwp/nwp_gfs.zarr
157+
public: false
158+
```
159+
160+
### Important Note About Local Data
161+
162+
If you're using local data, make sure to set `public: false` in your configuration. When `public: true` is set with a local path, you'll encounter this error:
163+
164+
```
165+
ValueError: storage_options passed with non-fsspec path
166+
```
167+
168+
For example, if your configuration looks like this:
169+
```yaml
170+
gsp:
171+
zarr_path: "s3://ocf-open-data-pvnet/data/uk/pvlive/v2/combined_2023_gsp.zarr"
172+
# ... other settings ...
173+
public: True
174+
```
175+
176+
And you're trying to use local data, change it to:
177+
```yaml
178+
gsp:
179+
zarr_path: "/path/to/your/local/data/combined_2023_gsp.zarr"
180+
# ... other settings ...
181+
public: False
182+
```
183+
184+
---
185+
186+
## 5. Common Hydra Issues
187+
188+
### Hydra Override Error
189+
190+
If you encounter an override error when using `experiment_simple`, it might be due to conflicting override declarations. The `experiment_simple` configuration includes these overrides:
191+
192+
```yaml
193+
defaults:
194+
- override /trainer: default.yaml
195+
- override /model: multimodal.yaml
196+
- override /datamodule: premade_samples.yaml
197+
- override /callbacks: default.yaml
198+
- override /logger: wandb.yaml
199+
- override /hydra: default.yaml
200+
```
201+
202+
If you see errors like "Could not override 'hydra'", you may need to temporarily comment out the hydra override in your config file:
203+
204+
```yaml
205+
# - override /hydra: default.yaml
206+
```
207+
208+
### Working Directory Changes
209+
210+
Remember that Hydra runs experiments inside timestamped directories (e.g., `outputs/YYYY-MM-DD/HH-MM-SS/`). This is why absolute paths are essential - relative paths will break when the working directory changes.
211+
212+
---
213+
214+
## 6. Quick Comparison: Streamed vs. Premade
215+
216+
| Feature | Streamed | Premade |
217+
| -------------------------- | -------- | ------- |
218+
| Reads Zarr at runtime | Yes | No |
219+
| Needs pre-generated samples| No | Yes |
220+
| Sensitive to storage flags | Yes | No |
221+
| Recommended for beginners | No | Yes |
222+
223+
---
224+
225+
## 7. External Requirements
226+
227+
### Google Cloud CLI
228+
229+
Even if you're using local data, some metadata utilities require the Google Cloud CLI. Install it from:
230+
https://cloud.google.com/sdk/docs/install
231+
232+
Then authenticate:
233+
```bash
234+
gcloud auth application-default login
235+
```

src/open_data_pvnet/configs/PVNet_configs/logger/wandb.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ wandb:
99
save_dir: "PLACEHOLDER"
1010
offline: False # set True to store all logs only locally
1111
id: null # pass correct id to resume experiment!
12+
# use only one id souce at a time!
1213
id: "${oc.env:WANDB_RUN_ID}"
1314
# entity: "" # set to name of your wandb team or just remove it
1415
log_model: True

0 commit comments

Comments
 (0)