Skip to content

Training and Evaluation pipeline#105

Closed
Sharkyii wants to merge 4 commits intoopenclimatefix:mainfrom
Sharkyii:feature/training-eval-pipeline
Closed

Training and Evaluation pipeline#105
Sharkyii wants to merge 4 commits intoopenclimatefix:mainfrom
Sharkyii:feature/training-eval-pipeline

Conversation

@Sharkyii
Copy link
Copy Markdown

Description

This PR introduces a stable, end-to-end training and evaluation pipeline for PVNet, currently validated on GSP-only data.

To simplify debugging and ensure pipeline stability during initial integration, the optimizer was intentionally switched from EmbAdamWReduceLROnPlateau to a plain AdamW. This change reduces training complexity while validating data flow, model wiring, and configuration handling.
The original optimizer and learning-rate scheduling will be reintroduced in a follow-up PR once the full multi-encoder setup is finalized.

At this stage:

  • The pipeline is fully functional for GSP data
  • Training and evaluation pipelines run end-to-end without errors
  • Configuration and Hydra overrides are verified and stable

Planned follow-ups:

  • Extend support to NWP and satellite data
  • Properly integrate and validate GSP + NWP multi-encoder setup
  • Restore EmbAdamWReduceLROnPlateau once architecture and inputs are finalized

Two reference documents have been added to explain the pipeline design and usage for future contributors.

Fixes #7


How Has This Been Tested?

  • End-to-end training and evaluation pipelines executed successfully
  • Verified data loading, batching, model forward pass, logging, and checkpointing
  • Tested using GSP-only configuration
  • Metrics logged correctly via W&B (offline mode)
  • Sanity checks performed on loss and MAE trends

If your changes affect data processing, have you plotted any changes?

  • Yes (basic sanity checks on metrics and loss behaviour)

Checklist

Comment thread pyproject.toml
"zarr==2.18.3",
"pvnet==4.1.19",
"ocf-data-sampler==0.2.32",
"ocf-data-sampler==0.2.10",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come you had to move this down?

# number_of_conv3d_layers: 6
# conv3d_channels: 32
# image_size_pixels: 24
ukv:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you use ukv? I thought only gfs was accessible?

"cpu", "--device", help="Device to run evaluation on ('cpu' or 'cuda')"
),
quantiles: str = typer.Option(
"0.02,0.1,0.25,0.5,0.75,0.9,0.98",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you get this from the config?

@@ -0,0 +1,183 @@
import logging
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sharkyii Sharkyii closed this Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

train model + evaluate model

2 participants