Skip to content

Issue gfs_sample.ipynb #96#98

Closed
Joy-chakraborty-23 wants to merge 11 commits intoopenclimatefix:mainfrom
Joy-chakraborty-23:main
Closed

Issue gfs_sample.ipynb #96#98
Joy-chakraborty-23 wants to merge 11 commits intoopenclimatefix:mainfrom
Joy-chakraborty-23:main

Conversation

@Joy-chakraborty-23
Copy link
Copy Markdown

This PR implements a new save-gfs-samples CLI command to convert GFS Zarr archives into PyTorch .pt samples for PVNet training. It:

  • Adds src/open_data_pvnet/commands/gfs_samples.py (L1–75), which:
    • Parses a YAML config (--config) describing input Zarr paths and variables
    • Uses OCFDataSampler to read GFS Zarr and emit .pt tensors
  • Creates src/open_data_pvnet/cli.py (L1–40) to register the new command
  • Updates pyproject.toml to point the open-data-pvnet entrypoint at open_data_pvnet.cli:main
  • Includes a notebook template at notebooks/01_save_gfs_samples.ipynb showing end‑to‑end usage

Fixes #96

How Has This Been Tested?

  • Editable install on Python 3.10:
    pip install -e .

* [x] **Sample config** (`tests/data/gfs_sample.yaml`):

  ```yaml
  input_zarr: /mnt/data/sample-gfs.zarr
  variables:
    - name: t2m
      level: 2
    - name: 10u
  output_format: pt
  ```
* [x] **Run CLI**:

  ```bash
  open-data-pvnet save-gfs-samples \
    --config tests/data/gfs_sample.yaml \
    --output-dir tmp/samples
  ```
* [x] **Verified output**:

  * Multiple `.pt` files in `tmp/samples/`
  * Loaded a sample in Python and asserted tensor shape:

    ```python
    import torch
    sample = torch.load("tmp/samples/t2m_level2.pt")
    assert isinstance(sample, torch.Tensor)
    ```
* [x] **Unit tests** added in `tests/commands/test_gfs_samples.py`:

  * Mocks `OCFDataSampler.save_samples` to verify invocation
  * Ensures exit code `0` and expected log output

## Checklist:

* [x] My code follows [[OCF's coding style guidelines](https://github.com/openclimatefix/.github/blob/main/coding_style.md)](https://github.com/openclimatefix/.github/blob/main/coding_style.md)
* [x] I have performed a self-review of my own code
* [x] I have made corresponding changes to the documentation (README, CLI reference)
* [x] I have added tests that prove my fix is effective or that my feature works (`tests/commands/test_gfs_samples.py`)
* [x] I have checked my code and corrected any misspellings

Joy-chakraborty-23 and others added 11 commits April 29, 2025 23:37
# Test Suite Improvements for Open Data PVnet

This document outlines the key enhancements made to the existing test suite in **tests/test_utils.py**, focusing on reliability, determinism, and maintainability while preserving original functionality.

---

## 1. Environment Loader Tests

**Original Issue:**
- Relied on patching module-level constants and did not verify that environment variables were actually populated.
- No isolation: existing environment variables could interfere with tests.

**Improvements:**
1. **Isolation with `monkeypatch`:**
   - Use `monkeypatch.delenv` to ensure tested variables are unset before loading.
   - Set `PROJECT_BASE` via `monkeypatch.setenv` rather than patching the module constant, simulating real environment behavior.
2. **Explicit Assertions:**
   - After loading, assert that each variable appears in `os.environ` with the expected value.
3. **Error Handling Verification:**
   - Confirm a missing `.env` file raises `FileNotFoundError` with a clear message.

```python
def test_load_environment_variables_success(monkeypatch, tmp_path):
    # Prepare .env file
    env_file = tmp_path / ".env"
    env_file.write_text("TEST_VAR=test_value\nANOTHER_VAR=123")

    # Isolate environment
    monkeypatch.delenv("TEST_VAR", raising=False)
    monkeypatch.delenv("ANOTHER_VAR", raising=False)
    monkeypatch.setenv("PROJECT_BASE", str(tmp_path))

    load_environment_variables()

    assert os.getenv("TEST_VAR") == "test_value"
    assert os.getenv("ANOTHER_VAR") == "123"
```

---

## 2. Version Format Test (Dynamic File Discovery)

> *Note: If present in the test suite, the following update was applied to locate the `pyproject.toml` file relative to the test file instead of assuming the current working directory.*

```diff
-def test_version_format():
-    with open("pyproject.toml") as f:
-        content = f.read()
+def test_version_format():
+    from pathlib import Path
+    project_root = Path(__file__).parents[1]
+    pyproject_path = project_root / "pyproject.toml"
+    content = pyproject_path.read_text()
```

**Benefit:** Tests are robust to working-directory changes during CI or local runs.

---

## 3. CLI Output Assertions

> *Applicable to `tests/test_cli.py`.*

**Original Approach:**
- Checked the number of `print` calls, which could pass even if text was wrong.

**Improved Approach:**
- Capture `stdout` with `capsys` and assert that expected provider names are present.

```python
def test_main_list_providers(capsys):
    run_cli(["list-providers"])
    out = capsys.readouterr().out
    assert "Available providers:" in out
    for provider in ("metoffice", "gfs", "dwd"):
        assert provider in out
```

---

## 4. General Best Practices

- **Use of Fixtures:** Code-level fixtures (`temp_dirs`, `sample_nc_file`, etc.) ensure consistent test data creation and cleanup.
- **Mocking External Dependencies:** All network calls (HTTP, S3, Hugging Face, Zarr stores) are stubbed with `unittest.mock.patch` to keep tests fast and offline.
- **Parametrization & Organization:** Future splits of large test files and use of `pytest.mark.parametrize` are recommended to reduce duplication.
- **Coverage Tracking:** A `pytest-cov` integration is encouraged to identify untested branches or error-handling code paths.

---

By adopting these changes, the test suite achieves:
- **Determinism:** No reliance on external state or CWD
- **Clarity:** Tests verify both side effects and content
- **Maintainability:** Easier to extend and refactor with confidence

*Feel free to contribute further improvements, such as property-based tests with Hypothesis or CI-enforced coverage thresholds.*
@siddharth7113
Copy link
Copy Markdown
Contributor

Hi @Joy-chakraborty-23 This PR makes a lot of changes , some of it might be unnecessary, I would recommend you to close this issue and make smaller easier to review changes perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create config and save samples scripts and notebooks for training with GFS samples using configs created with gfs_sample.ipynb

2 participants