Skip to content

test(rf3): add integration tests for rf3#323

Merged
woodsh17 merged 10 commits into
productionfrom
rf3-integration-tests
Jun 26, 2026
Merged

test(rf3): add integration tests for rf3#323
woodsh17 merged 10 commits into
productionfrom
rf3-integration-tests

Conversation

@woodsh17

Copy link
Copy Markdown
Member

Summary

  • Adds an integration test suite for rf3 fold covering three input modes
    (JSON protein-only, JSON with ligand, CIF with ligand) and five flag
    behaviours (skip_existing, early_stopping, annotate_b_factor, seed
    reproducibility, template_selection, ground_truth_conformer_selection)
  • Adds a CPU/GPU parity test that checks scalar confidence metrics against
    a committed GPU baseline within ±0.05
  • Fixes three inference bugs uncovered during test development (see commit 85b65b4)
  • Adds a GitHub Actions CI workflow (quick tier on every push to main/production,
    full tier on production only)

Test results

10/10 passed on a CPU-only machine (30 min). All 28 shared-layer unit tests
pass. Formatting and linting clean.

Known limitations (documented in test files)

  • GPU baseline covers protein-only input only; ligand baselines are future work
  • iptm=0.0 for single-chain inputs is a pre-existing inference engine bug;
    both CPU and GPU reproduce it so parity still holds
  • one_model_per_file flag is unimplemented in the inference engine; no test

woodsh17 and others added 6 commits June 16, 2026 18:07
Adds end-to-end integration tests for `rf3 fold` that run on CPU in CI,
along with three production bugs surfaced during the test run.

Bug fixes:
- RF3.py: n_recycles=1 + default early_stopping_plddt_threshold=0.5
  crashed because next() consumed the only generator item for the
  early-stop check, leaving nothing for the deque. Fall back to
  first_recycle_outputs when the generator is exhausted.
- inference_engines/rf3.py: early-stop path wrote score.csv +
  _metrics.csv; changed to write _ranking_scores.csv with
  early_stopped=True so output is consistent with the normal path.
- utils/inference.py: skip_existing checked for _metrics.csv (only
  written on early-stopped runs); changed sentinel to _ranking_scores.csv
  so completed folds are correctly detected and skipped.

New files:
- .github/workflows/test_integration.yaml: CI job (push to main/production
  or workflow_dispatch); caches the RF3 checkpoint, runs integration suite.
- models/rf3/tests/integration/: conftest.py with session-scoped fixtures
  (speed flags: n_recycles=1, num_steps=20, diffusion_batch_size=1),
  test_basic_fold.py (3 input-mode tests), test_options.py (7 flag tests),
  test_cpu_gpu_parity.py (auto-skipped until GPU baseline is committed).
- models/rf3/tests/data/: 1cyo.cif, 1cyo_from_json.json,
  1cyo_with_ligand.json, integration_baselines/README.md.
- pyproject.toml: registers integration and gpu pytest markers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add integration_slow marker to the 7 options tests (each requires its
own rf3 fold subprocess). The 3 basic input-mode tests keep only the
integration marker and remain fast (~5-10 min, one shared fold call).

CI split:
- quick job: -m "integration and not integration_slow"
  runs on every push to main/production; timeout 20 min
- full job:  -m integration
  runs only on push to production; timeout 60 min

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add early_stopping_plddt_threshold=0.0 to SPEED_FLAGS so the default
  threshold (0.5) cannot silently early-stop future low-pLDDT test inputs
- Raise _FOLD_TIMEOUT 600→1800 s to cover the batched 3-input fixture and
  slow CI runners without a hard 10-minute cap
- Fix num_steps docstring: default is 50 (inference engine), not 200
  (training full-rollout)
- Fix seed_dirs and ground_truth_conformer_dir fixtures to unpack the
  (out_dir, stderr) tuple returned by run_rf3_fold, consistent with all
  other fixtures
- Fix parity test to assert metrics are non-None rather than silently
  skipping; a dropped metric key now fails explicitly
- Update parity test skipif reason string (baseline is now committed)
- Fix annotate_b_factor docstring: flag does not force one_model_per_file
- Fix B-factor range check to strict-exclusive (0 < v < 1) matching the
  pLDDT bin scheme; the old inclusive form would accept 0.0/1.0 which
  would indicate a binning bug
- Fix one_model_per_file test docstring to describe actual behaviour
- Remove stale "← not yet committed" annotation from baselines README

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The one_model_per_file flag in RF3InferenceEngine.run() is currently a
no-op in the normal prediction path — per-sample subdirectory CIFs are
always written unconditionally. The test passed regardless of whether
the flag was set, giving false confidence that it worked.

Remove both the fixture and the test. If the flag is ever properly
implemented, a real test can be added at that point.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Commit GPU-generated summary_confidences.json for 1cyo_from_json
  (overall_plddt=0.6552, ptm=0.308, seed=1, n_recycles=1, num_steps=20).
  test_confidence_metrics_match_gpu_baseline will now run instead of skip.
- Reword early-stopping warning to be more descriptive
- Formatting-only change in test_basic_fold.py (ruff blank-line fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a "Known limitations" section to the test_cpu_gpu_parity.py module
docstring covering stale baselines, the iptm=0.0 single-chain bug,
narrow input coverage (protein-only), and the low-quality speed-flag
baseline. Add a brief pointer section to integration_baselines/README.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@woodsh17 woodsh17 requested a review from lyskov June 16, 2026 23:13
woodsh17 and others added 4 commits June 17, 2026 20:49
Collapse the quick/slow split into a single job so every trigger runs
the full suite. Remove timeout-minutes to let the job run as long as
needed (GitHub Actions default of 6 hours applies).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the unit test workflow: run on every PR (bare pull_request trigger,
no base-branch filter) in addition to pushes to main/production.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
${{ env.HOME }} in a YAML env: block is a GitHub Actions expression
context lookup (not the shell $HOME) and was resolving to empty,
producing a path of /.cache/... that caused all tests to be skipped.
Inline the env var assignment in the run: step where $HOME expands
correctly in the shell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All six option tests now carry only @pytest.mark.integration so they
run in per-PR CI alongside the other integration tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@lyskov lyskov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@woodsh17 woodsh17 merged commit 62eba66 into production Jun 26, 2026
6 checks passed
@woodsh17 woodsh17 deleted the rf3-integration-tests branch June 26, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants