test(rf3): add integration tests for rf3 by woodsh17 · Pull Request #323 · RosettaCommons/foundry

woodsh17 · 2026-06-16T23:12:28Z

Summary

Adds an integration test suite for rf3 fold covering three input modes
(JSON protein-only, JSON with ligand, CIF with ligand) and five flag
behaviours (skip_existing, early_stopping, annotate_b_factor, seed
reproducibility, template_selection, ground_truth_conformer_selection)
Adds a CPU/GPU parity test that checks scalar confidence metrics against
a committed GPU baseline within ±0.05
Fixes three inference bugs uncovered during test development (see commit 85b65b4)
Adds a GitHub Actions CI workflow (quick tier on every push to main/production,
full tier on production only)

Test results

10/10 passed on a CPU-only machine (30 min). All 28 shared-layer unit tests
pass. Formatting and linting clean.

Known limitations (documented in test files)

GPU baseline covers protein-only input only; ligand baselines are future work
iptm=0.0 for single-chain inputs is a pre-existing inference engine bug;
both CPU and GPU reproduce it so parity still holds
one_model_per_file flag is unimplemented in the inference engine; no test

Adds end-to-end integration tests for `rf3 fold` that run on CPU in CI, along with three production bugs surfaced during the test run. Bug fixes: - RF3.py: n_recycles=1 + default early_stopping_plddt_threshold=0.5 crashed because next() consumed the only generator item for the early-stop check, leaving nothing for the deque. Fall back to first_recycle_outputs when the generator is exhausted. - inference_engines/rf3.py: early-stop path wrote score.csv + _metrics.csv; changed to write _ranking_scores.csv with early_stopped=True so output is consistent with the normal path. - utils/inference.py: skip_existing checked for _metrics.csv (only written on early-stopped runs); changed sentinel to _ranking_scores.csv so completed folds are correctly detected and skipped. New files: - .github/workflows/test_integration.yaml: CI job (push to main/production or workflow_dispatch); caches the RF3 checkpoint, runs integration suite. - models/rf3/tests/integration/: conftest.py with session-scoped fixtures (speed flags: n_recycles=1, num_steps=20, diffusion_batch_size=1), test_basic_fold.py (3 input-mode tests), test_options.py (7 flag tests), test_cpu_gpu_parity.py (auto-skipped until GPU baseline is committed). - models/rf3/tests/data/: 1cyo.cif, 1cyo_from_json.json, 1cyo_with_ligand.json, integration_baselines/README.md. - pyproject.toml: registers integration and gpu pytest markers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add integration_slow marker to the 7 options tests (each requires its own rf3 fold subprocess). The 3 basic input-mode tests keep only the integration marker and remain fast (~5-10 min, one shared fold call). CI split: - quick job: -m "integration and not integration_slow" runs on every push to main/production; timeout 20 min - full job: -m integration runs only on push to production; timeout 60 min Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add early_stopping_plddt_threshold=0.0 to SPEED_FLAGS so the default threshold (0.5) cannot silently early-stop future low-pLDDT test inputs - Raise _FOLD_TIMEOUT 600→1800 s to cover the batched 3-input fixture and slow CI runners without a hard 10-minute cap - Fix num_steps docstring: default is 50 (inference engine), not 200 (training full-rollout) - Fix seed_dirs and ground_truth_conformer_dir fixtures to unpack the (out_dir, stderr) tuple returned by run_rf3_fold, consistent with all other fixtures - Fix parity test to assert metrics are non-None rather than silently skipping; a dropped metric key now fails explicitly - Update parity test skipif reason string (baseline is now committed) - Fix annotate_b_factor docstring: flag does not force one_model_per_file - Fix B-factor range check to strict-exclusive (0 < v < 1) matching the pLDDT bin scheme; the old inclusive form would accept 0.0/1.0 which would indicate a binning bug - Fix one_model_per_file test docstring to describe actual behaviour - Remove stale "← not yet committed" annotation from baselines README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The one_model_per_file flag in RF3InferenceEngine.run() is currently a no-op in the normal prediction path — per-sample subdirectory CIFs are always written unconditionally. The test passed regardless of whether the flag was set, giving false confidence that it worked. Remove both the fixture and the test. If the flag is ever properly implemented, a real test can be added at that point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Commit GPU-generated summary_confidences.json for 1cyo_from_json (overall_plddt=0.6552, ptm=0.308, seed=1, n_recycles=1, num_steps=20). test_confidence_metrics_match_gpu_baseline will now run instead of skip. - Reword early-stopping warning to be more descriptive - Formatting-only change in test_basic_fold.py (ruff blank-line fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a "Known limitations" section to the test_cpu_gpu_parity.py module docstring covering stale baselines, the iptm=0.0 single-chain bug, narrow input coverage (protein-only), and the low-quality speed-flag baseline. Add a brief pointer section to integration_baselines/README.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Collapse the quick/slow split into a single job so every trigger runs the full suite. Remove timeout-minutes to let the job run as long as needed (GitHub Actions default of 6 hours applies). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Match the unit test workflow: run on every PR (bare pull_request trigger, no base-branch filter) in addition to pushes to main/production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

${{ env.HOME }} in a YAML env: block is a GitHub Actions expression context lookup (not the shell $HOME) and was resolving to empty, producing a path of /.cache/... that caused all tests to be skipped. Inline the env var assignment in the run: step where $HOME expands correctly in the shell. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All six option tests now carry only @pytest.mark.integration so they run in per-PR CI alongside the other integration tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lyskov

LGTM!

woodsh17 and others added 6 commits June 16, 2026 18:07

woodsh17 requested a review from lyskov June 16, 2026 23:13

woodsh17 and others added 4 commits June 17, 2026 20:49

ci(rf3): trigger integration tests on pull requests

9f2ccd6

Match the unit test workflow: run on every PR (bare pull_request trigger, no base-branch filter) in addition to pushes to main/production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(rf3): remove integration_slow marker from test_options

3575e26

All six option tests now carry only @pytest.mark.integration so they run in per-PR CI alongside the other integration tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lyskov approved these changes Jun 25, 2026

View reviewed changes

woodsh17 merged commit 62eba66 into production Jun 26, 2026
6 checks passed

woodsh17 deleted the rf3-integration-tests branch June 26, 2026 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(rf3): add integration tests for rf3#323

test(rf3): add integration tests for rf3#323
woodsh17 merged 10 commits into
productionfrom
rf3-integration-tests

woodsh17 commented Jun 16, 2026

Uh oh!

lyskov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

woodsh17 commented Jun 16, 2026

Summary

Test results

Known limitations (documented in test files)

Uh oh!

lyskov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants