test(rf3): add integration tests for rf3#323
Merged
Merged
Conversation
Adds end-to-end integration tests for `rf3 fold` that run on CPU in CI, along with three production bugs surfaced during the test run. Bug fixes: - RF3.py: n_recycles=1 + default early_stopping_plddt_threshold=0.5 crashed because next() consumed the only generator item for the early-stop check, leaving nothing for the deque. Fall back to first_recycle_outputs when the generator is exhausted. - inference_engines/rf3.py: early-stop path wrote score.csv + _metrics.csv; changed to write _ranking_scores.csv with early_stopped=True so output is consistent with the normal path. - utils/inference.py: skip_existing checked for _metrics.csv (only written on early-stopped runs); changed sentinel to _ranking_scores.csv so completed folds are correctly detected and skipped. New files: - .github/workflows/test_integration.yaml: CI job (push to main/production or workflow_dispatch); caches the RF3 checkpoint, runs integration suite. - models/rf3/tests/integration/: conftest.py with session-scoped fixtures (speed flags: n_recycles=1, num_steps=20, diffusion_batch_size=1), test_basic_fold.py (3 input-mode tests), test_options.py (7 flag tests), test_cpu_gpu_parity.py (auto-skipped until GPU baseline is committed). - models/rf3/tests/data/: 1cyo.cif, 1cyo_from_json.json, 1cyo_with_ligand.json, integration_baselines/README.md. - pyproject.toml: registers integration and gpu pytest markers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add integration_slow marker to the 7 options tests (each requires its own rf3 fold subprocess). The 3 basic input-mode tests keep only the integration marker and remain fast (~5-10 min, one shared fold call). CI split: - quick job: -m "integration and not integration_slow" runs on every push to main/production; timeout 20 min - full job: -m integration runs only on push to production; timeout 60 min Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add early_stopping_plddt_threshold=0.0 to SPEED_FLAGS so the default threshold (0.5) cannot silently early-stop future low-pLDDT test inputs - Raise _FOLD_TIMEOUT 600→1800 s to cover the batched 3-input fixture and slow CI runners without a hard 10-minute cap - Fix num_steps docstring: default is 50 (inference engine), not 200 (training full-rollout) - Fix seed_dirs and ground_truth_conformer_dir fixtures to unpack the (out_dir, stderr) tuple returned by run_rf3_fold, consistent with all other fixtures - Fix parity test to assert metrics are non-None rather than silently skipping; a dropped metric key now fails explicitly - Update parity test skipif reason string (baseline is now committed) - Fix annotate_b_factor docstring: flag does not force one_model_per_file - Fix B-factor range check to strict-exclusive (0 < v < 1) matching the pLDDT bin scheme; the old inclusive form would accept 0.0/1.0 which would indicate a binning bug - Fix one_model_per_file test docstring to describe actual behaviour - Remove stale "← not yet committed" annotation from baselines README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The one_model_per_file flag in RF3InferenceEngine.run() is currently a no-op in the normal prediction path — per-sample subdirectory CIFs are always written unconditionally. The test passed regardless of whether the flag was set, giving false confidence that it worked. Remove both the fixture and the test. If the flag is ever properly implemented, a real test can be added at that point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Commit GPU-generated summary_confidences.json for 1cyo_from_json (overall_plddt=0.6552, ptm=0.308, seed=1, n_recycles=1, num_steps=20). test_confidence_metrics_match_gpu_baseline will now run instead of skip. - Reword early-stopping warning to be more descriptive - Formatting-only change in test_basic_fold.py (ruff blank-line fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a "Known limitations" section to the test_cpu_gpu_parity.py module docstring covering stale baselines, the iptm=0.0 single-chain bug, narrow input coverage (protein-only), and the low-quality speed-flag baseline. Add a brief pointer section to integration_baselines/README.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collapse the quick/slow split into a single job so every trigger runs the full suite. Remove timeout-minutes to let the job run as long as needed (GitHub Actions default of 6 hours applies). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the unit test workflow: run on every PR (bare pull_request trigger, no base-branch filter) in addition to pushes to main/production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
${{ env.HOME }} in a YAML env: block is a GitHub Actions expression
context lookup (not the shell $HOME) and was resolving to empty,
producing a path of /.cache/... that caused all tests to be skipped.
Inline the env var assignment in the run: step where $HOME expands
correctly in the shell.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All six option tests now carry only @pytest.mark.integration so they run in per-PR CI alongside the other integration tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rf3 foldcovering three input modes(JSON protein-only, JSON with ligand, CIF with ligand) and five flag
behaviours (skip_existing, early_stopping, annotate_b_factor, seed
reproducibility, template_selection, ground_truth_conformer_selection)
a committed GPU baseline within ±0.05
full tier on production only)
Test results
10/10 passed on a CPU-only machine (30 min). All 28 shared-layer unit tests
pass. Formatting and linting clean.
Known limitations (documented in test files)
iptm=0.0for single-chain inputs is a pre-existing inference engine bug;both CPU and GPU reproduce it so parity still holds
one_model_per_fileflag is unimplemented in the inference engine; no test