feat(mpnn): report per-design confidence in inference outputs by daylight-00 · Pull Request #306 · RosettaCommons/foundry

daylight-00 · 2026-06-05T07:53:12Z

Summary

MPNN inference currently reports only sequence recovery, which needs a native sequence and is meaningless for de novo design. The original ProteinMPNN/LigandMPNN additionally report a per-sequence confidence (exp(-mean NLL)), the standard metric for ranking/filtering designs. This PR brings that to the foundry re-implementation.

What's added

For each design, inference now reports:

confidence: exp(-mean_over_designed_residues(log_probs)), in (0, 1], higher = more confident.
ligand_interface_confidence: same, restricted to polymer-ligand interface residues (ligand_mpnn only); omitted when there are no interface residues.
per-residue confidence: exp(-NLL) per position (0 at non-designed positions), stored in a dedicated mpnn_confidence atom-array field.

Output format

FASTA header: >name_b{b}_d{d}, confidence=..., ligand_interface_confidence=..., sequence_recovery=..., ligand_interface_sequence_recovery=...
CIF:
- _mpnn_output.confidence / _mpnn_output.ligand_interface_confidence (per-design scalars).
- per-residue confidence in a dedicated _atom_site.mpnn_confidence column (via the existing extra_fields path, like mpnn_temperature); the input b_factor annotation is preserved.

Implementation

metrics/nll.py:
- SampledNLL / SampledLigandInterfaceNLL: score the sampled sequence (S_sampled) under log_softmax(logits), reusing the existing NLL math and interface-mask machinery. Note: decoder_features["log_probs"] is temperature/bias-scaled (default T=0.1), which would pin confidence near 1.0, so the raw logits are used to match LigandMPNN's T=1.0 definition.
- MPNNConfidence / MPNNLigandInterfaceConfidence: thin wrappers that expose the derived confidence (exp(-NLL)), so the NLL-to-confidence transform lives in the metric layer rather than the engine. Registered as confidence / ligand_interface_confidence.
user_settings.py: expose logits in the minimal-return decoder features.
inference_engines/mpnn.py: register the confidence metrics, read the per-design / per-residue confidence directly, and store per-residue confidence in a dedicated mpnn_confidence atom-array annotation (leaving b_factor untouched). When there are no polymer-ligand interface residues the interface confidence is omitted (None) rather than written as NaN.
utils/inference.py: emit the confidence header fields in write_fasta and add mpnn_confidence to the CIF extra_fields.

Tests

tests/test_metrics.py: self-contained unit tests for the new metrics (no structure fixtures, so they run without the test-data assets the integration suite needs). They pin: the metrics read the sampled sequence and the raw logits (not the native sequence / temperature-scaled log_probs); the confidence equals exp(-NLL) of the sampled sequence under log_softmax(logits); the interface confidence is restricted to interface residues; and the no-interface case is cleanly undefined (not a crash).

Verification

Ran ligand_mpnn (legacy weights ligandmpnn_v_32_010_25.pt) on a dsDNA binder backbone:

>na_binder_design_dsDNA_basic_0_model_0.cif_b0_d0, confidence=0.4741, ligand_interface_confidence=0.4091, sequence_recovery=0.5082, ligand_interface_sequence_recovery=0.2895

In the output CIF, the input _atom_site.B_iso_or_equiv is preserved and per-residue confidence is written to a separate _atom_site.mpnn_confidence column (range ~0.14-0.93 on designed positions, 0.0 on ligand/DNA/fixed atoms).

Notes / scope

Per-residue confidence is stored in a dedicated mpnn_confidence field rather than b_factor, so the input structure's B-factors are preserved. (The original LigandMPNN overwrites PDB b-factors; Foundry keeps them.)
ligand_interface_confidence is omitted when the input has no polymer-ligand interface residues (e.g. ligand_mpnn on a monomer).
Confidence values are stochastic across runs unless a seed is set.
FASTA/CIF output overlaps with the fix/mpnn-fasta-output branch; confidence values live in output_dict, so they thread into that branch's writer by adding the two fields to its header allowlist.

MPNN inference only reported sequence recovery, which requires a native sequence and is meaningless for de novo design. The original ProteinMPNN/LigandMPNN also report a per-sequence confidence (exp(-mean NLL)) used to rank designs. This adds that to parity. - Add SampledNLL / SampledInterfaceNLL metrics that score the *sampled* sequence (reusing the existing NLL math + interface mask) instead of the native sequence used by the training-time NLL metric. - Compute confidence from the un-temperatured, un-biased logits (log_softmax of raw logits), matching LigandMPNN's T=1.0 definition. Using decoder_features["log_probs"] directly would be wrong: it is temperature- and bias-scaled (default T=0.1), pinning confidence near 1.0 and making it useless for ranking. - Expose raw logits in the minimal-return decoder features. - Wire overall_confidence + ligand_confidence (ligand_mpnn) per design into MPNNInferenceEngine, plus per-residue confidence. - Write overall_confidence/ligand_confidence to FASTA headers and to the CIF mpnn_output category; write per-residue confidence into the standard b-factor column (_atom_site.B_iso_or_equiv), overwriting the inherited input b-factors as the original LigandMPNN does. overall_confidence = exp(-mean_over_designed_residues(log_probs)), range (0, 1], higher = more confident; matches LigandMPNN README.

Cover the SampledNLL / SampledInterfaceNLL metrics added in the previous commit: - Assert the kwargs remapping reads the sampled sequence (S_sampled) and the raw logits, not the native sequence or temperature-scaled log_probs. - Assert SampledNLL.compute equals the hand-computed NLL of the sampled sequence under log_softmax(logits), with masked positions zeroed. These are self-contained (no structure fixtures) so they run without the test-data assets the integration suite requires. Also add Google-format docstrings to the overridden compute() methods.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds “confidence” metrics that score the sampled (designed) sequence using raw logits, and propagates those confidences through inference outputs (FASTA headers and structure B-factors) to mirror LigandMPNN’s reporting.

Changes:

Introduces SampledNLL / SampledInterfaceNLL to compute NLL (and derived confidence) on the sampled sequence using log_softmax(logits).
Updates inference to emit overall_confidence / ligand_confidence, write them to FASTA headers, and store per-residue confidence in structure b_factor.
Ensures logits are preserved through feature aggregation and adds unit tests for the new sampled-confidence wiring.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
models/mpnn/tests/test_metrics.py	Adds tests to ensure sampled-confidence metrics read sampled sequence and raw logits.
models/mpnn/src/mpnn/utils/inference.py	Adds confidence fields to output docs and FASTA header rendering.
models/mpnn/src/mpnn/transforms/feature_aggregation/user_settings.py	Ensures `logits` are kept in aggregated decoder features for confidence computation.
models/mpnn/src/mpnn/metrics/nll.py	Implements `SampledNLL` and `SampledInterfaceNLL` based on raw logits + sampled sequence.
models/mpnn/src/mpnn/inference_engines/mpnn.py	Computes confidences from new metrics, writes per-residue confidence into `b_factor`, and adds output fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address PR review feedback: - Rename the `log_probs` compute parameter and its `kwargs_to_compute_args` key to `logits` in SampledNLL/SampledInterfaceNLL. The argument carries raw logits, so the name now matches the contents. - Add a unit test for SampledInterfaceNLL that injects a known interface mask and asserts the interface NLL is restricted to interface positions and scores the sampled sequence (not native S) under log_softmax(logits).

daylight-00 · 2026-06-07T00:20:16Z

Resolves #239

AndrewKubaney

Thanks for implementing this! It will be useful to have the ProteinMPNN/LigandMPNN confidences included in the output.

I have two changes/questions:

I would recommend adding MPNNConfidence and MPNNInterfaceConfidence metrics, corresponding to overall_confidence and ligand_confidence here. I think those names would make the metrics clearer and help distinguish them from other confidence metrics. These could be wrappers around the sampled NLL metrics you added, which would also move the NLL-to-confidence calculation out of the top-level MPNNInferenceEngine.
I agree that it would be useful to store the per-residue confidences in the atom array. I have some changes in another branch that could enable writing these per-residue confidences to the output CIF. My only concern is that overwriting the b-factor may be undesirable if users want to retain the original b-factor annotation. Could we instead create a separate atom array field for the MPNN confidence? If the goal is to write this into the CIF output, we can also discuss how to merge this with the changes in my fix/mpnn-fasta-output branch.

Happy to discuss further!

-Andrew

…nce in a dedicated field Address review feedback on RosettaCommons#306: - Add MPNNConfidence / MPNNInterfaceConfidence wrappers around SampledNLL / SampledInterfaceNLL that expose confidence = exp(-NLL) directly, moving the NLL-to-confidence transform out of MPNNInferenceEngine into the metric layer (registered as overall_confidence / ligand_confidence). - Store per-residue confidence in a dedicated `mpnn_confidence` AtomArray annotation (serialized to CIF as _atom_site.mpnn_confidence via extra_fields) instead of overwriting b_factor, preserving the input structure's B-factors. - Scope SampledNLL/SampledInterfaceNLL docstrings to NLL only (confidence is documented on the wrappers); copy the parent kwargs mapping defensively (dict + pop) and mask per-residue confidence with torch.where. - Add unit tests for the confidence wrappers.

daylight-00 · 2026-06-07T05:10:10Z

Thanks for the review, Andrew. I've addressed both points.

I added MPNNConfidence / MPNNInterfaceConfidence wrappers so the sampled NLL metrics and the derived exp(-NLL) confidence values stay separated. The inference engine now consumes the confidence outputs directly instead of performing the NLL-to-confidence transform itself.
I also changed the per-residue confidence storage to avoid overwriting b_factor. I originally used b_factor because the original LigandMPNN writes per-residue confidence there, but I agree that preserving the input B-factor is cleaner for Foundry. The values now go into a dedicated mpnn_confidence AtomArray annotation.

For CIF output, I added mpnn_confidence to the existing extra_fields allowlist, so it serializes as _atom_site.mpnn_confidence, following the same path as other MPNN-specific fields like mpnn_temperature. I took a look at your fix/mpnn-fasta-output branch, and this seems compatible with the write_structure changes there, but I'm happy to rename/re-categorize the field if that branch settles on a different output convention.

For the FASTA output overlap, I also checked the refactored writer in your branch. Since the confidence values are now in output_dict, I think they should thread into the new design-entry writer by adding overall_confidence and ligand_confidence to the design-entry header allowlist. Let me know whether you'd prefer me to rebase onto your branch or land this first and reconcile afterward.

David

AndrewKubaney

Apologies for the delay in my review! This looks great; thank you for implementing this! I would suggest a few minor changes:

For the kwargs_to_compute_args function in your new metrics classes, I think it would be more straightforward to construct the mapping dictionary from scratch, rather than relying on super. This will avoid the .pop logic, which might be hard to understand.
I think MPNNInterfaceConfidence could be renamed to MPNNLigandInterfaceConfidence, so that people don't misinterpret this as applying to non protein-ligand interfaces (for instance, protein-protein interfaces).
I think mpnn_confidence is a good name for the atom array feature; what do you think about changing the final confidence names (the ones that ends up in the FASTA) from overall_confidence to confidence or mpnn_confidence (and similarly ligand_confidence to ligand_interface_confidence or mpnn_ligand_interface_confidence)? I think those might be more descriptive, but they do deviate from the original ProteinMPNN/LigandMPNN norm.
One edge case involves attempting to compute the ligand interface confidence when no ligand is present (if someone runs LigandMPNN on a protein monomer) or there are no valid ligand-interface residues. Could you add a test case to make sure this is handled cleanly?

Thanks again for your work on this! After addressing these, I think we should merge. The other branch can be merged at a later time.

Andrew

@AndrewKubaney

…nterface edge case Address @AndrewKubaney's second review on RosettaCommons#306: - Build kwargs_to_compute_args explicitly in SampledNLL / SampledLigandInterfaceNLL instead of copying the parent mapping and popping "log_probs" (I'd used copy+pop to avoid mutating the parent, but the explicit literal reads more clearly). - Rename MPNNInterfaceConfidence -> MPNNLigandInterfaceConfidence and SampledInterfaceNLL -> SampledLigandInterfaceNLL so the polymer-ligand interface scope is unambiguous (vs e.g. protein-protein interfaces). - Rename reported fields overall_confidence -> confidence and ligand_confidence -> ligand_interface_confidence, mirroring the existing sequence_recovery / ligand_interface_sequence_recovery keys and Foundry's un-prefixed output-key convention (the mpnn_confidence atom-array field is unchanged). - Handle the no-interface edge case: a ligand_mpnn run with no polymer-ligand interface residues yields an undefined (NaN) interface confidence; the engine now emits None (omitted) rather than NaN. Add a unit test.

daylight-00 · 2026-06-18T10:15:37Z

Thanks, Andrew — all four addressed.

kwargs_to_compute_args: switched to constructing the mapping explicitly. (I'd used a copy + pop("log_probs") to avoid mutating the parent mapping, but I agree the explicit literal is clearer.)
Renamed MPNNInterfaceConfidence -> MPNNLigandInterfaceConfidence (and the underlying SampledInterfaceNLL -> SampledLigandInterfaceNLL) so the polymer-ligand scope is explicit.
Renamed the reported fields: overall_confidence -> confidence and ligand_confidence -> ligand_interface_confidence. I like this better too — it mirrors the existing sequence_recovery / ligand_interface_sequence_recovery keys and Foundry's un-prefixed output-key convention (the mpnn_ prefix is used for atom-array fields like mpnn_confidence, which I kept as-is).
Added handling + a test for the no-interface case: when ligand_mpnn runs on an input with no polymer-ligand interface residues, the interface confidence is undefined (NaN, flagged by valid_examples_mask), and the engine now converts it to None so it's omitted rather than written as NaN.

One small naming question: the sibling metric classes in mpnn/metrics are unprefixed (NLL, SequenceRecovery, InterfaceNLL, ...), so for consistency I could drop the MPNN prefix on the new classes -> Confidence / LigandInterfaceConfidence. I'm equally happy to keep MPNN* if you prefer it.

Sounds good on merging this and leaving the fix/mpnn-fasta-output integration for later. Thanks for the thorough review!

David

daylight-00 added 2 commits June 5, 2026 16:27

Copilot AI review requested due to automatic review settings June 5, 2026 07:53

Copilot AI reviewed Jun 5, 2026

View reviewed changes

rclune assigned AndrewKubaney Jun 5, 2026

AndrewKubaney requested changes Jun 7, 2026

View reviewed changes

daylight-00 requested a review from AndrewKubaney June 7, 2026 05:10

AndrewKubaney requested changes Jun 18, 2026

View reviewed changes

daylight-00 requested a review from AndrewKubaney June 18, 2026 10:16

daylight-00 changed the title ~~feat(mpnn): report overall/ligand confidence in inference outputs~~ feat(mpnn): report per-design confidence in inference outputs Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(mpnn): report per-design confidence in inference outputs#306

feat(mpnn): report per-design confidence in inference outputs#306
daylight-00 wants to merge 5 commits into
RosettaCommons:productionfrom
daylight-00:feat/mpnn-confidence-output

daylight-00 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daylight-00 commented Jun 7, 2026

Uh oh!

AndrewKubaney left a comment •

edited

Loading

Uh oh!

daylight-00 commented Jun 7, 2026

Uh oh!

AndrewKubaney left a comment

Uh oh!

daylight-00 commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

daylight-00 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Output format

Implementation

Tests

Verification

Notes / scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daylight-00 commented Jun 7, 2026

Uh oh!

AndrewKubaney left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daylight-00 commented Jun 7, 2026

Uh oh!

AndrewKubaney left a comment

Choose a reason for hiding this comment

Uh oh!

daylight-00 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daylight-00 commented Jun 5, 2026 •

edited

Loading

AndrewKubaney left a comment •

edited

Loading

daylight-00 commented Jun 18, 2026 •

edited

Loading