Skip to content

feat(mpnn): report per-design confidence in inference outputs#306

Open
daylight-00 wants to merge 5 commits into
RosettaCommons:productionfrom
daylight-00:feat/mpnn-confidence-output
Open

feat(mpnn): report per-design confidence in inference outputs#306
daylight-00 wants to merge 5 commits into
RosettaCommons:productionfrom
daylight-00:feat/mpnn-confidence-output

Conversation

@daylight-00

@daylight-00 daylight-00 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

MPNN inference currently reports only sequence recovery, which needs a native sequence and is meaningless for de novo design. The original ProteinMPNN/LigandMPNN additionally report a per-sequence confidence (exp(-mean NLL)), the standard metric for ranking/filtering designs. This PR brings that to the foundry re-implementation.

What's added

For each design, inference now reports:

  • confidence: exp(-mean_over_designed_residues(log_probs)), in (0, 1], higher = more confident.
  • ligand_interface_confidence: same, restricted to polymer-ligand interface residues (ligand_mpnn only); omitted when there are no interface residues.
  • per-residue confidence: exp(-NLL) per position (0 at non-designed positions), stored in a dedicated mpnn_confidence atom-array field.

Output format

  • FASTA header: >name_b{b}_d{d}, confidence=..., ligand_interface_confidence=..., sequence_recovery=..., ligand_interface_sequence_recovery=...
  • CIF:
    • _mpnn_output.confidence / _mpnn_output.ligand_interface_confidence (per-design scalars).
    • per-residue confidence in a dedicated _atom_site.mpnn_confidence column (via the existing extra_fields path, like mpnn_temperature); the input b_factor annotation is preserved.

Implementation

  • metrics/nll.py:
    • SampledNLL / SampledLigandInterfaceNLL: score the sampled sequence (S_sampled) under log_softmax(logits), reusing the existing NLL math and interface-mask machinery. Note: decoder_features["log_probs"] is temperature/bias-scaled (default T=0.1), which would pin confidence near 1.0, so the raw logits are used to match LigandMPNN's T=1.0 definition.
    • MPNNConfidence / MPNNLigandInterfaceConfidence: thin wrappers that expose the derived confidence (exp(-NLL)), so the NLL-to-confidence transform lives in the metric layer rather than the engine. Registered as confidence / ligand_interface_confidence.
  • user_settings.py: expose logits in the minimal-return decoder features.
  • inference_engines/mpnn.py: register the confidence metrics, read the per-design / per-residue confidence directly, and store per-residue confidence in a dedicated mpnn_confidence atom-array annotation (leaving b_factor untouched). When there are no polymer-ligand interface residues the interface confidence is omitted (None) rather than written as NaN.
  • utils/inference.py: emit the confidence header fields in write_fasta and add mpnn_confidence to the CIF extra_fields.

Tests

  • tests/test_metrics.py: self-contained unit tests for the new metrics (no structure fixtures, so they run without the test-data assets the integration suite needs). They pin: the metrics read the sampled sequence and the raw logits (not the native sequence / temperature-scaled log_probs); the confidence equals exp(-NLL) of the sampled sequence under log_softmax(logits); the interface confidence is restricted to interface residues; and the no-interface case is cleanly undefined (not a crash).

Verification

Ran ligand_mpnn (legacy weights ligandmpnn_v_32_010_25.pt) on a dsDNA binder backbone:

>na_binder_design_dsDNA_basic_0_model_0.cif_b0_d0, confidence=0.4741, ligand_interface_confidence=0.4091, sequence_recovery=0.5082, ligand_interface_sequence_recovery=0.2895

In the output CIF, the input _atom_site.B_iso_or_equiv is preserved and per-residue confidence is written to a separate _atom_site.mpnn_confidence column (range ~0.14-0.93 on designed positions, 0.0 on ligand/DNA/fixed atoms).

Notes / scope

  • Per-residue confidence is stored in a dedicated mpnn_confidence field rather than b_factor, so the input structure's B-factors are preserved. (The original LigandMPNN overwrites PDB b-factors; Foundry keeps them.)
  • ligand_interface_confidence is omitted when the input has no polymer-ligand interface residues (e.g. ligand_mpnn on a monomer).
  • Confidence values are stochastic across runs unless a seed is set.
  • FASTA/CIF output overlaps with the fix/mpnn-fasta-output branch; confidence values live in output_dict, so they thread into that branch's writer by adding the two fields to its header allowlist.

MPNN inference only reported sequence recovery, which requires a native
sequence and is meaningless for de novo design. The original
ProteinMPNN/LigandMPNN also report a per-sequence confidence
(exp(-mean NLL)) used to rank designs. This adds that to parity.

- Add SampledNLL / SampledInterfaceNLL metrics that score the *sampled*
  sequence (reusing the existing NLL math + interface mask) instead of
  the native sequence used by the training-time NLL metric.
- Compute confidence from the un-temperatured, un-biased logits
  (log_softmax of raw logits), matching LigandMPNN's T=1.0 definition.
  Using decoder_features["log_probs"] directly would be wrong: it is
  temperature- and bias-scaled (default T=0.1), pinning confidence near
  1.0 and making it useless for ranking.
- Expose raw logits in the minimal-return decoder features.
- Wire overall_confidence + ligand_confidence (ligand_mpnn) per design
  into MPNNInferenceEngine, plus per-residue confidence.
- Write overall_confidence/ligand_confidence to FASTA headers and to the
  CIF mpnn_output category; write per-residue confidence into the
  standard b-factor column (_atom_site.B_iso_or_equiv), overwriting the
  inherited input b-factors as the original LigandMPNN does.

overall_confidence = exp(-mean_over_designed_residues(log_probs)),
range (0, 1], higher = more confident; matches LigandMPNN README.
Cover the SampledNLL / SampledInterfaceNLL metrics added in the previous
commit:
- Assert the kwargs remapping reads the sampled sequence (S_sampled) and
  the raw logits, not the native sequence or temperature-scaled log_probs.
- Assert SampledNLL.compute equals the hand-computed NLL of the sampled
  sequence under log_softmax(logits), with masked positions zeroed.
These are self-contained (no structure fixtures) so they run without the
test-data assets the integration suite requires.

Also add Google-format docstrings to the overridden compute() methods.
Copilot AI review requested due to automatic review settings June 5, 2026 07:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds “confidence” metrics that score the sampled (designed) sequence using raw logits, and propagates those confidences through inference outputs (FASTA headers and structure B-factors) to mirror LigandMPNN’s reporting.

Changes:

  • Introduces SampledNLL / SampledInterfaceNLL to compute NLL (and derived confidence) on the sampled sequence using log_softmax(logits).
  • Updates inference to emit overall_confidence / ligand_confidence, write them to FASTA headers, and store per-residue confidence in structure b_factor.
  • Ensures logits are preserved through feature aggregation and adds unit tests for the new sampled-confidence wiring.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
models/mpnn/tests/test_metrics.py Adds tests to ensure sampled-confidence metrics read sampled sequence and raw logits.
models/mpnn/src/mpnn/utils/inference.py Adds confidence fields to output docs and FASTA header rendering.
models/mpnn/src/mpnn/transforms/feature_aggregation/user_settings.py Ensures logits are kept in aggregated decoder features for confidence computation.
models/mpnn/src/mpnn/metrics/nll.py Implements SampledNLL and SampledInterfaceNLL based on raw logits + sampled sequence.
models/mpnn/src/mpnn/inference_engines/mpnn.py Computes confidences from new metrics, writes per-residue confidence into b_factor, and adds output fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread models/mpnn/src/mpnn/metrics/nll.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py
Comment thread models/mpnn/src/mpnn/metrics/nll.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py
Comment thread models/mpnn/src/mpnn/inference_engines/mpnn.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py Outdated
Comment thread models/mpnn/src/mpnn/metrics/nll.py
Address PR review feedback:
- Rename the `log_probs` compute parameter and its `kwargs_to_compute_args`
  key to `logits` in SampledNLL/SampledInterfaceNLL. The argument carries
  raw logits, so the name now matches the contents.
- Add a unit test for SampledInterfaceNLL that injects a known interface
  mask and asserts the interface NLL is restricted to interface positions
  and scores the sampled sequence (not native S) under log_softmax(logits).
@daylight-00

Copy link
Copy Markdown
Contributor Author

Resolves #239

@AndrewKubaney AndrewKubaney left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this! It will be useful to have the ProteinMPNN/LigandMPNN confidences included in the output.

I have two changes/questions:

  1. I would recommend adding MPNNConfidence and MPNNInterfaceConfidence metrics, corresponding to overall_confidence and ligand_confidence here. I think those names would make the metrics clearer and help distinguish them from other confidence metrics. These could be wrappers around the sampled NLL metrics you added, which would also move the NLL-to-confidence calculation out of the top-level MPNNInferenceEngine.

  2. I agree that it would be useful to store the per-residue confidences in the atom array. I have some changes in another branch that could enable writing these per-residue confidences to the output CIF. My only concern is that overwriting the b-factor may be undesirable if users want to retain the original b-factor annotation. Could we instead create a separate atom array field for the MPNN confidence? If the goal is to write this into the CIF output, we can also discuss how to merge this with the changes in my fix/mpnn-fasta-output branch.

Happy to discuss further!

-Andrew

…nce in a dedicated field

Address review feedback on RosettaCommons#306:
- Add MPNNConfidence / MPNNInterfaceConfidence wrappers around SampledNLL /
  SampledInterfaceNLL that expose confidence = exp(-NLL) directly, moving the
  NLL-to-confidence transform out of MPNNInferenceEngine into the metric layer
  (registered as overall_confidence / ligand_confidence).
- Store per-residue confidence in a dedicated `mpnn_confidence` AtomArray
  annotation (serialized to CIF as _atom_site.mpnn_confidence via extra_fields)
  instead of overwriting b_factor, preserving the input structure's B-factors.
- Scope SampledNLL/SampledInterfaceNLL docstrings to NLL only (confidence is
  documented on the wrappers); copy the parent kwargs mapping defensively
  (dict + pop) and mask per-residue confidence with torch.where.
- Add unit tests for the confidence wrappers.
@daylight-00

Copy link
Copy Markdown
Contributor Author

Thanks for the review, Andrew. I've addressed both points.

  1. I added MPNNConfidence / MPNNInterfaceConfidence wrappers so the sampled NLL metrics and the derived exp(-NLL) confidence values stay separated. The inference engine now consumes the confidence outputs directly instead of performing the NLL-to-confidence transform itself.

  2. I also changed the per-residue confidence storage to avoid overwriting b_factor. I originally used b_factor because the original LigandMPNN writes per-residue confidence there, but I agree that preserving the input B-factor is cleaner for Foundry. The values now go into a dedicated mpnn_confidence AtomArray annotation.

For CIF output, I added mpnn_confidence to the existing extra_fields allowlist, so it serializes as _atom_site.mpnn_confidence, following the same path as other MPNN-specific fields like mpnn_temperature. I took a look at your fix/mpnn-fasta-output branch, and this seems compatible with the write_structure changes there, but I'm happy to rename/re-categorize the field if that branch settles on a different output convention.

For the FASTA output overlap, I also checked the refactored writer in your branch. Since the confidence values are now in output_dict, I think they should thread into the new design-entry writer by adding overall_confidence and ligand_confidence to the design-entry header allowlist. Let me know whether you'd prefer me to rebase onto your branch or land this first and reconcile afterward.

David

@daylight-00 daylight-00 requested a review from AndrewKubaney June 7, 2026 05:10

@AndrewKubaney AndrewKubaney left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in my review! This looks great; thank you for implementing this! I would suggest a few minor changes:

  1. For the kwargs_to_compute_args function in your new metrics classes, I think it would be more straightforward to construct the mapping dictionary from scratch, rather than relying on super. This will avoid the .pop logic, which might be hard to understand.

  2. I think MPNNInterfaceConfidence could be renamed to MPNNLigandInterfaceConfidence, so that people don't misinterpret this as applying to non protein-ligand interfaces (for instance, protein-protein interfaces).

  3. I think mpnn_confidence is a good name for the atom array feature; what do you think about changing the final confidence names (the ones that ends up in the FASTA) from overall_confidence to confidence or mpnn_confidence (and similarly ligand_confidence to ligand_interface_confidence or mpnn_ligand_interface_confidence)? I think those might be more descriptive, but they do deviate from the original ProteinMPNN/LigandMPNN norm.

  4. One edge case involves attempting to compute the ligand interface confidence when no ligand is present (if someone runs LigandMPNN on a protein monomer) or there are no valid ligand-interface residues. Could you add a test case to make sure this is handled cleanly?

Thanks again for your work on this! After addressing these, I think we should merge. The other branch can be merged at a later time.

  • Andrew

…nterface edge case

Address @AndrewKubaney's second review on RosettaCommons#306:
- Build kwargs_to_compute_args explicitly in SampledNLL /
  SampledLigandInterfaceNLL instead of copying the parent mapping and popping
  "log_probs" (I'd used copy+pop to avoid mutating the parent, but the explicit
  literal reads more clearly).
- Rename MPNNInterfaceConfidence -> MPNNLigandInterfaceConfidence and
  SampledInterfaceNLL -> SampledLigandInterfaceNLL so the polymer-ligand
  interface scope is unambiguous (vs e.g. protein-protein interfaces).
- Rename reported fields overall_confidence -> confidence and
  ligand_confidence -> ligand_interface_confidence, mirroring the existing
  sequence_recovery / ligand_interface_sequence_recovery keys and Foundry's
  un-prefixed output-key convention (the mpnn_confidence atom-array field is
  unchanged).
- Handle the no-interface edge case: a ligand_mpnn run with no polymer-ligand
  interface residues yields an undefined (NaN) interface confidence; the engine
  now emits None (omitted) rather than NaN. Add a unit test.
@daylight-00

daylight-00 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, Andrew — all four addressed.

  1. kwargs_to_compute_args: switched to constructing the mapping explicitly. (I'd used a copy + pop("log_probs") to avoid mutating the parent mapping, but I agree the explicit literal is clearer.)

  2. Renamed MPNNInterfaceConfidence -> MPNNLigandInterfaceConfidence (and the underlying SampledInterfaceNLL -> SampledLigandInterfaceNLL) so the polymer-ligand scope is explicit.

  3. Renamed the reported fields: overall_confidence -> confidence and ligand_confidence -> ligand_interface_confidence. I like this better too — it mirrors the existing sequence_recovery / ligand_interface_sequence_recovery keys and Foundry's un-prefixed output-key convention (the mpnn_ prefix is used for atom-array fields like mpnn_confidence, which I kept as-is).

  4. Added handling + a test for the no-interface case: when ligand_mpnn runs on an input with no polymer-ligand interface residues, the interface confidence is undefined (NaN, flagged by valid_examples_mask), and the engine now converts it to None so it's omitted rather than written as NaN.

One small naming question: the sibling metric classes in mpnn/metrics are unprefixed (NLL, SequenceRecovery, InterfaceNLL, ...), so for consistency I could drop the MPNN prefix on the new classes -> Confidence / LigandInterfaceConfidence. I'm equally happy to keep MPNN* if you prefer it.

Sounds good on merging this and leaving the fix/mpnn-fasta-output integration for later. Thanks for the thorough review!

David

@daylight-00 daylight-00 changed the title feat(mpnn): report overall/ligand confidence in inference outputs feat(mpnn): report per-design confidence in inference outputs Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants