Add Sound Encoder to Cosmos3 by MaciejBalaNV · Pull Request #13911 · huggingface/diffusers

MaciejBalaNV · 2026-06-10T13:21:27Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Maciej Bala <mbala@nvidia.com>

yiyixuxu · 2026-06-10T20:35:30Z

+    def _disable_encoder(self):
+        self.encoder = None
+        self._encoder_available = False
+        self.register_to_config(encoder_enabled=False)
+
+    def _fix_state_dict_keys_on_load(self, state_dict: OrderedDict) -> None:
+        super()._fix_state_dict_keys_on_load(state_dict)
+        if self.encoder is not None and not any(key.startswith("encoder.") for key in state_dict):
+            self._disable_encoder()
+


why do we need these two methods?

It's an extra safety net for checkpoints that do not have the encoder weights. We will update the main checkpoint to have encoder weights, but I think it's still fine to keep this method in case of e.g. cached local checkpoints. We don't want them to break if people don't need the encoder weights.

dg845 · 2026-06-11T02:12:07Z

        return hidden_states


+class Cosmos3AudioSnakeBeta(nn.Module):


It looks like the existing Snake1d module implements essentially the same logic as Cosmos3AudioSnakeBeta, could we use it as well for the encoder?

The math should be the same, but we'd need a reshape on load, since Cosmos3AudioSnakeBeta has 1D parameters instead of 3D. Let me think about it for a bit.

I kept the separate classes for native checkpoint loading, but shared a forward implementation

dg845

Thanks for the PR! Left an initial design review :).

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Initial version of sound encoder

0ffee41

Signed-off-by: Maciej Bala <mbala@nvidia.com>

github-actions Bot added models tests size/L PR with diff > 200 LOC labels Jun 10, 2026

yiyixuxu reviewed Jun 10, 2026

View reviewed changes

yiyixuxu requested a review from dg845 June 10, 2026 20:37