Skip to content

Add Sound Encoder to Cosmos3#13911

Draft
MaciejBalaNV wants to merge 2 commits into
huggingface:mainfrom
MaciejBalaNV:cosmos3_sound_encoder
Draft

Add Sound Encoder to Cosmos3#13911
MaciejBalaNV wants to merge 2 commits into
huggingface:mainfrom
MaciejBalaNV:cosmos3_sound_encoder

Conversation

@MaciejBalaNV

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Maciej Bala <mbala@nvidia.com>
@github-actions github-actions Bot added models tests size/L PR with diff > 200 LOC labels Jun 10, 2026
Comment on lines +617 to +626
def _disable_encoder(self):
self.encoder = None
self._encoder_available = False
self.register_to_config(encoder_enabled=False)

def _fix_state_dict_keys_on_load(self, state_dict: OrderedDict) -> None:
super()._fix_state_dict_keys_on_load(state_dict)
if self.encoder is not None and not any(key.startswith("encoder.") for key in state_dict):
self._disable_encoder()

@yiyixuxu yiyixuxu Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these two methods?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an extra safety net for checkpoints that do not have the encoder weights. We will update the main checkpoint to have encoder weights, but I think it's still fine to keep this method in case of e.g. cached local checkpoints. We don't want them to break if people don't need the encoder weights.

@yiyixuxu yiyixuxu requested a review from dg845 June 10, 2026 20:37
return hidden_states


class Cosmos3AudioSnakeBeta(nn.Module):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the existing Snake1d module implements essentially the same logic as Cosmos3AudioSnakeBeta, could we use it as well for the encoder?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math should be the same, but we'd need a reshape on load, since Cosmos3AudioSnakeBeta has 1D parameters instead of 3D. Let me think about it for a bit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the separate classes for native checkpoint loading, but shared a forward implementation

Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_cosmos3_audio.py Outdated

@dg845 dg845 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left an initial design review :).

Signed-off-by: Maciej Bala <mbala@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

models size/L PR with diff > 200 LOC tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants