Add audio modality support for CLAP evaluation#148
Conversation
…enchmark into audio_benchmarks
Remove the hard dependency on the old laion_clap package in clap.py
and rewrite it to load old LAION-CLAP pretrained checkpoints directly
into the new open_clip CLAP architecture via state_dict key remapping.
This makes the eval pipeline fully self-contained with only open_clip
as the model backend.
clap.py changes (old LAION-CLAP checkpoint loader):
- Remove `import laion_clap` and `from transformers import RobertaTokenizer`
- Add state_dict key remapping: audio_branch->audio.encoder,
audio_projection->audio.proj, text_branch->text.transformer,
text_projection->text.proj, logit_scale_a->logit_scale
- Auto-detect fusion from checkpoint keys (fusion_model, mel_conv2d)
- Fix text projection shape mismatch (old: 768->512 Linear+ReLU+Linear
with bias; new: 768->640 Linear+GELU+Linear without bias)
- Add FusionAudioLoader for fusion checkpoints (4-channel mel_fusion:
global resized + 3 deterministic local chunks)
- Use open_clip.get_tokenizer() instead of RobertaTokenizer directly
- Support both HuggingFace checkpoint names (630k-best,
630k-fusion-best, etc.) and local file paths
- Tested on both non-fusion (630k-audioset-best) and fusion
(630k-audioset-fusion-best) old checkpoints
clap_v2.py (new open_clip CLAP checkpoint loader):
- New file for loading our own open_clip CLAP training checkpoints
- Direct state_dict loading (no key remapping needed)
- Auto-detects fusion from checkpoint keys and creates model with
enable_fusion=True when detected
- Supports both non-fusion (waveform input) and fusion (mel_fusion)
- Verified with real training checkpoints (strict=True load)
- Verified with synthetic fusion round-trip test
__init__.py:
- Register clap and clap_v2 model types with try/except ImportError
guards (graceful fallback if open_clip not installed)
Eval pipeline fixes (classification + retrieval):
- Disable torch.autocast — float16 precision on GH200 destroys cosine
similarity discriminability for CLAP models (acc drops to random
chance). The --no_amp CLI flag was already added but the metrics code
still used autocast internally.
- Cast features to float32 before F.normalize and cosine similarity
- Handle non-tensor targets in classification (torch.tensor conversion)
builder.py:
- Support both .txt (newline-separated) and .json ({"text": [...]})
caption formats in retrieval WebDatasets (audiocaps uses .json)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@mehdidc please have a look |
Use getattr() with safe defaults for audio-specific CLI args (modality, dump_classnames, dump_templates) so that callers without these attributes — including the existing unit tests — continue to work unchanged. Fixes test_clip_benchmark.py::test_base AttributeError. With assistance by Claude Code Opus 4.6
Updated accuracy metrics and gain columns for various audio datasets.
|
Thanks a lot @Spatenfe @JeniaJitsev for the PR. Looking good to me. I have one remark, see above. For audio specific dependencies, after merging this, will add a |
|
@Spatenfe @JeniaJitsev after the merge of CLAP to open clip master, do we still need a new model_type in clip benchmark to handle CLAP ? |
I did not test this because I don't have anything CLIP related set up. We should definitely test this before merging. |
That depends on how CLAP was integrated in openClip. Depending on how the "open_clip.create_model_and_transforms" function handles CLAP. But if done right, I think we can remove /models/clap.py and edit clip_benchmark/models/open_clip.py slightly. |
There was a problem hiding this comment.
@mehdidc I just noticed that there is a clap v2 file part of the pull request. We should remove that for now and add it once v2 is done
Ok will test it |
Summary
This PR integrates audio modality support into CLIP Benchmark, enabling standardized evaluation of CLAP (Contrastive Language-Audio Pretraining) models alongside existing image-language CLIP models. It supports both the original LAION-CLAP (v1) pretrained checkpoints and models trained with the most recent open_clip CLAP implementation (v2).
Key features
clap— loads old LAION-CLAP v1 pretrained checkpoints (e.g.630k-audioset-best) into the new open_clip architecture via state_dict key remapping. No dependency on the oldlaion_clappackage — onlyopen_clipis neededclap_v2— loads checkpoints from recent open_clip CLAP training directly (no key remapping)librosa, with proper padding/truncation and mel spectrogram computation for fusion models--modalityCLI flag: explicitimage/audioselection with auto-detection from the loaded model typeEvaluation results
Zero-shot classification
Results using CLAP (HTSAT-tiny) with LAION-CLAP v1 pretrained checkpoints:
Zero-shot retrieval
Linear probe
Changes by file
clip_benchmark/models/clap.pylaion_clapstate_dict keys to open_clip format, auto-detects fusion, fixes text projection shape mismatch. Downloads checkpoints from HuggingFace.clip_benchmark/models/clap_v2.pyclip_benchmark/models/__init__.pyclapandclap_v2model types with graceful ImportError fallbackclip_benchmark/cli.py--modalityflag (image/audio/auto), passesaudio_loaderandmodalitythrough the evaluation pipelineclip_benchmark/datasets/builder.pyaudio_loaderintegration, mixed.txt/.jsoncaption format handling for retrievalclip_benchmark/metrics/zeroshot_classification.py.float()cast before normalization for numerical stabilityclip_benchmark/metrics/zeroshot_retrieval.py.float()cast before normalizationclip_benchmark/metrics/linear_probe.pyclip_benchmark/metrics/utils.pyAUDIO_README.mdUsage examples
Dependencies
open_clip(for model architecture and tokenizer)librosa(audio decoding)torchaudio(mel spectrogram computation for fusion models)huggingface_hub(optional, for downloading old LAION-CLAP checkpoints)No dependency on the old
laion_clappackage.Test plan
clap_v2loader verified with own training checkpoints (strict load)With assistance by Claude Code Opus 4.6