Skip to content

[ASR] fix streaming multitalker asr timestamp computation#15701

Open
thanhtvt wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
thanhtvt:main
Open

[ASR] fix streaming multitalker asr timestamp computation#15701
thanhtvt wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
thanhtvt:main

Conversation

@thanhtvt
Copy link
Copy Markdown

@thanhtvt thanhtvt commented May 14, 2026

What does this PR do ?

Fix timestamp computation in streaming multitalker ASR for Parakeet model. The _compute_hypothesis_timestamps function had three compounding bugs that caused incorrect segment boundaries, merging utterances across long pauses and producing inflated hypothesis durations.

Collection: ASR

Changelog

  • Added _prev_token_counts (in ASRState) to track per-speaker progress across streaming chunks, initialized/reset in __init__, _reset_speaker_wise_sentences, and reset.
  • Added _prev_decoded_lengths (in ASRState) to store the decoder's accumulated frame count per speaker for recovering from silent gaps.
  • Fixed _compute_hypothesis_timestamps to use prev_token_count (first new token) instead of timestamp[0] (first token ever) for start_time.
  • Fixed _compute_hypothesis_timestamps to undo the decoder's decoded_lengths shift using decoded_length_before before applying offset, fixing double-counting.
  • Updated update_sessionwise_seglsts_for_parallel to pass prev_token_count and decoded_length_before to _compute_hypothesis_timestamps and update _prev_decoded_lengths after each chunk.
  • Update docstring of _compute_hypothesis_timestamps.

Usage

Follow the official guide on how to run Multitalker Parakeet Streaming 0.6B:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
          asr_model="/path/to/your/multitalker-parakeet-streaming-0.6b-v1.nemo" \
          diar_model="/path/to/your/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \
          att_context_size="[70,13]" \
          generate_realtime_scripts=False \
          audio_file="/path/to/example.wav" \
          output_path="/path/to/example_output.json"

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests? → No need to write new tests
  • Did you add or update any necessary documentation? → I update docstrings of the modified method.
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) → No
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

I gently tag @nithinraok for this PR, per Contributor guidelines

Additional Information

Root cause: The decoder shifts timestamp indices by prev_batched_state.decoded_lengths at each streaming chunk (global frame indices). The original code was unaware of this shift and compounded three issues:

  1. Wrong token index: Used timestamp[0] (the first token emitted since audio began) instead of the first new token from the current chunk, identified by prev_token_count.
  2. Offset double-counting: Added offset (chunk start time) on top of already-shifted global timestamps, causing all timestamps to drift forward with each chunk.
  3. Silent-gap underestimation: The decoder's decoded_lengths accumulates only while a speaker is active in the batch. When a speaker falls silent for multiple chunks, their decoded_lengths freezes. Resuming speakers produced timestamps that did not account for elapsed silence, causing start_time ≈ last_active_time + small_delta, always within sent_break_sec of the previous segment, forcing all utterances into one merged segment.

Fix: Track _prev_decoded_lengths[spk_idx] to undo the decoder shift, recovering local frame indices.

decoded_length_before = _prev_decoded_lengths[spk_idx]
start_local = timestamp[prev_token_count] - decoded_length_before
end_local = timestamp[-1] - decoded_length_before
start_time = offset + start_local * frame_len_sec
end_time = offset + (end_local + 1) * frame_len_sec
_prev_decoded_lengths[spk_idx] = hypothesis.dec_state.decoded_length.item()

Behavior

For reproducibility, I used the NVIDIA multi-talker ASR video demo on HuggingFace, extracted the .wav audio, and ran the processing script:

Before Fix (Incorrect Durations)

[
    {
        "speaker": "speaker_0",
        "start_time": 1.04,
        "end_time": 38.96,
        "words": "The NVIDIA multitalker ASR system separates and transcribes multiple voices automatically. No enrollment or voice registration is needed. It simply listens, figures out who speaking when, who generates an individual transcript for each person in real time",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_1",
        "start_time": 16.8,
        "end_time": 39.84,
        "words": "It is built to handle overlapping speech naturally. When people fight over each other, the model runs one strain per voice, so each speaker's words stay clear, accurate, and well organized.",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_2",
        "start_time": 29.36,
        "end_time": 65.84,
        "words": "The system also works live. It processes audio as it's captured, delivering captions almost instantly. You can even tune the settings to balance latency and accuracy depending on your application's needs",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_3",
        "start_time": 44.16,
        "end_time": 76.08,
        "words": "And it all builds on the state of the art single speaker ASR Foundation from NVIDIA. We start from a model that already captures human speech with high precision, then extend it to understand many voices at once without sacrificing clarity or performance",
        "session_id": "nvidia-multitalker-asr-demo"
    }
]

After Fix (Corrected Durations)

[
    {
        "speaker": "speaker_0",
        "start_time": 1.04,
        "end_time": 19.92,
        "words": "The NVIDIA multitalker ASR system separates and transcribes multiple voices automatically. No enrollment or voice registration is needed. It simply listens, figures out who speaking when, who generates an individual transcript for each person in real time",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_1",
        "start_time": 16.8,
        "end_time": 27.52,
        "words": "It is built to handle overlapping speech naturally. When people fight over each other, the model runs one strain per voice, so each speaker's words stay clear, accurate, and well organized.",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_2",
        "start_time": 29.2,
        "end_time": 45.68,
        "words": "The system also works live. It processes audio as it's captured, delivering captions almost instantly. You can even tune the settings to balance latency and accuracy depending on your application's needs",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_3",
        "start_time": 44.16,
        "end_time": 59.28,
        "words": "And it all builds on the state of the art single speaker ASR Foundation from NVIDIA. We start from a model that already captures human speech with high precision, then extend it to understand many voices at once without sacrificing clarity or performance",
        "session_id": "nvidia-multitalker-asr-demo"
    }
]

Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants