[ASR] fix streaming multitalker asr timestamp computation by thanhtvt · Pull Request #15701 · NVIDIA-NeMo/NeMo

thanhtvt · 2026-05-14T18:14:38Z

What does this PR do ?

Fix timestamp computation in streaming multitalker ASR for Parakeet model. The _compute_hypothesis_timestamps function had three compounding bugs that caused incorrect segment boundaries, merging utterances across long pauses and producing inflated hypothesis durations.

Collection: ASR

Changelog

Added _prev_token_counts (in ASRState) to track per-speaker progress across streaming chunks, initialized/reset in __init__, _reset_speaker_wise_sentences, and reset.
Added _prev_decoded_lengths (in ASRState) to store the decoder's accumulated frame count per speaker for recovering from silent gaps.
Fixed _compute_hypothesis_timestamps to use prev_token_count (first new token) instead of timestamp[0] (first token ever) for start_time.
Fixed _compute_hypothesis_timestamps to undo the decoder's decoded_lengths shift using decoded_length_before before applying offset, fixing double-counting.
Updated update_sessionwise_seglsts_for_parallel to pass prev_token_count and decoded_length_before to _compute_hypothesis_timestamps and update _prev_decoded_lengths after each chunk.
Update docstring of _compute_hypothesis_timestamps.

Usage

Follow the official guide on how to run Multitalker Parakeet Streaming 0.6B:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
          asr_model="/path/to/your/multitalker-parakeet-streaming-0.6b-v1.nemo" \
          diar_model="/path/to/your/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \
          att_context_size="[70,13]" \
          generate_realtime_scripts=False \
          audio_file="/path/to/example.wav" \
          output_path="/path/to/example_output.json"

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? → No need to write new tests
Did you add or update any necessary documentation? → I update docstrings of the modified method.
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) → No
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

I gently tag @nithinraok for this PR, per Contributor guidelines

Additional Information

Root cause: The decoder shifts timestamp indices by prev_batched_state.decoded_lengths at each streaming chunk (global frame indices). The original code was unaware of this shift and compounded three issues:

Wrong token index: Used timestamp[0] (the first token emitted since audio began) instead of the first new token from the current chunk, identified by prev_token_count.
Offset double-counting: Added offset (chunk start time) on top of already-shifted global timestamps, causing all timestamps to drift forward with each chunk.
Silent-gap underestimation: The decoder's decoded_lengths accumulates only while a speaker is active in the batch. When a speaker falls silent for multiple chunks, their decoded_lengths freezes. Resuming speakers produced timestamps that did not account for elapsed silence, causing start_time ≈ last_active_time + small_delta, always within sent_break_sec of the previous segment, forcing all utterances into one merged segment.

Fix: Track _prev_decoded_lengths[spk_idx] to undo the decoder shift, recovering local frame indices.

decoded_length_before = _prev_decoded_lengths[spk_idx]
start_local = timestamp[prev_token_count] - decoded_length_before
end_local = timestamp[-1] - decoded_length_before
start_time = offset + start_local * frame_len_sec
end_time = offset + (end_local + 1) * frame_len_sec
_prev_decoded_lengths[spk_idx] = hypothesis.dec_state.decoded_length.item()

Behavior

For reproducibility, I used the NVIDIA multi-talker ASR video demo on HuggingFace, extracted the .wav audio, and ran the processing script:

Before Fix (Incorrect Durations)

[
    {
        "speaker": "speaker_0",
        "start_time": 1.04,
        "end_time": 38.96,
        "words": "The NVIDIA multitalker ASR system separates and transcribes multiple voices automatically. No enrollment or voice registration is needed. It simply listens, figures out who speaking when, who generates an individual transcript for each person in real time",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_1",
        "start_time": 16.8,
        "end_time": 39.84,
        "words": "It is built to handle overlapping speech naturally. When people fight over each other, the model runs one strain per voice, so each speaker's words stay clear, accurate, and well organized.",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_2",
        "start_time": 29.36,
        "end_time": 65.84,
        "words": "The system also works live. It processes audio as it's captured, delivering captions almost instantly. You can even tune the settings to balance latency and accuracy depending on your application's needs",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_3",
        "start_time": 44.16,
        "end_time": 76.08,
        "words": "And it all builds on the state of the art single speaker ASR Foundation from NVIDIA. We start from a model that already captures human speech with high precision, then extend it to understand many voices at once without sacrificing clarity or performance",
        "session_id": "nvidia-multitalker-asr-demo"
    }
]

After Fix (Corrected Durations)

[
    {
        "speaker": "speaker_0",
        "start_time": 1.04,
        "end_time": 19.92,
        "words": "The NVIDIA multitalker ASR system separates and transcribes multiple voices automatically. No enrollment or voice registration is needed. It simply listens, figures out who speaking when, who generates an individual transcript for each person in real time",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_1",
        "start_time": 16.8,
        "end_time": 27.52,
        "words": "It is built to handle overlapping speech naturally. When people fight over each other, the model runs one strain per voice, so each speaker's words stay clear, accurate, and well organized.",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_2",
        "start_time": 29.2,
        "end_time": 45.68,
        "words": "The system also works live. It processes audio as it's captured, delivering captions almost instantly. You can even tune the settings to balance latency and accuracy depending on your application's needs",
        "session_id": "nvidia-multitalker-asr-demo"
    },
    {
        "speaker": "speaker_3",
        "start_time": 44.16,
        "end_time": 59.28,
        "words": "And it all builds on the state of the art single speaker ASR Foundation from NVIDIA. We start from a model that already captures human speech with high precision, then extend it to understand many voices at once without sacrificing clarity or performance",
        "session_id": "nvidia-multitalker-asr-demo"
    }
]

Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>

copy-pr-bot · 2026-05-14T18:14:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fix: streaming multitalker asr timestamp computation

ca84fd9

Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>

github-actions Bot added ASR community-request labels May 14, 2026

Merge branch 'main' into main

61e90ad

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ASR] fix streaming multitalker asr timestamp computation#15701

[ASR] fix streaming multitalker asr timestamp computation#15701
thanhtvt wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
thanhtvt:main

thanhtvt commented May 14, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thanhtvt commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Behavior

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thanhtvt commented May 14, 2026 •

edited

Loading