[ASR] fix streaming multitalker asr timestamp computation#15701
Open
thanhtvt wants to merge 2 commits into
Open
[ASR] fix streaming multitalker asr timestamp computation#15701thanhtvt wants to merge 2 commits into
thanhtvt wants to merge 2 commits into
Conversation
Signed-off-by: thanhtvt <trantrongthanhhp@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Fix timestamp computation in streaming multitalker ASR for Parakeet model. The
_compute_hypothesis_timestampsfunction had three compounding bugs that caused incorrect segment boundaries, merging utterances across long pauses and producing inflated hypothesis durations.Collection: ASR
Changelog
_prev_token_counts(inASRState) to track per-speaker progress across streaming chunks, initialized/reset in__init__,_reset_speaker_wise_sentences, andreset._prev_decoded_lengths(inASRState) to store the decoder's accumulated frame count per speaker for recovering from silent gaps._compute_hypothesis_timestampsto useprev_token_count(first new token) instead oftimestamp[0](first token ever) for start_time._compute_hypothesis_timestampsto undo the decoder'sdecoded_lengthsshift usingdecoded_length_beforebefore applyingoffset, fixing double-counting.update_sessionwise_seglsts_for_parallelto passprev_token_countanddecoded_length_beforeto_compute_hypothesis_timestampsand update_prev_decoded_lengthsafter each chunk._compute_hypothesis_timestamps.Usage
Follow the official guide on how to run Multitalker Parakeet Streaming 0.6B:
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
I gently tag @nithinraok for this PR, per Contributor guidelines
Additional Information
Root cause: The decoder shifts timestamp indices by
prev_batched_state.decoded_lengthsat each streaming chunk (global frame indices). The original code was unaware of this shift and compounded three issues:timestamp[0](the first token emitted since audio began) instead of the first new token from the current chunk, identified byprev_token_count.offset(chunk start time) on top of already-shifted global timestamps, causing all timestamps to drift forward with each chunk.decoded_lengthsaccumulates only while a speaker is active in the batch. When a speaker falls silent for multiple chunks, theirdecoded_lengthsfreezes. Resuming speakers produced timestamps that did not account for elapsed silence, causingstart_time ≈ last_active_time + small_delta, always withinsent_break_secof the previous segment, forcing all utterances into one merged segment.Fix: Track
_prev_decoded_lengths[spk_idx]to undo the decoder shift, recovering local frame indices.Behavior
For reproducibility, I used the NVIDIA multi-talker ASR video demo on HuggingFace, extracted the
.wavaudio, and ran the processing script:Before Fix (Incorrect Durations)
After Fix (Corrected Durations)