Adding Conformer encoder I/O-styled Transformer encoder#15703
Conversation
Signed-off-by: taejinp <tango4j@gmail.com>
| pre_block_norm: bool = True, | ||
| subsampling_factor: int = 4, | ||
| pos_emb_max_len: int = 5000, | ||
| xscaling: bool = True, |
There was a problem hiding this comment.
Shall we set default xscaling to False, since we already know that the layernorm will zero-out the effect of xscaling?
There was a problem hiding this comment.
Thanks for pointing this out. Setting this with default to False.
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
|
/ok to test 5398604 |
|
[🤖]: Hi @tango4j 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
|
Thanks Taejin! Have you had a chance to run training with this? Does it converges similarly with positional embedding enabled and how the results compare to previous runs. |
@KunalDhawan is working on using this PR part to train his MoE transformer experiments. Recently, after doing some survey, It appeared to me that convnet frontend and positional encoding can affect the performance a lot. So I think we need to test these two configurations separately (ablations). |
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
|
@nithinraok Also figured that Filterbank Stacking feature is equally good as dw_striding (3 level convnet frontend). Better switching to Filterbank stacking to make this model low precision friendly. @ipmedenn @KunalDhawan @stevehuang52 |
|
/ok to test 6725930 |
|
[🤖]: Hi @tango4j 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
There was a problem hiding this comment.
could you also move https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/fastconformer/transformer_stacking_tdt_bpe.yaml to this folder. examples/asr/conf/transformer/
| from nemo.collections.asr.parts.utils.regularization_utils import compute_stochastic_depth_drop_probs | ||
| from nemo.core.classes.common import typecheck | ||
| from nemo.core.classes.exportable import Exportable | ||
| from nemo.core.classes.mixins import AccessMixin | ||
| from nemo.core.classes.module import NeuralModule | ||
| from nemo.core.neural_types import AcousticEncodedRepresentation, BoolType, LengthsType, NeuralType, SpectrogramType |
There was a problem hiding this comment.
Please remove support for all these
| RelPositionalEncoding, | ||
| RelPositionMultiHeadAttention, | ||
| ) | ||
| from nemo.collections.asr.parts.submodules.subsampling import ConvSubsampling, FeatureStacking, StackingSubsampling |
There was a problem hiding this comment.
Let's drop ConvSubsampling from this PR.
This module lives under ASR and profiling shows it underperforms FeatureStacking on training throughput and complicates inference-tool support, with no accuracy upside. Removing it keeps the PR focused and avoids carrying forward a suboptimal path. Users can add this later if they found it useful for other tasks
What does this PR do?
Follow up work after the initial TF encoder PR (#15661). Many NeMo Speech AI maintainers are asking for the new Transformer Encoder implementations to have pre-encode, positional encoding features in conformer encoder.
Aligns the ASR
TransformerEncodermodule with the offlineConformerEncodermodule surface while preserving Transformer-specific attention parameters and behavior.Streaming encoder and adapter implementations are not included in this PR. These features will be added later on.
Tested LibriSpeech training with Transformer + CTC (BPE). Added
transformer_ctc_bpe.yamlwith the default configurations.Collection: ASR
Changelog
TransformerEncoderto inherit NeMo module/export/access mixins and expose Conformer-style input/output type metadata.input_example,forward_for_export,forward_internal,bypass_pre_encode,feat_out, positional encoding, pad mask toggling, stochastic depth, and inter-CTC tensor capture.FeatureStackingpath assubsampling="feature_stacking".FeatureStackinginto the shared ASR subsampling module so it can be imported fromnemo.collections.asr.parts.submodules.subsampling.self_attention_modelmirroring Conformer's positional-encoding switch:"rel_pos"(default),"abs_pos", and"no_pos"(Noneis accepted as a YAML alias for"no_pos").score_modclosure, (c) bias folded asQ + pos_bias_u; rel-shift is shared withConformerEncoderviaRelPositionMultiHeadAttention.rel_shift.self_attention_modeltests, including aT != n_headsregression forpos_bias_{u,v}broadcasting.torch.no_grad()so FlexAttention's CPU path doesn't raise undermodel.train().Usage
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
Follow up work after the initial TF encoder PR (#15661)