Adding Conformer encoder I/O-styled Transformer encoder by tango4j · Pull Request #15703 · NVIDIA-NeMo/NeMo

tango4j · 2026-05-15T15:58:51Z

What does this PR do?

Follow up work after the initial TF encoder PR (#15661). Many NeMo Speech AI maintainers are asking for the new Transformer Encoder implementations to have pre-encode, positional encoding features in conformer encoder.

Aligns the ASR TransformerEncoder module with the offline ConformerEncoder module surface while preserving Transformer-specific attention parameters and behavior.

Streaming encoder and adapter implementations are not included in this PR. These features will be added later on.

Tested LibriSpeech training with Transformer + CTC (BPE). Added transformer_ctc_bpe.yaml with the default configurations.

Collection: ASR

Changelog

Updated TransformerEncoder to inherit NeMo module/export/access mixins and expose Conformer-style input/output type metadata.
Added Conformer-style offline encoder utilities, including input_example, forward_for_export, forward_internal, bypass_pre_encode, feat_out, positional encoding, pad mask toggling, stochastic depth, and inter-CTC tensor capture.
Added Conformer-style pre-encoder options while preserving the Transformer-native FeatureStacking path as subsampling="feature_stacking".
Moved FeatureStacking into the shared ASR subsampling module so it can be imported from nemo.collections.asr.parts.submodules.subsampling.
Added self_attention_model mirroring Conformer's positional-encoding switch: "rel_pos" (default), "abs_pos", and "no_pos" (None is accepted as a YAML alias for "no_pos").
Implemented Transformer-XL relative PE on FlexAttention — (b)+(d) bias via a score_mod closure, (c) bias folded as Q + pos_bias_u; rel-shift is shared with ConformerEncoder via RelPositionMultiHeadAttention.rel_shift.
Added Transformer encoder tests mirroring relevant Conformer encoder test procedures for stochastic depth and bypass pre-encode behavior.
Added self_attention_model tests, including a T != n_heads regression for pos_bias_{u,v} broadcasting.
Updated Transformer encoder tests to use typed-module keyword arguments and validate output lengths.
Wrapped CPU forward tests in torch.no_grad() so FlexAttention's CPU path doesn't raise under model.train().

Usage

from nemo.collections.asr.modules.transformer_encoder import TransformerEncoder

encoder = TransformerEncoder(
    feat_in=128,
    d_model=512,
    n_heads=8,
    n_layers=17,
    subsampling="feature_stacking",
    subsampling_factor=4,
    self_attention_model="rel_pos",  # one of "rel_pos" | "abs_pos" | "no_pos" (or None)
)

encoded, encoded_len = encoder(audio_signal=features, length=feature_lengths)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Follow up work after the initial TF encoder PR (#15661)

Signed-off-by: taejinp <tango4j@gmail.com>

copy-pr-bot · 2026-05-15T15:58:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

stevehuang52 · 2026-05-15T17:42:20Z

        pre_block_norm: bool = True,
-        subsampling_factor: int = 4,
+        pos_emb_max_len: int = 5000,
+        xscaling: bool = True,


Shall we set default xscaling to False, since we already know that the layernorm will zero-out the effect of xscaling?

Thanks for pointing this out. Setting this with default to False.

Signed-off-by: taejinp <tango4j@gmail.com>

tango4j · 2026-05-16T23:41:19Z

/ok to test 5398604

github-actions · 2026-05-17T00:39:52Z

[🤖]: Hi @tango4j 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

nithinraok · 2026-05-18T12:36:53Z

Thanks Taejin!

Have you had a chance to run training with this? Does it converges similarly with positional embedding enabled and how the results compare to previous runs.

tango4j · 2026-05-19T23:59:38Z

Thanks Taejin!

Have you had a chance to run training with this? Does it converges similarly with positional embedding enabled and how the results compare to previous runs.

@KunalDhawan is working on using this PR part to train his MoE transformer experiments.
If time allows, I will also try Librispeech-only training with several setups to do sanity check (to test the points you mentioned)

Recently, after doing some survey, It appeared to me that convnet frontend and positional encoding can affect the performance a lot. So I think we need to test these two configurations separately (ablations).

Signed-off-by: Taejin Park <tango4j@gmail.com>

tango4j · 2026-05-21T06:18:08Z

@nithinraok
I have tested relative positional encoding with training job. Compared with FastConformer CTC vs Transformer CTC.
Having rel_pos in the Transformer encoder affects the performance a lot (10~20% more error). Need to set this as default.

Also figured that Filterbank Stacking feature is equally good as dw_striding (3 level convnet frontend). Better switching to Filterbank stacking to make this model low precision friendly.

@ipmedenn @KunalDhawan @stevehuang52
Now, transformer is fully equipped with relative positional encoding.

tango4j · 2026-05-21T06:18:36Z

/ok to test 6725930

github-actions · 2026-05-21T07:14:11Z

[🤖]: Hi @tango4j 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

nithinraok · 2026-05-21T16:43:42Z

could you also move https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/fastconformer/transformer_stacking_tdt_bpe.yaml to this folder. examples/asr/conf/transformer/

nithinraok · 2026-05-21T16:44:09Z

+from nemo.collections.asr.parts.utils.regularization_utils import compute_stochastic_depth_drop_probs
+from nemo.core.classes.common import typecheck
+from nemo.core.classes.exportable import Exportable
+from nemo.core.classes.mixins import AccessMixin
+from nemo.core.classes.module import NeuralModule
+from nemo.core.neural_types import AcousticEncodedRepresentation, BoolType, LengthsType, NeuralType, SpectrogramType


Please remove support for all these

nithinraok · 2026-05-21T16:48:33Z

+    RelPositionalEncoding,
+    RelPositionMultiHeadAttention,
+)
+from nemo.collections.asr.parts.submodules.subsampling import ConvSubsampling, FeatureStacking, StackingSubsampling


Let's drop ConvSubsampling from this PR.

This module lives under ASR and profiling shows it underperforms FeatureStacking on training throughput and complicates inference-tool support, with no accuracy upside. Removing it keeps the PR focused and avoids carrying forward a suboptimal path. Users can add this later if they found it useful for other tasks

Adding Conformer encoder style Transformer encoder

f86e346

Signed-off-by: taejinp <tango4j@gmail.com>

github-actions Bot added the ASR label May 15, 2026

tango4j requested review from KunalDhawan, ipmedenn and stevehuang52 May 15, 2026 15:59

stevehuang52 reviewed May 15, 2026

View reviewed changes

Adding final touch up

ba9d40d

Signed-off-by: taejinp <tango4j@gmail.com>

tango4j requested review from nithinraok and pzelasko May 15, 2026 22:39

Fixing Black issue

5398604

Signed-off-by: taejinp <tango4j@gmail.com>

tango4j marked this pull request as ready for review May 15, 2026 22:52

copy-pr-bot Bot temporarily deployed to public May 16, 2026 23:42 Inactive

copy-pr-bot Bot temporarily deployed to test May 16, 2026 23:42 Inactive

copy-pr-bot Bot temporarily deployed to public May 16, 2026 23:45 Inactive

copy-pr-bot Bot temporarily deployed to public May 16, 2026 23:46 Inactive

copy-pr-bot Bot temporarily deployed to public May 16, 2026 23:49 Inactive

tango4j added 2 commits May 20, 2026 22:59

Adding relative position encoding and transformer-ctc yaml

c19ca03

Signed-off-by: Taejin Park <tango4j@gmail.com>

Apply black formatting

6725930

Signed-off-by: Taejin Park <tango4j@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 21, 2026 06:19 Inactive

copy-pr-bot Bot deployed to test May 21, 2026 06:20 Active

copy-pr-bot Bot temporarily deployed to public May 21, 2026 06:22 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 06:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 06:26 Inactive

tango4j changed the title ~~Adding Conformer encoder style Transformer encoder~~ Adding Conformer encoder I/O-styled Transformer encoder May 21, 2026

nithinraok requested changes May 21, 2026

View reviewed changes

Conversation

tango4j commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

stevehuang52 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

tango4j May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tango4j commented May 16, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

nithinraok commented May 18, 2026

Uh oh!

tango4j commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tango4j commented May 21, 2026

Uh oh!

tango4j commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

nithinraok May 21, 2026

Choose a reason for hiding this comment

Uh oh!

nithinraok May 21, 2026

Choose a reason for hiding this comment

Uh oh!

nithinraok May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tango4j commented May 15, 2026 •

edited

Loading

tango4j May 15, 2026 •

edited

Loading

tango4j commented May 19, 2026 •

edited

Loading