Reference for ultralytics/nn/modules/transformer.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/nn/modules/transformer.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.nn.modules.transformer.TransformerEncoderLayer
TransformerEncoderLayer(
c1, cm=2048, num_heads=8, dropout=0.0, act=nn.GELU(), normalize_before=False
)
Bases: Module
Defines a single layer of the transformer encoder.
Attributes:
Name | Type | Description |
---|---|---|
ma |
MultiheadAttention
|
Multi-head attention module. |
fc1 |
Linear
|
First linear layer in the feedforward network. |
fc2 |
Linear
|
Second linear layer in the feedforward network. |
norm1 |
LayerNorm
|
Layer normalization after attention. |
norm2 |
LayerNorm
|
Layer normalization after feedforward network. |
dropout |
Dropout
|
Dropout layer for the feedforward network. |
dropout1 |
Dropout
|
Dropout layer after attention. |
dropout2 |
Dropout
|
Dropout layer after feedforward network. |
act |
Module
|
Activation function. |
normalize_before |
bool
|
Whether to apply normalization before attention and feedforward. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
c1
|
int
|
Input dimension. |
required |
cm
|
int
|
Hidden dimension in the feedforward network. |
2048
|
num_heads
|
int
|
Number of attention heads. |
8
|
dropout
|
float
|
Dropout probability. |
0.0
|
act
|
Module
|
Activation function. |
GELU()
|
normalize_before
|
bool
|
Whether to apply normalization before attention and feedforward. |
False
|
Source code in ultralytics/nn/modules/transformer.py
forward
Forward propagates the input through the encoder module.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
Tensor
|
Input tensor. |
required |
src_mask
|
Tensor
|
Mask for the src sequence. |
None
|
src_key_padding_mask
|
Tensor
|
Mask for the src keys per batch. |
None
|
pos
|
Tensor
|
Positional encoding. |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after transformer encoder layer. |
Source code in ultralytics/nn/modules/transformer.py
forward_post
Perform forward pass with post-normalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
Tensor
|
Input tensor. |
required |
src_mask
|
Tensor
|
Mask for the src sequence. |
None
|
src_key_padding_mask
|
Tensor
|
Mask for the src keys per batch. |
None
|
pos
|
Tensor
|
Positional encoding. |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after attention and feedforward. |
Source code in ultralytics/nn/modules/transformer.py
forward_pre
Perform forward pass with pre-normalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
Tensor
|
Input tensor. |
required |
src_mask
|
Tensor
|
Mask for the src sequence. |
None
|
src_key_padding_mask
|
Tensor
|
Mask for the src keys per batch. |
None
|
pos
|
Tensor
|
Positional encoding. |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after attention and feedforward. |
Source code in ultralytics/nn/modules/transformer.py
with_pos_embed
staticmethod
ultralytics.nn.modules.transformer.AIFI
Bases: TransformerEncoderLayer
Defines the AIFI transformer layer.
This class extends TransformerEncoderLayer to work with 2D data by adding positional embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
c1
|
int
|
Input dimension. |
required |
cm
|
int
|
Hidden dimension in the feedforward network. |
2048
|
num_heads
|
int
|
Number of attention heads. |
8
|
dropout
|
float
|
Dropout probability. |
0
|
act
|
Module
|
Activation function. |
GELU()
|
normalize_before
|
bool
|
Whether to apply normalization before attention and feedforward. |
False
|
Source code in ultralytics/nn/modules/transformer.py
build_2d_sincos_position_embedding
staticmethod
Build 2D sine-cosine position embedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
w
|
int
|
Width of the feature map. |
required |
h
|
int
|
Height of the feature map. |
required |
embed_dim
|
int
|
Embedding dimension. |
256
|
temperature
|
float
|
Temperature for the sine/cosine functions. |
10000.0
|
Returns:
Type | Description |
---|---|
Tensor
|
Position embedding with shape [1, embed_dim, h*w]. |
Source code in ultralytics/nn/modules/transformer.py
forward
Forward pass for the AIFI transformer layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
Input tensor with shape [B, C, H, W]. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor with shape [B, C, H, W]. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.TransformerLayer
Bases: Module
Transformer layer https://arxiv.org/abs/2010.11929 (LayerNorm layers removed for better performance).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
c
|
int
|
Input and output channel dimension. |
required |
num_heads
|
int
|
Number of attention heads. |
required |
Source code in ultralytics/nn/modules/transformer.py
forward
Apply a transformer block to the input x and return the output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
Input tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after transformer layer. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.TransformerBlock
Bases: Module
Vision Transformer https://arxiv.org/abs/2010.11929.
Attributes:
Name | Type | Description |
---|---|---|
conv |
Conv
|
Convolution layer if input and output channels differ. |
linear |
Linear
|
Learnable position embedding. |
tr |
Sequential
|
Sequential container of transformer layers. |
c2 |
int
|
Output channel dimension. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
c1
|
int
|
Input channel dimension. |
required |
c2
|
int
|
Output channel dimension. |
required |
num_heads
|
int
|
Number of attention heads. |
required |
num_layers
|
int
|
Number of transformer layers. |
required |
Source code in ultralytics/nn/modules/transformer.py
forward
Forward propagates the input through the bottleneck module.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
Input tensor with shape [b, c1, w, h]. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor with shape [b, c2, w, h]. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.MLPBlock
Bases: Module
Implements a single block of a multi-layer perceptron.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_dim
|
int
|
Input and output dimension. |
required |
mlp_dim
|
int
|
Hidden dimension. |
required |
act
|
Module
|
Activation function. |
GELU
|
Source code in ultralytics/nn/modules/transformer.py
forward
Forward pass for the MLPBlock.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
Input tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after MLP block. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.MLP
Bases: Module
Implements a simple multi-layer perceptron (also called FFN).
Attributes:
Name | Type | Description |
---|---|---|
num_layers |
int
|
Number of layers in the MLP. |
layers |
ModuleList
|
List of linear layers. |
sigmoid |
bool
|
Whether to apply sigmoid to the output. |
act |
Module
|
Activation function. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_dim
|
int
|
Input dimension. |
required |
hidden_dim
|
int
|
Hidden dimension. |
required |
output_dim
|
int
|
Output dimension. |
required |
num_layers
|
int
|
Number of layers. |
required |
act
|
Module
|
Activation function. |
ReLU
|
sigmoid
|
bool
|
Whether to apply sigmoid to the output. |
False
|
Source code in ultralytics/nn/modules/transformer.py
forward
Forward pass for the entire MLP.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
Input tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after MLP. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.LayerNorm2d
Bases: Module
2D Layer Normalization module inspired by Detectron2 and ConvNeXt implementations.
Original implementations in https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/batch_norm.py and https://github.com/facebookresearch/ConvNeXt/blob/main/models/convnext.py.
Attributes:
Name | Type | Description |
---|---|---|
weight |
Parameter
|
Learnable scale parameter. |
bias |
Parameter
|
Learnable bias parameter. |
eps |
float
|
Small constant for numerical stability. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_channels
|
int
|
Number of channels in the input. |
required |
eps
|
float
|
Small constant for numerical stability. |
1e-06
|
Source code in ultralytics/nn/modules/transformer.py
forward
Perform forward pass for 2D layer normalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
Input tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Normalized output tensor. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.MSDeformAttn
Bases: Module
Multiscale Deformable Attention Module based on Deformable-DETR and PaddleDetection implementations.
https://github.com/fundamentalvision/Deformable-DETR/blob/main/models/ops/modules/ms_deform_attn.py
Attributes:
Name | Type | Description |
---|---|---|
im2col_step |
int
|
Step size for im2col operations. |
d_model |
int
|
Model dimension. |
n_levels |
int
|
Number of feature levels. |
n_heads |
int
|
Number of attention heads. |
n_points |
int
|
Number of sampling points per attention head per feature level. |
sampling_offsets |
Linear
|
Linear layer for generating sampling offsets. |
attention_weights |
Linear
|
Linear layer for generating attention weights. |
value_proj |
Linear
|
Linear layer for projecting values. |
output_proj |
Linear
|
Linear layer for projecting output. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
int
|
Model dimension. |
256
|
n_levels
|
int
|
Number of feature levels. |
4
|
n_heads
|
int
|
Number of attention heads. |
8
|
n_points
|
int
|
Number of sampling points per attention head per feature level. |
4
|
Source code in ultralytics/nn/modules/transformer.py
forward
Perform forward pass for multiscale deformable attention.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
Tensor
|
Tensor with shape [bs, query_length, C]. |
required |
refer_bbox
|
Tensor
|
Tensor with shape [bs, query_length, n_levels, 2], range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. |
required |
value
|
Tensor
|
Tensor with shape [bs, value_length, C]. |
required |
value_shapes
|
list
|
List with shape [n_levels, 2], [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]. |
required |
value_mask
|
Tensor
|
Tensor with shape [bs, value_length], True for non-padding elements, False for padding elements. |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor with shape [bs, Length_{query}, C]. |
Source code in ultralytics/nn/modules/transformer.py
ultralytics.nn.modules.transformer.DeformableTransformerDecoderLayer
DeformableTransformerDecoderLayer(
d_model=256,
n_heads=8,
d_ffn=1024,
dropout=0.0,
act=nn.ReLU(),
n_levels=4,
n_points=4,
)
Bases: Module
Deformable Transformer Decoder Layer inspired by PaddleDetection and Deformable-DETR implementations.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/transformers/deformable_transformer.py https://github.com/fundamentalvision/Deformable-DETR/blob/main/models/deformable_transformer.py
Attributes:
Name | Type | Description |
---|---|---|
self_attn |
MultiheadAttention
|
Self-attention module. |
dropout1 |
Dropout
|
Dropout after self-attention. |
norm1 |
LayerNorm
|
Layer normalization after self-attention. |
cross_attn |
MSDeformAttn
|
Cross-attention module. |
dropout2 |
Dropout
|
Dropout after cross-attention. |
norm2 |
LayerNorm
|
Layer normalization after cross-attention. |
linear1 |
Linear
|
First linear layer in the feedforward network. |
act |
Module
|
Activation function. |
dropout3 |
Dropout
|
Dropout in the feedforward network. |
linear2 |
Linear
|
Second linear layer in the feedforward network. |
dropout4 |
Dropout
|
Dropout after the feedforward network. |
norm3 |
LayerNorm
|
Layer normalization after the feedforward network. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
int
|
Model dimension. |
256
|
n_heads
|
int
|
Number of attention heads. |
8
|
d_ffn
|
int
|
Dimension of the feedforward network. |
1024
|
dropout
|
float
|
Dropout probability. |
0.0
|
act
|
Module
|
Activation function. |
ReLU()
|
n_levels
|
int
|
Number of feature levels. |
4
|
n_points
|
int
|
Number of sampling points. |
4
|
Source code in ultralytics/nn/modules/transformer.py
forward
Perform the forward pass through the entire decoder layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embed
|
Tensor
|
Input embeddings. |
required |
refer_bbox
|
Tensor
|
Reference bounding boxes. |
required |
feats
|
Tensor
|
Feature maps. |
required |
shapes
|
list
|
Feature shapes. |
required |
padding_mask
|
Tensor
|
Padding mask. |
None
|
attn_mask
|
Tensor
|
Attention mask. |
None
|
query_pos
|
Tensor
|
Query position embeddings. |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after decoder layer. |
Source code in ultralytics/nn/modules/transformer.py
forward_ffn
Perform forward pass through the Feed-Forward Network part of the layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tgt
|
Tensor
|
Input tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Output tensor after FFN. |
Source code in ultralytics/nn/modules/transformer.py
with_pos_embed
staticmethod
ultralytics.nn.modules.transformer.DeformableTransformerDecoder
Bases: Module
Implementation of Deformable Transformer Decoder based on PaddleDetection.
Attributes:
Name | Type | Description |
---|---|---|
layers |
ModuleList
|
List of decoder layers. |
num_layers |
int
|
Number of decoder layers. |
hidden_dim |
int
|
Hidden dimension. |
eval_idx |
int
|
Index of the layer to use during evaluation. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
decoder_layer
|
Module
|
Decoder layer module. |
required |
num_layers
|
int
|
Number of decoder layers. |
required |
eval_idx
|
int
|
Index of the layer to use during evaluation. |
-1
|
Source code in ultralytics/nn/modules/transformer.py
forward
forward(
embed,
refer_bbox,
feats,
shapes,
bbox_head,
score_head,
pos_mlp,
attn_mask=None,
padding_mask=None,
)
Perform the forward pass through the entire decoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embed
|
Tensor
|
Decoder embeddings. |
required |
refer_bbox
|
Tensor
|
Reference bounding boxes. |
required |
feats
|
Tensor
|
Image features. |
required |
shapes
|
list
|
Feature shapes. |
required |
bbox_head
|
Module
|
Bounding box prediction head. |
required |
score_head
|
Module
|
Score prediction head. |
required |
pos_mlp
|
Module
|
Position MLP. |
required |
attn_mask
|
Tensor
|
Attention mask. |
None
|
padding_mask
|
Tensor
|
Padding mask. |
None
|
Returns:
Name | Type | Description |
---|---|---|
dec_bboxes |
Tensor
|
Decoded bounding boxes. |
dec_cls |
Tensor
|
Decoded classification scores. |