Abbreviations

Abbreviations#

Abbreviation	Definition
1x1 Filter	A 1x1 filter, also known as a pointwise convolution applies a linear transformation to the input feature map, combining information from different channels or reducing the number of channels.
Anchor	In object detection, an anchor is a predefined bounding box that is used as a reference for detecting objects in an image. Anchors are typically defined at different scales and aspect ratios to handle objects of various sizes and shapes. During training, the network learns to adjust the anchors to better fit the objects in the image. Anchors play a crucial role in the region proposal network (RPN) of object detection models such as Faster R-CNN and RetinaNet.
Anchor-free	In object detection, anchor-free methods do not rely on predefined anchors to detect objects in an image. Instead, they directly predict the bounding boxes and class probabilities without the need for anchor boxes. Anchor-free methods have gained popularity due to their simplicity and flexibility in handling objects of various sizes and aspect ratios. They have been successfully applied in object detection models such as CenterNet and FCOS.
AP	Average Precision: It measures the accuracy by calculating the area under the precision-recall curve.
AP@[0.5:0.95]	Average Precision (IoU) between 50% and 95% taken into account
APlarge	AP for large objects (area > 962)
APmedium	AP for medium objects (322 < area < 962)
APsmall	AP for small objects (area < 322)
Backbone	The backbone of a neural network refers to the main architecture or structure of the network. It typically consists of multiple layers or modules that extract features from the input data.
Batch Normalization	Batch normalization is a technique that normalizes the activations of each batch in a neural network. It helps to stabilize and speed up the training process by reducing internal covariate shift and allowing higher learning rates. Batch normalization is commonly used in deep neural networks.
BoF	Bag of Freebies are those methods which change the training, not the inference PANNet = Path Aggregation Network
Bottleneck	In machine learning, a bottleneck refers to a layer or a set of layers in a neural network that has a smaller number of units compared to the preceding and succeeding layers. Bottlenecks are often used in architectures like ResNet to reduce the computational complexity and memory requirements of the network. They can also act as a bottleneck for information flow, forcing the network to learn more compact and informative representations.
CBN	Cross Batch Normalization – takes mean and standard deviation of the four last batches
CIoU	Complete Intersection over Union (CIoU) is an extension of the Intersection over Union (IoU) metric used in object detection. It not only measures the overlap between two bounding boxes but also takes into account the distance between the central points and the aspect ratio of the bounding boxes. It’s a more comprehensive metric that can distinguish between different relative positions and aspect ratios of the boxes.
CmBN	Cross Iteration Mini Batch Normalization – takes CBN further, assumes four batches inside a single batch
CSP	Cross-Stage Partial Connection
CSPNet	CSPNet (Cross Stage Partial Network) is a convolutional neural network architecture that improves the performance of object detection tasks. It introduces a cross stage partial connection module, which enhances the information flow between different stages of the network. CSPNet has been shown to achieve state-of-the-art results on various object detection benchmarks.
DenseNet	DenseNet is a convolutional neural network architecture that connects each layer to every other layer in a feed-forward fashion. It is known for its dense connectivity pattern, where each layer receives feature maps from all preceding layers. DenseNet has been shown to improve gradient flow, encourage feature reuse, and reduce the number of parameters compared to traditional convolutional neural networks.
DFL	Distribution Focal Loss – Focal Loss has proven to be effective at balancing loss by increasing the loss on hard-to-classify classes
Dropout	Dropout is a regularization technique that randomly sets a fraction of the input units to zero during training. It helps to prevent overfitting by reducing the co-adaptation of neurons and encouraging the network to learn more robust features. Dropout is commonly used in deep neural networks.
Head	The head of a neural network refers to the final layers or modules that are responsible for producing the output predictions. It takes the features extracted by the backbone and processes them to generate the desired output.
IDD	Independent and identically distributed data
Image Segmentation	Encompasses Instance Segmentation (things) and Semantic Segmentation (stuff)
Instance Segmentation	Studies things (e.g. Masked R-CNN, Faster R-CNN, PANet, YOLCAT), measurement: AP
IoU	Intersection over Union
mAP	Averaged over all classes
Mosaic Data Augmentation	Mosaic data augmentation is a technique used in computer vision tasks, such as object detection, to improve the performance of deep learning models. It involves combining multiple images into a single mosaic image and using it as training data. Mosaic data augmentation helps to increase the diversity and complexity of the training data, leading to better generalization and robustness of the model.
Neck	The neck of a neural network refers to an intermediate set of layers or modules that connect the backbone and the head. It is responsible for further refining the features extracted by the backbone before passing them to the head for final processing.
NICO	Non-IID Image dataset with contexts
NMS	Non-Maximum Suppression is a technique used in object detection to eliminate redundant and overlapping bounding boxes. It selects the most probable bounding box and eliminates any box that has a high overlap (as measured by the Intersection over Union (IoU) metric) with the chosen box.
Non-IDD	Non Independent and identically distributed data
Padding	Padding is the process of adding extra pixels around the input image or feature map. It is commonly used in convolutional neural networks to preserve spatial dimensions and prevent information loss at the edges of the image. Padding can be done with zeros (zero-padding) or with values from the original image (reflective padding or symmetric padding).
Panoptic Segmentation	Studies both (most models are based on Mask R-CNN)
Pooling	Pooling is a downsampling operation that reduces the spatial dimensions of the input feature map. It is commonly used to reduce the computational complexity of the network and to extract the most important features. The most common types of pooling are max pooling and average pooling, which take the maximum or average value within a pooling window, respectively.
Precision	Relevant Retrieved / Retrieved
Recall	Relevant retrieved / Relevant
Region Proposal	Region proposal is a technique used in object detection to generate potential bounding boxes around objects in an image. It helps to narrow down the search space for the object detector by proposing regions that are likely to contain objects. Region proposal methods, such as Selective Search and EdgeBoxes, use various algorithms to generate these potential regions based on image features and similarity measures.
Residual	\( Y = F(x) + x \)
SAM	Special Attention Module
SAT	Self Adversarial Training
Semantic Segmentation	Studies stuff (e.g. SegNet, U-Net, DeconvNet), measurement: IoU
SENet	Squeeze and Excitation Network, finds which channel are more/less important in a feature map
SiLU	Sigmoid Linear Unit, aka. Swish
Skip Connection	Skip connection, also known as residual connection, is a technique that adds the input of a layer to the output of a subsequent layer. It allows the network to learn residual functions, which can help to alleviate the vanishing gradient problem and improve the flow of gradients during training. Skip connections are commonly used in deep residual networks (ResNet).
SPP	Spatial Pyramid Pooling, increases the receptive field
Stride	Stride refers to the number of pixels the convolutional kernel moves at each step during the convolution operation. A stride of 1 means the kernel moves one pixel at a time, while a stride of 2 means the kernel moves two pixels at a time. Stride affects the output size of the feature map, as well as the amount of computation required.
Upsampling	Upsampling is the process of increasing the spatial dimensions of an image or feature map. It is commonly used in tasks such as image super-resolution, semantic segmentation, and generative modeling. Upsampling can be done using techniques such as transposed convolution, nearest-neighbor interpolation, or bilinear interpolation.