Abbreviations

Abbreviations#

Abbreviation

Definition

1x1 Filter

A 1x1 filter, also known as a pointwise convolution applies a linear transformation to the input feature map, combining information from different channels or reducing the number of channels.

Anchor

In object detection, an anchor is a predefined bounding box that is used as a reference for detecting objects in an image. Anchors are typically defined at different scales and aspect ratios to handle objects of various sizes and shapes. During training, the network learns to adjust the anchors to better fit the objects in the image. Anchors play a crucial role in the region proposal network (RPN) of object detection models such as Faster R-CNN and RetinaNet.

Anchor-free

In object detection, anchor-free methods do not rely on predefined anchors to detect objects in an image. Instead, they directly predict the bounding boxes and class probabilities without the need for anchor boxes. Anchor-free methods have gained popularity due to their simplicity and flexibility in handling objects of various sizes and aspect ratios. They have been successfully applied in object detection models such as CenterNet and FCOS.

AP

Average Precision: It measures the accuracy by calculating the area under the precision-recall curve.

AP@[0.5:0.95]

Average Precision (IoU) between 50% and 95% taken into account

APlarge

AP for large objects (area > 962)

APmedium

AP for medium objects (322 < area < 962)

APsmall

AP for small objects (area < 322)

Backbone

The backbone of a neural network refers to the main architecture or structure of the network. It typically consists of multiple layers or modules that extract features from the input data.

Batch Normalization

Batch normalization is a technique that normalizes the activations of each batch in a neural network. It helps to stabilize and speed up the training process by reducing internal covariate shift and allowing higher learning rates. Batch normalization is commonly used in deep neural networks.

BoF

Bag of Freebies are those methods which change the training, not the inference PANNet = Path Aggregation Network

Bottleneck

In machine learning, a bottleneck refers to a layer or a set of layers in a neural network that has a smaller number of units compared to the preceding and succeeding layers. Bottlenecks are often used in architectures like ResNet to reduce the computational complexity and memory requirements of the network. They can also act as a bottleneck for information flow, forcing the network to learn more compact and informative representations.

CBN

Cross Batch Normalization – takes mean and standard deviation of the four last batches

CIoU

Complete Intersection over Union (CIoU) is an extension of the Intersection over Union (IoU) metric used in object detection. It not only measures the overlap between two bounding boxes but also takes into account the distance between the central points and the aspect ratio of the bounding boxes. It’s a more comprehensive metric that can distinguish between different relative positions and aspect ratios of the boxes.

CmBN

Cross Iteration Mini Batch Normalization – takes CBN further, assumes four batches inside a single batch

CSP

Cross-Stage Partial Connection

CSPNet

CSPNet (Cross Stage Partial Network) is a convolutional neural network architecture that improves the performance of object detection tasks. It introduces a cross stage partial connection module, which enhances the information flow between different stages of the network. CSPNet has been shown to achieve state-of-the-art results on various object detection benchmarks.

DenseNet

DenseNet is a convolutional neural network architecture that connects each layer to every other layer in a feed-forward fashion. It is known for its dense connectivity pattern, where each layer receives feature maps from all preceding layers. DenseNet has been shown to improve gradient flow, encourage feature reuse, and reduce the number of parameters compared to traditional convolutional neural networks.

DFL

Distribution Focal Loss – Focal Loss has proven to be effective at balancing loss by increasing the loss on hard-to-classify classes

Dropout

Dropout is a regularization technique that randomly sets a fraction of the input units to zero during training. It helps to prevent overfitting by reducing the co-adaptation of neurons and encouraging the network to learn more robust features. Dropout is commonly used in deep neural networks.

Head

The head of a neural network refers to the final layers or modules that are responsible for producing the output predictions. It takes the features extracted by the backbone and processes them to generate the desired output.

IDD

Independent and identically distributed data

Image Segmentation

Encompasses Instance Segmentation (things) and Semantic Segmentation (stuff)

Instance Segmentation

Studies things (e.g. Masked R-CNN, Faster R-CNN, PANet, YOLCAT), measurement: AP

IoU

Intersection over Union

mAP

Averaged over all classes

Mosaic Data Augmentation

Mosaic data augmentation is a technique used in computer vision tasks, such as object detection, to improve the performance of deep learning models. It involves combining multiple images into a single mosaic image and using it as training data. Mosaic data augmentation helps to increase the diversity and complexity of the training data, leading to better generalization and robustness of the model.

Neck

The neck of a neural network refers to an intermediate set of layers or modules that connect the backbone and the head. It is responsible for further refining the features extracted by the backbone before passing them to the head for final processing.

NICO

Non-IID Image dataset with contexts

NMS

Non-Maximum Suppression is a technique used in object detection to eliminate redundant and overlapping bounding boxes. It selects the most probable bounding box and eliminates any box that has a high overlap (as measured by the Intersection over Union (IoU) metric) with the chosen box.

Non-IDD

Non Independent and identically distributed data

Padding

Padding is the process of adding extra pixels around the input image or feature map. It is commonly used in convolutional neural networks to preserve spatial dimensions and prevent information loss at the edges of the image. Padding can be done with zeros (zero-padding) or with values from the original image (reflective padding or symmetric padding).

Panoptic Segmentation

Studies both (most models are based on Mask R-CNN)

Pooling

Pooling is a downsampling operation that reduces the spatial dimensions of the input feature map. It is commonly used to reduce the computational complexity of the network and to extract the most important features. The most common types of pooling are max pooling and average pooling, which take the maximum or average value within a pooling window, respectively.

Precision

Relevant Retrieved / Retrieved

Recall

Relevant retrieved / Relevant

Region Proposal

Region proposal is a technique used in object detection to generate potential bounding boxes around objects in an image. It helps to narrow down the search space for the object detector by proposing regions that are likely to contain objects. Region proposal methods, such as Selective Search and EdgeBoxes, use various algorithms to generate these potential regions based on image features and similarity measures.

Residual

\( Y = F(x) + x \)

SAM

Special Attention Module

SAT

Self Adversarial Training

Semantic Segmentation

Studies stuff (e.g. SegNet, U-Net, DeconvNet), measurement: IoU

SENet

Squeeze and Excitation Network, finds which channel are more/less important in a feature map

SiLU

Sigmoid Linear Unit, aka. Swish

Skip Connection

Skip connection, also known as residual connection, is a technique that adds the input of a layer to the output of a subsequent layer. It allows the network to learn residual functions, which can help to alleviate the vanishing gradient problem and improve the flow of gradients during training. Skip connections are commonly used in deep residual networks (ResNet).

SPP

Spatial Pyramid Pooling, increases the receptive field

Stride

Stride refers to the number of pixels the convolutional kernel moves at each step during the convolution operation. A stride of 1 means the kernel moves one pixel at a time, while a stride of 2 means the kernel moves two pixels at a time. Stride affects the output size of the feature map, as well as the amount of computation required.

Upsampling

Upsampling is the process of increasing the spatial dimensions of an image or feature map. It is commonly used in tasks such as image super-resolution, semantic segmentation, and generative modeling. Upsampling can be done using techniques such as transposed convolution, nearest-neighbor interpolation, or bilinear interpolation.