Light-resistant target detection improvement algorithm for overexposed environments

2.1. C2F module improvements

The convolution (Conv) used in the C2F layer in YOLO v8 consists of three parts: traditional two-dimensional convolution (Conv2d), two-dimensional batch normalization (BatchNorm2d) and an activation function Sigmoid linear unit (SiLU). Two-dimensional convolution typically involves using a fixed-size convolution kernel matrix to sample the image as it slides, performing element-wise multiplication with the corresponding elements of the input image and summing the results weighted by W to obtain the desired output. The formula for the calculation is presented in Equation (1), where R signifies the convolution kernel, P_n refers to the traversal of all positions within the kernel R and P₀ represents each position of the kernel R on the feature map y.

$$\begin{eqnarray} y\left( {{P_0}} \right) = \mathop \sum \limits_{{P_n} \in R} W\left( {{P_n}} \right) \cdot x\left( {{P_0} + {P_n}} \right) \end{eqnarray}$$

(1)

As shown by the heatmap comparison in Fig. 2, Fig. 2(a) shows the heatmap resulting from the original model's (YOLO v8n) object detection, where the target features in the normal light source image are quite distinct. The model can utilize traditional convolution to extract sufficient features for object judgement. Fig. 2(b) depicts the heatmap obtained from detecting overexposed images with the original model, where the target features are weakened due to the high brightness. The model's ability to extract features from overexposed targets is relatively poor, and the features obtained are insufficient for the model's detection purposes. Fig. 2(c) shows the heatmap resulting from incorporating online overexposure augmentation during the model training process, which enhances the model's ability to extract features from overexposed images. However, traditional convolution struggles with the significant weakening of target features, especially when there is partial feature loss in the images. The effectiveness of traditional convolution in extracting image features is poor, leading to errors and omissions. During detection, this can result in the model making incorrect or missed detections, thereby affecting its performance.

Fig. 2.

Heat map comparison: (a) original model detects normal images; (b) original model detects overexposed images; (c) the online overexposure augmentation model detects overexposed images.

Aiming at the problem of insufficient ability of traditional two-dimensional convolution to extract target features in overexposed images, this paper proposes C2F-DCNv2, which introduces deformable convolutional networks to improve the C2F, and uses deformable convolutional networks to deal with the idea of geometric change capture, to expand the effective receptive field, and to improve the model's sampling effect for the weakened and missing features. The structure diagram is shown in Fig. 3. DCNv1 [19] is the first version of deformable convolutional networks. The convolution is calculated in the same way as the traditional convolution; the main idea is to add a learnable offset ∆P_n in the convolution kernel R. The convolution sampling location is changed from a regular convolution kernel region to an irregular convolution kernel plus an offset ∆P_n. The formula for DCNv1 can be seen in Eq. (2), and the comparison of convolution can be seen in Fig. 4. The traditional pooling process is to divide the region into several uniform squares, as shown in the right panel of Fig. 4. After the division, the sampling points (blue dots) with fixed positions are selected for the pooling operation, while the sampling points (red dots) selected by the deformable convolutional networks are variable, and this sampling point selection helps the network to obtain the features at different positions and enhance the network's effective sense of the range of the receptive field.

$$\begin{eqnarray} y\left( {{P_0}} \right) = \mathop \sum \limits_{{P_n} \in R} W\left( {{P_n}} \right) \cdot x\left( {{P_0} + {P_n} + \Delta {P_n}} \right) \end{eqnarray}$$

(2)

Fig. 3.

Structure diagram of C2F-DCNv2.

Fig. 4.

Convolution comparison chart: (a) traditional convolution; (b) deformable convolutional, red boxes are convolution kernel sizes, blue dots are feature sampling points and green arrows are offsets; (c) pooling point comparison, red dots are deformable convolutional sampling points.

DCNv2, in continuation of the idea of DCNv1 offsets, corrects the input features from different spatial locations by introducing an additional modulation module using the modulation bias ∆M_k, which is computed as shown in Eq. (3), where the value of ∆M_k is taken in the range [0,1]. Compared with DCNv1, the deformable modular space of DCNv2 supports better adaptability to images, i.e. it can be adjusted according to the learned feature magnitude to more closely match the target shape, enhancing the network's ability to model geometric transformations.

$$\begin{eqnarray} y\left( P \right) = \mathop \sum \limits_{k = 1}^K {W_k} \cdot x\left( {P + {P_k} + \Delta {P_k}} \right) \cdot \Delta {M_k} \end{eqnarray}$$

(3)

Overexposed images have some weakened and missing features due to the presence of higher brightness and exposure. Traditional convolution, due to its more fixed sampling points, cannot connect with residual features during the C2F sampling process. As a result, fewer features are acquired, and the limited receptive field becomes insufficient. The model can only rely on a small number of relatively complete features within a specific region of the target for making judgements. DCNv2 can adjust the sampling points based on the offset and modulation parameters, thereby expanding the network's receptive field and reconnecting the crippled and weakened target features. This increases the effective receptive field, as shown in Fig. 5. Fig. 5(a) represents the detection of normal image features using C2F with conventional convolution, where the blue dots indicate the sampling points where the conventional convolution successfully captures effective features with clear contours and more distinct characteristics, providing sufficient information. Fig. 5(b) depicts the features obtained by C2F when detecting overexposed images. The white dots indicate the effective feature points that are missing compared to normal images. Due to overexposure, the image features are weakened or lost, and the target features are no longer clear, with a certain loss of contours. The effective features that can be extracted using traditional convolution are insufficient for accurate detection. Fig. 5(c) illustrates the features of overexposed images detected by the C2F-DCNv2 model. The orange dots indicate the effective feature points acquired by C2F-DCNv2, where the positions of the sampling points can be altered using offsets to capture more effective feature points. Compared to the original C2F layer, the improved C2F-DCNv2 is more adept at constructing an effective receptive field that closely resembles the target's geometric shape. This enhances the detection model's performance on overexposed images, as shown in Fig. 6.

Fig. 5.

Schematic of sampling points: (a) normal images detected by C2F; (b) overexposed images detected by C2F; (c) overexposed images detected by C2F-DCNv2.

Fig. 6.

Heat map comparison: (a) normal images detected by C2F; (b) overexposed images detected by C2F; (c) overexposed images detected by C2F-DCNv2.

2.2. SPPF module enhancement

The last layer in the YOLO v8 backbone is the SPPF, which is composed of two primary structures: Conv and MaxPool2d. It is used to process the different scale features generated by the backbone network, pooling these features and connecting them to achieve a feather map-level fusion of local and global features. This enhances the network's ability to capture multi-scale information and improves the model's detection performance for targets of varying scales. To enhance the model's ability to capture weakened and missing features in overexposed images, we propose SPPF-LSKA, which utilizes a visual attention network (VAN) to strengthen the model's feature capture capabilities. SPPF is enhanced with a VAN that incorporates a large kernel attention (LKA) mechanism [20]. However, due to the computational cost associated with increasing convolutional kernels, we opted for a computationally less-intensive LSKA to improve SPPF. LSKA utilizes an LSKA module in the convolution process of the VAN to enhance the feature capture capability. By adjusting the size of the convolutional kernels, it increases the long-range dependency in the convolution process and expands the receptive field of SPPF. This makes the VAN more biased towards capturing the target's shape rather than its internal features. Interacting with the previous layer C2F-DCNv2, it strengthens the model's detection effectiveness for weakened features in overexposed images. Additionally, it decomposes the 2D convolutional kernels of the deep convolutional layer into stacked horizontal and vertical 1D kernels to reduce computational load, balancing performance with computational cost. The SPPF-LSKA structure diagram is shown in Fig. 7.

Fig. 7.

Structure diagram of SPPF-LSKA.

2.3. IoU optimization

Overexposed images, due to their high brightness, lead to a certain degree of feature weakening, and the contours may also appear partially missing. When multiple targets are piled up and overlapping in one place, the overlapping contours and features become weakened or missing, making it difficult for the model to discriminate between targets. This can lead to the misdetection of multiple targets as a single target, and the NMS may inadvertently suppress correct detections. To address these problems, the DIoU loss function is used to optimize the issue of false positives and false negatives caused by overlapping targets with missing features in overexposed images.

The DIoU loss function, when calculated, takes into account the overlap, distance and scale between the predicted box and the ground truth box, as shown in Equation (4). Here, b and b^gt represent the centroids of the predicted box and the ground truth box, respectively. The term ρ denotes the Euclidean distance between the two points b and b^gt. DIoU aims to find the minimum bounding rectangle that includes both the predicted and ground truth boxes, with its longest diagonal length denoted by C (as illustrated in Fig. 8).

$$\begin{eqnarray} \textit{DIoU} = IoU - \frac{{{\rho ^2}\left( {b,{b^{{\rm{gt}}}}} \right)}}{{{C^2}}} \end{eqnarray}$$

(4)

when NMS suppresses redundant bounding boxes in the last step of the target detection algorithm, it is prone to erroneous suppression, especially in the face of overlapping and occlusion of targets. When using DIoU, NMS considers both the overlapping area of the bounding boxes and the distance between their centroids as determining factors. This can effectively alleviate the suppression of neighbouring detected bounding boxes that have been erroneously suppressed due to a lack of features, and it helps to address the detection difficulties caused by the weakened and missing features of overlapping targets in overexposed images.

Fig. 8.

DIou schematic diagram.

3. Experimental results and analyses

3.1. Experimental environment

The experimental environment of this paper is as follows: in the Windows 10 operating system, build the development environment with Python 3.8, CUD 11.8, torch 2.0.1 as the framework and use the NVIDIA GeForce RTX 4060ti 16 G version of the graphics card for training. YOLO v8n is selected as the basic model, batch is set to 64, and the brightness change amplitude value hsv_v is set to 0.7, the image input size is defaulted to 640×640, the initial learning rate was 0.01, the early stop parameter is set to 50, and 200 epochs are trained for each set of experiments.

Batch refers to the number of samples selected for each training session of the model. The size of the batch affects the training speed and accuracy of the model. Within a certain range, a larger batch size generally leads to better training results. However, excessively large values can result in an increase in the number of epochs required for training and potential issues with insufficient GPU memory. Additionally, since computer storage is typically binary, setting the batch size to a power of two can facilitate faster parallel computation. Given the limited hardware conditions of the computer used in this study, a batch size of 64 was chosen as the final value.

The brightness variation amplitude value, hsv_v, represents the degree of change during online overexposure augmentation. The higher the value, the more intense the brightness enhancement during the overexposure process, thereby simulating a higher intensity of overexposure. After conducting several sets of comparative experiments ranging from 0.4 to 0.9, the results are shown in Fig. 9(a). The optimal value of 0.7 was selected as the final parameter.

Fig. 9.

Comparison of different parameters: (a) hsv_v; (b) kernel_size.

SGD is suitable for online learning scenarios, allowing for real-time updates based on sample changes [21], which aligns well with the online overexposure augmentation used in this model. Its computational efficiency and memory efficiency can also alleviate the hardware load of the computer used. The initial learning rate is typically set between 0.01 and 0.001. In this paper, a larger value of 0.01 is chosen to provide a relatively stable training process, which facilitates the gradual decay of the learning rate by the optimizer to obtain more appropriate values.

To investigate the impact of using different kernel sizes in LSKA on performance, we conducted six sets of experiments with kernel sizes of 7, 11, 23, 35, 41 and 53, respectively, to determine the optimal kernel size. The results can be seen in Fig. 9(b).

3.2. Introduction to the dataset

The experimental dataset in this paper is adopted from the Kuls-Warehouse computer vision project taken by Ocean University in Korea [22], which can be downloaded from the Roboflow website recommended by YOLO v8. The original image set consists of a total of 8828 images, with the main categories of goods, people and forklift trucks, which is able to better simulate the modern environment of warehouses, factories and other industries, and the dataset is divided into training, validation and test sets at a ratio of 0.7, 0.2 and 0.1. The dataset is divided into training set and validation set with the ratio of 0.7 and 0.2. Roboflow supports offline data enhancement techniques and provides a second version of the image set that applies the following enhancements to create three versions of each source image: 1) random rotation between −10 and +10 degrees; 2) random shear between −15° and +15° horizontally and −15° and +15°, vertically; 3) random Gaussian blur with a range from 0 to 2.5 pixels. The original image training set is expanded threefold, and to simulate overexposed environments, the original validation and test sets are doubled. The exposure of the expanded images is adjusted using OpenCV, with the exposure range set to 3−3.1, simulating overexposed environments. The final dataset consists of 18 477 images for training and 3576 images for validation.

Steffens et al. [23] created a dataset of overexposed images to simulate overexposed images captured by the camera by setting the overexposure parameters 0, +1 and +1.5 in order to directly increase the exposure of the image. This article uses the cv2.convertScaleAbs function from OpenCV to adjust the exposure of the image. This function controls the exposure of the image by multiplying each pixel value by a set scaling factor (i.e. 3−3.1 in this case). The reason for setting the scaling factor to 3−3.1 is as follows. According to the division of the brightness histogram, the range of 0−255 can be divided into five regions from dark to light: shadow (the first 5%), dark (5%−20%), midtone (20%−80%), highlight (80%−95%) and specular highlight (the last 5%). For images captured under natural light, the distribution of pixel values is based on the characteristics of the image's colours, but most of them fall within the range of 5%−95%. Overexposed images, due to their higher exposure levels, tend to have a pixel value distribution that leans towards the 50%−100% range. As the degree of overexposure increases, the pixel value distribution will more closely resemble the 100% highlight area. By setting the scaling factor to 3−3.1, we can simulate the loss of pixel values in the 30%−100% range, effectively mimicking the distortion state of overexposed images. Furthermore, through testing, it has been found that using this value to simulate overexposed images results in significant loss of image details, yet the images are not entirely unrecognizable. If the value is increased, the image distortion becomes severe and unidentifiable; if the value is decreased, it becomes difficult to simulate overexposed images with high exposure levels. Therefore, by using this value, we aim to enhance the overexposure intensity as much as possible while ensuring that the image still contains enough information for object detection purposes.

3.3. Online overexposure enhancement

This paper's experiment requires random overexposure data augmentation on the training set to simulate overexposure scenarios under different conditions. The online data augmentation technique is now used to randomly add exposure to images. Compared to offline data augmentation, online data augmentation allows the same image to present different exposure levels, better simulating the varying exposure conditions caused by different lighting intensities in real-world scenarios. The specific operations for online overexposure enhancement are as follows.

Obtain the pre-set change amplitude values for the HSV three channels, hsv_h, hsv_s and hsv_v (in this paper's experiment, hsv_h = 0.015, hsv_s = 0.7 and hsv_v = 0.7). Multiply each change amplitude value by a random number R between −1 and 1 and then add 1 to obtain three change values, R_h, R_s and R_v, which are always greater than or equal to 0. These values are used to ensure that the image pixel values remain within the 0−255 range after the changes.

Separate the image into H, S and V channels, converting them into arrays lut_h, lut_s and lut_v that are within the range of 0−255, and then multiply each of these arrays by the change values obtained in Eq. (1).

After obtaining the product for the H channel, take the remainder after division by 180. For the S channel, use the product directly. For the V channel, perform two checks: when R_v is less than or equal to 1, the V channel values undergo only minor changes, simulating a normal image. When R_v is greater than 1, the brightness change amplitude increases, simulating images under overexposed conditions.

After the above processing, the probabilities of generating normal images and overexposed images during online overexposure enhancement are both 50%. The online overexposure enhancement flowchart is shown in Fig. 10.

Fig. 10.

The online overexposure enhancement flowchart.

3.4. Performance indicators

This paper experiments using YOLO v8 in the output of the indicators: recall (Recall) and average precision (mAP), with mAP divided into mAP50 and mAP50–95, respectively, representing the mAP value in the iou in the 50% and 50%−95% threshold range. Because mAP50–95 better reflects the accuracy of the model detection, for the experiments in this paper the model output of the best weights best.pt file formula is set as:

$$\begin{eqnarray} 0.1 \times ({\rm{mAP}}50) + 0.9 \times ({\rm{mAP}}50\!\!-\!\!95) \end{eqnarray}$$

(5)

The performance of the models is also compared based on three metrics: the number of parameters and Giga Floating-point Operations Per Second (GFLOPs).

3.5. Analysis of results

This paper presents experiments designed to improve the light-resistant target detection algorithm based on the YOLO v8n model. To demonstrate the detection performance of the model, we conducted performance comparison experiments using different versions of the YOLO series of models. The results are presented in Table 1 and Fig. 11. From the data in Table 1, it can be observed that, at the same epoch, v7 has the lowest detection accuracy. After analysis, it is believed that v7 has a larger model size, resulting in a relatively slower convergence speed compared to other models [24]. Even at epoch 200, it has not yet converged, making it difficult to obtain the best weights. v6 converges faster than v7 [25], with slightly higher accuracy, but it has not fully converged by epoch 200 either; v5 has a smaller model size and lower computational requirements, allowing it to converge quickly, and it is able to converge by epoch 200, thus enabling the acquisition of the best weights [26]. The precision demonstrated by the model in this paper is somewhat superior to other versions of the YOLO model. However, there are still some shortcomings: in terms of computational load and model size, v5 has an advantage, as it uses only one detection head whereas v8 uses three. The smaller model size and computational load of v5 make it more advantageous for devices with extremely low computational power requirements. The v10n has a somewhat smaller model volume compared to the v8n, as well as slightly higher accuracy, but is still smaller than the model presented in this paper [27]. The light-resistant target detection improvement model designed in this paper, while still having some shortcomings, has a significantly higher detection accuracy than other models. Moreover, the computational power and model size required can still meet the demands of industrial deployment, achieving a balance between lightweight design and performance. Therefore, the algorithm improved in this paper exhibits a clear advantage on this dataset. The test results are shown in Fig. 12.

Fig. 11.

Results normalized comparison graph: (a) recall comparison chart; (b) mAP50 comparison chart; (c) mAP50–95 comparison chart.

Fig. 12.

Comparison chart of test results: (a) v5; (b) v6; (c) v7; (d) v8; (e) v10; (f) our model.

Table 1.

Comparison of YOLO series model results.

Model	Recall	mAP50	mAP50–95	GFLOPs/G	Parameters/M
YOLO v5n	0.451	0.494	0.319	4.2	1.77
YOLO v6n	0.426	0.469	0.335	11.8	4.23
YOLO v7n	0.436	0.407	0.244	105.3	37.2
YOLO v8n	0.449	0.525	0.387	8.1	3.00
YOLO v10n	0.559	0.564	0.427	6.5	2.27
Our YOLO	0.695	0.757	0.544	8.2	3.31

Model	Recall	mAP50	mAP50–95	GFLOPs/G	Parameters/M
YOLO v5n	0.451	0.494	0.319	4.2	1.77
YOLO v6n	0.426	0.469	0.335	11.8	4.23
YOLO v7n	0.436	0.407	0.244	105.3	37.2
YOLO v8n	0.449	0.525	0.387	8.1	3.00
YOLO v10n	0.559	0.564	0.427	6.5	2.27
Our YOLO	0.695	0.757	0.544	8.2	3.31

Table 1.

Comparison of YOLO series model results.

Model	Recall	mAP50	mAP50–95	GFLOPs/G	Parameters/M
YOLO v5n	0.451	0.494	0.319	4.2	1.77
YOLO v6n	0.426	0.469	0.335	11.8	4.23
YOLO v7n	0.436	0.407	0.244	105.3	37.2
YOLO v8n	0.449	0.525	0.387	8.1	3.00
YOLO v10n	0.559	0.564	0.427	6.5	2.27
Our YOLO	0.695	0.757	0.544	8.2	3.31

Model	Recall	mAP50	mAP50–95	GFLOPs/G	Parameters/M
YOLO v5n	0.451	0.494	0.319	4.2	1.77
YOLO v6n	0.426	0.469	0.335	11.8	4.23
YOLO v7n	0.436	0.407	0.244	105.3	37.2
YOLO v8n	0.449	0.525	0.387	8.1	3.00
YOLO v10n	0.559	0.564	0.427	6.5	2.27
Our YOLO	0.695	0.757	0.544	8.2	3.31

To demonstrate the effectiveness of the proposed method, the generalization performance of the method was validated using two additional datasets: VOC 2012 [28] and AutoDrive Dataset [29]. The results are shown in Table 2 .

Table 2.

Generalization validation.

Dataset	Model	Recall	mAP50	mAP50–95
VOC 2012	v8	0.457	0.496	0.349
	Ours	0.515	0.572	0.413
AutoDrive	v8	0.573	0.638	0.452
	Ours	0.625	0.692	0.502

Dataset	Model	Recall	mAP50	mAP50–95
VOC 2012	v8	0.457	0.496	0.349
	Ours	0.515	0.572	0.413
AutoDrive	v8	0.573	0.638	0.452
	Ours	0.625	0.692	0.502

Table 2.

Generalization validation.

Dataset	Model	Recall	mAP50	mAP50–95
VOC 2012	v8	0.457	0.496	0.349
	Ours	0.515	0.572	0.413
AutoDrive	v8	0.573	0.638	0.452
	Ours	0.625	0.692	0.502

Dataset	Model	Recall	mAP50	mAP50–95
VOC 2012	v8	0.457	0.496	0.349
	Ours	0.515	0.572	0.413
AutoDrive	v8	0.573	0.638	0.452
	Ours	0.625	0.692	0.502

Table 3.

Ablation study

Model	A	B	C	D	E	Recall	mAP50	mAP50–95	FPS	FLOPs/G	Parameters/M
Model 1	√					0.449	0.525	0.387	344	8.1	3.007
Model 2	√	√				0.590	0.644	0.472	344	8.1	3.007
Model 3	√	√	√			0.572	0.685	0.505	314	8.0	3.039
Model 4	√	√		√		0.550	0.688	0.504	320	8.3	3.280
Model 5	√	√			√	0.581	0.683	0.495	348	8.1	3.007
Model 6	√	√	√	√		0.583	0.710	0.521	309	8.2	3.312
Model 7	√	√	√	√	√	0.695	0.757	0.544	304	8.2	3.312

Model	A	B	C	D	E	Recall	mAP50	mAP50–95	FPS	FLOPs/G	Parameters/M
Model 1	√					0.449	0.525	0.387	344	8.1	3.007
Model 2	√	√				0.590	0.644	0.472	344	8.1	3.007
Model 3	√	√	√			0.572	0.685	0.505	314	8.0	3.039
Model 4	√	√		√		0.550	0.688	0.504	320	8.3	3.280
Model 5	√	√			√	0.581	0.683	0.495	348	8.1	3.007
Model 6	√	√	√	√		0.583	0.710	0.521	309	8.2	3.312
Model 7	√	√	√	√	√	0.695	0.757	0.544	304	8.2	3.312

Note: The unoptimized baseline model YOLO v8n is defined as A; the online data enhancement technology after optimization is defined as B; the C2F layer optimized by DCNv2 is defined as C; the SPPF layer optimized by LSKA is defined as D; and the use of DIoU as the loss function is defined as E.

Table 3.

Ablation study

Model	A	B	C	D	E	Recall	mAP50	mAP50–95	FPS	FLOPs/G	Parameters/M
Model 1	√					0.449	0.525	0.387	344	8.1	3.007
Model 2	√	√				0.590	0.644	0.472	344	8.1	3.007
Model 3	√	√	√			0.572	0.685	0.505	314	8.0	3.039
Model 4	√	√		√		0.550	0.688	0.504	320	8.3	3.280
Model 5	√	√			√	0.581	0.683	0.495	348	8.1	3.007
Model 6	√	√	√	√		0.583	0.710	0.521	309	8.2	3.312
Model 7	√	√	√	√	√	0.695	0.757	0.544	304	8.2	3.312

Model	A	B	C	D	E	Recall	mAP50	mAP50–95	FPS	FLOPs/G	Parameters/M
Model 1	√					0.449	0.525	0.387	344	8.1	3.007
Model 2	√	√				0.590	0.644	0.472	344	8.1	3.007
Model 3	√	√	√			0.572	0.685	0.505	314	8.0	3.039
Model 4	√	√		√		0.550	0.688	0.504	320	8.3	3.280
Model 5	√	√			√	0.581	0.683	0.495	348	8.1	3.007
Model 6	√	√	√	√		0.583	0.710	0.521	309	8.2	3.312
Model 7	√	√	√	√	√	0.695	0.757	0.544	304	8.2	3.312

4. Ablation experiment

The light-resistant target detection model designed in this paper primarily improves the C2F layer and SPPF in the YOLO v8n backbone and uses DIoU as the loss function. Additionally, an online data augmentation module is designed to simulate overexposed environments by randomly adjusting the brightness of the images within a certain range. To better analyse the optimization of each part of the improved model and to verify the effectiveness of each improvement in enhancing model performance, this paper designs seven sets of experiments, including ablation experiments to compare and analyse each improvement point. Detection speed is compared using Frames Per Second (FPS). The evaluation metrics for model performance include Recall, mAP50, mAP50–95, FPS, GFLOPs and the number of parameters (Parameters), which represent the model size. A checkmark (√) indicates the use of the improvement method, while no checkmark indicates that the corresponding improvement is not used. The specific experimental results are shown in Table 2, and the mAP comparison charts for each group of experiments are visible in Fig. 13. From Table 3, it is clear that Model 1 refers to the use of the original model, which is the baseline model YOLO v8n, with no modifications made to the model. The model uses the data augmentation provided by v8. The baseline model performs poorly in detecting overexposed images, with mAP50 and mAP50–90 values of 0.525 and 0.387, respectively, a recall rate of 0.449, an FPS of 344, a parameter count of 3.007 and GFLOPs of 8.1. This level of detection accuracy is insufficient to meet practical detection requirements. The other models all employ an improved online overexposure enhancement module, leading to an improvement in detection accuracy. However, due to differences in optimization measures at other stages, the detection accuracy varies. Model 2 only uses random brightness enhancement during training, without modifying the model's network structure. After introducing the online overexposure enhancement, the model learns to detect weakened features in overexposed images, effectively improving model performance. The mAP50 and mAP50–90 values increase by 11.9% and 8.5%, respectively, and the recall rate increases by 14.1%. Since the model's structure itself is not altered, there is no change in the parameter count, FPS or GFLOPs.

Fig. 13.

Normalized comparison mAP of ablation experiment map: (a) normalized comparison of mAP50 for ablation experiments; (b) normalized comparison of mAP50–95 for ablation experiments.

Model 3 introduces the DCNv2 to improve the C2F layer, adaptively adjusting the receptive field to enhance the model's ability to capture the contours of overexposed images and extract target features efficiently. This improves the model's capability to construct the geometric shape of the target. Compared to Model 2, the mAP50 and mAP50–90 are improved by 11.9% and 8.5%, respectively, the recall rate is increased by 14.1%, GFLOPs are reduced by 0.1 and the parameter count is increased by only 1%. Due to the additional offset added during the convolution process, there is a significant increase in computational requirements, resulting in a decrease of 30 FPS. Model 4 employs LSKA to enhance the SPPF layer's ability to extract features of different scales from overexposed images. Compared to Model 2, the mAP50 and mAP50–90 are improved by 4.4% and 3.2%, respectively. GFLOPs increase by 0.2, as it introduces LSKA, increasing the model's parameter count by 9%, with a slight increase, and reducing FPS by 24. Model 5 employs DIoU as the loss function, replacing CIoU, to address the issue of false positives and false negatives caused by overlapping targets with missing features in overexposed images. This enhances the accuracy of detecting weakened targets in overexposed images. Compared to Model 2, the mAP50 and mAP50–90 are improved by 3.9% and 2.3%, respectively, while the recall rate decreases by 0.9%. The FPS increases by 4, and there is no alteration to the model's structure, with no change in the parameter count or GFLOPs. Model 6 combines the improvements of the C2F and SPPF layers. Compared to Model 2, the mAP50 and mAP50–90 are improved by 6.6% and 4.9%, respectively, while the recall rate decreases by 0.7%. This is due to an increase in computational requirements, a 10.4% increase in the parameter count and a decrease in FPS by 35.

Model 7 incorporates all the aforementioned improvements. Although the parameter count increases by 10.1%, the mAP50 and mAP50–95 are improved by 11.3% and 6.8% compared to Model 2, respectively, and the recall rate increases by 10.5%. Compared to Model 1, the mAP50, mAP50–95 and recall rate are improved by 23.2%, 15.7% and 24.6%, respectively, significantly enhancing the model's ability to detect against bright light. Even though the FPS decreases by 40, the FPS of 304 still meets the real-time requirements for detection.

5. Conclusions

The paper focuses on the research of object detection in overexposed environments, and improves the YOLO v8 network model to propose a real-time anti-light object detection model suitable for overexposed environments. It designs an online overexposure enhancement module to simulate overexposed environments, introduces DCNv2, SPPF and the DIoU that is more suitable for this scenario. These enhancements improve the model's ability to detect weakened and missing targets in overexposed environments, as well as the detection accuracy of overlapping targets in such scenes. Additionally, an ablation experiment is designed to verify the effectiveness of each module in improving the model's performance. The experimental results show that, compared to the original model, the model size increases by 0.3 M and the FPS decreases from 344 to 304, but the mAP50, mAP50–95 and recall rate are improved by 23.2%, 15.7% and 24.6%, respectively.

While achieving good detection results, the model proposed in this paper does not have the optimal memory usage or computational resource requirements compared to the other models under comparison. In practical deployment, due to the limited computing resources of most devices, this leads to increased model inference times. A smaller number of parameters can save more computational costs and better reduce inference time. Therefore, reducing the model's redundant parameters and computational resource needs through methods such as pruning and distillation to lower the difficulty of actual deployment will be a key focus for future research.

Acknowledgements

This research was supported by the Natural Science Foundation of Guangdong Province (Grant No. 2022A1515010011) and the Basic and Theoretical Science and Technology Programme of Jiangmen City (Grant No. 2023JC01020) in 2023.

Author contributions

Chen Zheng designed the method. Chen Zheng and Zhiwei Li performed the data analysis and wrote the manuscript. Hui Liu provided technical guidance. Sha Huang assisted in organizing the manuscript. Yanjia Zhao contributed to the data processing.

Conflict of interest statement. None declared.

References

Duan

Junginger

Huang

et al.

Deep learning for visual SLAM in transportation robotics: a review

Transp Saf Environ

2019

;

177

–

Fan

Wang

et al.

Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms

Transp Saf Environ

2022

;

tdac026

Liu

Wang

Luo

A review of applications of visual inspection technology based on image processing in the railway industry

Transp Saf Environ

2019

;

185

–

204

Afifi

Derpanis

Ommer

et al.

Learning multi-scale photo exposure correction

. In:

IEEE/CVF Conference on Computer Vision and Pattern Recognition

Nashville, TN, USA

2021

;

9153

–

9163

Arora

Hanmandlu

Gupta

et al.

Enhancement of overexposed color images

. In:

3rd International Conference on Information and Communication Technology

Nusa Dua, Bali, Indonesia

2015

;

207

–

Guo

Cheng

Zhuo

et al.

Correcting over-exposure in photographs

. In:

IEEE Computer Society Conference on Computer Vision and Pattern Recognition

San Francisco, CA, USA

2010

;

515

–

Kapoor

Arora

Colour image enhancement based on histogram equalization

Electr Comput Eng

2015

;

–

Jin

Liu

et al.

Joint over and under exposures correction by aggregated retinex propagation for image enhancement

IEEE Signal Process Lett

2020

;

1210

–

Rinanto

Two residual attention convolution models to recover underexposed and overexposed images

Symmetry

2023

;

1850

10.

Weng

Zhang

et al.

URetinex-net: retinex-based deep unfolding network for low-light image enhancement

. In:

IEEE/CVF Conference on Computer Vision and Pattern Recognition

New Orleans, LA, USA

2022

;

5891

–

900

11.

Gao

Research scheme on anti glare optical system and plate anti exposure in Intelligent Transportation System

Master thesis

Hangzhou

Zhejiang University of Technology College of Science

2016

Google Preview

12.

Arad

Kurtser

Barnea

et al.

Controlled lighting and illumination-independent target detection for real-time cost-efficient applications. the case study of sweet pepper robotic harvesting

Sensors

2019

;

1390

13.

Yuan

Study on Vehicular Distance Detection in Bright Background Based on Monocular Vision Deep Learning Algorithm

Master thesis

Lhasa

Tibet University

2023

Google Preview

14.

Yao

Peng

Chen

et al.

An improved YOLO algorithm supporting anti-illumination target detection

Automot Eng

2023

;

777

–

15.

Redmon

Divvala

Girshick

et al.

You only look once: unified, real-time object detection

. In:

IEEE Conference on Computer Vision and Pattern Recognition

Las Vegas, NV, USA

2016

;

779

–

16.

Wang

Shivanna

Cheng

et al.

DCN V2: improved deep & cross network and practical lessons for web-scale learning to rank systems

. In:

Proceedings of the Web Conference 2021

Ljubljana Slovenia

2021

;

1785

–

1797

17.

Lau

Rehman

YAU

Large separable kernel attention: rethinking the large kernel attention design in CNN

Expert Syst Appl

2024

;

236

121352

18.

Zheng

Wang

Liu

et al.

Distance-IoU loss: faster and better learning for bounding box regression

Proc AAAI Conf Artif Intell

2020

;

12993

–

3000

19.

Dai

Xiong

et al.

Deformable convolutional networks

. In:

IEEE International Conference on Computer Vision

Venice, Italy

2017

;

764

–

20.

Guo

Liu

et al.

Visual attention network

Comput Vis Medium

2023

;

733

–

. https://arxiv.org/abs/1609.04747v2.

21.

Ruder

An overview of gradient descent optimization algorithms

2016

;

1609.04747

Google Preview

. https://universe.roboflow.com/korea-maritime-and-ocean-university/kuls-warehouse.

22.

Korea Maritime and Ocean University. Kuls-Warehouse Dataset. [2024-2-5]

23.

Steffens

Lilles Jorge Drews

Silva Botelho

Deep learning based exposure correction for image exposure correction with application in computer vision for robotics

. In:

Latin American Robotic Symposium, 2018 Brazilian Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education

João Pessoa

Brazil

2018

;

194

–

200

24.

Wang

Bochkovskiy

Liao

YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

. In:

IEEE/CVF Conference on Computer Vision and Pattern Recognition

Vancouver, BC, Canada

2023

;

7464

–

25.

Jiang

et al.

YOLOv6: A single-stage object detection framework for industrial applications

arXiv preprint

2022

;

02976

DOI: arxiv-2209.02976

26.

Jocher

Yolov5. [2024-2-5]

. https://github.com/ultralytics/yolov5.

27.

Wang

Chen

Liu

et al.

YOLOv10: real-time end-to-end object detection

2024

;

2405.14458

. https://arxiv.org/abs/2405.14458v2.

28.

Everingham

Van Gool

Williams

CKI

et al.

The pascal visual object classes (VOC) challenge

Int J Comput Vis

2010

;

303

–