An improved RSMamba network based on multi-domain image fusion for wheelset bearing fault diagnosis under composite conditions

fault diagnosis, wheelset bearing, multi-domain fusion, deep learning, RSMamba model

Highlights

An improved RSMamba network based on multi-domain image fusion is presented.
An RGB-CF strategy based on multi-domain image fusion is proposed for generating 2D images.
The CSRA module is embedded in the RSMamba to improve the multi-label recognition capability.

List of Symbols

DL
Deep learning

SSM
State space model

STFT
Short-Time Fourier Transform

CWT
Continuous Wavelet Transform

RGB-CF strategy
Red–green–blue channel fusion strategy

GAF
Gramian angular field

Bisp
Bispectrum

ST
Stockwell Transform

CSRA
Class-specific residual attention

WBCTR
Wheelset bearing comprehensive test rig

RVRSW
Rolling and vibration rig of single wheelset

T-SNE
T-distributed stochastic neighbor embedding

1. Introduction

As railways continue to evolve rapidly, ensuring the operational safety of high-speed trains emerges as a paramount priority in railway transportation. Wheelset bearings, crucial rotating elements in train bogies, directly impact the stability and safety of trains. During operation, wheelset bearings endure challenging conditions, including alternating load impacts and complex excitations from wheels and rails. Fatigue, wear, inadequate lubrication, and other factors lead to bearing failures like cracks, spalling, and excessive wear, ultimately resulting in severe train safety incidents (Wang et al., 2024). Consequently, developing efficient and precise wheelset bearing fault diagnosis technology is crucial for safeguarding the safe operation of high-speed railroads.

Conventional methods for diagnosing faults in wheelset bearings encompass vibration analysis, temperature monitoring, and acoustic detection (Zhao & Chen, 2024). While these methods can identify certain fault characteristics in bearings, they are constrained by low diagnostic accuracy, vulnerability to environmental noise, and an inability to detect early faults effectively. Recently, the rapid advancement of artificial intelligence technology has led to significant achievements in deep learning (DL) across fields such as image processing, speech recognition, and natural language processing (Chen et al., 2024; Kim et al., 2023). This advancement has presented new opportunities in mechanical fault diagnosis, particularly for rotating mechanical equipment, offering innovative diagnostic approaches.

In contrast to traditional signal processing techniques for bearing fault diagnosis, including resonance demodulation (Chen et al., 2017), spectrum kurtosis (Hu & Peng, 2016), morphological filtering (Li et al., 2020), and various time-frequency decomposition methods (Chen et al., 2016; Cui et al., 2021), the DL-based model demonstrates robust nonlinear mapping capabilities and autonomous feature learning, efficiently handling vast amounts of monitoring data. This data-driven DL model facilitates end-to-end fault classification, outputting probabilities without requiring prior knowledge. When it comes to network architecture, traditional DL models applied in fault diagnosis can be classified into stacked autoencoders (SAEs), deep belief networks (DBNs), convolutional neural networks (CNNs), graph neural networks (GNNs), and recurrent neural networks (RNNs) (Wang et al., 2023). Luo et al. (2022) introduced a convolutional shortcuts-based SAE model specifically designed for rolling bearing fault classification, wherein the Kullback–Leibler divergence was substituted with convolutional shortcuts. Wang et al. (2020) devised a dynamic extended DBN-based fault classifier within their proposed DBN model, targeting fault diagnosis in the Tennessee Eastman process. Yu & Liu (2020) integrated confidence and classification rules into their DBN architecture, significantly enhancing the network’s feature learning capabilities. Both SAEs and DBNs possess limitations, requiring considerable computational resources and undergoing complex training processes, particularly for large-scale datasets.

CNNs, characterized by advantages such as weight parameter sharing, local connectivity, spatial subsampling, and an efficient training process, are highly suited for image recognition and processing tasks, making them prevalent in fault diagnosis applications. Eren et al. (2019) designed a compact and adaptive 1D CNN model for real-time induction bearing fault classification. Wang et al. (2021) proposed a 1D-CNN-based network for bearing fault diagnosis that integrates signals from both an accelerometer and a microphone. Chen et al. (2020) introduced a 1D-CNN architecture featuring variations in convolutional kernel sizes and numbers, aiming to enhance diagnostic accuracy. The 1D-CNN method eliminates the need for preprocessing one-dimensional vibration data from rolling bearings. Although the aforementioned 1D CNNs can directly process collected 1D signals without requiring additional 1D-to-2D data conversion, general CNNs are notably adept at processing image data for more comprehensive feature extraction.

CNNs, being a pivotal architecture in the domain of DL, have witnessed substantial advancements in recent years. In 2012, AlexNet achieved a significant victory in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC; Yuan & Zhang, 2016). Following this success, VGGNet, GoogleNet, ResNet, DenseNet, and variations of CNNs have emerged rapidly (Alom et al., 2018). The aforementioned 2D-CNNs have emerged as the primary applications in the field of fault diagnosis. Although CNNs have achieved remarkable success, they are not without limitations and drawbacks. (1) CNNs excel in local feature extraction but are limited in capturing long-range dependencies and global contextual information across the entire dataset. (2) The convolutional filter’s fixed receptive field restricts the model’s capacity to capture information beyond a certain range. (3) When confronted with large-scale datasets or high-dimensional data, CNNs experience significant computational complexity, particularly during extensive convolutional operations.

Since its introduction in 2017 by Vaswani (2017), the Transformer model has exhibited outstanding performance across various tasks, including machine translation, text classification, question answering, and language modeling. The Transformer architecture, renowned for its parallel processing, long-range dependency modeling, and exceptional flexibility and interpretability, offers notable advantages over CNNs. Hence, variations of Transformer, e.g., Vision Transformer (VIT), Swin Transformer (SwinT), integration of Transformer, and CNN, are rapidly being used in the field of fault diagnosis. Tang et al. (2022) proposed a VIT-based model specifically designed for rolling bearing fault classification. Ding et al. (2022) designed a time-frequency Transformer capable of extracting effective time-frequency representation features from bearing datasets. Wu et al. (2023) developed a Transformer-based classifier capable of identifying various known fault types and detecting novel fault types within rotating machinery systems. Hou et al. (2023) introduced an enhanced Transformer model for bearing fault classification, incorporating a multi-feature parallel fusion encoder into its fundamental architecture.

As DL technology has evolved, the limitation of Transformer has been recognized. The self-attention mechanism exhibits squared complexity, leading to substantial modeling efficiency and memory consumption challenges, particularly as the input sequence lengthens or network depth increases. Mamba, an emerging and promising model in machine learning, has attracted considerable attention for its innovative design and outstanding performance across diverse applications. Rooted in structured state space models (SSMs), Mamba has demonstrated revolutionary potential in recent years for data processing, simulation tasks, and large-scale computations (Gu & Dao, 2023). SSMs (Gu et al., 2021) can attain near-linear complexity through state transitions that establish long-range dependencies, with these transitions executed via convolutional computations. By integrating SSMs into its framework and leveraging selective scanning alongside hardware-aware algorithms, Mamba addresses the limitations of SSMs in handling remote dependencies and content-aware capabilities. In comparison to the Transformer model, Mamba demonstrates high competitiveness, rendering it a formidable alternative. Vision Mamba (Zhu et al., 2024), a specialized variant tailored for computer vision tasks, has exhibited remarkable performance in processing lengthy sequences and high-resolution imagery. nnMamba (Gong et al., 2024), a method combining CNN and remote modeling-functional SSMs, is proposed, which effectively improves medical image analysis capabilities. Localmamba (Huang et al, 2024), a novel local scanning strategy, is introduced, partitioning the image into distinct windows, enabling the efficient capture of local dependencies while maintaining a comprehensive global perspective. NetMamba, an efficient linear-time SSM proposed by Wang et al. (2024), is a specially selected and improved unidirectional Mamba structure. Nevertheless, despite promising outcomes, the Mamba model remains in its nascent stage of development, with widespread adoption and large-scale applied research yet to materialize. Furthermore, the Mamba model has yet to demonstrate successful applications in the context of wheelset bearing fault diagnosis.

Despite the promising outcomes achieved by diverse DL models in bearing fault diagnosis, accurate diagnosis remains elusive due to the intricate mechanical structure and operational conditions of wheelset bearings in high-speed trains, posing urgent challenges. Firstly, wheelset bearings function in arduous environments, resulting in collected signals frequently being contaminated by background noise, alternating heavy load impacts, mechanical perturbations, and intricate wheel-track interactions. Constructing a DL model with robust generalization capabilities is crucial for accurately extracting fault features and identifying nuanced fault signatures. Secondly, the acquired 1D vibration signals must be transformed into 2D images before inputting them into the 2D DL model. The existing 2D image conversion techniques, including Markov Transform Field, Short-Time Fourier Transform (STFT), and Continuous Wavelet Transform (CWT), possess inherent constraints. This process may result in distortion of the original signal’s representation and the loss of critical information, particularly when dealing with the weak characteristics and high noise levels inherent in wheelset bearing fault signals. Thirdly, DL models, particularly those with complex architectures, require substantial computational resources and lengthy training times. The complexity of the models and the lengthy training times pose challenges in the engineering application of wheelset bearing fault diagnosis, emphasizing the need for prompt and accurate results.

To cope with the aforementioned obstacles, this study introduces an improved RSMamba network leveraging multi-domain image fusion for fault diagnosis of wheelset bearings under complex operating conditions. A novel red–green–blue channel fusion (RGB-CF) strategy is developed for multi-domain image fusion, transforming 1D vibration signals into 2D images using Gramian angular field (GAF), bispectrum, and Stockwell Transform (ST), and fusing them to form a new image with corresponding R, G, and B channels. The RSMamba architecture comprises dynamic multi-path Mamba blocks, each featuring triplicate paths (forward, reverse, and shuffle) to bolster non-causal relationship capture. Additionally, a class-specific residual attention (CSRA) is introduced to augment the model's capacity for learning discriminative representations for multi-label recognition. The primary contributions of this study can be listed as follows:

The proposed multi-domain image fusion method transforms the bearing 1D vibration signals into 2D images in the time, frequency, and time-frequency domains, respectively, and integrates them into a new image by using the RGB-CF strategy. This fused image concurrently encapsulates information from the vibration signal in the time, frequency, and time-frequency domains, thereby significantly enriching the representation of image feature information.
The improved RSMamba model, based on the original Mamba structure, enhances the representational capacity to handle non-causal data by introducing a dynamic multi-path activation mechanism, which improves its applicability to 2D image data. Additionally, the incorporation of the CSRA module notably elevates the performance metrics of the proposed model in the realm of multi-label image recognition.
Comprehensive experiments are conducted utilizing two distinct vibration datasets collected from wheelset bearings. Our proposed method effectively classifies wheelset bearings exhibiting various failure patterns across diverse loads and rotational speeds. In comparison to existing CNN-based and Transformer-based methods, our approach demonstrates significant advantages in terms of representational capacity, computational efficiency, and resource utilization.

The structure of the remainder of this paper is as follows. Section 2 provides background knowledge on the proposed approach. Sections 3 and 4 detail the proposed RGB-CF strategy and the improved RSMamba model, respectively. Section 5 introduces and analyzes two experiments conducted in this study. Finally, Section 6 presents the conclusion of the paper.

2. Related Work

2.1. Fault diagnosis techniques of wheelset bearing

Wheelset bearings endure extreme conditions, including heavy loads, variable speeds, and environmental factors, thus requiring advanced diagnostic technologies for timely detection and mitigation of potential failures. Currently, hot-box detection is the most prevalent method employed to monitor wheelset bearing conditions through the detection of axlebox temperatures. There are two approaches, a real-time on-board system with temperature sensors mounted in the axlebox of each carriage, and a wayside hot wheelset bearing temperature detection system (Cao et al., 2016). Elevated temperatures may signify bearing wear, lubrication problems, or excessive friction. Threshold-based alerting serves as an early warning system for impending faults. Nevertheless, the initial stages of bearing failure may not manifest in obvious temperature rises, and temperature detection often yields false alarms, thereby rendering temperature monitoring insufficient for accurate fault diagnosis.

Acoustic emission monitoring employs high-sensitivity sensors to detect sound waves emanating from bearing faults. Liu et al. (2017) implemented a wayside rectangular microphone array system to enhance the accuracy of fault diagnosis for wheelset bearings. Huang et al. (2019) tackled the challenges associated with the Doppler effect and devised a more robust wayside acoustic fault diagnosis system specifically for railway-vehicle wheelset bearings. Nevertheless, high levels of environmental background noise significantly impact the precision of bearing fault identification. Furthermore, the overlap and similarity of specific acoustic frequency components pose significant challenges to accurate acoustic analysis.

Oil analysis provides valuable insights into the internal state of lubricated bearings. Methods such as infrared spectroscopy, spectroscopy, and contamination analysis (Peng et al., 2005; Zhao et al., 2023) can identify wear debris, contaminants, and alterations in lubricant properties, early indicating potential bearing faults. The primary drawbacks of oil monitoring include its time-consuming nature, high costs, and inability to be deployed on board.

Vibration signals from bearings encompass abundant information, making vibration analysis the prevalent approach for bearing fault diagnosis. Currently, vibration sensors are increasingly being utilized for condition monitoring of critical components in new-generation high-speed trains (Zhang et al., 2023). During operation, vibration signals emitted by bearings are collected, processed, and analyzed to discern fault-indicative features. In addition, machine learning (ML) and DL algorithms based on vibration signals have been widely used to improve the accuracy and efficiency of wheelset bearing fault diagnosis.

2.2. 2D image conversion techniques

The conversion of 1D signals into 2D images has emerged as a crucial technique across diverse fields, such as signal processing, pattern recognition, and machine learning. This conversion techniques the employment of sophisticated image analysis techniques, enabling the extraction of intricate information embedded in 1D signals, thereby enhancing their interpretability and facilitating the development of innovative applications. Time-delay embedding (Pan & Duraisamy, 2020) is a fundamental approach that constructs 2D images by arranging embedded vectors as pixels in a grid, where lagged versions of the signal serve as distinct dimensions. However, its sensitivity to embedding dimension and lag selection can potentially result in information loss or distortion. A recurrence plot (RP; Marwan et al., 2007) is utilized for 1D signals to visualize the recurrence of states in a phase space, achieved through phase space reconstruction. RP is capable of revealing the dynamic properties of 1D signals and facilitating the detection of periodicity and chaos, albeit with increasing complexity as the signal lengthens. GAF encodes a 1D signal in a polar coordinate framework and subsequently transforms it into a 2D image utilizing angular cosine and sine values. This method effectively maintains temporal information while enhancing the interpretability of the resulting 2D image. CWT dissects a signal into time-frequency components, allowing for their rearrangement into a 2D image representation. Other similar time-frequency decomposition methods, such as STFT, Hilbert–Huang Transform, Wigner–Ville Distribution, and Synchrosqueezing Transform (Yang et al., 2019), can decompose a 1D signal to generate a 2D image. While these methodologies have demonstrated successful applications, domain-specific image conversion methods still experience limitations, such as distortions in the original signal representation and the loss of critical information during the conversion process.

3. 2D Image Conversion Based on RGB-CF

Traditional image conversion techniques usually convert 1D signals to 2D images within a specific domain, encompassing time, frequency, and time-frequency domains. Nonetheless, this approach may omit the signal's feature information in other domains, leading to distorted representations and the loss of critical information. In this section, we introduce the RGB-CF strategy, aimed at deeply integrating signal information across time, frequency, and time-frequency domains to effectively augment information representation.

3.1. GAF

GAF encodes time series data into angular values within polar coordinates, subsequently transforming both angular and radial components into matrix form. This facilitates the visual representation of 1D sequence data. Given a 1D time series X = [x₁, x₂,…, x_n], where x_i represents the ith value. Initially, normalize the time series X, and then convert it into angle value through cosine transform. Subsequently, the normalized values undergo a cosine transformation, converting them into angular values. Ultimately, two Gramian matrices are constructed: the Gramian Angular Summation Field (GASF) and the Gramian Angular Difference Field (GADF), which are derived from the trigonometric sums and differences, respectively. The GASF is chosen in this study, and its calculation process can be detailed as follows:

$$\begin{eqnarray} \textit{GASF}(i,j) = {\tilde x_i}{\tilde x_j} - \sqrt {1 - {{\tilde x}_i}^2} \sqrt {1 - {{\tilde x}_j}^2} \end{eqnarray}$$

(1)

where |${\tilde x_i}$| and |${\tilde x_j}$| are the ith and jth values after the normalization, respectively. The GAF outputs a 2D image, with each pixel encoding the angular relationship between pairs of time points. As a result, the GAF effectively preserves the temporal dependencies within the time series data.

3.2. Bispectrum

In contrast to traditional Fourier-based approaches, which are limited to capturing linear correlations, bispectrum analysis, a higher-order spectral technique, surpasses the constraints of second-order statistics exemplified by the power spectrum. Bispectrum analysis possesses the capability to detect and characterize nonlinear interactions and phase couplings among diverse frequency components. The bispectrum is defined as the Fourier transformation of the third-order cumulant, fundamentally embodying the second-order moment within the frequency domain signal. For the 1D time series X, its bispectrum |$B({f_1},{f_2})$| is mathematically expressed as

$$\begin{eqnarray} B({f_1},{f_2}) = E(X({f_1})X({f_2}){X^*}({f_1} + {f_2})), \end{eqnarray}$$

(2)

where X(f) is the Fourier transform of the X, X*(f) denotes the complex conjugate of the Fourier transform, and |$E( \cdot )$| presents the third-order cumulant. The bispectrum characterizes the interactions among frequency components f₁ and f₂, offering insights into phase relationships and amplitude modulations. By utilizing third-order statistics, the bispectrum effectively suppresses Gaussian noise while preserving critical frequency domain information, including amplitude and phase, which is then visualized as a 2D image.

3.3. ST

The ST is an advanced time-frequency representation technique that integrates the strengths of both the STFT and the wavelet transform. It utilizes an adaptive windowing mechanism that dynamically adjusts according to frequency. As a result, it achieves high time resolution for high-frequency events and high-frequency resolution for low-frequency phenomena, effectively addressing the limitations of both the STFT and wavelet transform. Fundamentally, the ST is a hybrid approach combining the Fourier transform with a scalable Gaussian window, enabling frequency-dependent adaptation. Mathematically, the ST result |$S(\tau ,f)$| of a 1D time series X is defined as

$$\begin{eqnarray} \left\{ \begin{array}{@{}l@{}} G(t - \tau ,f) = \exp \left( - \frac{{{{(t - \tau )}^2}{f^2}}}{2}\right)\\ S(\tau ,f) = \int_{ - \infty }^\infty {x\frac{1}{{\sqrt {2\pi } }}G(t - \tau ,f)\exp ( - j2\pi ft)} {\rm d}t \end{array} \right. \end{eqnarray}$$

(3)

where τ represents the time shift, while f and t represent the frequency and time sequence, respectively.|$G(t - \tau ,f)$| denotes the Gaussian window function, which scales proportionally with frequency, conferring robust multi-resolution capabilities. The ST generates a time-frequency matrix, enabling the direct extraction of amplitude and phase information. The transform’s magnitude represents the time-frequency energy distribution, while the phase component captures instantaneous phase variations across frequencies. Bearing vibration signals typically exhibit non-stationary characteristics during faults. The ST’s ability to track temporal frequency variations effectively converts 1D data into a 2D time-frequency image, illustrating the evolution of frequency content over time.

3.4. RGB-CF strategy

The objective of the proposed RGB-CF strategy is to convert the captured 1D vibration signals from wheelset bearings into 2D images, with these images seamlessly integrating information from the time, frequency, and time-frequency domains, which can comprehensively and intuitively represent the fault feature information, effectively avoiding the limitations of the traditional single domain image generation method. Integration of multi-domain feature information can comprehensively and intuitively represent the fault feature information, effectively avoiding the limitations of the traditional single domain image generation method, and greatly enriching the representation of image feature information. As depicted in Figure 1, the framework of the RGB-CF strategy is outlined. This approach primarily comprises three key steps. The detailed implementation procedure is outlined as follows:

Setp 1: 2D image conversion. The 1D vibration signals are individually converted into 2D images with three RGB channels using the GAF, bispectrum, and ST. These operations result in 2D representations of the original signal in the time, frequency, and time-frequency domains, respectively. Each of the images in the time, frequency, and time-frequency domains undergoes a pixel-wise addition along the RGB channels.
Step 2: 2D image processing. Each of the images in the time, frequency, and time-frequency domains undergoes a pixel-wise addition along the RGB channels. Specifically, the corresponding pixel values in the R-channel images from GAF, bispectrum, and ST are summed element-wise, and this process is repeated for the G- and B-channel images. Subsequently, the pixel values of the resulting R-, G-, and B-channel images are averaged element-wise.
Step 3: 2D image fusion. The aforementioned three-channel images are fused by concatenating along the R, G, and B channels. The final image after the RGB-CF is still an RGB three-channel image, intricately integrating time, frequency, and time-frequency domain feature information from the bearing vibration signal.

Figure 1:

Schematic diagram of the RGB-CF strategy.

4. Improved RSMamba Model

4.1. Mamba model

Mamba is an efficient architecture for sequence processing, leveraging SSMs (Gu & Dao, 2023). Derived from modern control theory for continuous linear time-invariant systems, SSMs can be viewed as a fusion of RNNs and CNNs. The dynamics of SSMs can be mathematically described through a subsequent linear ordinary differential equation, as detailed as follows:

$$\begin{eqnarray} \left\{ \begin{array}{@{}l@{}} {h^{\prime}}\left( t \right) = Ah\left( t \right) + Bx\left( t \right)\\ y(t) = Ch(t) \end{array}, \right. \end{eqnarray}$$

(4)

where A represents the state transition matrix, and B and C denote the projection matrices. x(t) is the input signal of the system, y(t) is the corresponding output, and h(t) represents the implicit latent state. To obtain the discretized form of the continuous system, a zero-order hold discretization method with a time scale parameter |$\Delta $| is used processes A and B. The calculation formula can be expressed as follows :

$$\begin{eqnarray} \left\{ \begin{array}{@{}l@{}} \overline {\it A} = \exp \left( {\Delta {\it A}} \right)\\ \overline {\mathop{\it{B}}\nolimits} = {\left( {\Delta {\mathop{\rm{A}}\nolimits} } \right)^{ - 1}}\left( {\exp \left( {\Delta {\mathop{\it{A}}\nolimits} } \right) - {\it{I}}} \right)\Delta {\mathop{\it{B}}\nolimits} \end{array} \right. \end{eqnarray}$$

(5)

The process of SSM, after discretization, is derived from

$$\begin{eqnarray} \left\{ \begin{array}{@{}l@{}} {h_k} = \overline A {h_{k - 1}} + \overline B {x_k}\\ {y_k} = C{h_k} \end{array} \right. \end{eqnarray}$$

(6)

Finally, the output of the system can be described using a convolution form, which can be written as

$$\begin{eqnarray} \left\{ \begin{array}{@{}l@{}} y = x * \overline K \\ \overline K = \left( {C\overline B ,C\overline A \overline B ,...,C{{\overline A }^{L - 1}}\overline B } \right) \end{array}, \right. \end{eqnarray}$$

(7)

where L represent the length of the input signal, and |$\overline K $| is the structured convolutional kernel.

The Mamba network employs a modular architecture, primarily consisting of numerous Mamba blocks. The Mamba block integrates the Hungry Hungry Hippo (H3) architecture with the Transformer model, as depicted in Figure 2. Initially, the multidimensional input data is transformed into feature representations via an embedding layer, which is then fed into the Mamba block. The features undergo linear projection to the hidden state dimension, followed by nonlinear convolution and activation operations, utilizing the Sigmoid Linear Unit (SILU) activation function. Subsequently, these results are input into the SSM. Following a linear projection and SILU-based activation, the shortcut features are multiplied by the output of the SSM. Lastly, a linear projection layer is appended to diminish the dimensionality of the output tensor after the Mamba block.

Figure 2:

Architecture of Mamba block.

4.2. RSMamba model

RSMamba is a recently introduced model framework for remote sensing image analysis, leveraging SSM, as detailed in (Chen et al., 2024). The RSMamba model employs a modular architecture, primarily comprising a series of interconnected multi-path SSM blocks. By inheriting the advantages of the traditional Mamba model, including linear complexity and a global receptive field, RSMamba incorporates a dynamic multipath activation mechanism into its SSM, thereby significantly enhancing its capability to handle non-causal sequence data.

The architecture of the multi-path SSM block is shown in Figure 3. Different from the use of class toke to aggregate the global representation in the vision Mamba encoder, the input sequence is directly fed into the multi-path SSM block. The traditional Mamba can only model in a single direction and is position-agnostic. It encounters difficulties in modeling spatial positional relationships and unidirectional paths, thereby limiting the applicability to visual data representation. The dynamic multipath activation mechanism is introduced to augment the capacity of processing 2D data. The input sequence is duplicated in three copies, corresponding to the establishment of three different paths: forward path, random shuffle path, and reverse path. Within each path, a Mamba block, featuring shared parameters, is utilized to capture the dependency relationships between tokens across the three sequences. Subsequently, the tokens in each sequence are rearranged to their original order, and a linear layer is applied to compress the sequence information. An activation gate is utilized in each path to enhance the representation of the unique sequence information. By incorporating these three paths, RSMamba can effectively capture both causal and non-causal relationships, significantly augmenting RSMamba's ability to model complex, non-causal, and position-sensitive data. Ultimately, category prediction occurs after the RSMamba model, achieved through mean pooling and linear projection of the sequence features. The computational procedure of the RSMamba model can be succinctly summarized as

$$\begin{eqnarray} \left\{ \begin{array}{@{}l@{}} {T^i} = \varphi _{\rm mp - ssm}^i({T^{i - 1}}) + {T^{i - 1}}\\ T = {\varphi _{\text{line} - \text{proj}}}\left( {{\varphi _{\text{mean}}}({T^N})} \right) \end{array} \right. \end{eqnarray}$$

(8)

where |${T^i}$|⁠, |${T^{i - 1}}$| represent the output sequence of the ith and (i − 1)th multi-path SSM blocks, |${T^N}$| represents the final output of the multi-path SSM blocks with a total number of N. |${\varphi _{\rm mp - ssm}}$|⁠, |${\varphi _{\text{mean}}}$|⁠, and |${\varphi _{\text{line} - \text{proj}}}$| denotes the operation of the multi-path SSM block, mean pooling operation, and linear projection, respectively.

Figure 3:

Architecture of multi-path SSM block.

4.3. Improved RSMamba architecture

In a multi-label classification task, various spatial regions are occupied by diverse categories of image features. Hence, the model must be capable of capturing these regions and generating corresponding category features. Consequently, to further bolster the multi-label recognition capability of the RSMamba model, a CSRA module is designed at the end of the model, generating more precise class-specific features for each category. Figure 4 illustrates the structure of the CSRA module, comprising two main components: mean pooling and spatial pooling. Mean pooling derives category features by averaging the feature map across each category, which is straightforward but fails to fully exploit spatial feature information. Spatial pooling generates category features by assessing the weight of each category at each spatial position and performing weighted summation, enabling it to capture information about various categories in distinct spatial regions, thereby producing more representative features. The CSRA module integrates mean pooling and spatial pooling via hyperparametric λ-weighted fusion, enabling it to extract category-specific features and consequently enhance the effectiveness of multi-label recognition.

Figure 4:

Structure of the CSRA module.

Figure 5 presents an overview of the proposed method for diagnosing axlebox bearing faults. The collected 1D vibration signals are transformed into 2D images through the RGB-CSF strategy. A 2D convolution operation is utilized to map the image into pixel-wise feature maps.

Figure 5:

Flowchart of the proposed method for axlebox bearing fault diagnosis.

These local patches are subsequently flattened into 1D sequences, while the relative spatial positional relationships within the image are preserved via the position encoding process. Following this, the sequences are sequentially input into multiple multi-path SSM blocks to extract long-range dependency features. The number of blocks is set to N. The influence of N is evaluated in the experimental section. Ultimately, the class-specific features extracted from the CSRA module are utilized for categorical prediction.

5. Case Study

In this section, we conduct two case studies to demonstrate the effectiveness of the improved RSMamba, utilizing the wheelset bearing comprehensive test rig (WBCTR) and the rolling and vibration rig of single wheelset (RVRSW). Each test rig undergoes three sets of testing under various operating conditions, and the resulting datasets are selectively utilized as training and testing sets for subsequent analyses. Furthermore, to highlight the improved RSMamba’s superiority in diagnostic accuracy and computational efficiency, we select cutting-edge DL-based methods for comparative analyses. These methods include ResNet34 (He et al., 2016), ShuffleNet V2 (Zhang et al., 2018), VIT (Alexey, 2020), SwinT (Liu et al., 2021), and Scale-Aware Modulation Transformer (SMT; Lin et al., 2023).

The proposed model is written in Python 3.10 and based on the Pytorch 3.11 DL framework. The experiments are conducted on a computer equipped with an Intel Core i7-13700H CPU and an NVIDIA GeForce RTX 4060 GPU, featuring 16GB of dedicated memory. The proposed RGB-CF strategy is used to generate 2D images, which are then employed in the above methods and subjected to five independent trials to mitigate randomness.

5.1. Fault diagnosis on the dataset of WBCTR

5.1.1. Data description

The WBCTR comprises a support bearing at one end, a test wheelset bearing at the other, along with a driving motor, a hydraulic loading device, and a data acquisition system, as depicted in Figure 6a. The test wheelset bearing health states encompass 8 distinct categories, illustrated in Figure 7. They are bearing outer race serious and slight faults, bearing inner race serious and slight faults, bearing rolling serious and slight faults, compound fault of outer race and rolling, and normal bearing, corresponding to labels H1 to H8. Notably, all the aforementioned wheelset bearing failures are genuine, resulting from extended train operation, without any artificial processing.

Figure 6:

Experimental test rig: (a) WBCTR and (b) RVRSW.

Figure 7:

Wheelset bearings with different health states for the WBCTR: (a) outer race serious fault, (b) outer race slight fault, (c) inner race serious fault, (d) inner race slight fault, (e) rolling serious fault, (f) rolling slight fault, (g) compound fault, and (h) normal.

The experimental sampling frequency is set at 25 600 Hz. The bearings are tested at three distinct rotational speeds: 1010 r/min, 760 r/min, and 505 r/min, while three different degrees of loads are applied by the hydraulic loading device, respectively. Since quantitative information on hydraulic loading is not available, load numbers are used to indicate different degrees of loading. Based on the operational conditions of the training and test sets, three testing tasks are defined. The comprehensive dataset information is presented in Table 1.

Table 1:

Detail information of WBCTR dataset.

	Training set			Testing set
Task	Speed (r/min)	Load (no.)	Sample number	Speed (r/min)	Load (no.)	Sample number
A1	1010, 760	1, 2	200	505	3	100
A2	760 505	2, 3	200	1010	1	100
A3	1010, 505	1, 3	200	760	2	100

	Training set			Testing set
Task	Speed (r/min)	Load (no.)	Sample number	Speed (r/min)	Load (no.)	Sample number
A1	1010, 760	1, 2	200	505	3	100
A2	760 505	2, 3	200	1010	1	100
A3	1010, 505	1, 3	200	760	2	100

Table 1:

Detail information of WBCTR dataset.

	Training set			Testing set
Task	Speed (r/min)	Load (no.)	Sample number	Speed (r/min)	Load (no.)	Sample number
A1	1010, 760	1, 2	200	505	3	100
A2	760 505	2, 3	200	1010	1	100
A3	1010, 505	1, 3	200	760	2	100

	Training set			Testing set
Task	Speed (r/min)	Load (no.)	Sample number	Speed (r/min)	Load (no.)	Sample number
A1	1010, 760	1, 2	200	505	3	100
A2	760 505	2, 3	200	1010	1	100
A3	1010, 505	1, 3	200	760	2	100

5.1.2. Diagnosis result

The variations in accuracy and loss curves over the epochs during the training phase for different testing tasks are depicted in Figure 8a–c, respectively. The confusion matrices for the different testing results, presented in Figure 8d–f, indicate that all samples in categories H1, H2, H4, H7, and H8 are accurately classified, whereas one sample in both H3 and H5, and three samples in H6, are misclassified. Upon reaching 120 epochs in the three testing tasks, the accuracy and loss values converged, indicating that the model's performance stabilized. The overall classification accuracy for testing task A1 achieved 99.67%, while those for A2 and A3 are 99.74% and 99.56%, respectively. The results demonstrate that the proposed method effectively identifies various fault types and the degree of damage in wheelset bearings under different rotational speeds and load conditions.

Figure 8:

Model performance curves and confusion matrixes of different testing tasks for the WBCTR dataset. (a) A1 performance curve, (b) A2 performance curve, (c) A3 performance curve, (d) A1 confusion matrix, (e) A2 confusion matrix, (f) A3 confusion matrix.

To demonstrate the comprehensive superiority of our proposed model, we conduct a comparative analysis of various methods. The model training time, total number of parameters (Params), number of floating point operations (FLOPs), and frames per second (FPS) are selected as lightweight metrics to evaluate the computational efficiency of different models. The comparative results for three testing tasks are presented in Table 2. Our comparative analysis reveals a distinct contrast, with SwinT and SMT exhibiting significantly higher classification accuracy than the three traditional methods (ResNet34, ShuffleNet V2, and VIT). The primary reason for this lies in the challenges faced by traditional DL methods in accurately extracting feature information pertinent to sample categories under complex operating conditions. Across the three testing tasks (A1, A2, and A3), our method achieves an average classification accuracy that is 2.23%, 1.3%, and 1.85% higher, respectively, than the sub-optimal method SMT. ShuffleNet V2, a renowned lightweight network model, outperforms the four comparative models in terms of lightweight metrics. Nonetheless, our model exhibits notable advantages over ShuffleNet V2, particularly in both model lightness and computational efficiency.

Table 2:

Comparative analysis results of different testing tasks for the WBCTR dataset.

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	ResNet34	90.12	87.78	88.95	196	21.28	3.68	210.03
	Shufflenet V2	93.56	89.91	91.73	151	7.94	0.59	126.71
	ViT	90.55	88.76	89.66	484	86.58	16.89	129.02
	SwinT	96.53	96.21	96.38	336	28.29	4.38	199.09
	SMT	97.77	97.11	97.44	563	62.67	7.67	173.34
	Ours	99.83	99.53	99.67	49	1.71	0.87	347.43
A2	ResNet34	90.24	88.33	89.29	194	21.29	3.69	209.79
	Shufflenet V2	92.36	89.74	91.05	150	7.93	0.59	125.23
	ViT	91.63	88.98	90.31	483	86.57	16.89	129.56
	SwinT	96.74	96.23	96.49	336	28.3	4.39	199.13
	SMT	99.03	97.86	98.44	562	62.69	7.67	173.39
	Ours	99.81	99.68	99.74	48	1.7	0.86	347.39
A3	ResNet34	90.76	87.23	88.99	198	21.31	3.71	209.44
	Shufflenet V2	92.41	88.89	90.65	153	7.93	0.60	125.04
	ViT	91.04	88.76	89.91	487	86.57	16.88	129.37
	SwinT	96.53	96.21	96.37	338	28.27	4.39	199.11
	SMT	98.23	97.34	97.71	562	62.68	7.66	173.45
	Ours	99.62	99.5	99.56	48	1.73	0.87	347.37

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	ResNet34	90.12	87.78	88.95	196	21.28	3.68	210.03
	Shufflenet V2	93.56	89.91	91.73	151	7.94	0.59	126.71
	ViT	90.55	88.76	89.66	484	86.58	16.89	129.02
	SwinT	96.53	96.21	96.38	336	28.29	4.38	199.09
	SMT	97.77	97.11	97.44	563	62.67	7.67	173.34
	Ours	99.83	99.53	99.67	49	1.71	0.87	347.43
A2	ResNet34	90.24	88.33	89.29	194	21.29	3.69	209.79
	Shufflenet V2	92.36	89.74	91.05	150	7.93	0.59	125.23
	ViT	91.63	88.98	90.31	483	86.57	16.89	129.56
	SwinT	96.74	96.23	96.49	336	28.3	4.39	199.13
	SMT	99.03	97.86	98.44	562	62.69	7.67	173.39
	Ours	99.81	99.68	99.74	48	1.7	0.86	347.39
A3	ResNet34	90.76	87.23	88.99	198	21.31	3.71	209.44
	Shufflenet V2	92.41	88.89	90.65	153	7.93	0.60	125.04
	ViT	91.04	88.76	89.91	487	86.57	16.88	129.37
	SwinT	96.53	96.21	96.37	338	28.27	4.39	199.11
	SMT	98.23	97.34	97.71	562	62.68	7.66	173.45
	Ours	99.62	99.5	99.56	48	1.73	0.87	347.37

Table 2:

Comparative analysis results of different testing tasks for the WBCTR dataset.

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	ResNet34	90.12	87.78	88.95	196	21.28	3.68	210.03
	Shufflenet V2	93.56	89.91	91.73	151	7.94	0.59	126.71
	ViT	90.55	88.76	89.66	484	86.58	16.89	129.02
	SwinT	96.53	96.21	96.38	336	28.29	4.38	199.09
	SMT	97.77	97.11	97.44	563	62.67	7.67	173.34
	Ours	99.83	99.53	99.67	49	1.71	0.87	347.43
A2	ResNet34	90.24	88.33	89.29	194	21.29	3.69	209.79
	Shufflenet V2	92.36	89.74	91.05	150	7.93	0.59	125.23
	ViT	91.63	88.98	90.31	483	86.57	16.89	129.56
	SwinT	96.74	96.23	96.49	336	28.3	4.39	199.13
	SMT	99.03	97.86	98.44	562	62.69	7.67	173.39
	Ours	99.81	99.68	99.74	48	1.7	0.86	347.39
A3	ResNet34	90.76	87.23	88.99	198	21.31	3.71	209.44
	Shufflenet V2	92.41	88.89	90.65	153	7.93	0.60	125.04
	ViT	91.04	88.76	89.91	487	86.57	16.88	129.37
	SwinT	96.53	96.21	96.37	338	28.27	4.39	199.11
	SMT	98.23	97.34	97.71	562	62.68	7.66	173.45
	Ours	99.62	99.5	99.56	48	1.73	0.87	347.37

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	ResNet34	90.12	87.78	88.95	196	21.28	3.68	210.03
	Shufflenet V2	93.56	89.91	91.73	151	7.94	0.59	126.71
	ViT	90.55	88.76	89.66	484	86.58	16.89	129.02
	SwinT	96.53	96.21	96.38	336	28.29	4.38	199.09
	SMT	97.77	97.11	97.44	563	62.67	7.67	173.34
	Ours	99.83	99.53	99.67	49	1.71	0.87	347.43
A2	ResNet34	90.24	88.33	89.29	194	21.29	3.69	209.79
	Shufflenet V2	92.36	89.74	91.05	150	7.93	0.59	125.23
	ViT	91.63	88.98	90.31	483	86.57	16.89	129.56
	SwinT	96.74	96.23	96.49	336	28.3	4.39	199.13
	SMT	99.03	97.86	98.44	562	62.69	7.67	173.39
	Ours	99.81	99.68	99.74	48	1.7	0.86	347.39
A3	ResNet34	90.76	87.23	88.99	198	21.31	3.71	209.44
	Shufflenet V2	92.41	88.89	90.65	153	7.93	0.60	125.04
	ViT	91.04	88.76	89.91	487	86.57	16.88	129.37
	SwinT	96.53	96.21	96.37	338	28.27	4.39	199.11
	SMT	98.23	97.34	97.71	562	62.68	7.66	173.45
	Ours	99.62	99.5	99.56	48	1.73	0.87	347.37

To further show the powerful feature learning and aggregation capability of the proposed method, the t-distributed stochastic neighbor embedding (t-SNE) method is employed to visualize the output features of the model, and the results are shown in Figure 9. The feature clustering separation is clearer in all three testing tasks, which confirms that the proposed method has a more desirable feature extraction capability under composite working conditions.

Figure 9:

Visualization of representation results in different testing tasks for the WBCTR dataset: (a) A1, (b) A2, and (c) A3.

5.2. Fault diagnosis on the dataset of RVRSW

5.2.1 Data description

The RVRSW primarily comprises a hydraulic station, hydraulic actuators, a supportive framework, an exciter, and rail wheels. Its mechanical configuration is depicted in Figure 6b. The experimental focus is a single wheelset bogie sourced from the China Railway High-speed (CRH380B) train, specifically utilizing the right wheelset bearing as the subject of testing. Positioned atop the bench, a hydraulic actuator is capable of applying vertical loads reaching up to 7 tons. Six distinct categories of axlebox bearings, encompassing various health states such as normal, outer race fault, inner race slight spalling, inner race serious spalling, rolling fault, and a compound fault involving both outer race and rolling, are selected for testing, labeled from C1 to C6. A depiction of selected faulty wheelset bearings is presented in Figure 10. An accelerometer is mounted at the upper end of the axlebox during the experiment to gather vibration signals, sampled at a frequency of 25.6 kHz. We have established three distinct operating conditions, featuring speeds of 150, 200, and 250 km/h, which are associated with varying levels of static loading: 7, 6, and 5 tons, respectively. Consequently, we have devised three testing tasks, and the detailed information about these tasks can be found in Table 3.

Figure 10:

Part of faulty wheelset bearings for the RVRSW: (a) outer race fault, (b) inner race serious spalling fault, (c) inner race slight spalling fault, and (d) rolling fault.

Table 3:

Detail information for RVRSW dataset.

	Training set			Testing set
Task	Speed (km/h)	Load (tons)	Sample number	Speed (km/h)	Load (tons)	Sample number
B1	250, 200	5, 6	200	150	7	100
B2	250, 150	5, 7	200	200	6	100
B3	200, 150	6, 7	200	250	5	100

	Training set			Testing set
Task	Speed (km/h)	Load (tons)	Sample number	Speed (km/h)	Load (tons)	Sample number
B1	250, 200	5, 6	200	150	7	100
B2	250, 150	5, 7	200	200	6	100
B3	200, 150	6, 7	200	250	5	100

Table 3:

Detail information for RVRSW dataset.

	Training set			Testing set
Task	Speed (km/h)	Load (tons)	Sample number	Speed (km/h)	Load (tons)	Sample number
B1	250, 200	5, 6	200	150	7	100
B2	250, 150	5, 7	200	200	6	100
B3	200, 150	6, 7	200	250	5	100

	Training set			Testing set
Task	Speed (km/h)	Load (tons)	Sample number	Speed (km/h)	Load (tons)	Sample number
B1	250, 200	5, 6	200	150	7	100
B2	250, 150	5, 7	200	200	6	100
B3	200, 150	6, 7	200	250	5	100

5.2.2. Diagnosis result

Figures 11a–c present the accuracy and loss trends of the proposed model throughout training epochs for different testing tasks, respectively. Upon reaching 120 epochs in all three testing scenarios (B1, B2, and B3), both accuracy and loss metrics stabilize, thereby demonstrating the model’s robustness. Figures 11d–f display confusion matrices revealing classification accuracies of 99.57%, 99.68%, and 99.75% for tasks B1, B2, and B3, respectively. The results presented demonstrate the efficacy of the proposed method in accurately identifying fault types and assessing damage levels within the RVRSW dataset, under complex operational conditions.

Figure 11:

Model performance curves and confusion matrixes of different testing tasks for the RVRSW dataset. (a) B1 performance curve, (b) B2 performance curve, (c) B3 performance curve, (d) B1 confusion matrix, (e) B2 confusion matrix, and (f) B3 confusion matrix.

The comparative analysis results are presented in Table 4. Our model achieved average classification accuracies of 3.43%, 1.2%, and 1.91% higher than the second-best SMT model in tasks B1, B2, and B3, respectively. The lightweight metrics evaluation of our model is also significantly better than the classical ShuffleNet V2 model. Our model surpasses the five comparative models in both classification accuracy and lightweight metrics evaluation. Lastly, the visualization of output features is presented in Figure 12. The clear separation of features in all three tasks underscores the superior feature extraction capabilities of the proposed method under composite conditions. In terms of both classification accuracy and lightness metrics evaluation, our model outperforms the five comparative models.

Figure 12:

Visualization of representation results in different testing tasks for the WBCTR dataset. (a) B1, (b) B2, and (c) B3.

Table 4:

Comparative analysis results of different testing tasks for the RVRSW dataset.

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops (10⁹)	FPS
B1	ResNet34	87.28	82.88	90.08	105	21.29	3.68	98.78
	Shufflenet V2	88.75	86.89	92.82	139	7.93	0.58	155.67
	ViT	85.57	83.73	89.68	319	86.57	16.88	123.39
	SwinT	94.69	94.88	94.29	481	28.77	4.29	199.09
	SMT	96.74	96.88	96.14	597	14.41	7.63	137.34
	Ours	99.64	99.51	99.57	32	1.71	0.85	343.43
B2	ResNet34	87.56	83.54	85.55	104	21.28	3.69	98.79
	Shufflenet V2	87.74	85.64	86.69	138	7.94	0.59	155.42
	ViT	85.47	83.57	84.52	321	86.58	16.89	129.37
	SwinT	93.78	92.79	93.28	482	28.59	4.26	199.11
	SMT	96.71	95.77	98.48	596	14.38	7.64	137.45
	Ours	99.76	99.61	99.68	32	1.72	0.85	343.67
B3	ResNet34	87.57	84.88	89.29	105	21.28	3.70	98.77
	Shufflenet V2	87.69	85.74	91.05	139	7.93	0.59	155.45
	ViT	84.72	82.98	88.85	319	86.59	16.89	129.56
	SwinT	93.89	93.07	96.49	482	28.64	4.28	199.13
	SMT	97.41	97.08	97.84	597	14.39	7.63	137.39
	Ours	99.82	99.68	99.75	33	1.71	0.86	343.51

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops (10⁹)	FPS
B1	ResNet34	87.28	82.88	90.08	105	21.29	3.68	98.78
	Shufflenet V2	88.75	86.89	92.82	139	7.93	0.58	155.67
	ViT	85.57	83.73	89.68	319	86.57	16.88	123.39
	SwinT	94.69	94.88	94.29	481	28.77	4.29	199.09
	SMT	96.74	96.88	96.14	597	14.41	7.63	137.34
	Ours	99.64	99.51	99.57	32	1.71	0.85	343.43
B2	ResNet34	87.56	83.54	85.55	104	21.28	3.69	98.79
	Shufflenet V2	87.74	85.64	86.69	138	7.94	0.59	155.42
	ViT	85.47	83.57	84.52	321	86.58	16.89	129.37
	SwinT	93.78	92.79	93.28	482	28.59	4.26	199.11
	SMT	96.71	95.77	98.48	596	14.38	7.64	137.45
	Ours	99.76	99.61	99.68	32	1.72	0.85	343.67
B3	ResNet34	87.57	84.88	89.29	105	21.28	3.70	98.77
	Shufflenet V2	87.69	85.74	91.05	139	7.93	0.59	155.45
	ViT	84.72	82.98	88.85	319	86.59	16.89	129.56
	SwinT	93.89	93.07	96.49	482	28.64	4.28	199.13
	SMT	97.41	97.08	97.84	597	14.39	7.63	137.39
	Ours	99.82	99.68	99.75	33	1.71	0.86	343.51

Table 4:

Comparative analysis results of different testing tasks for the RVRSW dataset.

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops (10⁹)	FPS
B1	ResNet34	87.28	82.88	90.08	105	21.29	3.68	98.78
	Shufflenet V2	88.75	86.89	92.82	139	7.93	0.58	155.67
	ViT	85.57	83.73	89.68	319	86.57	16.88	123.39
	SwinT	94.69	94.88	94.29	481	28.77	4.29	199.09
	SMT	96.74	96.88	96.14	597	14.41	7.63	137.34
	Ours	99.64	99.51	99.57	32	1.71	0.85	343.43
B2	ResNet34	87.56	83.54	85.55	104	21.28	3.69	98.79
	Shufflenet V2	87.74	85.64	86.69	138	7.94	0.59	155.42
	ViT	85.47	83.57	84.52	321	86.58	16.89	129.37
	SwinT	93.78	92.79	93.28	482	28.59	4.26	199.11
	SMT	96.71	95.77	98.48	596	14.38	7.64	137.45
	Ours	99.76	99.61	99.68	32	1.72	0.85	343.67
B3	ResNet34	87.57	84.88	89.29	105	21.28	3.70	98.77
	Shufflenet V2	87.69	85.74	91.05	139	7.93	0.59	155.45
	ViT	84.72	82.98	88.85	319	86.59	16.89	129.56
	SwinT	93.89	93.07	96.49	482	28.64	4.28	199.13
	SMT	97.41	97.08	97.84	597	14.39	7.63	137.39
	Ours	99.82	99.68	99.75	33	1.71	0.86	343.51

		Classification accuracy (%)			Lightweight metrics
Task	Model	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops (10⁹)	FPS
B1	ResNet34	87.28	82.88	90.08	105	21.29	3.68	98.78
	Shufflenet V2	88.75	86.89	92.82	139	7.93	0.58	155.67
	ViT	85.57	83.73	89.68	319	86.57	16.88	123.39
	SwinT	94.69	94.88	94.29	481	28.77	4.29	199.09
	SMT	96.74	96.88	96.14	597	14.41	7.63	137.34
	Ours	99.64	99.51	99.57	32	1.71	0.85	343.43
B2	ResNet34	87.56	83.54	85.55	104	21.28	3.69	98.79
	Shufflenet V2	87.74	85.64	86.69	138	7.94	0.59	155.42
	ViT	85.47	83.57	84.52	321	86.58	16.89	129.37
	SwinT	93.78	92.79	93.28	482	28.59	4.26	199.11
	SMT	96.71	95.77	98.48	596	14.38	7.64	137.45
	Ours	99.76	99.61	99.68	32	1.72	0.85	343.67
B3	ResNet34	87.57	84.88	89.29	105	21.28	3.70	98.77
	Shufflenet V2	87.69	85.74	91.05	139	7.93	0.59	155.45
	ViT	84.72	82.98	88.85	319	86.59	16.89	129.56
	SwinT	93.89	93.07	96.49	482	28.64	4.28	199.13
	SMT	97.41	97.08	97.84	597	14.39	7.63	137.39
	Ours	99.82	99.68	99.75	33	1.71	0.86	343.51

5.3. Discussion

5.3.1. Effect of the λ-weighted in the CSRA module

To further illustrate the effect of different values of λ-weighted in the CSRA module on classification performance and operational efficiency of the proposed model, we conducted comparative analyses using λ-weighted values of 0.05, 0.1, 0.2, 0.3, and 0.4 for testing tasks A1 and B1. The results, presented in Table 5, indicate that varying λ-weighted values affect the classification accuracy of the proposed model but have minimal impact on its lightweight metrics. This is because λ influences only the weights of different feature information fusion and does not alter the network structure and model parameters, thereby leaving operational efficiency unchanged. When λ is set to 0.2, the model achieves the highest classification accuracy in tasks A1 and B1. Therefore, λ-weighted is set to 0.2 in the proposed model.

Table 5:

Comparative analysis result of different λ-weighted values.

		Classification accuracy (%)			Lightweight metrics
Task	λ value	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	0.05	96.45	96.37	96.41	49.31	1.72	0.88	347.52
	0.1	97.52	97.41	97.48	49.32	1.71	0.88	347.24
	0.2	99.73	99.61	99.67	49.32	1.71	0.87	347.43
	0.3	98.71	98.65	98.67	49.32	1.71	0.87	347.67
	0.4	98.52	98.41	98.49	49.31	1.71	0.87	347.59
B1	0.05	97.65	97.56	97.63	32.21	1.71	0.86	343.51
	0.1	98.75	98.68	98.71	32.22	1.71	0.85	343.88
	0.2	99.64	99.52	99.57	32.23	1.71	0.85	343.43
	0.3	99.13	98.89	99.01	32.23	1.72	0.85	343.57
	0.4	98.69	98.60	98.65	32.21	1.71	0.85	343.79

		Classification accuracy (%)			Lightweight metrics
Task	λ value	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	0.05	96.45	96.37	96.41	49.31	1.72	0.88	347.52
	0.1	97.52	97.41	97.48	49.32	1.71	0.88	347.24
	0.2	99.73	99.61	99.67	49.32	1.71	0.87	347.43
	0.3	98.71	98.65	98.67	49.32	1.71	0.87	347.67
	0.4	98.52	98.41	98.49	49.31	1.71	0.87	347.59
B1	0.05	97.65	97.56	97.63	32.21	1.71	0.86	343.51
	0.1	98.75	98.68	98.71	32.22	1.71	0.85	343.88
	0.2	99.64	99.52	99.57	32.23	1.71	0.85	343.43
	0.3	99.13	98.89	99.01	32.23	1.72	0.85	343.57
	0.4	98.69	98.60	98.65	32.21	1.71	0.85	343.79

Table 5:

Comparative analysis result of different λ-weighted values.

		Classification accuracy (%)			Lightweight metrics
Task	λ value	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	0.05	96.45	96.37	96.41	49.31	1.72	0.88	347.52
	0.1	97.52	97.41	97.48	49.32	1.71	0.88	347.24
	0.2	99.73	99.61	99.67	49.32	1.71	0.87	347.43
	0.3	98.71	98.65	98.67	49.32	1.71	0.87	347.67
	0.4	98.52	98.41	98.49	49.31	1.71	0.87	347.59
B1	0.05	97.65	97.56	97.63	32.21	1.71	0.86	343.51
	0.1	98.75	98.68	98.71	32.22	1.71	0.85	343.88
	0.2	99.64	99.52	99.57	32.23	1.71	0.85	343.43
	0.3	99.13	98.89	99.01	32.23	1.72	0.85	343.57
	0.4	98.69	98.60	98.65	32.21	1.71	0.85	343.79

		Classification accuracy (%)			Lightweight metrics
Task	λ value	Max-acc	Min-acc	Avg-acc	Time (s)	Params (M)	Flops(10⁹)	FPS
A1	0.05	96.45	96.37	96.41	49.31	1.72	0.88	347.52
	0.1	97.52	97.41	97.48	49.32	1.71	0.88	347.24
	0.2	99.73	99.61	99.67	49.32	1.71	0.87	347.43
	0.3	98.71	98.65	98.67	49.32	1.71	0.87	347.67
	0.4	98.52	98.41	98.49	49.31	1.71	0.87	347.59
B1	0.05	97.65	97.56	97.63	32.21	1.71	0.86	343.51
	0.1	98.75	98.68	98.71	32.22	1.71	0.85	343.88
	0.2	99.64	99.52	99.57	32.23	1.71	0.85	343.43
	0.3	99.13	98.89	99.01	32.23	1.72	0.85	343.57
	0.4	98.69	98.60	98.65	32.21	1.71	0.85	343.79

5.3.2. Effect of the number of multi-path SSM blocks

The number of multi-path SSM blocks has a large influence on the recognition ability and operational efficiency of the proposed model. Increasing the number of blocks results in a deeper network structure, enhancing the model's feature learning capability. However, this deeper architecture incurs higher computational costs and risks of overfitting. A reduced number of blocks may result in insufficient feature extraction, thereby diminishing the model's performance. We conduct comparative analyses for three testing tasks in the WBCTR dataset, using 4, 6, 8, 10, and 12 multi-path SSM blocks. We choose classification accuracy and model training time as evaluation metrics, and present the results in Figure 13. When setting the number of blocks to 4, the classification accuracy is notably lower across all three testing tasks. Upon further augmentation of the block count, the model’s accuracy remains stable, suggesting peak performance has been achieved. As the number of blocks escalates, the model’s training duration increases markedly, stemming from the intensified computational demands of the deeper architectural configuration. After weighing the classification capabilities and computational efficiency, the optimal number of multi-path SSM blocks is determined to be 6.

Figure 13:

Effect of the multi-path SSM blocks for the WBCTR dataset. (a) A1, (b) A2, and (c) A3.

5.3.3. Effect of the proposed RGB-CF strategy

To further demonstrate the merits of our proposed RGB-CF strategy, we choose the most advanced and commonly employed 2D image generation methods, GAF, bispectrum (Bisp), STFT, CWT, and ST, for comparative analysis against our strategy. The 2D images derived from these methods are fed into our enhanced RSMamba model for evaluation. To ensure a comprehensive comparative analysis, we additionally select accuracy (Acc), precision (Pre), recall (Re), and the F1 score as our evaluation metrics. Figure 14 presents the comparative diagnosis results across three testing datasets from the WBCTR dataset. The results indicate that the evaluation metrics for GAF and Bisp methods are notably lower, attributable to their limited ability to capture either time domain or frequency domain features from 1D vibration signals. The STFT, CWT, and ST methods, rooted in time-frequency domain decomposition, effectively retain both time and frequency domain features of 1D signals, leading to better diagnostic performance. The RGB-CF approach demonstrates superior diagnostic performance in all three testing tasks, effectively integrating information across time, frequency, and time-frequency domains, thereby enhancing the representation of image features.

Figure 14:

Effect of the proposed RGB-CF strategy for the WBCTR dataset. (a) A1, (b) A2, (c) A3.

5.3.4. Ablation study

This section presents an in-depth ablation study aimed at elucidating the contributions of the multi-path SSM block and CSRA module in the proposed model. The ablation study is conducted on three testing tasks of the WBCTR dataset. There are four ablation experiments based on different model structures separately, namely M1 (consisting of traditional Mamba blocks only), M2 (consisting of RSMamba blocks only), M3 (consisting of traditional Mamba blocks and the CSRA module), and M4 (the proposed model). For a fair comparison, the RGB-CF strategy is utilized to generate 2D images for model input. Figure 15 illustrates the results of the ablation study, employing classification accuracy as the primary evaluation metric. The average classification accuracies of M1, M2, M3, and M4 are 91.09%, 97.97%, 93.62%, and 99.67% across the three testing tasks, respectively. Notably, M1 exhibits the lowest classification accuracy. Conversely, M2 achieves 6.88% higher classification accuracy due to its dynamic multipath activation mechanism, compared to M1. Incorporating the dynamic multipath activation mechanism into the SSM significantly enhances its capability to process non-causal sequence data. M3 has 2.53% higher diagnostic accuracy than the M1. This underscores the crucial role of the proposed CSRA module in extracting more precise class-specific features. Finally, with the help of the multi-path SSM block and the CSRA module, the proposed model outperforms the other ablation models significantly.

Figure 15:

Results of the ablation study for the WBCTR dataset. (a) A1, (b) A2, and (c) A3.

10.1016/j.measurement.2016.01.023

6. Conclusions

In the complex and dynamic operating environments of high-speed trains, traditional DL methods that rely on vibration analysis face significant limitations in accurately identifying and efficiently diagnosing wheelset bearing faults. We propose an improved RSMamba network based on multi-domain image fusion for wheelset bearing fault diagnosis. The RGB-CF strategy employs the principle of multi-domain fusion, transforming a 1D vibration signal into a 2D image representation spanning the time, frequency, and time-frequency domains, followed by RGB channel fusion to enrich feature representation. To improve non-causal dependency modeling, the RSMamba network integrates dynamic multi-path Mamba blocks. Subsequently, a CSRA module is incorporated into the model to improve multi-label recognition accuracy.

Extensive experiments on two real-world wheelset bearing datasets validate the proposed method’s effectiveness in accurately identifying fault categories and assessing damage severity under composite conditions. The improved RSMamba network outperforms state-of-the-art CNN- and Transformer-based models, demonstrating superior accuracy, computational efficiency, and resource utilization. These findings highlight its potential as an efficient and robust solution for fault diagnosis in critical rotating components of high-speed trains. However, the study has limitations, particularly in computational overhead and the risk of overfitting when incorporating excessive multi-path blocks. Future research should focus on optimizing computational efficiency while maintaining high accuracy and exploring the feasibility of real-time fault diagnosis under diverse environmental conditions.

Conflicts of Interest

The authors declare no conflict of interest.

Author Contributions

Feiyue Deng (Conceptualization, Methodology, Writing-original draft, Writing—review & editing), Yunlong Zhu (Data curation, Formal analysis), Rujiang Hao (Visualization, Funding acquisition), Shaopu Yang (Supervision)

Funding

The work is supported by National Natural Science Foundation of China (12032017, 12272243).

Data Availability

The data underlying this article will be shared on reasonable request to the corresponding author.

Acknowledgments

Thanks to the support of the state key laboratory of mechanical behavior and system safety of traffic engineering structures.

References

Alom

M. Z.

Taha

T. M.

Yakopcic

Westberg

Sidike

Nasrin

M. S.

Asari

V. K.

(

2018

The history began from alexnet: A comprehensive survey on deep learning approaches

arxiv preprint arxiv:1803.01164

Cao

Fan

Zhou

(

2016

Wheel-bearing fault diagnosis of trains using empirical wavelet transform

Measurement

439

–

449

10.3390/electronics10010059

Chen

C. C.

Liu

Yang

C. C.

(

2020

An improved fault diagnosis using 1D-convolutional neural network model

Electronics

10.1016/j.ymssp.2015.08.023

Chen

Pan

Chen

Yuan

(

2016

Wavelet transform based on inner product in fault diagnosis of rotating machinery: A review

Mechanical Systems and Signal Processing

–

Chen

Liu

Zou

Shi

(

2024

RSMamba: Remote sensing image classification with state space model

IEEE Geoscience and Remote Sensing Letters

8002605

10.1109/ACCESS.2021.3108972

Chen

Zhang

Feng

Jiang

(

2017

Optimal resonant band demodulation based on an improved correlated kurtosis and its application in bearing fault diagnosis

Sensors

360

Cui

Guan

Chen

(

2021

Rolling element fault diagnosis based on VMD and sensitivity MCKD

IEEE Access

120297

–

120308

10.1016/j.ymssp.2021.108616

Ding

Jia

Miao

Cao

(

2022

A novel time-frequency transformer based on self-attention mechanism and its application in fault diagnosis of rolling bearings

Mechanical Systems and Signal Processing

168

108616

10.1007/s11265-018-1378-3

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Houlsby

(

2020

An image is worth 16x16 words: Transformers for image recognition at scale

arxiv preprint arxiv:2010.11929

Eren

Ince

Kiranyaz

(

2019

A generic intelligent bearing fault diagnosis system using compact adaptive 1D CNN classifier

Journal of Signal Processing Systems

179

–

189

10.1016/j.engappai.2023.106507

Gong

Kang

Wang

Wan

(

2024

nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model

arxiv preprint arxiv:2402.03526

Dao

(

2023

Mamba: Linear-time sequence modeling with selective state spaces

arxiv preprint arxiv:2312.00752

Goel

Ré

(

2021

Efficiently modeling long sequences with structured state spaces

arxiv preprint arxiv:2111.00396

Zhang

Ren

Sun

(

2016

Deep residual learning for image recognition

. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(pp.

770

–

778

.).

IEEE Computer Society

Hou

Wang

Chen

(

2023

DiagnosisFormer: An efficient rolling bearing fault diagnosis method based on improved Transformer

Engineering Applications of Artificial Intelligence

124

106507

Peng

(

2016

Frequency band selection based on the kurtosis of the squared envelope spectrum and its application in bearing fault diagnosis

Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science

230

1113

–

1125

10.1109/ACCESS.2019.2924832

Huang

Liu

Geng

Liu

Ren

Zhao

(

2019

Fault diagnosis accuracy improvement using wayside rectangular microphone array for health monitoring of railway-vehicle wheel bearing

IEEE Access

87410

–

87424

10.1109/ICCV51070.2023.00553

Huang

Pei

You

Wang

Qian

(

2024

Localmamba: Visual state space model with windowed selective scan

preprint arxiv:2403.09338

Kim

Park

C. H.

Suh

Chae

Yoon

Youn

B. D.

(

2023

MPARN: Multi-scale path attention residual network for fault diagnosis of rotating machines

Journal of Computational Design and Engineering

860

–

872

Liang

Chen

Lin

(

2020

Wheelset bearing fault detection using morphological signal and image analysis

Structural Control and Health Monitoring

e2619

Lin

Chen

Huang

Jin

(

2023

Scale-aware modulation meet transformer

. In

Proceedings of the IEEE/CVF International Conference on Computer Vision

(pp.

5992

–

6003

.).

IEEE Computer Society

Liu

Qian

Liu

(

2017

Wayside acoustic fault diagnosis of train wheel bearing based on Doppler effect correction and fault-relevant information enhancement

Advances in Mechanical Engineering

1687814017732676

10.1177/1687814017732676

10.1109/ICCV48922.2021.00986

Liu

Lin

Cao

Zhang

Guo

. (

2021

Swin transformer: Hierarchical vision transformer using shifted windows

. In

Proceedings of the IEEE/CVF International Conference on Computer Vision

(pp.

9992

–

10002

.).

10.1016/j.physrep.2006.11.001

Luo

Huang

Wang

Luo

Zhou

(

2022

Transfer Learning Based on Improved Stacked Autoencoder for Bearing Fault Diagnosis

Knowledge-Based Systems

256

109846

Marwan

Romano

M. C.

Thiel

Kurths

(

2007

Recurrence plots for the analysis of complex systems

Physics Reports

438

237

–

329

10.1016/j.wear.2004.11.020

Pan

Duraisamy

(

2020

On the structure of time-delay embedding in linear models of non-linear dynamical systems

Chaos: An Interdisciplinary Journal of Nonlinear Science

30(7)

073135

Peng

Kessissoglou

N. J.

Cox

(

2005

A study of the effect of contaminant particles in lubricants using wear debris and vibration condition monitoring techniques

Wear

258

1651

–

1662

10.48550/arXiv.1706.03762

Tang

Wang

(

2022

A novel fault diagnosis method of rolling bearing based on integrated vision transformer model

Sensors

3878

Vaswani

(

2017

Attention is all you need

. In

Proceedings of the 31st International Conference on Neural Information Processing Systems

(pp.

6000

–

6010

.).

Curran Associates, Inc

Google Preview

Wang

Shao

Peng

Liu

(

2024

PSparseFormer: Enhancing fault feature extraction based on parallel sparse self-attention and multiscale broadcast feed-forward block

IEEE Internet of Things Journal

11(13)

22982

–

22991

10.1016/j.measurement.2020.108518

Wang

Xie

Wang

Zhao

Cui

(

2024

NetMamba: Efficient network traffic classification via pre-training unidirectional Mamba

. In

2024 IEEE 32nd International Conference on Network Protocols (ICNP)

(pp.

–

.).

IEEE Computer Society

Wang

Mao

(

2021

Bearing fault diagnosis based on vibro-acoustic data fusion and 1D-CNN network

Measurement

173

108518

10.1016/j.isatra.2019.07.001

Wang

Pan

Yuan

Yang

Gui

(

2020

A novel deep learning based fault diagnosis approach for chemical process with extended deep belief network

ISA Transactions

457

–

467

Wang

Liu

(

2023

A diagnosis method for imbalanced bearing data based on improved SMOTE model combined with CNN-AM

Journal of Computational Design and Engineering

1930

–

1940

Triebe

M. J.

Sutherland

J. W.

(

2023

A transformer-based approach for novel fault detection and fault classification/diagnosis in manufacturing: A rotary system application

Journal of Manufacturing Systems

439

–

452

10.1016/j.jmsy.2023.02.018

10.1016/j.ymssp.2018.07.039

Yang

Peng

Zhang

Meng

(

2019

Parameterised time-frequency analysis methods and their engineering applications: A review of recent advances

Mechanical Systems and Signal Processing

119

182

–

221

10.1007/s10443-022-10095-4

Liu

(

2020

Knowledge Extraction and Insertion to Deep Belief Network for Gearbox Fault Diagnosis

(Vol. 197, p. 05883).

Knowledge-Based Systems

Yuan

Z. W.

Zhang

(

2016

August

Feature extraction and image retrieval based on AlexNet

. In

Eighth International Conference on Digital Image Processing (ICDIP 2016)

(Vol.

10033

, pp.

–

.).

SPIE

Zhang

Wang

Chen

Zhang

Qian

Cui

(

2023

A new composite leaf spring for in-board bogie of new generation high-speed trains

Applied Composite Materials

1377

–

1392

Zhang

Zhou

Lin

Sun

(

2018

ShuffleNet: An extremely efficient convolutional neural network for mobile devices

. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(pp.

6848

–

6856

.).

Zhao

Chen

(

2024

Dynamics and fault diagnosis of railway vehicle gearboxes: A review

Journal of Dynamics, Monitoring and Diagnostics

852

–

876