Machine learning prediction of brain metastasis invasion pattern on brain magnetic resonance imaging scans

Association Between Imaging Features by Expert Assessment and the Ground Truth of Brain Metastasis Invasion Pattern (BMIP) Determined by Histopathological Assessment

Criteria	Fuzzy border	Tumor size	Edema	Mass effect	Leptomeningeal involvement	Multifocality
TP (n)	74	42	66	45	41	58
TN (n)	4	23	12	40	39	29
FP (n)	45	31	40	14	13	29
FN (n)	19	51	20	45	46	50
Scans excluded (n)	23	18	25	21	26	0
Accuracy	54.9	44.2	56.5	59.0	57.6	52.4
Precision	62.2	57.5	62.3	76.3	75.9	66.7
Recall/sensitivity	79.6	45.1	76.7	50.0	47.1	53.7
Specificity	8.2	42.6	23.1	74.1	75.0	50.0
F1-score	69.8	50.6	68.8	60.4	58.2	59.5

Criteria	Fuzzy border	Tumor size	Edema	Mass effect	Leptomeningeal involvement	Multifocality
TP (n)	74	42	66	45	41	58
TN (n)	4	23	12	40	39	29
FP (n)	45	31	40	14	13	29
FN (n)	19	51	20	45	46	50
Scans excluded (n)	23	18	25	21	26	0
Accuracy	54.9	44.2	56.5	59.0	57.6	52.4
Precision	62.2	57.5	62.3	76.3	75.9	66.7
Recall/sensitivity	79.6	45.1	76.7	50.0	47.1	53.7
Specificity	8.2	42.6	23.1	74.1	75.0	50.0
F1-score	69.8	50.6	68.8	60.4	58.2	59.5

A neuroradiologist retrospectively evaluated T2W and T1W-contrast enhanced images of preoperative scans in patients with surgically resected BrM and correlated findings with BMIP. Fuzzy border, tumor size greater than the median, edema score greater than or equal to 2, mass effect scores less than or equal to one, multifocality of greater than one metastatic lesion identified on the preoperative scan, and the presence of leptomeningeal involvement in the surgically resected specimen were used for evaluation. TP, true positive; TN, true negative; FP, false positive; FN, false negative; n = number of samples. In addition, positive class represents the HI samples, while negative class represents the MI ones.

Table 1.

Open in new tab Download slide

Association Between Imaging Features by Expert Assessment and the Ground Truth of Brain Metastasis Invasion Pattern (BMIP) Determined by Histopathological Assessment

Criteria	Fuzzy border	Tumor size	Edema	Mass effect	Leptomeningeal involvement	Multifocality
TP (n)	74	42	66	45	41	58
TN (n)	4	23	12	40	39	29
FP (n)	45	31	40	14	13	29
FN (n)	19	51	20	45	46	50
Scans excluded (n)	23	18	25	21	26	0
Accuracy	54.9	44.2	56.5	59.0	57.6	52.4
Precision	62.2	57.5	62.3	76.3	75.9	66.7
Recall/sensitivity	79.6	45.1	76.7	50.0	47.1	53.7
Specificity	8.2	42.6	23.1	74.1	75.0	50.0
F1-score	69.8	50.6	68.8	60.4	58.2	59.5

Criteria	Fuzzy border	Tumor size	Edema	Mass effect	Leptomeningeal involvement	Multifocality
TP (n)	74	42	66	45	41	58
TN (n)	4	23	12	40	39	29
FP (n)	45	31	40	14	13	29
FN (n)	19	51	20	45	46	50
Scans excluded (n)	23	18	25	21	26	0
Accuracy	54.9	44.2	56.5	59.0	57.6	52.4
Precision	62.2	57.5	62.3	76.3	75.9	66.7
Recall/sensitivity	79.6	45.1	76.7	50.0	47.1	53.7
Specificity	8.2	42.6	23.1	74.1	75.0	50.0
F1-score	69.8	50.6	68.8	60.4	58.2	59.5

Brain MRI Scan Acquisition

To construct DL and traditional ML models, features were extracted from T2W and T1WC + images from preoperative brain MRIs obtained as part of the standard of care, acquired before and after intravenous administration of gadopentetate dimeglumine (Magnevist®) at a dose of 0.01 mmol/kg body weight. The images were obtained using 2 MRI machines: a 3.0 T Philips MRI machine and a 1.5 T GE Signa scanner. All images were stored in the DICOM image format. Figure 1 demonstrates an example of a subset of slices from T2W and T1WC + sequences and associated manually contoured masks that were subsequently used to generate additional computationally derived masks.

Figure 1.

Examples of 2 MRI images, each in a row, with their overlaid manually segmented masks of (1) the metastatic tumor of interest, determined by the outer margins of the enhancing lesion on contrast-enhanced T1W images (shown in column A) and (2) edema (including primary tumor), determined by the outer margins of abnormal hyperintense signal on T2W images (visualized in column B). Computationally generated masks isolating the area of edema alone on T2W images were also generated (column C).

MRI Sequence Selection and Manual Lesion Segmentation

We classified the MRI (T2W and T1WC+) sequences into 4 distinct groups by primary tumor type (lung, breast, melanoma, other) and divided the total patient population into training and testing subsets. Due to the predominance of lung and breast (LB) primaries in BrM, with a limited number of melanoma or other metastases, melanoma and other (MO) samples were exclusively included in the training set, while LB samples were divided, randomly allocating 80% to the training subset and 20% to the testing subset. Therefore, the final classification performance was only for LB BrM. In developing the convolution-based DL (CDL) models, 20% of the split training set was allocated for validation. However, the entire training set was utilized for developing the classic ML models. For patients with multiple resected specimens, we adopted a patient-wise split strategy to prevent any information leakage between our training and testing subsets (Supplementary Table S1).

In this study, different ML-based models were constructed using features extracted from T2W and T1WC + images from patient preoperative brain MRI scans. For each case, a volume of interest (VOI) was manually drawn around the (1) tumor, represented by an enhancing lesion seen on T1WC+, consisting of the outer margin of the contiguous homogeneous or heterogeneously enhancing tumor component, including, if present, immediately contiguous leptomeningeal and pachymeningeal components; and (2) edema, determined based on the maximum area of high T2 signal surrounding a tumor. At the time of manual segmentation, the edema VOI also included the BrM (Figure 1), subsequently separated computationally. Segmentation was first performed by a medical student (BR) trained to perform this task. Thereafter, all contours were reviewed (and modified, if necessary) with a single board-certified neuroradiologist (SL, MCL, or RF). Each neuroradiologist reviewed approximately one-third of the contours. Adjustments made to the contours included boundary refinements, ensuring the inclusion of leptomeningeal invasion, exclusion of vessels, and correcting contours to account for imaging artifacts. Additionally, facilitating discussions and consensus meetings among neuroradiologists was implemented to agree on challenging cases, which helped standardize the contouring process and reduce variability. Manual contours were generated using the open-source medical image visualization software, 3D Slicer version 5.0.3. Following initial manual contouring of the BrM (T1WC+) and (edema + tumor based on T2W images), additional contours of edema only from T2W images were generated computationally. Leveraging 3D Slicer, a blend of interactive tools, such as intensity-based thresholding, region-growing algorithms, and manual adjustments, were used to ensure accurate and precise outlining of the tumor boundaries. Manual adjustments were executed to ensure alignment of the segmented regions with the tumor edges.

Registration, Data Processing, and Additional VOI Generation

To perform a comprehensive evaluation, we analyzed and tested models extracting features from the tumor on T1WC + (T), combined tumor and associated edema on T2W (T + E), or edema without the actual tumor on T2W images (E). In this specific use case, we were also interested in capturing features from edema elicited by the tumor. We reasoned that important predictive information from the peritumoral invasion of the brain may be captured by analysis of signal changes within the brain parenchyma immediately surrounding the tumor in the area of tumor-associated edema. To accomplish this, we leveraged the manual segmentations described previously to generate additional masks. This was done both for (1) efficiency and (2) obtaining the optimal mask. The rationale for this protocol is the following. On MRI, the gold standard for delineation of the actual brain metastasis is its contour as demonstrated on the T1WC + sequence. However, the optimal delineation of edema is done on T2W images. As such, a computational approach based on the aforementioned manually generated masks is not only more efficient (or less manually laborious) but also would be considered most accurate if the images are co-registered.

Using T1WC + volumes as a reference and using advanced normalization tools^10,11 python package, we registered the T2W volumes and their associated masks to match the dimensions of the T1W data to define a dataset called T1WR. We also subtracted the T1WC + volumes from the corresponding T2W volumes to define a dataset called T1WC + E and T2WE. Together, these approaches allowed us to apply isolated edema masks (without tumor) to high-resolution T2 volumes (E), retaining only the masked area while eliminating the remaining regions of the image.

Radiomic Feature Extraction

Feature sets extracted from T, T + E, and E images were used for prediction independently as well as in combination. We employed PyRadiomics version v3.0.1, and Python version 3.8 package^7,12 to extract a comprehensive set of 107 radiomic features from each sample in the datasets.¹³ In order to avoid overfitting, we limited the number of features to approximately 10% of the sample size. Subsequently, we refined this feature set to include only the top 10 representative features through a 2-step process: (1) eliminating approximately 50% of features, which was experimentally and analytically achieved by removing features with a variance lower than 0.03, and (2) applying the chi-square statistical feature selection technique¹⁴ to select from the remaining features with the highest degree of association with the target (BMIP). Note that features with low variance are typically removed because they exhibit minimal variation across samples, thereby offering little discriminatory power between classes. Additionally, feature selection was conducted exclusively on the training set and performed after the standardization of the features. We specifically performed image-level standardization to adjust the images to a common intensity scale, thereby reducing variations caused due to differences in calibration and sensitivity between the 2 utilized MRI machines. Moreover, we applied feature-level normalization to ensure a consistent feature scale among all the extracted features. Due to the differences in MRI machine configurations, MRI scan dimensionalities, and various feature scales, these 2 steps were required for stable model training and evaluation processes. We examined the T1WC+, T2W, and Registered T2-weighted (T2WR) images individually, employing identical image processing and feature extraction procedures. This methodology guarantees a uniform approach to image processing and feature extraction across all sequences, thereby maintaining the integrity of our final comparisons and ensuring an unbiased assessment. For additional details on processing steps, refer to the Supplementary Materials section.

Model Training

We employed a Voting Ensemble approach, individually enclosing both traditional TML and CDL models. During this process, we constructed 3 distinct TML models: support vector classifier,¹⁵ Random Forest (RF),¹⁶ and Multi-Layer Perceptron (MLP).¹⁷ In this phase, we incorporated the grid search method from the Scikit-Learn Python package with 3-fold cross-validation to select the best-performing model. Experimentally, we observed that further adjusting the class weights in the support vector classifier model, when necessary, improves performance across both classes. Additionally, 3 variations of the EfficientNet¹⁸ model were employed for CDL. Finally, we determined the majority of votes at 2 levels to derive our final results (Figure 2): (1) aggregating the prediction of the images belonging to the same volume; (2) aggregating all 3 models’ predictions for each volume. A visual representation of the process is provided in Figure 2 for prediction using the peritumoral edema. For additional details on model development, please refer to the Supplementary Materials section.

Figure 2.

The process of model development and evaluation involved multiple traditional machine learning approaches (TML) and deep learning (DL) approaches. The predicted labels from each set of TML and DL models were independently aggregated, and the final prediction, for both TML and DL models, was generated by applying the majority vote during the aggregation process. Although MLP is commonly categorized as a deep model, we treated it as a TML model in this context, given its use as a classifier training on the extracted handcrafted radiomics features. SVC, support vector classifier; RF, random forest; MLP, multi-layer perceptron; B0, EfficientNet-B0; B1, EfficientNet-B1; B2, EfficientNet-B2.

Open in new tab Download slide

Results

Conventional MR Imaging Features Do Not Reliably Predict BMIP

A neuroradiologist blinded to the BMIP status of patients was asked to assess the following pre-determined imaging features on preoperative MRI: fuzziness at the tumor-brain interface, tumor size, peritumoral edema, mass effect, multifocality, and leptomeningeal involvement of the surgical specimen. Imaging assessment was performed in 139-166 patients, depending on the parameter of interest, after excluding those cases with motion artifacts, absence of contrast-enhanced sequences, or severely hemorrhagic lesions. Using BMIP determined by histopathology as the ground truth, the presence of a fuzzy tumor-brain interface as assessed by the neuroradiologist, was able to predict BMIP in 54.9% of cases (Table 1). When attempting to predict HI BMIP, sensitivity was 79.6%, specificity was 8.2%, and F1-score was 69.8%. Despite the fact that the development of leptomeningeal metastases has been previously associated with HI BMIP, the presence of leptomeningeal involvement in the target lesion on the preoperative MRI performed poorly as a predictor of BMIP (F1-Score = 0.582; Table 1).

The method of calling BMIP was validated by calculating the inter-observer agreement between the 2 observers to ensure the reliability and consistency of the scoring system. This analysis revealed inter-observer agreement of 83.7% and a Cohen’s Kappa coefficient of 0.66 (95% CI: 0.55–0.77), suggesting significant agreement between the 2 reviewers (Supplementary Table S2).

Using a logistic regression model, we investigated the utility of semantic imaging features, including mass effect, leptomeningeal disease, and multifocal disease, for predicting BMIP. We conducted a hyperparameter grid search with 3- and 5-fold cross-validation techniques, and our best model obtained an accuracy of 59.0% and an F1 score of 71.8% (Supplementary Figure S1). The results indicate that the model performs poorly in classifying the MI and HI classes. Accordingly, the semantic imaging features lack sufficient discriminative information for the model to discriminate between these classes effectively.

Noninvasive Prediction of BMIP From MRI Images Using ML

We next investigated the use of radiomics and ML to determine whether BMIP can be predicted noninvasively based on computerized analysis of MRI images. Given that LB cancer BrM constituted the largest proportion of patients in our cohort, performance in the test set was only evaluated on patients with lung or breast cancer metastasis. We included a total of 112 BrM in the training set (55 lungs, 24 breasts, 13 melanomas, 20 “others”) and 20 BrM in the test set (15 lungs and 5 breasts), representing 20% of the specimens from these primary tumor types (Supplementary Table S1).

To evaluate the top-performing models, which were selected based on the highest F1-score obtained during training on the validation set, we utilized accuracy, precision, recall, and F1-score metrics to compute the model performance on the independent test set. We incorporated the performance of the models after aggregation, aiming to establish a consensus among our best-performing models. Multiple models were evaluated based on tumor features on T1WC + images, tumor and edema on T2W images, and edema only on T2W images. In addition, both TML and CDL models were evaluated.

As shown in Table 2, TML models using edema on T2W images (E) exhibited superior performance compared to CDL models, demonstrating an approximate 18% improvement for the ensembled TML models with an overall accuracy of 85%, precision of 93%, a recall of 87%, and F1-score of 90%. Among the models, the RF model attains the highest F1-score; however, its confusion matrix (Figure 3) indicates only 60% accuracy for the MI class, implying challenges in achieving satisfactory results for both positive (HI) and negative (MI) classes. The MLP (MLP) model itself demonstrates high performance in both classes, significantly contributing to the enhanced results of our ensembled model (MLEnsemble in Table 2). Nevertheless, MLEnsemble not only achieved a comparable 90% F1 score with the RF model but also demonstrated proficiency in accurately predicting both MI and HI classes, achieving the greatest robustness and accuracy using this dataset.

Table 2.

Performance Evaluation of Various Models Based on Accuracy, Precision, Recall, and F1-Score on the Independent Test Set

Learning approach	Data	Model	Accuracy	Precision	Recall	F1-score
Traditional machine-learning (TML)	T—Tumor on T1WC + images	SVC	35.0	100.0	13.3	23.5
		RF	75.0	91.6	73.3	81.4
		MLP	70.0	90.9	66.6	76.9
		MLEnsemble	70.0	90.9	66.6	76.9
	T + E—Tumor and edema on T2W images	SVC	65.0	100.0	53.3	69.5
		RF	65.0	90.0	60.0	72.0
		MLP	50.0	72.7	53.3	61.5
		MLEnsemble	60.0	100.0	46.6	63.6
	E—Edema only on T2W images	SVC	75.0	91.6	73.3	81.4
		RF	85.0	87.0	93.3	90.3
		MLP	80.0	92.3	80.0	85.7
		MLEnsemble	85.0	92.8	86.6	89.6
Convolution-based deep learning (CDL)	T—Tumor on T1WC + images	Eff-B0	57.8	56.6	57.8	55.4
		Eff-B1	54.6	53.5	54.6	53.4
		Eff-B2	70.3	70.2	70.3	70.2
		DLEnsemble	75.0	73.4	75.0	74.0
	T + E—Tumor and edema on T2W images	Eff-B0	51.5	51.8	51.5	51.6
		Eff-B1	54.6	46.9	54.6	45.2
		Eff-B2	60.9	61.6	60.9	61.1
		DLEnsemble	65.0	62.5	65.0	63.6
	E—Edema only on T2W images	Eff-B0	60.94	61.14	60.94	61.03
		Eff-B1	54.69	60.60	54.69	53.36
		Eff-B2	71.88	75.47	71.88	71.88
		DLEnsemble	69.99	86.36	69.99	71.87

Learning approach	Data	Model	Accuracy	Precision	Recall	F1-score
Traditional machine-learning (TML)	T—Tumor on T1WC + images	SVC	35.0	100.0	13.3	23.5
		RF	75.0	91.6	73.3	81.4
		MLP	70.0	90.9	66.6	76.9
		MLEnsemble	70.0	90.9	66.6	76.9
	T + E—Tumor and edema on T2W images	SVC	65.0	100.0	53.3	69.5
		RF	65.0	90.0	60.0	72.0
		MLP	50.0	72.7	53.3	61.5
		MLEnsemble	60.0	100.0	46.6	63.6
	E—Edema only on T2W images	SVC	75.0	91.6	73.3	81.4
		RF	85.0	87.0	93.3	90.3
		MLP	80.0	92.3	80.0	85.7
		MLEnsemble	85.0	92.8	86.6	89.6
Convolution-based deep learning (CDL)	T—Tumor on T1WC + images	Eff-B0	57.8	56.6	57.8	55.4
		Eff-B1	54.6	53.5	54.6	53.4
		Eff-B2	70.3	70.2	70.3	70.2
		DLEnsemble	75.0	73.4	75.0	74.0
	T + E—Tumor and edema on T2W images	Eff-B0	51.5	51.8	51.5	51.6
		Eff-B1	54.6	46.9	54.6	45.2
		Eff-B2	60.9	61.6	60.9	61.1
		DLEnsemble	65.0	62.5	65.0	63.6
	E—Edema only on T2W images	Eff-B0	60.94	61.14	60.94	61.03
		Eff-B1	54.69	60.60	54.69	53.36
		Eff-B2	71.88	75.47	71.88	71.88
		DLEnsemble	69.99	86.36	69.99	71.87

The table demonstrates the performance of both traditional machine learning (TML) and deep learning (DL) models used for BMIP prediction based on features from the (1) tumor on T1WC + images, (2) tumor + edema on T2W images, and (3) edema only on T2W images. Model performance is shown for multiple individual and ensemble models using TML or DL.

Table 2.

Open in new tab Download slide

Performance Evaluation of Various Models Based on Accuracy, Precision, Recall, and F1-Score on the Independent Test Set

Learning approach	Data	Model	Accuracy	Precision	Recall	F1-score
Traditional machine-learning (TML)	T—Tumor on T1WC + images	SVC	35.0	100.0	13.3	23.5
		RF	75.0	91.6	73.3	81.4
		MLP	70.0	90.9	66.6	76.9
		MLEnsemble	70.0	90.9	66.6	76.9
	T + E—Tumor and edema on T2W images	SVC	65.0	100.0	53.3	69.5
		RF	65.0	90.0	60.0	72.0
		MLP	50.0	72.7	53.3	61.5
		MLEnsemble	60.0	100.0	46.6	63.6
	E—Edema only on T2W images	SVC	75.0	91.6	73.3	81.4
		RF	85.0	87.0	93.3	90.3
		MLP	80.0	92.3	80.0	85.7
		MLEnsemble	85.0	92.8	86.6	89.6
Convolution-based deep learning (CDL)	T—Tumor on T1WC + images	Eff-B0	57.8	56.6	57.8	55.4
		Eff-B1	54.6	53.5	54.6	53.4
		Eff-B2	70.3	70.2	70.3	70.2
		DLEnsemble	75.0	73.4	75.0	74.0
	T + E—Tumor and edema on T2W images	Eff-B0	51.5	51.8	51.5	51.6
		Eff-B1	54.6	46.9	54.6	45.2
		Eff-B2	60.9	61.6	60.9	61.1
		DLEnsemble	65.0	62.5	65.0	63.6
	E—Edema only on T2W images	Eff-B0	60.94	61.14	60.94	61.03
		Eff-B1	54.69	60.60	54.69	53.36
		Eff-B2	71.88	75.47	71.88	71.88
		DLEnsemble	69.99	86.36	69.99	71.87

Learning approach	Data	Model	Accuracy	Precision	Recall	F1-score
Traditional machine-learning (TML)	T—Tumor on T1WC + images	SVC	35.0	100.0	13.3	23.5
		RF	75.0	91.6	73.3	81.4
		MLP	70.0	90.9	66.6	76.9
		MLEnsemble	70.0	90.9	66.6	76.9
	T + E—Tumor and edema on T2W images	SVC	65.0	100.0	53.3	69.5
		RF	65.0	90.0	60.0	72.0
		MLP	50.0	72.7	53.3	61.5
		MLEnsemble	60.0	100.0	46.6	63.6
	E—Edema only on T2W images	SVC	75.0	91.6	73.3	81.4
		RF	85.0	87.0	93.3	90.3
		MLP	80.0	92.3	80.0	85.7
		MLEnsemble	85.0	92.8	86.6	89.6
Convolution-based deep learning (CDL)	T—Tumor on T1WC + images	Eff-B0	57.8	56.6	57.8	55.4
		Eff-B1	54.6	53.5	54.6	53.4
		Eff-B2	70.3	70.2	70.3	70.2
		DLEnsemble	75.0	73.4	75.0	74.0
	T + E—Tumor and edema on T2W images	Eff-B0	51.5	51.8	51.5	51.6
		Eff-B1	54.6	46.9	54.6	45.2
		Eff-B2	60.9	61.6	60.9	61.1
		DLEnsemble	65.0	62.5	65.0	63.6
	E—Edema only on T2W images	Eff-B0	60.94	61.14	60.94	61.03
		Eff-B1	54.69	60.60	54.69	53.36
		Eff-B2	71.88	75.47	71.88	71.88
		DLEnsemble	69.99	86.36	69.99	71.87

Figure 3.

The confusion matrices of the traditional machine-learning (TML) models, including SVC, RF, and MLP, along with their corresponding ensemble aggregation, are based on computerized analysis and machine-learning prediction using edema on T2W images.

Among the other models, the next best-performing model based on accuracy was the CDL model using prediction from analysis of the enhancing tumor component on T1WC + images (Table 2). Similar to the TDL model, the ensembled model had the best performance, with an accuracy of 75%, precision of 73%, recall of 75%, and F1-score of 74% (Table 2). The performance metrics of the other models are provided in Table 2. The hand-crafted radiomic features used for BMIP prediction are described in detail in Supplemental Table S3.

We also conducted experiments by training our TML models exclusively on LB edema radiomics features to assess the impact of primary tumor type on the model performance. Note that this decreases the training size by 43%. To ensure a fair and comparable evaluation with the other analyses performed, we used the same approach as in our main TML experiments for data splitting and parameter tunning on training and testing sets, excluding only the MO samples from the training set in these analyses. These analyses demonstrate only minimal performance degradation after excluding the MO samples from the training set, achieving an accuracy of 80.0% and an F1 score of 86.7% (Supplementary Figure S2), compared to our top-performing TML model, which achieved an F1 score of 89%. Hence, including MO samples in the training process does not harm and may in fact enhance the robustness of the developed models, underscoring the importance of diverse sample inclusion.

Discussion

In this study, we investigated the potential of radiomics and ML for predicting BMIP in BrM using preoperative brain MRI scans. Surgically resected BrM can be classified into minimally (MI) and highly (HI) invasive BMIP subtypes based on their histopathological growth pattern and relation to the adjacent brain parenchyma.² HI BrM have been demonstrated to be associated with shortened local recurrence-free, leptomeningeal metastasis-free, and overall survival.^2,4 In preclinical models, the HI BMIP pattern has also been suggested to serve as a predictive biomarker for emerging, but not yet approved, therapies targeting pSTAT3-expressing astrocytes in the BrM microenvironment.¹⁹ However, any future clinical application and use of this biomarker requires a priori knowledge of the BMIP before therapy, which cannot currently be achieved given the fact that BMIP can only be determined with pathological evaluation of resected tumor specimens. Our study demonstrates the potential of radiomics and ML to predict BMIP noninvasively, prior to resection, with a high accuracy using an ensemble TML model. Such a model has the potential to serve as a noninvasive image-based biomarker for determining prognosis and response to therapy in patients with BrM. While other studies have attempted to correlate imaging-based features with invasion in BrM,^20,21 the results described herein are the first to noninvasively predict BMIP using radiomics and ML performed on brain MRI scans.

The prediction of BMIP in this study builds upon an increasing literature demonstrating the use of ML for developing image-based biomarkers that can enhance or augment expert evaluation, providing lesion characterization beyond what can be done using conventional largely qualitative image analysis performed by the naked human eye.⁸ The best-performing model in our study, the ensemble TML model, achieved an accuracy of 85%, whereas none of the conventional imaging features assessed by the expert neuroradiologist on the same dataset were found to reliably predict BMIP. Similarly, the TML model achieved a 90% F1-Score, compared to the 70% F1 score by evaluating conventional imaging features.

Importantly, the ML model with the highest accuracy was the model based on the peritumoral edema. This is congruent with the current biological understanding of BMIP, where one can hypothesize that the features most representative and predictive of BMIP are the stromal reaction and edema in the invaded brain parenchyma.

In our sample, the TML models were superior in predicting BMIP compared to the CDL models. Given that it is well established that deep neural networks typically require much larger sample sizes for training compared to traditional ML approaches, the most likely explanation for this finding in our cohort is the limited sample size. As such, while this study lays out the framework for image-based prediction of BMIP, refinement of these models with larger sample sizes has the potential to improve predictive performance and more effectively use DL architectures for model development. Despite the small sample size, we took multiple steps to ensure the reliability of the performance metrics reported, which include (1) a robust and consistent preprocessing pipeline, (2) random assignment and use of an independent test set for performance evaluation, and (3) ensuring that when more than one metastasis was resected and used for model development, data from the same patient was not used both in the training and test sets to avoid data leakage and violation of the independence assumption.

With further development and validation of additional datasets, a tool such as the one established herein could have important clinical applications. BMIP has been demonstrated to be an important prognostic tool for patients with surgically resected BrM.^2,4 Possessing knowledge of BMIP prior to surgical resection from a preoperative MRI may result in modifications to surgical planning and extent of resection, which may lead to improved patient outcomes. Furthermore, adjuvant stereotactic radiosurgery (SRS) after neurosurgical resection of BrM is the current standard of care,²² and radiation oncologists may choose to irradiate with larger margins if a resected specimen has a HI BMIP. However, there are ongoing studies to assess whether radiation therapy prior to surgical resection is superior. Knowing the BMIP from the preoperative MRI may therefore help with SRS planning in this context. Additionally, work is ongoing to use BMIP as a predictive biomarker for anti-cancer treatment. In the context of liver metastases, this proof-of-concept has been extended to patients, with replacement growth patterns being associated with poor responses to anti-angiogenic therapy in patients.²³

Our study has several limitations. The most important limitations are the small sample size and the absence of an external validation set. These are largely the result of the study evaluating a unique endpoint that is not routinely evaluated or reported in clinical practice at most centers. As the reporting of BMIP becomes more widely adopted, future studies expanding on our observations will be easier to perform. Our observations will require additional independent evaluation and validation using larger and more diverse datasets, and eventually, prospective studies demonstrating efficacy as a biomarker. While the model described herein was trained on BrM of various primary tumor types, the performance was only tested on LB cancer BrM in the independent test set, a decision made because of the small sample size of the less common primary types, which would preclude a reliable or meaningful evaluation. Ideally, such a study would be performed separately for each primary cancer type, if sample size allowed. Since LB cancer BrM represents approximately 75% of BrM, the results may be applicable to a majority of patients with BrM. Furthermore, this study was performed using data exclusively from patients who underwent surgical resection and had sufficient tissue for histopathological analysis, potentially limiting the generalizability of the findings to the broader population of patients with BrM. In order for these findings to be generalizable to all BrM, it is imperative for future studies to be performed with patients who had non-resectable lesions and subsequent autopsy to determine BMIP. Finally, the use of 2D convolutional models for a task involving spatial data like MRI might overlook crucial spatial relationships captured in 3D structures. Future studies may aim to incorporate additional analyses to explore spatial relationships of captured features.

The ground truth used for BMIP in this study was histopathological assessment. It is important to note that this methodology, while the gold standard for BMIP assessment, is likely imperfect. BMIP is only discernable on specimens deemed to have sufficient brain-tumor interface for histopathological evaluation. While this methodology has demonstrated its clinical relevance given the association of BMIP with clinical outcomes,² approximately one-third of surgically resected BrM have insufficient tissue for BMIP determination. Furthermore, of the specimens that are amenable to BMIP determination, the totality of the brain-tumor interface is seldom able to be examined given the palliative intent of neurosurgical resection of brain metastases that do not require negative microscopic margins circumferentially around resected the metastatic lesion.⁹ This implies the possibility that some patients with MI lesions may have undetected components of the tumor with prominent invasion. While this can be seen as a limitation of this paradigm, it may also serve as a strength, in that a noninvasive model such as the one established herein may be able to stratify patients with indeterminate BMIP as determined by histopathology, to predict their clinical course or response to treatment.

In conclusion, this study demonstrates the feasibility of ML for the development of a noninvasive image-based biomarker for predicting BMIP. This is an important proof-of-concept demonstrating that imaging features, particularly in the peritumoral brain, may be used to identify invasive metastatic cancer cells in the brain. Furthermore, these findings highlight the fact that BMIP may be more widely studied as a predictive biomarker in preclinical and clinical contexts, given encouraging results suggesting that it can be determined accurately in a noninvasive manner and in the absence of a surgical specimen.

Funding

This work was funded by Spark Grants on the Application of Disruptive Technologies in Cancer Prevention and Early Detection of the Canadian Cancer Society and the Canadian Institutes of Health Research—Institute of Cancer Research and Brain Canada Foundation (CCS grant #707078/CIHR grant #707078). This Project has been made possible with the financial support of Health Canada, through the Canada Brain Research Fund, an innovative partnership between the Government of Canada (through Health Canada) and Brain Canada, and the Canadian Cancer Society. While at McGill, R.F. was also a clinical research scholar (chercheur-boursier clinicien) supported by the Fonds de recherche en santé du Québec (FRQS) and had an operating grant jointly funded by the FRQS and the Fondation de l’Association des radiologistes du Québec (FARQ).

Acknowledgments

The authors thank all of the patients who donated their brain metastase tissues to this research. We thank Dr. Farhad Maleki for their insightful comments on the manuscript.

Conflicts of interest statement

R.F. has had a research collaboration/grant and has acted as consultant and/or speaker for Nuance Communications/Microsoft Inc., Canon Medical Systems Inc., and GE Healthcare. R.F. has also served on the clinical advisory board of Automated Imaging Diagnostics/Neuropacs Inc. R.F. is also a co-investigator on a National Institutes of Health STTR grant subaward and a co-principal investigator on a National Science Foundation grant. The authors declare no other conflicts of interest. All authors have reviewed and approved the final version of the article.

Authorship statement

K.N. participated in the design of the study, in particular the ML component, and was involved in the development and execution of the ML part of the study. B.R. played a key role in cohort discovery and initial image processing and performed manual tumor segmentations that were subsequently used for additional computationally derived contour as well as feature extraction and lesion analysis for ML algorithm training and evaluation. A.N. compiled clinical data and performed statistical analyses. N.M. performed image segmentation and supported study execution. S.G. contributed to cohort discovery, clinical lesion assessment, and supported study execution. K.P. provided clinical expertise and supported study execution. R.Z. contributed to the initial grant to grant preparation and study execution. C.R. contributed to the study design and supported the study execution. A.B-F. and J.K.W. contributed to specific parts of the study design and execution planning, particularly the approach for image registration and computational derivation of the edema maps. M-C.G. performed histopathological interpretation of patient specimens to determine BMIP. M-C.L. provided clinical expertise and oversaw part of the tumor segmentation. S.L. provided clinical expertise, oversaw part of the tumor segmentation, and performed the expert evaluation for prediction of BMIP. P.M.S. and K.P. provided study supervision, guiding clinical and experimental rationale. M.D. conceived the study concept and was involved in study design, grant preparation, cohort discovery, and determination of ground truth BMIP on pathology slides. R.F. was involved in every aspect of this study including the initial inception and design of the study, he was the principal investigator on a grant funding this study and oversaw the overall execution of the study. All authors were involved in manuscript drafting and/or review.

Data availability

The source numerical data from this study will be made available upon reasonable request. The actual patient images cannot be shared publicly due to patient privacy restrictions.

References

Achrol

Rennert

Anders

, et al.

Brain metastases

Nat Rev Dis Primers.

2019

;

(

Dankner

Caron

Al-Saadi

, et al.

Invasive growth associated with Cold-Inducible RNA-Binding Protein expression drives recurrence of surgically resected brain metastases

Neuro-Oncology.

2021

;

(

1470

–

1480

Berghoff

Rajky

Winkler

, et al.

Invasion patterns in brain metastases of solid cancers

Neuro-Oncology.

2013

;

(

1664

–

1672

Siam

Bleckmann

Chaung

, et al.

The metastatic infiltration at the metastasis/brain parenchyma-interface is very heterogeneous and has a significant impact on survival in a prospective study

Oncotarget

2015

;

(

29254

–

29267

Haneberg

Pierre

Winter-Reinhold

, et al.

Introduction to Radiomics and Artificial Intelligence: A Primer for Radiologists

. Semin Roentgenol.

2023

;

(

152

–

157

Google Preview

Forghani

Precision digital oncology: Emerging role of radiomics-based biomarkers and artificial intelligence for advanced imaging and characterization of brain tumors

Radiol Imaging Cancer

2020

;

(

e190047

Gillies

Kinahan

Hricak

Radiomics: Images are more than pictures, they are data

Radiology.

2016

;

278

(

563

–

577

Nowakowski

Lahijanian

Panet-Raymond

, et al.

Radiomics as an emerging tool in the management of brain metastases

Neurooncol. Adv..

2022

;

(

vdac141

PubMed

Vogelbaum

Brown

Messersmith

, et al.

Treatment for brain metastases: ASCO-SNO-ASTRO Guideline

J Clin Oncol.

2022

;

(

492

–

516

10.

Tustison

Cook

Klein

, et al.

Large-scale evaluation of ANTs and FreeSurfer cortical thickness measurements

Neuroimage.

2014

;

166

–

179

11.

Avants

Tustison

Stauffer

, et al.

The Insight ToolKit image registration framework

Front Neuroinf.

2014

;

Crossref

12.

Van Griethuysen

Fedorov

Parmar

, et al.

Computational radiomics system to decode the radiographic phenotype

Cancer Res.

2017

;

(

e104

–

e107

13.

Karu

Jain

Bolle

RM.

Is there any texture in the image

Pattern Recognit.

1996

;

(

1437

–

1446

Crossref

14.

Ferri

Pudil

Hatef

Kittler

Comparative study of techniques for large-scale feature selection

. In: Gelsema ES, Kanal LS, eds.

Machine Intelligence and Pattern Recognition

. Vol

Amsterdam, Netherlands

Elsevier

;

1994

403

–

413

Google Preview

15.

Platt

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

Adv Large Margin Classif

1999

;

(

–

16.

Breiman

Random forests

Mach Learn.

2001

;

–

Crossref

17.

Hinton

GE.

Connectionist learning procedures

. In: Kodratoff Y, Michalski RS, eds.

Machine Learning

Amsterdam, Netherlands

Elsevier

;

1990

555

–

610

18.

Tan

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

. In:

PMLR

;

2019

6105

–

6114

Google Preview