Applying natural language processing to patient messages to identify depression concerns in cancer patients

Patient characteristics cohort.

	N = 3312
Demographics
Female sex, N (%)	2002 (60)
Age, mean (SD)	61.3 (13.7)
English speaking, N (%)	3080 (93)
Race
Asian (%)	725 (22)
Black (%)	65 (2)
White (%)	2133 (64)
Other (%)	364 (11)
Ethnicity
Hispanic/Latino (%)	204 (6)
Non-Hispanic/non-Latino (%)	3060 (92)
Other (%)	47 (1)
Depression diagnosis, N (%)	1116 (34)
Insurance (%)
Private	1980 (60)
Medicare	550 (17)
Medicaid	260 (8)
Other	522 (16)

	N = 3312
Demographics
Female sex, N (%)	2002 (60)
Age, mean (SD)	61.3 (13.7)
English speaking, N (%)	3080 (93)
Race
Asian (%)	725 (22)
Black (%)	65 (2)
White (%)	2133 (64)
Other (%)	364 (11)
Ethnicity
Hispanic/Latino (%)	204 (6)
Non-Hispanic/non-Latino (%)	3060 (92)
Other (%)	47 (1)
Depression diagnosis, N (%)	1116 (34)
Insurance (%)
Private	1980 (60)
Medicare	550 (17)
Medicaid	260 (8)
Other	522 (16)

Table 1.

Patient characteristics cohort.

	N = 3312
Demographics
Female sex, N (%)	2002 (60)
Age, mean (SD)	61.3 (13.7)
English speaking, N (%)	3080 (93)
Race
Asian (%)	725 (22)
Black (%)	65 (2)
White (%)	2133 (64)
Other (%)	364 (11)
Ethnicity
Hispanic/Latino (%)	204 (6)
Non-Hispanic/non-Latino (%)	3060 (92)
Other (%)	47 (1)
Depression diagnosis, N (%)	1116 (34)
Insurance (%)
Private	1980 (60)
Medicare	550 (17)
Medicaid	260 (8)
Other	522 (16)

	N = 3312
Demographics
Female sex, N (%)	2002 (60)
Age, mean (SD)	61.3 (13.7)
English speaking, N (%)	3080 (93)
Race
Asian (%)	725 (22)
Black (%)	65 (2)
White (%)	2133 (64)
Other (%)	364 (11)
Ethnicity
Hispanic/Latino (%)	204 (6)
Non-Hispanic/non-Latino (%)	3060 (92)
Other (%)	47 (1)
Depression diagnosis, N (%)	1116 (34)
Insurance (%)
Private	1980 (60)
Medicare	550 (17)
Medicaid	260 (8)
Other	522 (16)

Inter-annotator agreement

The IAA calculated over all 7 annotators was 0.38 according to Krippendorf’s Alpha, which can be considered moderate. We observed a large variation in IAA between different sets of annotators, ranging from 0.32 to 0.52, depending on which annotator was removed from the set.

Model performance

The TF-IDF parameters and hyperparameters of the LR and SVM can be found in Supplementary Appendix E. The LR model had a mean area under the ROC curve (AUROC) of 0.79 (95% confidence interval [CI]: 0.74-0.83) while the SVM attained an AUROC of 0.83 (95% CI: 0.78-0.87).

Both BERT and RedditBERT were trained and validated for 5 epochs on 5693 labeled messages. See Supplementary Appendix E for hyperparameters. Both models outperformed the LR and SVM and RedditBERT slightly outperformed BERT, with a mean AUROC of 0.88 (95% CI: 0.85-0.91) versus 0.86 (95% CI: 0.82-0.90), respectively. A threshold of 0.5 led to the highest F1 score for the BERT models and the SVM. For the LR, a threshold of 0.2 led to the highest F1 score (Table 2). In total, RedditBERT labeled 200 messages as concerning (22%). When comparing the predictive performance per subgroup, BERT showed bigger differences in performance across sex, race, and ethnicity than RedditBERT. For both models there was a decreased performance for Black patients (Table 3).

Table 2.

Performance metrics of 4 models classifying patient messages as concerning for depression.

Metric, mean [95% CI] based on 1000 bootstraps	Log Reg Threshold: 0.2^a	SVM Threshold: 0.5^a	BERT Threshold: 0.5^a	RedditBERT Threshold: 0.5^a
AUROC	0.79 [0.74-0.83]	0.83 [0.78-0.87]	0.86 [0.82-0.90]	0.88 [0.85-0.91]
Precision	0.32 [0.25-0.39]	0.36 [0.28-0.44]	0.37 [0.30-0.44]	0.33 [0.26-0.39]
Recall	0.51 [0.40-0.61]	0.60 [0.49-0.70]	0.68 [0.59-0.78]	0.74 [0.66-0.84]
F1-score	0.39 [0.31-0.47]	0.45 [0.37-0.52]	0.48 [0.40-0.55]	0.46 [0.39-0.53]

Metric, mean [95% CI] based on 1000 bootstraps	Log Reg Threshold: 0.2^a	SVM Threshold: 0.5^a	BERT Threshold: 0.5^a	RedditBERT Threshold: 0.5^a
AUROC	0.79 [0.74-0.83]	0.83 [0.78-0.87]	0.86 [0.82-0.90]	0.88 [0.85-0.91]
Precision	0.32 [0.25-0.39]	0.36 [0.28-0.44]	0.37 [0.30-0.44]	0.33 [0.26-0.39]
Recall	0.51 [0.40-0.61]	0.60 [0.49-0.70]	0.68 [0.59-0.78]	0.74 [0.66-0.84]
F1-score	0.39 [0.31-0.47]	0.45 [0.37-0.52]	0.48 [0.40-0.55]	0.46 [0.39-0.53]

a

Threshold chosen that led to the highest F1 score.

Table 2.

Performance metrics of 4 models classifying patient messages as concerning for depression.

Metric, mean [95% CI] based on 1000 bootstraps	Log Reg Threshold: 0.2^a	SVM Threshold: 0.5^a	BERT Threshold: 0.5^a	RedditBERT Threshold: 0.5^a
AUROC	0.79 [0.74-0.83]	0.83 [0.78-0.87]	0.86 [0.82-0.90]	0.88 [0.85-0.91]
Precision	0.32 [0.25-0.39]	0.36 [0.28-0.44]	0.37 [0.30-0.44]	0.33 [0.26-0.39]
Recall	0.51 [0.40-0.61]	0.60 [0.49-0.70]	0.68 [0.59-0.78]	0.74 [0.66-0.84]
F1-score	0.39 [0.31-0.47]	0.45 [0.37-0.52]	0.48 [0.40-0.55]	0.46 [0.39-0.53]

Metric, mean [95% CI] based on 1000 bootstraps	Log Reg Threshold: 0.2^a	SVM Threshold: 0.5^a	BERT Threshold: 0.5^a	RedditBERT Threshold: 0.5^a
AUROC	0.79 [0.74-0.83]	0.83 [0.78-0.87]	0.86 [0.82-0.90]	0.88 [0.85-0.91]
Precision	0.32 [0.25-0.39]	0.36 [0.28-0.44]	0.37 [0.30-0.44]	0.33 [0.26-0.39]
Recall	0.51 [0.40-0.61]	0.60 [0.49-0.70]	0.68 [0.59-0.78]	0.74 [0.66-0.84]
F1-score	0.39 [0.31-0.47]	0.45 [0.37-0.52]	0.48 [0.40-0.55]	0.46 [0.39-0.53]

a

Threshold chosen that led to the highest F1 score.

Table 3.

Predictive performance per subgroup.

	BERT AUC [95% CI]	Recall [95% CI]	RedditBERT AUC [95% CI]	Recall [95% CI]
Overall	0.86 [0.82-0.90]	0.69 [0.59-0.78]	0.88 [0.85-0.91]	0.74 [0.65-0.83]
Sex
Female (n = 476)	0.85 [0.79-0.91]	0.73 [0.61-0.84]	0.88 [0.84-0.92]	0.73 [0.61-0.84]
Male (n = 284)	0.89 [0.83-0.94]	0.62 [0.43-0.78]	0.90 [0.84-0.94]	0.76 [0.62-0.90]
Race
Asian (n = 136)	0.87 [0.74-0.98]	0.75 [0.53-0.95]	0.91 [0.83-0.97]	0.63 [0.37-0.86]
Black (n = 16)	0.82 [N/A]	0.33 [N/A]	0.75 [N/A]	0.33 [N/A]
White (n = 519)	0.86 [0.81-0.90]	0.67 [0.54-0.79]	0.88 [0.85-0.92]	0.79 [0.68-0.89]
Other (n = 81)	0.83 [0.64-0.98]	0.74 [0.44-1.00]	0.90 [0.79-0.98]	0.75 [0.44-1.00]
Ethnicity
Hispanic/Latino (n = 44)	0.80 [0.54-0.99]	0.66 [0.33-1.00]	0.88 [0.73-0.99]	0.78 [0.44-1.00]
Non-Hispanic/non-Latino (n = 709)	0.87 [0.83-0.91]	0.69 [0.58-0.79]	0.89 [0.86-0.92]	0.75 [0.66-0.84]
Other (n = 7)	0.95 [N/A]	0.67 [N/A]	0.95 [N/A]	0.33 [N/A]

	BERT AUC [95% CI]	Recall [95% CI]	RedditBERT AUC [95% CI]	Recall [95% CI]
Overall	0.86 [0.82-0.90]	0.69 [0.59-0.78]	0.88 [0.85-0.91]	0.74 [0.65-0.83]
Sex
Female (n = 476)	0.85 [0.79-0.91]	0.73 [0.61-0.84]	0.88 [0.84-0.92]	0.73 [0.61-0.84]
Male (n = 284)	0.89 [0.83-0.94]	0.62 [0.43-0.78]	0.90 [0.84-0.94]	0.76 [0.62-0.90]
Race
Asian (n = 136)	0.87 [0.74-0.98]	0.75 [0.53-0.95]	0.91 [0.83-0.97]	0.63 [0.37-0.86]
Black (n = 16)	0.82 [N/A]	0.33 [N/A]	0.75 [N/A]	0.33 [N/A]
White (n = 519)	0.86 [0.81-0.90]	0.67 [0.54-0.79]	0.88 [0.85-0.92]	0.79 [0.68-0.89]
Other (n = 81)	0.83 [0.64-0.98]	0.74 [0.44-1.00]	0.90 [0.79-0.98]	0.75 [0.44-1.00]
Ethnicity
Hispanic/Latino (n = 44)	0.80 [0.54-0.99]	0.66 [0.33-1.00]	0.88 [0.73-0.99]	0.78 [0.44-1.00]
Non-Hispanic/non-Latino (n = 709)	0.87 [0.83-0.91]	0.69 [0.58-0.79]	0.89 [0.86-0.92]	0.75 [0.66-0.84]
Other (n = 7)	0.95 [N/A]	0.67 [N/A]	0.95 [N/A]	0.33 [N/A]

Table 3.

Predictive performance per subgroup.

	BERT AUC [95% CI]	Recall [95% CI]	RedditBERT AUC [95% CI]	Recall [95% CI]
Overall	0.86 [0.82-0.90]	0.69 [0.59-0.78]	0.88 [0.85-0.91]	0.74 [0.65-0.83]
Sex
Female (n = 476)	0.85 [0.79-0.91]	0.73 [0.61-0.84]	0.88 [0.84-0.92]	0.73 [0.61-0.84]
Male (n = 284)	0.89 [0.83-0.94]	0.62 [0.43-0.78]	0.90 [0.84-0.94]	0.76 [0.62-0.90]
Race
Asian (n = 136)	0.87 [0.74-0.98]	0.75 [0.53-0.95]	0.91 [0.83-0.97]	0.63 [0.37-0.86]
Black (n = 16)	0.82 [N/A]	0.33 [N/A]	0.75 [N/A]	0.33 [N/A]
White (n = 519)	0.86 [0.81-0.90]	0.67 [0.54-0.79]	0.88 [0.85-0.92]	0.79 [0.68-0.89]
Other (n = 81)	0.83 [0.64-0.98]	0.74 [0.44-1.00]	0.90 [0.79-0.98]	0.75 [0.44-1.00]
Ethnicity
Hispanic/Latino (n = 44)	0.80 [0.54-0.99]	0.66 [0.33-1.00]	0.88 [0.73-0.99]	0.78 [0.44-1.00]
Non-Hispanic/non-Latino (n = 709)	0.87 [0.83-0.91]	0.69 [0.58-0.79]	0.89 [0.86-0.92]	0.75 [0.66-0.84]
Other (n = 7)	0.95 [N/A]	0.67 [N/A]	0.95 [N/A]	0.33 [N/A]

	BERT AUC [95% CI]	Recall [95% CI]	RedditBERT AUC [95% CI]	Recall [95% CI]
Overall	0.86 [0.82-0.90]	0.69 [0.59-0.78]	0.88 [0.85-0.91]	0.74 [0.65-0.83]
Sex
Female (n = 476)	0.85 [0.79-0.91]	0.73 [0.61-0.84]	0.88 [0.84-0.92]	0.73 [0.61-0.84]
Male (n = 284)	0.89 [0.83-0.94]	0.62 [0.43-0.78]	0.90 [0.84-0.94]	0.76 [0.62-0.90]
Race
Asian (n = 136)	0.87 [0.74-0.98]	0.75 [0.53-0.95]	0.91 [0.83-0.97]	0.63 [0.37-0.86]
Black (n = 16)	0.82 [N/A]	0.33 [N/A]	0.75 [N/A]	0.33 [N/A]
White (n = 519)	0.86 [0.81-0.90]	0.67 [0.54-0.79]	0.88 [0.85-0.92]	0.79 [0.68-0.89]
Other (n = 81)	0.83 [0.64-0.98]	0.74 [0.44-1.00]	0.90 [0.79-0.98]	0.75 [0.44-1.00]
Ethnicity
Hispanic/Latino (n = 44)	0.80 [0.54-0.99]	0.66 [0.33-1.00]	0.88 [0.73-0.99]	0.78 [0.44-1.00]
Non-Hispanic/non-Latino (n = 709)	0.87 [0.83-0.91]	0.69 [0.58-0.79]	0.89 [0.86-0.92]	0.75 [0.66-0.84]
Other (n = 7)	0.95 [N/A]	0.67 [N/A]	0.95 [N/A]	0.33 [N/A]

Associations between model predictions and patient characteristics

There was a significant difference in race in the classification of patients’ messages. Messages of White patients were more often classified as concerning, while messages of Asian patients were less often classified as concerning (see Supplementary Appendix F). Furthermore, patients on Medicaid or Medicare also sent more messages classified as concerning. Patients who sent messages classified as concerning by RedditBERT had a higher chance of receiving a depression diagnosis, a prescription for antidepressants, or a mental health referral within the next 3, 6, and 12 months after sending the concerning message. Patients sending a concerning message were also more likely to already have a depression diagnosis, a prescription for antidepressants, or a mental health referral (see Supplementary Appendix F).

Explainability

The explanation of which words contributed to the prediction per message differed for BERT and RedditBERT, with RedditBERT highlighting more words than BERT (see Supplementary Appendix G). Annotators preferred BERT’s explanation to RedditBERT’s explanations for 14 out of 26 texts (54%). Annotators often opted for RedditBERT’s explanation when it highlighted words or sentences that BERT missed. On the other hand, annotators sometimes preferred BERT’s explanation because they found RedditBERT highlighted words that did not make sense in the eyes of the annotators. Furthermore, several annotators mentioned that the words highlighted as “not concerning” did not always seem to make sense to them (Table 4).

Table 4.

Annotators’ reasons for choosing BERT or RedditBERT’s explanation.

Reasons for choosing RedditBERT	Reasons for choosing BERT
“Difficult. I like the explanations of [RedditBERT] a bit more, because it seems to pick out more complete sentences like “am extremely tired” and “have not … able to sleep more”.”	“I prefer [BERT] because the blue [non-concerning] words in [RedditBERT], do not make sense to me. Why should testosterone be marked as non-concerning.”
“This is the best use case of this model. A clear cry for help. I like the explanations of [RedditBERT] better because it picks out more complete sentences “I’m pretty depressed” and has a stronger reaction on the “psychologist”.”	“[RedditBERT] highlights a lot of text that I do not think relevant in either direction.”
“I think in general it is good [RedditBERT] picks up on prescription names.”	“Again, there is a lot of text highlighted in both that does not really make sense to me. [BERT] highlights less text.”
“I like how [RedditBERT] picks up on the “love to talk to somebody”.”	“I do not agree with the extra highlighted words really in [RedditBERT], as the only indication of concern is the “depressed”.”

Reasons for choosing RedditBERT	Reasons for choosing BERT
“Difficult. I like the explanations of [RedditBERT] a bit more, because it seems to pick out more complete sentences like “am extremely tired” and “have not … able to sleep more”.”	“I prefer [BERT] because the blue [non-concerning] words in [RedditBERT], do not make sense to me. Why should testosterone be marked as non-concerning.”
“This is the best use case of this model. A clear cry for help. I like the explanations of [RedditBERT] better because it picks out more complete sentences “I’m pretty depressed” and has a stronger reaction on the “psychologist”.”	“[RedditBERT] highlights a lot of text that I do not think relevant in either direction.”
“I think in general it is good [RedditBERT] picks up on prescription names.”	“Again, there is a lot of text highlighted in both that does not really make sense to me. [BERT] highlights less text.”
“I like how [RedditBERT] picks up on the “love to talk to somebody”.”	“I do not agree with the extra highlighted words really in [RedditBERT], as the only indication of concern is the “depressed”.”

Table 4.

Annotators’ reasons for choosing BERT or RedditBERT’s explanation.

Reasons for choosing RedditBERT	Reasons for choosing BERT
“Difficult. I like the explanations of [RedditBERT] a bit more, because it seems to pick out more complete sentences like “am extremely tired” and “have not … able to sleep more”.”	“I prefer [BERT] because the blue [non-concerning] words in [RedditBERT], do not make sense to me. Why should testosterone be marked as non-concerning.”
“This is the best use case of this model. A clear cry for help. I like the explanations of [RedditBERT] better because it picks out more complete sentences “I’m pretty depressed” and has a stronger reaction on the “psychologist”.”	“[RedditBERT] highlights a lot of text that I do not think relevant in either direction.”
“I think in general it is good [RedditBERT] picks up on prescription names.”	“Again, there is a lot of text highlighted in both that does not really make sense to me. [BERT] highlights less text.”
“I like how [RedditBERT] picks up on the “love to talk to somebody”.”	“I do not agree with the extra highlighted words really in [RedditBERT], as the only indication of concern is the “depressed”.”

Reasons for choosing RedditBERT	Reasons for choosing BERT
“Difficult. I like the explanations of [RedditBERT] a bit more, because it seems to pick out more complete sentences like “am extremely tired” and “have not … able to sleep more”.”	“I prefer [BERT] because the blue [non-concerning] words in [RedditBERT], do not make sense to me. Why should testosterone be marked as non-concerning.”
“This is the best use case of this model. A clear cry for help. I like the explanations of [RedditBERT] better because it picks out more complete sentences “I’m pretty depressed” and has a stronger reaction on the “psychologist”.”	“[RedditBERT] highlights a lot of text that I do not think relevant in either direction.”
“I think in general it is good [RedditBERT] picks up on prescription names.”	“Again, there is a lot of text highlighted in both that does not really make sense to me. [BERT] highlights less text.”
“I like how [RedditBERT] picks up on the “love to talk to somebody”.”	“I do not agree with the extra highlighted words really in [RedditBERT], as the only indication of concern is the “depressed”.”

Discussion

In this study, we demonstrate a proof-of-concept for leveraging patient-generated health data for the early identification of depression concerns in cancer patients. By employing NLP techniques, specifically BERT and domain-adaptive pretraining using Reddit data (RedditBERT), we highlight the potential of artificial intelligence in enhancing mental health surveillance. However, the performance disparities observed across patient subgroups, notably concerning race and ethnicity, necessitate a careful consideration of the ethical implications and potential biases introduced by these models.

The good discriminatory ability across all models showed the potential of patient messages as a valuable source for depression risk stratification. Our results are comparable to one other study that used patient portal messages to identify a mental health event, namely suicide.²⁸ For this study, the authors reported an AUROC of 0.71. Both findings underline the potential of using patient messages as a unique data source, which provides a current snapshot of how a patient is feeling and directly represents the patient’s voice. This untapped data source has the opportunity to improve personalized, proactive identification of mental health issues. However, more research on this topic is needed, as these are the only studies describing the application of NLP on this data source.

There was no significant difference between the naïve base BERT and the domain-adaptive pretrained RedditBERT model. This finding contradicts previous studies in which domain-specific models like BioBERT (pre-trained on biomedical texts) and ClinicalBERT (pretrained on clinical texts), and continuously pretrained BERT models outperformed base BERT.³⁵^,³⁶^,⁴¹^,⁴³ However, within the mental health domain, a recent study also found that depression classification did not improve significantly with continued pretraining.⁴⁴ More research is needed to assess the value of social media data for continued pretraining in the mental health domain.

We found a notable difference in how words were weighed in the explanations provided for BERT and RedditBERT, but there was no difference on average in the quality of the explanations. Explanations may help generate trust in deep neural network models, like BERT, which are inherently uninterpretable.⁴⁵ Yet, post-hoc explainability methods like LIME are difficult to validate, and their effect on clinical decision making is still unknown. More research is needed on the added value of explainability methods in increasing trust versus the potential to harm trust.⁴⁶ Alternatively, neural networks with a more inherent interpretability mechanism could lead to better explanations.⁴⁷

The subgroup analysis showed slight differences in performance between sex, race, and ethnicity. Compared to BERT, RedditBERT performed more consistently between subgroups and had a slightly better recall for male patients and White patients, which could be due to Reddit being predominantly used by males.⁴⁸ Furthermore, both models performed worse on Black patients, which can be explained by the low number of Black patients within our sample. This finding highlights the importance of addressing potential biases and ethical considerations associated with deploying AI models in healthcare, emphasizing the need for equitable and unbiased implementations. The National Institute of Health’s All of Us Research Program is a great example of an initiative aiming to collect data from a diverse group of participants across the United States.⁴⁹ For future research, we recommend training models on such a diverse dataset to decrease differences in subgroup performance.

A depression diagnosis or prescription of depression medication occurred more often after a concerning message was sent compared to after a non-concerning message was sent. This suggests that our models were able to identify messages that were truly indicative of depression concerns. These may be patients that could benefit from additional mental healthcare outreach. Important to note, however, is that some of these patients already received a depression diagnosis or treatment. This highlights the classification capabilities of the model, although the model might not perform well for prediction. This assumption is underlined by a recent study, where we show that using the output of our model does not improve the performance of a prediction model for depression.³¹

Limitations

One limitation is the moderate inter-annotator agreement. This can be attributed to the diversity among the annotators and the inherent subjective interpretation of what qualifies as a “concerning” message in patient emails. This is highlighted by the large variation in IAA, depending on which annotators are included. Although we believe the IAA could be improved by excluding some annotators, it also mirrors the real-world where different healthcare professionals may interpret patient communications differently. Relevant literature describing similar use cases, such as annotating Twitter data for mental health symptoms, report similar moderate inter-annotator agreements.⁵⁰^,⁵¹ Despite the moderate agreement, the study’s rigorous approach in involving multiple annotators and the alignment with existing literature provide valuable insights into the complexities of labeling subjective content. Taking this into account, we conclude that the use of patient messages combined with labels from our diverse, clinical group of annotators greatly improved the method’s potential to be applicable in healthcare practice.

Furthermore, our current approach to upsampling concerning messages may lead to a bias in the training data towards messages that are more easily identifiable as concerning. As the current study is a proof-of-concept, we chose this method to keep the manual annotation feasible while still ensuring that there were enough concerning messages to train the model effectively. However, future work should explore more sophisticated sampling techniques to better represent the full spectrum of patient messages and minimize potential biases.

Another limitation of this study is the focus on a single institution, which may limit the generalizability of our results to other settings. Especially the high number of privately insured patients is not representative for the general population. Nevertheless, this cohort included a diverse population in terms of race and ethnicity, with a substantial percentage of Hispanic and Asian patients. This study can thus be seen as a proof-of-concept and sets the stage for future investigations into the ways different ethnic and cultural groups express mental health concerns in their communications. By recognizing and addressing these differences, subsequent studies can delve deeper into tailoring interventions that resonate effectively across diverse patient populations.

Lastly, the study’s framework might not capture patients who do not use emails for communication or are hesitant to reach out, thereby potentially missing a subset of the population in need.

Future implications

Given our exploration in the use of advanced NLP models for the identification of depression concerns in cancer patients, the broader implications for healthcare are significant. The advantage of BERT and RedditBERT over traditional methods underscores the potential of integrating more sophisticated language models into clinical practice. With the ongoing advancements in NLP, especially in the field of large language models (LLMs), there is the potential to further refine these models, making them even more relevant and effective in a clinical context. Future work should focus on comparing several newer language models to determine if they could provide improved performance in identifying depression. Recent studies have shown that it is also possible to use LLMs to create chatbots for counseling, offering another promising avenue for providing mental health support.⁵² However, while the promise of these advanced NLP models in healthcare is evident, it’s crucial to approach their integration with caution. Before such models can be responsibly incorporated into clinical settings, additional research is required to address potential biases as were demonstrated in the current study and evaluate the real-world impact on physician-patient interaction and clinical outcomes.^53–55

Furthermore, our study significantly contributes to the literature by emphasizing the underutilized potential of patient-generated health data, specifically messages sent through a secure patient portal. This novel approach taps into valuable information exchanged between patients and healthcare providers, offering insights into the mental health state of a patient and enabling early detection of depression concerns. This data is systematically collected, as opposed to, for example, patient reported outcomes (PROs). The collection of PROs is often burdensome to patients and healthcare providers, may not capture all patients’ concerns, and rely on patients’ memory to report symptoms that have occurred prior to the patient’s visit.¹⁸ Thus, patient messages should be seen as a valuable additional data source for clinical research and surveillance.

Following our proof-of-concept study, we propose several next steps. First of all, to ensure broader applicability of such a tool, the training dataset should be extended with data representative of the general population. Secondly, it is important to conduct a temporal validation to assess the model’s performance over time. Lastly, other types of explainability methods should be tested to determine if some provide a better understanding of the model’s behavior than the current method. These steps will help refine the model further and enhance its applicability and trustworthiness in a clinical setting.

Conclusion

In conclusion, this work represents a significant methodological advancement in the early identification of depression concerns among cancer patients, addressing a critical gap in patient care. Our work contributes to a route to reduce clinical burden while enhancing overall patient care, leveraging BERT-based models. Further research is needed to address biases, evaluate real-world impacts, and ensure responsible integration into clinical settings. As the study highlights, the interpretability of these models is paramount for clinician trust and responsible implementation in healthcare settings, particularly for vulnerable patient populations.

Author contributions

Marieke M. van Buchem, Anne A.H. de Hond, Ewout W. Steyerberg, Ilse M.J. Kant, and Tina Hernandez-Boussard were responsible for the conceptualization and design of the study. Marieke M. van Buchem and Anne A.H. de Hond performed the data extraction. Marieke M. van Buchem performed the data analysis. Max Schuessler and Vaibhavi Shah provided clinical advice. Marieke M. van Buchem drafted the original manuscript. All authors had full access to all the data, critically analyzed, reviewed, contributed, and approved the final manuscript.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR003142 and the National Library Of Medicine of the National Institutes of Health under Award Number R01LM013362. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. M.v.B. and A.d.H. received a travel grant from the Catharine van Tussenbroek Fund and the Prins Bernhard Cultuur Fund to support this research.

Conflicts of interest

The authors have no competing interests to disclose.

Data availability

The data underlying this article cannot be shared due to the sensitivity of the content and the privacy of individuals that participated in the study.

References

1

Linden

W

,

Vodermaier

A

,

MacKenzie

R

, et al.

Anxiety and depression after cancer diagnosis: prevalence rates by cancer type, gender, and age

.

J Affect Disord

.

2012

;

141

(

2-3

):

343

-

351

.

2

Smith

HR.

Depression in cancer patients: pathogenesis, implications and treatment (Review)

.

Oncol Lett

.

2015

;

9

(

4

):

1509

-

1514

.

3

Pitman

A

,

Suleman

S

,

Hyde

N

, et al.

Depression and anxiety in patients with cancer

.

BMJ

.

2018

;

361

:

k1415

.

PubMed

OpenURL Placeholder Text

4

Colleoni

M

,

Mandala

M

,

Peruzzotti

G

, et al.

Depression and degree of acceptance of adjuvant cytotoxic drugs

.

Lancet

.

2000

;

356

(

9238

):

1326

-

1327

.

5

Grassi

L

,

Indelli

M

,

Marzola

M

, et al.

Depressive symptoms and quality of life in home-care-assisted cancer patients

.

J Pain Symptom Manage

.

1996

;

12

(

5

):

300

-

307

.

6

HHS SA and MHSA (SAMHSA)

.

Substance abuse and mental health services administration; mental health and substance abuse emergency response criteria. Interim final rule

.

Fed Regist

.

2001

;

66

:

51873

-

51880

.

PubMed

OpenURL Placeholder Text

7

Walker

J

,

Hansen

CH

,

Martin

P

, et al.

Prevalence, associations, and adequacy of treatment of major depression in patients with cancer: a cross-sectional analysis of routinely collected clinical data

.

Lancet Psychiatry

.

2014

;

1

(

5

):

343

-

350

.

8

Caruso

R

,

Breitbart

W.

Mental health care in oncology. Contemporary perspective on the psychosocial burden of cancer and evidence-based interventions

.

Epidemiol Psychiatr Sci

.

2020

;

29

:

e86

.

9

Mitchell

AJ

,

Chan

M

,

Bhatti

H

, et al.

Prevalence of depression, anxiety, and adjustment disorder in oncological, haematological, and palliative-care settings: a meta-analysis of 94 interview-based studies

.

Lancet Oncol

.

2011

;

12

(

2

):

160

-

174

.

10

Mitchell

AJ

,

Meader

N

,

Davies

E

, et al.

Meta-analysis of screening and case finding tools for depression in cancer: evidence based recommendations for clinical practice on behalf of the depression in cancer care consensus group

.

J Affect Disord

.

2012

;

140

(

2

):

149

-

160

.

11

Iyortsuun

NK

,

Kim

S-H

,

Jhon

M

, et al.

A review of machine learning and deep learning approaches on mental health diagnosis

.

Healthcare

.

2023

;

11

(

3

):

285

.

12

Cho

S-E

,

Geem

ZW

,

Na

K-S.

Prediction of depression among medical check-ups of 433,190 patients: a nationwide population-based study

.

Psychiatry Res

.

2020

;

293

:

113474

.

13

Tai-Seale

M

,

Dillon

EC

,

Yang

Y

, et al.

Physicians’ well-being linked to in-basket messages generated by algorithms in electronic health records

.

Health Aff (Millwood)

.

2019

;

38

(

7

):

1073

-

1078

.

14

Adler-Milstein

J

,

Zhao

W

,

Willard-Grace

R

, et al.

Electronic health records and burnout: time spent on the electronic health record after hours and message volume associated with exhaustion but not with cynicism among primary care clinicians

.

J Am Med Inform Assoc

.

2020

;

27

(

4

):

531

-

538

.

15

Lieu

TA

,

Altschuler

A

,

Weiner

JZ

, et al.

Primary care physicians’ experiences with and strategies for managing electronic messages

.

JAMA Netw Open

.

2019

;

2

(

12

):

e1918287

.

16

Arachchige

IAN

,

Sandanapitchai

P

,

Weerasinghe

R.

Investigating machine learning & natural language processing techniques applied for predicting depression disorder from online support forums: a systematic literature review

.

Information

.

2021

;

12

(

11

):

444

.

17

Tejaswini

V

,

Babu

KS

,

Sahoo

B.

Depression detection from social media text analysis using natural language processing techniques and hybrid deep learning model

.

ACM Trans Asian Low-Resour Lang Inf Process

.

2022

;

23

(

1

):

1

-

20

.

18

Katchapakirin

K

,

Wongpatikaseree

K

,

Yomaboot

P

, et al. Facebook social media for depression detection in the Thai community. In: 2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE). IEEE;

2018

:

1

-

6

.

19

Asad

NA

,

Pranto

M

,

Afreen

S

, et al. Depression detection by analyzing social media posts of user. In: 2019 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON). IEEE;

2019

:

13

-

17

.

20

Kabir

MK

,

Islam

M

,

Kabir

ANB

, et al.

Detection of depression severity using Bengali social media posts on mental health: study using natural language processing techniques

.

JMIR Form Res

.

2022

;

6

(

9

):

e36118

.

21

Dessai

S

,

Usgaonkar

SS.

Depression detection on social media using text mining. In: 2022 3rd International Conference on Emerging Technology (INCET). IEEE;

2022

:

1

-

4

.

22

Haque

A

,

Reddi

V

,

Giallanza

T.

Deep learning for suicide and depression identification with unsupervised label correction. In: Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks. Springer International Publishing; 2021:

436

-

447

.

23

Ren

L

,

Lin

H

,

Xu

B

, et al.

Depression detection on Reddit with an emotion-based attention network: algorithm development and validation

.

JMIR Med Inform

.

2021

;

9

(

7

):

e28754

.

24

Podina

IR

,

Bucur

A-M

,

Todea

D

, et al.

Mental health at different stages of cancer survival: a natural language processing study of Reddit posts

.

Front Psychol

.

2023

;

14

:

1150227

.

25

Chen

Z

,

Yang

R

,

Fu

S

, et al. Detecting Reddit users with depression using a hybrid neural network. In: 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI). IEEE; 2023:193-199

26

Choudhury

MD

,

De

S.

Mental health discourse on Reddit: self-disclosure, social support, and anonymity

.

ICWSM

.

2014

;

8

(

1

):

71

-

80

.

27

Ammari

T

,

Schoenebeck

S

,

Romero

D.

Self-declared throwaway accounts on Reddit: how platform affordances and shared norms enable parenting disclosure and support

.

Proc ACM Hum-Comput Interact

.

2019

;

3

(

CSCW

):

1

-

30

.

28

Bhandarkar

AR

,

Arya

N

,

Lin

KK

, et al.

Building a natural language processing artificial intelligence to predict suicide-related events based on patient portal message data

.

Mayo Clin Proc Digit Heal

.

2023

;

1

(

4

):

510

-

518

.

29

Devlin

J

,

Chang

M-W

,

Lee

K

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics;

2019

:

4171

-

4186

.

30

Riedl

D

,

Schüßler

G.

Factors associated with and risk factors for depression in cancer patients—a systematic literature review

.

Transl Oncol

.

2022

;

16

:

101328

.

31

Hond

A D

,

Buchem

M V

,

Fanconi

C

, et al.

Predicting depression risk in patients with cancer using multimodal data: algorithm development study

.

JMIR Med Inform

.

2024

;

12

:

e51925

.

32

Sousa

MGd

,

Sakiyama

K

,

Rodrigues

LdS

, et al. BERT for stock market sentiment analysis. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). IEEE;

2019

:

1597

-

1601

.

33

Du

J

,

Xiang

Y

,

Sankaranarayanapillai

M

, et al.

Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning

.

J Am Med Inform Assoc

.

2021

;

28

(

7

):

1393

-

1400

.

34

Zhou

S

,

Wang

N

,

Wang

L

, et al.

CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records

.

J Am Med Inform Assoc

.

2022

;

29

(

7

):

1208

-

1216

.

35

Lamproudis

A

,

Henriksson

A

,

Dalianis

H.

Developing a clinical language model for Swedish: continued pretraining of generic BERT with in-domain data. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing—Deep Learning for Natural Language Processing Methods and Application. INCOMA, Ltd.;

2021

:

790

-

797

.

36

Lee

J

,

Yoon

W

,

Kim

S

, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics

.

2020

;36(4):1234-1240.

37

Gururangan

S

,

Marasović

A

,

Swayamdipta

S

, et al. Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics;

2020

:

8342

-

8360

.

38

Alsentzer

E

,

Murphy

JR

,

Boag

W

, et al. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics;

2019

:

72

-

78

.

39

Chakrabarty

T

,

Hidey

C

,

McKeown

K.

IMHO fine-tuning improves claim detection. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics;

2019

:

558

-

563

.

40

Fanconi

C

,

Buchem

M V

,

Hernandez-Boussard

T.

Natural language processing methods to identify oncology patients at high risk for acute care with clinical notes.

AMIA Jt Summits Transl Sci Proc

.

2023

:138-147.

41

Huang

K

,

Altosaar

J

,

Ranganath

R.

ClinicalBERT: modeling clinical notes and predicting hospital readmission. Arxiv.

2020

. Accessed July 12, 2024. https://arxiv.org/abs/1904.05342.

42

Ribeiro

M

,

Singh

S

,

Guestrin

C.

“Why Should I Trust You?”: explaining the predictions of any classifier. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. Association for Computational Linguistics;

2016

:

97

-

101

.

43

Peng

B

,

Chersoni

E

,

Hsu

Y-Y

, et al. Is domain adaptation worth your investment? comparing BERT and FinBERT on financial tasks. In: Proceedings of the Third Workshop on Economics and Natural Language Processing. Association for Computational Linguistics;

2021

:

37

-

44

.

44

Ji

S

,

Zhang

T

,

Ansari

L

, et al. MentalBERT: publicly available pretrained language models for mental healthcare. In: Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association;

2022

:

7184

-

7190

.

45

Amann

J

,

Vetter

D

,

Blomberg

SN

,

Z-Inspection initiative

, et al.

To explain or not to explain?—artificial intelligence explainability in clinical decision support systems

.

PLOS Digit Health

.

2022

;

1

(

2

):

e0000016

.

46

Wysocki

O

,

Davies

JK

,

Vigo

M

, et al.

Assessing the communication gap between AI models and healthcare professionals: explainability, utility and trust in AI-driven clinical decision-making

.

Artif Intell

.

2023

;

316

:

103839

.

. https://web.archive.org/web/20210117184818/https://www.redditinc.com/advertising/audience

47

Fanconi

C

,

Vandenhirtz

M

,

Husmann

S

, et al. This reads like that: deep learning for interpretable natural language processing. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics;

2023

:

14067

-

14076

.

48

Reddit.com. Advertising—Audience—Reddit

. Discover what makes Reddit ads unique. Accessed January 17,

2021

49

Investigators A of URP

.

The “All of Us” research program

.

N Engl J Med

.

2019

;

381

(

7

):

668

-

676

.

PubMed

50

Homan

C

,

Johar

R

,

Liu

T

, et al. Toward macro-insights for suicide prevention: analyzing fine-grained distress at scale. In: Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. Association for Computational Linguistics; 2014:

107

-

117

.

51

Mowery

D

,

Bryan

C

,

Conway

M.

Towards developing an annotation scheme for depressive disorder symptoms: a preliminary study using Twitter data. In: Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. Association for Computational Linguistics; 2015:

89

-

98

.

52

Lai

T

,

Shi

Y

,

Du

Z

, et al.

Supporting the demand on mental health services with AI-based conversational large language models (LLMs)

.

BioMedInformatics

.

2023

;

4

(

1

):

8

-

33

.

53

Nashwan

AJ

,

Abujaber

AA

,

Choudry

H.

Embracing the future of physician-patient communication: GPT-4 in gastroenterology

.

Gastroenterol Endosc

.

2023

;

1

(

3

):

132

-

135

.