Proceedings of the 13th International Conference on Thrombosis and Hemostasis Issues in Cancer, 2026

Automated identification of cancer-associated thrombosis events via natural language processing: a systematic review of the literature

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Published: 16 April 2026
0
Views
0
Downloads

Authors

Accurate identification of cancer-associated thrombosis (CAT) in electronic health records is essential for disease surveillance, trial design, and the development of risk stratification models. Manual chart review is impractical at scale, while ICD-code or similar coding system-based extraction often misses events, fails to distinguish incident from prevalent events, or equates bland and tumor thrombi. Natural language processing (NLP) is a promising solution, offering scalable extraction of structured CAT events directly from clinical text. We systematically reviewed the literature for NLP-based pipelines for CAT identification according to PRISMA guidelines. PubMed, Embase, and Web of Science were queried for English language studies from 2010-2025. NLP methodology, training strategy, and model performance were extracted from relevant studies. Seven studies, implementing NLP approaches ranging from lexicon or rules-based pipelines to transformer models, met inclusion criteria. Most studies developed and evaluated models using text from a single institution, and only one distinguished incident CAT events from prevalent events. Reported metrics varied between studies, though models tended to exhibit higher specificity and negative predictive value than sensitivity and positive predictive value. Overall, current NLP systems for CAT identification have achieved desired results to assist but not supplant manual chart review. Major limitations include small and institution-specific datasets, absence of external validation, and lack of distinction between incident vs prevalent events. Development of generalizable models will require large, multi-institutional datasets as the field moves towards transformer-based models, and standardized evaluation metrics with shared benchmark test sets are needed for an unbiased measure of progress.

Downloads

Download data is not yet available.

Citations

1.Li A, Zhou E. Trends and updates on the epidemiology of cancer associated thrombosis: a systematic review. Bleeding Thromb Vasc Biol 2024;3:108. DOI: https://doi.org/10.4081/btvb.2024.108
2.Elvas LB, Almeida A, Ferreira JC. Natural language pro¬cessing in medical text processing: A scoping literature review. Int J Med Inf 2025;204:106049. DOI: https://doi.org/10.1016/j.ijmedinf.2025.106049
3.Kim Y. Convolutional neural networks for sentence classification. arXiv:1408.5882v2.
4.Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008;17:128-44. DOI: https://doi.org/10.1055/s-0038-1638592
5.Goldberg Y. Modeling with recurrent networks. In: Goldberg Y, ed. Neural network methods for natural language processing. Cham, Springer; 2017. DOI: https://doi.org/10.1007/978-3-031-02165-7
6.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. 2023 arXiv:1706.03762v7
7.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019 arXiv:1810.04805.
8.Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. 2020 arXiv:2005.14165.
9.Liu P, Yuan W, Fu J, et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 2023;55:195. DOI: https://doi.org/10.1145/3560815
10.Lam BD, Chrysafi P, Chiasakul T, et al. Machine learning natural language processing for identifying venous thromboembolism: systematic review and meta-analysis. Blood Adv 2024;8:2991-3000. DOI: https://doi.org/10.1182/bloodadvances.2023012200
11.Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. DOI: https://doi.org/10.1136/bmj.n71
12.Chen Y, Carroll RJ, Hinz ERM, et al. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. J Am Med Inform Assoc 2013; 20:e253-9. DOI: https://doi.org/10.1136/amiajnl-2013-001945
13.Jafari O, Ma S, Lam BD, et al. Development and validation of venous thromboembolism–bidirectional encoder representations from transformers (VTE-BERT) natural language processing model. J Thromb Haemost 2025. Online ahead of print. DOI: https://doi.org/10.1016/j.jtha.2025.07.021
14.He JC, Hirsch I, Li Y, et al. Impact of applying machine learning to the electronic medical record on prediction of cancer-associated thrombosis. JCO Oncol Pract 2024;20: 409. DOI: https://doi.org/10.1200/OP.2024.20.10_suppl.409
15.Jagasia S, Krauze AV. Developing a word lexicon from electronic health records for natural language processing analysis of free-text reports for patients with venous thromboembolism. Int J Radiat Oncol 2023;117:e469. DOI: https://doi.org/10.1016/j.ijrobp.2023.06.1675
16.Li A, da Costa WL, Guffey D, et al. Developing and opti¬mizing a computable phenotype for incident venous thromboembolism in a longitudinal cohort of patients with cancer. Res Pract Thromb Haemost 2022;6:e12733. DOI: https://doi.org/10.1002/rth2.12733
17.Avery J, Martens KL, Nguyen D, et al. Utilization of natural language processing in venous thromboembolism identification. Blood 2022;140:7860-1. DOI: https://doi.org/10.1182/blood-2022-165426
18.Subramanian NG, Pleitez HG, Nguyen D, et al. Diagnostic performance of natural language processing in detection of acute cancer VTE. J Clin Oncol 2023;41:e19062. DOI: https://doi.org/10.1200/JCO.2023.41.16_suppl.e19062
19.Singh R, Mantha S. PINES (progressive inference networked episodic service). Available from: https://pines.ai
20.Williamson J. Development of visual tagging tool.
21.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001:17-21.
22.Maghsoudi A, Zhou E, Guffey D, et al. A Transformer natural language processing algorithm for cancer associated thrombosis phenotype. Blood 2023;142:S1267. DOI: https://doi.org/10.1182/blood-2023-184756
23.Yuan K, Yoon CH, Gu Q, et al. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun Med 2025;5:83. DOI: https://doi.org/10.1038/s43856-025-00790-1
24.Peng L, Luo G, Zhou S, et al. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. Npj Digit Med 2024;7:127. DOI: https://doi.org/10.1038/s41746-024-01126-4
25.Guluzade A, Heiba N, Boukhers Z, et al. ELMTEX: fine-tuning LLMs for structured clinical information extraction. A case study on clinical reports. In: Bellazzi R, Juarez Herrero JM, Sacchi L, Zupan B, eds. Artificial intelligence in medicine. AIME 2025. Cham, Springer. DOI: https://doi.org/10.1007/978-3-031-95841-0_34
26.Hsieh CY, Li CL, Yeh CK, et al. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. 2023. arXiv:2305.02301v2. DOI: https://doi.org/10.18653/v1/2023.findings-acl.507
27.Lam BD, Ma S, Kovalenko I, et al. Using a transformer language model to curate a pulmonary embolism dataset from the Medical Information Mart for Intensive Care IV: MIMIC-IV-Ext-PE. Res Pract Thromb Haemost 2025;9:102896. DOI: https://doi.org/10.1016/j.rpth.2025.102896

Supporting Agencies

Cancer Prevention and Research Institute of Texas, American Society of Hematology, ASCO Foundation

How to Cite



1.
Boyne A, Zhou E, Li A. Automated identification of cancer-associated thrombosis events via natural language processing: a systematic review of the literature. Bleeding Thromb Vasc Biol [Internet]. 2026 Apr. 16 [cited 2026 Apr. 17];5(s1). Available from: https://www.btvb.org/btvb/article/view/437