Thrombophilia work-up and clinical outcomes in Indian patients with unprovoked venous and arterial thrombosis aged 18-50 years

Harimadhavan, Monisha; Ram, Bharath; G, Karthick R; Shetty, Devi Prasad; Prabhu, Shilpa; S, Nataraj K; Nayak, Akshatha; Damodar, Sharat; Zanon, Ezio; Castaman, Giancarlo; Federici, Augusto Bramante; Mancuso, Maria Elisa; Pasut, Gianfranco; Tosetto, Alberto; Rocino, Angiola; Santoro, Rita Carlotta; Barcellona, Doris; Tripodi, Armando; Di Castelnuovo, Augusto; Persichillo, Mariarosaria; Candura, Fabio; Marchesini, Emanuela; Forni, Gian Luca; Gringeri, Alessandro; Gentilini, Ilaria; Tricoli, Mirko; Najjar, Osama; Teofili, Luciana; Aidan Boyne; Emily Zhou; Ang Li

doi:10.4081/btvb..193

Authors

Aidan Boyne

Section of Hematology-Oncology, Department of Medicine, Baylor College of Medicine, Houston, TX, United States.

Emily Zhou

McGovern Medical School, University of Texas Health Science Center, Houston, TX, United States.

Ang Li

ang.li2@bcm.edu

Section of Hematology-Oncology, Department of Medicine, Baylor College of Medicine, Houston, TX, United States.

Accurate identification of cancer-associated thrombosis (CAT) in electronic health records is essential for disease surveillance, trial design, and the development of risk stratification models. Manual chart review is impractical at scale, while ICD-code or similar coding system-based extraction often misses events, fails to distinguish incident from prevalent events, or equates bland and tumor thrombi. Natural language processing (NLP) is a promising solution, offering scalable extraction of structured CAT events directly from clinical text. We systematically reviewed the literature for NLP-based pipelines for CAT identification according to PRISMA guidelines. PubMed, Embase, and Web of Science were queried for English language studies from 2010-2025. NLP methodology, training strategy, and model performance were extracted from relevant studies. Seven studies, implementing NLP approaches ranging from lexicon or rules-based pipelines to transformer models, met inclusion criteria. Most studies developed and evaluated models using text from a single institution, and only one distinguished incident CAT events from prevalent events. Reported metrics varied between studies, though models tended to exhibit higher specificity and negative predictive value than sensitivity and positive predictive value. Overall, current NLP systems for CAT identification have achieved desired results to assist but not supplant manual chart review. Major limitations include small and institution-specific datasets, absence of external validation, and lack of distinction between incident vs prevalent events. Development of generalizable models will require large, multi-institutional datasets as the field moves towards transformer-based models, and standardized evaluation metrics with shared benchmark test sets are needed for an unbiased measure of progress.

Downloads

Download data is not yet available.

Citations

1.Li A, Zhou E. Trends and updates on the epidemiology of cancer associated thrombosis: a systematic review. Bleeding Thromb Vasc Biol 2024;3:108. DOI: https://doi.org/10.4081/btvb.2024.108

2.Elvas LB, Almeida A, Ferreira JC. Natural language pro¬cessing in medical text processing: A scoping literature review. Int J Med Inf 2025;204:106049. DOI: https://doi.org/10.1016/j.ijmedinf.2025.106049

3.Kim Y. Convolutional neural networks for sentence classification. arXiv:1408.5882v2.

4.Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008;17:128-44. DOI: https://doi.org/10.1055/s-0038-1638592

5.Goldberg Y. Modeling with recurrent networks. In: Goldberg Y, ed. Neural network methods for natural language processing. Cham, Springer; 2017. DOI: https://doi.org/10.1007/978-3-031-02165-7

6.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. 2023 arXiv:1706.03762v7

7.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019 arXiv:1810.04805.

8.Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. 2020 arXiv:2005.14165.

9.Liu P, Yuan W, Fu J, et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 2023;55:195. DOI: https://doi.org/10.1145/3560815

10.Lam BD, Chrysafi P, Chiasakul T, et al. Machine learning natural language processing for identifying venous thromboembolism: systematic review and meta-analysis. Blood Adv 2024;8:2991-3000. DOI: https://doi.org/10.1182/bloodadvances.2023012200

11.Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. DOI: https://doi.org/10.1136/bmj.n71

12.Chen Y, Carroll RJ, Hinz ERM, et al. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. J Am Med Inform Assoc 2013; 20:e253-9. DOI: https://doi.org/10.1136/amiajnl-2013-001945

13.Jafari O, Ma S, Lam BD, et al. Development and validation of venous thromboembolism–bidirectional encoder representations from transformers (VTE-BERT) natural language processing model. J Thromb Haemost 2025. Online ahead of print. DOI: https://doi.org/10.1016/j.jtha.2025.07.021

14.He JC, Hirsch I, Li Y, et al. Impact of applying machine learning to the electronic medical record on prediction of cancer-associated thrombosis. JCO Oncol Pract 2024;20: 409. DOI: https://doi.org/10.1200/OP.2024.20.10_suppl.409

15.Jagasia S, Krauze AV. Developing a word lexicon from electronic health records for natural language processing analysis of free-text reports for patients with venous thromboembolism. Int J Radiat Oncol 2023;117:e469. DOI: https://doi.org/10.1016/j.ijrobp.2023.06.1675

16.Li A, da Costa WL, Guffey D, et al. Developing and opti¬mizing a computable phenotype for incident venous thromboembolism in a longitudinal cohort of patients with cancer. Res Pract Thromb Haemost 2022;6:e12733. DOI: https://doi.org/10.1002/rth2.12733

17.Avery J, Martens KL, Nguyen D, et al. Utilization of natural language processing in venous thromboembolism identification. Blood 2022;140:7860-1. DOI: https://doi.org/10.1182/blood-2022-165426

18.Subramanian NG, Pleitez HG, Nguyen D, et al. Diagnostic performance of natural language processing in detection of acute cancer VTE. J Clin Oncol 2023;41:e19062. DOI: https://doi.org/10.1200/JCO.2023.41.16_suppl.e19062

19.Singh R, Mantha S. PINES (progressive inference networked episodic service). Available from: https://pines.ai

20.Williamson J. Development of visual tagging tool.

21.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001:17-21.

22.Maghsoudi A, Zhou E, Guffey D, et al. A Transformer natural language processing algorithm for cancer associated thrombosis phenotype. Blood 2023;142:S1267. DOI: https://doi.org/10.1182/blood-2023-184756

23.Yuan K, Yoon CH, Gu Q, et al. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun Med 2025;5:83. DOI: https://doi.org/10.1038/s43856-025-00790-1

24.Peng L, Luo G, Zhou S, et al. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. Npj Digit Med 2024;7:127. DOI: https://doi.org/10.1038/s41746-024-01126-4

25.Guluzade A, Heiba N, Boukhers Z, et al. ELMTEX: fine-tuning LLMs for structured clinical information extraction. A case study on clinical reports. In: Bellazzi R, Juarez Herrero JM, Sacchi L, Zupan B, eds. Artificial intelligence in medicine. AIME 2025. Cham, Springer. DOI: https://doi.org/10.1007/978-3-031-95841-0_34

26.Hsieh CY, Li CL, Yeh CK, et al. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. 2023. arXiv:2305.02301v2. DOI: https://doi.org/10.18653/v1/2023.findings-acl.507

27.Lam BD, Ma S, Kovalenko I, et al. Using a transformer language model to curate a pulmonary embolism dataset from the Medical Information Mart for Intensive Care IV: MIMIC-IV-Ext-PE. Res Pract Thromb Haemost 2025;9:102896. DOI: https://doi.org/10.1016/j.rpth.2025.102896

Supporting Agencies

Cancer Prevention and Research Institute of Texas, American Society of Hematology, ASCO Foundation

How to Cite

1.

Boyne A, Zhou E, Li A. Automated identification of cancer-associated thrombosis events via natural language processing: a systematic review of the literature. Bleeding Thromb Vasc Biol [Internet]. 2026 Apr. 16 [cited 2026 May 28];5(s1). Available from: https://www.btvb.org/btvb/article/view/437

Download Citation

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Current Issue

Automated identification of cancer-associated thrombosis events via natural language processing: a systematic review of the literature

Authors

Downloads

Citations

Supporting Agencies

How to Cite

Download Citation

Most read articles by the same author(s)

authors

reviewers

indexing

Keywords