Discovering monogenic patients with a confirmed molecular diagnosis in millions of clinical notes with MonoMiner

  • David Wei Wu
    Department of Computer Science, Stanford University School of Engineering, Stanford, CA

    Medical Scientist Training Program, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA
    Search for articles by this author
  • Jonathan A. Bernstein
    Department of Pediatrics, Stanford University School of Medicine, Stanford, CA
    Search for articles by this author
  • Gill Bejerano
    Correspondence and requests for materials should be addressed to Gill Bejerano, Department of Computer Science, Stanford School of Engineering, Stanford University, Beckman Center B-300, 279 Campus Drive West (MC 5329), Stanford, CA 94305-5329
    Department of Computer Science, Stanford University School of Engineering, Stanford, CA

    Department of Pediatrics, Stanford University School of Medicine, Stanford, CA

    Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA

    Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA
    Search for articles by this author
Published:August 17, 2022DOI:



      Cohort building is a powerful foundation for improving clinical care, performing biomedical research, recruiting for clinical trials, and many other applications. We set out to build a cohort of all monogenic patients with a definitive causal gene diagnosis in a 3-million patient hospital system.


      We define a subset (4461) of OMIM diseases that have at least 1 known monogenic causal gene. We then introduce MonoMiner, a natural language processing framework to identify molecularly confirmed monogenic patients from free-text clinical notes.


      We show that ICD-10-CM codes cover only a fraction of monogenic diseases and that even where available, ICD-10-CM code‒based patient retrieval offers 0.14 precision. Searching by causal gene symbol offers great recall but has an even worse 0.07 precision. MonoMiner achieves 6 to 11 times higher precision (0.80), with 0.87 precision on disease diagnosis alone, tagging 4259 patients with 560 monogenic diseases and 534 causal genes, at 0.48 recall.


      MonoMiner enables the discovery of a large, high-precision cohort of patients with monogenic diseases with an established molecular diagnosis, empowering numerous downstream uses. Because it relies solely on clinical notes, MonoMiner is highly portable, and its approach is adaptable to other domains and languages.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      ACMG Member Login

      Are you an ACMG Member? Sign in for online access.


      Subscribe to Genetics in Medicine
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Murphy S.N.
        • Barnett G.O.
        • Chueh H.C.
        Visual query tool for finding patient cohorts from a clinical data warehouse of the Partners HealthCare System.
        Proc AMIA Symp. 2000; : 1174
        • Hurdle J.F.
        • Haroldsen S.C.
        • Hammer A.
        • et al.
        Identifying clinical/translational research cohorts: ascertainment via querying an integrated multi-source database.
        J Am Med Inform Assoc. 2013; 20: 164-171
        • Tao S.
        • Cui L.
        • Wu X.
        • Zhang G.Q.
        Facilitating cohort discovery by enhancing ontology exploration, query management and query sharing for large clinical data repositories.
        AMIA Annu Symp Proc. 2017; 2017: 1685-1694
        • Frankovich J.
        • Longhurst C.A.
        • Sutherland S.M.
        Evidence-based medicine in the EMR era.
        N Engl J Med. 2011; 365: 1758-1759
        • Longhurst C.A.
        • Harrington R.A.
        • Shah N.H.
        A “green button” for using aggregate patient data at the point of care.
        Health Aff (Millwood). 2014; 33: 1229-1235
        • Ferranti J.M.
        • Gilbert W.
        • McCall J.
        • Shang H.
        • Barros T.
        • Horvath M.M.
        The design and implementation of an open-source, data-driven cohort recruitment system: the Duke Integrated Subject Cohort and Enrollment Research Network (DISCERN).
        J Am Med Inform Assoc. 2012; 19: e68-e75
        • Bache R.
        • Taweel A.
        • Miles S.
        • Delaney B.C.
        An eligibility criteria query language for heterogeneous data warehouses.
        Methods Inf Med. 2015; 54: 41-44
        • Brandt P.S.
        • Kiefer R.C.
        • Pacheco J.A.
        • et al.
        Toward cross-platform electronic health record-driven phenotyping using Clinical Quality Language.
        Learn Health Syst. 2020; 4e10233
        • Dobbins N.J.
        • Spital C.H.
        • Black R.A.
        • et al.
        Leaf: an open-source, model-agnostic, data-driven web application for cohort discovery and translational biomedical research.
        J Am Med Inform Assoc. 2020; 27: 109-118
        • Birgmeier J.
        • Haeussler M.
        • Deisseroth C.A.
        • et al.
        AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature.
        Sci Transl Med. 2020; 12
        • Yoo B.
        • Birgmeier J.
        • Bernstein J.A.
        • Bejerano G.
        InpherNet accelerates monogenic disease diagnosis using patients’ candidate genes’ neighbors.
        Genet Med. 2021; 23: 1984-1992
        • Church G.
        Compelling reasons for repairing human germlines.
        N Engl J Med. 2017; 377: 1909-1911
      1. Rodwell C, Aymé S, eds. 2014 Report on the State of the Art of Rare Disease Activities in Europe. Accessed March 7, 2021.

        • Bavisetty S.
        • Grody W.W.
        • Yazdani S.
        Emergence of pediatric rare diseases: review of present policies and opportunities for improvement.
        Rare Dis. 2013; 1e23579
        • Faviez C.
        • Chen X.
        • Garcelon N.
        • et al.
        Diagnosis support systems for rare diseases: a scoping review.
        Orphanet J Rare Dis. 2020; 15: 94
        • Awa W.L.
        • Schober E.
        • Wiegand S.
        • et al.
        Reclassification of diabetes type in pediatric patients initially classified as type 2 diabetes mellitus: 15 years follow-up using routine data from the German/Austrian DPV database.
        Diabetes Res Clin Pract. 2011; 94: 463-467
        • Shinar Y.
        • Ceccherini I.
        • Rowczenio D.
        • et al.
        ISSAID/EMQN best practice guidelines for the genetic diagnosis of monogenic autoinflammatory diseases in the next-generation sequencing era.
        Clin Chem. 2020; 66: 525-536
        • Hammond N.
        • Munkacsi A.B.
        • Sturley S.L.
        The complexity of a monogenic neurodegenerative disease: more than two decades of therapeutic driven research into Niemann-Pick type C disease.
        Biochim Biophys Acta Mol Cell Biol Lipids. 2019; 1864: 1109-1123
        • O’Neal W.K.
        • Knowles M.R.
        Cystic fibrosis disease modifiers: complex genetics defines the phenotypic diversity in a monogenic disease.
        Annu Rev Genomics Hum Genet. 2018; 19: 201-222
      2. ICD. ICD-10-CM – International Classification of Diseases. 10th revision, Clinical Modification. Published January 26, 2021. Accessed March 23, 2021.

        • Braschi B.
        • Denny P.
        • Gray K.
        • et al. the HGNC and VGNC resources in 2019.
        Nucleic Acids Res. 2019; 47: D786-D792
      3. Home. Stanford Medicine Research Data Repository. Stanford University. Accessed March 30, 2022.

        • Overhage J.M.
        • Ryan P.B.
        • Reich C.G.
        • Hartzema A.G.
        • Stang P.E.
        Validation of a common data model for active safety surveillance research.
        J Am Med Inform Assoc. 2012; 19: 54-60
      4. About us. Epic. Accessed March 23, 2021.

        • Amberger J.S.
        • Bocchini C.A.
        • Scott A.F.
        • Hamosh A. leveraging knowledge across phenotype-gene relationships.
        Nucleic Acids Res. 2019; 47: D1038-D1043
      5. The Lancet. ICD-11.
        Lancet. 2019; 393: 2275
        • Bodenreider O.
        The Unified Medical Language System (UMLS): integrating biomedical terminology.
        Nucleic Acids Res. 2004; 32: D267-D270
        • Köhler S.
        • Gargano M.
        • Matentzoglu N.
        • et al.
        The human phenotype ontology in 2021.
        Nucleic Acids Res. 2021; 49: D1207-D1217
        • Deisseroth C.A.
        • Birgmeier J.
        • Bodle E.E.
        • et al.
        ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis.
        Genet Med. 2019; 21: 1585-1593
        • Jagadeesh K.A.
        • Birgmeier J.
        • Guturu H.
        • et al.
        Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization.
        Genet Med. 2019; 21: 464-470
        • Richards S.
        • Aziz N.
        • Bale S.
        • et al.
        Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.
        Genet Med. 2015; 17: 405-424
      6. MedlinePlus. Health Information from the National Library of Medicine. Accessed May 20, 2021.

        • Pavan S.
        • Rommel K.
        • Mateo Marquina M.E.
        • Höhn S.
        • Lanneau V.
        • Rath A.
        Clinical practice guidelines for rare diseases: the Orphanet database.
        PLoS One. 2017; 12e0170365
        • Fung K.W.
        • Richesson R.
        • Bodenreider O.
        Coverage of rare disease names in standard terminologies and implications for patients, providers, and research.
        AMIA Annu Symp Proc. 2014; 2014: 564-572
        • Teng F.
        • Ma Z.
        • Chen J.
        • Xiao M.
        • Huang L.
        Automatic medical code assignment via deep learning approach for intelligent healthcare.
        IEEE J Biomed Health Inform. 2020; 24: 2506-2515
        • Fries J.A.
        • Steinberg E.
        • Khattar S.
        • et al.
        Ontology-driven weak supervision for clinical entity classification in electronic health records.
        Nat Commun. 2021; 12: 2017
        • Moon S.
        • McInnes B.
        • Melton G.B.
        Challenges and practical approaches with word sense disambiguation of acronyms and abbreviations in the clinical domain.
        Healthc Inform Res. 2015; 21: 35-42
        • Banda J.M.
        • Sarraju A.
        • Abbasi F.
        • et al.
        Finding missed cases of familial hypercholesterolemia in health systems using machine learning.
        NPJ Digit Med. 2019; 2: 23