Rare Disease Information Retrieval

Project website

Technical University of Denmark · DTU Informatics <<<

University of Copenhagen · Department of Computer Science <<<

... Description


Given a user query, the task of a search engine is to retrieve relevant information from a typically large and heterogeneous collection of documents. General purpose web search engines such as Google are widely used for a variety of web search related tasks, including assisting clinicians in generating hypotheses for diagnosing difficult cases, such as rare diseases [1].

This project addresses the task of searching for relevant rare diseases given a query of patient data. The patient data is given as free text, which means that the queries do not have to use a controlled vocabulary or specific query language restrictions as in conventional diagnostic assistance systems. The patient data submitted as a query to the information retrieval (IR) system could consist of patient age, gender, demographic information, symptoms, evidence of diseases, test results, previous diagnoses, and other information that a clinician might find relevant in the differential diagnosis.

Document collection

The collection of documents indexed for retrieval consists of medical articles on the topic of rare diseases. As 80% of the rare diseases are of genetic origin, we also index medical articles on the topic of genetic diseases.

We have done experiments on two subsets of the document collection: one including only the articles about rare diseases (referred to as RARE), and one including both rare and genetic disease articles (referred to as RARE&GENET). The RARE subset contains 10,263 documents with a raw size of 543 MB, and the RARE&GENET subset contains 31,590 documents and has a raw size of 719 MB.

Rare disease information retrieval in a nutshell

To rank documents with respect to queries, we use a probabilistic model, specifically a statistical language model. Given a query (q) consisting of patient data, each article (d) is ranked based on the probability of generating the terms of the query from the article's language model (P(q|Md)). Estimating the probability of a query being generated by a document corresponds to estimating how likely it is for a document to be about the correct diagnosis of the patient described in the query [2].

We use the Indri system. On top of the baseline ranking model, we use evidence about the collected medical articles to revise the ranking. More specifically, based on the reasoning that documents about rare diseases are more relevant when searching for a rare disease diagnosis, we use the origin/publisher of the article to give a higher probability of being relevant to those documents about rare diseases, and a smaller probability of being relevant to the documents about genetic diseases. Thus, if P(D) denotes the prior probability of a document being relevant, and C denotes the collection containing both rare and genetic diseases documents, then:

where x = φy (φ is the boosting factor), and P(R|C) (resp. P(G|C)) denotes the probability of all rare disease (resp. genetic disease) documents in the collection C.

Preliminary results

From rare disease cases published in the Orphanet Journal of Rare Diseases (OJRD), one medical doctor and two non-expert extracted 30 queries. These were used to evaluate our rare disease search engine, the standard Google Search, and a Google Custom Search. The queries include patient symptoms and the correct diagnosis was not included in the query terms. Google Custom Search was customized to emphasize the sources of the RARE&GENET documents.

Retrieval using the queries on our Indri-based system was made on the two collections, RARE and RARE&GENET. Moreover, retrieval on the second collection, RARE&GENET, was repeated using a ranking boost for the RARE documents. Runs used the query likelihood language model with default settings.

In the evaluation, a document was deemed relevant if it was predominantly about the correct disease or one of its synonyms or variations. In contrast, a document was considered non-relevant if it does not mention the correct disease in the title or first 400 words, in the first 10 items of a list, or it is a document with restricted access and is not relevant based on the freely available information.

Table 1. Retrieval on 30 queries from the web and from our collection. Best scores in bold. Relevance is measured by precision at rank 10 (P@10), precision at rank 20 (P@20) and by the mean reciprocal rank (MRR)

Retrieval ApproachBinary Relevance
Standard Google Search.023.013.056
Google Custom on RARE.030.017.173
RARE&GENET (RARE boost factor of 4).173.115.469


[1] H. Tang and J. H. K. Ng, "Googling for a diagnosis - use of Google as a diagnostic aid: internet based study," BMJ Clinical Research Ed., vol. 333, pp. 1143-5, Dec. 2006.
[2] B. Croft, D. Metzler, and T. Strohman, Search engines: information retrieval in practice. Addison-Wesley Publishing Company, USA, 2009.

... Search Engine

Search Engine Interface


Web interface to the rare disease IR system

Resource Collection

RARE collection

GENET collection
  • Genetics Home Reference: A service of the U.S. National Library of Medicine.
    Available on: ghr.nlm.nih.gov/BrowseConditions
  • Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc., Category Syndromes.
    Available on en.wikipedia.org/wiki/Category:Syndromes
  • Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD).
    Available on www.ncbi.nlm.nih.gov/omim/

... Publications

Rare Disease Diagnosis as an Information Retrieval Task


Radu Dragusin, Paula Petcu, Christina Lioma, Birger Larsen, Henrik Jorgensen, and Ole Winther

Proceedings of 3rd International Conference on the Theory of Information Retrieval 2011, Lecture Notes in Computer Science, Springer. 2011 (to appear)

A Vertical Search Engine Supporting the Diagnosis of Rare Diseases


Radu Dragusin and Paula Petcu

Technical report. Institute of Computer Science at University of Copenhagen. August 2011.

Improving clinical practice using a computerized clinical decision support system for diagnosing rare diseases: literature review, challenges, and possible paths forward


Radu Dragusin and Paula Petcu

Technical report. Institute of Computer Science at University of Copenhagen. October 2010.