DTU · Rare Disease Information Retrieval

... Description

Introduction

Given a user query, the task of a search engine is to retrieve relevant information from a typically large and heterogeneous collection of documents. General purpose web search engines such as Google are widely used for a variety of web search related tasks, including assisting clinicians in generating hypotheses for diagnosing difficult cases, such as rare diseases [1].

This project addresses the task of searching for relevant rare diseases given a query of patient data. The patient data is given as free text, which means that the queries do not have to use a controlled vocabulary or specific query language restrictions as in conventional diagnostic assistance systems. The patient data submitted as a query to the information retrieval (IR) system could consist of patient age, gender, demographic information, symptoms, evidence of diseases, test results, previous diagnoses, and other information that a clinician might find relevant in the differential diagnosis.

Document collection

The collection of documents indexed for retrieval consists of medical articles on the topic of rare diseases. As 80% of the rare diseases are of genetic origin, we also index medical articles on the topic of genetic diseases.

We have done experiments on two subsets of the document collection: one including only the articles about rare diseases (referred to as RARE), and one including both rare and genetic disease articles (referred to as RARE&GENET). The RARE subset contains 10,263 documents with a raw size of 543 MB, and the RARE&GENET subset contains 31,590 documents and has a raw size of 719 MB.

Rare disease information retrieval in a nutshell

To rank documents with respect to queries, we use a probabilistic model, specifically a statistical language model. Given a query (q) consisting of patient data, each article (d) is ranked based on the probability of generating the terms of the query from the article's language model (P(q|M_d)). Estimating the probability of a query being generated by a document corresponds to estimating how likely it is for a document to be about the correct diagnosis of the patient described in the query [2].

We use the Indri system. On top of the baseline ranking model, we use evidence about the collected medical articles to revise the ranking. More specifically, based on the reasoning that documents about rare diseases are more relevant when searching for a rare disease diagnosis, we use the origin/publisher of the article to give a higher probability of being relevant to those documents about rare diseases, and a smaller probability of being relevant to the documents about genetic diseases. Thus, if P(D) denotes the prior probability of a document being relevant, and C denotes the collection containing both rare and genetic diseases documents, then:
$\begin{matrix} P( R | C)x + P(G|C)y = 1 \end{matrix}$
where x = φy (φ is the boosting factor), and P(R|C) (resp. P(G|C)) denotes the probability of all rare disease (resp. genetic disease) documents in the collection C.

Preliminary results

From rare disease cases published in the Orphanet Journal of Rare Diseases (OJRD), one medical doctor and two non-expert extracted 30 queries. These were used to evaluate our rare disease search engine, the standard Google Search, and a Google Custom Search. The queries include patient symptoms and the correct diagnosis was not included in the query terms. Google Custom Search was customized to emphasize the sources of the RARE&GENET documents.

Retrieval using the queries on our Indri-based system was made on the two collections, RARE and RARE&GENET. Moreover, retrieval on the second collection, RARE&GENET, was repeated using a ranking boost for the RARE documents. Runs used the query likelihood language model with default settings.

In the evaluation, a document was deemed relevant if it was predominantly about the correct disease or one of its synonyms or variations. In contrast, a document was considered non-relevant if it does not mention the correct disease in the title or first 400 words, in the first 10 items of a list, or it is a document with restricted access and is not relevant based on the freely available information.

Table 1. Retrieval on 30 queries from the web and from our collection. Best scores in bold. Relevance is measured by precision at rank 10 (P@10), precision at rank 20 (P@20) and by the mean reciprocal rank (MRR)
Retrieval Approach	Binary Relevance
Retrieval Approach	P@10	P@20	MRR
Standard Google Search	.023	.013	.056
Google Custom on RARE	.030	.017	.173
RARE	.123	.073	.445
RARE&GENET	.157	.105	.467
RARE&GENET (RARE boost factor of 4)	.173	.115	.469

References

[1] H. Tang and J. H. K. Ng, "Googling for a diagnosis - use of Google as a diagnostic aid: internet based study," BMJ Clinical Research Ed., vol. 333, pp. 1143-5, Dec. 2006.
[2] B. Croft, D. Metzler, and T. Strohman, Search engines: information retrieval in practice. Addison-Wesley Publishing Company, USA, 2009.

... Search Engine

Search Engine Interface

Here

Web interface to the rare disease IR system

... Dataset

Resource Collection

RARE collection

Orphanet: an online rare disease and orphan drug data base. Copyright, INSERM 1997.
Available on www.orpha.net

Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc., Category Rare Diseases.
Available on en.wikipedia.org/wiki/Category:Rare_diseases

NORD Rare Disease Database and Organizational Database. The National Organization for Rare Disorders (NORD).
Available on rarediseases.org/

The Genetic and Rare Diseases Information Center (GARD).
Available on rarediseases.info.nih.gov/GARD

Swedish Information Centre for Rare Diseases. Swedish National Board of Health and Welfare.
Available on www.socialstyrelsen.se/rarediseases

m-Power Rare Disease Database. Madisons Foundation.
Available on www.madisonsfoundation.org/

Health On the Net Foundation.
Available on www.hon.ch/HONselect/RareDiseases/

Rare Diseases. About.com Health.
Available on rarediseases.about.com/

GENET collection

Genetics Home Reference: A service of the U.S. National Library of Medicine.
Available on: ghr.nlm.nih.gov/BrowseConditions

Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc., Category Syndromes.
Available on en.wikipedia.org/wiki/Category:Syndromes

Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD).
Available on www.ncbi.nlm.nih.gov/omim/

... Publications

Rare Disease Diagnosis as an Information Retrieval Task

pdf

Radu Dragusin, Paula Petcu, Christina Lioma, Birger Larsen, Henrik Jorgensen, and Ole Winther

Proceedings of 3rd International Conference on the Theory of Information Retrieval 2011, Lecture Notes in Computer Science, Springer. 2011 (to appear)

A Vertical Search Engine Supporting the Diagnosis of Rare Diseases

pdf

Radu Dragusin and Paula Petcu

Technical report. Institute of Computer Science at University of Copenhagen. August 2011.

Improving clinical practice using a computerized clinical decision support system for diagnosing rare diseases: literature review, challenges, and possible paths forward

Rare Disease Information Retrieval

Project website

Technical University of Denmark · DTU Informatics <<<

University of Copenhagen · Department of Computer Science <<<

Introduction

Document collection

Rare disease information retrieval in a nutshell

Preliminary results

Table 1. Retrieval on 30 queries from the web and from our collection. Best scores in bold. Relevance is measured by precision at rank 10 (P@10), precision at rank 20 (P@20) and by the mean reciprocal rank (MRR)

References

Search Engine Interface

Here

Web interface to the rare disease IR system

Resource Collection

Available on www.orpha.net

Available on en.wikipedia.org/wiki/Category:Rare_diseases

Available on rarediseases.org/

Available on rarediseases.info.nih.gov/GARD

Available on www.socialstyrelsen.se/rarediseases

Available on www.madisonsfoundation.org/

Available on www.hon.ch/HONselect/RareDiseases/

Available on rarediseases.about.com/

Available on: ghr.nlm.nih.gov/BrowseConditions

Available on en.wikipedia.org/wiki/Category:Syndromes

Available on www.ncbi.nlm.nih.gov/omim/

Rare Disease Diagnosis as an Information Retrieval Task

pdf

Radu Dragusin, Paula Petcu, Christina Lioma, Birger Larsen, Henrik Jorgensen, and Ole Winther

Proceedings of 3rd International Conference on the Theory of Information Retrieval 2011, Lecture Notes in Computer Science, Springer. 2011 (to appear)

A Vertical Search Engine Supporting the Diagnosis of Rare Diseases

pdf

Radu Dragusin and Paula Petcu

Technical report. Institute of Computer Science at University of Copenhagen. August 2011.

Improving clinical practice using a computerized clinical decision support system for diagnosing rare diseases: literature review, challenges, and possible paths forward

pdf

Radu Dragusin and Paula Petcu

Technical report. Institute of Computer Science at University of Copenhagen. October 2010.

Radu Dragusin · Paula Petcu · Ole Winther · Christina Lioma

raddr >at< imm.dtu.dk · paupe >at< imm.dtu.dk · owi >at< imm.dtu.dk · camli >at< imm.dtu.dk