...
Description
Introduction
Given a user query, the task of a search engine is to retrieve relevant information from a typically large and heterogeneous collection of documents. General purpose web search engines such as Google are widely used for a variety of web search related tasks, including assisting clinicians in generating hypotheses for diagnosing difficult cases, such as rare diseases [1].
This project addresses the task of searching for relevant rare diseases given a query of patient data. The patient data is given as free text, which means that the queries do not have to use a controlled vocabulary or specific query language restrictions as in conventional diagnostic assistance systems. The patient data submitted as a query to the information retrieval (IR) system could consist of patient age, gender, demographic information, symptoms, evidence of diseases, test results, previous diagnoses, and other information that a clinician might find relevant in the differential diagnosis.
Document collection
The collection of documents indexed for retrieval consists of medical articles on the topic of rare diseases. As 80% of the rare diseases are of genetic origin, we also index medical articles on the topic of genetic diseases.
We have done experiments on two subsets of the document collection: one including only the articles about rare diseases (referred to as RARE), and one including both rare and genetic disease articles (referred to as RARE&GENET). The RARE subset contains 10,263 documents with a raw size of 543 MB, and the RARE&GENET subset contains 31,590 documents and has a raw size of 719 MB.
Rare disease information retrieval in a nutshell
To rank documents with respect to queries, we use a probabilistic model, specifically a statistical language model. Given a query (q) consisting of patient data, each article (d) is ranked based on the probability of generating the terms of the query from the article's language model (P(q|Md)). Estimating the probability of a query being generated by a document corresponds to estimating how likely it is for a document to be about the correct diagnosis of the patient described in the query [2].
We use the Indri system. On top of the baseline ranking model, we use evidence about the collected medical articles to revise the ranking. More specifically, based on the reasoning that documents about rare diseases are more relevant when searching for a rare disease diagnosis, we use the origin/publisher of the article to give a higher probability of being relevant to those documents about rare diseases, and a smaller probability of being relevant to the documents about genetic diseases. Thus, if P(D) denotes the prior probability of a document being relevant, and C denotes the collection containing both rare and genetic diseases documents, then:
where x = φy (φ is the boosting factor), and P(R|C) (resp. P(G|C)) denotes the probability of all rare disease (resp. genetic disease) documents in the collection C.
Preliminary results
From rare disease cases published in the Orphanet Journal of Rare Diseases (OJRD), one medical doctor and two non-expert extracted 30 queries. These were used to evaluate our rare disease search engine, the standard Google Search, and a Google Custom Search. The queries include patient symptoms and the correct diagnosis was not included in the query terms. Google Custom Search was customized to emphasize the sources of the RARE&GENET documents.
Retrieval using the queries on our Indri-based system was made on the two collections, RARE and RARE&GENET. Moreover, retrieval on the second collection, RARE&GENET, was repeated using a ranking boost for the RARE documents. Runs used the query likelihood language model with default settings.
In the evaluation, a document was deemed relevant if it was predominantly about the correct disease or one of its synonyms or variations. In contrast, a document was considered non-relevant if it does not mention the correct disease in the title or first 400 words, in the first 10 items of a list, or it is a document with restricted access and is not relevant based on the freely available information.
Retrieval Approach | Binary Relevance | ||
P@10 | P@20 | MRR | |
Standard Google Search | .023 | .013 | .056 |
Google Custom on RARE | .030 | .017 | .173 |
RARE | .123 | .073 | .445 |
RARE&GENET | .157 | .105 | .467 |
RARE&GENET (RARE boost factor of 4) | .173 | .115 | .469 |
References
[1] H. Tang and J. H. K. Ng, "Googling for a diagnosis - use of Google as a diagnostic aid: internet based study," BMJ Clinical Research Ed., vol. 333, pp. 1143-5, Dec. 2006.
[2] B. Croft, D. Metzler, and T. Strohman, Search engines: information retrieval in practice. Addison-Wesley Publishing Company, USA, 2009.
...
Search Engine
Search Engine Interface
Here
Web interface to the rare disease IR system
Resource Collection
RARE collection
- Orphanet: an online rare disease and orphan drug data base. Copyright, INSERM 1997.
Available on www.orpha.net
- Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc., Category Rare Diseases.
Available on en.wikipedia.org/wiki/Category:Rare_diseases
- NORD Rare Disease Database and Organizational Database. The National Organization for Rare Disorders (NORD).
Available on rarediseases.org/
- The Genetic and Rare Diseases Information Center (GARD).
Available on rarediseases.info.nih.gov/GARD
- Swedish Information Centre for Rare Diseases. Swedish National Board of Health and Welfare.
Available on www.socialstyrelsen.se/rarediseases
- m-Power Rare Disease Database. Madisons Foundation.
Available on www.madisonsfoundation.org/
- Health On the Net Foundation.
Available on www.hon.ch/HONselect/RareDiseases/
- Rare Diseases. About.com Health.
Available on rarediseases.about.com/
GENET collection
- Genetics Home Reference: A service of the U.S. National Library of Medicine.
Available on: ghr.nlm.nih.gov/BrowseConditions
- Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc., Category Syndromes.
Available on en.wikipedia.org/wiki/Category:Syndromes
- Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD).
Available on www.ncbi.nlm.nih.gov/omim/
Available on www.orpha.net
Available on en.wikipedia.org/wiki/Category:Rare_diseases
Available on rarediseases.org/
Available on rarediseases.info.nih.gov/GARD
Available on www.socialstyrelsen.se/rarediseases
Available on www.madisonsfoundation.org/
Available on www.hon.ch/HONselect/RareDiseases/
Available on rarediseases.about.com/
Available on: ghr.nlm.nih.gov/BrowseConditions
Available on en.wikipedia.org/wiki/Category:Syndromes
Available on www.ncbi.nlm.nih.gov/omim/
...
Publications