SeMedico is a semantic search engine for the life sciences. In this context, semantic refers to the system's knowledge of concepts concerning life sciences and their relations with each other as expressed in the system's citations. In SeMedico, concepts are diseases, genes or proteins or further categories (see also section X for terminologies / ontologies).
The terminologies and ontologies used for SeMedico arrange their concepts in a hierarchical manner ordered from general concepts to more specific items. The concept Nervous System Diseases is a quite general category from which Parkinson's Disease is a specific kind. SeMedico exploits this information in its citation analysis. Thus, when searching for nervous system diseases in SeMedico, mentions of Parkinson's will be included.
On top of this terminological analysis, SeMedico identifies a range of protein-protein-interactions (PPI) mentioned in the analyzed text. The documents in SeMedico are enriched with these semantic meta data and employed to find the semantically most relevant results.
SeMedico searches citations from MedLine - the largest part of PubMed - and full texts taken from the open access portion of PubMed Central.
To find publications of interest, just begin to type a keyword into the search field on the index page, just as is done using Google, Bing etc. SeMedico will then suggest concepts from its knowledge base. The suggestions are ordered by semantic category. If a suggestion is selected, it forms a query token and will be searched as one concept, even if it contains multiple words. If no suggestion seems adequate, the item under the special keyword category may be selected in which case the entered word is added to the query without resolving it to a concept. Finally, the user may just enter one or several words and start the search by hitting enter or clicking the find button. In this case, SeMedico will analyze the query for concepts automatically.
After a query has been sent, the document hits are presented in bins of 10 hits. Some of the text portions causing the hit are displayed together with the title and bibliographical details about the citation. The text snippets are typically complete sentences to allow quick assessment of the hit. The snippets are highlighted to prominently show words, concepts and even semantic relations that matched the query. For PPIs, all elements of the relation are highlighted, i.e. arguments (genes/proteins), relation types (e.g. regulation, bindings etc.) and confidence clues (e.g. maybe, the data suggest that) modifying the respective PPI.
SeMedico runs on the entire MedLine collection, currently (as of December 2015) more than 25M documents. MedLine documents consist either of bibliographical entries, which include author names, the title of the paper, the journal in which it appeared, etc., but also assorted keywords, that are either derived from the Medical Subject Headings (MeSH) or deliberately chosen free-text descriptors. Furthermore, they include author-supplied abstracts, however not the full text. However, external links to full texts (e.g. purchasable at publishers’ websites) are provided, if available.
PubMed Central documents, on the other hand, include not only the abstract of a publication but also the full text portion. The non-textual information pieces (audio or video information, photos, graphics, tables, etc.) are available from within SeMedico.
Formulating a reasonable query is, even for a domain expert, often a very challenging task. Two main problems have to be dealt with: the terminological variety in the domain of discourse and the various linguistic variations single terms may undergo.
On the one hand, the terminological level addressed by a semantic search engine relates to alternative denotations of the same meaning unit (so-called synonyms), such as ‘Appendicitis’ and ‘Blinddarmentzündung’ within one natural language or across different natural languages (‘Blinddarmentzündung’ and ‘inflammation of the appendix’). On the other hand, it relates to semantic, mostly taxonomic, relations that hold between terms, such as between a narrower term (e.g. ‘Appendicitis’) and a broader term (e.g. ‘inflammation’, or even ‘disease’).
The linguistic variations address other forms of synonymous relations such as spelling variations (e.g. ‘Cortex’ or ‘Kortex’), inflectional variants (e.g. ‘kidney’, ‘kidney’s’, ‘kidneys’) and long and short forms (e.g. ‘electroencephalogram’ and ‘EEG’).
Rather than letting searchers anticipate (or sometimes just guess) all these variants while a search query is being formulated – an approach which has repeatedly been shown in experiments to fail –, semantic search engines are designed in such a way so that they automatically compensate for both the terminological as well as linguistic variety of natural language and normalize these mostly synonymous variants. In effect, one mention of a variant form in a query leads to the identification of all other synonyms in the course of the document search.
SeMedico’s terminological system has its roots in multiple resources. One massively exploited resource is the MeSH taxonomy. For suggestion display, all categories but ‘Genes and Proteins’, ‘Gene Ontology’ and ‘Gene Regulation Ontoloy’ (see below) were derived from the MeSH Main Headings and Supplementary Concepts. To reflect the focus on biological foundations of medicine, some parts of the MeSH Main Headings were partly regrouped to grant an easier understanding of the shown categories for suggestions.
In particular, the ‘Genes and Proteins’ facet was not taken from the MeSH but rather originates from NCBI Gene and was imported into SeMedico with a filter to restrict the displayed terms to those actually accessible by SeMedico’s employed classifier for genes, GeNo. In order to create a taxonomical access to gene terms, cross-species gene terms were created. Each of these cross-species terms assembles the terms for the same gene in different organisms. For example, to summarize the terms ‘IL2 (Homo Sapiens)’, ‘IL2 (Mus Musculus)’, ‘IL2 (Ovis aries)’ etc., a new term ‘IL2 (any organism)’ was created. Selecting this term triggers a search of ‘IL2’ independent of concrete species.
The ‘Gene Ontology’ and ‘Gene Regulation’ ontologies were integrated from BioPortal. They are well-known ontologies and fitting extensions to the MeSH-based facets. They also deliver the semantic foundation of the PPI types for which Semedico offers search capabilities.