INtERAcT: Interaction Network Inference from Vector Representations of Words

Background
In recent years, the number of biomedical publications made freely available through literature archives is steadfastly growing, resulting in a rich source of untapped new knowledge. Most biomedical facts are however buried in the form of unstructured text, and their exploitation requires expert-knowledge and time-consuming manual curation of published articles. Hence the development of novel methodologies that can automatically analyze textual sources, extract facts and knowledge, and produce summarized representations that capture the most relevant information is a timely and pressing need.

Results
We present INtERAcT, a novel approach to infer interactions between molecular entities extracted from literature using an unsupervised procedure that takes advantage of recent developments in automatic text mining and text analysis. INtERAcT implements a new metric acting on the vector space of word representations to estimate an interaction score between two molecules. We show that the proposed metric outperforms other metrics at the task of identifying known molecular interactions reported in publicly available databases for different cancer types.

Conclusions
Our findings suggest that INtERAcT may increase our capability to summarize the understanding of a specific disease or biological process using published literature in an automated and unsupervised fashion. Furthermore, our approach does not require text annotation, manual curation or the definition of semantic rules based on expert knowledge, and hence it can be readily and efficiently applied to different scientific domains, enabling the automatic reconstruction of domain-specific networks of molecular interactions.
Key words: Natural Language Processing, word embeddings, protein{protein interactions, knowledge extraction, prostate cancer

By: Matteo Manica, Roland Mathis, Maria Rodriguez Martinez

Published in: RZ3918 in 2018

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rz3918.pdf

Questions about this service can be mailed to reports@us.ibm.com .