Grants and Contributions:
Grant or Award spanning more than one fiscal year. (2017-2018 to 2022-2023)
The massive amount of publicly available data is an amazing opportunity for artificial intelligence to play a key role in life sciences. Automatic approaches have proven to be effective in supporting life sciences research, yet mining complex and unstructured data is still a major challenge. In this context, the objective of my research program is to contribute to knowledge discovery in life sciences by easing access to existing knowledge, and supporting its exploration. I propose to reach this objective by creating algorithms to jointly retrieve and mine textual and non-textual data. Life scientists looking for existing knowledge face critical challenges such as discovering entities in documents, retrieving documents and data relevant to specific topics, or analyze data according to their contribution to experiments.
Over the next five years, my research will hence focus on two objectives:
O1. The investigation of new models and algorithms to jointly retrieve various types of documents from natural language (NL) queries.
The retrieval of documents is a critical step for life sciences since the retrieved results can be used as input for a variety of tasks, such as curation, triage, or biological network modeling. There is a twofold challenge in understanding NL queries, and retrieving heterogeneous types of documents. The objective is to investigate the best way of analyzing NL queries to expand them in directions that trigger the retrieval of articles, gene or protein sequences, related database entries, experimental data, etc.
O2. The exploration of new algorithms to discover bio-entities in documents, and link them to relevant knowledge bases. Though much work has been done toward entity discovery and linking (EDL) in social media and news, many challenges still remain in life sciences. As automatically annotated documents support researchers in building computational models of biological processes, further work on the bio-entity discovery and linking task is necessary.
EDL is very challenging in genomics because bio-entities are often highly ambiguous, and little context is usually available for disambiguation. The objective is to investigate how generic approaches for solving the EDL task can be adapted to the genomics field, and how several reference databases can be used together to support linking and disambiguation of bio-entities.
This research program is cross-disciplinary. In the computer science domain, the program combines natural language processing, information retrieval, machine learning, and big data mining. My collaboration with genomics researchers provides a challenging environment involving real users.
Involved Highly Qualified Personal will get advanced training in natural language processing, applied machine learning, and text/data mining.
The released work will be open-source in order to be easily reused by the community, and transferred to the industry.