Grants and Contributions:
Grant or Award spanning more than one fiscal year. (2017-2018 to 2022-2023)
Automatically extracting knowledge from a large set of mostly unstructured documents (such as the Web) and organizing it into a knowledge base (KB) is a key challenge in artificial intelligence. Intuitively, such KBs should directly impact the quality of many NLP applications such as question answering, information retrieval or Text Analytics. Open information extraction, the task of extracting knowledge from texts without much supervision (especially not a prescription of the kind of information to mine), has brought new hope for such an endeavour.
Despite a number of well-designed components are nowadays widespread and readily available for extracting facts and relations (so-called tuples) from texts, tapping information in large collections of texts still raises a number of issues. The technology embedded in a typical knowledge extraction pipeline is fraught with shortcomings: coreference resolution, named-entity resolution and parsing errors are collapsing so that many tuples (if not the vast majority) are simply useless. Also, most works are targeting very frequent entities and relations, which exclude a large quantity of information on domain specific texts that are pervasive over the Web.
Our long term objective consists in developing the necessary expertise in populating, curating, maintaining and using a KB. Our proposal departs from several existing initiatives by a number of key factors. First, since specific domains are prevalent over the Web, we want our technology to be domain aware. Second, since today's world is multi-lingual and because not everything is written in English, we further want our technology to be multi-lingual in nature. Last, most works are devoted to develop fully automatic technology for assisting humans. In our proposal, we are interested in measuring how much gaming with a purpose can make humans assist the computer.
In order to succeed, we target in this proposal the development of deFacto, a multi-domain, bilingual KB (French -- English) acquired iteratively from texts mined over the web, with the help of feedback collected from users via serious gaming.