Grants and Contributions:

Title:
Pattern and Knowledge Discovery on Relational, Biosequence and Multiple Temporal Sequence Data
Agreement Number:
RGPIN
Agreement Value:
$115,000.00
Agreement Date:
May 10, 2017 -
Organization:
Natural Sciences and Engineering Research Council of Canada
Location:
Ontario, CA
Reference Number:
GC-2017-Q1-02400
Agreement Type:
Grant
Report Type:
Grants and Contributions
Additional Information:

Grant or Award spanning more than one fiscal year. (2017-2018 to 2022-2023)

Recipient's Legal Name:
Wong, Andrew (University of Waterloo)
Program:
Discovery Grants Program - Individual
Program Purpose:

As information technology advances, a tremendous amount of data is generated in all industries. There is an increasing amount of attempts to leverage this large amount of data, with the assistance of high-performance computing to develop intelligent system for various applications. Despite their success, the models that turn data into these applications are often black-boxes. This has two drawbacks. First, the users have no trust since “how” is hidden. Second, it is difficult for human to interpret “why”.

Here we introduce a new paradigm: From Pattern to Knowledge (P2K). It first discovers strong statistical associations/relations from data autonomously. It represents them as patterns, pattern clusters and their association/co-occurrence to reflect the “what” and “where” of critical information without explicit reliance on prior knowledge usually unavailable or difficult to get. It then comes up with the “how” of robust algorithms to conduct analysis and direct further search to disclose the “why” of the underlying mechanisms --- interpretable/verifiable. P2K will make existing machine intelligence approaches more robust and reliable while revealing useful and actionable knowledge.

Hence, the objective of this proposal is to develop P2K, targeting on 3 types of data at its initial phase: relational, bio-sequence and multiple temporal sequence data. We choose bio-sequence from bioinformatics as a platform to validate the scientific values and effectiveness of P2K. In the last five years, from biosequences, we have developed algorithms to discover, prune, locate and analyze statistically significant patterns, pattern clusters and their association/co-occurrences so as to reveal local and distant functional domains and relationship without relying explicitly on prior knowledge or clues; b) use the patterns discovered as features for predictive analysis. The effectiveness of P2K is backed by strong publications.

In the next five years, for biosequence data we will develop algorithms to predict binding sites/partners between proteins, protein and DNA/RNA, protein and aptamers to reduce user’s reliance on structures, saving them time/budget and help drug discovery and disease treatment to identify small molecules that can inhibit binding. For relational data, we will complete a scalable system to discover and analyze patterns for mixed-mode data, including using patterns extracted from business/finance reports via our text mining module Text-P2K in a semi-supervised fashion to assist decision making. For multiple time-series data, we will leverage the discovered temporal associations of pattern clusters to capture a wide range of local relations along and across individual series and use them as features/patterns for interpretation and forecasting. It can help finance firms to identify rare movements to control risk, and help factories to identify machinery faults in advance.