Grants and Contributions:
Title:
Topology and Transformers in tandem: developing methodology to combine these cutting-edge tools
Agreement Number:
1012497
Agreement Value:
$183,333.00
Agreement Date:
Dec 1, 2023 - Mar 31, 2026
Description:
The Core AI4Design proposal seeks to integrate two powerful methods for
data analysis, namely Topological Data Analysis (TDA) and machine
learning (ML). These two approaches have so far been relatively
independent of each other in practice. This proposal aims to develop new
techniques that combine the strengths of both approaches to create nextgeneration
tools for analyzing complex data. More precisely, the many types
of sequence data where transformers are used - sequences of words in
natural language texts, or sequences of programming code, or sequences
of video frames and audio - are all highly complex in that the data
themselves are high-dimensional (e.g. video), they involve very high-order
structure (e.g. the way in which pixels in video allow us to see a cat moving
is not something that can be summarized in a handful of equations) and the
corpora of data that one needs for training are huge. TDA on the other hand
is very effective at summarizing high-order structure and actual geometry in
data. We propose to develop a systematic approach for combining TDA and
transformers for more efficient learning and analysis of complex data.
The proposal comprises two main streams. The first stream aims to develop a hierarchical version of TDA structures that utilize successive layers, with
each layer pooling the points of the layer below, akin to the operation of
convolutional neural networks. The second stream targets the ML paradigm
of transformer networks and proposes to use TDA persistence images as
the data representation, which will be served to transformers. The
developed tools will be tested on datasets where the current state of the art uses transformers but where "shape" or geometry is believed to play a role in the underlying learning problem yet has not been leveraged yet due to
the absence of such information in the typical input for transformers. An
example is the discovery of new proteins predicted to have desirable
luminosity or therapeutic properties. State of the art in this field uses
transformers to generate sequences of amino acids likely to have the
desired properties (these are then synthesized and tested). It is known that
the geometry of a fluorescent protein contributes directly to luminosity, and
folded shape information is available for many proteins used to train the
transformer but this geometric information is not fed to the transformer. We
propose to summarize it with TDA persistence images and use these to
train slightly modified transformers. Our aim is to demonstrate improved
generative ability in the TDA-transformer compared to a basic transformer alone.
Organization:
National Research Council Canada
Expected Results:
In the short term, anticipated outcomes will be strengthened collaborations across industry, academia, and government to support research excellence. In the medium term, anticipated outcomes will be the development of new and potentially disruptive technologies with collaborators. In the long term, find collaborative solutions to public policy challenges and create stronger innovation systems.
Location:
Ottawa, Ontario, CA K1N 6N5
Reference Number:
172-2023-2024-Q3-1012497
Agreement Type:
Grant
Report Type:
Grants and Contributions
Recipient Business Number:
119278877
Recipient Type:
Academia
Recipient's Legal Name:
University of Ottawa
Federal Riding Name:
Ottawa–Vanier
Federal Riding Number:
35078
Program:
Collaborative Science, Technology and Innovation Program - Collaborative R&D Initiatives
Program Purpose:
Collaborate on multiparty research and development programs to catalyze transformative, high-risk, high-reward research with the potential for game-changing scientific discoveries and technological breakthroughs in priority areas.
NAICS Code:
541710