Grants and Contributions:
Grant or Award spanning more than one fiscal year. (2017-2018 to 2022-2023)
We are drowning in a sea of data. Increasingly, jobs that used to rely on other skills now rely on the ability to interpret, understand, and trust data. For example, many biologists now spend less time in the lab and more time trying to analyze the data that they have created. Increasingly, thanks to improved tools for creating data, they are expected to manage and understand this data by themselves. Unfortunately, this data is often exceedingly difficult to understand and trust, even for experts in the field. There are many reasons for this. One reason is that the representation of the data is often designed for computers, not for people (i.e., storage, not accessibility). Another reason for this is that users may spend a lot of time using systems where data is combined from multiple sources, leaving them stuck having trouble integrating data that they neither understand nor trust.
The lack of understanding and trust can cause problems both large and small. For example, the recent financial crisis was partially caused by the inability to track money across multiple sources (e.g., mutual funds are in a different database than savings accounts). Because it was impossible to understand and trust the data that was spread across various sources, regulators did not realize some of the problematic flows of money until after the crash. On a smaller scale, without being able to find all of the data needed in order to evaluate building design choices, it is impossible to easily explore building design alternatives, which leads to stifled innovation and less-efficient buildings.
My proposed research will improve the ability of users to access data that they need and trust that this data is well curated enough to do what they need it to do. This will build on my current work that has looked at case studies in economic data and civil engineering, and will work with the techniques that I have learned in working with recommender systems to help recommend data items to users. In this proposal I describe a number of specific directions that I intend to work on with my students in that space, including: (1) using query logs to help recommend parts of the data that users may be interested in; (2) helping users understand the complex connections that appear in XML files; and (3) improving the understanding of provenance data.