Grants and Contributions:

Title:
Reinventing the tuning and debugging tools for multi-thousand cores computer systems
Agreement Number:
RGPIN
Agreement Value:
$140,000.00
Agreement Date:
May 10, 2017 -
Organization:
Natural Sciences and Engineering Research Council of Canada
Location:
Quebec, CA
Reference Number:
GC-2017-Q1-02775
Agreement Type:
Grant
Report Type:
Grants and Contributions
Additional Information:

Grant or Award spanning more than one fiscal year. (2017-2018 to 2022-2023)

Recipient's Legal Name:
Dagenais, Michel (École Polytechnique de Montréal)
Program:
Discovery Grants Program - Individual
Program Purpose:

The top supercomputers in recent years owe much of their power to the number-crunching ability of Graphical Processing Units (GPU) used for General Purpose computation (GPGPU). However, GPGPUs are also frequently used in smaller clusters and individual workstations for demanding tasks. Even in mobile phones and embedded systems, GPGPUs are an efficient medium to accomplish specialised tasks such as image processing. Furthermore, GPGPUs are at the forefront of a wider trend towards heterogeneous system architecture (HSA), where specialised co-procesors are used for a number of tasks beyond graphics and video, such as signal processing and network packet processing.

The internal architecture of these complex GPGPU devices, with over 4000 cores and arithmetic units, was not well documented until recently. The architecture evolved significantly in recent years with a much tighter integration with main processing CPU cores, offering shared virtual memory and user-level queues. The problem is that the programming tools for debugging and tuning these systems are severely lacking. The objective of the proposed research is to devise new more efficient algorithms and software architectures for the debugging, tracing and profiling tools, such that they can adequately cope with hardware architectures boasting thousands of cores and putting a strong pressure on the available memory and bandwidth. The challenges are at several levels: insuring that the tracing and profiling tools impose a minimal overhead on the system studied to avoid affecting its behaviour, efficiently exploiting all the available tracing and profiling hardware assistance, and efficiently interfacing the debugging, tracing and profiling tools to the operating system, device driver and application to be monitored.

The significance and novelty of this work is in addressing a serious challenge currently faced by all users of such systems: having suitable tools to efficiently debug and tune their applications on these systems. It is widely recognised that the extremely rapid advances in heterogeneous parallel processing hardware availability has not been followed by an equivalent progress in the software development tools. As a result, a large number of users are not achieving the level of performance that would be attainable with proper tools, or are simply not taking advantage of such hardware, because of the difficulty in efficiently programming, debugging and tuning these devices. The proposed research program will thus have a significant impact, helping all users of sophisticated computing devices in better exploiting the available hardware.