People of ACM - Torsten Hoefler

June 25, 2024

What is the specific focus of ETH Zurich’s Scalable Parallel Computing Laboratory and what makes this lab unique in the field?

Our lab has always been guided by the vision of “Performance as a Science,” which aims to improve our world through faster and more efficient computer systems. Our main research area is developing mathematical performance and requirements models to understand applications, systems, and the mapping between them. We focus specifically on large-scale AI and HPC workloads and systems that were used to train programs such as GPT-3 and 4. Computer performance has driven human development for decades now. The best example is the recent AI revolution that was enabled by accelerated AI computing. We continue this trend by driving most important fields combining AI and scientific simulations (“AI4Science”) within climate simulations to understand human impact on our planet at a finer granularity.

Your most cited paper, “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis,” written with Tal Ben-Nun, surveys recent research on the relationship between the deep neural networks used to train machine learning applications and high-performance computing clusters. What trends in high-performance computing are having the most impact on machine learning?

With this paper, we were among the first to combine all the different forms of parallelism—data, pipeline, and operator—into a single coherent view of how to parallelize and distribute deep learning workloads. Within Microsoft we first coined the term “3D parallelism” based on this work which quickly caught on in the community. This led to the insight that the communication pattern can be expressed as a 3D Torus, the basis of the HammingMesh topology or the topologically equivalent later TPUv4 optical network. Most AI training and some inference frameworks now engage all levels of the described parallelism to enable large systems.

You’ve co-authored a book and you are considered an expert on the Message-Passing Interface (MPI). Why is the MPI so central to high performance computing, and what are the most recent advances in this area?

MPI is the de-facto HPC programming model. Furthermore, specialized AI communication frameworks such as NCCL, ACCL, RCCL, etc., often called collective communications libraries (CCL), are essentially mini-MPI libraries that extract the subset of MPI semantics and calls that are relevant for their use-case. Specifically, MPI principles such as collective operations and their nonblocking or asynchronous versions that we introduced in the MPI-3 specification and various research papers are now the bread and butter of AI communications in both training and inference systems.

In 2022 the world’s first exascale computer, Frontier, became operational. What has surprised you about Frontier’s performance, and / or what new challenges to the field is Frontier presenting?

Frontier is an impressive machine, designed for the largest HPC workloads. AI and mixed AI-HPC workloads will have to provide much higher network bandwidths than today’s machines offer, including Frontier. But of course, networking equipment is expensive and can thus not easily be “scaled up.” We are currently attacking this problem on two fronts; with AI-optimized topologies such as HammingMesh as well as new technologies such as Ultra Ethernet, where I co-chair the transport working group. HammingMesh uses the observation that most of the required network bandwidth stays in a localized part of the network. The Ultra Ethernet consortium designs specific protocols for congestion control, routing, and reliability that scale to the largest computers and are supporting modern AI workloads.

 

Torsten Hoefler is a Professor at the Swiss Federal Institute of Technology (ETH) Zurich, where he serves as Director of the Scalable Parallel Computing Laboratory. He is also the Chief Architect for Machine Learning at the Swiss National Computing Center and a long-term consultant to Microsoft in areas including large-scale AI and networking. His research interests include performance-centric system design, which includes scalable networks, parallel programming techniques, and performance modeling for large-scale simulations and AI systems.

Among his honors, Hoefler was named an ACM Fellow for foundational contributions to high- performance computing and the application of HPC techniques to machine learning. Further, he was part of a team won the ACM Gordon Bell Prize. He is the youngest recipient of the IEEE CS Sidney Fernbach Award for outstanding contributions in the application of high-performance computers using innovative approaches. Hoefler is also the inaugural recipient of The ISC Jack Dongarra Early Career Award, which is given to an early- to mid-career researcher who has been a catalyst for scientific progress through exceptional work in areas including numerical algorithms and software libraries, computational sciences, mathematics, and machine learning.