People of ACM - Joseph Gonzalez
June 18, 2019
Why is this an exciting time to be working at the intersection of machine learning and data systems?
Machine learning is powered by data, and advances in machine learning have been driven by advances in data systems. Part of the exciting recent progress in machine learning stems from larger datasets, faster training, and perhaps most important, tools and abstractions that have enabled a much larger group of people to advance the field. When I started my PhD in 2006, studying systems for machine learning was largely viewed as engineering and not research. Furthermore, developing new models often meant developing new training algorithms and inference procedures that required a deep understanding of statistics and optimization. I remember spending evenings checking my math and pthread implementation to debug a flawed approximate parallel Markov chain Monte Carlo inference procedure that was failing due to issues with my model and underlying data.
Today, there are entire conferences devoted to systems for machine learning, and undergraduates are developing and training advanced models on GPU clusters that outperform work done by senior PhD students (including me) less than a decade ago. Today progress is largely limited by creativity and our budget for compute resources and data. Machine learning frameworks, like PyTorch and TensorFlow, provide the necessary abstractions to hide the complexity of differentiation, optimization, and parallel computation, freeing the modern data scientist to focus on the learning problem. These frameworks build on advances in data systems and scientific computing to unlock new parallel hardware.
Despite all this progress, I still think there are tremendous opportunities and substantial challenges that we must still overcome. The energy behind model development has increasingly ignored the data, and there are opportunities to better integrate model development and data systems. Furthermore, as machine learning becomes an increasingly common part of complex software systems, there is a need to develop machine learning engineering practices and tools to address problems ranging from deployment and serving, to versioning, monitoring, and debugging models in production. Finally, there is a growing need to address privacy and security, as well as consider the long-term consequences of models and systems on society.
In a Research for Practice article in the March/April issue of ACM Queue, you and Dan Crankshaw surveyed recent research in machine-learning serving systems. Why is this an important area in the field right now?
Much of the focus of the machine learning and more recently systems community has been dominated by the fundamental problem of training increasingly complex models on ever-larger datasets. However, until recently, very little attention has been given to how we use these trained models to solve problems. What we and others are finding is that there is a rich set of machine learning and systems problems that arise after training is finished. These problems range from managing the tradeoff between latency, throughput, cost, and accuracy in response to varying workloads and resource availability, to online model selection, learning, and failure detection. In the future, rendering predictions from machine learning models will be the dominant use of processor cycles.
You recently co-authored two papers on serverless computing. What are the major challenges with respect to serverless computing and data science?
This is a challenging question and one that we are still studying. Fundamentally, serverless computing is about separating application logic and service requirements from the management and provisioning of the physical infrastructure in a way that enables applications to be truly elastic (scaling all the way to 0). Today, serverless applications typically leverage high-level compute (e.g., AWS Lambda) and storage (e.g., S3) services to achieve this goal. As more compute or storage is needed, more hardware can be automatically and transparently added by the cloud provider.
However, as we note in those papers, we are still a long way away from fully realizing the full potential of the serverless vision. Interestingly, a key challenge to the application of serverless computing in data science is that serverless computing isolates compute and storage to enable independent elastic scaling but at the expense of data locality. Data science, as you might expect, tightly couples compute and storage. As a consequence, today, purpose-built data services like (e.g., Cloud Dataflow or BigQuery) achieve much of the serverless vision in the narrow context of data science. However, we might like to imagine a future where these technologies and their future descendants can be built to scale elastically and efficiently in tomorrow’s serverless cloud.
What is an interesting advance, or exciting avenue of research, that will enhance autonomous vehicle systems in the next few years?
Autonomous vehicles are still in their infancy but could fundamentally change the world by substantially reducing automotive fatalities and traffic related carbon emissions. However, while much of the research world (including my group) is busy trying to solve the incredibly challenging perception, planning, and control problems of autonomous driving, there is far less attention to the software platform that will host these advances. Instead, many of the leading industrial and academic autonomous vehicles projects are developing on a decade old open-source robotics framework (ROS) intended for research into home robotics. Actually, the fact that this has become the dominant solution actually speaks to the brilliance of the ROS project, which prioritized simplicity and composition—important goals when trying to solve challenging AI problems. However, ROS was never intended to connect mission-critical systems that must process gigabytes of data every second across an array of hardware accelerators with human life on the line.
My research group is studying how to enable rapid innovation in the underlying AI components while supporting diverse parallel hardware and providing the necessary time and availability guarantees needed for autonomous driving. While this project is still in its early stages, the key insight is that a car is really a distributed dataflow system on wheels.
Joseph "Joey" Gonzalez is an Assistant Professor in the Electrical Engineering and Computer Science Department at the University of California, Berkeley and a founding member of the Berkeley RISE Lab, where he and his colleagues conduct research on data-intensive systems. RISE Lab projects include accelerated deep learning for high-resolution computer vision, dynamic deep neural networks for transfer learning, and real-time model serving. He is also a founder of Turi Inc. (formerly GraphLab), which was recently acquired by Apple Inc., and is on the technical advisory board for Deepscale.ai, which is developing new computer vision software and systems for autonomous vehicles.
At the ACM-IMS Interdisciplinary Summit on the Foundation of Data Science, Gonzalez moderated the panel “Deep Learning, Reinforcement Learning, and Role of Methods in Data Science.”