People of ACM - Volker Markl
October 18, 2022
Why is this an exciting time to be working in big data and AI?
Data management (DM) and machine learning (ML) are the key drivers of the current wave of innovation in artificial intelligence (AI) and data science. Today, we are witnessing a convergence of the DM and ML communities.
In the past, the ML community at large has primarily focused on the design of intelligent algorithms and their use in various applications, and less focused on the operationalization of ML processes on huge, diverse datasets or continuous data streams, including the associated systems challenges, such as scalability, performance, ease-of-use, and the end-to-end management of these processes. In contrast, the DM community traditionally has considered these systems challenges, but mostly for relational algebra and its derivatives, and not in the context of linear algebra and more complex iterative computations.
By collaborating and working at the intersection of these two communities, a set of interesting research problems and their associated challenges will emerge: problems that will require a profound knowledge in both ML/theory and DM/systems research, and lead to novel groundbreaking applications.
What is an important challenge you and your group at TU Berlin are working on right now?
The ever-increasing activities in digitization around the globe are creating digital twins of almost every natural phenomenon in the sciences, industry, and our personal lives. These digital twins are generating massively distributed data streams derived from many millions of sensors and data sources which need to be cleansed, integrated, and analyzed with ML and other AI methods in near-real time. Moreover, these tasks need to be performed securely, be privacy-aware, and ensure legal compliance (e.g., with applicable regulatory directives).
In our NebulaStream program, my research group is working on a novel data processing system, which is able to process complex analytics beyond relational operations on millions of distributed data streams. Our key design points are ease-of-use, energy-efficiency, and high-performance. We seek to overcome the challenges confronted when running thousands of continuous queries concurrently that involve relational algebra, linear algebra, and iterative operations in a massively distributed edge/fog/cloud environment. In particular, we are investigating how to achieve low-latency and high-throughput via the use of compiler techniques, so as to more easily integrate various programming languages. In addition, we are focusing on the use of declarative languages and automatic optimization techniques, to distribute and select the right physical implementations of operations automatically, so that they may be pushed onto edge or fog devices whenever possible. Ultimately, our goal is to build a system that minimizes latency, maximizes throughput, and preserves energy efficiency in a resource-constrained environment. Naturally, to ensure success we are always looking for talented research-oriented and systems-oriented individuals to join us and contribute to our objectives.
One of your most cited papers in the ACM Digital Library is “Robust Query Processing Through Progressive Optimization,” which you and your co-authors presented at SIGMOD 2004. How did this paper improve upon the existing state-of-the-art in query optimization?
Traditionally, query optimization has been static. At compile time, query optimizers examine the data distributions of both tables and columns involved in relational queries, as well as the execution environment. Based on a cost model informed by these statistics, the query optimizer determines the best execution strategy (i.e., order and physical implementation of relational operators) for a relational query. However, since the cost model is based on assumptions—such as the independence of conjunctive predicates in a selection or the inclusion of tables involved in a join)—which may or may not be satisfied. And since statistics are often inaccurate, the execution strategy is often suboptimal.
The key idea of progressive optimization is to introduce a feedback loop into query optimization. To do this, at runtime, the query processor monitors and compares the actual parameters of the cost model with the estimates used at compile time, which had determined the current execution strategy. In the event of a discrepancy that would render the current execution strategy as suboptimal, progressive optimization would call the query optimizer once again to determine an alternative, better execution plan. In particular for long running queries, changing the query plan during execution improves performance and reduces resource consumption. This work is highly relevant even today, as in streaming systems, like NebulaStream, we have to deal with these types of long-running queries and therefore require progressive optimization.
You led the project which resulted in the development of the Apache Flink open-source big data analytics system, for which you received a Humboldt Innovation Award. How has Apache Flink impacted the field?
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been published under the Apache 2.0 License and has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Flink has had impact on several dimensions. First, Flink has become a world-leading system for data stream processing, with a vibrant community of more than 800 open-source developers and an ecosystem with an active conference culture, fueled by tens of thousands of participants in meetups, and a large number of production users, including major companies from Europe, the US, and China. Second, Flink has shown a successful innovation pipeline. It is an example for how systems-oriented research, built on a solid foundation, originating from a European university, and funded by public sources can successfully be transferred to industry. Third, Flink has become a foundation in many systems and application-oriented areas: many research papers that have been published in many top-tier venues, such as ACM SIGMOD and VLDB, make use of and have further advanced Flink. Fourth, Flink has resulted in a successful startup, which had its exit in 2019. Fifth, the story of Apache Flink serves as an inspiration and role model to many students at TU Berlin. It showcases that TU Berlin is a great environment for foundational, system-oriented research in Big Data and Machine Learning Systems. Currently, we are looking to repeat the success story and impact of Flink with NebulaStream, our newest initiative.
As an advisor to the German Federal Government and the European Union on matters related to big data and AI, what are some priorities for the European Union to remain a global leader in AI development and deployment in the coming decades?
In contrast to the popular statement that data is the new oil or the new gold, I consider data to be a factor of production in the economic sense. Data is not consumed like a resource but is an atypical good. Like soil, data needs to be watered and fertilized, to grow and be harvested. We need to curate, integrate, and analyze data to derive new insights and ultimately value from it. Thus, I like to speak of data as the new soil, and not the new oil.
This has profound consequences, since this means that we have to build infrastructures that facilitate the provisioning, sharing, and analysis of data. We require an interoperable, scalable, easy-to-use and broadly accessible data infrastructure where data, algorithms, and processing resources can be offered as open data/open-source, shared in collaborative spaces, or sold in the context of novel business models in an information economy. Building such an infrastructure will require research, standardization, legislative efforts, and entrepreneurial activities. At the same time, it requires a closer interaction of the DM and ML communities and the attraction and retention of top talent across Europe.
The German federal government jointly with various state governments have created several permanently-funded AI Competence Centers to meet these challenges. As Directors of BIFOLD, the biggest among these centers, Klaus-Robert Müller and I offer attractive working conditions for Postdocs (with tenure-track possibilities). We are currently establishing new research groups in various DM and ML research areas in an exciting ecosystem in Berlin comprised of top universities, scientific institutions, companies, and a rich startup scene. I am confident that Germany and Europe are taking the right steps to retain a leading role in AI and big data.
Volker Markl is a Professor of Computer Science, Chair of the Database Management and Information Systems (DIMA) Group, and Director of the Berlin Institute for the Foundations of Learning and Data (BIFOLD) at the Technische Universität (TU) Berlin. He is also Head of the Intelligent Analytics for Massive Data Group at the German Research Center for Artificial Intelligence (DFKI). His current research interests lie at the intersection of distributed systems, scalable data processing, and machine learning.
Over the course of his career, Markl has published over 280 scientific papers, received an ACM SIGMOD Best Paper Award, two ACM SIGMOD Research Highlight Awards, and to date been issued eighteen patents. Among his honors, he was selected as one of Germany’s leading “Digital Minds” (Deutschlands Digitale Kӧpfe) in 2014 by the German Informatics Society and named an ACM Fellow in 2020 for his contributions to query optimization, scalable data processing, and data programmability.