People of ACM - Alfons Kemper
December 19, 2023
Will you tell us a little about the Database Systems Group at TU Munich, and how the focus of your lab is unique within the field?
I am glad that the CS Department at TUM was able to form one of the best systems-oriented database groups in the world by successively hiring Thomas Neumann, Jana Giceva, and Viktor Leis as additional core database faculty. Our cooperative database work is unique for an academic environment in the sense that we build comprehensive systems into which all our conceptual ideas are integrated and end-to-end tested. We also use our own database systems for academic teaching. My large introductory database class of 2000 students each year uses our HyPer-db.com and Umbra-db.com web interfaces for teaching the SQL query language and logical optimization techniques. This is not only beneficial for the students but also helps us debug these systems. Furthermore, my colleague Thomas Neumann offers a practical course on database system engineering where students build a scaled-down system along the Umbra database architecture. The participants of this system-centered course are highly attractive for the various database companies that have established development branches in Germany.
What was a key innovation in the development of HyPer, and how will Umbra expand on that?
HyPer was one of the earliest so-called HTAP systems that combined high-performance analytical and transactional processing on one database state. This ensures that the analytical queries reflect the most recent state of the dynamic transactional database and, therefore, the current state of the modeled world is reflected in the analytical queries. This real-time analytics is the basis for the so-called real-world awareness essential for effective decision-making in our fast-moving society, e.g., in economics or environmental monitoring. Our performance results demonstrate that HyPer indeed combines the best of the two worlds—HyPer’s OLTP performance is comparable to that of dedicated OLTP engines (like VoltDB or Hekaton) and HyPer’s OLAP query response times match those of the best pure OLAP engines (e.g., MonetDB and VectorWise). It should be emphasized that HyPer can match (or beat) these two best-of-breed transaction (VoltDB) and query (MonetDB, VectorWise) processing engines at the same time by performing both workloads in parallel on the same most current database state. This performance evaluation was based on a new business intelligence benchmark we designed (the so-called CH-benCHmark) that combines the transactional workload of TPC-C with the OLAP queries of TPC-H – executed against the same database state. This workload has become a de facto standard benchmark for NewSQL database system developers.
Hyper’s excellent performance is due to various design choices:
One, HyPer relies on in-memory data management without the ballast of traditional database systems that is caused by DBMS-controlled page structures and buffer management. The SQL table definitions are transformed into simple vector-based virtual memory representations without indirect addressing schemes imposed by buffer management. These vectors constitute the so-called column store approach that is particularly beneficial for analytical processing as it makes better use of the storage hierarchy in modern processors.
Two, HyPer pioneered the concept of query compilation, whereby declarative SQL queries are translated into machine code. Compared to traditionally interpreted evaluation approaches, this best exploits the underlying high-performance hardware.
And three, HyPer’s query execution is designed to scale to hundreds of cores and takes full advantage of the parallel computing resources in multi-core servers. Our so-called morsel-driven parallelism uses contention-free, skew-resilient parallel algorithms that provide near-perfect scaling for large analytic queries and still low latency for short transactional queries.
The OLAP analytical processing is separated from the mission-critical OLTP transaction processing by relying on lock-free multi-version concurrency control. Thereby, transactional updates are first carried out on new versions of the data objects, while queries can still access older versions that retain the data object’s state at the beginning of the query execution.
After the successful technology transfer of HyPer from an academic project into a commercial system used by millions of users within the Tableau/Salesforce software stack, our TUM database group started the new Umbra project. The commercial version of Hyper is further enhanced by a development group of Salesforce/Tableau located in Munich that employs mostly our TUM database graduates.
While HyPer was originally architected as a pure main-memory system, the goal of Umbra was to build an extensible database system with in-memory performance and graceful performance degradation once the working set grows beyond DRAM capacity. This is achieved by a lean buffer management approach that has only marginal overhead while data objects reside in main memory. In this respect, Umbra is being designed as a cloud-native system for the Big Data era. Furthermore, Umbra was designed as a so-called computational database system whose extensibility allows integrating computationally complex tasks (in particular, machine learning pipelines) directly into the database kernel, thereby avoiding the data shipping penalty between different systems.
In a May 2023 article for Communications of the ACM, you extolled the virtues of FoundationDB—a free and open-source multi-model distributed NoSQL database developed by Apple Inc. You wrote that the design of FoundationDB lays the “foundation for future cloud information systems that must balance performance and consistency requirements.” Why are you hopeful that database systems will be able to effectively manage “Internet scalability?”
The explosive growth of data in the Big Data era has made data management a very challenging problem. The volume, the variety, and the velocity of accumulated data have grown dramatically, even exponentially. Internet scalability imposes two challenges: storing and selectively retrieving the data but also interactively processing the Big Data. Fortunately, recent hardware development has come to the rescue of large-scale database development. In particular, the emergence of large multi-core, multi-socket systems with massive main memory capacities provides new opportunities for data processing and combines unprecedented computing power into one machine. However, exploiting the radically new hardware landscape for data management requires a fundamental re-design of established database technology from the ground up. Big Data management systems have to be scalable in two dimensions: scaling up requires exploiting the vast multi-core power of modern processors, and scaling-out to a cluster of machines once the working set grows beyond the capacity of a single server.
Within our cooperative work group at TUM, we want to address these data processing challenges in the context of integrated computation. The overall vision is to derive interactive insights into big data by combining database technology with complex analysis and computation. Besides challenging requirements from the application domain, we expect to offer seamless integration of data mining and data exploration functionality, allowing for a multi-scale analysis pipeline where extensional data is combined as needed with computations to derive new insights on data. We achieve this goal by automatically compiling database query fragments and application logic into unified data processing pipelines that are then parallelized and distributed across the available computing cores in order to maximize hardware utilization. Maximizing data locality and minimizing synchronization overhead are two of the main challenges here. This development will enable the domain specialists to interactively derive insights from their Big Data sets.
Given the success of your textbook and current trends, what additional material might you include in future editions of DatenBank-Systeme?
I teamed up with my colleague Viktor Leis to work on the next (11th) edition of this textbook. We will particularly enhance the contents on cloud database technology. Focus areas are data formats for data lakes, such as Parquet and Iceberg, as well as log-structured merge trees, disaggregated database architectures, and distributed synchronization methods, such as Raft. Obviously, we will also describe in detail our newly developed systems, Umbra and LeanStore. Machine learning workflows will be addressed by introducing the essential data processing features of Big Data systems like Spark and Flink. Database systems play an increasingly important role in customizing generative AI in the form of Vector Databases that efficiently support similarity search over feature vectors generated by Large Language Models.
Alfons Kemper is a Professor of Database Systems and Head of the Department of Computer Science at the Technical University of Munich, as well as Principal Technical Advisor at Tableau/Salesforce. His research interests focus on advanced, scalable database and data exploration systems. Together with his colleague Thomas Neumann, Kemper led the development of the innovative New-SQL database system HyPer—a main memory database system that was acquired by Tableau Software. Neumann and Kemper’s group is currently working on its successor, Umbra.
Kemper’s textbook on database systems, now in its 10th edition, is used at universities and colleges throughout Germany, Austria, and Switzerland. He has received two Best Paper Awards and a Ten-Year Influential Paper Award from the IEEE Technical Committee on Data Engineering (ICDE). Kemper was named an ACM Fellow for contributions to database management system technology.