People of ACM - Felix Naumann
January 4, 2022
What is the main focus of your research right now?
Two interrelated topics drive my current research: data quality and data preparation. With the advent of machine learning, research and industry has recognized that success depends not only on the design and size of the trained model, but to a large degree also on the quality of the training and test data. We want to measure (and improve) data quality in many dimensions, including traditional ones, such as accuracy and completeness, novel ones, such as diversity and bias, but also in less tangible dimensions, such as liability and explainability, which reach beyond computer sciences into law and ethics.
To make the work of data scientists easier, we are also exploring one particular aspect of data quality, namely the ability to even load and ingest raw data into a system. Our research on data preparation is concerned with ill-formed raw data files, which might contain non-data elements, multiple tables, inconsistent formatting, etc. Recognizing and eliminating such problems manually is often easy but quite tedious. Yet, automatically performing such data preparation is surprisingly challenging.
One of your most downloaded papers from the ACM Digital Library is the 2008 article “Data Fusion,” which you co-authored with Jens Bleiholder. Will you briefly explain what you mean by the term "data fusion"? Have developments over the past 12 years (e.g., the availability of data, specific data management innovations) in any way altered your perspectives on data fusion (or data integration more broadly)?
Data fusion, in the context of data cleaning, refers to the merging of so-called duplicates, i.e., different records that refer to the same real-world object. Such duplicates appear during data integration, for instance when integrating two customer databases. Data fusion must decide which of the alternative values is the correct one to keep in the merged records. For instance, if two customer records show different addresses, one could choose the more recent one.
Since writing our survey, data fusion has seen many advances, for instance as “truth discovery.” Luna Dong and Divesh Srivastava have introduced the notion of “copying relationships,” which help avoid choosing a majority-value that was simply copied more often than others were. Fully automatic (and correct) data fusion is yet out of reach, as are most data cleaning challenges that must decide upon semantics based on syntactic information.
In the 2015 paper “Profiling relational data: a survey,” which you co-authored with Ziawasch Abedjan and Lukasz Golab, you review current data-profiling tools and touch on the direction in which data profiling is heading. Why is data profiling important and how might data profiling tools look different in five to 10 years?
Data profiling is the act of discovering metadata in a given dataset. Particularly difficult data profiling tasks include the discovery of key candidates, inclusion dependencies, functional dependencies, or denial constraints. Knowing such metadata serves multiple purposes, such as data cleaning, data integration, or query optimization. After much research by the community to design efficient discovery algorithms, an important direction is now to classify discovered metadata as genuine (true) or as spurious (coincidental)—again, a challenge that bridges the syntactic and the semantic worlds.
Further challenges are the discovery of approximate dependencies, which allow for some violations in the data, and the incremental discovery of metadata as data changes. In 10 years, such complex metadata will be part of the common canon of metadata, just like traditional metadata, such as data types and cardinalities.
What is another innovation in your field that will be especially important in the coming years?
For a few years now, we have been exploring a new dimension of data: their change behavior. Data (and metadata) experience many kinds of change: values are inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance how many changes have there been in the recent minutes, days or years? And what kind of changes were made at which times? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. Moreover, observations about changes in the past allow us to predict future changes and alert users when those changes do not occur at all or when they occur prematurely. In general, regarding data not as they are, but as artifacts with a change history leading up to their current value, provides a new perspective for many use cases.
Felix Naumann is a Professor for Information Systems at the University of Potsdam’s Hasso Plattner Institute. His research interests include data quality, data cleaning, data analysis, data integration and data profiling.
Naumann has been both an Associate Editor and a Senior Associate Editor of the ACM Journal of Data and Information Quality (JDIQ). He is presently an editor of Datenbank-Spektrum. Naumann has held several leadership roles at data management conferences, including serving as Tutorial Co-chair for ACM SIGMOD/PODS 2020 and Co-Program Committee Chair for VLDB 2021. Naumann was named an ACM Distinguished Member for outstanding engineering contributions to computing.