People of ACM - Jayant Haritsa
February 12, 2019
You have said that a solid understanding of mathematics and/or physics is essential for those working in database systems. What mathematical training in your background was especially helpful to you as you began working in database systems?
My comment was made in the context of database performance modeling, where we often find that students blindly trust the computer, and not basic physical principles. As a case in point, often numbers are reported to a large number of decimal places, just because the computer outputs it, although the measuring instrument itself does not have this accuracy. The fallacy here is obvious to physicists and engineers, but often eludes those with a pure CS background! So, although it may seem surprising, I recommend to young students interested in system performance that they first take up engineering—where the laws of nature are known and respected—at the undergraduate level, and then specialize in computer science in graduate school after they have internalized these principles.
For research in database performance, a good knowledge of probability, statistics, queueing networks, random processes, sampling and convex optimization is extremely beneficial toward a strong understanding of the domain. The training I received in my electrical engineering undergraduate program, where quite a few of these topics were covered in the signal processing and coding courses, proved to be especially helpful in my subsequent career.
At IISc’s Database Systems Lab, you developed the popular Picasso database query optimizer visualizer software. What prompted you to develop Picasso?
At the Very Large Data Bases (VLDB) 2002 conference, we presented what was one of the earliest applications of machine learning techniques (specifically, clustering) to database query optimization. This project was motivated by the observation that the optimization of declarative Structured Query Language (SQL) queries to produce efficient query execution plans is a computationally intensive process, especially for decision support queries. To amortize this cost, we presented a tool called Plastic (Plan Selection through Incremental Clustering) that attempts to reuse the plans generated for prior queries through the clustering process. For this scheme to work, an underlying expectation is that large regions of the “plan diagram” (which is a visual profile of the optimal plan choices over the feature space) are covered by a small set of plans. That is, the diagram exhibits an 80/20 type of skew, and this behavior was manually verified for some representative queries.
While the community welcomed the Plastic approach, they also raised the legitimate question whether the skew was commonplace or an artifact of the few queries evaluated in the study. To establish that this behavior was a generic phenomenon, we created the Picasso tool, which automatically generates plan diagrams on all the major database engines. Picasso did confirm that the skew was routinely prevalent in plan diagrams. However, it did something more, which was surprising: it highlighted chaos in these diagrams, with a large number of plans covering the space in intricate patterns. In fact, the plan diagrams often looked like Cubist paintings, and hence the name of the tool!
So, what was originally intended only as a “proof mechanism” suddenly morphed into a rich vein of research problems: for instance, we were able to show that the Cubist diagrams could be efficiently reduced to “anorexic” pictures, featuring only a few plans, without materially affecting the query processing quality. The anorexic reduction property has several useful practical applications in database engines, including improving the robustness of execution plan choices.
What is an exciting avenue of research in database systems that will become more prominent in the coming years?
Big Data has become the buzzword of choice in recent times, especially in the software industry. The accompanying hoopla has spawned frenetic claims foretelling the development of great and wondrous solutions to Big Data challenges. However, very little is said about testing, an essential prerequisite for deployment. This is surprising given the countless disaster stories about large-scale data management systems from all over the world‒for instance, the Obama healthcare web portal and Department of Defense electronic records program in the US, the e-borders security scheme in the UK, and the Flipkart Big Billion sale in India. Looking into the future—where system design will be increasingly automated and data-driven thanks to the ongoing AI/ML revolution—it is absolutely imperative that principled and effective testing methodologies are created to evaluate these systems before they are released. Therefore, I expect that database testing, which has been largely relegated to the fringes, will soon come to occupy a prominent position in the computer science curriculum.
In our own lab, we have developed CODD, a graphical tool that takes a first step toward the effective testing of Big Data deployments through a new and distinctive metaphor of data-less databases, where data are only simulated and never persistently generated or stored. This tool is already in use at major telecom and software companies.
Do you have any concerns about the current focus of database research?
Over the past two decades, the research focus of the database community has largely shifted from the database engine to middleware, encompassing areas such as data mining, data warehousing, information retrieval, document processing, knowledge management and bioinformatics. In these domains, the database engine is essentially viewed as a black box that merely functions as an efficient data supplier to the middleware. Certainly the importance and impact of middleware topics cannot be disputed. However, what is worrisome is that the movement away from database engine topics has begun to assume alarming proportions, placing in jeopardy the long-term future of database systems.
There are a multitude of causes for the downslide in database engine research, but a particularly pernicious reason is the widespread, and utterly wrong, perception that engine design is essentially a “finished art.” That is, all the major problems have already been solved, leaving little scope to make meaningful new contributions. However, nothing could be further off the mark. While engine issues are certainly classical, they are equally amenable for fresh investigations either because: (a) the underlying platforms are changing, invalidating long-standing design assumptions in the process (for example the advent of phase change memory with its asymmetric costs for reads and writes); or (b) novel design and analysis techniques have appeared on the scene, delivering new perspectives and solution methodologies (such as geometric strategies for robust query processing with guarantees). Therefore, I urge students to explore these options with an open mind, and we will hopefully soon see the day where database engines reclaim their former position of eminence in the research mainstream.
What advice would you offer a younger colleague just starting out in the database field?
In academia today, there is an unfortunate trend toward fast delta publishing, rather than on carrying out due diligence on the challenges that really matter. This concern has also been foregrounded by the database doyen and ACM Turing Awardee Michael Stonebraker in his recent talk “My Top Ten Fears about the Database Field,” where he complains in colorful language about the “diarrhea of papers!” Note the ironic situation where Frederick Sanger, who won two Nobel Prizes, has only around 70 papers to his name, whereas run-of-the-mill researchers with several hundred papers are nowadays commonplace.
Therefore, my advice to a junior colleague would be the following: if you are writing more than a couple of papers a year, you are shortchanging your intellectual abilities and not doing justice to the problem domain. So, rather than generating a sterile assembly line of marginal publications, concentrate instead on creating a limited portfolio of papers in which you can take genuine pride. In a nutshell, aim to be a vector researcher who changes the direction of a field, as opposed to a “scalar researcher” who only incrementally advances the field.
Jayant R. Haritsa is a Senior Professor in the Department of Computational and Data Sciences and the Department of Computer Science and Automation at the Indian Institute of Science (IISc) in Bangalore, India. His core research interests are the design, implementation and testing of database systems. He is known for pioneering contributions to the design and optimization of database engines that form the core of modern enterprise information systems. Haritsa has also produced software tools for SQL query optimization and relational database testing.
Haritsa is the recipient of the Shanti Swarup Bhatnagar Prize for Science and Technology in 2009, as well as the 2014 Infosys Prize in Engineering and Computer Science. A current member of the ACM India Council, he was named an ACM Fellow (2015) for contributions to the theory and practice of data management systems.