People of ACM - Gonzalo Navarro
August 22, 2023
How did you initially become interested in algorithms and data structures?
It has been a long journey! I’m Argentinian and started my undergrad at the National University of La Plata. That was in the 80s—Computer Science careers were just taking off in Latin America and university resources were limited (we had a few computers for hundreds of students). Still, the program was solid and well thought out, and I’m thankful for what it gave me. As still happens in many places in Latin America, however, the curriculum in algorithms and data structures was very basic. After the third year I moved to the ESLAI, a leading-edge undergraduate Computer Science education project in Argentina that selected a few dozen students per year and gave them full support with fellowships, equipment, housing, and most importantly, international professors. I had my first serious algorithmic course with Giorgio Ausiello (still today working at University of Rome La Sapienza), based on the book The Design and Analysis of Computer Algorithms by Aho, Hopcroft, and Ullman, and it was like first love—you never forget it. A second course was given by Jorge Olivos, a brilliant professor at the University of Chile who was on sabbatical in Argentina, and who ultimately defined where I live and work today.
After I graduated and went into the job market, I realized how boring it was when you were not learning all the time. It was the 90s, and I guess working today in companies like Google and the like must be much more exciting. I then contacted Jorge Olivos for advice, and he convinced me to go to Chile to do a MSc. He was semi-retired, so he connected me with Ricardo Baeza-Yates who was returning from his PhD work at the University of Waterloo with a new beautiful topic that became my second love: text searching. I still remember my fascination with the elegance and beauty of the suffix array data structure. I did my MSc and PhD with Ricardo Baeza-Yates on structured text retrieval and approximate string matching. The combination of text searching with text compression—my other main area of interest as of today—came with Nivio Ziviani, a colleague of Ricardo from the Federal University of Minas Gerais, Brazil, and another leading actor in the development of web search engines in Latin America. My current interest generalizes those experiences and spans the whole area of compact data structures, a cross between data structures and information theory.
Your paper “On Compressing and Indexing Repetitive Sequences,” (co-authored with Sebastian Kreft) was included in a compendium of top-cited articles in theoretical computer science. What challenge were you seeking to address with this paper, and how did your introduction of LZ-End solve it?
The conference versions of this paper date back to 2010, when the term “data deluge” was being heard everywhere. I was starting to realize that much of the fastest-growing text data was low in information content, as the text collections were highly repetitive, and this gave us a chance to curb the data deluge in those cases. Think of collections of genomes or reads of the same species in bioinformatics, or versioned document collections like Wikipedia, GitHub, or the Software Heritage project. Today, a decade later, with the projects to sequence a million human genomes and the promise of personalized medicine in the horizon, the problem of indexing huge repetitive collections is receiving a lot of attention—from practice in Bioinformatic labs to theory in the study of repetitiveness and how it shows up on text indexing data structures.
By that time, the concept of self-indexes, which are compressed representations of texts that can be accessed and efficiently searched without ever decompressing them, had existed for a decade in the form of compressed suffix arrays. However, these self-indexes focused on statistical compression used in a form that is blind to repetitiveness. Other compression models, such as dictionary-based ones, did capture repetitiveness, but accessing the text in compressed form was much more challenging and it was difficult to search it without decompressing. We had already obtained good results with techniques like grammar-based compression or run-length compressed suffix arrays, but not with the big prize: the original Ziv-Lempel parsing of 1976—the basis of compressors like gzip and p7zip—provided the best compression on repetitive sequences, and it was the most challenging model to use.
Sebastian Kreft was an outstanding MSc student and took on this challenge. We figured out how to adapt old ideas of Juha Kärkkäinen and Esko Ukkonen (whom I knew from my postdoc at the University of Helsinki in 1999) to reduce text searches to geometric queries and permutations, and to efficiently access the compressed text, inventing the LZ-End variant of Ziv-Lempel. Our compressed text index still offers the best space on repetitive collections, with competitive search times.
Your paper “Colored Range Queries and Document Retrieval," (co-authored with Travis Gagie, Juha Kärkkäinen, and Simon J. Puglisi) is also recognized for making important contributions to text search. What are colored range queries, and what was a key insight of this paper?
Coloring is just a “colorful” name for categorizing objects into classes. You have a set of objects, each with a “color” (its class) and perform queries that group them by color. In colored range queries you have an array of colors and want to answer questions such as “Which colors appear in this range?” “Which are the most frequent ones?” and so on. Abstract problems of this kind are studied in theoretical computer science, and their solutions can then be applied to a wide range of situations.
In this paper we gave new solutions to those problems and then showed how they could be applied to document retrieval. This area extends, in a sense, the classic information retrieval discipline that works on natural language texts, so as to apply it to general sequence collections. It then answers queries such as “Which sequences contain this short pattern?” “Which are those where it appears most often?” and so on, which make sense in areas like bioinformatics (e.g., find genes where some DNA marker appears often), data mining (e.g., find trending topics in a social network during some time period), etc.
Is there some recent research you would like to highlight?
I would like to refer to the paper “Worst-Case Optimal Graph Joins in Almost No Space,” which appeared in ACM SIGMOD 2021, because it belongs to the area of graph databases where I have started to work recently in the context of my work with the IMFD in Chile. For me, it meant the discovery of a new area with incredibly rich algorithmic content where the data deluge is putting pressure on the amount of space used by efficient indices. The best indices require multiplying the data space to offer different views of the same objects, and thus are impractical on the current volumes of data.
This is the perfect scenario to apply compact data structures. Classic structures are highly redundant, and compact data structures aim to offer the same functionality while using space close to the entropy of the data—its true information content—thereby removing the redundancy. We managed to develop new indices that use almost zero space on top of the plain data size, and even compress the data, while supporting the database operations competitively with, and sometimes faster than, current system which use an order of magnitude more space. I learned a lot in the process through entering into a new area of computer science, and my colleagues learned a lot in terms of sophisticated data structures. As a colleague put it, “it’s as if you had brought an alien technology.” It has been an immensely satisfying adventure that is just starting.
You have been prolific in terms of authorship. What advice would you offer a younger colleague in terms of work habits for a productive career?
The most important advice is: work with other people! For me at least, the difference between finding and developing ideas alone versus doing it while chatting and meeting with colleagues, is enormous. Even in the cases where the ideas are originally mine, they emerge and flow much better in inspiring conversations rather than being caged in my head. An innocent question, a relation with an idea of some other area your partner knows better, a key observation on something that was in front of your nose which you hadn’t seen, makes all the difference. And at other times, by bouncing an idea around with others when you have a limited sense of how to tackle a problem, you can better understand the obstacles and then further improve your approach, etc. Go to conferences and meet people, do extended stays in other places and learn how other teams work. You will find this much more rewarding than working alone.
My second piece of advice is: choose your problems well. Sometimes you find interesting ideas lost in mediocre publications because the problems are not so relevant. A more experienced colleague, whom you can judge by the impact of her/his publications, can guide you to problems where your talent is worth spending. Still, try to work on what you really enjoy as it is much easier to be good at things that thrill you. And your life will be much more colorful!
Gonzalo Navarro is a Professor at the University of Chile. His main research interests are in the design and analysis of algorithms and data structures. He has made important contributions in areas including compressed data structures, text search, graph databases, information retrieval, and metric databases. Navarro currently serves as Editor-in-Chief of the ACM Journal of Experimental Algorithmics (JEA) and is an Associate Editor of ACM Transactions on Algorithms (TALG). He also currently participates in Chile’s Center for Biotechnology and Bioengineering (CeBiB) and the Millennium Institute for Foundational Research on Data (IMFD).
Navarro has co-authored two books (published by Cambridge University Press), nearly 200 journal articles, and more than 270 conference papers. He has received seven Best Paper Awards and four Google Research Awards. Navarro was named an ACM Fellow for theoretical and practical contributions to the fields of text searching and compact data structures.