People of ACM - Claudia Bauzer Medeiros
September 4, 2018
How did you initially become interested in eScience?
eScience is characterized by joint research involving computer scientists and researchers from other fields, in which there are research contributions to both computer science and the other field(s). In the mid-1990s, I started working with geoscientists and discovered the wonderful world of geo-referenced data. Until then, I was a database researcher who worked with computer scientists. Geographic data opens many intriguing avenues of exploration and introduces all kinds of CS research challenges. Most geo applications require combining very diverse data sources, with all kinds of problems concerning data quality. Nowadays geo-related research requires dealing with big data and datastreams, as well as research on sensor networks, time series management, data mining, and new architectures.
Such applications also involve combining knowledge from distinct fields, and therefore interdisciplinary teams, in which each member sees the world through different lenses and requirements. This scenario, in turn, forces us as computer scientists to learn how these scientists work, and specifically how they collect and process data. Query processing is no longer “just” a matter of optimization, but also of catering to all these needs. Another interesting aspect of this work is learning to speak and understand all kinds of vocabularies. Working with geoscientists makes you understand the importance of provenance, temporal frames, and other concepts that I have since also encountered while working with social scientists, doctors, and chemists, for example. Through eScience research, we get to see the world through many different perspectives, while also advancing computer science.
Later in my career, I worked with many kinds of scientists who were looking for someone who could handle data but also understood geo-related issues. This led to my leading two big projects, in agriculture and in biodiversity. The latter allowed me to design a very heterogeneous database in Brazil, that stored data from field trips of more than 300 teams of biologists working in Săo Paulo, Brazil, covering a variety of living beings—the species database of the FAPESP BIOTA research program. To understand what was needed, I talked to groups working on insects, mammals, reptiles, plant species, marine biology, or plant-insect interactions, to name but a few. I asked questions such as, “How do you collect your data?”, “How do you store it?”, “How do you classify the species?”, “What happens when you mis-identify a specimen?”, “How do you design your data collection methodology?”, “How do you choose your samples, spatial units, temporal units?” and so on. And then, of course, how to process queries that combine information on species with information collected on their physical environment.
For over a year, we participated in long discussions on naming things, curating data, preserving items (such as a leaf or an insect), taking photos, recording animal vocalizations, labelling jars, and entering all this information into a database. Each time I talked with a different group, I discovered yet another facet that had to be considered if data were to be adequately managed for subsequent queries.
For the past 10 years, I have also often served as “matchmaker” between computer scientists and scientists from other fields. This has given me valuable insights into how researchers handle the research data lifecycle. Even if I did not work with them, I had the chance to get to know some needs of linguists, anthropologists, astronomers, historians, nutrition experts, and even those working in religious studies.
Last but not least, through my work as a member of the coordination of the Săo Paulo Research Foundation (FAPESP’s) research program on eScience and Data Science, I have seen projects submitted on themes that require interdisciplinary evaluation, very often from experts in three or four distinct domains, one of which is always in computer science.
What is the most significant way in which our ability to collect different kinds of data (satellite images, sensor data, etc.) has impacted the design and development of scientific databases?
The technological evolution of data collecting devices is part of the so-called “big data” phenomenon. The term “big” is misleading, in the sense that it implies volume, whereas there are many other issues associated with processing and managing big data. Our capacity for collecting huge volumes of scientific data, at ever-faster rates, has certainly impacted the design and development of scientific databases, including hardware solutions; new data structures are being proposed to support data storage and retrieval, and novel algorithms for efficient querying and mining. Data heterogeneity is yet another factor to be considered. Heterogeneity is not just a matter of collecting data, but also of different ways of looking at the world. Since these ways will always change, new kinds of data will always appear, as well as new interpretations of old kinds of data. Heterogeneity will continue to impact the design and development of scientific databases. Volume and heterogeneity, together with distinct quality requirements, bring about challenges in data curation and preservation, both of which are often considered the costlier factors in the scientific data lifecycle. We now have to cope with requirements that were not so prevalent in the past. For example, the design of scientific databases now has to prioritize curation and preservation.
Will you tell us a little about the Wildlife Animal Sound Identification System (WASIS) project you worked on? What are some other interesting applications of collecting sound data that might emerge in the future?
Though a tool is available, we are now looking for more people to continue the work. This initiative began with my work with biologists involved in environmental research. Through this work, some interesting algorithms were developed, helping scientists glean new insights from their sound collection efforts. Recording of the original data started in the 1970s, by Jacques Vieillard, a French biologist who was a lecturer at my university. He unfortunately died in the 1990s but left us a very rich recording collection, which is still growing with donations from scientists. This has now become one of the largest animal vocalization collections in the world, with great scientific and historical value. I am not a biologist, so I would not dare talk about future applications. I can, however, mention some very interesting present applications. For instance, many recordings correspond to endangered species, recorded when they were abundant in a given region. They thus help environmentalists understand the past, and propose preservation initiatives. Some co-located recordings (in space and time) provide insights into species that were found together, but with time (and climate change and human intervention) somehow were separated.
One of my students designed algorithms for data cleaning and filling gaps in recording metadata by combining available metadata with historical weather time series. In computer science, there is a multitude of opportunities for research—pattern matching (of sound waves) needs to be context-sensitive (for example, environmental variables measured at the same time). Some scientists believe that we can apply research on emotion-sensing to animal vocalizations. There also is a need for better sound descriptors, and for involving citizen science in a more effective way.
In your candidate’s statement to join the ACM Council, you mentioned that your eScience research and your work for funding agencies taught you the advantages (and pitfalls) of interdisciplinarity. Why do you believe more interdisciplinarity is needed within computing and what can ACM do to foster this?
Bioinformatics is a great example of the advantages of interdisciplinarity. For the past 24 years, I have been working with non-computer scientists, and hopefully becoming a better computer scientist through this exercise. Of course, there are lots of exciting and challenging problems within computer science, and we do not need to go outside our field to contribute to science.
On the other hand, it is really rewarding (and fun) to see how our research can contribute to (and profit from) working with people in other fields. Through talking with social scientists, I have learned the value of provenance to validate one’s results, and the need for provenance for reproducibility. My interest in scientific workflows was motivated by my research with chemists and geoscientists.
Interdisciplinarity is also needed within computing. As in many other fields, there is the danger of over-specialization. I have talked to experts in machine learning who do not worry about the quality of the data they will process—just about speeding up the process. Though they may conduct first-class research, I feel something is missing.
In my data-centric, data-driven view of the world, data has become the means through which scientists collaborate: tell me about your data, and I will learn about your research; I will discuss with you some possible approaches to manage your data, and you will learn about my research. Through this dialogue, we will learn about each other’s vocabularies, requirements, limitations, and needs. We will also discover new ways through which, together, we can advance science, and become better supervisors and teachers.
Claudia Bauzer Medeiros is a Professor of Databases at the Institute of Computing, University of Campinas (UNICAMP), Brazil. Her research is centered on the management and analysis of scientific data to face the challenges posed by large, real-world applications, ranging from chemistry and biology to urban planning and social sciences. Medeiros has coordinated large multidisciplinary projects, including applications in agro-environmental planning and biodiversity, some of which included partners from France and Germany.
Her honors include being named a Commander of the Brazilian Order of Scientific Merit and Doctor Honoris Causa from Universidad Antenor Orrego, Peru and Université Paris-Dauphine, France. She is a former President of the Brazilian Computer Society and currently serves as a Member of the Council of the Research Data Alliance. In May, she was elected as a Member-at-Large of the ACM Council, the leadership body that governs the association’s activities.