People of ACM - Margo Seltzer
May 23, 2023
Why was a software library like Berkeley DB needed in 1992? What was the key challenge in developing Berkeley DB?
BDB arose out of the Computer Science Research Group’s efforts to develop a set of user level utilities and libraries unencumbered by the UNIX license. At the time, there were multiple hash table implementations available: NDBM (and its predecessor DBM), which worked for persistent data, and hsearch, which worked for in-memory data. Keith Bostic, who was leading the overall user-level effort, was looking for replacements for these libraries. I had just taken Mike Stonebraker’s graduate database course, and I thought that a new implementation of dynamic linear hashing might satisfy both needs. This led to a package called “hash.” Bostic then convinced Mike Olson to develop a B-tree implementation that would be available via the same set of APIs. The combination of these two development efforts, coupled with a high-level design from Bostic led to a production quality library (DB 1.85) to store what we now call “key-data pairs.”
At the time, developers who needed data storage had few options. They could invest in a large and usually expensive database management system, or they could read and write plain files. Berkeley DB provided a solution that was as lightweight as writing plain files and offered some of the most common functionality of a database. Olson and I then demonstrated that one could extend the UNIX tool-based philosophy on which DB 1.85 was based to include transactional support, however, that code was never production quality. As the web emerged in the mid-1990’s, every web service was looking for the same thing: a storage system that was lightweight enough to satisfy web-scale demand, but that could protect data even in the presence of failure. Ultimately, that need drove the development of the commercial version of Berkeley DB.
The key challenge throughout the BDB effort was retaining the UNIX tool-based philosophy while providing service comparable to that provided by large monolithic database servers.
Based on your experience with Berkeley DB and Sleepy Cat software, what advice would you offer someone who has just developed a new software package and is considering the dual license approach?
The world has come a long way since 1992 (and 1996 when we started Berkeley DB). In 1996, we still had to explain what open-source was and we had to invent a dual license. We were lucky in that using Berkeley DB required that customers link their applications with our software. As such, we were able to precisely specify when a proprietary license was necessary. The world is quite different today and I believe it’s much more difficult to develop a dual license business. More commonly, you see open-source companies who sell support and services.
Among the research areas listed on the University of British Columbia’s Systopia Lab, your team has written “we both apply systems techniques to the construction of machine learning systems and leverage machine learning to improve systems.” Will you give an example of how machine learning is improving systems?
There is an active area of research in trying to replace heuristic algorithms with learned ones. We’ve seen advances throughout the system stack from the implementation of data structures to selecting parameters to tune performance in systems.
You have been recognized with awards for teaching. Based on your experience, how has your teaching style changed over the years?
Let’s start with how it has remained the same—I genuinely care about students and about their learning. I want to help students develop the skills that will help them become “future-proof.” That is, I want to prepare them for whatever technology emerges tomorrow, not simply teach them how to use the tools they have available today.
As for changes, the most dramatic change occurred around 2010. I heard a talk by Lynn Stein (Olin) in which she offered the perspective that using the valuable in-person class time for "first presentation of material" was, perhaps, not the best use of that time. This led to an “aha” moment—students are quite capable of reading, so why was I wasting that valuable time presenting material that they could read for themselves? I immediately created a new contract with the students. I would strive to keep reading assignments short, and in exchange, I expected them to actually read the material (one might be surprised how unusual this behavior is). This contract would allow us to spend class time digging more deeply into the material, synthesizing different ideas and perspectives, and engaging more actively with the subject matter.
This naturally led to my flipping my classrooms starting around 2013 (a process I blogged extensively about). After coming to UBC, I’ve also adopted a mastery-learning approach in which students can repeatedly practice concepts (even on graded homework) until they have mastered them.
Margo Seltzer is the Canada 150 Research Chair and the Cheriton Family Chair in Computer Science at the University of British Columbia. She is also a member of the Board of Directors of the Berkman Klein Center for Internet and Society at Harvard University. Her research interests are in systems, construed quite broadly: systems for capturing and accessing data provenance, file systems, databases, transaction processing systems, storage and analysis of graph-structured data, and systems for constructing optimal and interpretable machine learning models. Seltzer, along with Keith Bostic and Mike Olson, has been recognized for Berkeley DB, a database software library that underpinned a range of first-generation Internet services.
Among her many honors, Seltzer was named the 2023-2024 ACM Athena Lecturer for foundational research in file and storage systems, pioneering research in data provenance, impactful software contributions in Berkeley DB, and tireless dedication to service and mentoring.