People of ACM - Feifei Li
April 5, 2022
As Chief Database Scientist at Alibaba, what are your overarching goals for the company’s cloud infrastructure?
The key objective of our team is to build cutting-edge, world-class cloud native database systems for both Alibaba’s own business operations and enterprise customers on Alibaba Cloud, such as our cloud native relational database PolarDB and cloud native data warehouse AnalyticDB (ADB).
Cloud-native databases have become increasingly important in the era of cloud computing, due to the needs for elasticity, high availability, scalability, and on-demand usage by applications from various business domains. These requirements from cloud applications present new opportunities for cloud-native databases that cannot be fully addressed by traditional on-premise enterprise database systems. By exploring shared-storage and shared-everything architecture, a cloud-native database leverages the pool of resources provided by the underlying cloud infrastructure and decouples computation from storage, which provides excellent elasticity and high availability. For highly concurrent workloads that require horizontal scalability, a cloud-native database can further leverage a shared-nothing layer to provide distributed query and transaction processing capabilties. The ultimate goal is to provide cost-effective, easy-to-use, and highly reliable database services to our business operations and our cloud customers.
During Alibaba’s 1111 Global Shopping Festival, the volume of traffic on the site can spike by 150x in a matter of seconds. What tools has your team been developing to handle these kinds of fluctuations? How do you see these kinds of technologies developing in the near future?
As introduced above, the key to success in such application scenario is to provide extreme elasticity and excellent high availability by the underlying database system. With a gigantic traffic spike in the blink of an eye, operational database systems have to withstand a tsunami attack in a cost-effective fashion. A typical, traditional on-premise database system would have to provision a humongous amount of hardware resources ahead of time for the peak workload. This can be costly and a waste of resources once the peak traffic diminishes after a short period of time. In contrast, a cloud-native database system is able to adaptively and elastically allocate and deallocate resources in an on-demand fashion by exploring its shared-storage, shared-everything architecture. The decoupling of computation and storage, as well as the pooling of various resources (computer and storage resources), allows the cloud-native database system to be self-adaptive. Distributed query and transaction processing is also leveraged in order to provide further scalability through horizontal partitioning so that the demand of extreme concurrency is satisfied.
Furthermore, distributed consensus protocols such as Raft or Paxos are extended and enhanced to provide both intra-AZ (available zone) and inter-AZ high availability, so that any failures can be handled without the worry of data loss and business downtime or interruption. Meanwhile, software-hardware co-design is leveraged to explore acceleration and optimization offered by new hardwares such as RDMA, NVMe, as well as kernel bypassing protocols such as DPDK. HTAP (Hybrid Transaction and Analytical Processing) is another trend pursued by today’s cloud native database systems, with the goal of providing a one-stop solution for customers’ data processing and analytics needs in the case of double 11 shopping festival. Last but not least, self-driving database (aka autonomous database) techniques simplify the deployment, maintenance, and operation of cloud native database on a cloud infrastructure by integrating machine learning techniques with cloud native orchestration components (e.g., kubernetes) and various database modules (e.g., slow SQL diagnosis, index recommendation). For example, we have built DAS (database autonomy service) at Alibaba Cloud to serve double 11 operations and our cloud customers to ensure that our systems are self-healing, self-tuning, and self-adaptive to the extent of possible.
One of the most notable works of yours, “Wander Join: Online Aggregation via Random Walks”, won the Best Paper Award at the 35th ACM SIGMOD conference in 2016. In this work, you and your co-authors proposed a new approach to online queries with complex joins. What was a key insight of this paper? What innovations are being explored in query processing now?
Query processing and optimization is one of the most critical components of a database system. In this context, JOIN (an SQL clause used to query and access data from multiple tables) is the most common but most expensive database operation. Sampling provides an estimate much faster than computing exact results, which is important for query processing and optimization. But sampling over JOIN is hard; it has been a challenge for the database community for nearly 20 years. In this work, we have introduced novel data sampling techniques for enabling approximate and interactive query processing (e.g, providing online approximate answers with quality guarantees that improve over time). The quality of an online estimator improves over time and eventually leads to exact results. This is very attractive for big data analytics and query processing as users can issue queries as they wish and see query results immediately, with quality gradually improving until exact results are found (if needed); otherwise users have to wait for an unknown amount of time for exact results. They are also useful for query optimization (e.g., estimating cardinality of intermediate query results for a complex query plan).
The wander join algorithm that we propose in this paper cleverly achieves sampling by doing random walks over join graph. The join graph is never materialized but only explored conceptually through a careful weighted-sampling process with bias adjustment in estimation. This allows wander join to outperform existing methods by orders of magnitude, which has significantly advanced the state-of-the-art technique. As the quote for our ACM SIGMOD research highlight award in 2017 states, “There is a rich history in the DBMS literature involving sampling to estimate the results of queries faster than being computed exactly. This paper presents a highly effective alternative, achieving much better computational and statistical properties than the previously state of the art; convincingly proves this through experimentation with an open-source implementation in Postgres.”
Wander join produces independent but non-uniform samples; but sometimes a random sample (independent and uniform) is required for more complex analytical operations (e.g. machine learning, such as Support Vector Machine(SVM). Our follow-up work in SIGMOD’18 demonstrates how to obtain truly random samples for complex joins. This research also led to innovations in areas such as learning-based approaches for query processing and optimization. These ideas were outlined in papers such as “DeepDB: Learn from Data, not from Queries!” and “BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees”. This work inspired practical adoption and designs in real systems.
What is the most significant way machine learning methods are transforming large e-commerce companies like Alibaba?
Modern advances in machine learning have already made a fundamental and long-lasting impact on organizations and society more broadly, including at Alibaba. As a simple example, the recommendation framework in Alibaba’s e-commerce sites and apps relies on well-crafted and fine-tuned deep learning models to provide much more effective matchings of merchandise for customers browsing the sites and apps. The impact clearly goes beyond just recommendation. In the operation of Alibaba’s data centers, machine learning techniques have been explored and leveraged towards building smart AIops monitoring and orchestration tools to improve the efficiency and effectiveness of data center operations. There are many other scenarios and examples that machine learning and AI methods are transformative; they are increasingly a critical building component in many systems, including in cloud-native database systems as mentioned above (e.g. to empower a cloud-native database system be self-tuning).
You were a Professor at the University of Utah before joining Alibaba. What is the most striking way working for a company is different from working in academia?
The growth of and enrichment to my research and engineering career in computer science during my tenure at the School of Computing, University of Utah is tremendous and beyond description. It has one of the best computing education and research programs in the world. I’m forever grateful to the school and the university. That said, working in a great company such as Alibaba definitely provides a different and enriching perspective to my understanding of computer science both as a technology discipline and an increasingly critical business sector. Working for a company means that business and customer needs always come first; one has to be obsessed with practical, business-driven customer needs. That does not necessarily mean that you do not have long-term planning goals, but they have to be very focused and useful for practical applications with well-crafted, clearly articulated strategic plans and business values. This is strikingly different from working in academia, where creating business value is not a priority, but creating intellectual value is. Pursuing an unsolved problem or challenge is often the ultimate objective, even if such an effort turns out to be only an intellectual exercise. But it is through the pursuit of such curiosity that innovation breakthrough takes place and engineering endeavors eventually make adoption of new technologies pervasive and scalable in practice. Ultimately, whether in academia or in industry, it is all about creating value for (and contributing to) the wellness of our society and civilization as a whole. From my present vantage point, I do believe my careers in academia and industry have complemented and enriched one another!
Feifei Li is Vice President at Alibaba Group. He is the Director of the Database Product Business Unit at Alibaba Cloud, as well as Chief Database Scientist for the DAMO Academy (a research branch of Alibaba Group) and Director of its Database and Storage Research Lab. He was a Professor at the School of Computing, University of Utah before joining Alibaba Group. His research interests include database systems, large-scale data management, security, data analytics, and machine learning methods for system performance and monitoring. For ACM, Li is an Associate Editor of ACM Transactions on Database Systems (ACM TODS), was a Senior Area Chair for ACM SIGMOD and ACM SIGKDD multiple times, and has served various leadership roles (e.g., general co-chair) and on the Program Committees of multiple ACM SIGMOD Conferences.
Li was recently named an ACM Fellow for contributions to query processing, optimization and cloud database systems.