People of ACM - Gérard G. Medioni
April 2, 2024
In an interview, you mentioned that when you received your PhD from USC in 1983, you were perhaps one of only 50 people worldwide working in the computer vision field. Among the many leapfrog advances made since, what has been the most transformative innovation in computer vision/image understanding?
Deep learning, a class of machine learning algorithms, has fundamentally transformed the field of computer vision. A fairly young field with roots in the 1960s, computer vision evolved with the aim of “understanding” images or inferring semantic information from image pixels. Until 2012, progress in computer vision came from applying geometry and physics to address the image understanding problem. While adequate for some tasks such as navigation, or creating 3D models, this approach failed to solve the generic object recognition and categorization problems. In 2012, AlexNet, a deep network trained on a very large corpus, won the ImageNet Large Scale Visual Recognition Challenge by a significant margin over traditional methods. This watershed event, followed by an avalanche of further advances leveraging the new technology, ushered in the “deep learning revolution,” which has transformed the AI industry.
What was the biggest challenge in developing Just Walk Out technology? What kinds of technologies will be prevalent in retail stores 20 years from now?
Just Walk Out technology needs to produce an accurate receipt for shoppers in the store. This accuracy is essential to earn customer trust and establish that the technology is indeed reliable. To answer the “who took what” question, we needed to solve the following problems: 1 - Calibration, to know where each camera is with respect to the physical environment and to other cameras. 2 - Person Detection, to locate each and every shopper in all frames throughout the store. 3 - Object Recognition, to answer the “what” question. 4 - Pose Estimation, to associate the who to the "what." 5 - Activity Analysis, to detect and understand the type of shopping action. 6 - Sensor Fusion, to gather all signals from all sensors and produce an accurate virtual cart of the items taken by each shopper.
Clearly, computer vision will continue to play an essential role in the retail store of the future, with the goal of simplifying and enhancing the shopping experience for customers, in particular checkout. Just Walk Out technology and AI-powered smart carts will continue to resonate with customers and merchants.
At computer vision-based checkout-free stores, products are arranged on shelves or tables so the system can see which items are being taken. This means products like clothing need to be packaged in bags or boxes—but that’s not always the way people shop for soft goods. Customers want to see clothing on hangers, pick them up, feel the fabric, try them on, and may even return the items to other shelves or locations in the store. To address the unique nature of shopping for soft goods, radio frequency Identification (RFID) is an attractive technology.
You were a co-author on a paper which surveyed recent advances using deep learning for assistive computer vision. What is an example of promising computer vision technology which employs deep learning to help the visually impaired?Visual impairment covers a large spectrum, from low-vision to total blindness. Computer vision algorithms running on a camera open up a wide range of opportunities, allowing users to perform tasks they were unable to accomplish previously, such as reading text in books, newspapers or screens, viewing pictures, navigating in new environments, and gaining some form of independence. Such tools can detect objects in images and provide audio descriptions to the users. They can also identify people in photos or videos. OrCam offers several such personal assistive handheld and wearable AI devices.
You have noted that we are in a “Golden Age Computer Vision,” in that large companies and startups are making significant investments to make computer vision a core technology to reinvent the future. What skills specific to computer vision should universities be offering to prepare the next generation of inventors in this area?
Computer vision is a multi-disciplinary field, which requires several sources of knowledge. A solid foundation in mathematics is necessary. Physics explains the process of image formation. Electrical engineering is useful to understand image transmission and compression. Computer science is essential to design and code algorithms. Machine learning, and in particular deep learning, now provide the foundation of computer vision. Finally, understanding human vision is useful, both from the psychology and computational biology perspectives.
Gérard G. Medioni is a Vice President and Distinguished Scientist at Amazon. He is also an Emeritus Professor of Computer Science at the University of Southern California, where he served as Chairman of the Computer Science Department from 2001 to 2007. Medioni, whose research interests span a broad spectrum of the image understanding field, joined Amazon in 2014 to help create Just Walk Out technology, a system which employs cameras and sensors to allow customers to take the items they want from shelves and leave a store without waiting in line to check out. More recently, he helped create the identity service Amazon One, a convenient, contactless way for people to use the palm of their hand for payment, age verification, loyalty, entry and more.
Medioni has published four books and many articles on topics including scene understanding, object recognition, shape modeling, and computer vision. He is also the recipient of more than 110 patents. Medioni is a member of the National Academy of Engineering, Fellow of several societies (IEEE, IAPR, AAAI, AAIA, NAI), and was named an ACM Fellow for contributions to computer vision and its consumer-facing applications.