Skip to main content Skip to secondary navigation
Main content start

Fei-Fei Li: How do we teach computers to understand the visual world?

A computer scientist explores ‘the dark matter of our digital universe.’

Artificial intelligence has been on a six-decade ascent into realms once dominated by human experts. You know the highlights. Today computers regularly beat the world’s best chess masters at their own game. IBM’s Watson supercomputer has defeated all comers on Jeopardy! Algorithms regularly parse Twitter traffic to gauge public sentiment on everything from Taylor Swift’s new album to the direction of the stock market. And voice recognition apps such as Siri allow us to issue voice commands to smart phones.

Yet all these milestones depend on one attribute: machines that can grasp the meaning of words. Astounding though these examples may be, they pale alongside the next challenge facing the field of artificial intelligence: teaching computers to discern images and comprehend visual information.

“Images dominate our lives, but they are the dark matter of our digital universe,” says Stanford’s Fei-Fei Li, an associate professor of computer science and director of the Stanford Artificial Intelligence Lab. “They are the greatest share of data, but the least understood. [Images are] essentially invisible to us as digital information.”

As one of the world’s leading experts on computer vision, Li and her team are making inroads in the development of computational systems that can see and comprehend the visual world. Here is a distillation of some recent talks in which Li has sought to explain the progress and the obstacles facing her field.   

A profound challenge

The reason visual intelligence has lagged is not from a lack of effort—people have been trying since at least 1966. The fact is that computer vision is a profoundly difficult.  The goal, says Li, are computers that not only discern, but understand visual content.

More than a half of the human brain, the most powerful visual intelligence machinery yet developed, is involved in processing visual information. Imagine the difficulty of teaching a computer to scan the pixelated contours and tones in a two-dimensional photograph and to recognize not only the object in the photo, but also what that subject is doing—without error.

The solution begins with rules. Here again words enjoy something of the head start. There are well established definitions and long-accepted rules of syntax and grammar codified over centuries of common usage. There is no Oxford Dictionary of images, for instance, no Elements of Style for visual language.

Even if there were, Li says, contemplate for a moment the length of the entry it would take to categorize just one species: dogs. To correctly identify a photograph of a dog, a computer would have to identify every breed, from the Dachshund to the Great Dane, as dogs.

Not surprisingly, just a few short years ago, it made global news when a computer was able to recognize a cat in a two-dimensional photograph. With such modest milestones, it is no wonder that the pace of computer vision has trailed its verbally precocious sibling in the headline department.  

Day by day, however, the computer vision experts are catching up. Li says that the challenges of computer vision may be massive, but they are not insurmountable. More importantly, she says, they are worth solving.

Computers that see will change almost every aspect of our daily lives and improve society as a whole, from health care and medicine to cars that drive themselves, to counter terrorism and national security, Li predicts.

So, what will it take to create computers that truly see? Li lays out the four hallmarks of true computer vision.

Identifying Objects

Objects are the building blocks of vision. A computer must see an object and recognize for what it is. Much of vision is not really ‘seeing’ at all, but interpretation based upon our knowledge of the three-D world.

“There are many objects in the world and they are often distorted and obscured in photography,” Li says. “So, we are teaching to interpret what they see, just as a child would learn.”

A decade or so ago, computer vision experts began an effort known as the PASCAL Visual Object Challenge to define 20 classes of objects ranging from vehicles and animals to household objects.

“How many objects there are in the world, I don’t know,” Li says rhetorically before adding: “But, I know it’s more than 20. I have more than 20 objects in my office.”

Li and colleagues set out to change the paradigm. Accordingly, the Visual Object Challenge has begun to shift priorities to think on the order of millions of discrete objects. It took almost 50,000 workers from 167 countries a full two-and-a-half years to clean, sort, label, and cull through a database of a billion images.

The outcome, known as ImageNet, now includes 22,000 categories of objects and a catalog of over 15 million images, all described in plain English. In essence, ImageNet is a visual dictionary, the largest repository of quantified and qualified visual imagery in the world. It is available free online: www.image-net.org.

Recognizing People

The second threshold is the correct identification of specific people. Neuroscientists have long known that there are specific brain areas dedicated entirely to the recognition of faces, which they surmise was an evolutionarily important skill for humans to tell friend from foe.

Artificial Intelligence teams have since built algorithms that similarly detect and identify faces in imagery, the results of which have become commonplace in our everyday cameras and smartphones.

Classifying Actions

The third hurdle is to correctly describe the action in a photograph. This has begotten an altogether different kind of data collection effort focused on cataloging visual movements.

“We’re helping computers understand human behavior by capturing and describing various actions and patterns of movement in a visual database,” Li says.

Telling Stories

The culmination of all these skill, ultimately, is to understand what is going on in an image—to describe both the content and the context.

“The Holy Grail of computer vision is to get computers tell stories about what they see,“ Li says.

In this regard, she believes we are at a point of convergence: “This is the beginning of a very exciting generation for the marriage of computer vision and machine learning to expand the boundaries of artificial intelligence.”

Related Departments