Computer software only recently became smart enough to recognize objects in photographs. Now, Stanford researchers using machine learning have created a system that takes the next step, writing a simple story of what's happening in any digital image.
“The system can analyze an unknown image and explain it in words and phrases that make sense,” said Fei-Fei Li, a professor of computer science and director of the Stanford Artificial Intelligence Lab.
“This is an important milestone,” Li said. “It's the first time we've had a computer vision system that could tell a basic story about an unknown image by identifying discrete objects and also putting them into some context.”
Humans, Li said, create mental stories that put what we see into context. “Telling a story about a picture turns out to be a core element of human visual intelligence, but so far it has proven very difficult to do this with computer algorithms,” she said.
At the heart of the Stanford system are algorithms that enable the system to improve its accuracy by scanning scene after scene, looking for patterns, then using the accumulation of previously described scenes to extrapolate what is being depicted in the next unknown image.
“It's almost like the way a baby learns,” Li said.
She and her collaborators, including Andrej Karpathy, a graduate student in computer science, describe their approach in a paper submitted in advance of a forthcoming conference on cutting edge research in the field of computer vision.
Eventually these advances will lead to robotic systems that can navigate unknown situations. In the near term, machine-based systems that can discern the story in a picture will enable people to search photo or video archives and find specific images.
“Most of the traffic on the Internet is visual data files, and this might as well be dark matter as far as current search tools are concerned,” Li said. “Computer vision seeks to illuminate that dark matter.”
The new Stanford paper describes two years of effort that flows from research that Li has been pursuing for a decade. Her work builds on advances that have come, slowly at times, over the last 50 years since MIT scientist Seymour Papert convened a “summer project” to create computer vision in 1966.
Conceived during the early days of artificial intelligence, that timeline proved exceedingly optimistic, as computer scientists struggled to replicate in machines what took millions of years to evolve in living beings. It took researchers 20 years to create systems that could take the relatively simple first step of recognizing discrete objects in photographs.
More recently the emergence of the Internet has helped to propel computer vision. On one hand, the growth of photo and video uploads has created a demand for tools to sort, search and sift visual information. On the other, sophisticated algorithms running on powerful computers have led to electronic systems that can train themselves by performing repetitive tasks, improving as they go.
Computer scientists call this machine learning, and Li likened this to how a child learns soccer by getting out and kicking the ball. A coach might demonstrate how to kick and comment on the child's technique. But improvement occurs from within as the child's eyes, brain, nerves and muscles make tiny adjustments.
Machine learning algorithms guide this improvement process in computer-based systems. How humans learn is a subtle process that is not fully understood. Researchers such as Li are developing ways to create positive feedback in loops in machines by inserting mathematical instructions into software.
Li's latest algorithms incorporate work that her researchers and others have done. This includes training their system on a visual dictionary, using a database of more than 14 million objects. Each object is described by a mathematical term, or vector, that enables the machine to recognize the shape the next time it is encountered. Those mathematical definitions are linked to the words humans would use to describe the objects, be they cars, carrots, people, mountains or zebras.
Li played a leading role in creating this training tool, the ImageNet project, but her current work goes well beyond memorizing this visual dictionary.
Her team's new computer vision algorithm trained itself by looking for patterns in a visual dictionary, but this time a dictionary of scenes, a more complicated task than looking just at objects.
This was a smaller database, made up of tens of thousands of images. Each scene is described in two ways: in mathematical terms that the machine could use to recognize similar scenes and also in a phrase that humans would understand. For instance, one image might be “cat sits on keyboard” while another could be “girl rides on horse in field.”
These two databases – one of objects and the other of scenes – served as training material. Li's machine-learning algorithm analyzed the patterns in these predefined pictures and then applied its analysis to unknown images and used what it had learned to identify individual objects and provide some rudimentary context. In other words, it told a simple story about the image.
For instance, if Li's computer vision system discerned the mathematical outlines of a four-legged furry mammal lying on an object, it might tell a story such as “dog lies on rug.” If the math described a bipedal creature alongside a quadruped, the software might define this as “boy stands near cow.”