You watch hundreds of hours of television, they call you a lazy slob. A computer does it, and it's a technological success story.
That is the case for a new algorithm from MIT's Computer Science and Artificial Intelligence Laboratory. Researchers fed the program 600 hours of YouTube videos and television shows like The Office, Desperate Housewives and Scrubs to see if it could learn about and predict certain human interactions — hugs, kisses, high-fives and handshakes.
The algorithm uses an artificial intelligence technique called "deep learning" to create its own understanding of the patterns that make up human interaction. Given raw, unlabeled data, the machine is asked to figure out on its own what is important and what is not. It's a mechanism that humans naturally develop over the course of their lives by picking up clues in the social interactions of those around them.
"Humans don't need to have someone give us thousands of examples of things saying, 'This is a kiss.' We just need a few examples," says Carl Vondrick, a doctoral candidate at MIT and one of the researchers on the project. "So what's powerful about this is that it can learn its own rules."
To test the computer, they showed it a video of people who are one second away from doing one of the four interactions. The computer creates several possible future scenarios, and uses what it has learned to guess what will happen.
The computer guessed correctly 43 percent of the time; humans were correct 71 percent of the time in the same test. Vondrick believes that the system will be even more successful the more content it consumes — 600 hours is only 25 days.
Vondrick hopes to give the system much more video to learn from, and to ask it to predict more complex interactions. If it becomes advanced enough, the technology could be used to create intelligent security cameras that could automatically call an ambulance if someone is about to get injured, or the police as a crime is taking place.
This technology could also bring us closer to robots that interact with us like The Jetsons' maid, Rosey.
"If you want a robot to interact in your home, it needs some basic ability to predict the future," Vondrick says. "For example, if you were about to sit down in a chair, you don't want the robot to pull the chair out away from you as you're sitting down."
Vondrick's team isn't the first to create a video prediction algorithm, but theirs is the most accurate to date.
"It's not hugely different from some other things that people have done, but they've gotten substantially better results out of it than people have in this area before," says Pedro Domingos, a professor at the University of Washington and an expert in machine learning.
One of the reasons the computer was so successful is that it uses what Vondrick calls "visual representations." In the past, some video prediction algorithms have attempted to create a pixel-by-pixel representation of possible future scenarios, which Vondrick says is a difficult task.
"It's hard for even a professional painter to paint something realistic," he says. "So we were arguing that it wasn't necessary to actually render the full future. Instead you could try to predict an abstract version of the image."
The abstract images allow the computer to picture objects and actions more generally. For example, it can tell that an image contains a face or a chair, rather than a collection of colors to interpret. Domingos says it's the same basic technology Facebook uses to guess which of your friends should be tagged in your photos.
In a second experiment, the computer is shown an image and is asked to predict what object will appear five seconds later. For example, if it sees someone approaching a sink, it might guess that the person will soon be using soap. The computer performed 30 percent better than previous attempts, but it was still right only 11 percent of the time.
Domingos says the MIT team's algorithm is among a handful that are moving computers closer to being able to understand images in the way that humans can, which is a harder task than it may seem.
"We human beings tend to take vision for granted," Domingos says. "But evolution spent 500 million years developing vision, and a third of your brain is devoted to vision ... all sorts of stuff is happening in [each image,] it's really hard to extract the objects and the people and the action."
So if a computer is going to learn about human interaction from videos, why go with the social fumbling of Michael Scott (The Office) or the machinations of Edie Britt (Desperate Housewives)?
"We just wanted to use random videos from YouTube," Vondrick said. "The reason for television is that it's easy for us to get access to that data, and it's somewhat realistic in terms of describing everyday situations."
Vondrick plans to expose the algorithm to years' worth of television, with the hopes that it will become even more sophisticated over time. Who knows, maybe it will become even more refined than the sitcoms themselves.
Riley Beggin is an intern with NPR's investigations unit.
Copyright 2021 NPR. To see more, visit https://www.npr.org.