Google is pretty great at figuring out what a user is saying, but is it any good at knowing who’s saying it? Just look at current smart speaker technology, which can be easily fooled.
Google might have a pretty simple solution, however. Its researchers have created a deep learning system that is able to single out voices. It does this by literally looking at people’s faces when they’re talking.
First, the researchers trained its system to recognize individual people speaking alone. After which they created virtual noise—adding other people to make a fake crowd—as a way to teach the artificial intelligence to separate various audio tracks into distinct parts and thus allowing the system to recognize which is which.
Google’s research is detailed in a paper called “Looking to Listen at the Cocktail Party,” named after the cocktail party effect in which people are able to focus on one audio source despite the surrounding noise and distractions.
The researchers are still trying to determine how this technology may be implemented into Google’s products, but that shouldn’t take long to contemplate. The most obvious candidate is video services such as Hangouts or Duo, which can integrate this feature to amplify the voice of a person when they’re speaking against overwhelming crowd noise. There are also big implications for accessibility, as Engadget notes: AI-powered voice tracking may lead to camera-assisted hearing aids that can make a voice louder when they’re in front of the wearer.
There are privacy implications as well, however. Imagine the technology advancing enough to the point where it’s able to pinpoint a specific voice from a bustling street in an urban city such as New York? Combined with security cameras, Google’s new tech serves yet another fuel for panic over security. Time, however, will tell.