Last week, Google quietly changed a line of its support page for the Pixel Buds, which now reads: “Google Translate is available on all Assistant-optimised headphones and Android phones.” The feature was previously exclusive to owners of Pixel Buds and Pixel phones. And although the company did not go to any lengths to announce it officially, this small tweak is noteworthy.
To understand why, first a little bit of earphone history. Google launched its shiny pair of wireless earbuds last year in a climate of high anticipation after hyping the product for what was promised as a revolutionary tool: live translation. Just tapping the Buds and speaking the worlds “help me speak (language)” opens the Google Translate app on your — until now, Pixel — phone. From there, you can speak a sentence, which is translated and transcribed in the destination language on you phone, and then read out. On paper, Google's new technology should have interpreters fearing for their jobs.
The on-stage demo of the live translation tool at the product’s showcase got enthusiastic rounds of applause, but when the device started shipping, the response was a little more skeptical: the quality of the translation did not match the expectations of the public.
Tech Insider tested it with ten different languages. The device successfully translated basic questions such as “where is the nearest hospital,” but as soon as sentences got more complex, or if the speaker had an accent, things got lost in translation. Our very own reviewer came to the conclusion that live translation has been “a bit of a con,” with Google Assistant struggling to understand the words spoken to it.
Consumer technology Daniel Gleeson says: “Getting natural language right is incredibly difficult. It would be a massive achievement for Google, and the day that they do, they will be shouting it from the rooftops.” Perhaps a reason why the Pixel Buds support page update was kept under wraps, some may say.
Google’s problem does not come from the translating process itself — in fact, the company has been upping its translation game in the past few years. In 2016, it converted Google Translate to an AI-driven system based on deep learning. Until then, the tool translated each individual word separately, and applied linguistic rules to make the sentence grammatically correct — thus leading to the somewhat fragmented translations that we know all too well. Neural networks, on the other hand, consider the sentence as a whole and guess what the correct output may be, based on large datasets of text that they have previously been trained on. Using machine-learning, these systems are capable of considering the context of a sentence to deliver a much more accurate translation.
Integrating machine-learning was part of the mission of Google Brain, the company’s branch that is dedicated to deep learning. Google Brain also implemented the use of neural networks for another tool that is key to live translation, but that seems to be where it all goes wrong: speech recognition. Google Assistant, indeed, is trained on hours upon hours of speech, to which it applies machine-learning tools to recognise patterns, and ultimately correctly recognise what you are saying when it's asked for a translation.
Except it doesn’t. So if Google has managed to apply neural networks with some degree of success for text-to-text translation, why is the Assistant still not able to consistently recognise speech using the same technique? Matic Horvat, researcher in natural language processing a the University of Cambridge, says that it all comes down to the dataset that is used to train the neural network.
“Systems adapt to the training dataset they have been fed,” he says. “And the quality of speech recognition degrades when you introduce it to things it hasn’t heard before. If your training dataset is conversational speech, it won’t do so well at recognising speech in a busy environment, for example.”
Interference: it is the nemesis of any computer scientist working to improve speech recognition technology. Last year, Google allocated €300,000 of its Digital News Innovation Fund to London-based startup Trint, which is leading the way in automating speech transcription, albeit with an algorithm that is different from Google’s. That algorithm, however, is no better at tackling the basic problem of interference.
The company’s site, in fact, dedicates an entire section to recommendations on how to record speech in a clear environment. It also asserts that it operates with a five to ten per cent margin of error, but is transparent in saying that this applies to clear recordings. There are no official statistics for recordings that include overtalk or background noise. “The biggest challenge is to explain to our users that we are only as good as the audio they will give us,” says Trint’s CEO Jeff Kofman. “With echoes, noise or even heavy accents, the algorithm will make mistakes.”
Google's Pixel Buds aren't just bad, they're utterly pointless
The challenges posed by live speech mean that the training process is the most expensive and the longest part of creating a neural network. And keeping live translation to a restricted number of devices, like Google did with the Pixel Buds, certainly doesn’t help the system to learn. The more speech it can process, indeed, the more data it can add to its algorithms — and the more the machine can learn to recognise unfamiliar speaking patterns. Google didn't put forward a spokesperson for interview, but did point us in the direction of its blog post on Google Assistant.
For Gleeson, this is one of the reasons that Google has made a move towards expanding the feature to more hardware. “One of the toughest issues with speech recognition is gathering enough data on specific accents, colloquialisms, idioms, all of which are highly regionalised,” he says. “Keeping the feature on the Pixel was never going to let Google reach those regions in high enough numbers to process adequate levels of data.”
Accumulating data, however, comes with a downside. The best performing neural networks are those with the most data — but that data is stored on CPUs which size increases with the amount of information stored. CPUs which are still far from being integrated in mobile devices, making real-time speech processing impossible today. In fact, every time you use Google Assistant, the spoken information is sent to be processed externally in a datacenter, before being sent back to your phone. None of the computational effort is done locally, because existing phones could not store the data that neural networks need to process speech.
While Google Assistant is capable of completing that process fairly fast, says Horvat, it is still a long way from real-time speech recognition. One of the company’s challenges today, to improve seamlessness in features like live translation, is to find out how to integrate neural network processing in mobile phones.
Developers, in fact, are already working on producing small external chips suitable for efficiently processing neural networks, which could be integrated in phones. Earlier this month, for example, Huawei announced an AI chip that the company claims could be used to train neural network algorithms in minutes.
While Google has its own chip called the Edge TPU, it is designed for enterprise use and not yet for smartphones. For Horvat, this is its Achilles heel: as a software company, Google doesn’t have much control over manufacturers to ensure the development of a product that would make local neural network processing available to all Android devices — unlike Apple, for example.
For the near future, Google may be forced to take smaller steps to improve its speech recognition technology. And while live translation has attracted much criticism, for industry analyst Neil Shah, partner and research director on IoT, mobile and ecosystems at Counterpoint, expanding the reach of it is a way for the company to position itself ahead of the competition: “Google has access to two billion Android users,” he says. “It is very well positioned to scale faster than competition, and train with massive flows of input data, as more users start using the latest voice interactions in Android phones.”
Daniel Gleeson concurs. Whether reviews of the feature stick to the tone of gentle mockery or not, Google’s move will ultimately lead to significant improvements. As with all AI products, the tool needs to learn — so by definition, it comes to the market unfinished. “You do run the risk of people saying it doesn’t work the way it was promised,” he says, “but that is the only way for it to get there.” Interpreters needn’t worry for their jobs just yet.
Originally published here.
For media inquiries contact firstname.lastname@example.org