Thursday, April 26, 2012

Breaking down the language barrier—six years in

The rise of the web has brought the world’s collective knowledge to the fingertips of more than two billion people. With just a short query you can access a webpage on a server thousands of miles away in a different country, or read a note from someone halfway around the world. But what happens if it’s in Hindi or Afrikaans or Icelandic, and you speak only English—or vice versa?

In 2001, Google started providing a service that could translate eight languages to and from English. It used what was then state-of-the-art commercial machine translation (MT), but the translation quality wasn’t very good, and it didn’t improve much in those first few years. In 2003, a few Google engineers decided to ramp up the translation quality and tackle more languages. That's when I got involved. I was working as a researcher on DARPA projects looking at a new approach to machine translation—learning from data—which held the promise of much better translation quality. I got a phone call from those Googlers who convinced me (I was skeptical!) that this data-driven approach might work at Google scale.

I joined Google, and we started to retool our translation system toward competing in the NIST Machine Translation Evaluation, a “bake-off” among research institutions and companies to build better machine translation. Google’s massive computing infrastructure and ability to crunch vast sets of web data gave us strong results. This was a major turning point: it underscored how effective the data-driven approach could be.

But at that time our system was too slow to run as a practical service—it took us 40 hours and 1,000 machines to translate 1,000 sentences. So we focused on speed, and a year later our system could translate a sentence in under a second, and with better quality. In early 2006, we rolled out our first languages: Chinese, then Arabic.

We announced our statistical MT approach on April 28, 2006, and in the six years since then we’ve focused primarily on core translation quality and language coverage. We can now translate among any of 64 different languages, including many with a small web presence, such as Bengali, Basque, Swahili, Yiddish, even Esperanto.

Today we have more than 200 million monthly active users on translate.google.com (and even more in other places where you can use Translate, such as Chrome, mobile apps, YouTube, etc.). People also seem eager to access Google Translate on the go (the language barrier is never more acute than when you’re traveling)—we’ve seen our mobile traffic more than quadruple year over year. And our users are truly global: more than 92 percent of our traffic comes from outside the United States.

In a given day we translate roughly as much text as you’d find in 1 million books. To put it another way: what all the professional human translators in the world produce in a year, our system translates in roughly a single day. By this estimate, most of the translation on the planet is now done by Google Translate. (We can’t speak for the galaxy; Douglas Adams’s “Babel fish” probably has us beat there.) Of course, for nuanced or mission-critical translations, nothing beats a human translator—and we believe that as machine translation encourages people to speak their own languages more and carry on more global conversations, translation experts will be more crucial than ever.

We imagine a future where anyone in the world can consume and share any information, no matter what language it’s in, and no matter where it pops up. We already provide translation for webpages on the fly as you browse in Chrome, text in mobile photos, YouTube video captions, and speech-to-speech “conversation mode” on smartphones. We want to knock down the language barrier wherever it trips people up, and we can’t wait to see what the next six years will bring.

No comments:

Post a Comment