Thursday, February 17, 2011

Why the Machine Translation Crowd Hates Google

Jack: You have more sexual hang ups than an adult chat line run by Gilbert Gottfried.
Liz: What?
Jack: That was written by a computer program we're working on to replace you.
(30 Rock)

To get a measure of how insignificant the localization industry is, witness the chasm that separates the amoeba in the machine translation (MT) sector and the Behemoth from Mountain View. The Google MT people are regularly quoted in major news outlets and then you witness the carping that ensues in the blogosphere and twittersphere. You would think that Google people would be lauded as pioneers in the field. By rolling out their engine for free, they have at least placed computerized translation at the forefront of the public’s imagination. You would be wrong.

It is telling to see that not a single Google guy or gal showed up at last November’s Denver Conference of the Association for Machine Translation in the Americas (AMTA). The usual suspects were there: Asia Online, SDL-Language Weaver and a veritable Star Trek convention of academic geeks. Google was not even a corporate sponsor. None of the high-level Google MT execs was a guest speaker. The keynote address was delivered by Paul Bremer, for Chrissake. It is a measure of how far out in the wilderness these people are that they have to pay a former Bush Administration official to visit.

Even more telling is how the slightest critical piece on Google’s translation engine circulates at the speed of light through the blogosphere, especially by agencies and MT specialists. This, of course, is aided and abetted by slightly misinformed freelance translators who have nightmares at least once a week in which Google forecloses on their mortgage.

Google’s people, of course, are blissfully unaware of this. As in so many other fields, they are Moses on the mountain while the Israelites are in the valley building shrines to stones that look a little bit like sheep. I guess Google researchers going to visit the AMTA would be a little like a twenty-first century Homo sapiens going to a Cro-Magnon convention to find viable alternatives to fossil fuel. “So, Mr. Ughhho, what do you have along the lines of fuel cells?” “Ughhho have fire! Fire good! Ughhho powerful!”

To get a feel of the little love that the Big G gets in Machine Translation Island, see the reaction to this article in The Guardian. One member of the Google MT team is quoted as saying the following:

Andreas Zollmann, who has been researching in the field for many years and working at Google Translate for the last year, suggests, along with Blunsom, that the idea that more and more data can be introduced to make the system better and better is probably a false premise. "Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output," he suggests, but the doublings are not infinite. "We are now at this limit where there isn't that much more data in the world that we can use," he admits. "So now it is much more important again to add on different approaches and rules-based models."

What this means is that once Google Translate achieves a certain degree of quality (and pay no heed to any bulls**t to the contrary: Google Translate is the state of the art in SMT), the rate of progress reaches a plateau. Improvements are still achieved, but they are painfully gradual compared to the pioneering years of the technology. Whereas five years ago doubling a one-billion-word corpus brought a 100% improvement in quality, nowadays a doubling of a ten-billion-word MT corpus only brings a marginal rate of improvement.

To analyze another example, take the reaction when another Google executive remarked to an Australian newspaper that "I'd be really careful about having any kind of a sensitive debate with someone either spoken or written using these translations." Whew, that prompted a firestorm!

Despite the myriad things that are questionable about the company, one thing you have to love about The Google is its intellectual honesty. Of course, that honesty is enabled by the fact that its interest in setting up the MT engine in the first place isn’t commercial, but rather strategic. Instead of making pennies selling their MT application to corporates and Joe Schmoe, they decided to make the application available for free in order to expand their corpus. This is typical of the company’s long-term vision. Instead of licensing their invention, they decided to put it out there because any narrowing of the language barrier will broaden the reach of the Internet, which is their real core business. So they simply devote a tiny sliver of their R&D budget to initiatives such as GT.

The Machine Translation Sector Has Been Googled

The thing is that a tiny squirt of Google’s R&D is like Gargantua flooding Paris. And the Parisians can get hopping mad and resentful! Zollmann’s unexceptional pronouncement can trigger a lot of sniping. From the purely clueless to the slyly disingenuous. Or take the whispering campaign about Google Translate and confidentiality issues (you know who you are).

Frankly, I find it all slightly smug and more than a little infuriating. Because it is typical of the intellectual dishonesty that pervades the vulgar push to drive down in standards in the translation industry through Web 2.0 idiocy. You see, folks, the thing is that when a Google dude says “our MT engine has reached its limit” what he really means is “machine translation has reached its limit.”

And why is that? Because none of the pygmies sniping at Google can match its content aggregation capabilities. Or its raw (human) brain power.

The other strategy is to insinuate snidely the following: “Well, Google might have reached its limits, but we know a better shortcut down the Yellow Brick Road.” In response, one should entertain the following thought experiment. Let us imagine that there is an Albert Einstein of machine translation. Let’s imagine that he is 23 years old and just graduated from MIT with a double Ph.D. in physics and linguistics. What is more likely: That he works for Google or for a mega-agency that haggles for nickels and dimes with freelancers? I have my own (incredibly biased) answer to that question. I leave you to draw your own.

Another, slightly bizarre, instance of this tendency is when Google began to warn that it would exclude from its search results any pages that had been machine translated, even those that had a smattering of post-editing. This, of course, is a huge bummer for the MT crowd, because its main objective isn’t to provide better machine translation technology. Its real objective is to convince the industry to crowdsource the proofreading of MT drivel by non-professionals.

Google’s epistemological modesty prompts this sort of conspiratorial reaction: “Google just wants us to use only their MT tool.”

Which, in my view, is slightly bizarre. What does Google care about whether its free tool is used or not to machine translate a website? “Aha,” our conspiracy theorist will riposte, “if Google translate is not used, it cannot enrich its corpus. Right?” Wrong! While the translated website doesn’t enrich the Google corpus immediately, eventually it will make it out into the Internet and the company’s crawlers will eventually catch it in its nets. After identifying it as multilingual content, the site will ultimately make its way to further contaminate the SMT corpus.

Google Translate: Massive Party Pooper

Why all this animus? After all, Google is not in direct competition with the MT Lilliputians. “Free” is not in competition with companies that want to license their own SMT applications.

No, it’s not the competition. In a sense, it is something much, much worse. You see, Google squelched in one fell swoop the opportunity for another one of those rounds of incredibly wasteful capital misallocation to which Silicon Valley has treated us over the years. By creating the best MT application and putting it out there for free, many years of free-spending (and competing) venture-capital-financed startups failed to get off their ground. In a world without Page and Bryn, all of these MT dotcoms would have gotten exactly where we are now in double the time and at several times the cost to society.

Wasteful as they are, these manias are yummy because they create VC-funded millionaires that take seed money and sprinkle it on sportscars and trophy wives. And, lo, how many advisory fees were foregone by investment banks!

No, Virginia, there will never be a machine translation IPO. Or, to put it another way, it already happened. In 2003, when the U.S.S. Google floated on the wide open market seas and sank a lot of paper boat dreams.

The reverberations of the Google bomb are still felt to this day. Any presentation to a venture capitalist of the new, new thing in computerized translation is doomed to fail. The prospect goes home, takes off his tie, opens his laptop and applies the Beta version to his daughter’s French homework. His conclusion: The Next Big MT Thing actually performs pretty much along the lines of something that is free (yuck!).

What are these clowns peddling? Next!

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

9 comments:

Anonymous said...

Not necessarily in defense of anyone in particular, but I'm sure Paul Bremmer was not paid to do a keynote. He is actually the CEO of one of the MT companies, I think it's Apptek.
Beatriz Bonnet

Kirti Vashee said...

(This comment was edited in order to remove plugs for corporate products and links to third-party websites that indirectly promote products not endorsed by me, the owner of the Financial Translation Blog.)

Miguel

There are a number of factual errors in your post that I think that your readers may wish to consider to get a more accurate picture.

Firstly, Google was present at AMTA Denver and in fact Shankar Kumar, who is a lead on GOOG speech initiatives (closely linked to MT) was not only present, he also applied to be a board member of AMTA. So he, Chris Wendt of Microsoft and I were all elected to the AMTA board which as you may guess involves regular interaction with the AMTA agenda.

(Note from the owner: I was unable to independently verify any of these claims.)

Also Paul Bremer was not the only keynote.

(Note: I never claimed that Paul Bremer was the only keynote.)

The Guardian article you reference points out that data is not enough and large volumes of noisy data especially is unlikely to lead to progress. This does not necessarily mean that this is the end of the road for MT just because GT has hit a wall.

While GT is possibly amongst the best free MT solutions out there, I have seen many MT systems that produce better quality output than GT, so I think it is worth looking at what these systems are doing differently.

With regard to surrendering to the behemoth I beg to disagree. Not so long ago IBM ruled the computing world – a behemoth if ever there was one. A college dropout named Bill came along and snatched the whole PC world away from them. And Google came along (originally just a bunch of guys in a garage) and snatched away the Internet Search market away from Microsoft. And more recently we see another college dropout has taken the lead away from Google as now Facebook is the highest traffic site in the world and a potential threat to Google’s core advertising revenue base. So yes, most companies that will show how to use MT more effectively are likely to be smaller and also more agile and innovative.


But you may be right about GT raising the bar for any and all MT players (and translators too by the way, especially in FIGS languages) as people realize that they should not be paying for something they can get for free at better quality no less.

Google is mostly irrelevant as a search engine (in local languages) in China, Czech Republic, Japan, Korea, Russia and I expect many more countries in future and I suspect they have more urgent issues to focus on than GT. Bing was the fastest growing search engine in the US in 2010 and many expect it will continue to make progress (albeit still small) at the expense of Google in 2011.

Miguel Llorens M. said...

Thank you for the feedback, Kirti. The reader might also be interested in knowing that Mr. Vashee is a salesman for one of the companies trying to push redundant MT systems on an unsuspecting world.

The alert reader will also realize of course, that by virtue of his job and by penning a lengthy comment on why Google sucks, Mr. Vashee simply confirmed the main thesis of my piece.

Vadim Berman said...

Miguel,

Many good points (including the one about technical superiority), yet far, very far from being 100% accurate. For instance, your business analysis is dead wrong.

DISCLAIMER: Being an MT vendor, I myself have a vested interest, just like Kirti (different camp though).

1. Selling MT to the public and freelancers, the same guys who use Google Translate, is not very exciting or profitable. The bulk of MT money is done by selling systems to enterprises. These folks normally can't use Google Translate for a few simple reasons. The most obvious one is that it's not secure. Basically company documents and information travels somewhere to the cloud, and then someone assures you that it will not be used anywhere. Except, of course, to train the translation system. And then it will accidentally re-surface somewhere when the corpus muncher burps in the wrong direction...

If it's not secure, normally it simply doesn't exist for enterprises. Even if all MT systems today except Google Translate would do dumb word-to-word translation, that would not spell the end of non-Google enterprise MT.

Google by itself is not in the enterprise market. They've been actually trying to enter it, I heard, but with much lower success than the search. It is simply very different. (Even in their core business, search, Google Search appliance is not much of a hit. I can't say why, but my educated guess is that they still have to adapt their cloud philosophy, not to mention support, service, etc.)

Now you might not believe it, but every division in enterprises actually has a set budget. You can't ask to allocate 50 servers to yield translation quality 5% better than a competitor, and provide a billion word corpus for your system to learn. It doesn't make sense and no one would even care about these 5%.

So no, Google Translate didn't hurt much of enterprise MT market. I mean if you consider freelancers and really small business "enterprises", then maybe yes, but folks with over 200 employees, mostly not.

2. And here's a more debatable point of technical superiority. Kirti once mentioned that Google undertook a tremendous task of giving "one-fits-all" translation. But that was mostly on the expense of customisation, special lingo, special rules, etc.

I know for certain that Google Translate is simply unaware of the special terms (financial, for one), which enterprise clients need.

3. And finally, Kirti had a very valid point of certain emerging markets. Google has a low market share in China (is it even a secret?) - even though not because of Baidu's technical superiority, in Russia - until something like 3 years ago when Google implemented better inflection system, Russian search in Google was semi-useful.

When it comes to Google Translate, tier 2 languages are hardly useful. Try Thai, Persian, or even content in languages like Japanese which was plenty of Japanese proper names or was not translated to your target language (just throwing examples: www.goo.ne.jp, www.nifty.com). Hardly usable, there are some RbMT equivalents that work better (the amount of content available to Google is huge yet finite).

Google is a great company; unlike the today's social bubble, they have a lot to offer, and their MT may have been actually instrumental in helping the MT market grow by educating the users about the existence of automatic translation. However, they are not God, and BTW have the wisdom not to pose as one.

Miguel Llorens M. said...

(My response to yet another rather lengthy example of corporate-prop from the MT crowd is between brackets throughout.)

Miguel,

Many good points (including the one about technical superiority), yet far, very far from being 100% accurate. For instance, your business analysis is dead wrong.

DISCLAIMER: Being an MT vendor, I myself have a vested interest, just like Kirti (different camp though).

1. Selling MT to the public and freelancers, the same guys who use Google Translate, is not very exciting or profitable. The bulk of MT money is done [sic, probably means “made”] by selling systems to enterprises. These folks normally can't use Google Translate for a few simple reasons. The most obvious one is that it's not secure. Basically[,] company documents and information travels somewhere to the cloud, and then someone assures you that it will not be used anywhere. Except, of course, to train the translation system. And then it will accidentally re-surface somewhere when the corpus muncher burps in the wrong direction...

(In other words, corporates buy your product not because it’s better but because it’s confidential. Notice how once again you are providing grist for my mill. The confidentiality bogeyman has been raised many times by salesman such as you elsewhere and I alluded to it in my post.)

If it's not secure, normally it simply doesn't exist for enterprises. Even if all MT systems today except Google Translate would do [were carrying out] dumb word-to-word translation, that would not spell the end of non-Google enterprise MT.

(The lady doth protest too much, methinks.)

Google by itself is not in the enterprise market. They've been actually trying to enter it, I heard, but with much lower success than the [sic] search. It is simply very different. (Even in their core business, search, Google Search appliance is not much of a hit. I can't say why, but my educated guess is that they still have to adapt their cloud philosophy, not to mention support, service, etc.)

(Yes, yes, Google sucks. One more confirmation of my thesis. Thank you. Yawn. What’s next?)

Now you might not believe it, but every division in enterprises actually has a set budget. You can't ask to allocate 50 servers to yield translation quality 5% better than a competitor, and provide a billion word corpus for your system to learn. It doesn't make sense and no one would even care about these 5%.

(I truly and really have no earthly idea what this paragraph means.)

So no, Google Translate didn't hurt much of enterprise MT market. I mean[,] if you consider freelancers and really small business "enterprises[,]", then maybe yes, but [in the case of] folks with over 200 employees, mostly not.

(I never claimed that GT hurt the enterprise MT market. I merely posited that it set a very low ceiling to your scope for expansion. Given that the killer app is free, there is no chance that a firm like yours will ever raise significant amounts of venture capital moolah, much less make it to an IPO.)

Miguel Llorens M. said...

(Second part of response to Mr. Berman's lengthy rambling.)

2. And here's a more debatable point of technical superiority. Kirti once mentioned that Google undertook a tremendous task of giving "one-fits-all" [probably means “one-size-fits-all”] translation. But that was mostly on the expense of customisation, special lingo, special rules, etc.
(You concede my point of technical superiority at the beginning of your comment and then you cruelly snatch it away from me three paragraphs later. You tease.)
I know for certain that Google Translate is simply unaware of the special terms (financial, for one), which enterprise clients need.
(I have experimented with Google Translate’s treatment of financial terms, and it is neither better nor worse than a random Internet search, so this criticism is disingenuous or at best a matter of opinion. However, this type of analysis is quite typical of salespeople like you masquerading as business experts or localization gurus. One of the major defects of MT is that correct terminology isn’t a fixed dogmatic system but rather one that needs customization. Therefore, you are overstretching your point.)
3. And finally, Kirti had a very valid point of certain emerging markets. Google has a low market share in China (is it even a secret?) - even though not because of Baidu's technical superiority, in Russia - until something like 3 years ago when Google implemented better inflection system, Russian search in Google was semi-useful.

When it comes to Google Translate, tier 2 languages are hardly useful. Try Thai, Persian, or even content in languages like Japanese which was plenty of Japanese proper names or was not translated to your target language (just throwing examples: www.goo.ne.jp, www.nifty.com). Hardly usable, there are some RbMT equivalents that work better (the amount of content available to Google is huge yet finite).
(RbMT? Really? Frankly, I can concede that GT is worse outside of the main European biggies, but to claim that RbMT can fill the gap where no major corpora exist is tantamount to telling the Indonesian translators that their work will not be imperiled for many eons to come.)
Google is a great company; unlike the today's social bubble, they have a lot to offer, and their MT may have been actually instrumental in helping the MT market grow by educating the users about the existence of automatic translation. However, they are not God, and BTW have the wisdom not to pose as one.
(Far be it from my secular humanist bones to claim that Google or any other search engine is God. On the other hand, they have the wisdom to downplay the hype regarding their own tool, more than one can say for its competitors who are busily inflating a bubble that just won’t take. My point is that smaller players lack the same intellectual honesty and compound that deceitfulness by sniping at The Big G, as you have done so forcefully in my blog.

On another topic, allow me to note how this comment is typical of the linguistic proficiency of the cyber geeks of MT Island who are lecturing translators on how to do their work. Most of them can’t write their way out of a paper bag. Mr. Berman claims on his LinkedIn profile to be a native speaker of English, Hebrew and Russian. I know for a fact, as can be readily demonstrated by a perusal of his written comment, that he is not a native English speaker. We can assume that he must be a native speaker at least of one the other two languages—or at least we can hope. Yet these are the people who claim that so-and-so produced better quality output than Google Translate, without actually showing anyone their data.)

Vadim Berman said...

Miguel,

Thanks for your replies, and correcting my typos. I'll try not to descend to your level, nevertheless, there is a handful of things I'd like to clarify.

Let's start with the easiest things first:

1. My LinkedIn profile says "Native OR BILINGUAL proficiency". Yes, I do tend to make a lot of mistakes when typing inside the tiny comment boxes. But how exactly you concluded that I claim to be a native English speaker, is beyond my comprehension. Maybe it's in the same line as calling "The Australian" a Canadian newspaper.

2. Show me where I claimed that "Google sucks". What I'm pointing out is that these are two different worlds, and the Google system(s) as of today are not built for the enterprise world. Of course, if you can explain to me, how MapReduce-based systems can shrink to one box, I will be more than happy to concede my defeat.

Would I pick a BlackBerry or an iPhone / Android handset? Naturally, one of the latter. But these are poor choices for the enterprise.

Scaling software down is as big a challenge as scaling it up. Go and research a bit about the major players in the enterprise search.

In fact, I find Bing a larger threat to MT business. MS already tried it on a specific domain (MS technical support), and large portion of MS profits come from the enterprise. They rarely make the first shot though, so...

Yes, I got your point about you not claiming Google to be a threat in the enterprise market.

3. You've got it backwards with RbMT, and yes, I will re-iterate my claim here.

I have seen two Japanese RbMT systems (that is, classic oldish RbMT, not EbMT or semantically-aware new generation ones) that consistently beat Japanese Google MT.

Japanese has more than enough data, you have to admire the way Japanese NLP researchers catalogue their language; the issues here are very different, but I won't bore you (unless you're really interested).

SMT is essentially translation memory on steroids; the more data it can be trained on, the better.

Do you remember mathematical induction a bit? This is an analogy as to how different approaches work:

* SMT would learn the expression for a given set of values (1, 2, 3, ... 50).
* RbMT would force the developer figure out the formula for n.
* EbMT would try to figure out the formula by looking at (1, 2, 3, ...).

The thing is, EbMT also requires a corpus. RbMT is normally much more difficult to build (the main argument of SMT purists), yet you can take an old dusty book and figure out the "formulas" from there.

Here's another fact you might find surprising. Early SMT experiments date back roughly to the same time when RbMT development started. Why did it take so long to kick in? Not enough data. Same as today for Tier 2 languages and below.

I don't think anyone (at least from the non-drooling part of the audience) ever disputed that.

4. I would expect a financial translator to base his assumptions about investment policies on his experience and not movie stereotypes.

Just how lucrative, in your opinion, is MT to VCs? How VCs can fund a business in a market as small as MT?

Do you know how little SYSTRAN was making when they were alone online and holding on a to a major European customer who keeps paying no matter what they do?

The only candidate I can think of is Language Weaver, but In-Q-Tel is hardly a classic VC fund.

Now you might be interested to know that the Language Weaver system was planned by Franz Och as well. And hey, Franz actually co-authored papers with Philip Koehn.

How does this fit in your misresearched conspiracy theory?

There is one "easy money" source, but you somehow missed that (why?). This is "all-you-can-spend" eurogrants (no IPOs here). But they usually disappear in the bowels of the likes of Deutsche Telekom, France Telecom, and major European universities. These are the guys I'd like to "snipe" at. How 'bout that?

Vadim Berman said...

(A small request: can you please **not** answer inline?)

Miguel Llorens M. said...

Mr. Berman,

You cite "Native OR BILINGUAL proficiency" for three languages.

This proves either that you are puffing up your CV or you don't know what the words mean. Either way, if you were applying for a job, your CV would go to the garbage can.

And yet you feel entitled enough to provide your opinion on any public forum without the courtesy of even trying to make the slightest modicum of sense (viz., your latest 2,000 word comment). Moreover, you feel entitled to dictate the terms in which your opinion will be framed. AND IN A FORUM DEVOTED TO LANGUAGE (!).

Sir, you are symptomatic of the degradation of standards in the localization industry. What can I say? I am unimpressed. I feel stimulated to engage people who disagree with me when they have at least a smattering of culture. You do not fit the bill. I bid you a good day.