Sunday, February 27, 2011

Another Machine Translation Myth Busted: The Content Tsunami Doesn’t Exist

Last month I wrote a post that criticized several highly questionable theses that pop up frequently in the sales pitches of the hucksters trying desperately to inflate the machine translation (MT) bubble. The three main propositions are:

Condition 1: “Translation budgets are being slashed.”
Condition 2: “Heads of translation departments are being pressured to produce more with less.”
Condition 3: “There has been a cosmic explosion in the amount of content.”

As I wrote last month, singly, these observations are questionable, even if marginally plausible. Together, they form a classic example of “bubble rationalization” well known to anyone who sat through the late 90s tech boom and the mid-“noughties” housing fiasco.

Regarding the third condition, which I nicknamed “The Content Big Bang” or the “Content Tsunami,” my skepticism was motivated by the following observations: 1) it is unlikely that the growth of text slated for commercial translation will sharply outstrip the growth of other goods (especially in a period of economic slowdown); and 2) most of the “content” regurgitated by social networks probably isn’t designed to be translated in the first place (given its ephemeral nature).

And yet the existence of the mythical Content Tsunami, as so many other things, continues to be upheld despite the lack of any concrete evidence. The following fragment from one MT paper is typical of the meme (the bold text is added):

The change from a static, one-way narrow street of Web 1.0 allowing only a passive consumption, to Web 2.0 – a dynamic, two-way highway with its architecture of participation promoting active collaboration on all fronts. Add unimaginable storage and processing capabilities of cloud computing with its data-drive approach and you have technology which is having an enormous impact on the content volume. The content requiring translation, according to Common Sense Advisory, is growing at a rate of 50% a year.

Well, the results are in, folks. It turns out that the Content Tsunami—like the Great Kidnapping Epidemic of the 80s, the 1920s Red Scare, the 1960s Missile Gap, the Eternal Growth in House Prices, the Holy Grail, El Dorado, Dow 36,000, Santa Claus, the Easter Bunny, the Tooth Fairy—doesn’t exist.

Last Saturday’s “Numbers Guy” column in the Wall Street Journal shoots the following deflationary darts at the MT Bubble hype. The piece summarizes a paper published in the latest issue of Science.

While overall data production has grown at 23% a year since 1986, the growth in the volume of written words (the object of commercial translation) was manageable:

“The amount of data stored in books roughly doubled between 1986 and 2007, a period during which the world population increased by about a third. The increase in newsprint was a relatively manageable 91%.”

More importantly, overall figures for the Content Tsunami are misleading because they aggregate content such as video and photos (which are byte-heavy) with words (much lighter). Indeed, much of the explosive growth in production and storage of data is in visual media (most of which doesn’t require translation):

But the digital avalanche isn't as massive as those numbers suggest. Much of the growth reflects the surge in high-resolution video and photos. In addition, while there is much more information available, each piece is being consumed, on average, by far fewer people than in the past.

This observation has an interesting philosophical corollary. People with a scientific outlook, technological determinists and positivists traditionally have felt that natural language is a clumsy tool for storing and communicating knowledge. Criticism of its redundancy (reiteration of plural or singular forms) and “noise” (genders, for example) underlies a lot of the push behind quaint relics of the twentieth century such as Esperanto and logical positivism. However, the scientists who authored the Science paper point out one advantage of natural language over other media:

"You can get a lot of information out of reading a half-megabyte book, compared to watching a one-gigabyte TV show," says Roger Bohn, director of the Global Information Industry Center at University of California, San Diego. Yet in 2007, the world's capacity to store video was about 6,000 times greater, in terms of bytes, than the storage capacity of paper, according to the Science study. That, says Prof. Bohn, is a "testament to how efficient language is for communicating concisely."

In conclusion, the Content Tsunami isn’t happening, or at least it isn’t happening in the sense that the quote from TAUS above seeks to suggest. Therefore, in the future, when you hear some blow-hard spouting off about the content boom, you should chime in cheerfully: "Oh, you mean the porn: lots and lots of porn!"

Of course, don’t think a little thing like facts is going to deter a bubble blower. These people have products to sell and children to feed. So the beat goes on. “Play it, Sam. Play ‘As Time Goes By’.”

Yes, Sam, play it ad nauseam. (Anyway, it’s not like we can stop you... sigh.)

Related Content:

Why the Machine Translation Crowd Hates Google

The Machine Translation Bubble

Machine Translation and the Gigantic Hamster Wheel

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

Thursday, February 24, 2011

Zuckerberg of Arabia? Reality Check…

“It was us and the Internet [who liberated Egypt]. George W. Bush and It was Freedom Fries and J-Date. OK, probably not J-Date…”
Jon Stewart

The shibboleth of the week intoned by Web 2.0 fanatics is that Facebook was the magic key that brought down two dictatorships in the Middle East (interestingly, Facebook doesn’t play much of a role in the ongoing reporting of this week uprising in Libya, a truly, truly closed society). The strongest proof for this thesis is probably Google executive Wael Ghonim’s effusive praise of Mark Zuckerberg as his inspiration. Oh, right, that and the movie V for Vendetta. Now, you can’t doubt Ghonim’s courage. The man, after all, was imprisoned blindfolded in the cells of the Mubarak political police for two weeks. The mere thought makes me shudder.

The thing is I’m reading cyber-skeptic Evgeny Morozov’s Net Delusion right now. He brings a lot of empirical evidence to challenge the idea that the 2009 Green Revolution in Iran was triggered by Twitter. Reality check: Less than two tenths of one percent of Iranians had Twitter accounts in June 2009. The bulk of Twitter traffic on Iran was in English, not Farsi. The Twitter Revolution is actually an illusion generated among observers in the West who are ready to buy into “Internet-centrism” and “cyber-utopianism” (Morozov’s terms). Not to mention the current social-networking mania that assaults us on a daily basis.

Now comes the Gregorian chant about Facebook and Egypt. I came across this blog post covering a protest in front of the Saudi Arabian Embassy in Tunis (deposed Tunisian dictator Ben Ali took refuge in Saudi Arabia).  The caption reads: “This photo was taken outside of Tunisia’s Saudi Arabian embassy.”

This is empirical evidence of the highest order. The participants themselves of the Tunisian Revolution invoke Mark Zuckerberg as one of the triggers for the democratic revolt.

Hmmm… Look closer at the picture. The people are protesting outside a Saudi Arabian embassy, but not in Tunisia. Look at the sign on the building. I enlarged it (poorly), but you can just make it out “Embassy of the Kingdom of Saudi Arabia.” An English sign? In Tunisia? A little more googling reveals that they are actually in Washington D.C. It seems that once a narrative becomes entrenched, it begins to generate its own evidence.

Enlarged sign in background.
Now, I know a thing or two about the origins of the French Revolution. The scholarly consensus, two hundred years after it occurred, is that no one is exactly quite sure of its causes. To think that you can flippantly come to a conclusion after two weeks that a Facebook page toppled a sixty-year-old military dictatorship is just typical geek infantility.

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

Thursday, February 17, 2011

Why the Machine Translation Crowd Hates Google

Jack: You have more sexual hang ups than an adult chat line run by Gilbert Gottfried.
Liz: What?
Jack: That was written by a computer program we're working on to replace you.
(30 Rock)

To get a measure of how insignificant the localization industry is, witness the chasm that separates the amoeba in the machine translation (MT) sector and the Behemoth from Mountain View. The Google MT people are regularly quoted in major news outlets and then you witness the carping that ensues in the blogosphere and twittersphere. You would think that Google people would be lauded as pioneers in the field. By rolling out their engine for free, they have at least placed computerized translation at the forefront of the public’s imagination. You would be wrong.

It is telling to see that not a single Google guy or gal showed up at last November’s Denver Conference of the Association for Machine Translation in the Americas (AMTA). The usual suspects were there: Asia Online, SDL-Language Weaver and a veritable Star Trek convention of academic geeks. Google was not even a corporate sponsor. None of the high-level Google MT execs was a guest speaker. The keynote address was delivered by Paul Bremer, for Chrissake. It is a measure of how far out in the wilderness these people are that they have to pay a former Bush Administration official to visit.

Even more telling is how the slightest critical piece on Google’s translation engine circulates at the speed of light through the blogosphere, especially by agencies and MT specialists. This, of course, is aided and abetted by slightly misinformed freelance translators who have nightmares at least once a week in which Google forecloses on their mortgage.

Google’s people, of course, are blissfully unaware of this. As in so many other fields, they are Moses on the mountain while the Israelites are in the valley building shrines to stones that look a little bit like sheep. I guess Google researchers going to visit the AMTA would be a little like a twenty-first century Homo sapiens going to a Cro-Magnon convention to find viable alternatives to fossil fuel. “So, Mr. Ughhho, what do you have along the lines of fuel cells?” “Ughhho have fire! Fire good! Ughhho powerful!”

To get a feel of the little love that the Big G gets in Machine Translation Island, see the reaction to this article in The Guardian. One member of the Google MT team is quoted as saying the following:

Andreas Zollmann, who has been researching in the field for many years and working at Google Translate for the last year, suggests, along with Blunsom, that the idea that more and more data can be introduced to make the system better and better is probably a false premise. "Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output," he suggests, but the doublings are not infinite. "We are now at this limit where there isn't that much more data in the world that we can use," he admits. "So now it is much more important again to add on different approaches and rules-based models."

What this means is that once Google Translate achieves a certain degree of quality (and pay no heed to any bulls**t to the contrary: Google Translate is the state of the art in SMT), the rate of progress reaches a plateau. Improvements are still achieved, but they are painfully gradual compared to the pioneering years of the technology. Whereas five years ago doubling a one-billion-word corpus brought a 100% improvement in quality, nowadays a doubling of a ten-billion-word MT corpus only brings a marginal rate of improvement.

To analyze another example, take the reaction when another Google executive remarked to an Australian newspaper that "I'd be really careful about having any kind of a sensitive debate with someone either spoken or written using these translations." Whew, that prompted a firestorm!

Despite the myriad things that are questionable about the company, one thing you have to love about The Google is its intellectual honesty. Of course, that honesty is enabled by the fact that its interest in setting up the MT engine in the first place isn’t commercial, but rather strategic. Instead of making pennies selling their MT application to corporates and Joe Schmoe, they decided to make the application available for free in order to expand their corpus. This is typical of the company’s long-term vision. Instead of licensing their invention, they decided to put it out there because any narrowing of the language barrier will broaden the reach of the Internet, which is their real core business. So they simply devote a tiny sliver of their R&D budget to initiatives such as GT.

The Machine Translation Sector Has Been Googled

The thing is that a tiny squirt of Google’s R&D is like Gargantua flooding Paris. And the Parisians can get hopping mad and resentful! Zollmann’s unexceptional pronouncement can trigger a lot of sniping. From the purely clueless to the slyly disingenuous. Or take the whispering campaign about Google Translate and confidentiality issues (you know who you are).

Frankly, I find it all slightly smug and more than a little infuriating. Because it is typical of the intellectual dishonesty that pervades the vulgar push to drive down in standards in the translation industry through Web 2.0 idiocy. You see, folks, the thing is that when a Google dude says “our MT engine has reached its limit” what he really means is “machine translation has reached its limit.”

And why is that? Because none of the pygmies sniping at Google can match its content aggregation capabilities. Or its raw (human) brain power.

The other strategy is to insinuate snidely the following: “Well, Google might have reached its limits, but we know a better shortcut down the Yellow Brick Road.” In response, one should entertain the following thought experiment. Let us imagine that there is an Albert Einstein of machine translation. Let’s imagine that he is 23 years old and just graduated from MIT with a double Ph.D. in physics and linguistics. What is more likely: That he works for Google or for a mega-agency that haggles for nickels and dimes with freelancers? I have my own (incredibly biased) answer to that question. I leave you to draw your own.

Another, slightly bizarre, instance of this tendency is when Google began to warn that it would exclude from its search results any pages that had been machine translated, even those that had a smattering of post-editing. This, of course, is a huge bummer for the MT crowd, because its main objective isn’t to provide better machine translation technology. Its real objective is to convince the industry to crowdsource the proofreading of MT drivel by non-professionals.

Google’s epistemological modesty prompts this sort of conspiratorial reaction: “Google just wants us to use only their MT tool.”

Which, in my view, is slightly bizarre. What does Google care about whether its free tool is used or not to machine translate a website? “Aha,” our conspiracy theorist will riposte, “if Google translate is not used, it cannot enrich its corpus. Right?” Wrong! While the translated website doesn’t enrich the Google corpus immediately, eventually it will make it out into the Internet and the company’s crawlers will eventually catch it in its nets. After identifying it as multilingual content, the site will ultimately make its way to further contaminate the SMT corpus.

Google Translate: Massive Party Pooper

Why all this animus? After all, Google is not in direct competition with the MT Lilliputians. “Free” is not in competition with companies that want to license their own SMT applications.

No, it’s not the competition. In a sense, it is something much, much worse. You see, Google squelched in one fell swoop the opportunity for another one of those rounds of incredibly wasteful capital misallocation to which Silicon Valley has treated us over the years. By creating the best MT application and putting it out there for free, many years of free-spending (and competing) venture-capital-financed startups failed to get off their ground. In a world without Page and Bryn, all of these MT dotcoms would have gotten exactly where we are now in double the time and at several times the cost to society.

Wasteful as they are, these manias are yummy because they create VC-funded millionaires that take seed money and sprinkle it on sportscars and trophy wives. And, lo, how many advisory fees were foregone by investment banks!

No, Virginia, there will never be a machine translation IPO. Or, to put it another way, it already happened. In 2003, when the U.S.S. Google floated on the wide open market seas and sank a lot of paper boat dreams.

The reverberations of the Google bomb are still felt to this day. Any presentation to a venture capitalist of the new, new thing in computerized translation is doomed to fail. The prospect goes home, takes off his tie, opens his laptop and applies the Beta version to his daughter’s French homework. His conclusion: The Next Big MT Thing actually performs pretty much along the lines of something that is free (yuck!).

What are these clowns peddling? Next!

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

Monday, February 7, 2011

The Egyptian Uprising, Social Media and Cyberutopianism in the Translation Industry

The marvels of communication technology in the present have produced a false consciousness about the past, even a sense that communication has no history.
Robert Darnton

What a week. In addition to the normal workload, my OS had its by now traditional, bi-annual blow-up. Resetting to previous saved configurations from Safe Mode did not work. Ultimately, I was forced to re-install Windows and all my programs, with the consequent hassle of re-configuring every single application to adapt to my own obsessive-compulsive workflow. No data was lost thanks to my double-redundant back-up, both online and on an external hard drive. My online back-up provider re-loaded all my data very quickly. I was really, really impressed (and relieved).

In addition to these little everyday hassles, I was transfixed by the Egyptian uprising. Looking at the scenes of Tahrir Square in the early hours of the morning and hearing the discharges of gunfire from pro-Mubarak supporters brought back a lot of memories of the Venezuelan 2007 student mobilizations against Chavez’s “constitutional” power grab, which was ultimately defeated at the polls, but not without a lot of really dicey moments along the way.

Photo by RamyRaoof, available under a Creative Commons Attribution-Noncommercial license.
To describe the selflessness and courage of the East Germans of 1989 or the Egyptian protesters of today is basically beyond my modest powers of expression. Suffice it to say that the deep emotions that the men and women of Tahrir Square evoke in me have to do, I think, with the epiphanic quality of what their actions reveal. 

We live in a relativistic world of moral compromise. I am generally comfortable in that world. I am not religious. Absolute truths, whether political or philosophical or theological, make me nervous.

But events such as the ones witnessed this week in Cairo remind you that, beneath the humdrum rhythm of everyday life, there is perhaps something deep, noble and unquantifiable in the human condition. That men and women do not want to live in oppression. That any form of authoritarian government is obscene. That the apparent acquiescence of the majority is a farce that can melt away in a second. That there is such a thing as good and evil. I generally shy away from such sweeping beliefs, but at the same time it is hard not to think something along those lines when you see citizens volunteering to defend their museums against looters or braving bullets to defend a purely symbolic city square.

In the face of this, the belief that all of this outpouring of breathtaking generosity was enabled by Facebook and Twitter seems at best superficial and at worst distasteful. The issue of social media as a catalyst for democratic movements has been part of the Zeitgeist since the 2009 Green Revolution in Iran and it has returned with a vengeance over the past week and a half. Some of the best commentary that I have read in the past few days challenges the facile assumption that you can crowdsource resistance to oppression. Malcolm Gladwell weighed in on the topic with a blog entry on the New Yorker (I wrote an entry about his longer piece on Twitter and political activism some months back):

People protested and brought down governments before Facebook was invented. They did it before the Internet came along. Barely anyone in East Germany in the nineteen-eighties had a phone—and they ended up with hundreds of thousands of people in central Leipzig and brought down a regime that we all thought would last another hundred years—and in the French Revolution the crowd in the streets spoke to one another with that strange, today largely unknown instrument known as the human voice. People with a grievance will always find ways to communicate with each other. How they choose to do it is less interesting, in the end, than why they were driven to do it in the first place.

Frank Rich also published a column along these lines in The New York Times refuting some of the Web 2.0 hype. But the two pieces that hit closest to home for me were a review last week by Lee Siegel of a book entitled The Net Delusion by Evgeny Morozov and a follow-up blog entry by the same writer discussing Morozov’s book and its pertinence for the Egyptian crisis.

Unexpectedly, a line from the initial review summed up very succinctly some of the puffery and hype surrounding language technology in the translation industry:

Morozov urges the cyberutopians to open their eyes to the fact that the asocial pursuit of profit is what drives social media. “Not surprisingly,” he writes, “the dangerous fascination with solving previously intractable social problems with the help of technology allows vested interests to disguise what essentially amounts to advertising for their commercial products in the language of freedom and liberation.”

Some very well-publicized initiatives that combine machine translation technology and crowdsourcing very skillfully conceal commercial interests behind noble humanitarian objectives about the dissemination of knowledge. The ridiculous drumbeat in the media over the past ten days that Mark Zuckerberg is bringing democracy to the bandaged heads of Tahrir Square brought into sharp relief more local concerns for me. These concerns arise when I listen to some of the hucksterism of L10N cyberevangelists.

I haven’t read Morozov’s book yet, but you can be pretty sure that I’m gonna.

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.