Sunday, February 27, 2011

Another Machine Translation Myth Busted: The Content Tsunami Doesn’t Exist

Last month I wrote a post that criticized several highly questionable theses that pop up frequently in the sales pitches of the hucksters trying desperately to inflate the machine translation (MT) bubble. The three main propositions are:

Condition 1: “Translation budgets are being slashed.”
Condition 2: “Heads of translation departments are being pressured to produce more with less.”
Condition 3: “There has been a cosmic explosion in the amount of content.”

As I wrote last month, singly, these observations are questionable, even if marginally plausible. Together, they form a classic example of “bubble rationalization” well known to anyone who sat through the late 90s tech boom and the mid-“noughties” housing fiasco.

Regarding the third condition, which I nicknamed “The Content Big Bang” or the “Content Tsunami,” my skepticism was motivated by the following observations: 1) it is unlikely that the growth of text slated for commercial translation will sharply outstrip the growth of other goods (especially in a period of economic slowdown); and 2) most of the “content” regurgitated by social networks probably isn’t designed to be translated in the first place (given its ephemeral nature).

And yet the existence of the mythical Content Tsunami, as so many other things, continues to be upheld despite the lack of any concrete evidence. The following fragment from one MT paper is typical of the meme (the bold text is added):

The change from a static, one-way narrow street of Web 1.0 allowing only a passive consumption, to Web 2.0 – a dynamic, two-way highway with its architecture of participation promoting active collaboration on all fronts. Add unimaginable storage and processing capabilities of cloud computing with its data-drive approach and you have technology which is having an enormous impact on the content volume. The content requiring translation, according to Common Sense Advisory, is growing at a rate of 50% a year.

Well, the results are in, folks. It turns out that the Content Tsunami—like the Great Kidnapping Epidemic of the 80s, the 1920s Red Scare, the 1960s Missile Gap, the Eternal Growth in House Prices, the Holy Grail, El Dorado, Dow 36,000, Santa Claus, the Easter Bunny, the Tooth Fairy—doesn’t exist.

Last Saturday’s “Numbers Guy” column in the Wall Street Journal shoots the following deflationary darts at the MT Bubble hype. The piece summarizes a paper published in the latest issue of Science.

While overall data production has grown at 23% a year since 1986, the growth in the volume of written words (the object of commercial translation) was manageable:

“The amount of data stored in books roughly doubled between 1986 and 2007, a period during which the world population increased by about a third. The increase in newsprint was a relatively manageable 91%.”

More importantly, overall figures for the Content Tsunami are misleading because they aggregate content such as video and photos (which are byte-heavy) with words (much lighter). Indeed, much of the explosive growth in production and storage of data is in visual media (most of which doesn’t require translation):

But the digital avalanche isn't as massive as those numbers suggest. Much of the growth reflects the surge in high-resolution video and photos. In addition, while there is much more information available, each piece is being consumed, on average, by far fewer people than in the past.

This observation has an interesting philosophical corollary. People with a scientific outlook, technological determinists and positivists traditionally have felt that natural language is a clumsy tool for storing and communicating knowledge. Criticism of its redundancy (reiteration of plural or singular forms) and “noise” (genders, for example) underlies a lot of the push behind quaint relics of the twentieth century such as Esperanto and logical positivism. However, the scientists who authored the Science paper point out one advantage of natural language over other media:

"You can get a lot of information out of reading a half-megabyte book, compared to watching a one-gigabyte TV show," says Roger Bohn, director of the Global Information Industry Center at University of California, San Diego. Yet in 2007, the world's capacity to store video was about 6,000 times greater, in terms of bytes, than the storage capacity of paper, according to the Science study. That, says Prof. Bohn, is a "testament to how efficient language is for communicating concisely."

In conclusion, the Content Tsunami isn’t happening, or at least it isn’t happening in the sense that the quote from TAUS above seeks to suggest. Therefore, in the future, when you hear some blow-hard spouting off about the content boom, you should chime in cheerfully: "Oh, you mean the porn: lots and lots of porn!"

Of course, don’t think a little thing like facts is going to deter a bubble blower. These people have products to sell and children to feed. So the beat goes on. “Play it, Sam. Play ‘As Time Goes By’.”

Yes, Sam, play it ad nauseam. (Anyway, it’s not like we can stop you... sigh.)

Related Content:

Why the Machine Translation Crowd Hates Google

The Machine Translation Bubble

Machine Translation and the Gigantic Hamster Wheel

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

No comments: