Tuesday, February 28, 2012

Seth Godin, Sturgeon’s Law, and the Content Tsunami

(Homer hits zombie on the head with a book.)
Lisa: Dad! Wait! Stop! That’s the last book in the world!
(They look at the cover. It is the memoirs of Arsenio Hall.)
Lisa (changing her mind): Ahh… knock yourself out…
(Homer crushes the zombie’s skull with the book.)
—The Simpsons, “Tree House of Horror XX”

Seth Godin writes in a blog post that, as a teenager, he read all 250 science-fiction books in his high school library (“From Asimov to Zelany”). He reflects that reading the entire corpus of a single genre is impossible nowadays because of (say it with me) the Content Tsunami:
As the deluge of information grows and choices continue to widen (there's no way I could even attempt to cover science fiction from scratch today, for example), it's easy to forget the benefits of acquiring this sort of (mostly) complete understanding in a field.
Now, Godin is well worth reading, but he is a Web 2.0 hyper and a naïve technology millennialist if ever there was one. I have news for you, Seth. Even in the 1970s, reading 250 sci-fi books barely scratched the surface of the genre.

Godin’s comment is typical of how techies erroneously view the current changes in media: “People used to read before, but the book is dying; we are being buried under a mountain of written material broadcast by the Web; everything is changing quickly; some day soon we will have Google implants in our frontal cortex; the keyboard will be relegated to the dustbin of history.” In this specific example, the belief is that, once upon a time—usually thirty or twenty or forty years ago (fill in the blank)—there was much less reading material and one solitary reader could obtain a first-hand view of all the literature in any given field.

That is just not the case. And it hasn’t been the case for a very long (looooong) time. Nowhere is this truer than in the realm of literature. Any reader in 1850 who wanted to read all the novels published in Great Britain up to then would have needed three or four lifetimes to do so. This challenge has always daunted literary criticism. You see, even the most erudite critic has not read the whole of even first-line literary classics. Many nineteenth and twentieth century specialists don’t have that much regard for the classics of previous centuries. I have personally ascertained that Old English specialists at leading universities do not have a lot of time to read the latest 800-page brick from Jonathan Franzen or Haruki Murakami. The humanities—like science—are becoming increasingly compartmentalized. That is undeniable (and, incidentally, also one of the factors behind The Great Stagnation). However, my point is more wide-ranging than that.

My point is that even people specialized in relatively narrow periods (say, the British novel from the 1850s to 1914) are still facing a Content Tsunami. The Victorian scholar who wants to read all of the stuff vomited by 19th century printing presses will still find a mountain to climb. Even the Victorian specialist really only bases her sweeping theses on a discreet “sampling” of all the stuff produced in the Age of the Novel. The Content Tsunami has always been with us. That is why gifted readers such as Harold Bloom or Umberto Eco tower above the rest of us (not just because of their acumen but also for their sheer ability to digest mountains of books). And even they can sometimes look a little amateurish when straying out of their fields of expertise. Personally, I’m not a big fan of Bloom, but he obviously has read everything he discusses. However, when he adds the odd Latin American author to his indigestible books on the Western Canon, it is easy to see how uncomfortable he is when outside of his comfort zone. I mean, he probably read Vargas Llosa, but the level of enthusiasm simply is not there.

As an undergraduate assistant, I helped the philosophy department index purchases made for the library. It included many (many) tomes of Harvard University Press’s Loeb Classical Library. If you thought classical literature only produced Homer, Virgil, the Athenian dramatists and a few fragmentary poets (basically what I was taught in Literature of Latin and Greek Antiquity), boy, were you wrong! Those people in the pre-Christian era didn’t have much papyrus or abundant ink, but they sure had a lot of time on their hands! Reading the whole of what was written just in Western Europe before the fall of Rome or the birth of Charlemagne would take up a good chunk of your life.

Readers have always been tiny little wanderers upon a Himalaya of linguistic output. People who have not studied the humanities are astounded by the current masses of text produced by other human beings, but that sense of awe stretches back much farther (perhaps in oral cultures there was also a Content Tsunami; maybe somewhere there was a hunter-gatherer oppressed by the sheer amount of epic poetry he had to listen to). It is a sign of ignorance to think that this is a new phenomenon. Think about this for a moment: What we actually inherited today from Antiquity is merely a tip of the iceberg of what was actually written. This tip (all the Ancient writing still extant) is all that is left of many hundreds and hundreds of volumes that were lost down the ages in that great chain of transmission (and destruction) from Greece to Rome to early Islamic culture to Medieval Spain, up through the Italian Renaissance and beyond. A lot of stuff was either considered too insignificant or erroneous or blasphemous and was either obliterated or recycled (see: palimpsests). This creates an instance of what economists call “survivor bias.” The time-bound technologist thinks that readers in Antiquity didn’t have to exercise critical faculties in deciding what was worth reading.

The challenges that search engines seek to solve have always been with us. There is even an entirely new discipline, pioneered by literary critic Franco Moretti, that seeks to map out massive corpora of novels statistically in order to run them through computers. The hope is that this will provide insights that are unattainable by individual readers. As Moretti tells it:
''A canon of 200 novels, for instance, sounds very large for 19th-century Britain (and is much larger than the current one), but is still less than 1 per cent of the novels that were actually published: 20,000, 30,000, more, no one really knows -- and close reading won't help here, a novel a day every day of the year would take a century or so.''
Obviously, there is a geometrical progression in the amount of written material. That is undeniable. But that progression (and the accompanying anxiety) has always been with us. To extrapolate from this geometrical progression to make trendy assumptions about business or culture is superficial. Seth Godin might have read 250 sci-fi books as a pimply high-school freshman but, even then, the corpus of fantasy and futurist fiction was far larger than that. Godin only swallowed a tiny sampling, impressive as it may be.

In fact, it was science fiction that gave rise to an indispensable analytical tool for discussing the Content Tsunami: Sturgeon’s Law. When someone remarked that “ninety percent of science fiction is crap,” Sturgeon famously retorted that “using the same standards that categorize 90% of science fiction as trash, crud, or crap, it can be argued that 90% of film, literature, consumer goods, etc. are crap.” Brilliant! Sturgeon's quip penetrates to the heart of any discussion about the mountains of text that surround us. It reminds us that Godin is wrong on two counts: 1) he is wrong about the past, in believing that he read most of the science-fiction corpus up to that time; and 2) he is wrong about the present, in believing that the daunting task of managing all the information that flows our way is qualitatively different from what has been the task of the literate person from Parmenides to the present.

The challenge is not in how to “manage” the deluge of “information” that washes all over us. That is a phantom that only exists in the minds of people not trained in the art of critical thinking. The point of education is basically to provide us the skills that guide us through the ocean of sentences produced by fellow human beings. Real value lies in identifying the 10% (or, IMHO, less than 10%) of the Content Tsunami that is worth reading (or translating). Real value lies in helping others discover the margarita in the midst of the refuse produced by the porcos. The belief that this task can be delegated to a search algorithm or a translation engine says more about you than about the real world.

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center, and H.B.O. International. To contact him, visit his website and write to the address listed there. You can also join his LinkedIn network by visiting the profile or follow him on Twitter.

Monday, February 13, 2012

Machine Translation, "The Shining," and My Crush on a Faceless Canadian Bureaucrat

Homer: So what do you think, Marge?  All I need is a title. I was
 thinking along the lines of "No TV and No Beer
Make Homer" something something. Marge: [timid] "Go Crazy"?
Homer: Don't mind if I do!
--The Simpsons, "Treehouse of Horror V"

Let me begin with an impossibly obscure instance of centralbankese. A policy statement is the kind of text that “Low Quality Translation” theorists say is ripe for computerization because it is repetitive. Check out this little quatrain, taken from a recent statement put out by the Bank of Canada:

To the extent that the expansion continues and the current material excess supply in the economy is gradually absorbed, some of the considerable monetary policy stimulus currently in place will be eventually withdrawn, consistent with achieving the 2 per cent inflation target.

After puzzling over this monstrosity for several minutes, I realized that most of the cognitive dissonance for me came from “current material excess supply” (it just rolls off the tongue, doesn’t it?). Let us run it through Google Translate:

"En la medida en que la expansión continúa y el suministro de material excedente actual en la economía se absorbe poco a poco, algunas de las medidas de estímulo monetario considerable actualmente en vigor con el tiempo se retiró, en consonancia con el logro de la meta de inflación del 2 por ciento."

GoogleT is correct in providing “en la medida en que” (I would have used “en la medida que,” but you have to give some points for Google Translate); the verb tense in "continúa" needs changing to the subjunctive "cointinúe"). All in all, the first snippet of the MT version is OK. But the GoogleT sentence gets more and more garbled after “current excess material supply.” What the phrase means, of course, is “current and significant excess supply.” “Material” here is an adjective (GoogleT guesses incorrectly that here it is a noun). But even if GoogleT had guessed right, it would still have a mountain to climb. The question here is: “excess supply” of what? Is it monetary supply or unused capacity utilization or unsold inventory? Or all of them? My hunch is that it refers to the supply of money, but that is a decision I would have to consult further.

The “Crap Quality” ideologue believes that this messy human element (the undecidability of this hyeroglyph) can be elided in the case of “non-literary” texts. And the solution goes something like what I am about to describe. Imagine, for example, that we take all of the central bank policy statements written since the dawn of time (yes, back even to the times of the Roman Imperial Central Bank) and take all of the translations of those statements and upload them into the Universal Central Bank Policy Statement Translation Engine (the UCBPSTE©). Chop it all up and run it through the algorithm.

Is the translation still deficient? “No problemo,” as Americans say. Just take every single text that has ever been translated since the time of Cain with its original and cram both into the engine. What? Still garbage? Nothing doing. Let’s get large crowds of hamsters to translate all manner of texts to enrich the database. Eventually, given enough data (pant, pant), we will have an answer for our troublesome “current material excess supply.” 

I am here to tell you that this is unlikely. That forgets two problems. The first (and lesser one) is that central bank policy statements are willfully obscure. The Canadian bureaucrat who wrote the unfortunate “current material excess supply” is not just a bad writer. He practices to be a bad writer. He will be cranking out impenetrable hieroglyphs until the universe stops expanding and collapses back upon itself and we are all crushed as matter expands further and further to create an infinitely cold, infinitely empty void.

The second (and more important) obstacle is the lack of repetitiveness in something as complex as language and history (two areas not bound by strict rules). The “Crap Quality” people talk about the Content Tsunami, right? That means there should be a large enough (and ever increasing) corpus to solve most translation conundrums. To gauge the scale of the challenge, let us ask ourselves how many times in the history of humanity have the words “current material excess supply” been strung together by a hominid with opposable thumbs. Let’s do a Boolean search on the Internet using quotation marks to get exact hits for the phrase.

Aha! We get 21,000 hits. This bodes pretty well for computer-capturable repetitiveness, since it increases the likelihood that this tough nut of a phrase was translated.

Alas! Our “Crap Quality” theorist will feel a distinct pang of disappointment. And perhaps an extreme terror, not unlike Jack Nicholson’s wife in The Shining as she inspects her hubby’s literary handiwork. Yes, it is as if the Internet has been maniacally scribbling “All work and no play makes Jack a dull boy” over and over again. (Oh, faceless Canadian bureaucrat, I love you!) The 21,000 instances of “current material excess supply” come from the same sentence quoted 21,000 times. The point is (if it needs explanation) that “current material excess supply” is made up of non-problematic components (taken singly); the problem is that when you string them together, they create a unit that is not susceptible to statistical deciphering. The further point (and the objection to “Lower Quality Translation”) is that there are potentially infinite strings like “current material excess supply” out there, unspoken. And they are far more frequent than engineers who have never translated a word in their life think. So a rose is a rose is not a rose.

Wednesday, February 8, 2012

The Visionary Has a Vision

perogrullada:  f platitude, truism, obvious thing to say.
--Oxford Spanish Dictionary 

Prophet III: There shall
in that time be rumours of things going astray. Ehm...and
there shall be a great confusion as to where things really are.
And nobody will really know where lieth those little things
wi...with a sort of raffia work base, that has an attachment. At this
time, a friend shall lose his friend's hammer, and the young
shall not know where lieth the things possessed by their fathers, that
their fathers put there only just the night before, 'bout eight
—Monty Python’s Life of Brian (1979)

The localization visionary awoke suddenly. In a trance, he began to discuss Apple’s latest quarterly results. It came to him as if in a vision that Apple sells a lot of iPads in non-English speaking countries. His brow furrowed. “There has to be some sort of profound insight here,” he muttered to himself. The disciples huddled around him in tense anticipation as the prophet mulled the vision inside his febrile, God-inspired mind. Until, finally, it came to him:

“Successful companies, like Apple, need a sound localization strategy to succeed in the global economy.”

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center, and H.B.O. International, as well as many small-and-medium-sized brokerages and asset management companies operating in SpainTo contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

Thursday, February 2, 2012

The Content Tsunami Hits the Shores of the Iberian Peninsula

The amount of content is exploding like the Big Bang, we are told by the intellectual midgets who speak at localization conferences. Really? If the amount of content is expanding exponentially, why are so many people paying peanuts to other people to create more low quality content? Wake up, people. There is no Content Tsunami! There is a Data Deluge, but content is not data. Content is text, which is human-made and meaningful in itself. There is a deluge of economic, astronomical and demographic data, but all of that is meaningless outside of a context. A text, in contrast, is meaningful outside of any context as long as there is another human being left alive to read it. Data. Content. The two things are radically different. The localization guru’s willful ignorance of this distinction is just a dramatic illustration of his lack of intellectual honesty (and his hunger to make a quick buck and get his hands on that trophy third wife).

The need to create mountains of cheap content is real, but it has very little to do with any mythical Content Tsunami. It is more to do with some of the weird and quirky ways in which the Internet is organized. For whatever reason, the Lords of the Cloud (read: “the Googlevi Twins”) have decided that certain arbitrary aspects of a website are indicative of its importance and should therefore be used to determine its position in a Web search. Those features are basically two: amount of textual content and frequency of updating.

And presto, with that simple formula, you have the recipe for a lot of crap content. Moreover, you have an incentive (Milton Friedman, hello!) for creating a lot of crud that—like the aborted demon-spawn of Ragnarok and Sauron—should never have seen the light of day. The Low Quality Translation Movement is simply the localization industry's arm of the Content Tsunami. Its main get-rich-quick scheme is to sell cheap translation as the answer for cheap content and (crucially) trying to suck the entire translation industry into this model of second-quality garbage under the cloak of technological progress. But I preach in vain. I can see Kirti Vashee rolling his eyes and raising his hands in exasperation: "There are even people who deny the existence of a Data Deluge!" Translation: "See!? See!? You see the kind of crap I have to deal with!?"

That is why I am so relentless in going after the l10n hype-meisters who endlessly lecture us about the Content Tsunami. The latest example of this drive to create rivers of meaningless content comes from Spain. A journalist answered an advertisement for creating online content and received an offer you just can’t refuse. It was 0.75 euro cents for writing 800-word pieces. Yes, you read right. Not 0.75 euros per word. No. Less than one euro for 800 words. That is 0.0009375 euro cents per word. Well, in the year that indignados became a worldwide buzzword, the journalist decided to go online to complain about this. Needless to say, the hashtag #gratisnotrabajo (“I don’t work for free”) became a trending topic for a couple of days on Twitter.

Here is my translation of the job ad: “Journalist wanted. Compensation is €0.75 per article, which must contain a minimum of 800 words.”

But wait… there’s more (and this is my favorite part): “Texts will be subject to certain conditions of quality control—spelling, punctuation, semantics and expression.”

I just love that. We are paying less than a ride on the Madrid Metro for 800 words, but your texts will be subject to quality constraints. Seriously, if the objective is to write large amounts of crap content, why don’t we just get computers to do it? Lackuna, maybe there is a fortune in it for you.

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center, and H.B.O. International. To contact him, visit his website and write to the address listed there. You can also join his LinkedIn network by visiting the profile or follow him on Twitter.