Friday, July 29, 2011

More on the Content Tsunami: Digitized Versions of Public-Domain Books

Any work published before 1929 is in the public domain. That means the bulk of pre-twentieth century classics can be downloaded for free from Google Books. However, as historian and librarian Robert Darnton has warned repeatedly from the pages of the New York Review of Books, the Google book digitization project is very different from the process of making a modern scholarly edition of a classic. When you dive into the freebie editions in the public domain of Jane Austen and Tirso de Molina, you are pretty much at the mercy of the quality of the edition from centuries past, before the careful comparison of versions and variants gave birth to the contemporary discipline of textual criticism.

One reader discovered the downside to the products of the Google Books project as he struggled against the severe limitations of the OCR (optical character recognition) technology used to scan these books (which are, parenthetically, scanned by part-time students and not by literary scholars):


The Google editions were packed with errors. If I were not studying Google Ebooks for professional reasons, if I were not already familiar with the works of Austen, would I have gone on? Would I have thought that Austen does not know how to place quotation marks, that she made grammatical mistakes that would embarrass even a high school freshman, or that her dialogue sometimes breaks off without explanation?  I began to wonder what service or disservice Google had performed, rendering one of the world’s most popular writers in a form as bizarre as the Zemblan translation of Shakespeare in Nabokov’s “Pale Fire.”


This is why initiatives by European national libraries to provide more scientific scans of cultural treasures from the past are so important. Hopefully, this sort of project will provide some sort of corrective to the useful but woefully mangled Google Books initiative. (Hat tip to @jordibal for this reference.)

More on the Content Tsunami:

I have covered the great cyber-garbage patch—the greatest opportunity for the translation industry since Yahweh cursed the Tower of Babel—here, here and here.




Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center, several small-and-medium-sized brokerages, asset management institutions based in Spain, and H.B.O. International. To contact him, visit his website and write to the address listed there. You can also join his LinkedIn network or follow him on Twitter.








2 comments:

Jordi Balcells said...

This is a good parallelism between digitalisation and translation. Google Books uses cheap workforce to do the scanning (flipping the pages of a book at a constant rate) and then gets the computer to do all the OCR without any human intervention or later correction.
Compare this to Project Gutenberg. They get the books scanned with a combination of voluntary work and cheap workforce funded via donations. Anyway, the OCR process is the important bit here: they do distributed proofreading. Every book used to be the responsibility of a single person, but nowadays a single book is distributed among many people. I'm guessing it's easier to get someone to proofread a single page than a full book.
Project Gutenberg is crowdsourcing with a hint of reCaptcha.

It's obvious: fixing machine output using crowdsourced labour works much better than just dumping raw machine output without letting the community improve it. While there's nothing wrong with PG's crowdsourcing, I am not too sure about Google's doing the same. PG is a not-for-profit, while Google...

In fact, access to crap content is often worse than no access at all, as the linked TeleRead article implies. I wonder what someone reading non-postedited MT'ed Wikipedia articles would think... ;)

Miguel Llorens M. said...

I think that is the key question: Is crap content or crap translation better than no content or no translation? As a sometime victim of MT'ed help documents, my feeling is that it just heightens my contempt for the company that does it. Likewise, Google's refusal to invest in help doucmentation and reliance on forum discussions which are 99% irrelevant to what one needs. I think Google can afford to go that way, but I don't think most companies can give themselves the luxury of irritating users.