One reader discovered the downside to the products of the Google Books project as he struggled against the severe limitations of the OCR (optical character recognition) technology used to scan these books (which are, parenthetically, scanned by part-time students and not by literary scholars):
The Google editions were packed with errors. If I were not studying Google Ebooks for professional reasons, if I were not already familiar with the works of Austen, would I have gone on? Would I have thought that Austen does not know how to place quotation marks, that she made grammatical mistakes that would embarrass even a high school freshman, or that her dialogue sometimes breaks off without explanation? I began to wonder what service or disservice Google had performed, rendering one of the world’s most popular writers in a form as bizarre as the Zemblan translation of Shakespeare in Nabokov’s “Pale Fire.”
This is why initiatives by European national libraries to provide more scientific scans of cultural treasures from the past are so important. Hopefully, this sort of project will provide some sort of corrective to the useful but woefully mangled Google Books initiative. (Hat tip to @jordibal for this reference.)
More on the Content Tsunami:
I have covered the great cyber-garbage patch—the greatest opportunity for the translation industry since Yahweh cursed the Tower of Babel—here, here and here.
Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center, several small-and-medium-sized brokerages, asset management institutions based in Spain, and H.B.O. International. To contact him, visit his website and write to the address listed there. You can also join his LinkedIn network or follow him on Twitter.
Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. He has worked as a translator for Goldman Sachs, the US Government's Open Source Center, several small-and-medium-sized brokerages, asset management institutions based in Spain, and H.B.O. International. To contact him, visit his website and write to the address listed there. You can also join his LinkedIn network or follow him on Twitter.
This is a good parallelism between digitalisation and translation. Google Books uses cheap workforce to do the scanning (flipping the pages of a book at a constant rate) and then gets the computer to do all the OCR without any human intervention or later correction.
ReplyDeleteCompare this to Project Gutenberg. They get the books scanned with a combination of voluntary work and cheap workforce funded via donations. Anyway, the OCR process is the important bit here: they do distributed proofreading. Every book used to be the responsibility of a single person, but nowadays a single book is distributed among many people. I'm guessing it's easier to get someone to proofread a single page than a full book.
Project Gutenberg is crowdsourcing with a hint of reCaptcha.
It's obvious: fixing machine output using crowdsourced labour works much better than just dumping raw machine output without letting the community improve it. While there's nothing wrong with PG's crowdsourcing, I am not too sure about Google's doing the same. PG is a not-for-profit, while Google...
In fact, access to crap content is often worse than no access at all, as the linked TeleRead article implies. I wonder what someone reading non-postedited MT'ed Wikipedia articles would think... ;)
I think that is the key question: Is crap content or crap translation better than no content or no translation? As a sometime victim of MT'ed help documents, my feeling is that it just heightens my contempt for the company that does it. Likewise, Google's refusal to invest in help doucmentation and reliance on forum discussions which are 99% irrelevant to what one needs. I think Google can afford to go that way, but I don't think most companies can give themselves the luxury of irritating users.
ReplyDelete