Sunday, July 3, 2011

Two Notes on the Content Tsunami


Note 1: Check out this blog post from the New York Times: Google’s War on Nonsense.

Note 2: Blogger Michael Wahlster on Translate This! discovered a spammy duplicate of an entry in the Jenner twins’ Translation Times Blog that was created using the technique known as scraping. Except the spammy clone (http://goo.gl/HNi28) in this case has a little twist. In Wahlster’s words:


It would have been business as usual had there not been this wrinkle: It turned out not to be a “simple” scrape after all. Where the original identified my blog as Translate This!, the scrapers had written Interpret That!; instead of Brave New Words they wrote Fearless New Words; instead of Thoughts on Translation it was Ideas on Translation. They also seem to have a law against the word “friend.” It was replaced by “companion.”


My guess is that the software doing the scraping also changes the wording of the original using some sort of very crude algorithm that inserts synonyms randomly. Why does it do this? To keep from being identified as a clone by Google’s crawlers. If it is identified as a spammy duplicate, the site will be demoted on Google’s search rankings.


Let’s view this in action. Here are two sentences from the Jenner twins’ original blog post:

Below is a list (with links) of 10 of our favorite language blogs, in no particular order, followed by a brief description. We'll then ask our blogging readers to do the same on their blogs (don't forget to link to the blogs so everyone can find them!) and title the blog entry A for Language Blogs.

And here is the “same” sentence from the scraped blog post (with changes in bold):


Beneath is a list (with links) of Ten of our preferred language weblogs, in no actual order, followed by a brief depiction. We Are Going To then ask our blogging viewers to do the same on their weblogs (do not forget to link to the weblogs so everybody could discover them!) and title the weblog entry A ? for Language Weblogs.


The sentence is comprehensible, but slightly skewed by the crude intervention of the computer. A classic example of non-human language that is cluttering up the Internet (and the corpus used by language technologists). This is the type of cheap reproduction that Web is good at, with the little added twist of the synonym-switching to pass under the radar of the algorithm change introduced by Google only a couple of months ago precisely to filter out content farms and this type of useless SEO jockeying. As I said, these people are drug-resistant bacteria.


Perhaps Google would do well to look at itself in the future as a waste management company rather than a tech enterprise.

So the next time you hear someone spouting off about the Content Big Bang, you should lean in cheerfully and say: “Oh! You mean the garbage!”

Miguel Llorens is a freelance financial translator based in Madrid who works from Spanish into English. He is specialized in equity research, economics, accounting, and investment strategy. To contact him, visit his website and write to the address listed there. Feel free to join his LinkedIn network or to follow him on Twitter.

1 comment:

João Roque Dias said...

Spammer's Whois data:
Roza Sizova
apt 7, 17 Gagarina Str - Russia
Phone:+84.236623718
Email:livapetr@gmail.com