Note 1: Check out this blog post from the New York Times: Google’s War on Nonsense.
Note 2: Blogger Michael Wahlster on Translate This! discovered a spammy duplicate of an entry in the Jenner twins’ Translation Times Blog that was created using the technique known as scraping. Except the spammy clone (http://goo.gl/HNi28) in this case has a little twist. In Wahlster’s words:
It would have been business as usual had there not been this wrinkle: It turned out not to be a “simple” scrape after all. Where the original identified my blog as Translate This!, the scrapers had written Interpret That!; instead of Brave New Words they wrote Fearless New Words; instead of Thoughts on Translation it was Ideas on Translation. They also seem to have a law against the word “friend.” It was replaced by “companion.”
My guess is that the software doing the scraping also changes the wording of the original using some sort of very crude algorithm that inserts synonyms randomly. Why does it do this? To keep from being identified as a clone by Google’s crawlers. If it is identified as a spammy duplicate, the site will be demoted on Google’s search rankings.
Let’s view this in action. Here are two sentences from the Jenner twins’ original blog post:
Below is a list (with links) of 10 of our favorite language blogs, in no particular order, followed by a brief description. We'll then ask our blogging readers to do the same on their blogs (don't forget to link to the blogs so everyone can find them!) and title the blog entry A ♥ for Language Blogs.
And here is the “same” sentence from the scraped blog post (with changes in bold):
Beneath is a list (with links) of Ten of our preferred language weblogs, in no actual order, followed by a brief depiction. We Are Going To then ask our blogging viewers to do the same on their weblogs (do not forget to link to the weblogs so everybody could discover them!) and title the weblog entry A ? for Language Weblogs.
The sentence is comprehensible, but slightly skewed by the crude intervention of the computer. A classic example of non-human language that is cluttering up the Internet (and the corpus used by language technologists). This is the type of cheap reproduction that Web is good at, with the little added twist of the synonym-switching to pass under the radar of the algorithm change introduced by Google only a couple of months ago precisely to filter out content farms and this type of useless SEO jockeying. As I said, these people are drug-resistant bacteria.
Perhaps Google would do well to look at itself in the future as a waste management company rather than a tech enterprise.
So the next time you hear someone spouting off about the Content Big Bang, you should lean in cheerfully and say: “Oh! You mean the garbage!”
1 comment:
Spammer's Whois data:
Roza Sizova
apt 7, 17 Gagarina Str - Russia
Phone:+84.236623718
Email:livapetr@gmail.com
Post a Comment