{"id":67,"date":"2010-11-30T21:09:54","date_gmt":"2010-11-30T21:09:54","guid":{"rendered":"http:\/\/jackofallblogs.com\/?p=67"},"modified":"2010-12-02T01:50:59","modified_gmt":"2010-12-02T01:50:59","slug":"why-we-dont-worry-about-every-scraper","status":"publish","type":"post","link":"https:\/\/jackofallblogs.com\/networking\/why-we-dont-worry-about-every-scraper\/","title":{"rendered":"Why We Don’t Worry About Every Scraper"},"content":{"rendered":"

[dropcap]I[\/dropcap]n addition to being a writer for Splashpress, including on BloggingPro<\/a>, Freelance Writing Jobs<\/a> and now Jack of All Blogs, one of my responsibilities for the company is copyright and plagiarism enforcement. I help monitor where Splashpress Media content is being used and, when appropriate, secure its removal or at the very least its banning from the search engines.<\/p>\n

While these are all jobs I do routinely as part of my job as a copyright and plagiarism consultant<\/a>, I’m happy to say that Splashpress has adopted a practical policy on scrapers that not only allows them to enforce their rights, but prevents them from having to pursue every single case of infringement, regardless of how unimportant.<\/p>\n

The company recognizes that, as an organization with over 100 sites and 200 people<\/a>, that it is impractical, especially on a reasonable budget, to target every single scraper or spammer that wishes to misuse their content, let alone every attributed reuse by a human. <\/p>\n

As such, the company carefully targets those that it goes after, using its resources to focus on those who cause the most damage and create the most headache.<\/p>\n

While it is a relatively simple process, understanding it requires a basic grasp of how the search engines parse duplicate content and why not every case of infringement is action-worthy. Most importantly, it involves understanding tips that can show any blogger how to use the spammers to their advantage and turn duplicate content into free advertising.<\/p>\n

Understanding Duplicate Content<\/h2>\n

As I recently explained on BloggingPro<\/a>, duplicate content is when the same or very similar content appears on multiple pages. It can be pages on the same site or pages across different domains.<\/p>\n

Search engines don’t like this because they want to showcase a wide array of original material with every search result. As such, they do their best to detect duplicate content, determine which URL is the best and\/or original and then penalize the others with the same content. To do this, the search engines use a variety of tactics including looking at which URL was posted first and seeing which page has the most inbound links.<\/p>\n

Fortunately, Google does a decent job of detecting which page is the original. Though spammers will scrape the contents of a site, usually using the RSS feed, and republish it on their blogs, they rarely fool the search engines. With few inbound links and URLs that appear hours after the fact, they have a hard time fooling the search engines into thinking they are the original work.<\/p>\n

This doesn’t mean it can’t or doesn’t happen, which is why I do occasionally have to step in on rare occasions. However, realizing that most scrapers aren’t hurting the sites they lift from, no matter how hard they try, frees up SplashPress to focus more on creating new, interesting content and entertaining\/informing its readers.<\/p>\n

It’s a win\/win but it wouldn’t be possible without a few additional steps.<\/p>\n

Preventing Scrapers from Mattering<\/h2>\n

\"DartJust because most scrapers and other plagiarists don’t usually impact their original sites doesn’t mean it can’t or won’t happen. As such, we take proactive measures to prevent that from happening and decrease the chances of them hurting our sites.<\/p>\n

One of the key steps we take is an internal linking editorial policy. Almost every article, when appropriate, has a link to a different relevant article on the same site or, if one isn’t available, another Splashpress property.<\/p>\n

The reason for this is simple, in addition to directing readers to relevant content elsewhere on the network, these links are also picked up by spammers and republished. Those links, in turn, become valid, inbound links that search engines pick up and place value on.<\/p>\n

This has two very important effects:<\/p>\n

    \n
  1. Increases Page Ranking:<\/strong> By generating more inbound links, the spammers are actually helping the sites involved rank better. Though the importance of links from spammers is likely minimal, it’s still a help and, given the amount some of our content is lifted, can be quite powerful when spread across so many sites.<\/li>\n
  2. Prevents Duplicate Content Penalties:<\/strong> Second, by linking to the original site, the spammer is essentially “voting” for the original site and search engines see that and weigh it when determining which version is the original. This helps ensure the original version isn’t accidentally penalized as a duplicate.<\/li>\n<\/ol>\n

    While the system isn’t perfect, it has served us well, making it so that well over 99% of all spammers can be safely ignored.<\/p>\n

    Still, every once in a while a spammer or scraper gets lucky and starts ranking well with our content. In those cases, that is where I step in.<\/p>\n

    Dealing with Outliers<\/h2>\n

    In the rare cases where a spammer does manage to start causing harm to the original sites, we do have an action plan in place and it closely mirrors the Stopping Internet Plagiarism guide<\/a> that I have written on my main blog, Plagiarism Today<\/a>. <\/p>\n

    Basically, the system consists of first trying to contact the scraper, which is rarely possible, and then filing a takedown notice with the site’s host to get the entire domain removed. If that fails, we then file a similar notice with the search engines to get it removed from the indexes of the search engines which, while it doesn’t remove the site from the Web, at least prevents it from competing with the original sites with their own content.<\/p>\n

    To date, we’ve been able to resolve every damaging case of infringement without much problem. However, that wouldn’t be possible if we tried to stop every single one as, without the ability to focus on those that were actively hurting us, we probably wouldn’t be able to ensure resolution.<\/p>\n

    All in all, the need to take such action is very rare, a couple of cases per month at the most, but it is necessary and we are prepared. However, it isn’t nearly as necessary as many think that it is. <\/p>\n

    Bottom Line<\/h2>\n

    In the endl, content management and enforcement is an important part of any site’s business. However, that isn’t the same as stamping out every unauthorized copy that exists. Not only is that impractical, but it is a tremendous waste of resources.<\/p>\n

    Managing your content is much more than simply removing works, it involves understanding how your content is being used, encouraging useful and beneficial copying and dealing with harmful ones.<\/p>\n

    Fortunately, this is something that Splashpress Media does very well and is something that I’m very proud to be a part of. <\/p>\n

    My hope is that others will understand this as well so we can all work on spending our energies on what’s really important, creating good content and promoting it well. <\/p>\n","protected":false},"excerpt":{"rendered":"

    [dropcap]I[\/dropcap]n addition to being a writer for Splashpress, including on BloggingPro, Freelance Writing Jobs and now Jack of All Blogs, one of my responsibilities for the company is copyright and plagiarism enforcement. I help monitor where Splashpress Media content is being used and, when appropriate, secure its removal or at the very least its banning […]<\/p>\n","protected":false},"author":8,"featured_media":94,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[8],"tags":[14,15,18,17,13,12,16],"_links":{"self":[{"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/posts\/67"}],"collection":[{"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/comments?post=67"}],"version-history":[{"count":0,"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/posts\/67\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/media\/94"}],"wp:attachment":[{"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/media?parent=67"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/categories?post=67"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jackofallblogs.com\/wp-json\/wp\/v2\/tags?post=67"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}