0

Google Analytics, Black Hat SEO, Semalt and You

Google Analytics is a must-have tool in any webmaster’s toolbox. Using it, you can find out demographic information about who is visiting your site, how long they stay, what pages they tend to view, and (with Webmaster Tools enabled) what search terms people used to find your site. There’s a wealth of information available for extraction from Google Analytics if you know how to read it.

Analytics will also show you where your traffic is coming from. Are people finding your site because of an article you wrote for Time magazine? Are people finding your site because a popular blogger linked to your site in one of their own articles? Or has something you created achieved the vaunted viral status?

The problem

One of the sites I manage has seen fairly steady growth over the last year in terms of page views. In most cases, this would be considered a great thing! More people are coming to the site, so the owners wanted to look at how people were finding them and perhaps offer promotions or tidy up some of their content. The idea was to be more friendly to the people who arrived at the site seeking such information.

The problem arose when I looked at where this increase in traffic was coming from. In some months, 35% of the total traffic to the site came from a company called Semalt. This sounds great at first, like they’re doing you a massive favor, until you realize that the entire 35% is completely bogus traffic. Making matters worse is that they appear to operate under a series of discount, irrelevant domain names so you can’t just filter out Semalt.com and be done with them; you also have to deal with each of their dynamic heads.

Who is Semalt?

Semalt appears to be a marketing company. Their bot regularly scours the internet to analyze the top players in any field of search terms, presumably so they can sell optimization advice to those who wish to improve their standing in the search rankings.

Their business practices are so problematic that WordPress.com has officially banned their bots from accessing the network, but if you host your own WordPress setup you aren’t protected by anything. Thus you can expect your stats to be affected by Semalt’s bots.

The concern

Having a bot crawling your site is no cause for concern in and of itself. Google has bots that regularly crawl the internet’s network of links, indexing every page and using this information to know where to send you when you search for things. This benefits you in the long run, as it drives traffic to your site.

Normally if you include a file in your root directory called ROBOTS.TXT containing some simple instructions, Google and other webcrawlers will respect your wishes and ignore your site (or parts of it you specify). Semalt does not respect the presence of this file at all. They do supposedly offer an opt-out system on their own site, but to date results have been mixed as to its effectiveness.

When your web site is crawled to this degree (35% of total traffic from Semalt.com alone!) it not only messes with your stats, but if you are operating on a limited-bandwidth hosting plan then they are chewing through your bandwidth and money as well. If you are running AdSense on your web property, none of the traffic Semalt brings to your site even counts towards ad impressions, so you’re having your sites repeatedly scraped for the benefit of some shady marketing firm. I’ve had enough of unscrupulous businesses lately. Semalt is a literal parasite.

The explanation

What they’re doing here is a curious form of socially-engineered spam– one that targets webmasters and marketers. When I look at the referral statistics for a site, what am I finding? Semalt tops the list.

With that, they’ve got my attention. I have since looked into them. And I was not pleased with what I found.

They wouldn’t have gotten that far had they tried to email me, as it would have been (correctly) filtered out as spam.

The experiment

I didn’t know enough about how they’re pulling this off as I would have liked, so I did some research and experimentation.

Most web browsers keep track of certain things as you traverse the internet. For example, if you click on an external link on this site, your web browser will note that you came from here and upon arrival at the next site, tell that site that you came from http://strika.co (among other things).

Some security software will prevent your browser from revealing your navigation paths, which can cause problems in that sometimes overzealous webmasters enforce rules that say “If you’re trying to access a file hosted here, but you’re not accessing it from within my own site, I’m going to block you!” If your security software disallows tracking of referrers, the site is going to think you’re trying to hot-link (which is bandwidth stealing of another sort).

To see what sort of information I’m dealing with here, I set up two subdomains: “Heat” and “Cold.”

“Cold” consisted of nothing more than an html file containing a link to “Heat.”

The HTTP headers (that information which gets passed between sites when you initiate a connection) can be accessed using the $_SERVER variable in PHP. Let’s see what “Heat” had to say when I came to it via “Cold:”

var_dump($_SERVER);

array(37) {
 ["PATH"]=>
 string(29) "/bin:/usr/bin:/sbin:/usr/sbin"
 ["RAILS_ENV"]=>
 string(10) "production"
 ["SCRIPT_NAME"]=>
 string(10) "/index.php"
 ["REQUEST_URI"]=>
 string(1) "/"
 ["QUERY_STRING"]=>
 string(0) ""
 ["REQUEST_METHOD"]=>
 string(3) "GET"
 ["SERVER_PROTOCOL"]=>
 string(8) "HTTP/1.1"
 ["GATEWAY_INTERFACE"]=>
 string(7) "CGI/1.1"
 ["SERVER_PORT"]=>
 string(2) "80"
 ["SERVER_NAME"]=>
 string(22) "heat.testsite.com"
 ["SERVER_SOFTWARE"]=>
 string(6) "Apache"
 ["SERVER_SIGNATURE"]=>
 string(0) ""
 ["HTTP_CACHE_CONTROL"]=>
 string(9) "max-age=0"
 ["HTTP_CONNECTION"]=>
 string(5) "close"
 ["HTTP_DNT"]=>
 string(1) "1"
 ["HTTP_REFERER"]=>
 string(30) "http://cold.testsite.com/"
 ["HTTP_ACCEPT_ENCODING"]=>
 string(13) "gzip, deflate"
 ["HTTP_ACCEPT_LANGUAGE"]=>
 string(14) "en-US,en;q=0.5"
 ["HTTP_ACCEPT"]=>
 string(63) "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
 ["HTTP_USER_AGENT"]=>
 string(76) "Godzilla/2.0 (X11; BeOS; Linux x86_64; rv:38.0) Gecko/19990582 Flamingfox/36.0"

That’s a lot of information. I’ve removed some of the more intimate details and this is still only about half of what the server knows about itself and my computer.

Of particular interest:

["HTTP_REFERER"]=>
 string(30) "http://cold.testsite.com/"

So, this is what is getting passed between sites.

Semalt has a subdomain, semalt.semalt.com, that constantly tops the list, so what I figured they’ve most likely done is set up a web page with a bunch of target links on their own server, then run a bot to go visit every item on the list on their own server– this way, when the bot reaches the target site, it will arrive with an HTTP_REFERER containing “http://semalt.semalt.com” and cause curious webmasters to go looking into them.

Actually, surprise, Semalt’s traffic is the result of a distributed botnet. Stay away from Soundfrost.

The solution

Remember those overzealous webmasters I mentioned earlier? Prepare to become one.

At the root of every web site is a file called .htaccess. It handles URL mapping and redirection for your site, including turning things like http://strika.co?p=369 into http://strika.co/posts/369. It also handles some degree of access control, allowing you to reject all traffic except that coming from your own IP address, or in this case– reject all traffic from anybody showing up, claiming to have been referred by Semalt or its fake subsidiaries.

If you’re running WordPress, you’ll already have an .htaccess file you can modify. If you’re not using WordPress, just create a file called .htaccess at the topmost directory in your web structure (where your main index.html or index.php file resides) and paste this block into it:

# BEGIN
<IfModule mod_rewrite.c>
RewriteEngine On

RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END

By adding this ruleset to your site, what you’re doing is making it so that visitors never see those awful-looking boilerplate “404” pages if someone tries to go to a page or file that doesn’t exist. Instead, they’ll be redirected to your main page. It’s nice functionality to have for any site, though this block was completely lifted from a stock WordPress install. It will still work fine without WordPress.

Now let’s block Semalt and their army of minions. You can do this one of two ways, but either way, do so in the blank space between “RewriteEngine On” and “RewriteBase /”:

Method 1 – Static list

(RewriteEngine On)

# Matches semalt.semalt.com, or
RewriteCond %{HTTP_REFERER} ^http://semalt\.semalt\.com [NC,OR]
# Matches www.semalt.com, or
RewriteCond %{HTTP_REFERER} ^http://www\.semalt\.com [NC,OR]
# Matches buttons-for-website.com, then
RewriteCond %{HTTP_REFERER} ^http://buttons-for-website\.com [NC]
# Send the bot back home, tell it your site has moved to semalt.com permanently
RewriteRule (.*) http://semalt.com/ [R=301,L]

(RewriteBase /)

Do you see the pattern? Anytime some new $0.99 domain starts sending spam traffic your way, just create another entry in the list.

Google Analytics will often just show you the referrer was “some.site.com” so you need to make sure you take the prerogative to include the “http://” part like in the example above.

Be aware one of these things is not like the others. The final entry, “buttons-for-website.com,” has an [NC] tag instead of [NC,OR]. The last entry in the list should not have an OR attached to it, but all others should.

Method 2 – Regex

If multiple subdomains keep coming after you, or you see a distinctive pattern in the URLs that are passed as referrers, you can use some of the more advanced functionality of regular expressions to try to account for them.

# Match either semalt.semalt.com or www.semalt.com, or
RewriteCond %{HTTP_REFERER} ^http://(semalt|www)\.semalt\.com [NC,OR]
# Match any letter/number combination that comprises a subdomain of semalt.com, then
RewriteCond %{HTTP_REFERER} ^http://[a-zA-Z0-9]*\.semalt\.com [NC]
# Redirect the bot to semalt.com and tell it your site moved there permanently
RewriteRule (.*) http://semalt.com/ [R=301,L]

Method 3 – Anti-DDoS

I’m currently looking into whether a service like Cloudflare might cut down on the automated scraping done by Semalt and company.

Method 4 – PHP Code

Stick something like this in your WordPress header or at the top of your PHP files, though it still requires you to curate a list of blocked domains.

Testing

If you own multiple sites, I’d recommend creating some subdomains and testing your rules by trying to block traffic from one to another. You would do this simply by creating a hyperlink somewhere on site 1 that points to site 2.

Pretend site 1 is affiliated with Semalt. Modify its .htaccess to prevent site 2 from receiving any traffic from there.

Do not do your testing on a production site.

Troubleshooting

If you mess up a rule, you can quickly find ALL of your incoming web traffic seemingly being redirected to semalt.com! Worse yet, you may find that even after undoing the changes, you’re still being sent there from your browser.

Don’t fret. Changes made to your .htaccess file go through immediately upon saving. If you’re unsure of what you’ve done, either delete the line(s) in question, or put a # at the beginning of the line to turn it into a comment until you get a handle on the situation.

Sometimes, what will happen is that the referrer gets stored in your browser across tabs and sessions. So even if you don’t click on a link, but type in a URL, somehow the original referrer you used still gets passed.

The best way to check and see if your rules are malfunctioning is by opening a Private Tab (Firefox) or Incognito Mode (Chrome) and going to the site from the domain that is supposed to be blocked. Private tabs open a completely sterile environment in which no existing data is loaded and no future data is stored. After each change you make, you should close the Private Tab and re-open a new one, because they will retain data belonging to the same session (thus, troubleshooting between changes may not be accurate).

The other alternative is to clear cookies and cache manually but that’s more of a pain. Just use private browsing modes for testing rules.

Good luck!

Leave a Reply