Group Admins

WordPress Troubleshooting and Support

Public Group active 1 year, 5 months ago ago

WordPress support from our community

semalt & baidu

Tagged: ,

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #2955
    Vincent Gentile
    Participant

    Hello WPNYC – I have a client that has these 2 sites (semalt.com & baidu.com) crawling their site several times a day. I have added code that I found online to the htacess file however they continue hit the site.

    Has anyone had this problem…?

    Below is my htacess file:
    __________________________________________________________________________
    Options -Indexes

    # BEGIN WPSuperCache
    # END WPSuperCache

    # BEGIN WordPress
    <IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteBase /
    RewriteRule ^index\.php$ – [L]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule . /index.php [L]
    </IfModule>

    # END WordPress

    # block visitors referred from semalt.com
    RewriteEngine on
    RewriteCond %{HTTP_REFERER} semalt\.com [NC]
    RewriteRule .* – [F]

    # block visitors referred from baidu.com
    RewriteEngine on
    RewriteCond %{HTTP_REFERER} baidu\.com [NC]
    RewriteRule .* – [F]

    RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]

    RewriteCond %{HTTP_USER_AGENT} ^.*(semalt|baidu)…. [NC]
    RewriteRule . – [F,L]

    #2956
    D.K. Smith
    Participant

    Hi Vincent,

    I assume you disallowed them in robots.txt.

    Semalt’s removal page, http://semalt.com/project_crawler.php

    Most effective is to check the sites’ server logs and get the IP addresses semalt & baidu are using.

    1. Block those IPs in htaccess:

    order allow,deny
    deny from 222.333.44.555 [replace with actual IPs]
    deny from 666.777.88.999
    allow from all [tells Apache to allow everyone else]

    2. block a range of IPs:

    order allow,deny
    deny from 222.333. [replace with first two blocks in actual IPs]
    deny from 10.0.0. [use as written]
    allow from all

    3. This may work depending on Apache configuration:

    RewriteEngine on
    RewriteCond %{REMOTE_ADDR} ^22\.333\.44\.555 [ ^22\period333\etc]
    RewriteRule ^ – [F]

    4. PM me if you’d like to block all of China and I’ll send the file, which is too large to post here.

    We often have to use all of the above dePending on the bot (baidu is among the worst) and the server’s Apache config.

    Good luck!

    #2957
    Steve
    Keymaster

    You also received a reply from Twitter >

    #2958
    Vincent Gentile
    Participant

    Hi DK – Thank You for responding. They always come from different IP’s and there isn’t a range.

    I did not disallow them in the robots.txt. I have never done this, but (I guess) I need to create the robots.txt file and upload it to the host. If this is correct, do I up-load it to the public_html….?

    Do you think the below code would work….?

    User-agent: *
    Disallow: /

    # Some bots are known to be trouble, particularly those designed to copy
    # entire sites. Please obey robots.txt.

    User-agent: semalt.com
    Disallow: /

    #2959
    Vincent Gentile
    Participant

    Steve – I saw you twitter message…..My client is paranoid about submitting to the semalt crawler….?

    #2960
    D.K. Smith
    Participant

    If baidu is coming from all over it’s not baidu, they’re spam bots – harvesters spoofing baidu.

    Baidu used to run on 119.63.192.0 – 119.63.199.255 – not sure if those IPs are still valid.

    1. Try this on top of htacess, before WP rewrite:

    RewriteEngine on
    RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://yoursite.net$ [NC]
    RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ – [F,NC]
    SetEnvIfNoCase User-Agent “^baiduspider” bad_bot
    <Limit GET POST>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Limit>

    2. Or this on top:

    SetEnvIfNoCase User-agent “Baidu” spammer=yes
    SetEnvIfNoCase User-agent “semalt” spammer=yes

    <Limit GET POST>
    order deny,allow
    deny from env=spammer
    </Limit>

    3. Robots.txt file should be in EVERY WordPress root folder:

    #Baidu
    User-agent: Baiduspider
    Disallow: /

    #Semalt
    User-agent: semaltspider
    Disallow: /

    4. The only method that’s consistently effective is blocking IPs.

    PM your email and I’ll send our block-China list. There’s nothing to lose by trying it unless the owner wants traffic from China. If that doesn’t stop it, most likely they’re harvester bots spoofing baidu.

    #2977
    D.K. Smith
    Participant

    Vincent – How did the stop-baidu effort go?

    Was it Baidu or spambots?

Viewing 7 posts - 1 through 7 (of 7 total)
  • You must be logged in to reply to this topic.