Handling 404s/403s with searchengines
Scenario last night:
Google: Gimme this page. (some administration side url, fuck knows how they know to index it)They obviously ignore the 403. What are you supposed to do in this situation? |
Quote:
|
Wouldn't a 404 be better? Bot asks for a page, you tell it the page doesn't exist, bot stops asking.
|
The problem with the URL is it's a parameter which causes it to be forbidden. Like:
http://site/Resource?Action=Admininstead of just: http://site/Resourceor : http://site/Resource?Action=Some_legal_actionWhich are valid and I do want to be indexed. As far as I've been able to figure out, to exclude the admin actions, I'd have to expose them all in the robots.txt, which is not really something I want. |
Quote:
I'd been thinking we need to change them to http://site/ redirects to make them stop. |
Use mod_rewrite to redirect googlebot to a 404 page or something.
RewriteCond %{HTTP_USER_AGENT} Googlebot RewriteRule Admin /404.html [R=301] |
Quote:
|
You'll also want Google to drop this, so do that from webmasters Tools, Crawler Access, Remove URL (being your admin dir, etc). And as you rightly speculated.. never put any hidden directory in robots.txt.
|
Quote:
I'm not sure 503 would have been suitable either - but perhaps 410 Gone [1] is a better choice? (Assuming Google/Bing/et all know what it means) [1] "Indicates that the resource requested is no longer available and will not be available again." - http://en.wikipedia.org/wiki/List_of_HTTP_status_codes |
Quote:
Hmm, how about you remap an url for admins only which redirects to Resource?Admin etc. /admin/* and forbid that, or UrlRewrite or something. That way you're not exposing anything about your API beyond it's existence. |
Try to rename admin and password protect, because if google indexed there is a good chance yandex etc did also.
|
A lot of your guys suggestions are good, but I can't really do them exactly as you say because our environment (POJO java, JSPS, 10 years of legacy code, etc) is very different to the usual PHP/apache style setups. I can, and do take your suggestions and implement in 'our way'.
This is still happening. In the last hour on one of our site I've seen over 100 requests from a google ip, with a google bot user agent, for a URL I've consistently returned 403s for. Even weirder now google is requesting URLs like "Folder?Action=iouzgwsunskasv" which I have never generated in any page. 15 minutes later it's back again with "Folder?Action=qnwotukfozr". I'm returning 400 - bad request and they still come back. It is possible some bot is somehow spoofing the ip address and useragent to make it seem it's google bot, then poking actions at a website and monitoring it for changes? |
Gone in thru google's webmaster tools for several of these sites now. It's definitely google requesting the pages we a giving 403 for. Removing those URLS will be a task because there's more than 3000 URLs it's hitting like this on one site.
The other odd Action=garbage URLs don't appear in the webmaster tools at all, making me think it's not google. |
Any chance you can post some of the IP addresses of the fake Google bots?
|
All of the requests are coming from 66.249.71.170, which appears to be a google address.
|
Yeah, that's definitely a Google one. Probably won't get you anywhere, but maybe worth contacting them about it?
I checked a couple of reasonably heavy sites and found no hits from that IP address. |
Thanks, I'm probably going to post on the webmaster forums once I get my shit together. Plan is currently to 401 content they shouldn't be able to index and 410 admin functions they shouldn't even know about.
Since I last purged the logs on one of our moderate traffic level sites (for us) about three months ago, we've served 173,000 403 errors to google. |
So, stuck up the code which returns a 410 to search engines, when they request a page they shouldn't ever be able to see. Google hits it twice in 30 seconds after the server was restarted and hasn't come back since.
Hopefully that's it. |
Nope, 4k+ hits since that change was deployed. Many on he exact same URL multiple times. Fuck it, I'm going to 301 them to http://sitedomain/.
|
Isn't that going to cause "duplicate site content" problems and lower your search ranking? Or do 301/302 redirects not affect that algorithm?
|
All times are GMT +13. The time now is 03:16. |
Powered by Trololololooooo
© Copyright NZGames.com 1996-2024
Site paid for by members (love you guys)