|
23rd September 2010, 11:08 | #1 |
Ich Bin Ein Grey Lynner
|
Handling 404s/403s with searchengines
Scenario last night:
Google: Gimme this page. (some administration side url, fuck knows how they know to index it)They obviously ignore the 403. What are you supposed to do in this situation? |
23rd September 2010, 13:27 | #2 | |
|
Quote:
__________________
So the perkbuster Hide abusing perks, crimbuster Garrett actually a crim - what's next? Roger Douglas is secretly poor? --Saladin |
|
23rd September 2010, 13:48 | #3 |
|
Wouldn't a 404 be better? Bot asks for a page, you tell it the page doesn't exist, bot stops asking.
__________________
"Nothing is so smiple that it can't be screwed up." |
23rd September 2010, 13:53 | #4 |
Ich Bin Ein Grey Lynner
|
The problem with the URL is it's a parameter which causes it to be forbidden. Like:
http://site/Resource?Action=Admininstead of just: http://site/Resourceor : http://site/Resource?Action=Some_legal_actionWhich are valid and I do want to be indexed. As far as I've been able to figure out, to exclude the admin actions, I'd have to expose them all in the robots.txt, which is not really something I want. |
23rd September 2010, 13:56 | #5 | |
Ich Bin Ein Grey Lynner
|
Quote:
I'd been thinking we need to change them to http://site/ redirects to make them stop. Last edited by smudge : 23rd September 2010 at 13:57. |
|
23rd September 2010, 15:17 | #6 |
Mmm... Sacrilicious
|
Use mod_rewrite to redirect googlebot to a 404 page or something.
RewriteCond %{HTTP_USER_AGENT} Googlebot RewriteRule Admin /404.html [R=301] |
23rd September 2010, 17:46 | #7 | |
A mariachi ogre snorkel
|
Quote:
|
|
24th September 2010, 15:49 | #8 |
|
You'll also want Google to drop this, so do that from webmasters Tools, Crawler Access, Remove URL (being your admin dir, etc). And as you rightly speculated.. never put any hidden directory in robots.txt.
|
24th September 2010, 16:34 | #9 | |
|
Quote:
I'm not sure 503 would have been suitable either - but perhaps 410 Gone [1] is a better choice? (Assuming Google/Bing/et all know what it means) [1] "Indicates that the resource requested is no longer available and will not be available again." - http://en.wikipedia.org/wiki/List_of_HTTP_status_codes
__________________
"Nothing is so smiple that it can't be screwed up." Last edited by LordP : 24th September 2010 at 16:35. |
|
24th September 2010, 16:43 | #10 | |
|
Quote:
Hmm, how about you remap an url for admins only which redirects to Resource?Admin etc. /admin/* and forbid that, or UrlRewrite or something. That way you're not exposing anything about your API beyond it's existence.
__________________
So the perkbuster Hide abusing perks, crimbuster Garrett actually a crim - what's next? Roger Douglas is secretly poor? --Saladin |
|
24th September 2010, 18:01 | #11 |
|
Try to rename admin and password protect, because if google indexed there is a good chance yandex etc did also.
|
7th October 2010, 12:38 | #12 |
Ich Bin Ein Grey Lynner
|
A lot of your guys suggestions are good, but I can't really do them exactly as you say because our environment (POJO java, JSPS, 10 years of legacy code, etc) is very different to the usual PHP/apache style setups. I can, and do take your suggestions and implement in 'our way'.
This is still happening. In the last hour on one of our site I've seen over 100 requests from a google ip, with a google bot user agent, for a URL I've consistently returned 403s for. Even weirder now google is requesting URLs like "Folder?Action=iouzgwsunskasv" which I have never generated in any page. 15 minutes later it's back again with "Folder?Action=qnwotukfozr". I'm returning 400 - bad request and they still come back. It is possible some bot is somehow spoofing the ip address and useragent to make it seem it's google bot, then poking actions at a website and monitoring it for changes? |
7th October 2010, 13:36 | #13 |
Ich Bin Ein Grey Lynner
|
Gone in thru google's webmaster tools for several of these sites now. It's definitely google requesting the pages we a giving 403 for. Removing those URLS will be a task because there's more than 3000 URLs it's hitting like this on one site.
The other odd Action=garbage URLs don't appear in the webmaster tools at all, making me think it's not google. |
7th October 2010, 16:50 | #14 |
|
Any chance you can post some of the IP addresses of the fake Google bots?
__________________
"Nothing is so smiple that it can't be screwed up." |
7th October 2010, 17:01 | #15 |
Ich Bin Ein Grey Lynner
|
All of the requests are coming from 66.249.71.170, which appears to be a google address.
|
7th October 2010, 17:45 | #16 |
|
Yeah, that's definitely a Google one. Probably won't get you anywhere, but maybe worth contacting them about it?
I checked a couple of reasonably heavy sites and found no hits from that IP address.
__________________
"Nothing is so smiple that it can't be screwed up." |
7th October 2010, 18:20 | #17 |
Ich Bin Ein Grey Lynner
|
Thanks, I'm probably going to post on the webmaster forums once I get my shit together. Plan is currently to 401 content they shouldn't be able to index and 410 admin functions they shouldn't even know about.
Since I last purged the logs on one of our moderate traffic level sites (for us) about three months ago, we've served 173,000 403 errors to google. |
8th October 2010, 12:37 | #18 |
Ich Bin Ein Grey Lynner
|
So, stuck up the code which returns a 410 to search engines, when they request a page they shouldn't ever be able to see. Google hits it twice in 30 seconds after the server was restarted and hasn't come back since.
Hopefully that's it. |
12th October 2010, 11:10 | #19 |
Ich Bin Ein Grey Lynner
|
Nope, 4k+ hits since that change was deployed. Many on he exact same URL multiple times. Fuck it, I'm going to 301 them to http://sitedomain/.
|
24th November 2010, 20:53 | #20 |
|
Isn't that going to cause "duplicate site content" problems and lower your search ranking? Or do 301/302 redirects not affect that algorithm?
|