NZGames.com Forums
Register FAQ Calendar Mark Forums Read

Go Back   NZGames.com Forums > General > Coders' Forum
User Name
Password

Reply
 
Thread Tools
Old 23rd September 2010, 10:08     #1
smudge
Ich Bin Ein Grey Lynner
 
Handling 404s/403s with searchengines

Scenario last night:
Google: Gimme this page. (some administration side url, fuck knows how they know to index it)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

:: a fuckload of times::
They obviously ignore the 403. What are you supposed to do in this situation?
  Reply With Quote
Old 23rd September 2010, 12:27     #2
Cynos
 
Quote:
Originally Posted by smudge
Scenario last night:
Google: Gimme this page. (some administration side url, fuck knows how they know to index it)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

:: a fuckload of times::
They obviously ignore the 403. What are you supposed to do in this situation?
Have you added the forbidden urls to your robots.txt? Googlebot is good about honouring that - if you have and it's ignoring it, then it's two options - Google gotta bug, or someone's faking a UA.
__________________
So the perkbuster Hide abusing perks, crimbuster Garrett actually a crim - what's next? Roger Douglas is secretly poor? --Saladin
  Reply With Quote
Old 23rd September 2010, 12:48     #3
LordP
 
Wouldn't a 404 be better? Bot asks for a page, you tell it the page doesn't exist, bot stops asking.
__________________
"Nothing is so smiple that it can't be screwed up."
  Reply With Quote
Old 23rd September 2010, 12:53     #4
smudge
Ich Bin Ein Grey Lynner
 
The problem with the URL is it's a parameter which causes it to be forbidden. Like:
http://site/Resource?Action=Admin
instead of just:
http://site/Resource
or :
http://site/Resource?Action=Some_legal_action
Which are valid and I do want to be indexed. As far as I've been able to figure out, to exclude the admin actions, I'd have to expose them all in the robots.txt, which is not really something I want.
  Reply With Quote
Old 23rd September 2010, 12:56     #5
smudge
Ich Bin Ein Grey Lynner
 
Quote:
Originally Posted by LordP
Wouldn't a 404 be better? Bot asks for a page, you tell it the page doesn't exist, bot stops asking.
Do you find this happens thou? We find a customer upgrades from their old site to our new one. We get requests for their old system's "something.asp" from Google and Bing for, in some cases a year or more later. Even thou we've been returning 404 all that time.

I'd been thinking we need to change them to http://site/ redirects to make them stop.

Last edited by smudge : 23rd September 2010 at 12:57.
  Reply With Quote
Old 23rd September 2010, 14:17     #6
Spoon1
Mmm... Sacrilicious
 
Use mod_rewrite to redirect googlebot to a 404 page or something.

RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule Admin /404.html [R=301]
  Reply With Quote
Old 23rd September 2010, 16:46     #7
Ab
A mariachi ogre snorkel
 
Quote:
Originally Posted by smudge
Do you find this happens thou? We find a customer upgrades from their old site to our new one. We get requests for their old system's "something.asp" from Google and Bing for, in some cases a year or more later. Even thou we've been returning 404 all that time.

I'd been thinking we need to change them to http://site/ redirects to make them stop.
should have used 503 on those, not 404.
  Reply With Quote
Old 24th September 2010, 14:49     #8
hsh
 
You'll also want Google to drop this, so do that from webmasters Tools, Crawler Access, Remove URL (being your admin dir, etc). And as you rightly speculated.. never put any hidden directory in robots.txt.
  Reply With Quote
Old 24th September 2010, 15:34     #9
LordP
 
Quote:
Originally Posted by smudge
Do you find this happens thou? We find a customer upgrades from their old site to our new one. We get requests for their old system's "something.asp" from Google and Bing for, in some cases a year or more later. Even thou we've been returning 404 all that time.
To be honest, I haven't really followed up very much on the ones I've seen, and at the time it sounded like it should have been enough.

I'm not sure 503 would have been suitable either - but perhaps 410 Gone [1] is a better choice? (Assuming Google/Bing/et all know what it means)

[1] "Indicates that the resource requested is no longer available and will not be available again." - http://en.wikipedia.org/wiki/List_of_HTTP_status_codes
__________________
"Nothing is so smiple that it can't be screwed up."

Last edited by LordP : 24th September 2010 at 15:35.
  Reply With Quote
Old 24th September 2010, 15:43     #10
Cynos
 
Quote:
Originally Posted by smudge
The problem with the URL is it's a parameter which causes it to be forbidden. Like:
http://site/Resource?Action=Admin
instead of just:
http://site/Resource
or :
http://site/Resource?Action=Some_legal_action
Which are valid and I do want to be indexed. As far as I've been able to figure out, to exclude the admin actions, I'd have to expose them all in the robots.txt, which is not really something I want.

Hmm, how about you remap an url for admins only which redirects to Resource?Admin etc. /admin/* and forbid that, or UrlRewrite or something. That way you're not exposing anything about your API beyond it's existence.
__________________
So the perkbuster Hide abusing perks, crimbuster Garrett actually a crim - what's next? Roger Douglas is secretly poor? --Saladin
  Reply With Quote
Old 24th September 2010, 17:01     #11
hsh
 
Try to rename admin and password protect, because if google indexed there is a good chance yandex etc did also.
  Reply With Quote
Old 7th October 2010, 11:38     #12
smudge
Ich Bin Ein Grey Lynner
 
A lot of your guys suggestions are good, but I can't really do them exactly as you say because our environment (POJO java, JSPS, 10 years of legacy code, etc) is very different to the usual PHP/apache style setups. I can, and do take your suggestions and implement in 'our way'.

This is still happening. In the last hour on one of our site I've seen over 100 requests from a google ip, with a google bot user agent, for a URL I've consistently returned 403s for. Even weirder now google is requesting URLs like "Folder?Action=iouzgwsunskasv" which I have never generated in any page. 15 minutes later it's back again with "Folder?Action=qnwotukfozr". I'm returning 400 - bad request and they still come back.

It is possible some bot is somehow spoofing the ip address and useragent to make it seem it's google bot, then poking actions at a website and monitoring it for changes?
  Reply With Quote
Old 7th October 2010, 12:36     #13
smudge
Ich Bin Ein Grey Lynner
 
Gone in thru google's webmaster tools for several of these sites now. It's definitely google requesting the pages we a giving 403 for. Removing those URLS will be a task because there's more than 3000 URLs it's hitting like this on one site.

The other odd Action=garbage URLs don't appear in the webmaster tools at all, making me think it's not google.
  Reply With Quote
Old 7th October 2010, 15:50     #14
LordP
 
Any chance you can post some of the IP addresses of the fake Google bots?
__________________
"Nothing is so smiple that it can't be screwed up."
  Reply With Quote
Old 7th October 2010, 16:01     #15
smudge
Ich Bin Ein Grey Lynner
 
All of the requests are coming from 66.249.71.170, which appears to be a google address.
  Reply With Quote
Old 7th October 2010, 16:45     #16
LordP
 
Yeah, that's definitely a Google one. Probably won't get you anywhere, but maybe worth contacting them about it?

I checked a couple of reasonably heavy sites and found no hits from that IP address.
__________________
"Nothing is so smiple that it can't be screwed up."
  Reply With Quote
Old 7th October 2010, 17:20     #17
smudge
Ich Bin Ein Grey Lynner
 
Thanks, I'm probably going to post on the webmaster forums once I get my shit together. Plan is currently to 401 content they shouldn't be able to index and 410 admin functions they shouldn't even know about.

Since I last purged the logs on one of our moderate traffic level sites (for us) about three months ago, we've served 173,000 403 errors to google.
  Reply With Quote
Old 8th October 2010, 11:37     #18
smudge
Ich Bin Ein Grey Lynner
 
So, stuck up the code which returns a 410 to search engines, when they request a page they shouldn't ever be able to see. Google hits it twice in 30 seconds after the server was restarted and hasn't come back since.

Hopefully that's it.
  Reply With Quote
Old 12th October 2010, 10:10     #19
smudge
Ich Bin Ein Grey Lynner
 
Nope, 4k+ hits since that change was deployed. Many on he exact same URL multiple times. Fuck it, I'm going to 301 them to http://sitedomain/.
  Reply With Quote
Old 24th November 2010, 19:53     #20
Madman
 
Isn't that going to cause "duplicate site content" problems and lower your search ranking? Or do 301/302 redirects not affect that algorithm?
  Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump



© Copyright NZGames.com 1996-2023
Site paid for by members (love you guys)