NZGames.com Forums

NZGames.com Forums (https://forums.nzgames.com/index.php)
-   Coders' Forum (https://forums.nzgames.com/forumdisplay.php?f=19)
-   -   Handling 404s/403s with searchengines (https://forums.nzgames.com/showthread.php?t=84859)

smudge 23rd September 2010 11:08

Handling 404s/403s with searchengines
 
Scenario last night:
Google: Gimme this page. (some administration side url, fuck knows how they know to index it)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

:: a fuckload of times::
They obviously ignore the 403. What are you supposed to do in this situation?

Cynos 23rd September 2010 13:27

Quote:

Originally Posted by smudge
Scenario last night:
Google: Gimme this page. (some administration side url, fuck knows how they know to index it)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

Google: Gimme this page. (same url)
Me: 403 fuck off.

:: a fuckload of times::
They obviously ignore the 403. What are you supposed to do in this situation?

Have you added the forbidden urls to your robots.txt? Googlebot is good about honouring that - if you have and it's ignoring it, then it's two options - Google gotta bug, or someone's faking a UA.

LordP 23rd September 2010 13:48

Wouldn't a 404 be better? Bot asks for a page, you tell it the page doesn't exist, bot stops asking.

smudge 23rd September 2010 13:53

The problem with the URL is it's a parameter which causes it to be forbidden. Like:
http://site/Resource?Action=Admin
instead of just:
http://site/Resource
or :
http://site/Resource?Action=Some_legal_action
Which are valid and I do want to be indexed. As far as I've been able to figure out, to exclude the admin actions, I'd have to expose them all in the robots.txt, which is not really something I want.

smudge 23rd September 2010 13:56

Quote:

Originally Posted by LordP
Wouldn't a 404 be better? Bot asks for a page, you tell it the page doesn't exist, bot stops asking.

Do you find this happens thou? We find a customer upgrades from their old site to our new one. We get requests for their old system's "something.asp" from Google and Bing for, in some cases a year or more later. Even thou we've been returning 404 all that time.

I'd been thinking we need to change them to http://site/ redirects to make them stop.

Spoon1 23rd September 2010 15:17

Use mod_rewrite to redirect googlebot to a 404 page or something.

RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule Admin /404.html [R=301]

Ab 23rd September 2010 17:46

Quote:

Originally Posted by smudge
Do you find this happens thou? We find a customer upgrades from their old site to our new one. We get requests for their old system's "something.asp" from Google and Bing for, in some cases a year or more later. Even thou we've been returning 404 all that time.

I'd been thinking we need to change them to http://site/ redirects to make them stop.

should have used 503 on those, not 404.

hsh 24th September 2010 15:49

You'll also want Google to drop this, so do that from webmasters Tools, Crawler Access, Remove URL (being your admin dir, etc). And as you rightly speculated.. never put any hidden directory in robots.txt.

LordP 24th September 2010 16:34

Quote:

Originally Posted by smudge
Do you find this happens thou? We find a customer upgrades from their old site to our new one. We get requests for their old system's "something.asp" from Google and Bing for, in some cases a year or more later. Even thou we've been returning 404 all that time.

To be honest, I haven't really followed up very much on the ones I've seen, and at the time it sounded like it should have been enough.

I'm not sure 503 would have been suitable either - but perhaps 410 Gone [1] is a better choice? (Assuming Google/Bing/et all know what it means)

[1] "Indicates that the resource requested is no longer available and will not be available again." - http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Cynos 24th September 2010 16:43

Quote:

Originally Posted by smudge
The problem with the URL is it's a parameter which causes it to be forbidden. Like:
http://site/Resource?Action=Admin
instead of just:
http://site/Resource
or :
http://site/Resource?Action=Some_legal_action
Which are valid and I do want to be indexed. As far as I've been able to figure out, to exclude the admin actions, I'd have to expose them all in the robots.txt, which is not really something I want.


Hmm, how about you remap an url for admins only which redirects to Resource?Admin etc. /admin/* and forbid that, or UrlRewrite or something. That way you're not exposing anything about your API beyond it's existence.

hsh 24th September 2010 18:01

Try to rename admin and password protect, because if google indexed there is a good chance yandex etc did also.

smudge 7th October 2010 12:38

A lot of your guys suggestions are good, but I can't really do them exactly as you say because our environment (POJO java, JSPS, 10 years of legacy code, etc) is very different to the usual PHP/apache style setups. I can, and do take your suggestions and implement in 'our way'.

This is still happening. In the last hour on one of our site I've seen over 100 requests from a google ip, with a google bot user agent, for a URL I've consistently returned 403s for. Even weirder now google is requesting URLs like "Folder?Action=iouzgwsunskasv" which I have never generated in any page. 15 minutes later it's back again with "Folder?Action=qnwotukfozr". I'm returning 400 - bad request and they still come back.

It is possible some bot is somehow spoofing the ip address and useragent to make it seem it's google bot, then poking actions at a website and monitoring it for changes?

smudge 7th October 2010 13:36

Gone in thru google's webmaster tools for several of these sites now. It's definitely google requesting the pages we a giving 403 for. Removing those URLS will be a task because there's more than 3000 URLs it's hitting like this on one site.

The other odd Action=garbage URLs don't appear in the webmaster tools at all, making me think it's not google.

LordP 7th October 2010 16:50

Any chance you can post some of the IP addresses of the fake Google bots?

smudge 7th October 2010 17:01

All of the requests are coming from 66.249.71.170, which appears to be a google address.

LordP 7th October 2010 17:45

Yeah, that's definitely a Google one. Probably won't get you anywhere, but maybe worth contacting them about it?

I checked a couple of reasonably heavy sites and found no hits from that IP address.

smudge 7th October 2010 18:20

Thanks, I'm probably going to post on the webmaster forums once I get my shit together. Plan is currently to 401 content they shouldn't be able to index and 410 admin functions they shouldn't even know about.

Since I last purged the logs on one of our moderate traffic level sites (for us) about three months ago, we've served 173,000 403 errors to google.

smudge 8th October 2010 12:37

So, stuck up the code which returns a 410 to search engines, when they request a page they shouldn't ever be able to see. Google hits it twice in 30 seconds after the server was restarted and hasn't come back since.

Hopefully that's it.

smudge 12th October 2010 11:10

Nope, 4k+ hits since that change was deployed. Many on he exact same URL multiple times. Fuck it, I'm going to 301 them to http://sitedomain/.

Madman 24th November 2010 20:53

Isn't that going to cause "duplicate site content" problems and lower your search ranking? Or do 301/302 redirects not affect that algorithm?


All times are GMT +13. The time now is 03:16.

Powered by Trololololooooo
© Copyright NZGames.com 1996-2024
Site paid for by members (love you guys)