Free Blog |  HTTPPoint discussion forum |  Recreation vehicle insurance |  Casino news |  Debt Info |  Debt Info(uk) |  Computer Info |  History of computer |  Flight Simulator | 

PDA

View Full Version : Re: Google causing excessive bandwidth uasage.


Stan Brown
01-01-2006, 16:15
Sat, 19 Nov 2005 10:32:26 +0000 from Philip Ronan
<invalid@invalid.invalid>:

> "Doug Laidlaw" wrote:
> > Google has been around to my site twice this month and downloaded almost a
> > GB, putting me over my bandwidth limit both times I imagine that if I
> > wasn't paying a flat fee, that would be costing me money.
> > Is there a way of limiting this while at the same time allowing Google
> > reasonable indexing?

> alt.internet.search-engines might have been a better place to ask.

(follow-ups redirected accordingly)

> Then I don't really see what the problem is. You've got all this content on
> your website, and presumably you want it indexed by Google. So you can't
> complain when the googlebot comes along and looks at the stuff.

I'm _not_ paying a flat fee, unlike the OP, and I'd like to know the
answer to this also.

> I think you would be better off reading this:
> <http://www.google.com/intl/en/webmasters/bot.html>

Good heavens! That page says Google trawls my site every few
_seconds_. Not long ago I remember it used to be every few _days_. I
noticed activity on my site grew quite a bit a little less that a
year ago; I wonder if this was the reason?

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/2003/05/05/why_we_wont_help_you

Philip Ronan
01-01-2006, 16:15
"Stan Brown" wrote:

> Sat, 19 Nov 2005 10:32:26 +0000 from Philip Ronan
> <invalid@invalid.invalid>:
>
>> I think you would be better off reading this:
>> <http://www.google.com/intl/en/webmasters/bot.html>
>
> Good heavens! That page says Google trawls my site every few
> _seconds_. Not long ago I remember it used to be every few _days_. I
> noticed activity on my site grew quite a bit a little less that a
> year ago; I wonder if this was the reason?

What it actually says is "For most sites, Googlebot shouldn't access your
site more than once every few seconds on average." Think of that as a hit
rate. It would be pointless trawling through your *entire site* every few
seconds. In my experience the Googlebot generates no more traffic than an
ordinary visitor to the site.

It just occurred to me that the problems you and the OP are experiencing
might be caused by things like poor cacheability. You're both generating
pages dynamically, aren't you? Are they cacheable? Can they handle
conditional requests? If not, you're creating extra traffic for your site,
and not just from the search engine robots.

Here's your homework:

1. Read RFC2616, especially the bits about conditional requests
2. Check your content for cacheability

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Nick Kew
01-01-2006, 16:15
Stan Brown wrote:
> (follow-ups redirected accordingly)

And ignored. I'm not posting *only* to a group I don't read.

> Good heavens! That page says Google trawls my site every few
> _seconds_. Not long ago I remember it used to be every few _days_. I

Erm, that'll be URLs that get visited at a high rate while it's
spidering. So if it visits one per minute and you have 1440 pages,
it'll take one day to spider the site from scratch.

It'll then revisit in [???] days/weeks to check for changes.

--
Nick Kew

Alan J. Flavell
01-01-2006, 16:15
On Sat, 19 Nov 2005, Stan Brown wrote:

> > alt.internet.search-engines might have been a better place to ask.
> (follow-ups redirected accordingly)

Urgl. I missed that, first time, but this server doesn't do alt
groups. So here goes again, including a group that I not only read
but can post to...

> > <http://www.google.com/intl/en/webmasters/bot.html>
>
> Good heavens! That page says Google trawls my site every few
> _seconds_.

I don't think so! It says the server shouldn't get *an* access from
Googlebot more often than a few seconds. That's a rate control
mechanism, not a frequency of revisiting.

Though I'm a bit surprised to see that when I count up the log entries
for Googlebot on our server, I count some 68K accesses in the current
log, 13th November onwards, out of the total of some 400K accesses
over that period.

But the accesses are clustered by date, implying that they did a trawl
twice this week - or once (30K hits in the previous week, in just a
single cluster), with only a few hundreds of Googlebot hits per day on
the intermediate days (presumably to re-check pages which were
recently active?).

I see most of the Googlebot accesses here are returning status 200.
The references to my own personal space are mostly returning status
304, but I see a few cases where my "xbithack full" pages are missing
the g+x bit, and so they always return status 200, which I need to
rectify.

Hmmm, and I have to look into those status 200 responses elsewhere on
the server, and probably do something about it. I have a theory.

David
01-01-2006, 16:15
On Sat, 19 Nov 2005 17:41:21 GMT, Philip Ronan
<invalid@invalid.invalid> wrote:

>In my experience the Googlebot generates no more traffic than an
>ordinary visitor to the site.

It depends a lot on the size of the site, a small site then yes a lot
like a very interested visitor (most real visitors view a small number
of pages, unlike the bots), but a large site you feel like you've been
mugged some visits :-))

Does depend a lot on the number and quality (PR) of the links to a
site though.

David
--
Free Search Engine Optimization Tutorial
http://www.seo-gold.com/tutorial/

Stan Brown
01-01-2006, 16:15
Sat, 19 Nov 2005 17:41:21 GMT from Philip Ronan
<invalid@invalid.invalid>:
> "Stan Brown" wrote:
>
> > Sat, 19 Nov 2005 10:32:26 +0000 from Philip Ronan
> > <invalid@invalid.invalid>:
> >
> >> I think you would be better off reading this:
> >> <http://www.google.com/intl/en/webmasters/bot.html>
> >
> > Good heavens! That page says Google trawls my site every few
> > _seconds_. Not long ago I remember it used to be every few _days_. I
> > noticed activity on my site grew quite a bit a little less that a
> > year ago; I wonder if this was the reason?
>
> What it actually says is "For most sites, Googlebot shouldn't access your
> site more than once every few seconds on average." Think of that as a hit
> rate. It would be pointless trawling through your *entire site* every few
> seconds. In my experience the Googlebot generates no more traffic than an
> ordinary visitor to the site.

Thanks, that makes more sense.

> It just occurred to me that the problems you and the OP are experiencing
> might be caused by things like poor cacheability. You're both generating
> pages dynamically, aren't you? Are they cacheable? Can they handle
> conditional requests? If not, you're creating extra traffic for your site,
> and not just from the search engine robots.

No, my pages are all static, and (just checked with lynx -dump -head)
the server does return last-modified dates. So, unless I'm
misunderstanding you, they're pretty darn cacheable. :-)

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
"If there's one thing I know, it's men. I ought to: it's
been my life work." -- Marie Dressler, in /Dinner at Eight/

Philip Ronan
01-01-2006, 16:15
"Stan Brown" wrote:

> No, my pages are all static, and (just checked with lynx -dump -head)
> the server does return last-modified dates. So, unless I'm
> misunderstanding you, they're pretty darn cacheable. :-)

But you're still having your bandwidth eaten up by Googlebot? That's odd.
Only 1% of my traffic comes from Googlebots (12,400 hits last month). The
site has about 3000 pages indexed, IIRC.

What sort of traffic are you getting?

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Stan Brown
01-01-2006, 16:15
Sun, 20 Nov 2005 23:27:22 GMT from Philip Ronan
<invalid@invalid.invalid>:
> "Stan Brown" wrote:
>
> > No, my pages are all static, and (just checked with lynx -dump -head)
> > the server does return last-modified dates. So, unless I'm
> > misunderstanding you, they're pretty darn cacheable. :-)
>
> But you're still having your bandwidth eaten up by Googlebot? That's odd.

That is not what I said. I said I'd noticed a sudden upsurge in usage
about a year ago (never going back down) and wondered whether it was
Google starting to recheck the site every few seconds. But someone
pointed out that I'd misread the Google page, so that's not the
explanation.

(I should have known anyway that it wasn't, because I examined the
logs and couldn't see any one domain taking the lion's share of the
accesses. Maybe my site just got popular. :-)

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
"If there's one thing I know, it's men. I ought to: it's
been my life work." -- Marie Dressler, in /Dinner at Eight/

Free Blog |  HTTPPoint discussion forum |  Recreation vehicle insurance |  Casino news |  Debt Info |  Debt Info(uk) |  Computer Info |  History of computer |  Flight Simulator | 
Extenze
Suffer from back problems? Then try the Extenze inversion table. Suffer in silence no more.

Canon
Canon pricing from pricerunner.