Re: How to let crawlers in, but prevent their damage?

Dennis Gearon Mon, 10 Jan 2011 09:16:29 -0800

Hmmmm, so if someone says they have SEO skills on their resume, they COULD be 
talking about optimizing the SEARH engnie at some site, not just a web site to 
be crawled by search engines?

----- Original Message ----
From: Ken Krugler <kkrugler_li...@transpac.com>
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 9:07:43 AM
Subject: Re: How to let crawlers in, but prevent their damage?

On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote:

> Hi Ken, thanks Ken. :)
> 
> The problem with this approach is that it exposes very limited content to
> bots/web search engines.
> 
> Take http://search-lucene.com/ for example.  People enter all kinds of queries
> in web search engines and end up on that site.  People who visit the site
> directly don't necessarily search for those same things.  Plus, new terms are
> entered to get to search-lucene.com every day, so keeping up with that would
> mean constantly generating more and more of those static pages.  Basically, 
the
> tail is super long.

To clarify - the issue of using actual user search traffic is one of SEO, not 
what content you expose.

If, for example, people commonly do a search for "java <something>" then that's 
a hint that the URL to the static content, and the page title, should have the 
language as part of it.

So you shouldn't be generating static pages based on search traffic. Though you 
might want to decide what content to "favor" (see below) based on popularity.

> On top of that, new content is constantly being generated,
> so one would have to also constantly both add and update those static pages.

Yes, but that's why you need to automate that content generation, and do it on 
a 
regular (e.g. weekly) basis.

The big challenges we ran into were:

1. Dealing with badly behaved bots that would hammer the site.

We wound up putting this content on a separate system, so it wouldn't impact 
users on the main system.

And generating a regular report by user agent & IP address, so that we could 
block by robots.txt and IP when necessary.

2. Figuring out how to structure the static content so that it didn't look like 
spam to Google/Yahoo/Bing

You don't want to have too many links per page, or too much depth, but that 
constrains how many pages you can reasonably expose.

We had project scores based on code, activity, usage - so we used that to rank 
the content and focus on exposing early (low depth) the "good stuff". You could 
do the same based on popularity, from search logs.

Anyway, there's a lot to this topic, but it doesn't feel very Solr specific. So 
apologies for reducing the signal-to-noise ratio with talk about SEO :)

-- Ken

> I have a feeling there is not a good solution for this because on one hand
> people don't like the negative bot side effect, on the other hand people want 
>as
> much of their sites indexed by the big guys.  The only half-solution that 
comes
> to mind involves looking at who's actually crawling you and who's bringing you
> visitors, then blocking those with a bad ratio of those two - bots that crawl 
a
> lot but don't bring a lot of value.
> 
> Any other ideas?
> 
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
>> From: Ken Krugler <kkrugler_li...@transpac.com>
>> To: solr-user@lucene.apache.org
>> Sent: Mon, January 10, 2011 9:43:49 AM
>> Subject: Re: How to let crawlers in, but prevent their damage?
>> 
>> Hi Otis,
>> 
>> From what I learned at Krugle, the approach that worked for us  was:
>> 
>> 1. Block all bots on the search page.
>> 
>> 2. Expose the target  content via statically linked pages that are separately
>> generated from the same  backing store, and optimized for target search terms
>> (extracted from your own  search logs).
>> 
>> -- Ken
>> 
>> On Jan 10, 2011, at 5:41am, Otis Gospodnetic  wrote:
>> 
>>> Hi,
>>> 
>>> How do people with public search  services deal with bots/crawlers?
>>> And I don't mean to ask how one bans  them (robots.txt) or slow them down
>> (Delay
>>> stuff in robots.txt) or  prevent them from digging too deep in search
>> results...
>>> 
>>> What I  mean is that when you have publicly exposed search that bots crawl,
>> they
>>> issue all kinds of crazy "queries" that result in errors, that add noise to
>> Solr
>>> caches, increase Solr cache evictions, etc. etc.
>>> 
>>> Are there some known recipes for dealing with them, minimizing their
>> negative
>>> side-effects, while still letting them crawl you?
>>> 
>>> Thanks,
>>> Otis
>>> ----
>>> Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
>>> Lucene ecosystem search :: http://search-lucene.com/
>>> 
>> 
>> --------------------------
>> Ken  Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n  g
>> 
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: How to let crawlers in, but prevent their damage?

Reply via email to