Indexing Only Parts of HTML Pages

2008-08-13 Thread Nick Tkach
I'm wondering, is there some way ("out of the box") to tell Solr that 
we're only interested in indexing certain parts of a page?  For example, 
let's say I have a bunch of pages in my site that contain some common 
navigation elements, roughly like this:



  
  

  Stuff here about parts of my site


  More stuff about other parts of the site

A bunch of stuff particular to each individual page...
  


Is there some way to either tell Solr to not index what's in the two 
divs whenever it encounters them (and it will-in nearly every page) or, 
failing that, to somehow easily give content in those areas a large 
negative score in order to get the same effect?


FWIW, we are using Nutch to do the crawling, but as I understand it 
there's no way to get Nutch to skip only parts of pages without writing 
custom code, right?


Re: what crawler do you use for Solr indexing?

2009-03-05 Thread Nick Tkach
Yes, Nutch works quite well as a crawler for Solr.

- Original Message -
From: "Tony Wang" 
To: solr-user@lucene.apache.org
Sent: Thursday, March 5, 2009 5:32:57 PM GMT -06:00 US/Canada Central
Subject: what crawler do you use for Solr indexing?

Hi,

I wonder if there's any open source crawler product that could be integrated
with Solr. What crawler do you guys use? or you coded one by yourself? I
have been trying to find out solutions for Nutch/Solr integration, but
haven't got any luck yet.

Could someone shed me some light?

thanks!

Tony

-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信