You could grab your xpath rules from a db too. This is what I did for a price 
scrapping app I did a while ago. New sites were added with a set of rules using 
a web ui  You could certainly use regex of course, but IMO that's more complex 
than writing a simple xpath. Using JavaScript or some dom traversal code, you 
could quite easily create a click and point tool to generate rules very simply 
and quickly. 

On 21 Jul 2010, at 23:10, Savannah Beckett <savannah_becket...@yahoo.com> wrote:

> And I will have to recompile the dom or sax code each time I add a job board 
> for 
> crawling.  Regex patten is only a string which can be stored in a text file 
> or 
> db, and retrieved based on the job board.  What do you think?
> 
> 
> 
> 
> ________________________________
> From: "Nagelberg, Kallin" <knagelb...@globeandmail.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Wed, July 21, 2010 10:39:32 AM
> Subject: RE: faceted search with job title
> 
> Yeah you should definitely just setup a custom parser for each site.. should 
> be 
> easy to extract title using groovy's xml parsing along with tagsoup for 
> sloppy 
> html. If you can't find the pattern for each site leading to the job title 
> how 
> can you expect solr to? Humans have the advantage here :P
> 
> -Kallin Nagelberg
> 
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
> Sent: Wednesday, July 21, 2010 12:20 PM
> To: solr-user@lucene.apache.org
> Cc: dave.sea...@magicalia.com
> Subject: Re: faceted search with job title
> 
> mmm...there must be better way...each job board has different format.  If 
> there 
> are constantly new job boards being crawled, I don't think I can manually 
> look 
> for specific sequence of tags that leads to job title.  Most of them don't 
> even 
> have class or id.  There is no guarantee that the job title will be in the 
> title 
> 
> tag, or header tag.  Something else can be in the title.  Should I do this in 
> a 
> class that extends IndexFilter in Nutch?
> Thanks. 
> 
> 
> 
> 
> ________________________________
> From: Dave Searle <dave.sea...@magicalia.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Wed, July 21, 2010 8:42:55 AM
> Subject: RE: faceted search with job title
> 
> You'd probably need to do some post processing on the pages and set up rules 
> for 
> 
> each website to grab that specific bit of data. You could load the html into 
> an 
> xml parser, then use xpath to grab content from a particular tag with a class 
> or 
> 
> id, based on the particular website
> 
> 
> 
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
> Sent: 21 July 2010 16:38
> To: solr-user@lucene.apache.org
> Subject: faceted search with job title
> 
> Hi,
>   I am currently using nutch to crawl some job pages from job boards.  They 
> are 
> in my solr index now.  I want to do faceted search with the job titles.  How? 
>  
> The job titles can be in any locations of the page, e.g. title, header, 
> content...   If I use indexfilter in Nutch to search the content for job 
> title, 
> there are hundred of thousands of job titles, I can't hard code them all.  Do 
> you have a better idea?  I think I need the job title in a separate field in 
> the 
> 
> 
> index to make it work with solr faceted search, am I right?
> Thanks.
> 
> 

Reply via email to