You could grab your xpath rules from a db too. This is what I did for a price scrapping app I did a while ago. New sites were added with a set of rules using a web ui You could certainly use regex of course, but IMO that's more complex than writing a simple xpath. Using JavaScript or some dom traversal code, you could quite easily create a click and point tool to generate rules very simply and quickly.
On 21 Jul 2010, at 23:10, Savannah Beckett <savannah_becket...@yahoo.com> wrote: > And I will have to recompile the dom or sax code each time I add a job board > for > crawling. Regex patten is only a string which can be stored in a text file > or > db, and retrieved based on the job board. What do you think? > > > > > ________________________________ > From: "Nagelberg, Kallin" <knagelb...@globeandmail.com> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Wed, July 21, 2010 10:39:32 AM > Subject: RE: faceted search with job title > > Yeah you should definitely just setup a custom parser for each site.. should > be > easy to extract title using groovy's xml parsing along with tagsoup for > sloppy > html. If you can't find the pattern for each site leading to the job title > how > can you expect solr to? Humans have the advantage here :P > > -Kallin Nagelberg > > -----Original Message----- > From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] > Sent: Wednesday, July 21, 2010 12:20 PM > To: solr-user@lucene.apache.org > Cc: dave.sea...@magicalia.com > Subject: Re: faceted search with job title > > mmm...there must be better way...each job board has different format. If > there > are constantly new job boards being crawled, I don't think I can manually > look > for specific sequence of tags that leads to job title. Most of them don't > even > have class or id. There is no guarantee that the job title will be in the > title > > tag, or header tag. Something else can be in the title. Should I do this in > a > class that extends IndexFilter in Nutch? > Thanks. > > > > > ________________________________ > From: Dave Searle <dave.sea...@magicalia.com> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Wed, July 21, 2010 8:42:55 AM > Subject: RE: faceted search with job title > > You'd probably need to do some post processing on the pages and set up rules > for > > each website to grab that specific bit of data. You could load the html into > an > xml parser, then use xpath to grab content from a particular tag with a class > or > > id, based on the particular website > > > > -----Original Message----- > From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] > Sent: 21 July 2010 16:38 > To: solr-user@lucene.apache.org > Subject: faceted search with job title > > Hi, > I am currently using nutch to crawl some job pages from job boards. They > are > in my solr index now. I want to do faceted search with the job titles. How? > > The job titles can be in any locations of the page, e.g. title, header, > content... If I use indexfilter in Nutch to search the content for job > title, > there are hundred of thousands of job titles, I can't hard code them all. Do > you have a better idea? I think I need the job title in a separate field in > the > > > index to make it work with solr faceted search, am I right? > Thanks. > >