Re: faceted search with job title

Ken Krugler Thu, 22 Jul 2010 07:19:49 -0700

Hi Savannah,

A few comments below, scattered in-line...


-- Ken

On Jul 21, 2010, at 3:08pm, Savannah Beckett wrote:

And I will have to recompile the dom or sax code each time I add ajob board forcrawling. Regex patten is only a string which can be stored in atext file or
db, and retrieved based on the job board.  What do you think?

You can store the XPath expressions in a text file as strings, andload/compile them as needed.

From: "Nagelberg, Kallin" <knagelb...@globeandmail.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title
Yeah you should definitely just setup a custom parser for eachsite.. should beeasy to extract title using groovy's xml parsing along with tagsoupfor sloppy
html.


Definitely yes re using TagSoup to clean up bad HTML.

And definitely yes to needing per-site "rules" (typically XPath +optional regex as needed) to extract specific details.

For a common class of sites powered by the same back-end, you canoften re-use the same general rules as the markup that you care aboutis consistent.

If you can't find the pattern for each site leading to the job titlehow
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-----Original Message-----
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title
mmm...there must be better way...each job board has differentformat. If thereare constantly new job boards being crawled, I don't think I canmanually lookfor specific sequence of tags that leads to job title. Most of themdon't evenhave class or id. There is no guarantee that the job title will bein the titletag, or header tag. Something else can be in the title. Should Ido this in a
class that extends IndexFilter in Nutch?

When I do this kind of thing I use Bixo (http://openbixo.org), butthat requires knowledge of Cascading (& some Hadoop) in order toconstruct web mining workflows.

________________________________
From: Dave Searle <dave.sea...@magicalia.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title
You'd probably need to do some post processing on the pages and setup rules foreach website to grab that specific bit of data. You could load thehtml into anxml parser, then use xpath to grab content from a particular tagwith a class or
id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
I am currently using nutch to crawl some job pages from jobboards. They arein my solr index now. I want to do faceted search with the jobtitles. How?The job titles can be in any locations of the page, e.g. title,header,content... If I use indexfilter in Nutch to search the content forjob title,there are hundred of thousands of job titles, I can't hard code themall. Doyou have a better idea? I think I need the job title in a separatefield in the
index to make it work with solr faceted search, am I right?

Yes, you'd want a separate "job title" field in the index. Thoughoften the job titles are slight variants on each other, so this wouldprobably work much better if you automatically found common phrasesand used those, otherwise you get "Senior Bottlewasher" and "Sr.Bottlewasher" and "Sr Bottlewasher" as separate facet values.


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: faceted search with job title

Reply via email to