Hi Dan:

Agreed, this question is more Nutch related than Solr ;)

Nutch doesn't send any data into /update/extract request handler, all the text 
and metadata extraction happens in Nutch side rather than relying in the 
ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the same 
technology as the ExtractRequestHandler provided by Solr so shouldn't be any 
greater difference. 

By default Nutch doesn't boost anything as is Solr job to boost the different 
content in the different fields, which is what happens when you do a query 
against Solr. Nutch calculates the LinkRank which is a variation of the famous 
PageRank (or the OPIC score, which is another scoring algorithm implemented in 
Nutch, which I believe is the default in Nutch 2.x). What you can do is use the 
headings and map the heading tags into different fields and then apply 
different boosts to each field. 

The general idea with Nutch is to "make pieces of the web page" and store each 
piece in a different field in Solr, then you can tweak your relevance function 
using the values yo see fit, so you don't need to write any plugin to 
accomplish this (at least for the h1, h2, etc. example you provided, if you 
want to extract other parts of the webpage you'll need to write your own plugin 
to do so). 

Nutch is highly customizable, you can write a plugin for almost any piece of 
logic, from parsers to indexers, passing from URL filters, scoring algorithms, 
protocols and a long long list, usually the plugins are not so difficult to 
write, but the problem comes to know which extension point you need to use, 
this comes with experience and taking a good dive in the source code.

Hope this helps,

----- Original Message -----
From: "Dan Davis" <dansm...@gmail.com>
To: "solr-user" <solr-user@lucene.apache.org>
Sent: Monday, January 26, 2015 12:08:13 AM
Subject: [MASSMAIL]Weighting of prominent text in HTML

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?    Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.

Reply via email to