Thank you Markus and Chris, for pointers. For SweetSpotSimilarity I am thinking perhaps a set of closed ranges exposed via similarity config is easier to maintain as data changes than making adjustments to fit a function. Another piece of info would've been handy is to know the average position info + position info for the first few occurrences for each term. This would allow perhaps higher boosting for term occurrences earlier in the doc. In my case extra keywords are towards the end of the doc,but that info does not seem to be propagated into scorer. Thanks again, Mihran
On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > You should start by checking out the "SweetSpotSimilarity" .. it was > heavily designed arround the idea of dealing with things like excessively > verbose titles, and keyword stuffing in summary text ... so you can > configure your expectation for what a "normal" length doc is, and they > will be penalized for being longer then that. similarly you can say what > a 'resaonable' tf is, and docs that exceed that would't get added boost > (which in conjunction with teh lengthNorm penality penalizes docs that > stuff keywords) > > > https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html > > > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg > > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg > > > -Hoss > http://www.lucidworks.com/ >