Hi all,
In our Solr 6 setup we use string payloads to boost certain tokens (URIs).
These strings are mapped to floats via a schema parameter "PayloadMapping",
which can be read out in our custom WKSimilarity class (extending
TFIDFSimilarity).
<fieldType name="uri_payload" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory"
encoder="identity" delimiter="|"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<similarity
class="com.wolterskluwer.atlas.solr.similarities.WKSimilarityFactory">
<str name="BM25k1a">0.4</str>
<str name="BM25k1b">0.4</str>
<str name="BM25b">0.5</str>
<str name="IDFCurveFactor">0</str>
<str name="sloppyFreqCurveFactor">0.0</str>
<str name="PayloadBoost">10.0</str>
<str name="PayloadImpact">3.0</str>
<str name="PayloadCurveFactor">1.0</str>
<str
name="PayloadMapping">isAbout=15.0,coversFiscalPeriod=10.0,type=5.0,hasTheme=5.0,subject=4.0,mentions=2.0,creator=2.0</str>
</similarity>
</fieldType>
The reason for this indirection is convenience: by storing payload strings
i.s.o. floats we could change & tune the boosts easily by updating the schema
without having to change the content set.
Inside WKSimilarity each payload string is mapped to its corresponding boost
value and the final boost is applied via the scorePayload method (where we
could tune the boost curve via some additional schema parameters). This works
well in Solr 6.
The problem: we are about to migrate to Solr 8 and after LUCENE-8014 it isn't
possible anymore the override the scorePayload method in WKSimilarity (it is
removed from TFIDFSimilarity). I wonder what alternatives there are for mapping
strings payload to floats and use them in a tunable formula for boosting.
Thanks,
Tom Burgmans