Hello, 

we are currently developing a combined index for book metadata and
fulltexts. Our primary core contains metadata of ~12Mio. books. ~0.5Mio.
of them have fulltexts; those fulltexts are indexed in a secondary core.
This secondary core has one index document per fulltext page. 
We are joining all matching fulltext pages with the bookwise metadata
in the primary core. Currently we have the problem that scores for books
with matches from the secondary core are not comparable with matches
from metadata only. So we are trying to normalize fulltext scores to be
in the same dimension as the metadata scores for non-digitized results.

This is a basic query without join using only the primary core
(metadata): 
http://server/solr/live/select?&q=+geschichte&fl=id,score
Top 10 result scores range from 2.0 to 1.7

For fulltexts, the query is extended with a join: 
http://server/solr/live/select?q=%28%28+geschichte%29%20OR%20_query_:{!join%20from=expandtype%20fromIndex=pages%20to=id%20score=max%20v=%27pageno_content:%28+geschichte%29%27}%29&fl=id,score
Top 10 result scores range from 5.4 to 4.8 (4.7 score points for the
first hit result from the joined secondary core. We would like to reduce
this value. See explain output below [1])

This difference will effectively hide any books without fulltexts from
hitlists, which is not our goal. 

We tried to add lucene boosts to the join subquery, but they do not
have any effect on the final scores. E.g. we 'down boost' the fulltext
results by a factor of 0.1:
q=((+geschichte) OR _query_:{!join from=expandtype fromIndex=pages
to=id score=max v='pageno_content:(+geschichte)^0.1'})
But the resulting scores are the same as from the join example above. 

Is this the correct query syntax, or should the boost for the join
query be put somewhere else?

Thanks for any suggestions. 

Best Regards
Alena

[1] Explain output for the first hit of the join example query 
5.398742 = sum of:
  4.816505 = sum of:
    0.07251295 = max of:
      0.07251295 = weight(title:geschichte in 10585926)
[ClassicSimilarity], result of:
        0.07251295 = score(doc=10585926,freq=1.0), product of:
          0.037440736 = queryWeight, product of:
            5.1646385 = idf(docFreq=197504, maxDocs=12713278)
            0.00724944 = queryNorm
          1.9367394 = fieldWeight in 10585926, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            5.1646385 = idf(docFreq=197504, maxDocs=12713278)
            0.375 = fieldNorm(doc=10585926)
      0.005904072 = weight(free_search:geschichte in 10585926)
[ClassicSimilarity], result of:
        0.005904072 = score(doc=10585926,freq=2.0), product of:
          0.022005465 = queryWeight, product of:
            3.035471 = idf(docFreq=1660594, maxDocs=12713278)
            0.00724944 = queryNorm
          0.26830027 = fieldWeight in 10585926, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.035471 = idf(docFreq=1660594, maxDocs=12713278)
            0.0625 = fieldNorm(doc=10585926)
    4.743992 = Score based on join value 957245
  0.58188105 = weight(statusband:F in 10585926) [ClassicSimilarity],
result of:
    0.58188105 = score(doc=10585926,freq=1.0), product of:
      0.4592555 = queryWeight, product of:
        50.0 = boost
        1.2670095 = idf(docFreq=9734121, maxDocs=12713278)
        0.00724944 = queryNorm
      1.2670095 = fieldWeight in 10585926, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        1.2670095 = idf(docFreq=9734121, maxDocs=12713278)
        1.0 = fieldNorm(doc=10585926)
  3.5596997E-4 =
FunctionQuery(1.0/(3.16E-11*float(ms(const(1458638802405),date(freshness)))+1.0)),
product of:
    0.00491031 =
1.0/(3.16E-11*float(ms(const(1458638802405),date(freshness)=1813-01-01T00:00:01Z))+1.0)
    0.0724944 = boost
    1.0 = queryNorm

Reply via email to