Hi!

I have a SolrCloud 6.6 collection with 3 shards setup where I need the 
TermVectors TF and DF values for queries.

I have configured the ExactStatsCache in the solrConfig:

<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>

When I query "detector works", it returns different docfreq values based on the 
shard the document comes from:

"termVectors":[
    "27504103",[
      "uniqueKey","27504103",
      "kc",[
        "detector works",[
          "tf",1,
          "df",3,
          "tf-idf",0.3333333333333333]]],
    "27507925",[
      "uniqueKey","27507925",
      "kc",[
        "detector works",[
          "tf",1,
          "df",3,
          "tf-idf",0.3333333333333333]]],
    "27504105",[
      "uniqueKey","27504105",
      "kc",[
        "detector works",[
          "tf",1,
          "df",2,
          "tf-idf",0.5]]],
    "27507927",[
      "uniqueKey","27507927",
      "kc",[
        "detector works",[
          "tf",1,
          "df",2,
          "tf-idf",0.5]]],
    "27507929",[
      "uniqueKey","27507929",
      "kc",[
        "detector works",[
          "tf",1,
          "df",1,
          "tf-idf",1.0]]],
    "27504107",[
      "uniqueKey","27504107",
      "kc",[
        "detector works",[
          "tf",1,
          "df",3,
          "tf-idf",0.3333333333333333]]]]}

I expect to see the DF values to be 6 and TF-IDF to be adjusted on that value. 
I can see in the debug logs that the cache was active.

I have found a pending bug (since Solr 5.5: 
https://issues.apache.org/jira/browse/SOLR-8893) that explains that this 
ExactStatsCache is used to compute the correct TF-IDF for the query but not for 
the TermVectors component.

Is there any way to get the correctly merged DF values (and TF-IDF) from 
multiple shards?

Is there a way to get from which shard a document comes from so I could compute 
my own correct DF?

Thank you,
Patrick

Reply via email to