Re: solr 1.4 facet.limit behaviour in merging from several shards

Dmitry Kan Wed, 14 Sep 2011 00:46:20 -0700

Hi Chris,

Thanks for taking this. Sorry for my confusing explanation. Since you
requested a bigger picture, I'll give some more detail. In short: we don't
do date facets, and sorting by date in reverse order happens naturally by
design.

All the data is split to shards. We use logical sharding, not hash based.
Each shard contains piece of data that corresponds to a specific date range.
We know in advance, which date range is represented by which shard. Each
document in a shard has a field, which contains date in milliseconds which
is a result of subtraction of the original document's date from a very big
date in the future. In this way, if you issue a facet query against a shard
and use facet.method=index you get hits from the shard ordered
lexicographically in reverse order.

Here is an example of two values:

9223370739060532807_docid1
9223370741484545807_docid2

The second value is larger than the first, which means that the document
itself is older.

Here is a typical facet query:

wt=xml&start=0&hl.alternateField=Contents&version=1&df=Contents&q=aerospace+engineer&hl.alternateFieldLength=100000&facet=true&f.OppositeDateLongNumber_docid.facet.limit=1000&facet.field=OppositeDateLongNumber_docid&rows=1&facet.sort=index&facet.zeros=false&isShard=true

The output xml is:

(skipping the header)

 <lst name="facet_fields">
  <lst name="OppositeDateLongNumber_docid">
        <int name="9223370722475651807_1">2</int>
        <int name="9223370722825037807_4">1</int>
        <int name="9223370723175759807_2">2</int>
        <int name="9223370723372652807_10">1</int>
        <int name="9223370723949606807_7">1</int>
  </lst>
 </lst>

Excerpt from the schema:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
omitNorms="true">
      <analyzer type="index">
       <!-- the order matters -->
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.ReversedWildcardFilterFactory"
withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2"
maxFractionAsterisk="0.33"/>
       <!-- here we have two more proprietary filters, one of which does
stemming -->
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- our proprietary stemming filter ->
      </analyzer>
    </fieldType>

<field name="OppositeDateLongNumber_docid" type="string" indexed="true"
stored="true"  required="false" omitNorms="true" />
<field name="Contents" type="text" indexed="true" stored="true"
omitNorms="true" />

Back to the problem: It has been reproducible, that if query ran from the
solr - router reaches two or more shards, each of which generates around
1000 hits, upon merging, some portion of hits (on the time border between
two shards) gets dropped. So the result hit list is uniform otherwise,
except for the missing portion of hits in the middle.

So the question is: if the facet search reaches two or more shards and each
shard generates 1000 results, which entries will go into the final list of
resulting entries, given the facet.limit=1000 set on the original
distributed query? What is the algorithm in this case?

Please let me know, if something is not clear or more detail is needed from
schema / execution / design.

Regards,

Dmitry

On Fri, Sep 9, 2011 at 12:22 AM, Chris Hostetter
<hossman_luc...@fucit.org>wrote:

>
> : When shooting a distributed query, we use facet.limit=1000. Then the
> merging
> : SOLR combines the results. We also use facet.zeros=false to ensure
> returning
> : only non-zero facet entries.
> : The issue that we found is that there was a gap in time in the final
> results
> : list (reverse sorted by date attached to each entry in all the shards),
> : whereby entries stamped with certain date disappeared. If we use
> different
> : query criteria, that produces less than 1000 results both in each of the
> : shards and combined, we see those "missing" entries. So the problem is
> not
> : in missing data, but in the combination algorithm.
>
> I don't understand what you mean by "entries stamped with certain date"
> ... are you saying the actaul results of the search seem to be missing
> documents, or that the fact counts returned seemed to be missing
> constraints that should be in the list?
>
> it seems like you are refering to documents missing from the actaul
> results ("reverse sorted by date") but facet.limit can't affect anything
> about the results of the actual query.  facet.limit also only applies to
> facet.field (not facet.date or facet.range), but you're talking about a
> date field....
>
> can you please be specific about the requests you are executing (ie: what
> params) the schema you have (ie: what are the fields/types in use in all
> the params/query strings), the results you are getting, and the results
> you are expecting?   actually providing the response xml is very helpful.
> (change the "fl" to hide any fields you consider sensitive)
>
> -Hoss
>

-- 
Regards,

Dmitry Kan

Re: solr 1.4 facet.limit behaviour in merging from several shards

Reply via email to