Hi Chris, Thanks for taking this. Sorry for my confusing explanation. Since you requested a bigger picture, I'll give some more detail. In short: we don't do date facets, and sorting by date in reverse order happens naturally by design.
All the data is split to shards. We use logical sharding, not hash based. Each shard contains piece of data that corresponds to a specific date range. We know in advance, which date range is represented by which shard. Each document in a shard has a field, which contains date in milliseconds which is a result of subtraction of the original document's date from a very big date in the future. In this way, if you issue a facet query against a shard and use facet.method=index you get hits from the shard ordered lexicographically in reverse order. Here is an example of two values: 9223370739060532807_docid1 9223370741484545807_docid2 The second value is larger than the first, which means that the document itself is older. Here is a typical facet query: wt=xml&start=0&hl.alternateField=Contents&version=1&df=Contents&q=aerospace+engineer&hl.alternateFieldLength=100000&facet=true&f.OppositeDateLongNumber_docid.facet.limit=1000&facet.field=OppositeDateLongNumber_docid&rows=1&facet.sort=index&facet.zeros=false&isShard=true The output xml is: (skipping the header) <lst name="facet_fields"> <lst name="OppositeDateLongNumber_docid"> <int name="9223370722475651807_1">2</int> <int name="9223370722825037807_4">1</int> <int name="9223370723175759807_2">2</int> <int name="9223370723372652807_10">1</int> <int name="9223370723949606807_7">1</int> </lst> </lst> Excerpt from the schema: <fieldType name="text" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer type="index"> <!-- the order matters --> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/> <!-- here we have two more proprietary filters, one of which does stemming --> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <!-- our proprietary stemming filter -> </analyzer> </fieldType> <field name="OppositeDateLongNumber_docid" type="string" indexed="true" stored="true" required="false" omitNorms="true" /> <field name="Contents" type="text" indexed="true" stored="true" omitNorms="true" /> Back to the problem: It has been reproducible, that if query ran from the solr - router reaches two or more shards, each of which generates around 1000 hits, upon merging, some portion of hits (on the time border between two shards) gets dropped. So the result hit list is uniform otherwise, except for the missing portion of hits in the middle. So the question is: if the facet search reaches two or more shards and each shard generates 1000 results, which entries will go into the final list of resulting entries, given the facet.limit=1000 set on the original distributed query? What is the algorithm in this case? Please let me know, if something is not clear or more detail is needed from schema / execution / design. Regards, Dmitry On Fri, Sep 9, 2011 at 12:22 AM, Chris Hostetter <hossman_luc...@fucit.org>wrote: > > : When shooting a distributed query, we use facet.limit=1000. Then the > merging > : SOLR combines the results. We also use facet.zeros=false to ensure > returning > : only non-zero facet entries. > : The issue that we found is that there was a gap in time in the final > results > : list (reverse sorted by date attached to each entry in all the shards), > : whereby entries stamped with certain date disappeared. If we use > different > : query criteria, that produces less than 1000 results both in each of the > : shards and combined, we see those "missing" entries. So the problem is > not > : in missing data, but in the combination algorithm. > > I don't understand what you mean by "entries stamped with certain date" > ... are you saying the actaul results of the search seem to be missing > documents, or that the fact counts returned seemed to be missing > constraints that should be in the list? > > it seems like you are refering to documents missing from the actaul > results ("reverse sorted by date") but facet.limit can't affect anything > about the results of the actual query. facet.limit also only applies to > facet.field (not facet.date or facet.range), but you're talking about a > date field.... > > can you please be specific about the requests you are executing (ie: what > params) the schema you have (ie: what are the fields/types in use in all > the params/query strings), the results you are getting, and the results > you are expecting? actually providing the response xml is very helpful. > (change the "fl" to hide any fields you consider sensitive) > > -Hoss > -- Regards, Dmitry Kan