bq. Does lucene look at %docs in each state, or the first doc or something else?

Frankly I don’t care since no matter what, the results of faceting mixed 
definitions is not useful.

tl;dr;

“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means 
just what I choose it to mean — neither more nor less.’

So “undefined" in this case means “I don’t see any value at all in chasing that 
info down” ;).

Changing from regular text to SortableText means that the results will be 
inaccurate no matter what. For example, I have a doc with the value “my dog has 
fleas”. When NOT using SortableText, there are multiple tokens so facet counts 
would be:

my (1)
dog (1)
has (1)
fleas (1)

But for SortableText will be:

my dog has fleas (1)

Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”. doc1 
was  indexed before switching to SortableText and doc2 after. Presumably  the 
output you want is:

my dog has fleas (1)
my cat has fleas (1)

But you can’t get that output.  There are three cases:

1> Lucene treats all documents as SortableText, faceting on the docValues 
parts. No facets on doc1

my  cat has fleas (1) 

2> Lucene treats all documents as tokenized, faceting on each individual token. 
Faceting is performed on the tokenized content of both,  docValues in doc2  
ignored

my  (2)
dog (1)
has (2)
fleas (2)
cat (1)


3> Lucene does the best it can, faceting on the tokens for docs without 
SortableText and docValues if the doc was indexed with Sortable text. doc1 
faceted on tokenized, doc2 on docValues

my  (1)
dog (1)
has (1)
fleas (1)
my cat has fleas (1)

Since none of those cases is what I want, there’s no point I can see in chasing 
down what actually happens….

Best,
Erick

P.S. I _think_ Lucene tries to use the definition from the first segment, but 
since whether the lists of segments to be  merged don’t look at the field 
definitions at all. Whether the first segment in the list has SortableText or 
not will not be predictable in a general way even within a single run.


> On Jun 9, 2019, at 6:53 PM, John Davis <johndavis925...@gmail.com> wrote:
> 
> Understood, however code is rarely random/undefined. Does lucene look at %
> docs in each state, or the first doc or something else?
> 
> On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> It’s basically undefined. When segments are merged that have dissimilar
>> definitions like this what can Lucene do? Consider:
>> 
>> Faceting on a text (not sortable) means that each individual token in the
>> index is uninverted on the Java heap and the facets are computed for each
>> individual term.
>> 
>> Faceting on a SortableText field just has a single term per document, and
>> that in the docValues structures as opposed to the inverted index.
>> 
>> Now you change the value and start indexing. At some point a segment
>> containing no docValues is merged with a segment containing docValues for
>> the field. The resulting mixed segment is in this state. If you facet on
>> the field, should the docs without docValues have each individual term
>> counted? Or just the SortableText values in the docValues structure?
>> Neither one is right.
>> 
>> Also remember that Lucene has no notion of schema. That’s entirely imposed
>> on Lucene by Solr carefully constructing low-level analysis chains.
>> 
>> So I’d _strongly_ recommend you re-index your corpus to a new collection
>> with the current definition, then perhaps use CREATEALIAS to seamlessly
>> switch.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925...@gmail.com>
>> wrote:
>>> 
>>> Hi there,
>>> We recently changed a field from TextField + no docValues to
>>> SortableTextField which has docValues enabled by default. Once I did
>> this I
>>> do not see any facet values for the field. I know that once all the docs
>>> are re-indexed facets should work again, however can someone clarify the
>>> current logic of lucene/solr how facets will be computed when schema is
>>> changed from no docValues to docValues and vice-versa?
>>> 
>>> 1. Until ALL the docs are re-indexed, no facets will be returned?
>>> 2. Once certain fraction of docs are re-indexed, those facets will be
>>> returned?
>>> 3. Something else?
>>> 
>>> 
>>> Varun
>> 
>> 

Reply via email to