Re: Enabling/disabling docValues

John Davis Mon, 10 Jun 2019 19:53:42 -0700

You have made many assumptions which might not always be realistic a)
TextField is always tokenized b) Users care about precise counts and c)
Users have the luxury or ability to do a full re-index anytime. These are
real issues and there is no black/white solution. I will ask Lucene folks
on the actual implementation.


On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> bq. Does lucene look at %docs in each state, or the first doc or something
> else?
>
> Frankly I don’t care since no matter what, the results of faceting mixed
> definitions is not useful.
>
> tl;dr;
>
> “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> means just what I choose it to mean — neither more nor less.’
>
> So “undefined" in this case means “I don’t see any value at all in chasing
> that info down” ;).
>
> Changing from regular text to SortableText means that the results will be
> inaccurate no matter what. For example, I have a doc with the value “my dog
> has fleas”. When NOT using SortableText, there are multiple tokens so facet
> counts would be:
>
> my (1)
> dog (1)
> has (1)
> fleas (1)
>
> But for SortableText will be:
>
> my dog has fleas (1)
>
> Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> doc1 was  indexed before switching to SortableText and doc2 after.
> Presumably  the output you want is:
>
> my dog has fleas (1)
> my cat has fleas (1)
>
> But you can’t get that output.  There are three cases:
>
> 1> Lucene treats all documents as SortableText, faceting on the docValues
> parts. No facets on doc1
>
> my  cat has fleas (1)
>
> 2> Lucene treats all documents as tokenized, faceting on each individual
> token. Faceting is performed on the tokenized content of both,  docValues
> in doc2  ignored
>
> my  (2)
> dog (1)
> has (2)
> fleas (2)
> cat (1)
>
>
> 3> Lucene does the best it can, faceting on the tokens for docs without
> SortableText and docValues if the doc was indexed with Sortable text. doc1
> faceted on tokenized, doc2 on docValues
>
> my  (1)
> dog (1)
> has (1)
> fleas (1)
> my cat has fleas (1)
>
> Since none of those cases is what I want, there’s no point I can see in
> chasing down what actually happens….
>
> Best,
> Erick
>
> P.S. I _think_ Lucene tries to use the definition from the first segment,
> but since whether the lists of segments to be  merged don’t look at the
> field definitions at all. Whether the first segment in the list has
> SortableText or not will not be predictable in a general way even within a
> single run.
>
>
> > On Jun 9, 2019, at 6:53 PM, John Davis <johndavis925...@gmail.com>
> wrote:
> >
> > Understood, however code is rarely random/undefined. Does lucene look at
> %
> > docs in each state, or the first doc or something else?
> >
> > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> It’s basically undefined. When segments are merged that have dissimilar
> >> definitions like this what can Lucene do? Consider:
> >>
> >> Faceting on a text (not sortable) means that each individual token in
> the
> >> index is uninverted on the Java heap and the facets are computed for
> each
> >> individual term.
> >>
> >> Faceting on a SortableText field just has a single term per document,
> and
> >> that in the docValues structures as opposed to the inverted index.
> >>
> >> Now you change the value and start indexing. At some point a segment
> >> containing no docValues is merged with a segment containing docValues
> for
> >> the field. The resulting mixed segment is in this state. If you facet on
> >> the field, should the docs without docValues have each individual term
> >> counted? Or just the SortableText values in the docValues structure?
> >> Neither one is right.
> >>
> >> Also remember that Lucene has no notion of schema. That’s entirely
> imposed
> >> on Lucene by Solr carefully constructing low-level analysis chains.
> >>
> >> So I’d _strongly_ recommend you re-index your corpus to a new collection
> >> with the current definition, then perhaps use CREATEALIAS to seamlessly
> >> switch.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925...@gmail.com>
> >> wrote:
> >>>
> >>> Hi there,
> >>> We recently changed a field from TextField + no docValues to
> >>> SortableTextField which has docValues enabled by default. Once I did
> >> this I
> >>> do not see any facet values for the field. I know that once all the
> docs
> >>> are re-indexed facets should work again, however can someone clarify
> the
> >>> current logic of lucene/solr how facets will be computed when schema is
> >>> changed from no docValues to docValues and vice-versa?
> >>>
> >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> >>> returned?
> >>> 3. Something else?
> >>>
> >>>
> >>> Varun
> >>
> >>
>
>

Re: Enabling/disabling docValues

Reply via email to