There is no way to match case insensitive without TextFields + no tokenization. Its a long standing limitation of not being able to apply any analyzers with str fields.
Thanks for pointing out the re-index page I've seen it. However sometimes it is hard to re-index in a reasonable amount of time & resources, and if we empower power users to understand the system better it will help making more informed tradeoffs. On Tue, Jun 11, 2019 at 6:52 AM Gus Heck <gus.h...@gmail.com> wrote: > On Mon, Jun 10, 2019 at 10:53 PM John Davis <johndavis925...@gmail.com> > wrote: > > > You have made many assumptions which might not always be realistic a) > > TextField is always tokenized > > > Well, you could of course change configuration or code to do something else > but this would be a very odd and misleading thing to do and we would expect > you to have mentioned it. > > > > b) Users care about precise counts and > > > This is indeed use case dependent if you are talking about approximately > correct (150 vs 152 etc), but it's pretty reasonable to say that gross > errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless. > > > > c) Users have the luxury or ability to do a full re-index anytime. > > > This is a state of affairs we consistently advise against. The reason we > give the advice is precisely because one cannot change the schema out from > under an existing index safely without rewriting the index. Without > extremely careful design on your side (not using certain features and high > storage requirements), your index will not retain enough information to > re-remake itself. Therefore, it is a long standing bad practice to not have > a separate canonical copy of the data and a means to re-index it (or a > design where only the very most recent data is important, and a copy of > that). There is a whole page dedicated to reindexing in the ref guide: > https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant > bit from the current version: > > `There is no process in Solr for programmatically reindexing data. When we > say "reindex", we mean, literally, "index it again". However you got the > data into the index the first time, you will run that process again. It is > strongly recommended that Solr users index their data in a repeatable, > consistent way, so that the process can be easily repeated when the need > for reindexing arises.` > > > The ref guide has lots of nice info, maybe you should read it rather than > snubbing one of the nicest and most knowledgeable committers on the project > (who is helping you for free) by haughtily saying you'll go ask someone > else... And if you've been left with this situation (no ability to reindex) > by your predecessor you have our deepest sympathies, but it still doesn't > change the fact that you need break it to management the your predecessor > has lost the data required to maintain the system and you still need > re-index whatever you can salvage somehow, or start fresh. > > When Erick is saying you shouldn't be asking that question... >90% of the > time you really shouldn't be, and if you do pursue it, you'll just waste a > lot of your own time. > > > > On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > > > bq. Does lucene look at %docs in each state, or the first doc or > > something > > > else? > > > > > > Frankly I don’t care since no matter what, the results of faceting > mixed > > > definitions is not useful. > > > > > > tl;dr; > > > > > > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it > > > means just what I choose it to mean — neither more nor less.’ > > > > > > So “undefined" in this case means “I don’t see any value at all in > > chasing > > > that info down” ;). > > > > > > Changing from regular text to SortableText means that the results will > be > > > inaccurate no matter what. For example, I have a doc with the value “my > > dog > > > has fleas”. When NOT using SortableText, there are multiple tokens so > > facet > > > counts would be: > > > > > > my (1) > > > dog (1) > > > has (1) > > > fleas (1) > > > > > > But for SortableText will be: > > > > > > my dog has fleas (1) > > > > > > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”. > > > doc1 was indexed before switching to SortableText and doc2 after. > > > Presumably the output you want is: > > > > > > my dog has fleas (1) > > > my cat has fleas (1) > > > > > > But you can’t get that output. There are three cases: > > > > > > 1> Lucene treats all documents as SortableText, faceting on the > docValues > > > parts. No facets on doc1 > > > > > > my cat has fleas (1) > > > > > > 2> Lucene treats all documents as tokenized, faceting on each > individual > > > token. Faceting is performed on the tokenized content of both, > docValues > > > in doc2 ignored > > > > > > my (2) > > > dog (1) > > > has (2) > > > fleas (2) > > > cat (1) > > > > > > > > > 3> Lucene does the best it can, faceting on the tokens for docs without > > > SortableText and docValues if the doc was indexed with Sortable text. > > doc1 > > > faceted on tokenized, doc2 on docValues > > > > > > my (1) > > > dog (1) > > > has (1) > > > fleas (1) > > > my cat has fleas (1) > > > > > > Since none of those cases is what I want, there’s no point I can see in > > > chasing down what actually happens…. > > > > > > Best, > > > Erick > > > > > > P.S. I _think_ Lucene tries to use the definition from the first > segment, > > > but since whether the lists of segments to be merged don’t look at the > > > field definitions at all. Whether the first segment in the list has > > > SortableText or not will not be predictable in a general way even > within > > a > > > single run. > > > > > > > > > > On Jun 9, 2019, at 6:53 PM, John Davis <johndavis925...@gmail.com> > > > wrote: > > > > > > > > Understood, however code is rarely random/undefined. Does lucene look > > at > > > % > > > > docs in each state, or the first doc or something else? > > > > > > > > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson < > erickerick...@gmail.com > > > > > > > wrote: > > > > > > > >> It’s basically undefined. When segments are merged that have > > dissimilar > > > >> definitions like this what can Lucene do? Consider: > > > >> > > > >> Faceting on a text (not sortable) means that each individual token > in > > > the > > > >> index is uninverted on the Java heap and the facets are computed for > > > each > > > >> individual term. > > > >> > > > >> Faceting on a SortableText field just has a single term per > document, > > > and > > > >> that in the docValues structures as opposed to the inverted index. > > > >> > > > >> Now you change the value and start indexing. At some point a segment > > > >> containing no docValues is merged with a segment containing > docValues > > > for > > > >> the field. The resulting mixed segment is in this state. If you > facet > > on > > > >> the field, should the docs without docValues have each individual > term > > > >> counted? Or just the SortableText values in the docValues structure? > > > >> Neither one is right. > > > >> > > > >> Also remember that Lucene has no notion of schema. That’s entirely > > > imposed > > > >> on Lucene by Solr carefully constructing low-level analysis chains. > > > >> > > > >> So I’d _strongly_ recommend you re-index your corpus to a new > > collection > > > >> with the current definition, then perhaps use CREATEALIAS to > > seamlessly > > > >> switch. > > > >> > > > >> Best, > > > >> Erick > > > >> > > > >>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925...@gmail.com > > > > > >> wrote: > > > >>> > > > >>> Hi there, > > > >>> We recently changed a field from TextField + no docValues to > > > >>> SortableTextField which has docValues enabled by default. Once I > did > > > >> this I > > > >>> do not see any facet values for the field. I know that once all the > > > docs > > > >>> are re-indexed facets should work again, however can someone > clarify > > > the > > > >>> current logic of lucene/solr how facets will be computed when > schema > > is > > > >>> changed from no docValues to docValues and vice-versa? > > > >>> > > > >>> 1. Until ALL the docs are re-indexed, no facets will be returned? > > > >>> 2. Once certain fraction of docs are re-indexed, those facets will > be > > > >>> returned? > > > >>> 3. Something else? > > > >>> > > > >>> > > > >>> Varun > > > >> > > > >> > > > > > > > > > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) >