sorry, i was thinking just using the <delete><query>*:*</query></delete> method for clearing the index would leave them still
On Fri, Oct 16, 2020 at 4:28 PM Erick Erickson <erickerick...@gmail.com> wrote: > Not sure what you’re asking here. re-indexing, as I was > using the term, means completely removing the index and > starting over. Or indexing to a new collection. At any > rate, starting from a state where there are _no_ segments. > > I’m guessing you’re still thinking that re-indexing without > doing the above will work; it won’t. The way merging works, > it chooses segments based on a number of things, including > the percentage deleted documents. But there are still _other_ > live docs in the segment. > > Segment S1 has docs 1, 2, 3, 4 (old definition) > Segment S2 has docs 5, 6, 7, 8 (new definition) > > Doc 2 is deleted, and S1 and S2 are merged into S3. The whole > discussion about not being able to do the right thing kicks in. > Should S3 use the new or old definition? Whichever one > it uses is wrong for the other segment. And remember, > Lucene simply _cannot_ “do the right thing” if the data > isn’t there. > > What you may be missing is that a segment is a “mini-index”. > The underlying assumption is that all documents in that > segment are produced with the same schema and can be > accessed the same way. My comments about merging > “doing the right thing” is really about transforming docs > so all the docs can be treated the same. Which they can’t > if they were produced with different schemas. > > Robert Muir’s statement is interesting here, built > on Mike McCandless’ comment: > > "I think the key issue here is Lucene is an index not a database. > Because it is a lossy index and does not retain all of the user’s > data, its not possible to safely migrate some things automagically. > …. The function is y = f(x) and if x is not available its not > possible, so lucene can't do it." > > Don’t try to get around this. Prepare to > re-index the entire corpus into a new collection whenever > you change the schema and then maybe use an alias to > seamlessly convert from the user’s perspective. If you > simply cannot re-index from the system-of-record, you have > two choices: > > 1> use new collections whenever you need to change the > schema and “somehow” have the app do different things > with the new and old collections > > 2> set stored=true for all your source fields (i.e. not > copyField destination). You can either roll your own > program that pulls data from the old and sends > it to the new or use the Collections API REINDEXCOLLECTION > API call. But note that it’s specifically called out > in the docs that all fields must be stored to use the > API, what happens under the covers is that the > stored fields are read and sent to the target > collection. > > In both these cases, Robert’s comment doesn’t apply. Well, > it does apply but “if x is not available” is not the case, > the original _is_ available; it’s the stored data... > > I’m over-stating the case somewhat, there are a few changes > that you can get away with re-indexing all the docs into an > existing index, things like changing from stored=true to > stored=false, adding new fields, deleting fields (although the > meta-data for the field is still kept around) etc. > > > On Oct 16, 2020, at 3:57 PM, David Hastings < > hastings.recurs...@gmail.com> wrote: > > > > Gotcha, thanks for the explanation. another small question if you > > dont mind, when deleting docs they arent actually removed, just tagged as > > deleted, and the old field/field type is still in the index until > > merged/optimized as well, wouldnt that cause almost the same conflicts > > until then? > > > > On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> Doesn’t re-indexing a document just delete/replace…. > >> > >> It’s complicated. For the individual document, yes. The problem > >> comes because the field is inconsistent _between_ documents, and > >> segment merging blows things up. > >> > >> Consider. I have segment1 with documents indexed with the old > >> schema (String in this case). I change my schema and index the same > >> field as a text type. > >> > >> Eventually, a segment merge happens and these two segments get merged > >> into a single new segment. How should the field be handled? Should it > >> be defined as String or Text in the new segment? If you convert the docs > >> with a Text definition for the field to String, > >> you’d lose the ability to search for individual tokens. If you convert > the > >> String to Text, you don’t have any guarantee that the information is > even > >> available. > >> > >> This is just the tip of the iceberg in terms of trying to change the > >> definition of a field. Take the case of changing the analysis chain, > >> say you use a phonetic filter on a field then decide to remove it and > >> do not store the original. Erick might be encoded as “ENXY” so the > >> original data is simply not there to convert. Ditto removing a > >> stemmer, lowercasing, applying a regex, …... > >> > >> > >> From Mike McCandless: > >> > >> "This really is the difference between an index and a database: > >> we do not store, precisely, the original documents. We store > >> an efficient derived/computed index from them. Yes, Solr/ES > >> can add database-like behavior where they hold the true original > >> source of the document and use that to rebuild Lucene indices > >> over time. But Lucene really is just a "search index" and we > >> need to be free to make important improvements with time." > >> > >> And all that aside, you have to re-index all the docs anyway or > >> your search results will be inconsistent. So leaving aside the > >> impossible task of covering all the possibilities on the fly, it’s > >> better to plan on re-indexing…. > >> > >> Best, > >> Erick > >> > >> > >>> On Oct 16, 2020, at 3:16 PM, David Hastings < > >> hastings.recurs...@gmail.com> wrote: > >>> > >>> "If you want to > >>> keep the same field name, you need to delete all of the > >>> documents in the index, change the schema, and reindex." > >>> > >>> actually doesnt re-indexing a document just delete/replace anyways > >> assuming > >>> the same id? > >>> > >>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch < > >> arafa...@gmail.com> > >>> wrote: > >>> > >>>> Just as a side note, > >>>> > >>>>> indexed="true" > >>>> If you are storing 32K message, you probably are not searching it as a > >>>> whole string. So, don't index it. You may also want to mark the field > >>>> as 'large' (and lazy): > >>>> > >>>> > >> > https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties > >>>> > >>>> When you are going to make it a text field, you will probably be > >>>> having the same issues as well. > >>>> > >>>> And honestly, if you are not storing those fields to search, maybe you > >>>> need to consider the architecture. Maybe those fields do not need to > >>>> be in Solr at all, but in external systems. Solr (or any search > >>>> system) should not be your system of records since - as the other > >>>> reply showed - some of the answers are "reindex everything". > >>>> > >>>> Regards, > >>>> Alex. > >>>> > >>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <yaswanth...@gmail.com> > >>>> wrote: > >>>>> > >>>>> I am using solr 8.2 > >>>>> > >>>>> Can I change the schema fieldtype from string to solr.TextField > >>>>> without indexing? > >>>>> > >>>>> <field name="messagetext" type="string" indexed="true" > >>>> stored="true"/> > >>>>> > >>>>> The reason is that string has only 32K char limit where as I am > looking > >>>> to > >>>>> store more than 32K now. > >>>>> > >>>>> The contents on this field doesn't require any analysis or tokenized > >> but > >>>> I > >>>>> need this field in the queries and as well as output fields. > >>>>> > >>>>> -- > >>>>> Thanks & Regards, > >>>>> Yaswanth Kumar Konathala. > >>>>> yaswanth...@gmail.com > >>>> > >> > >> > >