Not sure what you’re asking here. re-indexing, as I was using the term, means completely removing the index and starting over. Or indexing to a new collection. At any rate, starting from a state where there are _no_ segments.
I’m guessing you’re still thinking that re-indexing without doing the above will work; it won’t. The way merging works, it chooses segments based on a number of things, including the percentage deleted documents. But there are still _other_ live docs in the segment. Segment S1 has docs 1, 2, 3, 4 (old definition) Segment S2 has docs 5, 6, 7, 8 (new definition) Doc 2 is deleted, and S1 and S2 are merged into S3. The whole discussion about not being able to do the right thing kicks in. Should S3 use the new or old definition? Whichever one it uses is wrong for the other segment. And remember, Lucene simply _cannot_ “do the right thing” if the data isn’t there. What you may be missing is that a segment is a “mini-index”. The underlying assumption is that all documents in that segment are produced with the same schema and can be accessed the same way. My comments about merging “doing the right thing” is really about transforming docs so all the docs can be treated the same. Which they can’t if they were produced with different schemas. Robert Muir’s statement is interesting here, built on Mike McCandless’ comment: "I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user’s data, its not possible to safely migrate some things automagically. …. The function is y = f(x) and if x is not available its not possible, so lucene can't do it." Don’t try to get around this. Prepare to re-index the entire corpus into a new collection whenever you change the schema and then maybe use an alias to seamlessly convert from the user’s perspective. If you simply cannot re-index from the system-of-record, you have two choices: 1> use new collections whenever you need to change the schema and “somehow” have the app do different things with the new and old collections 2> set stored=true for all your source fields (i.e. not copyField destination). You can either roll your own program that pulls data from the old and sends it to the new or use the Collections API REINDEXCOLLECTION API call. But note that it’s specifically called out in the docs that all fields must be stored to use the API, what happens under the covers is that the stored fields are read and sent to the target collection. In both these cases, Robert’s comment doesn’t apply. Well, it does apply but “if x is not available” is not the case, the original _is_ available; it’s the stored data... I’m over-stating the case somewhat, there are a few changes that you can get away with re-indexing all the docs into an existing index, things like changing from stored=true to stored=false, adding new fields, deleting fields (although the meta-data for the field is still kept around) etc. > On Oct 16, 2020, at 3:57 PM, David Hastings <hastings.recurs...@gmail.com> > wrote: > > Gotcha, thanks for the explanation. another small question if you > dont mind, when deleting docs they arent actually removed, just tagged as > deleted, and the old field/field type is still in the index until > merged/optimized as well, wouldnt that cause almost the same conflicts > until then? > > On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Doesn’t re-indexing a document just delete/replace…. >> >> It’s complicated. For the individual document, yes. The problem >> comes because the field is inconsistent _between_ documents, and >> segment merging blows things up. >> >> Consider. I have segment1 with documents indexed with the old >> schema (String in this case). I change my schema and index the same >> field as a text type. >> >> Eventually, a segment merge happens and these two segments get merged >> into a single new segment. How should the field be handled? Should it >> be defined as String or Text in the new segment? If you convert the docs >> with a Text definition for the field to String, >> you’d lose the ability to search for individual tokens. If you convert the >> String to Text, you don’t have any guarantee that the information is even >> available. >> >> This is just the tip of the iceberg in terms of trying to change the >> definition of a field. Take the case of changing the analysis chain, >> say you use a phonetic filter on a field then decide to remove it and >> do not store the original. Erick might be encoded as “ENXY” so the >> original data is simply not there to convert. Ditto removing a >> stemmer, lowercasing, applying a regex, …... >> >> >> From Mike McCandless: >> >> "This really is the difference between an index and a database: >> we do not store, precisely, the original documents. We store >> an efficient derived/computed index from them. Yes, Solr/ES >> can add database-like behavior where they hold the true original >> source of the document and use that to rebuild Lucene indices >> over time. But Lucene really is just a "search index" and we >> need to be free to make important improvements with time." >> >> And all that aside, you have to re-index all the docs anyway or >> your search results will be inconsistent. So leaving aside the >> impossible task of covering all the possibilities on the fly, it’s >> better to plan on re-indexing…. >> >> Best, >> Erick >> >> >>> On Oct 16, 2020, at 3:16 PM, David Hastings < >> hastings.recurs...@gmail.com> wrote: >>> >>> "If you want to >>> keep the same field name, you need to delete all of the >>> documents in the index, change the schema, and reindex." >>> >>> actually doesnt re-indexing a document just delete/replace anyways >> assuming >>> the same id? >>> >>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch < >> arafa...@gmail.com> >>> wrote: >>> >>>> Just as a side note, >>>> >>>>> indexed="true" >>>> If you are storing 32K message, you probably are not searching it as a >>>> whole string. So, don't index it. You may also want to mark the field >>>> as 'large' (and lazy): >>>> >>>> >> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties >>>> >>>> When you are going to make it a text field, you will probably be >>>> having the same issues as well. >>>> >>>> And honestly, if you are not storing those fields to search, maybe you >>>> need to consider the architecture. Maybe those fields do not need to >>>> be in Solr at all, but in external systems. Solr (or any search >>>> system) should not be your system of records since - as the other >>>> reply showed - some of the answers are "reindex everything". >>>> >>>> Regards, >>>> Alex. >>>> >>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <yaswanth...@gmail.com> >>>> wrote: >>>>> >>>>> I am using solr 8.2 >>>>> >>>>> Can I change the schema fieldtype from string to solr.TextField >>>>> without indexing? >>>>> >>>>> <field name="messagetext" type="string" indexed="true" >>>> stored="true"/> >>>>> >>>>> The reason is that string has only 32K char limit where as I am looking >>>> to >>>>> store more than 32K now. >>>>> >>>>> The contents on this field doesn't require any analysis or tokenized >> but >>>> I >>>>> need this field in the queries and as well as output fields. >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Yaswanth Kumar Konathala. >>>>> yaswanth...@gmail.com >>>> >> >>