Re: converting string to solr.TextField

Erick Erickson Fri, 16 Oct 2020 13:29:17 -0700

Not sure what you’re asking here. re-indexing, as I was
using the term, means completely removing the index and
starting over. Or indexing to a new collection. At any
rate, starting from a state where there are _no_ segments.


I’m guessing you’re still thinking that re-indexing without
doing the above will work; it won’t. The way merging works,
it chooses segments based on a number of things, including
the percentage deleted documents. But there are still _other_
live docs in the segment.

Segment S1 has docs 1, 2, 3, 4 (old definition)
Segment S2 has docs 5, 6, 7, 8 (new definition)

Doc 2 is deleted, and S1 and S2 are merged into S3. The whole
discussion about not being able to do the right thing kicks in.
Should S3 use the new or old definition? Whichever one
it uses is wrong for the other segment. And remember,
Lucene simply _cannot_ “do the right thing” if the data
isn’t there.

What you may be missing is that a segment is a “mini-index”.
The underlying assumption is that all documents in that
segment are produced with the same schema and can be
accessed the same way. My comments about merging
“doing the right thing” is really about transforming docs
so all the docs can be treated the same. Which they can’t
if they were produced with different schemas.

Robert Muir’s statement is interesting here, built
on Mike McCandless’ comment:

"I think the key issue here is Lucene is an index not a database.
Because it is a lossy index and does not retain all of the user’s
data, its not possible to safely migrate some things automagically.
…. The function is y = f(x) and if x is not available its not 
possible, so lucene can't do it."

Don’t try to get around this. Prepare to
re-index the entire corpus into a new collection whenever
you change the schema and then maybe use an alias to
seamlessly convert from the user’s perspective. If you
simply cannot re-index from the system-of-record, you have
two choices:

1> use new collections whenever you need to change the
     schema and “somehow” have the app do different things
    with the new and old collections

2> set stored=true for all your source fields (i.e. not
   copyField destination). You can either roll your own
   program that pulls data from the old and sends
   it to the new or use the Collections API REINDEXCOLLECTION
   API call. But note that it’s specifically called out
   in the docs that all fields must be stored to use the
    API, what happens under the covers is that the 
     stored fields are read and sent to the target
   collection.

In both these cases, Robert’s comment doesn’t apply. Well,
it does apply but “if x is not available” is not the case,
the original _is_ available; it’s the stored data...

I’m over-stating the case somewhat, there are a few changes
that you can get away with re-indexing all the docs into an
existing index, things like changing from stored=true to 
stored=false, adding new fields, deleting fields (although the
meta-data for the field is still kept around) etc.

> On Oct 16, 2020, at 3:57 PM, David Hastings <hastings.recurs...@gmail.com> 
> wrote:
> 
> Gotcha, thanks for the explanation.  another small question if you
> dont mind, when deleting docs they arent actually removed, just tagged as
> deleted, and the old field/field type is still in the index until
> merged/optimized as well, wouldnt that cause almost the same conflicts
> until then?
> 
> On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> Doesn’t re-indexing a document just delete/replace….
>> 
>> It’s complicated. For the individual document, yes. The problem
>> comes because the field is inconsistent _between_ documents, and
>> segment merging blows things up.
>> 
>> Consider. I have segment1 with documents indexed with the old
>> schema (String in this case). I  change my schema and index the same
>> field as a text type.
>> 
>> Eventually, a segment merge happens and these two segments get merged
>> into a single new segment. How should the field be handled? Should it
>> be defined as String or Text in the new segment? If you convert the docs
>> with a Text definition for the field to String,
>> you’d lose the ability to search for individual tokens. If you convert the
>> String to Text, you don’t have any guarantee that the information is even
>> available.
>> 
>> This is just the tip of the iceberg in terms of trying to change the
>> definition of a field. Take the case of changing the analysis chain,
>> say you use a phonetic filter on a field then decide to remove it and
>> do not store the original. Erick might be encoded as “ENXY” so the
>> original data is simply not there to convert. Ditto removing a
>> stemmer, lowercasing, applying a regex, …...
>> 
>> 
>> From Mike McCandless:
>> 
>> "This really is the difference between an index and a database:
>> we do not store, precisely, the original documents.  We store
>> an efficient derived/computed index from them.  Yes, Solr/ES
>> can add database-like behavior where they hold the true original
>> source of the document and use that to rebuild Lucene indices
>> over time.  But Lucene really is just a "search index" and we
>> need to be free to make important improvements with time."
>> 
>> And all that aside, you have to re-index all the docs anyway or
>> your search results will be inconsistent. So leaving aside the
>> impossible task of covering all the possibilities on the fly, it’s
>> better to plan on re-indexing….
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Oct 16, 2020, at 3:16 PM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> "If you want to
>>> keep the same field name, you need to delete all of the
>>> documents in the index, change the schema, and reindex."
>>> 
>>> actually doesnt re-indexing a document just delete/replace anyways
>> assuming
>>> the same id?
>>> 
>>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
>> arafa...@gmail.com>
>>> wrote:
>>> 
>>>> Just as a side note,
>>>> 
>>>>> indexed="true"
>>>> If you are storing 32K message, you probably are not searching it as a
>>>> whole string. So, don't index it. You may also want to mark the field
>>>> as 'large' (and lazy):
>>>> 
>>>> 
>> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>>>> 
>>>> When you are going to make it a text field, you will probably be
>>>> having the same issues as well.
>>>> 
>>>> And honestly, if you are not storing those fields to search, maybe you
>>>> need to consider the architecture. Maybe those fields do not need to
>>>> be in Solr at all, but in external systems. Solr (or any search
>>>> system) should not be your system of records since - as the other
>>>> reply showed - some of the answers are "reindex everything".
>>>> 
>>>> Regards,
>>>>  Alex.
>>>> 
>>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <yaswanth...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> I am using solr 8.2
>>>>> 
>>>>> Can I change the schema fieldtype from string to solr.TextField
>>>>> without indexing?
>>>>> 
>>>>>   <field name="messagetext" type="string" indexed="true"
>>>> stored="true"/>
>>>>> 
>>>>> The reason is that string has only 32K char limit where as I am looking
>>>> to
>>>>> store more than 32K now.
>>>>> 
>>>>> The contents on this field doesn't require any analysis or tokenized
>> but
>>>> I
>>>>> need this field in the queries and as well as output fields.
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> Yaswanth Kumar Konathala.
>>>>> yaswanth...@gmail.com
>>>> 
>> 
>>

Re: converting string to solr.TextField

Reply via email to