Re: converting string to solr.TextField

David Hastings Fri, 16 Oct 2020 13:37:13 -0700

sorry, i was thinking just using the
<delete><query>*:*</query></delete>
method for clearing the index would leave them still


On Fri, Oct 16, 2020 at 4:28 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Not sure what you’re asking here. re-indexing, as I was
> using the term, means completely removing the index and
> starting over. Or indexing to a new collection. At any
> rate, starting from a state where there are _no_ segments.
>
> I’m guessing you’re still thinking that re-indexing without
> doing the above will work; it won’t. The way merging works,
> it chooses segments based on a number of things, including
> the percentage deleted documents. But there are still _other_
> live docs in the segment.
>
> Segment S1 has docs 1, 2, 3, 4 (old definition)
> Segment S2 has docs 5, 6, 7, 8 (new definition)
>
> Doc 2 is deleted, and S1 and S2 are merged into S3. The whole
> discussion about not being able to do the right thing kicks in.
> Should S3 use the new or old definition? Whichever one
> it uses is wrong for the other segment. And remember,
> Lucene simply _cannot_ “do the right thing” if the data
> isn’t there.
>
> What you may be missing is that a segment is a “mini-index”.
> The underlying assumption is that all documents in that
> segment are produced with the same schema and can be
> accessed the same way. My comments about merging
> “doing the right thing” is really about transforming docs
> so all the docs can be treated the same. Which they can’t
> if they were produced with different schemas.
>
> Robert Muir’s statement is interesting here, built
> on Mike McCandless’ comment:
>
> "I think the key issue here is Lucene is an index not a database.
> Because it is a lossy index and does not retain all of the user’s
> data, its not possible to safely migrate some things automagically.
> …. The function is y = f(x) and if x is not available its not
> possible, so lucene can't do it."
>
> Don’t try to get around this. Prepare to
> re-index the entire corpus into a new collection whenever
> you change the schema and then maybe use an alias to
> seamlessly convert from the user’s perspective. If you
> simply cannot re-index from the system-of-record, you have
> two choices:
>
> 1> use new collections whenever you need to change the
>      schema and “somehow” have the app do different things
>     with the new and old collections
>
> 2> set stored=true for all your source fields (i.e. not
>    copyField destination). You can either roll your own
>    program that pulls data from the old and sends
>    it to the new or use the Collections API REINDEXCOLLECTION
>    API call. But note that it’s specifically called out
>    in the docs that all fields must be stored to use the
>     API, what happens under the covers is that the
>      stored fields are read and sent to the target
>    collection.
>
> In both these cases, Robert’s comment doesn’t apply. Well,
> it does apply but “if x is not available” is not the case,
> the original _is_ available; it’s the stored data...
>
> I’m over-stating the case somewhat, there are a few changes
> that you can get away with re-indexing all the docs into an
> existing index, things like changing from stored=true to
> stored=false, adding new fields, deleting fields (although the
> meta-data for the field is still kept around) etc.
>
> > On Oct 16, 2020, at 3:57 PM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > Gotcha, thanks for the explanation.  another small question if you
> > dont mind, when deleting docs they arent actually removed, just tagged as
> > deleted, and the old field/field type is still in the index until
> > merged/optimized as well, wouldnt that cause almost the same conflicts
> > until then?
> >
> > On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Doesn’t re-indexing a document just delete/replace….
> >>
> >> It’s complicated. For the individual document, yes. The problem
> >> comes because the field is inconsistent _between_ documents, and
> >> segment merging blows things up.
> >>
> >> Consider. I have segment1 with documents indexed with the old
> >> schema (String in this case). I  change my schema and index the same
> >> field as a text type.
> >>
> >> Eventually, a segment merge happens and these two segments get merged
> >> into a single new segment. How should the field be handled? Should it
> >> be defined as String or Text in the new segment? If you convert the docs
> >> with a Text definition for the field to String,
> >> you’d lose the ability to search for individual tokens. If you convert
> the
> >> String to Text, you don’t have any guarantee that the information is
> even
> >> available.
> >>
> >> This is just the tip of the iceberg in terms of trying to change the
> >> definition of a field. Take the case of changing the analysis chain,
> >> say you use a phonetic filter on a field then decide to remove it and
> >> do not store the original. Erick might be encoded as “ENXY” so the
> >> original data is simply not there to convert. Ditto removing a
> >> stemmer, lowercasing, applying a regex, …...
> >>
> >>
> >> From Mike McCandless:
> >>
> >> "This really is the difference between an index and a database:
> >> we do not store, precisely, the original documents.  We store
> >> an efficient derived/computed index from them.  Yes, Solr/ES
> >> can add database-like behavior where they hold the true original
> >> source of the document and use that to rebuild Lucene indices
> >> over time.  But Lucene really is just a "search index" and we
> >> need to be free to make important improvements with time."
> >>
> >> And all that aside, you have to re-index all the docs anyway or
> >> your search results will be inconsistent. So leaving aside the
> >> impossible task of covering all the possibilities on the fly, it’s
> >> better to plan on re-indexing….
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>> On Oct 16, 2020, at 3:16 PM, David Hastings <
> >> hastings.recurs...@gmail.com> wrote:
> >>>
> >>> "If you want to
> >>> keep the same field name, you need to delete all of the
> >>> documents in the index, change the schema, and reindex."
> >>>
> >>> actually doesnt re-indexing a document just delete/replace anyways
> >> assuming
> >>> the same id?
> >>>
> >>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >>> wrote:
> >>>
> >>>> Just as a side note,
> >>>>
> >>>>> indexed="true"
> >>>> If you are storing 32K message, you probably are not searching it as a
> >>>> whole string. So, don't index it. You may also want to mark the field
> >>>> as 'large' (and lazy):
> >>>>
> >>>>
> >>
> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
> >>>>
> >>>> When you are going to make it a text field, you will probably be
> >>>> having the same issues as well.
> >>>>
> >>>> And honestly, if you are not storing those fields to search, maybe you
> >>>> need to consider the architecture. Maybe those fields do not need to
> >>>> be in Solr at all, but in external systems. Solr (or any search
> >>>> system) should not be your system of records since - as the other
> >>>> reply showed - some of the answers are "reindex everything".
> >>>>
> >>>> Regards,
> >>>>  Alex.
> >>>>
> >>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <yaswanth...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> I am using solr 8.2
> >>>>>
> >>>>> Can I change the schema fieldtype from string to solr.TextField
> >>>>> without indexing?
> >>>>>
> >>>>>   <field name="messagetext" type="string" indexed="true"
> >>>> stored="true"/>
> >>>>>
> >>>>> The reason is that string has only 32K char limit where as I am
> looking
> >>>> to
> >>>>> store more than 32K now.
> >>>>>
> >>>>> The contents on this field doesn't require any analysis or tokenized
> >> but
> >>>> I
> >>>>> need this field in the queries and as well as output fields.
> >>>>>
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Yaswanth Kumar Konathala.
> >>>>> yaswanth...@gmail.com
> >>>>
> >>
> >>
>
>

Re: converting string to solr.TextField

Reply via email to