Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

Nick Vasilyev Wed, 11 May 2016 12:07:00 -0700

Aliasing works great, I implemented it after upgrading to Solr 5 and it
allows us to do this exact thing. The only thing you have to watch out for
is indexing new items (if they overwrite old ones) while you are
re-indexing.


I took it a step further for another collection that stores a lot of time
based data from logs. I have two aliases for that collection logs and
logs_indexing, every month a new collection gets created called logs_201605
or something like that and both aliases get updated. logs_indexing now only
points to the newest collection, thats where all the indexing is going, the
logs alias gets updated to include the new collection as well (since
aliases can point to multiple collections).

Here is the link to the documentation.
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4

On Tue, May 10, 2016 at 12:55 PM, Horváth Péter Gergely <
peter.gergely.horv...@gmail.com> wrote:

> Hi Erick,
>
> Most of the time we have to do a full re-index: I do love your second idea,
> I will take a look at the details of that. Thank you! :)
>
> Cheers,
> Peter
>
> 2016-05-10 17:10 GMT+02:00 Erick Erickson <erickerick...@gmail.com>:
>
> > Peter:
> >
> > Yeah, that would work, but there are a couple of alternatives:
> > 1> If there's any way to know what the subset of docs that's
> >      changed, just re-index _them_. The problem here is
> >      picking up deletes. In the RDBMS case this is often done
> >      by creating a trigger for deletes and then the last step
> >      in your update is to remove the docs since the last time
> >      you indexed using the deleted_docs table (or whatever).
> >      This falls down if a> you require an instantaneous switch
> >      from _all_ the old data to the new or b> you can't get a
> >      list of deleted docs.
> >
> > 2> Use collection aliasing. The pattern is this: you have your
> >      "Hot" collection (col1) serving queries that is pointed to
> >      by alias "hot". You create a new collection (col2) and index
> >      to it in the background. When done, use CREATEALIAS
> >      to point "hot" to "col2". Now you can delete col1. There are
> >      no restrictions on where these collections live, so this
> >      allows you to move your collections around as you want. Plus
> >      this keeps a better separation of old and new data...
> >
> > Best,
> > Erick
> >
> > On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
> > <peter.gergely.horv...@gmail.com> wrote:
> > > Hi Everyone,
> > >
> > > I am wondering if there is any best practice regarding re-indexing
> > > documents in SolrCloud 6.0.0 without making the data (or the underlying
> > > collection) temporarily unavailable. Wiping all documents in a
> collection
> > > and performing a full re-indexing is not a viable alternative for us.
> > >
> > > Say we had a massive Solr Cloud cluster with a number of separate nodes
> > > that are used to host *multiple hundreds* of collections, with document
> > > counts ranging from a couple of thousands to multiple (say up to 20)
> > > millions of documents, each with 200-300 fields and a background batch
> > > loader job that fetches data from a variety of source systems.
> > >
> > > We have to retain the cluster and ALL collections online all the time
> > (365
> > > x 24): We cannot allow queries to be blocked while data in a collection
> > is
> > > being updated and we cannot load everything in a single-shot jumbo
> commit
> > > (the replication could overload the cluster).
> > >
> > > One solution I could imagine is storing an additional field "load
> > > time-stamp" in all documents and the client (interactive query)
> > application
> > > extending all queries with an additional restriction, which requires
> > > documents "load time-stamp" to be the latest known completed "load
> > > time-stamp".
> > >
> > > This concept would work according to the following:
> > > 1.) The batch job would simply start loading new documents, with the
> new
> > > "load time-stamp". Existing documents would not be touched.
> > > 2.) The client (interactive query) application would still use the old
> > data
> > > from the previous load (since all queries are restricted with the old
> > "load
> > > time-stamp")
> > > 3.) The batch job would store the new "load time-stamp" as the one to
> be
> > > used (e.g. in a separate collection etc.) -- after this, all queries
> > would
> > > return the most up-to-data documents
> > > 4.) The batch job would purge all documents from the collection, where
> > > the "load time-stamp" is not the same as the last one.
> > >
> > > This approach seems to be implementable, however, I definitely want to
> > > avoid reinventing the wheel myself and wondering if there is any better
> > > solution or built-in Solr Cloud feature to achieve the same or
> something
> > > similar.
> > >
> > > Thanks,
> > > Peter
> >
>

Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

Reply via email to