Hi Erick,

Most of the time we have to do a full re-index: I do love your second idea,
I will take a look at the details of that. Thank you! :)

Cheers,
Peter

2016-05-10 17:10 GMT+02:00 Erick Erickson <erickerick...@gmail.com>:

> Peter:
>
> Yeah, that would work, but there are a couple of alternatives:
> 1> If there's any way to know what the subset of docs that's
>      changed, just re-index _them_. The problem here is
>      picking up deletes. In the RDBMS case this is often done
>      by creating a trigger for deletes and then the last step
>      in your update is to remove the docs since the last time
>      you indexed using the deleted_docs table (or whatever).
>      This falls down if a> you require an instantaneous switch
>      from _all_ the old data to the new or b> you can't get a
>      list of deleted docs.
>
> 2> Use collection aliasing. The pattern is this: you have your
>      "Hot" collection (col1) serving queries that is pointed to
>      by alias "hot". You create a new collection (col2) and index
>      to it in the background. When done, use CREATEALIAS
>      to point "hot" to "col2". Now you can delete col1. There are
>      no restrictions on where these collections live, so this
>      allows you to move your collections around as you want. Plus
>      this keeps a better separation of old and new data...
>
> Best,
> Erick
>
> On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
> <peter.gergely.horv...@gmail.com> wrote:
> > Hi Everyone,
> >
> > I am wondering if there is any best practice regarding re-indexing
> > documents in SolrCloud 6.0.0 without making the data (or the underlying
> > collection) temporarily unavailable. Wiping all documents in a collection
> > and performing a full re-indexing is not a viable alternative for us.
> >
> > Say we had a massive Solr Cloud cluster with a number of separate nodes
> > that are used to host *multiple hundreds* of collections, with document
> > counts ranging from a couple of thousands to multiple (say up to 20)
> > millions of documents, each with 200-300 fields and a background batch
> > loader job that fetches data from a variety of source systems.
> >
> > We have to retain the cluster and ALL collections online all the time
> (365
> > x 24): We cannot allow queries to be blocked while data in a collection
> is
> > being updated and we cannot load everything in a single-shot jumbo commit
> > (the replication could overload the cluster).
> >
> > One solution I could imagine is storing an additional field "load
> > time-stamp" in all documents and the client (interactive query)
> application
> > extending all queries with an additional restriction, which requires
> > documents "load time-stamp" to be the latest known completed "load
> > time-stamp".
> >
> > This concept would work according to the following:
> > 1.) The batch job would simply start loading new documents, with the new
> > "load time-stamp". Existing documents would not be touched.
> > 2.) The client (interactive query) application would still use the old
> data
> > from the previous load (since all queries are restricted with the old
> "load
> > time-stamp")
> > 3.) The batch job would store the new "load time-stamp" as the one to be
> > used (e.g. in a separate collection etc.) -- after this, all queries
> would
> > return the most up-to-data documents
> > 4.) The batch job would purge all documents from the collection, where
> > the "load time-stamp" is not the same as the last one.
> >
> > This approach seems to be implementable, however, I definitely want to
> > avoid reinventing the wheel myself and wondering if there is any better
> > solution or built-in Solr Cloud feature to achieve the same or something
> > similar.
> >
> > Thanks,
> > Peter
>

Reply via email to