Hi Erick, Most of the time we have to do a full re-index: I do love your second idea, I will take a look at the details of that. Thank you! :)
Cheers, Peter 2016-05-10 17:10 GMT+02:00 Erick Erickson <erickerick...@gmail.com>: > Peter: > > Yeah, that would work, but there are a couple of alternatives: > 1> If there's any way to know what the subset of docs that's > changed, just re-index _them_. The problem here is > picking up deletes. In the RDBMS case this is often done > by creating a trigger for deletes and then the last step > in your update is to remove the docs since the last time > you indexed using the deleted_docs table (or whatever). > This falls down if a> you require an instantaneous switch > from _all_ the old data to the new or b> you can't get a > list of deleted docs. > > 2> Use collection aliasing. The pattern is this: you have your > "Hot" collection (col1) serving queries that is pointed to > by alias "hot". You create a new collection (col2) and index > to it in the background. When done, use CREATEALIAS > to point "hot" to "col2". Now you can delete col1. There are > no restrictions on where these collections live, so this > allows you to move your collections around as you want. Plus > this keeps a better separation of old and new data... > > Best, > Erick > > On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely > <peter.gergely.horv...@gmail.com> wrote: > > Hi Everyone, > > > > I am wondering if there is any best practice regarding re-indexing > > documents in SolrCloud 6.0.0 without making the data (or the underlying > > collection) temporarily unavailable. Wiping all documents in a collection > > and performing a full re-indexing is not a viable alternative for us. > > > > Say we had a massive Solr Cloud cluster with a number of separate nodes > > that are used to host *multiple hundreds* of collections, with document > > counts ranging from a couple of thousands to multiple (say up to 20) > > millions of documents, each with 200-300 fields and a background batch > > loader job that fetches data from a variety of source systems. > > > > We have to retain the cluster and ALL collections online all the time > (365 > > x 24): We cannot allow queries to be blocked while data in a collection > is > > being updated and we cannot load everything in a single-shot jumbo commit > > (the replication could overload the cluster). > > > > One solution I could imagine is storing an additional field "load > > time-stamp" in all documents and the client (interactive query) > application > > extending all queries with an additional restriction, which requires > > documents "load time-stamp" to be the latest known completed "load > > time-stamp". > > > > This concept would work according to the following: > > 1.) The batch job would simply start loading new documents, with the new > > "load time-stamp". Existing documents would not be touched. > > 2.) The client (interactive query) application would still use the old > data > > from the previous load (since all queries are restricted with the old > "load > > time-stamp") > > 3.) The batch job would store the new "load time-stamp" as the one to be > > used (e.g. in a separate collection etc.) -- after this, all queries > would > > return the most up-to-data documents > > 4.) The batch job would purge all documents from the collection, where > > the "load time-stamp" is not the same as the last one. > > > > This approach seems to be implementable, however, I definitely want to > > avoid reinventing the wheel myself and wondering if there is any better > > solution or built-in Solr Cloud feature to achieve the same or something > > similar. > > > > Thanks, > > Peter >