On 11/9/2017 11:09 AM, S G wrote:
> However, re-ingestion takes several hours to complete and during that time,
> the customer has to write to both the collections - previous collection and
> the one being bootstrapped.
> This dual-write is harder to do from the client side (because client needs
> to have a retry logic to ensure any update does not succeed in one
> collection and fails in another - consistency problem) and it would be a
> real welcome addition if collection aliasing can support this.

Let me explain how I handle this situation.  I'm not running in cloud
mode, but I use the "swap" feature of CoreAdmin to do much the same
thing you're describing with collection aliases.

My source data (mysql database) has a way to track the last new document
that was added, as well as track which deletes have been applied, and
which documents need to be reinserted.  I use these pointers to decide
what data to retrieve on each indexing cycle, and then I update them to
new positions when the indexing cycle completes successfully.

When I do a full rebuild, I grab the current positions for new docs,
deletes, and reinserts, and store that information in a special place. 
Then I start building indexes in the "build" cores.  In the meantime, I
am continuing to update all the "live" cores, so users are unaware that
anything special is happening.

When the rebuild finishes (which can take a day or more), I go to that
special place where I stored all the position information, and proceed
to run a "catchup" indexing process on the build cores -- all the
deletes, new documents, and reinserts that happened since the time the
full rebuild started.  When that completes, I swap the build cores with
the live cores, and resume normal operation.

Doing it this way, I do not need to worry about the normal indexing
cycle handling writes to both the old index and the new index -- the
ongoing cycle just updates the current live cores.

> Proposal:
> If can enhance the write alias to point to multiple collections such that
> any update to the alias is written to all the collections it points to, it
> would help the client to avoid dual writes and also issue just a single
> http call from the client instead of multiple. It would also reduce the
> retry logic inside the client code used to keep the collections consistent.

Imagine an index with time-series data, where there is an alias called
"today" that includes up to 24 hourly collections.  If you were to write
to that alias with the idea you've proposed, the data would end up in
the wrong places and would in fact get incorrectly duplicated many times
... but the way it currently works, the writes would only go to the
FIRST collection in the alias, which can be arranged to always be the
"current" collection.

Your proposal is an interesting idea, but would require some development
work.  Errors during indexing could be a major source of headaches,
especially those errors that don't affect all collections in the alias
equally.  So as to not change how users expect Solr to work currently,
aliases would need a special flag to indicate that writes *should* be
duplicated to all collections in the alias, or maybe there would need to
be two different kinds of aliases.  Since such a feature is probably not
going to happen quickly even if it is something that we agree to work
on, would you be able to use something like the method that I outlined
above?

Thanks,
Shawn

Reply via email to