We are actually very close to doing what Shawn has suggested.

Emir has a good point about new collections failing on deletes/updates of
older documents which were not present in the new collection. But even if
this
feature can be implemented for an append-only log, it would make a good
feature IMO.


Use-case for re-indexing everything again is generally that of an attribute
change like
enabling "indexed" or "docValues" on a field or adding a new field to a
schema.
While the reading client-code sits behind a flag to start using the new
attribute/field, we
have to re-index all the data without stopping older-format reads.
Currently, we have to do
dual writes to the new collections or play catch-up-after-a-bootstrap.


Note that the catch-up-after-a-bootstrap is not very easy too (it is very
similar to the one
described by Shwan). If this special place is Kafka or some table in the
DB, then we have to
do dual writes to the regular source-of-truth and this special place. Dual
writes with DB and Kafka
suffer from being transaction-less (and thus lack consistency) while dual
write to DB increase
the load on DB.


Having created_date / modified_date fields and querying the DB to find
live-traffic documents has
its own problems and is taxing on the DB again.


Dual writes to Solr's multiple collections directly is the simplest to
implement for a client and
that is exactly what this new feature could be. With a
dual-write-collection-alias, it becomes
easier for the client to not implement any of the above if the
dual-write-collection-alias does the following:

- Deletes on missing documents in new collection are simply ignored.
- Incremental updates just throw an error for not being supported on
multi-write-collection-alias.
- Regular updates (i.e. Delete-Then-Insert) should work just fine because
they will just treat the document as a brand new one and versioning
strategies can take care of out-of-order updates.


SG


On Fri, Nov 10, 2017 at 6:33 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> This approach could work only if it is append only index. In case you have
> updates/deletes, you have to process in order, otherwise you will get
> incorrect results. I am thinking that is one of the reasons why it might
> not be supported since not too useful.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 9 Nov 2017, at 19:09, S G <sg.online.em...@gmail.com> wrote:
> >
> > Hi,
> >
> > We have a use-case to re-create a solr-collection by re-ingesting
> > everything but not tolerate a downtime while that is happening.
> >
> > We are using collection alias feature to point to the new collection when
> > it has been re-ingested fully.
> >
> > However, re-ingestion takes several hours to complete and during that
> time,
> > the customer has to write to both the collections - previous collection
> and
> > the one being bootstrapped.
> > This dual-write is harder to do from the client side (because client
> needs
> > to have a retry logic to ensure any update does not succeed in one
> > collection and fails in another - consistency problem) and it would be a
> > real welcome addition if collection aliasing can support this.
> >
> > Proposal:
> > If can enhance the write alias to point to multiple collections such that
> > any update to the alias is written to all the collections it points to,
> it
> > would help the client to avoid dual writes and also issue just a single
> > http call from the client instead of multiple. It would also reduce the
> > retry logic inside the client code used to keep the collections
> consistent.
> >
> >
> > Thanks
> > SG
>
>

Reply via email to