Hi John, TRA's really do require that you index via the alias. Internally the code is wrapping the Distributed Update Processor with an additional processor to handle the time routing when (and only when) the TRA alias is detected. If the alias is not used, none of the TRA code runs (by design, for performance). TRA's have no capability at all to re-assign docs once they are implemented since the process is data driven during update only, with no internal maintenance threads (again by design). It is not even supported at this time to update the date on which the document was routed via atomic updates for example. One would have to delete and re-index the document (in that order, waiting for one to complete!) Adding some sort of "fixer thread" is not something that would make much sense, since we don't want to ever have the TRA's storing documents in the wrong place to begin with.
TRA's are targeted at systems where new data items arrive regularly, can be placed in the right place correctly up front and the timestamp is immutable (typical for IOT readings, log or event based types of data for example). I think you will probably need to follow up with Lucidworks to get them to add a feature to allow TRA's as targets if TRA's still sound like they fit your use case. (or pursue another solution without limitations on the indexing target) Frankly, it's a mystery to me how you even got any docs in the October collection you list in your question. For anything to have been distributed, it would have had to go through the alias. Also, how you have more than one collection is a mystery unless you manually inserted a doc at some point to cause collection creation perhaps? It's also worth noting that without the routing and maintenance features tied to the alias TRA's give very little benefit, and there are other ways of solving this problem with external solutions. Dave, my co-presenter at Activate 2018 talks about a couple of other options in the middle section of our talk https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s The part describing TRA's in detail starts at 14 min and 17 to 23 min discusses predecessors and alternatives -Gus On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <nashornj...@gmail.com> wrote: > Hello Everyone, > I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5, > cloud mode). As written in the Solr Manual, TRA expects documents to be > indexed using its alias name, and not directly into the collections under > it. Unfortunately, hive-solr doesn't allow using TRA names as indexing > targets. So what I do is: I index data using the first collection created > by TRA and expect Solr to distribute my data into its respective collection > under the hood. This works to some extent, but a big portion of data stays > in where they were indexed, ie. the first collection of the TRA. For > example (approximate numbers): > > * coll_2018-07-01 => 800.000.000 docs > * coll_2018-08-01 => 0 docs > * coll_2018-09-01 => 0 docs > * coll_2018-10-01 => 150.000.000 docs > * coll_2018-11-01 => 0 docs > > Here, coll_2018-07-01 contains data that should normally be in the other > four collections. > > Is there a way to make TRA scan (somehow intentionally) misplaced data and > send them to their correct places? > -- http://www.the111shift.com