Hi Gus, thanks for writing a detailed answer. I've written some bits between quotings from your post.
On 2018/11/30 05:15:10, Gus Heck <gus.h...@gmail.com> wrote: > Hi John, > > TRA's really do require that you index via the alias. Internally the code > is wrapping the Distributed Update Processor with an additional processor > to handle the time routing when (and only when) the TRA alias is detected. > If the alias is not used, none of the TRA code runs (by design, for > performance). TRA's have no capability at all to re-assign docs once they > are implemented since the process is data driven during update only, with > no internal maintenance threads (again by design). It is not even > supported at this time to update the date on which the document was routed > via atomic updates for example. One would have to delete and re-index the > document (in that order, waiting for one to complete!) Adding some sort of > "fixer thread" is not something that would make much sense, since we don't > want to ever have the TRA's storing documents in the wrong place to > begin with. > > TRA's are targeted at systems where new data items arrive regularly, can be > placed in the right place correctly up front and the timestamp is immutable > (typical for IOT readings, log or event based types of data for example). > > I think you will probably need to follow up with Lucidworks to get them to > add a feature to allow TRA's as targets if TRA's still sound like they fit > your use case. (or pursue another solution without limitations on the > indexing target) > I know that I'm using TRA out of its designed way, though my scenario would perfectly fit for TRA if I were able to use alias name with "hive-solr". I have reported the issue to hive-solr devs: https://github.com/lucidworks/hive-solr/issues/63 > > Frankly, it's a mystery to me how you even got any docs in the October > collection you list in your question. For anything to have been > distributed, it would have had to go through the alias. Also, how you have > more than one collection is a mystery unless you manually inserted a doc at > some point to cause collection creation perhaps? > Maybe it's the example got you confused, I might have oversummarized it while trying to trim. Let me clarify things a little bit: My data ranges from 2013-01-01 to NOW and continues to grow. I've created a TRA beginning from 2013-01-01 adding a new collection on a monthly basis. I begun indexing data from last to first. Since hive-solr threw NPE when used against TRA name, I was sending data to an external table created for the collection of 2013-01-01. When the first document was indexed, I saw that all the collections between 2013-01-01 and 2018-10-01 were created, and the docs were indexed into 2018-10-01, then 2018-09-01, then 2018-08-01... But after some point, say 2017-02-01, it stopped this routing and all documents went into 2013-01-01 collection. I didn't manually insert any documents to cause creation of collections. > > It's also worth noting that without the routing and maintenance features > tied to the alias TRA's give very little benefit, and there are other ways > of solving this problem with external solutions. Dave, my co-presenter at > Activate 2018 talks about a couple of other options in the middle section > of our talk > https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s > > > The part describing TRA's in detail starts at 14 min and 17 to 23 min > discusses predecessors and alternatives > > -Gus > > On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <nashornj...@gmail.com> wrote: > > > Hello Everyone, > > I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5, > > cloud mode). As written in the Solr Manual, TRA expects documents to be > > indexed using its alias name, and not directly into the collections under > > it. Unfortunately, hive-solr doesn't allow using TRA names as indexing > > targets. So what I do is: I index data using the first collection created > > by TRA and expect Solr to distribute my data into its respective collection > > under the hood. This works to some extent, but a big portion of data stays > > in where they were indexed, ie. the first collection of the TRA. For > > example (approximate numbers): > > > > * coll_2018-07-01 => 800.000.000 docs > > * coll_2018-08-01 => 0 docs > > * coll_2018-09-01 => 0 docs > > * coll_2018-10-01 => 150.000.000 docs > > * coll_2018-11-01 => 0 docs > > > > Here, coll_2018-07-01 contains data that should normally be in the other > > four collections. > > > > Is there a way to make TRA scan (somehow intentionally) misplaced data and > > send them to their correct places? > > > > > -- > http://www.the111shift.com >