Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

John Nashorn Fri, 30 Nov 2018 06:27:02 -0800

Hi Gus, thanks  for writing a detailed answer. I've written some bits between 
quotings from your post.


On 2018/11/30 05:15:10, Gus Heck <gus.h...@gmail.com> wrote: 
> Hi John,
> 
> TRA's really do require that you index via the alias. Internally the code
> is wrapping the Distributed Update Processor with an additional processor
> to handle the time routing when (and only when) the TRA alias is detected.
> If the alias is not used, none of the TRA code runs (by design, for
> performance). TRA's have no capability at all to re-assign docs once they
> are implemented since the process is data driven during update only, with
> no internal maintenance threads (again by design).  It is not even
> supported at this time to update the date on which the document was routed
> via atomic updates for example. One would have to delete and re-index the
> document (in that order, waiting for one to complete!) Adding some sort of
> "fixer thread" is not something that would make much sense, since we don't
> want to ever have the TRA's storing documents in the wrong place to
> begin with.
> 
> TRA's are targeted at systems where new data items arrive regularly, can be
> placed in the right place correctly up front and the timestamp is immutable
> (typical for IOT readings, log or event based types of data for example).
> 
> I think you will probably need to follow up with Lucidworks to get them to
> add a feature to allow TRA's as targets if TRA's still sound like they fit
> your use case. (or pursue another solution without limitations on the
> indexing target)
> 

I know that I'm using TRA out of its designed way, though my scenario would 
perfectly fit for TRA if I were able to use alias name with "hive-solr". I have 
reported the issue to hive-solr devs: 
https://github.com/lucidworks/hive-solr/issues/63

> 
> Frankly, it's a mystery to me how you even got any docs in the October
> collection you list in your question. For anything to have been
> distributed, it would have had to go through the alias. Also, how you have
> more than one collection is a mystery unless you manually inserted a doc at
> some point to cause collection creation perhaps?
> 

Maybe it's the example got you confused, I might have oversummarized it while 
trying to trim. Let me clarify things a little bit: My data ranges from 
2013-01-01 to NOW and continues to grow. I've created a TRA beginning from 
2013-01-01 adding a new collection on a monthly basis. I begun indexing data  
from last to first. Since hive-solr threw NPE when used against TRA name, I was 
sending data to an external table created for the collection of 2013-01-01. 
When the first document was indexed, I saw that all the collections between 
2013-01-01 and 2018-10-01 were created, and the docs were indexed into 
2018-10-01, then 2018-09-01, then 2018-08-01... But after some point, say 
2017-02-01, it stopped this routing and all documents went into 2013-01-01 
collection. 
I didn't manually insert any documents to cause creation of collections.

> 
> It's also worth noting that without the routing and maintenance features
> tied to the alias TRA's give very little benefit, and there are other ways
> of solving this problem with external solutions. Dave, my co-presenter at
> Activate 2018 talks about a couple of other options in the middle section
> of our talk
> https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s
> 
> 
> The part describing TRA's in detail starts at 14 min and 17 to 23 min
> discusses predecessors and alternatives
> 
> -Gus
> 
> On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <nashornj...@gmail.com> wrote:
> 
> > Hello Everyone,
> > I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5,
> > cloud mode). As written in the Solr Manual, TRA expects documents to be
> > indexed using its alias name, and not directly into the collections under
> > it. Unfortunately, hive-solr doesn't allow using TRA names as indexing
> > targets. So what I do is: I index data using the first collection created
> > by TRA and expect Solr to distribute my data into its respective collection
> > under the hood. This works to some extent, but a big portion of data stays
> > in where they were indexed, ie. the first collection of the TRA. For
> > example (approximate numbers):
> >
> > * coll_2018-07-01 => 800.000.000 docs
> > * coll_2018-08-01 => 0 docs
> > * coll_2018-09-01 => 0 docs
> > * coll_2018-10-01 => 150.000.000 docs
> > * coll_2018-11-01 => 0 docs
> >
> > Here, coll_2018-07-01 contains data that should normally be in the other
> > four collections.
> >
> > Is there a way to make TRA scan (somehow intentionally) misplaced data and
> > send them to their correct places?
> >
> 
> 
> -- 
> http://www.the111shift.com
>

Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

Reply via email to