The complications are things like this: Say an update comes in and gets written to the tlog and indexed but not committed. Now the leader goes down. How does the replica that takes over leadership 1> understand the current state of the index, i.e. that there are uncommitted updates 2> replay the updates from the tlog correctly?
Not to mention that during leader election one of the read-only replicas must become a read/write replica when it takes over leadership. The current mechanism does, indeed, use Zk to elect a new leader, the devil is in the details of how in-flight updates get handled properly. There's no a-priori reason all those details couldn't be worked out, it's just gnarly. Nobody has yet stepped up to commit the time/resources to work them all out. My guess is that the cost of having a bunch more disks is cheaper than the engineering time it would take to changes this. The standard answer is "patches welcome" ;). Best, Erick On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote: > Ok, thanks for the answer. The leader election and update notification sound > like they should work using ZooKeeper (leader election recipe and a normal > watch) but I guess there are some details that make things more complicated. > > On 09.12.2017 20:19, Erick Erickson wrote: >> >> This has been bandied about on a number of occasions, it boils down to >> nobody has stepped up to make it happen. It turns out there are a >> number of tricky issues: >> >>> how does leadership change if the leader goes down? >>> the raw complexity of getting it right. Getting it wrong corrupts indexes >>> how do you resolve leadership in the first place so only the leader >>> writes to the index? >>> how would that affect performance if N replicas were autowarming at the >>> same time, thus reading from HDFS? >>> how do the read-only replicas know to open a new searcher? >>> I'm sure there are a bunch more. >> >> So this is one of those things that everyone agrees is interesting, >> but nobody is willing to code and it's not actually clear that it >> makes sense in the Solr context. It'd be a pity to put in all the work >> then discover that the performance issues prohibited using it. >> >> If you _guarantee_ that the index doesn't change, there's the >> NoLockFactory you could specify. That would allow you to share a >> common index, woe be unto you if you start updating the index though. >> >> Best, >> Erick >> >> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net> >> wrote: >>> >>> Hi, >>> >>> for the HDFS case wouldn't it be nice if there was a mode in which the >>> replicas just read the same index files as the leader? I mean after all >>> the >>> data is already on a shared readable file system so why would one even >>> need >>> to replicate the transaction log files? >>> >>> regards, >>> Hendrik >>> >>> >>> On 08.12.2017 21:07, Erick Erickson wrote: >>>> >>>> bq: Will TLOG replicas use less network bandwidth? >>>> >>>> No, probably more bandwidth. TLOG replicas work like this: >>>> 1> the raw docs are forwarded >>>> 2> the old-style master/slave replication is used >>>> >>>> So what you do save is CPU processing on the TLOG replica in exchange >>>> for increased bandwidth. >>>> >>>> Since the only thing forwarded in NRT replicas (outside of recovery) >>>> is the raw documents, I expect that TLOG replicas would _increase_ >>>> network usage. The deal is that TLOG replicas can take over leadership >>>> if the leader goes down so they must have an >>>> up-to-date-after-last-index-sync set of tlogs. >>>> >>>> At least that's my current understanding... >>>> >>>> Best, >>>> Erick >>>> >>>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger >>>> <joseph.obernber...@gmail.com> wrote: >>>>> >>>>> Anyone have any thoughts on this? Will TLOG replicas use less network >>>>> bandwidth? >>>>> >>>>> -Joe >>>>> >>>>> >>>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote: >>>>>> >>>>>> Hi All - this same problem happened again, and I think I partially >>>>>> understand what is going on. The part I don't know is what caused any >>>>>> of >>>>>> the replicas to go into full recovery in the first place, but once >>>>>> they >>>>>> do, >>>>>> they cause network interfaces on servers to go fully utilized in both >>>>>> in/out >>>>>> directions. It appears that when a solr replica needs to recover, it >>>>>> calls >>>>>> on the leader for all the data. In HDFS, the data from the leader's >>>>>> point >>>>>> of view goes: >>>>>> >>>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process >>>>>> -->HDFS >>>>>> >>>>>> Do I have this correct? That poor network in the middle becomes a >>>>>> bottleneck and causes other replicas to go into recovery, which causes >>>>>> more >>>>>> network traffic. Perhaps going to TLOG replicas with 7.1 would be >>>>>> better >>>>>> with HDFS? Would it be possible for the leader to send a message to >>>>>> the >>>>>> replica to instead get the data straight from HDFS instead of going >>>>>> from >>>>>> one >>>>>> solr process to another? HDFS would better be able to use the cluster >>>>>> since >>>>>> each block has 3x replicas. Perhaps there is a better way to handle >>>>>> replicas with a shared file system. >>>>>> >>>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG. >>>>>> Good idea? Thank you! >>>>>> >>>>>> -Joe >>>>>> >>>>>> >>>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote: >>>>>>> >>>>>>> Hmm. This is quite possible. Any time things take "too long" it can >>>>>>> be >>>>>>> a problem. For instance, if the leader sends docs to a replica >>>>>>> and >>>>>>> the request times out, the leader throws the follower into "Leader >>>>>>> Initiated Recovery". The smoking gun here is that there are no errors >>>>>>> on the follower, just the notification that the leader put it into >>>>>>> recovery. >>>>>>> >>>>>>> There are other variations on the theme, it all boils down to when >>>>>>> communications fall apart replicas go into recovery..... >>>>>>> >>>>>>> Best, >>>>>>> Erick >>>>>>> >>>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger >>>>>>> <joseph.obernber...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as >>>>>>>> reported >>>>>>>> by: >>>>>>>> hadoop fs -du -s -h /solr6.6.0 >>>>>>>> 29.9 T 89.9 T /solr6.6.0 >>>>>>>> >>>>>>>> The 89.9TBytes is due to HDFS having 3x replication. There are >>>>>>>> about >>>>>>>> 1.1 >>>>>>>> billion documents indexed and we index about 2.5 million documents >>>>>>>> per >>>>>>>> day. >>>>>>>> Assuming an even distribution, each node is handling about 680GBytes >>>>>>>> of >>>>>>>> index. So our cache size is 1.4%. Perhaps 'relatively small block >>>>>>>> cache' >>>>>>>> was an understatement! This is why we split the largest collection >>>>>>>> into >>>>>>>> two, >>>>>>>> where one is data going back 30 days, and the other is all the data. >>>>>>>> Most >>>>>>>> of our searches are not longer than 30 days back. The 30 day index >>>>>>>> is >>>>>>>> 2.6TBytes total. I don't know how the HDFS block cache splits >>>>>>>> between >>>>>>>> collections, but the 30 day index performs acceptable for our >>>>>>>> specific >>>>>>>> application. >>>>>>>> >>>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would >>>>>>>> need >>>>>>>> a >>>>>>>> block cache of about 350GBytes. I'm accepting offers of DIMMs! >>>>>>>> >>>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of >>>>>>>> our >>>>>>>> servers died. This caused HDFS to start to replicate blocks across >>>>>>>> the >>>>>>>> cluster and produced a lot of network activity. When this happened, >>>>>>>> I >>>>>>>> believe there was high network contention for specific nodes in the >>>>>>>> cluster >>>>>>>> and their network interfaces became pegged and requests for HDFS >>>>>>>> blocks >>>>>>>> timed out. When that happened, SolrCloud went into recovery which >>>>>>>> caused >>>>>>>> more network traffic. Fun stuff. >>>>>>>> >>>>>>>> -Joe >>>>>>>> >>>>>>>> >>>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote: >>>>>>>>> >>>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote: >>>>>>>>>> >>>>>>>>>> Right now, we have a relatively small block cache due to the >>>>>>>>>> requirements that the servers run other software. We tried to >>>>>>>>>> find >>>>>>>>>> the best balance between block cache size, and RAM for programs, >>>>>>>>>> while >>>>>>>>>> still giving enough for local FS cache. This came out to be 84 >>>>>>>>>> 128M >>>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total). >>>>>>>>> >>>>>>>>> How much data is being handled on a server with 10GB allocated for >>>>>>>>> caching HDFS data? >>>>>>>>> >>>>>>>>> The first message in this thread says the index size is 31TB, which >>>>>>>>> is >>>>>>>>> *enormous*. You have also said that the index takes 93TB of disk >>>>>>>>> space. If the data is distributed somewhat evenly, then the answer >>>>>>>>> to >>>>>>>>> my question would be that each of those 45 Solr servers would be >>>>>>>>> handling over 2TB of data. A 10GB cache is *nothing* compared to >>>>>>>>> 2TB. >>>>>>>>> >>>>>>>>> When index data that Solr needs to access for an operation is not >>>>>>>>> in >>>>>>>>> the >>>>>>>>> cache and Solr must actually wait for disk and/or network I/O, the >>>>>>>>> resulting performance usually isn't very good. In most cases you >>>>>>>>> don't >>>>>>>>> need to have enough memory to fully cache the index data ... but >>>>>>>>> less >>>>>>>>> than half a percent is not going to be enough. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> >>>>>>>>> --- >>>>>>>>> This email has been checked for viruses by AVG. >>>>>>>>> http://www.avg.com >>>>>>>>> >