The complications are things like this:

Say an update comes in and gets written to the tlog and indexed but
not committed. Now the leader goes down. How does the replica that
takes over leadership
1> understand the current state of the index, i.e. that there are
uncommitted updates
2> replay the updates from the tlog correctly?

Not to mention that during leader election one of the read-only
replicas must become a read/write replica when it takes over
leadership.

The current mechanism does, indeed, use Zk to elect a new leader, the
devil is in the details of how in-flight updates get handled properly.

There's no a-priori reason all those details couldn't be worked out,
it's just gnarly. Nobody has yet stepped up to commit the
time/resources to work them all out. My guess is that the cost of
having a bunch more disks is cheaper than the engineering time it
would take to changes this. The standard answer is "patches welcome"
;).

Best,
Erick

On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote:
> Ok, thanks for the answer. The leader election and update notification sound
> like they should work using ZooKeeper (leader election recipe and a normal
> watch) but I guess there are some details that make things more complicated.
>
> On 09.12.2017 20:19, Erick Erickson wrote:
>>
>> This has been bandied about on a number of occasions, it boils down to
>> nobody has stepped up to make it happen. It turns out there are a
>> number of tricky issues:
>>
>>> how does leadership change if the leader goes down?
>>> the raw complexity of getting it right. Getting it wrong corrupts indexes
>>> how do you resolve leadership in the first place so only the leader
>>> writes to the index?
>>> how would that affect performance if N replicas were autowarming at the
>>> same time, thus reading from HDFS?
>>> how do the read-only replicas know to open a new searcher?
>>> I'm sure there are a bunch more.
>>
>> So this is one of those things that everyone agrees is interesting,
>> but nobody is willing to code and it's not actually clear that it
>> makes sense in the Solr context. It'd be a pity to put in all the work
>> then discover that the performance issues prohibited using it.
>>
>> If you _guarantee_ that the index doesn't change, there's the
>> NoLockFactory you could specify. That would allow you to share a
>> common index, woe be unto you if you start updating the index though.
>>
>> Best,
>> Erick
>>
>> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net>
>> wrote:
>>>
>>> Hi,
>>>
>>> for the HDFS case wouldn't it be nice if there was a mode in which the
>>> replicas just read the same index files as the leader? I mean after all
>>> the
>>> data is already on a shared readable file system so why would one even
>>> need
>>> to replicate the transaction log files?
>>>
>>> regards,
>>> Hendrik
>>>
>>>
>>> On 08.12.2017 21:07, Erick Erickson wrote:
>>>>
>>>> bq: Will TLOG replicas use less network bandwidth?
>>>>
>>>> No, probably more bandwidth. TLOG replicas work like this:
>>>> 1> the raw docs are forwarded
>>>> 2> the old-style master/slave replication is used
>>>>
>>>> So what you do save is CPU processing on the TLOG replica in exchange
>>>> for increased bandwidth.
>>>>
>>>> Since the only thing forwarded in NRT replicas (outside of recovery)
>>>> is the raw documents, I expect that TLOG replicas would _increase_
>>>> network usage. The deal is that TLOG replicas can take over leadership
>>>> if the leader goes down so they must have an
>>>> up-to-date-after-last-index-sync set of tlogs.
>>>>
>>>> At least that's my current understanding...
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>>> <joseph.obernber...@gmail.com> wrote:
>>>>>
>>>>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>>>>> bandwidth?
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>>>>>
>>>>>> Hi All - this same problem happened again, and I think I partially
>>>>>> understand what is going on.  The part I don't know is what caused any
>>>>>> of
>>>>>> the replicas to go into full recovery in the first place, but once
>>>>>> they
>>>>>> do,
>>>>>> they cause network interfaces on servers to go fully utilized in both
>>>>>> in/out
>>>>>> directions.  It appears that when a solr replica needs to recover, it
>>>>>> calls
>>>>>> on the leader for all the data.  In HDFS, the data from the leader's
>>>>>> point
>>>>>> of view goes:
>>>>>>
>>>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process
>>>>>> -->HDFS
>>>>>>
>>>>>> Do I have this correct?  That poor network in the middle becomes a
>>>>>> bottleneck and causes other replicas to go into recovery, which causes
>>>>>> more
>>>>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>>>>>> better
>>>>>> with HDFS?  Would it be possible for the leader to send a message to
>>>>>> the
>>>>>> replica to instead get the data straight from HDFS instead of going
>>>>>> from
>>>>>> one
>>>>>> solr process to another?  HDFS would better be able to use the cluster
>>>>>> since
>>>>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>>>>> replicas with a shared file system.
>>>>>>
>>>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>>>>>> Good idea?  Thank you!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>>>>>
>>>>>>> Hmm. This is quite possible. Any time things take "too long" it can
>>>>>>> be
>>>>>>>     a problem. For instance, if the leader sends docs to a replica
>>>>>>> and
>>>>>>> the request times out, the leader throws the follower into "Leader
>>>>>>> Initiated Recovery". The smoking gun here is that there are no errors
>>>>>>> on the follower, just the notification that the leader put it into
>>>>>>> recovery.
>>>>>>>
>>>>>>> There are other variations on the theme, it all boils down to when
>>>>>>> communications fall apart replicas go into recovery.....
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>>>>> <joseph.obernber...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>>>>>>>> reported
>>>>>>>> by:
>>>>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>>>>
>>>>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are
>>>>>>>> about
>>>>>>>> 1.1
>>>>>>>> billion documents indexed and we index about 2.5 million documents
>>>>>>>> per
>>>>>>>> day.
>>>>>>>> Assuming an even distribution, each node is handling about 680GBytes
>>>>>>>> of
>>>>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>>>>> cache'
>>>>>>>> was an understatement! This is why we split the largest collection
>>>>>>>> into
>>>>>>>> two,
>>>>>>>> where one is data going back 30 days, and the other is all the data.
>>>>>>>> Most
>>>>>>>> of our searches are not longer than 30 days back.  The 30 day index
>>>>>>>> is
>>>>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits
>>>>>>>> between
>>>>>>>> collections, but the 30 day index performs acceptable for our
>>>>>>>> specific
>>>>>>>> application.
>>>>>>>>
>>>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would
>>>>>>>> need
>>>>>>>> a
>>>>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>>>>
>>>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of
>>>>>>>> our
>>>>>>>> servers died.  This caused HDFS to start to replicate blocks across
>>>>>>>> the
>>>>>>>> cluster and produced a lot of network activity.  When this happened,
>>>>>>>> I
>>>>>>>> believe there was high network contention for specific nodes in the
>>>>>>>> cluster
>>>>>>>> and their network interfaces became pegged and requests for HDFS
>>>>>>>> blocks
>>>>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>>>>> caused
>>>>>>>> more network traffic.  Fun stuff.
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>>>>>
>>>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>>>>>
>>>>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>>>>> requirements that the servers run other software.  We tried to
>>>>>>>>>> find
>>>>>>>>>> the best balance between block cache size, and RAM for programs,
>>>>>>>>>> while
>>>>>>>>>> still giving enough for local FS cache.  This came out to be 84
>>>>>>>>>> 128M
>>>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>>>>>
>>>>>>>>> How much data is being handled on a server with 10GB allocated for
>>>>>>>>> caching HDFS data?
>>>>>>>>>
>>>>>>>>> The first message in this thread says the index size is 31TB, which
>>>>>>>>> is
>>>>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>>>>> space.  If the data is distributed somewhat evenly, then the answer
>>>>>>>>> to
>>>>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to
>>>>>>>>> 2TB.
>>>>>>>>>
>>>>>>>>> When index data that Solr needs to access for an operation is not
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>>>>>> resulting performance usually isn't very good.  In most cases you
>>>>>>>>> don't
>>>>>>>>> need to have enough memory to fully cache the index data ... but
>>>>>>>>> less
>>>>>>>>> than half a percent is not going to be enough.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>>> http://www.avg.com
>>>>>>>>>
>

Reply via email to