Thanks for the detailed answers Joe. Definitely sounds like you covered
most of the easy HDFS performance items.

Kevin Risden

On Wed, Nov 22, 2017 at 7:44 AM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi Kevin -
> * HDFS is part of Cloudera 5.12.0.
> * Solr is co-located in most cases.  We do have several nodes that run on
> servers that are not data nodes, but most do. Unfortunately, our nodes are
> not the same size.  Some nodes have 8TBytes of disk, while our largest
> nodes are 64TBytes.  This results in a lot of data that needs to go over
> the network.
>
> * Command is:
> /usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m
> -XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem
> -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m
> -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75
> -XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB
> -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> -Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=300000
> -DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181,
> gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida.
> querymasters.com:2181/solr6.6.0 -Dsolr.log.dir=/opt/solr6/server/logs
> -Djetty.port=9100 -DSTOP.PORT=8100 -DSTOP.KEY=solrrocks -Dhost=tarvos
> -Duser.timezone=UTC -Djetty.home=/opt/solr6/server
> -Dsolr.solr.home=/opt/solr6/server/solr -Dsolr.install.dir=/opt/solr6
> -Dsolr.clustering.enabled=true -Dsolr.lock.type=hdfs
> -Dsolr.autoSoftCommit.maxTime=120000 -Dsolr.autoCommit.maxTime=1800000
> -Dsolr.solr.home=/etc/solr6 -Djava.library.path=/opt/cloud
> era/parcels/CDH/lib/hadoop/lib/native -Xss256k -Dsolr.log.muteconsole
> -XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100
> /opt/solr6/server/logs -jar start.jar --module=http
>
> * We have enabled short circuit reads.
>
> Right now, we have a relatively small block cache due to the requirements
> that the servers run other software.  We tried to find the best balance
> between block cache size, and RAM for programs, while still giving enough
> for local FS cache.  This came out to be 84 128M blocks - or about 10G for
> the cache per node (45 nodes total).
>
> <directoryFactory name="DirectoryFactory"
>         class="solr.HdfsDirectoryFactory">
>         <bool name="solr.hdfs.blockcache.enabled">true</bool>
>         <bool name="solr.hdfs.blockcache.global">true</bool>
>         <int name="solr.hdfs.blockcache.slab.count">84</int>
>         <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</
> bool>
>         <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
>         <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
>         <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
>         <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">128</int>
>         <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int>
>         <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr6.6.0</st
> r>
>         <str name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</st
> r>
>     </directoryFactory>
>
> Thanks for reviewing!
>
> -Joe
>
>
>
> On 11/22/2017 8:20 AM, Kevin Risden wrote:
>
>> Joe,
>>
>> I have a few questions about your Solr and HDFS setup that could help
>> improve the recovery performance.
>>
>> * Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
>> * Is Solr colocated with HDFS data nodes?
>> * What is the output of "ps aux | grep solr"? (specifically looking for
>> the
>> Java arguments that are being set.)
>>
>> Depending on how Solr on HDFS was setup, there are some potentially simple
>> settings that can help significantly improve performance.
>>
>> 1) Short circuit reads
>>
>> If Solr is colocated with an HDFS datanode, short circuit reads can
>> improve
>> read performance since it skips a network hop if the data is local to that
>> node. This requires HDFS native libraries to be added to Solr.
>>
>> 2) HDFS block cache in Solr
>>
>> Solr without HDFS uses the OS page cache to handle caching data for
>> queries. With HDFS, Solr has a special HDFS block cache which allows for
>> caching HDFS blocks. This significantly helps query performance. There are
>> a few configuration parameters that can help here.
>>
>> Kevin Risden
>>
>> On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net
>> >
>> wrote:
>>
>> Hi Joe,
>>>
>>> sorry, I have not seen that problem. I would normally not delete a
>>> replica
>>> if the shard is down but only if there is an active shard. Without an
>>> active leader the replica should not be able to recover. I also just had
>>> a
>>> case where all replicas of a shard stayed in down state and restarts
>>> didn't
>>> help. This was however also caused by lock files. Once I cleaned them up
>>> and restarted all Solr instances that had a replica they recovered.
>>>
>>> For the lock files I discovered that the index is not always in the
>>> "index" folder but can also be in an index.<timestamp> folder. There can
>>> be
>>> an "index.properties" file in the "data" directory in HDFS and this
>>> contains the correct index folder name.
>>>
>>> If you are really desperate you could also delete all but one replica so
>>> that the leader election is quite trivial. But this does of course
>>> increase
>>> the risk of finally loosing the data quite a bit. So I would try looking
>>> into the code and figure out what the problem is here and maybe compare
>>> the
>>> state in HDFS and ZK with a shard that works.
>>>
>>> regards,
>>> Hendrik
>>>
>>>
>>> On 21.11.2017 23:57, Joe Obernberger wrote:
>>>
>>> Hi Hendrick - the shards in question have three replicas.  I tried
>>>> restarting each one (one by one) - no luck.  No leader is found. I
>>>> deleted
>>>> one of the replicas and added a new one, and the new one also shows as
>>>> 'down'.  I also tried the FORCELEADER call, but that had no effect.  I
>>>> checked the OVERSEERSTATUS, but there is nothing unusual there.  I don't
>>>> see anything useful in the logs except the error:
>>>>
>>>> org.apache.solr.common.SolrException: Error getting leader from zk for
>>>> shard shard21
>>>>      at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>>>> java:996)
>>>>      at org.apache.solr.cloud.ZkController.register(ZkController.
>>>> java:902)
>>>>      at org.apache.solr.cloud.ZkController.register(ZkController.
>>>> java:846)
>>>>      at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(
>>>> ZkContainer.java:181)
>>>>      at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>>>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>>>>      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1149)
>>>>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:624)
>>>>      at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>>>> props
>>>>      at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>>>> er.java:1043)
>>>>      at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>>>> er.java:1007)
>>>>      at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>>>> java:963)
>>>>      ... 7 more
>>>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>>>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/s
>>>> hard21/leader
>>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>>>> java:111)
>>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>>>> java:51)
>>>>      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>>>>      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>>>> ient.java:357)
>>>>      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>>>> ient.java:354)
>>>>      at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>>>> CmdExecutor.java:60)
>>>>      at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClie
>>>> nt.java:354)
>>>>      at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>>>> er.java:1021)
>>>>      ... 9 more
>>>>
>>>> Can I modify zookeeper to force a leader?  Is there any other way to
>>>> recover from this?  Thanks very much!
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>>>
>>>> We sometimes also have replicas not recovering. If one replica is left
>>>>> active the easiest is to then to delete the replica and create a new
>>>>> one.
>>>>> When all replicas are down it helps most of the time to restart one of
>>>>> the
>>>>> nodes that contains a replica in down state. If that also doesn't get
>>>>> the
>>>>> replica to recover I would check the logs of the node and also that of
>>>>> the
>>>>> overseer node. I have seen the same issue on Solr using local storage.
>>>>> The
>>>>> main HDFS related issues we had so far was those lock files and if you
>>>>> delete and recreate collections/cores and it sometimes happens that the
>>>>> data was not cleaned up in HDFS and then causes a conflict.
>>>>>
>>>>> Hendrik
>>>>>
>>>>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>>>
>>>>> We've never run an index this size in anything but HDFS, so I have no
>>>>>> comparison.  What we've been doing is keeping two main collections -
>>>>>> all
>>>>>> data, and the last 30 days of data.  Then we handle queries based on
>>>>>> date
>>>>>> range. The 30 day index is significantly faster.
>>>>>>
>>>>>> My main concern right now is that 6 of the 100 shards are not coming
>>>>>> back because of no leader.  I've never seen this error before.  Any
>>>>>> ideas?
>>>>>> ClusterStatus shows all three replicas with state 'down'.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -joe
>>>>>>
>>>>>>
>>>>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>>>>
>>>>>> We actually also have some performance issue with HDFS at the moment.
>>>>>>> We are doing lots of soft commits for NRT search. Those seem to be
>>>>>>> slower
>>>>>>> then with local storage. The investigation is however not really far
>>>>>>> yet.
>>>>>>>
>>>>>>> We have a setup with 2000 collections, with one shard each and a
>>>>>>> replication factor of 2 or 3. When we restart nodes too fast that
>>>>>>> causes
>>>>>>> problems with the overseer queue, which can lead to the queue
>>>>>>> getting out
>>>>>>> of control and Solr pretty much dying. We are still on Solr 6.3. 6.6
>>>>>>> has
>>>>>>> some improvements and should handle these actions faster. I would
>>>>>>> check
>>>>>>> what you see for "/solr/admin/collections?actio
>>>>>>> n=OVERSEERSTATUS&wt=json".
>>>>>>> The critical part is the "overseer_queue_size" value. If this goes
>>>>>>> up to
>>>>>>> about 10000 it is pretty much game over on our setup. In that case
>>>>>>> it seems
>>>>>>> to be best to stop all nodes, clear the queue in ZK and then restart
>>>>>>> the
>>>>>>> nodes one by one with a gap of like 5min. That normally recovers
>>>>>>> pretty
>>>>>>> well.
>>>>>>>
>>>>>>> regards,
>>>>>>> Hendrik
>>>>>>>
>>>>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>>>>
>>>>>>> We set the hard commit time long because we were having performance
>>>>>>>> issues with HDFS, and thought that since the block size is 128M,
>>>>>>>> having a
>>>>>>>> longer hard commit made sense.  That was our hypothesis anyway.
>>>>>>>> Happy to
>>>>>>>> switch it back and see what happens.
>>>>>>>>
>>>>>>>> I don't know what caused the cluster to go into recovery in the
>>>>>>>> first
>>>>>>>> place.  We had a server die over the weekend, but it's just one out
>>>>>>>> of
>>>>>>>> ~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9
>>>>>>>> copies).  It was at this point that we noticed lots of network
>>>>>>>> activity,
>>>>>>>> and most of the shards in this recovery, fail, retry loop.  That is
>>>>>>>> when we
>>>>>>>> decided to shut it down resulting in zombie lock files.
>>>>>>>>
>>>>>>>> I tried using the FORCELEADER call, which completed, but doesn't
>>>>>>>> seem
>>>>>>>> to have any effect on the shards that have no leader. Kinda out of
>>>>>>>> ideas
>>>>>>>> for that problem.  If I can get the cluster back up, I'll try a
>>>>>>>> lower hard
>>>>>>>> commit time. Thanks again Erick!
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>>>>
>>>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>>>>>> ;)...
>>>>>>>>>
>>>>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>>>>> surprising that they need to be forcefully killed. I was more
>>>>>>>>> thinking
>>>>>>>>> about how they got in this situation in the first place. _Before_
>>>>>>>>> you
>>>>>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>>>>>> Forcefully?
>>>>>>>>>
>>>>>>>>> Your hard commit is far longer than it needs to be, resulting in
>>>>>>>>> much
>>>>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with
>>>>>>>>> local
>>>>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not
>>>>>>>>> have
>>>>>>>>> had a chance to fsync the segments and may have to replay on
>>>>>>>>> startup.
>>>>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>>>>> not horribly expensive, it just fsync's the current segments and
>>>>>>>>> opens
>>>>>>>>> new ones. It won't be a total cure, but I bet reducing this
>>>>>>>>> interval
>>>>>>>>> would help a lot.
>>>>>>>>>
>>>>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if
>>>>>>>>> you
>>>>>>>>> issue a manual commit, something like
>>>>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>>>>> interval will make the wait between stopping indexing and
>>>>>>>>> restarting
>>>>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Erick
>>>>>>>>>
>>>>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>>>>> <hendrik.hadd...@gmx.net> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>>>>>> gracefully.
>>>>>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>>>>>> removed
>>>>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should
>>>>>>>>>> be owning
>>>>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>>>>>> owner and
>>>>>>>>>> is also not willing to retry, which is why you need to restart the
>>>>>>>>>> whole
>>>>>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>>>>>> start up
>>>>>>>>>> script which scans the log files in HDFS and compares that with
>>>>>>>>>> the
>>>>>>>>>> state in
>>>>>>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>>>>>>> that I'm
>>>>>>>>>> starting.
>>>>>>>>>>
>>>>>>>>>> regards,
>>>>>>>>>> Hendrik
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>>>>
>>>>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>>>>>> 6.6.1 using
>>>>>>>>>>> HDFS as the index.  The current index size is about 31TBytes.
>>>>>>>>>>> With
>>>>>>>>>>> 3x
>>>>>>>>>>> replication that takes up 93TBytes of disk. Our main collection
>>>>>>>>>>> is
>>>>>>>>>>> split
>>>>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>>>>>> running into
>>>>>>>>>>> is when restarting the solr6 cluster.  The shards go into
>>>>>>>>>>> recovery
>>>>>>>>>>> and start
>>>>>>>>>>> to utilize nearly all of their network interfaces. If we start
>>>>>>>>>>> too
>>>>>>>>>>> many of
>>>>>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>>>>>> retry loop
>>>>>>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>>>>>>> fast
>>>>>>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>>>>>>> this is
>>>>>>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>>>>>>> upon
>>>>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>>>>
>>>>>>>>>>> The process at this point is to start one node, find out the lock
>>>>>>>>>>> files,
>>>>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>>>>> write.lock
>>>>>>>>>>> files, and restart.  Usually this second restart is faster, but
>>>>>>>>>>> it
>>>>>>>>>>> still can
>>>>>>>>>>> take 20-60 minutes.
>>>>>>>>>>>
>>>>>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>>>>>> Should we
>>>>>>>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>>>>>>>> we should
>>>>>>>>>>> have built the solr6 cluster?
>>>>>>>>>>>
>>>>>>>>>>> Thank you for any insight!
>>>>>>>>>>>
>>>>>>>>>>> -Joe
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>>> http://www.avg.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>

Reply via email to