Thanks for the detailed answers Joe. Definitely sounds like you covered most of the easy HDFS performance items.
Kevin Risden On Wed, Nov 22, 2017 at 7:44 AM, Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Hi Kevin - > * HDFS is part of Cloudera 5.12.0. > * Solr is co-located in most cases. We do have several nodes that run on > servers that are not data nodes, but most do. Unfortunately, our nodes are > not the same size. Some nodes have 8TBytes of disk, while our largest > nodes are 64TBytes. This results in a lot of data that needs to go over > the network. > > * Command is: > /usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m > -XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem > -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m > -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75 > -XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB > -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps > -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime > -Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation > -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=300000 > -DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181, > gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida. > querymasters.com:2181/solr6.6.0 -Dsolr.log.dir=/opt/solr6/server/logs > -Djetty.port=9100 -DSTOP.PORT=8100 -DSTOP.KEY=solrrocks -Dhost=tarvos > -Duser.timezone=UTC -Djetty.home=/opt/solr6/server > -Dsolr.solr.home=/opt/solr6/server/solr -Dsolr.install.dir=/opt/solr6 > -Dsolr.clustering.enabled=true -Dsolr.lock.type=hdfs > -Dsolr.autoSoftCommit.maxTime=120000 -Dsolr.autoCommit.maxTime=1800000 > -Dsolr.solr.home=/etc/solr6 -Djava.library.path=/opt/cloud > era/parcels/CDH/lib/hadoop/lib/native -Xss256k -Dsolr.log.muteconsole > -XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100 > /opt/solr6/server/logs -jar start.jar --module=http > > * We have enabled short circuit reads. > > Right now, we have a relatively small block cache due to the requirements > that the servers run other software. We tried to find the best balance > between block cache size, and RAM for programs, while still giving enough > for local FS cache. This came out to be 84 128M blocks - or about 10G for > the cache per node (45 nodes total). > > <directoryFactory name="DirectoryFactory" > class="solr.HdfsDirectoryFactory"> > <bool name="solr.hdfs.blockcache.enabled">true</bool> > <bool name="solr.hdfs.blockcache.global">true</bool> > <int name="solr.hdfs.blockcache.slab.count">84</int> > <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</ > bool> > <int name="solr.hdfs.blockcache.blocksperbank">16384</int> > <bool name="solr.hdfs.blockcache.read.enabled">true</bool> > <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> > <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">128</int> > <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int> > <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr6.6.0</st > r> > <str name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</st > r> > </directoryFactory> > > Thanks for reviewing! > > -Joe > > > > On 11/22/2017 8:20 AM, Kevin Risden wrote: > >> Joe, >> >> I have a few questions about your Solr and HDFS setup that could help >> improve the recovery performance. >> >> * Is HDFS part of a distribution from Hortonworks, Cloudera, etc? >> * Is Solr colocated with HDFS data nodes? >> * What is the output of "ps aux | grep solr"? (specifically looking for >> the >> Java arguments that are being set.) >> >> Depending on how Solr on HDFS was setup, there are some potentially simple >> settings that can help significantly improve performance. >> >> 1) Short circuit reads >> >> If Solr is colocated with an HDFS datanode, short circuit reads can >> improve >> read performance since it skips a network hop if the data is local to that >> node. This requires HDFS native libraries to be added to Solr. >> >> 2) HDFS block cache in Solr >> >> Solr without HDFS uses the OS page cache to handle caching data for >> queries. With HDFS, Solr has a special HDFS block cache which allows for >> caching HDFS blocks. This significantly helps query performance. There are >> a few configuration parameters that can help here. >> >> Kevin Risden >> >> On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net >> > >> wrote: >> >> Hi Joe, >>> >>> sorry, I have not seen that problem. I would normally not delete a >>> replica >>> if the shard is down but only if there is an active shard. Without an >>> active leader the replica should not be able to recover. I also just had >>> a >>> case where all replicas of a shard stayed in down state and restarts >>> didn't >>> help. This was however also caused by lock files. Once I cleaned them up >>> and restarted all Solr instances that had a replica they recovered. >>> >>> For the lock files I discovered that the index is not always in the >>> "index" folder but can also be in an index.<timestamp> folder. There can >>> be >>> an "index.properties" file in the "data" directory in HDFS and this >>> contains the correct index folder name. >>> >>> If you are really desperate you could also delete all but one replica so >>> that the leader election is quite trivial. But this does of course >>> increase >>> the risk of finally loosing the data quite a bit. So I would try looking >>> into the code and figure out what the problem is here and maybe compare >>> the >>> state in HDFS and ZK with a shard that works. >>> >>> regards, >>> Hendrik >>> >>> >>> On 21.11.2017 23:57, Joe Obernberger wrote: >>> >>> Hi Hendrick - the shards in question have three replicas. I tried >>>> restarting each one (one by one) - no luck. No leader is found. I >>>> deleted >>>> one of the replicas and added a new one, and the new one also shows as >>>> 'down'. I also tried the FORCELEADER call, but that had no effect. I >>>> checked the OVERSEERSTATUS, but there is nothing unusual there. I don't >>>> see anything useful in the logs except the error: >>>> >>>> org.apache.solr.common.SolrException: Error getting leader from zk for >>>> shard shard21 >>>> at org.apache.solr.cloud.ZkController.getLeader(ZkController. >>>> java:996) >>>> at org.apache.solr.cloud.ZkController.register(ZkController. >>>> java:902) >>>> at org.apache.solr.cloud.ZkController.register(ZkController. >>>> java:846) >>>> at org.apache.solr.core.ZkContainer.lambda$registerInZk$0( >>>> ZkContainer.java:181) >>>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE >>>> xecutor.lambda$execute$0(ExecutorUtil.java:229) >>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>> Executor.java:1149) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>> lExecutor.java:624) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: org.apache.solr.common.SolrException: Could not get leader >>>> props >>>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll >>>> er.java:1043) >>>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll >>>> er.java:1007) >>>> at org.apache.solr.cloud.ZkController.getLeader(ZkController. >>>> java:963) >>>> ... 7 more >>>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: >>>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/s >>>> hard21/leader >>>> at org.apache.zookeeper.KeeperException.create(KeeperException. >>>> java:111) >>>> at org.apache.zookeeper.KeeperException.create(KeeperException. >>>> java:51) >>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) >>>> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl >>>> ient.java:357) >>>> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl >>>> ient.java:354) >>>> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk >>>> CmdExecutor.java:60) >>>> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClie >>>> nt.java:354) >>>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll >>>> er.java:1021) >>>> ... 9 more >>>> >>>> Can I modify zookeeper to force a leader? Is there any other way to >>>> recover from this? Thanks very much! >>>> >>>> -Joe >>>> >>>> >>>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote: >>>> >>>> We sometimes also have replicas not recovering. If one replica is left >>>>> active the easiest is to then to delete the replica and create a new >>>>> one. >>>>> When all replicas are down it helps most of the time to restart one of >>>>> the >>>>> nodes that contains a replica in down state. If that also doesn't get >>>>> the >>>>> replica to recover I would check the logs of the node and also that of >>>>> the >>>>> overseer node. I have seen the same issue on Solr using local storage. >>>>> The >>>>> main HDFS related issues we had so far was those lock files and if you >>>>> delete and recreate collections/cores and it sometimes happens that the >>>>> data was not cleaned up in HDFS and then causes a conflict. >>>>> >>>>> Hendrik >>>>> >>>>> On 21.11.2017 21:07, Joe Obernberger wrote: >>>>> >>>>> We've never run an index this size in anything but HDFS, so I have no >>>>>> comparison. What we've been doing is keeping two main collections - >>>>>> all >>>>>> data, and the last 30 days of data. Then we handle queries based on >>>>>> date >>>>>> range. The 30 day index is significantly faster. >>>>>> >>>>>> My main concern right now is that 6 of the 100 shards are not coming >>>>>> back because of no leader. I've never seen this error before. Any >>>>>> ideas? >>>>>> ClusterStatus shows all three replicas with state 'down'. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -joe >>>>>> >>>>>> >>>>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: >>>>>> >>>>>> We actually also have some performance issue with HDFS at the moment. >>>>>>> We are doing lots of soft commits for NRT search. Those seem to be >>>>>>> slower >>>>>>> then with local storage. The investigation is however not really far >>>>>>> yet. >>>>>>> >>>>>>> We have a setup with 2000 collections, with one shard each and a >>>>>>> replication factor of 2 or 3. When we restart nodes too fast that >>>>>>> causes >>>>>>> problems with the overseer queue, which can lead to the queue >>>>>>> getting out >>>>>>> of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 >>>>>>> has >>>>>>> some improvements and should handle these actions faster. I would >>>>>>> check >>>>>>> what you see for "/solr/admin/collections?actio >>>>>>> n=OVERSEERSTATUS&wt=json". >>>>>>> The critical part is the "overseer_queue_size" value. If this goes >>>>>>> up to >>>>>>> about 10000 it is pretty much game over on our setup. In that case >>>>>>> it seems >>>>>>> to be best to stop all nodes, clear the queue in ZK and then restart >>>>>>> the >>>>>>> nodes one by one with a gap of like 5min. That normally recovers >>>>>>> pretty >>>>>>> well. >>>>>>> >>>>>>> regards, >>>>>>> Hendrik >>>>>>> >>>>>>> On 21.11.2017 20:12, Joe Obernberger wrote: >>>>>>> >>>>>>> We set the hard commit time long because we were having performance >>>>>>>> issues with HDFS, and thought that since the block size is 128M, >>>>>>>> having a >>>>>>>> longer hard commit made sense. That was our hypothesis anyway. >>>>>>>> Happy to >>>>>>>> switch it back and see what happens. >>>>>>>> >>>>>>>> I don't know what caused the cluster to go into recovery in the >>>>>>>> first >>>>>>>> place. We had a server die over the weekend, but it's just one out >>>>>>>> of >>>>>>>> ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 >>>>>>>> copies). It was at this point that we noticed lots of network >>>>>>>> activity, >>>>>>>> and most of the shards in this recovery, fail, retry loop. That is >>>>>>>> when we >>>>>>>> decided to shut it down resulting in zombie lock files. >>>>>>>> >>>>>>>> I tried using the FORCELEADER call, which completed, but doesn't >>>>>>>> seem >>>>>>>> to have any effect on the shards that have no leader. Kinda out of >>>>>>>> ideas >>>>>>>> for that problem. If I can get the cluster back up, I'll try a >>>>>>>> lower hard >>>>>>>> commit time. Thanks again Erick! >>>>>>>> >>>>>>>> -Joe >>>>>>>> >>>>>>>> >>>>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote: >>>>>>>> >>>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik >>>>>>>>> ;)... >>>>>>>>> >>>>>>>>> I need to back up a bit. Once nodes are in this state it's not >>>>>>>>> surprising that they need to be forcefully killed. I was more >>>>>>>>> thinking >>>>>>>>> about how they got in this situation in the first place. _Before_ >>>>>>>>> you >>>>>>>>> get into the nasty state how are the Solr nodes shut down? >>>>>>>>> Forcefully? >>>>>>>>> >>>>>>>>> Your hard commit is far longer than it needs to be, resulting in >>>>>>>>> much >>>>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with >>>>>>>>> local >>>>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS. >>>>>>>>> What this means is that you can spend up to 30 minutes when you >>>>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not >>>>>>>>> have >>>>>>>>> had a chance to fsync the segments and may have to replay on >>>>>>>>> startup. >>>>>>>>> If you have openSearcher set to false, the hard commit operation is >>>>>>>>> not horribly expensive, it just fsync's the current segments and >>>>>>>>> opens >>>>>>>>> new ones. It won't be a total cure, but I bet reducing this >>>>>>>>> interval >>>>>>>>> would help a lot. >>>>>>>>> >>>>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if >>>>>>>>> you >>>>>>>>> issue a manual commit, something like >>>>>>>>> .../collection/update?commit=true. Just reducing the hard commit >>>>>>>>> interval will make the wait between stopping indexing and >>>>>>>>> restarting >>>>>>>>> shorter all by itself if you don't want to issue the manual commit. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Erick >>>>>>>>> >>>>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp >>>>>>>>> <hendrik.hadd...@gmx.net> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> the write.lock issue I see as well when Solr is not been stopped >>>>>>>>>> gracefully. >>>>>>>>>> The write.lock files are then left in the HDFS as they do not get >>>>>>>>>> removed >>>>>>>>>> automatically when the client disconnects like a ephemeral node in >>>>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should >>>>>>>>>> be owning >>>>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the >>>>>>>>>> owner and >>>>>>>>>> is also not willing to retry, which is why you need to restart the >>>>>>>>>> whole >>>>>>>>>> Solr instance after the cleanup. I added some logic to my Solr >>>>>>>>>> start up >>>>>>>>>> script which scans the log files in HDFS and compares that with >>>>>>>>>> the >>>>>>>>>> state in >>>>>>>>>> ZooKeeper and then delete all lock files that belong to the node >>>>>>>>>> that I'm >>>>>>>>>> starting. >>>>>>>>>> >>>>>>>>>> regards, >>>>>>>>>> Hendrik >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote: >>>>>>>>>> >>>>>>>>>> Hi All - we have a system with 45 physical boxes running solr >>>>>>>>>>> 6.6.1 using >>>>>>>>>>> HDFS as the index. The current index size is about 31TBytes. >>>>>>>>>>> With >>>>>>>>>>> 3x >>>>>>>>>>> replication that takes up 93TBytes of disk. Our main collection >>>>>>>>>>> is >>>>>>>>>>> split >>>>>>>>>>> across 100 shards with 3 replicas each. The issue that we're >>>>>>>>>>> running into >>>>>>>>>>> is when restarting the solr6 cluster. The shards go into >>>>>>>>>>> recovery >>>>>>>>>>> and start >>>>>>>>>>> to utilize nearly all of their network interfaces. If we start >>>>>>>>>>> too >>>>>>>>>>> many of >>>>>>>>>>> the nodes at once, the shards will go into a recovery, fail, and >>>>>>>>>>> retry loop >>>>>>>>>>> and never come up. The errors are related to HDFS not responding >>>>>>>>>>> fast >>>>>>>>>>> enough and warnings from the DFSClient. If we stop a node when >>>>>>>>>>> this is >>>>>>>>>>> happening, the script will force a stop (180 second timeout) and >>>>>>>>>>> upon >>>>>>>>>>> restart, we have lock files (write.lock) inside of HDFS. >>>>>>>>>>> >>>>>>>>>>> The process at this point is to start one node, find out the lock >>>>>>>>>>> files, >>>>>>>>>>> wait for it to come up completely (hours), stop it, delete the >>>>>>>>>>> write.lock >>>>>>>>>>> files, and restart. Usually this second restart is faster, but >>>>>>>>>>> it >>>>>>>>>>> still can >>>>>>>>>>> take 20-60 minutes. >>>>>>>>>>> >>>>>>>>>>> The smaller indexes recover much faster (less than 5 minutes). >>>>>>>>>>> Should we >>>>>>>>>>> have not used so many replicas with HDFS? Is there a better way >>>>>>>>>>> we should >>>>>>>>>>> have built the solr6 cluster? >>>>>>>>>>> >>>>>>>>>>> Thank you for any insight! >>>>>>>>>>> >>>>>>>>>>> -Joe >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>> This email has been checked for viruses by AVG. >>>>>>>>> http://www.avg.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >