> > If you think about it, having a shard with 3 replicas on top of a file
system that does 3x replication seems a little excessive! https://issues.apache.org/jira/browse/SOLR-6305 should help here. I can take a look at merging the patch since looks like it has been helpful to others. Kevin Risden On Fri, Aug 2, 2019 at 10:09 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Hi Kyle - Thank you. > > Our current index is split across 3 solr collections; our largest > collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across > 100 shards. There are 40 machines hosting this cluster. We've found > that when dealing with large collections having no replicas (but lots of > shards) ends up being more reliable since there is a much smaller > recovery time. We keep another 30 day index (1.4TBytes) that does have > replicas (40 shards, 3 replicas each), and if a node goes down, we > manually delete lock files and then bring it back up and yes - lots of > network IO, but it usually recovers OK. > > Having a large collection like this with no replicas seems like a recipe > for disaster. So, we've been experimenting with the latest version > (8.2) and our index process to split up the data into many solr > collections that do have replicas, and then build the list of > collections to search at query time. Our searches are date based, so we > can define what collections we want to query at query time. As a test, > we ran just two machines, HDFS, and 500 collections. One server ran out > of memory and crashed. We had over 1,600 lock files to delete. > > If you think about it, having a shard with 3 replicas on top of a file > system that does 3x replication seems a little excessive! I'd love to > see Solr take more advantage of a shared FS. Perhaps an idea is to use > HDFS but with an NFS gateway. Seems like that may be slow. > Architecturally, I love only having one large file system to manage > instead of lots of individual file systems across many machines. HDFS > makes this easy. > > -Joe > > On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote: > > Hi Joe, > > > > We fought with Solr on HDFS for quite some time, and faced similar issues > > as you're seeing. (See this thread, for example:" > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e > > ) > > > > The Solr lock files on HDFS get deleted if the Solr server gets shut down > > gracefully, but we couldn't always guarantee that in our environment so > we > > ended up writing a custom startup script to search for lock files on HDFS > > and delete them before solr startup. > > > > However, the issue that you mention of the Solr server rebuilding its > whole > > index from replicas on startup was enough of a show-stopper for us that > we > > switched away from HDFS to local disk. It literally made the difference > > between 24+ hours of recovery time after an unexpected outage to less > than > > a minute... > > > > If you do end up finding a solution to this issue, please post it to this > > mailing list, because there are others out there (like us!) who would > most > > definitely make use it. > > > > Thanks > > > > Kyle > > > > On Fri, 2 Aug 2019 at 08:58, Joe Obernberger < > joseph.obernber...@gmail.com> > > wrote: > > > >> Thank you. No, while the cluster is using Cloudera for HDFS, we do not > >> use Cloudera to manager the solr cluster. If it is a > >> configuration/architecture issue, what can I do to fix it? I'd like a > >> system where servers can come and go, but the indexes stay available and > >> recover automatically. Is that possible with HDFS? > >> While adding an alias to other collections would be an option, if that > >> collection is the only collection, or one that is currently needed, in a > >> live system, we can't bring it down, re-create it, and re-index when > >> that process may take weeks to do. > >> > >> Any ideas? > >> > >> -Joe > >> > >> On 8/1/2019 6:15 PM, Angie Rabelero wrote: > >>> I don’t think you’re using claudera or ambari, but ambari has an option > >> to delete the locks. This seems more a configuration/architecture isssue > >> than a realibility issue. You may want to spin up an alias while you > bring > >> down, clear locks and directories, recreate and index the affected > >> collection, while you work your other isues. > >>> On Aug 1, 2019, at 16:40, Joe Obernberger < > joseph.obernber...@gmail.com> > >> wrote: > >>> Been using Solr on HDFS for a while now, and I'm seeing an issue with > >> redundancy/reliability. If a server goes down, when it comes back up, > it > >> will never recover because of the lock files in HDFS. That solr node > needs > >> to be brought down manually, the lock files deleted, and then brought > back > >> up. At that point, it appears to copy all the data for its replicas. > If > >> the index is large, and new data is being indexed, in some cases it will > >> never recover. The replication retries over and over. > >>> How can we make a reliable Solr Cloud cluster when using HDFS that can > >> handle servers coming and going? > >>> Thank you! > >>> > >>> -Joe > >>> > >>> > >>> > >>> --- > >>> This email has been checked for viruses by AVG. > >>> https://www.avg.com > >>> >