Thank you Erick! Will look at all these suggestions.
-Vinay On Wed, Jun 26, 2013 at 6:37 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Right, unfortunately this is a gremlin lurking in the weeds, see: > http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock > > There are a couple of ways to deal with this: > 1> go ahead and up the limit and re-compile, if you look at > SolrCmdDistributor the semaphore is defined there. > > 2> https://issues.apache.org/jira/browse/SOLR-4816 should > address this as well as improve indexing throughput. I'm totally sure > Joel (the guy working on this) would be thrilled if you were able to > verify that these two points, I'd ask him (on the JIRA) whether he thinks > it's ready to test. > > 3> Reduce the number of threads you're indexing with > > 4> index docs in small packets, perhaps even one and just rack > together a zillion threads to get throughput. > > FWIW, > Erick > > On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis <poth...@gmail.com> wrote: > > Jason and Scott, > > > > Thanks for the replies and pointers! > > Yes, I will consider the 'maxDocs' value as well. How do i monitor the > > transaction logs during the interval between commits? > > > > Thanks > > Vinay > > > > > > On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman < > > jhell...@innoventsolutions.com> wrote: > > > >> Scott, > >> > >> My comment was meant to be a bit tongue-in-cheek, but my intent in the > >> statement was to represent hard failure along the lines Vinay is seeing. > >> We're talking about OutOfMemoryException conditions, total cluster > >> paralysis requiring restart, or other similar and disastrous conditions. > >> > >> Where that line is is impossible to generically define, but trivial to > >> accomplish. What any of us running Solr has to achieve is a realistic > >> simulation of our desired production load (probably well above peak) > and to > >> see what limits are reached. Armed with that information we tweak. In > >> this case, we look at finding the point where data ingestion reaches a > >> natural limit. For some that may be JVM GC, for others memory buffer > size > >> on the client load, and yet others it may be I/O limits on multithreaded > >> reads from a database or file system. > >> > >> In old Solr days we had a little less to worry about. We might play > with > >> a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial > >> commits and rollback recoveries. But with 4.x we now have more durable > >> write options and NRT to consider, and SolrCloud begs to use this. So > we > >> have to consider transaction logs, the file handles they leave open > until > >> commit operations occur, and how we want to manage writing to all cores > >> simultaneously instead of a more narrow master/slave relationship. > >> > >> It's all manageable, all predictable (with some load testing) and all > >> filled with many possibilities to meet our specific needs. Considering > hat > >> each person's data model, ingestion pipeline, request processors, and > field > >> analysis steps will be different, 5 threads of input at face value > doesn't > >> really contemplate the whole problem. We have to measure our actual > data > >> against our expectations and find where the weak chain links are to > >> strengthen them. The symptoms aren't necessarily predictable in > advance of > >> this testing, but they're likely addressable and not difficult to > decipher. > >> > >> For what it's worth, SolrCloud is new enough that we're still > experiencing > >> some "uncharted territory with unknown ramifications" but with continued > >> dialog through channels like these there are fewer territories without > good > >> cartography :) > >> > >> Hope that's of use! > >> > >> Jason > >> > >> > >> > >> On Jun 24, 2013, at 7:12 PM, Scott Lundgren < > >> scott.lundg...@carbonblack.com> wrote: > >> > >> > Jason, > >> > > >> > Regarding your statement "push you over the edge"- what does that > mean? > >> > Does it mean "uncharted territory with unknown ramifications" or > >> something > >> > more like specific, known symptoms? > >> > > >> > I ask because our use is similar to Vinay's in some respects, and we > want > >> > to be able to push the capabilities of write perf - but not over the > >> edge! > >> > In particular, I am interested in knowing the symptoms of failure, to > >> help > >> > us troubleshoot the underlying problems if and when they arise. > >> > > >> > Thanks, > >> > > >> > Scott > >> > > >> > On Monday, June 24, 2013, Jason Hellman wrote: > >> > > >> >> Vinay, > >> >> > >> >> You may wish to pay attention to how many transaction logs are being > >> >> created along the way to your hard autoCommit, which should truncate > the > >> >> open handles for those files. I might suggest setting a maxDocs > value > >> in > >> >> parallel with your maxTime value (you can use both) to ensure the > commit > >> >> occurs at either breakpoint. 30 seconds is plenty of time for 5 > >> parallel > >> >> processes of 20 document submissions to push you over the edge. > >> >> > >> >> Jason > >> >> > >> >> On Jun 24, 2013, at 2:21 PM, Vinay Pothnis <poth...@gmail.com> > wrote: > >> >> > >> >>> I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 > seconds. > >> >>> > >> >>> On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman < > >> >>> jhell...@innoventsolutions.com> wrote: > >> >>> > >> >>>> Vinay, > >> >>>> > >> >>>> What autoCommit settings do you have for your indexing process? > >> >>>> > >> >>>> Jason > >> >>>> > >> >>>> On Jun 24, 2013, at 1:28 PM, Vinay Pothnis <poth...@gmail.com> > wrote: > >> >>>> > >> >>>>> Here is the ulimit -a output: > >> >>>>> > >> >>>>> core file size (blocks, -c) 0 data seg size > >> >>>> (kbytes, > >> >>>>> -d) unlimited scheduling priority (-e) 0 file > size > >> >>>>> (blocks, -f) unlimited pending signals > >> >>>>> (-i) 179963 max locked memory (kbytes, -l) 64 max > memory > >> >> size > >> >>>>> (kbytes, -m) unlimited open files > (-n) > >> >>>>> 32769 pipe size (512 bytes, -p) 8 POSIX message > queues > >> >>>>> (bytes, > >> >>>>> -q) 819200 real-time priority (-r) 0 stack size > >> >>>>> (kbytes, -s) 10240 cpu time (seconds, -t) > unlimited > >> >>>> max > >> >>>>> user processes (-u) 140000 virtual memory > >> >>>> (kbytes, > >> >>>>> -v) unlimited file locks (-x) unlimited > >> >>>>> > >> >>>>> On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro < > >> yago.rive...@gmail.com > >> >>>>> wrote: > >> >>>>> > >> >>>>>> Hi, > >> >>>>>> > >> >>>>>> I have the same issue too, and the deploy is quasi exact like > than > >> >> mine, > >> >>>>>> > >> >>>> > >> >> > >> > http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862 > >> >>>>>> > >> >>>>>> With some concurrence and batches of 10 solr apparently have some > >> >>>> deadlock > >> >>>>>> distributing updates > >> >>>>>> > >> >>>>>> Can you dump the configuration of the ulimit on your servers?, > some > >> >>>> people > >> >>>>>> had the same issues because they are reach the ulimit maximum > >> defined > >> >>>> for > >> >>>>>> descriptor and process. > >> >>>>>> > >> >>>>>> -- > >> >>>>>> Yago Riveiro > >> >>>>>> Sent with Sparrow (http://www.sparrowmailapp.com/?sig) > >> >>>>>> > >> >>>>>> > >> >>>>>> On Monday, June 24, 2013 at 7:49 PM, Vinay Pothnis wrote: > >> >>>>>> > >> >>>>>>> Hello All, > >> >>>>>>> > >> >>>>>>> I have the following set up of solr cloud. > >> >>>>>>> > >> >>>>>>> * solr version 4.3.1 > >> >>>>>>> * 3 node solr cloud + replciation factor 2 > >> >>>>>>> * 3 zoo keepers > >> >>>>>>> * load balancer in front of the 3 solr nodes > >> >>>>>>> > >> >>>>>>> I am seeing this strange behavior when I am indexing a large > number > >> >> of > >> >>>>>>> documents (10 mil). When I have more than 3-5 threads sending > >> >> documents > >> >>>>>> (in > >> >>>>>>> batch of 20) to solr, sometimes solr goes into a hung state. > After > >> >> this > >> >>>>>> all > >> >>>>>>> the update requests get timed out. What we see via AppDynamics > (a > >> >>>>>>> performance monitoring tool) is that there are a number of > threads > >> >> that > >> >>>>>> are > >> >>>>>>> stalled. The stack trace for one of the threads is shown below. > >> >>>>>>> > >> >>>>>>> The cluster has to be restarted to recover from this. When I > reduce > >> >> the > >> >>>>>>> > >> > > >> > > >> > > >> > -- > >> > Scott Lundgren > >> > Director of Engineering > >> > Carbon Black, Inc. > >> > (210) 204-0483 | scott.lundg...@carbonblack.com > >> > >> >