So, we're definitely running into some very large documents (180MB, for example). I haven't run the analysis on the other 2 shards yet, but this could definitely be our problem.
Is there any conventional wisdom on a good "maximum size" for your indexed fields? Of course it will vary for each system, but assuming a heap of 10g, does anyone have past experience in limiting their field sizes? Our caches are set to 128. On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jgres...@gmail.com> wrote: > These are some good ideas. The "huge document" idea could add up, since I > think the shard1 index is a little larger (32.5GB on disk instead of > 31.9GB), so it is possible there's one or 2 really big ones that are > getting loaded into memory there. > > Btw, I did find an article on the Solr document routing ( > http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I don't > think that our ID structure is a problem in itself. But I will follow up > on the large document idea. > > I used this article ( > https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips) > to find the index heap and disk usage: > http://localhost:8983/solr/admin/cores?action=STATUS&memory=true > > Though looking at the data index directory on disk basically said the same > thing. > > I am pretty sure we're using the smart round-robining client, but I will > double check on Monday. > > We have been using CollectD and graphite to monitor our VMs, as well as > jvisualvm, though we haven't tried SPM. > > Thanks for all the ideas, guys. > > > On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> Hi Joe, >> >> Are you/how are you sure all 3 shards are roughly the same size? Can you >> share what you run/see that shows you that? >> >> Are you sure queries are evenly distributed? Something like SPM >> <http://sematext.com/spm/> should give you insight into that. >> >> How big are your caches? >> >> Otis >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jgres...@gmail.com> wrote: >> >> > Interesting thought about the routing. Our document ids are in 3 parts: >> > >> > <10-digit identifier>!<epoch timestamp>!<format> >> > >> > e.g., 5/12345678!130000025603!TEXT >> > >> > Each object has an identifier, and there may be multiple versions of the >> > object, hence the timestamp. We like to be able to pull back all of the >> > versions of an object at once, hence the routing scheme. >> > >> > The nature of the identifier is that a great many of them begin with a >> > certain number. I'd be interested to know more about the hashing scheme >> > used for the document routing. Perhaps the first character gives it >> more >> > weight as to which shard it lands in? >> > >> > It seems strange that certain of the most highly-searched documents >> would >> > happen to fall on this shard, but you may be onto something. We'll >> scrape >> > through some non-distributed queries and see what we can find. >> > >> > >> > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson < >> erickerick...@gmail.com> >> > wrote: >> > >> > > This is very weird. >> > > >> > > Are you sure that all the Java versions are identical? And all the JVM >> > > parameters are the same? Grasping at straws here. >> > > >> > > More grasping at straws: I'm a little suspicious that you are using >> > > routing. You say that the indexes are about the same size, but is it >> is >> > > possible that your routing is somehow loading the problem shard >> > abnormally? >> > > By that I mean somehow the documents on that shard are different, or >> > have a >> > > drastically higher number of hits than the other shards? >> > > >> > > You can fire queries at shards with &distrib=false and NOT have it go >> to >> > > other shards, perhaps if you can isolate the problem queries that >> might >> > > shed some light on the problem. >> > > >> > > >> > > Best >> > > er...@baffled.com >> > > >> > > >> > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jgres...@gmail.com> >> wrote: >> > > >> > > > It has taken as little as 2 minutes to happen the last time we >> tried. >> > It >> > > > basically happens upon high query load (peak user hours during the >> > day). >> > > > When we reduce functionality by disabling most searches, it >> > stabilizes. >> > > > So it really is only on high query load. Our ingest rate is fairly >> > low. >> > > > >> > > > It happens no matter how many nodes in the shard are up. >> > > > >> > > > >> > > > Joe >> > > > >> > > > >> > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky < >> > > j...@basetechnology.com> >> > > > wrote: >> > > > >> > > > > When you restart, how long does it take it hit the problem? And >> how >> > > much >> > > > > query or update activity is happening in that time? Is there any >> > other >> > > > > activity showing up in the log? >> > > > > >> > > > > If you bring up only a single node in that problematic shard, do >> you >> > > > still >> > > > > see the problem? >> > > > > >> > > > > -- Jack Krupansky >> > > > > >> > > > > -----Original Message----- From: Joe Gresock >> > > > > Sent: Saturday, May 31, 2014 9:34 AM >> > > > > To: solr-user@lucene.apache.org >> > > > > Subject: Uneven shard heap usage >> > > > > >> > > > > >> > > > > Hi folks, >> > > > > >> > > > > I'm trying to figure out why one shard of an evenly-distributed >> > 3-shard >> > > > > cluster would suddenly start running out of heap space, after 9+ >> > months >> > > > of >> > > > > stable performance. We're using the "!" delimiter in our ids to >> > > > distribute >> > > > > the documents, and indeed the disk size of our shards are very >> > similar >> > > > > (31-32GB on disk per replica). >> > > > > >> > > > > Our setup is: >> > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, >> so >> > > > > basically 2 physical CPUs), 24GB disk >> > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever). >> We >> > > > > reserve 10g heap for each solr instance. >> > > > > Also 3 zookeeper VMs, which are very stable >> > > > > >> > > > > Since the troubles started, we've been monitoring all 9 with >> > jvisualvm, >> > > > and >> > > > > shards 2 and 3 keep a steady amount of heap space reserved, always >> > > having >> > > > > horizontal lines (with some minor gc). They're using 4-5GB heap, >> and >> > > > when >> > > > > we force gc using jvisualvm, they drop to 1GB usage. Shard 1, >> > however, >> > > > > quickly has a steep slope, and eventually has concurrent mode >> > failures >> > > in >> > > > > the gc logs, requiring us to restart the instances when they can >> no >> > > > longer >> > > > > do anything but gc. >> > > > > >> > > > > We've tried ruling out physical host problems by moving all 3 >> Shard 1 >> > > > > replicas to different hosts that are underutilized, however we >> still >> > > get >> > > > > the same problem. We'll still be working on ruling out >> > infrastructure >> > > > > issues, but I wanted to ask the questions here in case it makes >> > sense: >> > > > > >> > > > > * Does it make sense that all the replicas on one shard of a >> cluster >> > > > would >> > > > > have heap problems, when the other shard replicas do not, >> assuming a >> > > > fairly >> > > > > even data distribution? >> > > > > * One thing we changed recently was to make all of our fields >> stored, >> > > > > instead of only half of them. This was to support atomic updates. >> > Can >> > > > > stored fields, even though lazily loaded, cause problems like >> this? >> > > > > >> > > > > Thanks for any input, >> > > > > Joe >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > I know what it is to be in need, and I know what it is to have >> > plenty. >> > > I >> > > > > have learned the secret of being content in any and every >> situation, >> > > > > whether well fed or hungry, whether living in plenty or in want. >> I >> > can >> > > > do >> > > > > all this through him who gives me strength. *-Philippians >> 4:12-13* >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > I know what it is to be in need, and I know what it is to have >> plenty. >> > I >> > > > have learned the secret of being content in any and every situation, >> > > > whether well fed or hungry, whether living in plenty or in want. I >> can >> > > do >> > > > all this through him who gives me strength. *-Philippians >> 4:12-13* >> > > > >> > > >> > >> > >> > >> > -- >> > I know what it is to be in need, and I know what it is to have plenty. >> I >> > have learned the secret of being content in any and every situation, >> > whether well fed or hungry, whether living in plenty or in want. I can >> do >> > all this through him who gives me strength. *-Philippians 4:12-13* >> > >> > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can > do all this through him who gives me strength. *-Philippians 4:12-13* > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*