-Mike
On 6/2/2014 8:38 PM, Joe Gresock wrote:
So, we were finally able to reproduce the heap overload behavior with a stress test of a query that highlighted the large fields we found. We'll have to play around with the highlighting settings, but for now we've disabled the highlighting on this query (which is a canned query that doesn't even really need highlighting), and our cluster is back to stellar performance. What we observed while debugging this was quite interesting: * We removed all of the documents with field values > 2 MB in Shard 1 (which was causing the problems) * When we enabled user query access again, Shard 2 fairly quickly ran out of heap space, but Shard 1 was stable! * We then removed all documents from Shard 2 with the same criteria. When running a stress test, Shard 3 ran out of heap space and Shard 1 and 2 were stable At this point, our stability issues are gone, but we're left wondering how best to re-ingest these documents. Currently we have this field truncated to 2 MB, which is not ideal. It seems like there's a balance between allowing more of this field to be searchable vs. providing the most highlighted results. I wonder if anyone can recommend some of the relevant highlighting parameters that might be able to allow us to turn highlighting back on for this field. I'd say probably only 100-200 documents have field values as large as this. Joe On Mon, Jun 2, 2014 at 10:44 AM, Erick Erickson <erickerick...@gmail.com> wrote:Joe: One thing to add, if you're returning that doc (or perhaps even some fields, this bit is still something of a mystery to me) then the whole 180M may be being decompressed. Since 4.1 the stored fields have been compressed to disk by default. That this, this is only true if the docs in question are returned as part of the result set. Adding &distrib=false to the URL and pinging only that shard should let you focus on only this shard.... Best, Erick On Mon, Jun 2, 2014 at 4:27 AM, Michael Sokolov < msoko...@safaribooksonline.com> wrote:Joe - there shouldn't really be a problem *indexing* these fields: remember that all the terms are spread across the index, so there isreallyno storage difference between one 180MB document and 180 1 MB documents from an indexing perspective. Making the field "stored" is more likely to lead to a problem, although it's still a bit of a mystery exactly what's going on. Do they need to be stored? For example: do you highlight the entire field? Still 180MB shouldn't necessarily lead to heap space problems, but one thing youcouldplay with is reducing the cache sizes on that node: if you had very large (in terms of numbers of documents) caches, and a lot of the documentswerebig, that could lead to heap problems. But this is all just guessing. -Mike On 6/2/2014 6:13 AM, Joe Gresock wrote:And the followup question would be.. if some of these documents are legitimately this large (they really do have that much text), is there a good way to still allow that to be searchable and not explode our index? These would be "text_en" type fields. On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jgres...@gmail.com> wrote: So, we're definitely running into some very large documents (180MB, forexample). I haven't run the analysis on the other 2 shards yet, butthiscould definitely be our problem. Is there any conventional wisdom on a good "maximum size" for your indexed fields? Of course it will vary for each system, but assuming a heap of 10g, does anyone have past experience in limiting their field sizes? Our caches are set to 128. On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jgres...@gmail.com>wrote:These are some good ideas. The "huge document" idea could add up,sinceI think the shard1 index is a little larger (32.5GB on disk instead of 31.9GB), so it is possible there's one or 2 really big ones that are getting loaded into memory there. Btw, I did find an article on the Solr document routing ( http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I don't think that our ID structure is a problem in itself. But I will follow up on the large document idea. I used this article ( https://support.datastax.com/entries/38367716-Solr- Configuration-Best-Practices-and-Troubleshooting-Tips) to find the index heap and disk usage: http://localhost:8983/solr/admin/cores?action=STATUS&memory=true Though looking at the data index directory on disk basically said the same thing. I am pretty sure we're using the smart round-robining client, but Iwilldouble check on Monday. We have been using CollectD and graphite to monitor our VMs, as wellasjvisualvm, though we haven't tried SPM. Thanks for all the ideas, guys. On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: Hi Joe,Are you/how are you sure all 3 shards are roughly the same size? Can you share what you run/see that shows you that? Are you sure queries are evenly distributed? Something like SPM <http://sematext.com/spm/> should give you insight into that. How big are your caches? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jgres...@gmail.com> wrote: Interesting thought about the routing. Our document ids are in 3 parts:<10-digit identifier>!<epoch timestamp>!<format> e.g., 5/12345678!130000025603!TEXT Each object has an identifier, and there may be multiple versions oftheobject, hence the timestamp. We like to be able to pull back all oftheversions of an object at once, hence the routing scheme. The nature of the identifier is that a great many of them beginwith acertain number. I'd be interested to know more about the hashingschemeused for the document routing. Perhaps the first character gives itmoreweight as to which shard it lands in? It seems strange that certain of the most highly-searched documentswouldhappen to fall on this shard, but you may be onto something. We'llscrapethrough some non-distributed queries and see what we can find. On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <erickerick...@gmail.com>wrote: This is very weird.Are you sure that all the Java versions are identical? And all theJVM parameters are the same? Grasping at straws here.More grasping at straws: I'm a little suspicious that you are using routing. You say that the indexes are about the same size, but isitis possible that your routing is somehow loading the problem shard abnormally?By that I mean somehow the documents on that shard are different,orhave adrastically higher number of hits than the other shards? You can fire queries at shards with &distrib=false and NOT have itgo to other shards, perhaps if you can isolate the problem queries that might shed some light on the problem.Best er...@baffled.com On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jgres...@gmail.com>wrote: It has taken as little as 2 minutes to happen the last time wetried.Itbasically happens upon high query load (peak user hours during the day). When we reduce functionality by disabling most searches, it stabilizes. So it really is only on high query load. Our ingest rate is fairlylow.It happens no matter how many nodes in the shard are up.Joe On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <j...@basetechnology.com>wrote: When you restart, how long does it take it hit the problem? And howmuchquery or update activity is happening in that time? Is there any otheractivity showing up in the log?If you bring up only a single node in that problematic shard, doyoustillsee the problem? -- Jack Krupansky -----Original Message----- From: Joe Gresock Sent: Saturday, May 31, 2014 9:34 AM To: solr-user@lucene.apache.org Subject: Uneven shard heap usage Hi folks, I'm trying to figure out why one shard of an evenly-distributed3-shardcluster would suddenly start running out of heap space, after 9+monthsofstable performance. We're using the "!" delimiter in our ids todistributethe documents, and indeed the disk size of our shards are verysimilar(31-32GB on disk per replica).Our setup is: 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,sobasically 2 physical CPUs), 24GB disk3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).Wereserve 10g heap for each solr instance.Also 3 zookeeper VMs, which are very stable Since the troubles started, we've been monitoring all 9 withjvisualvm,andshards 2 and 3 keep a steady amount of heap space reserved,alwayshavinghorizontal lines (with some minor gc). They're using 4-5GB heap, andwhenwe force gc using jvisualvm, they drop to 1GB usage. Shard 1,however,quickly has a steep slope, and eventually has concurrent modefailuresinthe gc logs, requiring us to restart the instances when they can nolongerdo anything but gc. We've tried ruling out physical host problems by moving all 3Shard 1replicas to different hosts that are underutilized, however westillgetthe same problem. We'll still be working on ruling out infrastructureissues, but I wanted to ask the questions here in case it makessense:* Does it make sense that all the replicas on one shard of aclusterwouldhave heap problems, when the other shard replicas do not,assuming afairlyeven data distribution? * One thing we changed recently was to make all of our fieldsstored,instead of only half of them. This was to support atomicupdates.Canstored fields, even though lazily loaded, cause problems likethis?Thanks for any input,Joe -- I know what it is to be in need, and I know what it is to haveplenty.Ihave learned the secret of being content in any and every situation,whether well fed or hungry, whether living in plenty or in want.Icandoall this through him who gives me strength. *-Philippians4:12-13* -- I know what it is to be in need, and I know what it is to haveplenty.Ihave learned the secret of being content in any and every situation,whether well fed or hungry, whether living in plenty or in want.I candoall this through him who gives me strength. *-Philippians4:12-13*-- I know what it is to be in need, and I know what it is to haveplenty.Ihave learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. Ican doall this through him who gives me strength. *-Philippians4:12-13*-- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. Icando all this through him who gives me strength. *-Philippians4:12-13*-- I know what it is to be in need, and I know what it is to have plenty.Ihave learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians4:12-13*