On 2/13/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
Yes, sorting by fields does take up memory (the fieldcache). 256M is pretty small for a 5M doc index. If you have any more memory slots, spring for some more memory (a little over $100 for 1GB).
Yeah, I'll see if I can give solr a bit more.
Lucene also likes to have free memory left over available for OS cache - otherwise searches start to be limited by disk bandwidth... not a good thing.
To try and lessen the memory used by the Lucene FieldCache, you might try lowering the mergeFactor of the index (see solrconfig.xml). This will cause more merges, slowing indexing, but it will squeeze out deleted documents faster. Also, try to optimize as often as possible (nightly?) for the same reasons.
Ah, I don't know if I mentioned, but we're optimizing nightly when impressions are at their lowest. So, I will lower the mergeFactor and re-load all of the docs to see if that helps us out.. I believe I left it high when we were tuning for the initial loading of ~4M docs before we realized batching them into groups of 1000 before doing a commit (instead of add, commit, add commit, etc) was a more efficient way of doing it. As it stands, loading ~600 docs takes about 2 seconds, so if it takes 15 seconds, I won't complain. :) Thanks for the tips. - Ian
-Yonik On 2/13/07, Ian Meyer <[EMAIL PROTECTED]> wrote: > All, > > I'm having some performance issues with solr. I will give some > background on our setup and implementation of solr. I'm completely > open to reworking everything if the way we are currently doing things > are not optimal. I'll try to be as verbose as I can in explaining all > of this, but feel free to ask more questions if something doesn't make > sense. > > Firstly, we have three messageboards of varying traffic, totaling > about 225K hits per day. Search is > used maybe 500 times a day. Each board has it's two instances of solr, > with Tomcat as the container, and loaded via JDNI. One instance is for > topics, one instance for the posts themselves. I feel as though this > may not be optimal, but I can't think of a better way to handle this. > After reading the schema, maybe someone will have some better ideas. > We use php to interface with solr, and we do some sorting on relevance > and on the date and my thought was that could be causing solr to run > out of memory. > > The boards are bco, vlv and wbc. I'll list the number of docs for each > below along with how many added per day. > > bco (topics): 180,530 (~200 added daily) > bco (posts): 3,961,053 (~5,000 added daily) > vlv (topics): 3,817 (~200 added daily) > vlv (posts): 84,005 (~7,000 added daily) > wbc (topics): 29,603 (~50 added daily) > wbc (posts): 739,660 (~1000 added daily) > > total: ~5 million total docs, with ~13.5K added per day. > > we add docs at :00 for bco, :20 for wbc, :40 for vlv. we feel an hour > is a good enough amount of time to where results aren't lagged too > much. the add process is fast, as well as the commit and i'm more > than impressed with solr's ability to handle the load it does. > > The server hardware is 4GB memory, 1 dual-core 2GHZ opteron.. RAID 10 > SATA.. the machine runs PostgreSQL, PHP and Apache. I feel that this > isn't optimal either, but the costs to buy another server to separate > either the solr or Postgres component is too great right now. Most of > the errors I see are the jvm running out of heap space. The jvm is set > to use the default for max heap size (256m I think?). I can't increase > it too much, because Postgres needs as much memory as it can so the > databases will still reside in memory. > > My first implementation of search for these sites was with pyLucene, > and while that was fast, there was some sort of bug where if I added > docs to the index, they wouldn't show up until I optimized the index, > and that eventually just ate up too much cpu and hosed the server > while it ran, which eventually started taking upwards of 2 hours of > 99% cpu and that's just no good. :) > > When I set up solr, I had cache warming enabled and that also caused > the server to choke way too soon. So I turned that off and that > seemed to hold things off for awhile. > > I've attached the schemas and configs to this email so you can see how > we have things set up. Every site is the same (config-wise) so just > the names are different. It's relatively simple and I feel like the > jvm shouldn't be choking so soon, but, who knows. :) > > One thought we had was having two instances of solr, with a board_id > field and the id field as the unique id, but I wasn't sure if solr > supported compound unique ids.. if not, that would make that solution > moot. > > Hopefully this makes sense, but if not, ask me for clarification on > whatever is unclear. > > Thanks in advance for your help and suggestions! > Ian