BTW, what is NRT? Dennis Gearon
Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Peter Sturge <peter.stu...@gmail.com> wrote: > From: Peter Sturge <peter.stu...@gmail.com> > Subject: Re: Tuning Solr caches with high commit rates (NRT) > To: solr-user@lucene.apache.org > Date: Friday, September 17, 2010, 2:18 AM > Hi, > > It's great to see such a fantastic response to this thread > - NRT is > alive and well! > > I'm hoping to collate this information and add it to the > wiki when I > get a few free cycles (thanks Erik for the heads up). > > In the meantime, I thought I'd add a few tidbits of > additional > information that might prove useful: > > 1. The first one to note is that the techniques/setup > described in > this thread don't fix the underlying potential for > OutOfMemory errors > - there can always be an index large enough to ask of its > JVM more > memory than is available for cache. > These techniques, however, mitigate the risk, and provide > an efficient > balance between memory use and search performance. > There are some interesting discussions going on for both > Lucene and > Solr regarding the '2 pounds of baloney into a 1 pound bag' > issue of > unbounded caches, with a number of interesting strategies. > One strategy that I like, but haven't found in discussion > lists is > auto-limiting cache size/warming based on available > resources (similar > to the way file system caches use free memory). This would > allow > caches to adjust to their memory environment as indexes > grow. > > 2. A note regarding lockType in solrconfig.xml for dual > Solr > instances: It's best not to use 'none' as a value for > lockType - this > sets the lockType to null, and as the source comments note, > this is a > recipe for disaster, so, use 'simple' instead. > > 3. Chris mentioned setting maxWarmingSearchers to 1 as a > way of > minimizing the number of onDeckSearchers. This is a prudent > move -- > thanks Chris for bringing this up! > > All the best, > Peter > > > > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich <peat...@yahoo.de> > wrote: > > Peter Sturge, > > > > this was a nice hint, thanks again! If you are here in > Germany anytime I > > can invite you to a beer or an apfelschorle ! :-) > > I only needed to change the lockType to none in the > solrconfig.xml, > > disable the replication and set the data dir to the > master data dir! > > > > Regards, > > Peter Karich. > > > >> Hi Peter, > >> > >> this scenario would be really great for us - I > didn't know that this is > >> possible and works, so: thanks! > >> At the moment we are doing similar with > replicating to the readonly > >> instance but > >> the replication is somewhat lengthy and > resource-intensive at this > >> datavolume ;-) > >> > >> Regards, > >> Peter. > >> > >> > >>> 1. You can run multiple Solr instances in > separate JVMs, with both > >>> having their solr.xml configured to use the > same index folder. > >>> You need to be careful that one and only one > of these instances will > >>> ever update the index at a time. The best way > to ensure this is to use > >>> one for writing only, > >>> and the other is read-only and never writes to > the index. This > >>> read-only instance is the one to use for > tuning for high search > >>> performance. Even though the RO instance > doesn't write to the index, > >>> it still needs periodic (albeit empty) commits > to kick off > >>> autowarming/cache refresh. > >>> > >>> Depending on your needs, you might not need to > have 2 separate > >>> instances. We need it because the 'write' > instance is also doing a lot > >>> of metadata pre-write operations in the same > jvm as Solr, and so has > >>> its own memory requirements. > >>> > >>> 2. We use sharding all the time, and it works > just fine with this > >>> scenario, as the RO instance is simply another > shard in the pack. > >>> > >>> > >>> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich > <peat...@yahoo.de> > wrote: > >>> > >>> > >>>> Peter, > >>>> > >>>> thanks a lot for your in-depth > explanations! > >>>> Your findings will be definitely helpful > for my next performance > >>>> improvement tests :-) > >>>> > >>>> Two questions: > >>>> > >>>> 1. How would I do that: > >>>> > >>>> > >>>> > >>>>> or a local read-only instance that > reads the same core as the indexing > >>>>> instance (for the latter, you'll need > something that periodically refreshes - i.e. runs > commit()). > >>>>> > >>>>> > >>>> 2. Did you try sharding with your current > setup (e.g. one big, > >>>> nearly-static index and a tiny write+read > index)? > >>>> > >>>> Regards, > >>>> Peter. > >>>> > >>>> > >>>> > >>>>> Hi, > >>>>> > >>>>> Below are some notes regarding Solr > cache tuning that should prove > >>>>> useful for anyone who uses Solr with > frequent commits (e.g. <5min). > >>>>> > >>>>> Environment: > >>>>> Solr 1.4.1 or branch_3x trunk. > >>>>> Note the 4.x trunk has lots of neat > new features, so the notes here > >>>>> are likely less relevant to the 4.x > environment. > >>>>> > >>>>> Overview: > >>>>> Our Solr environment makes extensive > use of faceting, we perform > >>>>> commits every 30secs, and the indexes > tend be on the large-ish side > >>>>> (>20million docs). > >>>>> Note: For our data, when we commit, we > are always adding new data, > >>>>> never changing existing data. > >>>>> This type of environment can be tricky > to tune, as Solr is more geared > >>>>> toward fast reads than frequent > writes. > >>>>> > >>>>> Symptoms: > >>>>> If anyone has used faceting in > searches where you are also performing > >>>>> frequent commits, you've likely > encountered the dreaded OutOfMemory or > >>>>> GC Overhead Exeeded errors. > >>>>> In high commit rate environments, this > is almost always due to > >>>>> multiple 'onDeck' searchers and > autowarming - i.e. new searchers don't > >>>>> finish autowarming their caches before > the next commit() > >>>>> comes along and invalidates them. > >>>>> Once this starts happening on a > regular basis, it is likely your > >>>>> Solr's JVM will run out of memory > eventually, as the number of > >>>>> searchers (and their cache arrays) > will keep growing until the JVM > >>>>> dies of thirst. > >>>>> To check if your Solr environment is > suffering from this, turn on INFO > >>>>> level logging, and look for: > 'PERFORMANCE WARNING: Overlapping > >>>>> onDeckSearchers=x'. > >>>>> > >>>>> In tests, we've only ever seen this > problem when using faceting, and > >>>>> facet.method=fc. > >>>>> > >>>>> Some solutions to this are: > >>>>> Reduce the commit rate to allow > searchers to fully warm before the > >>>>> next commit > >>>>> Reduce or eliminate the > autowarming in caches > >>>>> Both of the above > >>>>> > >>>>> The trouble is, if you're doing NRT > commits, you likely have a good > >>>>> reason for it, and > reducing/elimintating autowarming will very > >>>>> significantly impact search > performance in high commit rate > >>>>> environments. > >>>>> > >>>>> Solution: > >>>>> Here are some setup steps we've used > that allow lots of faceting (we > >>>>> typically search with at least 20-35 > different facet fields, and date > >>>>> faceting/sorting) on large indexes, > and still keep decent search > >>>>> performance: > >>>>> > >>>>> 1. Firstly, you should consider using > the enum method for facet > >>>>> searches (facet.method=enum) unless > you've got A LOT of memory on your > >>>>> machine. In our tests, this method > uses a lot less memory and > >>>>> autowarms more quickly than fc. (Note, > I've not tried the new > >>>>> segement-based 'fcs' option, as I > can't find support for it in > >>>>> branch_3x - looks nice for 4.x > though) > >>>>> Admittedly, for our data, enum is not > quite as fast for searching as > >>>>> fc, but short of purchsing a > Thaiwanese RAM factory, it's a worthwhile > >>>>> tradeoff. > >>>>> If you do have access to LOTS of > memory, AND you can guarantee that > >>>>> the index won't grow beyond the memory > capacity (i.e. you have some > >>>>> sort of deletion policy in place), fc > can be a lot faster than enum > >>>>> when searching with lots of facets > across many terms. > >>>>> > >>>>> 2. Secondly, we've found that LRUCache > is faster at autowarming than > >>>>> FastLRUCache - in our tests, about 20% > faster. Maybe this is just our > >>>>> environment - your mileage may vary. > >>>>> > >>>>> So, our filterCache section in > solrconfig.xml looks like this: > >>>>> <filterCache > >>>>> class="solr.LRUCache" > >>>>> size="3600" > >>>>> initialSize="1400" > >>>>> autowarmCount="3600"/> > >>>>> > >>>>> For a 28GB index, running in a > quad-core x64 VMWare instance, 30 > >>>>> warmed facet fields, Solr is running > at ~4GB. Stats filterCache size > >>>>> shows usually in the region of ~2400. > >>>>> > >>>>> 3. It's also a good idea to have some > sort of > >>>>> firstSearcher/newSearcher event > listener queries to allow new data to > >>>>> populate the caches. > >>>>> Of course, what you put in these is > dependent on the facets you need/use. > >>>>> We've found a good combination is a > firstSearcher with as many facets > >>>>> in the search as your environment can > handle, then a subset of the > >>>>> most common facets for the > newSearcher. > >>>>> > >>>>> 4. We also set: > >>>>> > <useColdSearcher>true</useColdSearcher> > >>>>> just in case. > >>>>> > >>>>> 5. Another key area for search > performance with high commits is to use > >>>>> 2 Solr instances - one for the high > commit rate indexing, and one for > >>>>> searching. > >>>>> The read-only searching instance can > be a remote replica, or a local > >>>>> read-only instance that reads the same > core as the indexing instance > >>>>> (for the latter, you'll need something > that periodically refreshes - > >>>>> i.e. runs commit()). > >>>>> This way, you can tune the indexing > instance for writing performance > >>>>> and the searching instance as above > for max read performance. > >>>>> > >>>>> Using the setup above, we get > fantastic searching speed for small > >>>>> facet sets (well under 1sec), and > really good searching for large > >>>>> facet sets (a couple of secs depending > on index size, number of > >>>>> facets, unique terms etc. etc.), > >>>>> even when searching against largeish > indexes (>20million docs). > >>>>> We have yet to see any OOM or GC > errors using the techniques above, > >>>>> even in low memory conditions. > >>>>> > >>>>> I hope there are people that find this > useful. I know I've spent a lot > >>>>> of time looking for stuff like this, > so hopefullly, this will save > >>>>> someone some time. > >>>>> > >>>>> > >>>>> Peter > >>>>> > >>>>> > > > > >