Near Real Time... Erick
On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon <gear...@sbcglobal.net>wrote: > BTW, what is NRT? > > Dennis Gearon > > Signature Warning > ---------------- > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Fri, 9/17/10, Peter Sturge <peter.stu...@gmail.com> wrote: > > > From: Peter Sturge <peter.stu...@gmail.com> > > Subject: Re: Tuning Solr caches with high commit rates (NRT) > > To: solr-user@lucene.apache.org > > Date: Friday, September 17, 2010, 2:18 AM > > Hi, > > > > It's great to see such a fantastic response to this thread > > - NRT is > > alive and well! > > > > I'm hoping to collate this information and add it to the > > wiki when I > > get a few free cycles (thanks Erik for the heads up). > > > > In the meantime, I thought I'd add a few tidbits of > > additional > > information that might prove useful: > > > > 1. The first one to note is that the techniques/setup > > described in > > this thread don't fix the underlying potential for > > OutOfMemory errors > > - there can always be an index large enough to ask of its > > JVM more > > memory than is available for cache. > > These techniques, however, mitigate the risk, and provide > > an efficient > > balance between memory use and search performance. > > There are some interesting discussions going on for both > > Lucene and > > Solr regarding the '2 pounds of baloney into a 1 pound bag' > > issue of > > unbounded caches, with a number of interesting strategies. > > One strategy that I like, but haven't found in discussion > > lists is > > auto-limiting cache size/warming based on available > > resources (similar > > to the way file system caches use free memory). This would > > allow > > caches to adjust to their memory environment as indexes > > grow. > > > > 2. A note regarding lockType in solrconfig.xml for dual > > Solr > > instances: It's best not to use 'none' as a value for > > lockType - this > > sets the lockType to null, and as the source comments note, > > this is a > > recipe for disaster, so, use 'simple' instead. > > > > 3. Chris mentioned setting maxWarmingSearchers to 1 as a > > way of > > minimizing the number of onDeckSearchers. This is a prudent > > move -- > > thanks Chris for bringing this up! > > > > All the best, > > Peter > > > > > > > > > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich <peat...@yahoo.de> > > wrote: > > > Peter Sturge, > > > > > > this was a nice hint, thanks again! If you are here in > > Germany anytime I > > > can invite you to a beer or an apfelschorle ! :-) > > > I only needed to change the lockType to none in the > > solrconfig.xml, > > > disable the replication and set the data dir to the > > master data dir! > > > > > > Regards, > > > Peter Karich. > > > > > >> Hi Peter, > > >> > > >> this scenario would be really great for us - I > > didn't know that this is > > >> possible and works, so: thanks! > > >> At the moment we are doing similar with > > replicating to the readonly > > >> instance but > > >> the replication is somewhat lengthy and > > resource-intensive at this > > >> datavolume ;-) > > >> > > >> Regards, > > >> Peter. > > >> > > >> > > >>> 1. You can run multiple Solr instances in > > separate JVMs, with both > > >>> having their solr.xml configured to use the > > same index folder. > > >>> You need to be careful that one and only one > > of these instances will > > >>> ever update the index at a time. The best way > > to ensure this is to use > > >>> one for writing only, > > >>> and the other is read-only and never writes to > > the index. This > > >>> read-only instance is the one to use for > > tuning for high search > > >>> performance. Even though the RO instance > > doesn't write to the index, > > >>> it still needs periodic (albeit empty) commits > > to kick off > > >>> autowarming/cache refresh. > > >>> > > >>> Depending on your needs, you might not need to > > have 2 separate > > >>> instances. We need it because the 'write' > > instance is also doing a lot > > >>> of metadata pre-write operations in the same > > jvm as Solr, and so has > > >>> its own memory requirements. > > >>> > > >>> 2. We use sharding all the time, and it works > > just fine with this > > >>> scenario, as the RO instance is simply another > > shard in the pack. > > >>> > > >>> > > >>> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich > > <peat...@yahoo.de> > > wrote: > > >>> > > >>> > > >>>> Peter, > > >>>> > > >>>> thanks a lot for your in-depth > > explanations! > > >>>> Your findings will be definitely helpful > > for my next performance > > >>>> improvement tests :-) > > >>>> > > >>>> Two questions: > > >>>> > > >>>> 1. How would I do that: > > >>>> > > >>>> > > >>>> > > >>>>> or a local read-only instance that > > reads the same core as the indexing > > >>>>> instance (for the latter, you'll need > > something that periodically refreshes - i.e. runs > > commit()). > > >>>>> > > >>>>> > > >>>> 2. Did you try sharding with your current > > setup (e.g. one big, > > >>>> nearly-static index and a tiny write+read > > index)? > > >>>> > > >>>> Regards, > > >>>> Peter. > > >>>> > > >>>> > > >>>> > > >>>>> Hi, > > >>>>> > > >>>>> Below are some notes regarding Solr > > cache tuning that should prove > > >>>>> useful for anyone who uses Solr with > > frequent commits (e.g. <5min). > > >>>>> > > >>>>> Environment: > > >>>>> Solr 1.4.1 or branch_3x trunk. > > >>>>> Note the 4.x trunk has lots of neat > > new features, so the notes here > > >>>>> are likely less relevant to the 4.x > > environment. > > >>>>> > > >>>>> Overview: > > >>>>> Our Solr environment makes extensive > > use of faceting, we perform > > >>>>> commits every 30secs, and the indexes > > tend be on the large-ish side > > >>>>> (>20million docs). > > >>>>> Note: For our data, when we commit, we > > are always adding new data, > > >>>>> never changing existing data. > > >>>>> This type of environment can be tricky > > to tune, as Solr is more geared > > >>>>> toward fast reads than frequent > > writes. > > >>>>> > > >>>>> Symptoms: > > >>>>> If anyone has used faceting in > > searches where you are also performing > > >>>>> frequent commits, you've likely > > encountered the dreaded OutOfMemory or > > >>>>> GC Overhead Exeeded errors. > > >>>>> In high commit rate environments, this > > is almost always due to > > >>>>> multiple 'onDeck' searchers and > > autowarming - i.e. new searchers don't > > >>>>> finish autowarming their caches before > > the next commit() > > >>>>> comes along and invalidates them. > > >>>>> Once this starts happening on a > > regular basis, it is likely your > > >>>>> Solr's JVM will run out of memory > > eventually, as the number of > > >>>>> searchers (and their cache arrays) > > will keep growing until the JVM > > >>>>> dies of thirst. > > >>>>> To check if your Solr environment is > > suffering from this, turn on INFO > > >>>>> level logging, and look for: > > 'PERFORMANCE WARNING: Overlapping > > >>>>> onDeckSearchers=x'. > > >>>>> > > >>>>> In tests, we've only ever seen this > > problem when using faceting, and > > >>>>> facet.method=fc. > > >>>>> > > >>>>> Some solutions to this are: > > >>>>> Reduce the commit rate to allow > > searchers to fully warm before the > > >>>>> next commit > > >>>>> Reduce or eliminate the > > autowarming in caches > > >>>>> Both of the above > > >>>>> > > >>>>> The trouble is, if you're doing NRT > > commits, you likely have a good > > >>>>> reason for it, and > > reducing/elimintating autowarming will very > > >>>>> significantly impact search > > performance in high commit rate > > >>>>> environments. > > >>>>> > > >>>>> Solution: > > >>>>> Here are some setup steps we've used > > that allow lots of faceting (we > > >>>>> typically search with at least 20-35 > > different facet fields, and date > > >>>>> faceting/sorting) on large indexes, > > and still keep decent search > > >>>>> performance: > > >>>>> > > >>>>> 1. Firstly, you should consider using > > the enum method for facet > > >>>>> searches (facet.method=enum) unless > > you've got A LOT of memory on your > > >>>>> machine. In our tests, this method > > uses a lot less memory and > > >>>>> autowarms more quickly than fc. (Note, > > I've not tried the new > > >>>>> segement-based 'fcs' option, as I > > can't find support for it in > > >>>>> branch_3x - looks nice for 4.x > > though) > > >>>>> Admittedly, for our data, enum is not > > quite as fast for searching as > > >>>>> fc, but short of purchsing a > > Thaiwanese RAM factory, it's a worthwhile > > >>>>> tradeoff. > > >>>>> If you do have access to LOTS of > > memory, AND you can guarantee that > > >>>>> the index won't grow beyond the memory > > capacity (i.e. you have some > > >>>>> sort of deletion policy in place), fc > > can be a lot faster than enum > > >>>>> when searching with lots of facets > > across many terms. > > >>>>> > > >>>>> 2. Secondly, we've found that LRUCache > > is faster at autowarming than > > >>>>> FastLRUCache - in our tests, about 20% > > faster. Maybe this is just our > > >>>>> environment - your mileage may vary. > > >>>>> > > >>>>> So, our filterCache section in > > solrconfig.xml looks like this: > > >>>>> <filterCache > > >>>>> class="solr.LRUCache" > > >>>>> size="3600" > > >>>>> initialSize="1400" > > >>>>> autowarmCount="3600"/> > > >>>>> > > >>>>> For a 28GB index, running in a > > quad-core x64 VMWare instance, 30 > > >>>>> warmed facet fields, Solr is running > > at ~4GB. Stats filterCache size > > >>>>> shows usually in the region of ~2400. > > >>>>> > > >>>>> 3. It's also a good idea to have some > > sort of > > >>>>> firstSearcher/newSearcher event > > listener queries to allow new data to > > >>>>> populate the caches. > > >>>>> Of course, what you put in these is > > dependent on the facets you need/use. > > >>>>> We've found a good combination is a > > firstSearcher with as many facets > > >>>>> in the search as your environment can > > handle, then a subset of the > > >>>>> most common facets for the > > newSearcher. > > >>>>> > > >>>>> 4. We also set: > > >>>>> > > <useColdSearcher>true</useColdSearcher> > > >>>>> just in case. > > >>>>> > > >>>>> 5. Another key area for search > > performance with high commits is to use > > >>>>> 2 Solr instances - one for the high > > commit rate indexing, and one for > > >>>>> searching. > > >>>>> The read-only searching instance can > > be a remote replica, or a local > > >>>>> read-only instance that reads the same > > core as the indexing instance > > >>>>> (for the latter, you'll need something > > that periodically refreshes - > > >>>>> i.e. runs commit()). > > >>>>> This way, you can tune the indexing > > instance for writing performance > > >>>>> and the searching instance as above > > for max read performance. > > >>>>> > > >>>>> Using the setup above, we get > > fantastic searching speed for small > > >>>>> facet sets (well under 1sec), and > > really good searching for large > > >>>>> facet sets (a couple of secs depending > > on index size, number of > > >>>>> facets, unique terms etc. etc.), > > >>>>> even when searching against largeish > > indexes (>20million docs). > > >>>>> We have yet to see any OOM or GC > > errors using the techniques above, > > >>>>> even in low memory conditions. > > >>>>> > > >>>>> I hope there are people that find this > > useful. I know I've spent a lot > > >>>>> of time looking for stuff like this, > > so hopefullly, this will save > > >>>>> someone some time. > > >>>>> > > >>>>> > > >>>>> Peter > > >>>>> > > >>>>> > > > > > > > > >