Does Solr use Lucene NRT? --- On Fri, 9/17/10, Erick Erickson <erickerick...@gmail.com> wrote:
> From: Erick Erickson <erickerick...@gmail.com> > Subject: Re: Tuning Solr caches with high commit rates (NRT) > To: solr-user@lucene.apache.org > Date: Friday, September 17, 2010, 1:05 PM > Near Real Time... > > Erick > > On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon <gear...@sbcglobal.net>wrote: > > > BTW, what is NRT? > > > > Dennis Gearon > > > > Signature Warning > > ---------------- > > EARTH has a Right To Life, > > otherwise we all die. > > > > Read 'Hot, Flat, and Crowded' > > Laugh at http://www.yert.com/film.php > > > > > > --- On Fri, 9/17/10, Peter Sturge <peter.stu...@gmail.com> > wrote: > > > > > From: Peter Sturge <peter.stu...@gmail.com> > > > Subject: Re: Tuning Solr caches with high commit > rates (NRT) > > > To: solr-user@lucene.apache.org > > > Date: Friday, September 17, 2010, 2:18 AM > > > Hi, > > > > > > It's great to see such a fantastic response to > this thread > > > - NRT is > > > alive and well! > > > > > > I'm hoping to collate this information and add it > to the > > > wiki when I > > > get a few free cycles (thanks Erik for the heads > up). > > > > > > In the meantime, I thought I'd add a few tidbits > of > > > additional > > > information that might prove useful: > > > > > > 1. The first one to note is that the > techniques/setup > > > described in > > > this thread don't fix the underlying potential > for > > > OutOfMemory errors > > > - there can always be an index large enough to > ask of its > > > JVM more > > > memory than is available for cache. > > > These techniques, however, mitigate the risk, and > provide > > > an efficient > > > balance between memory use and search > performance. > > > There are some interesting discussions going on > for both > > > Lucene and > > > Solr regarding the '2 pounds of baloney into a 1 > pound bag' > > > issue of > > > unbounded caches, with a number of interesting > strategies. > > > One strategy that I like, but haven't found in > discussion > > > lists is > > > auto-limiting cache size/warming based on > available > > > resources (similar > > > to the way file system caches use free memory). > This would > > > allow > > > caches to adjust to their memory environment as > indexes > > > grow. > > > > > > 2. A note regarding lockType in solrconfig.xml > for dual > > > Solr > > > instances: It's best not to use 'none' as a value > for > > > lockType - this > > > sets the lockType to null, and as the source > comments note, > > > this is a > > > recipe for disaster, so, use 'simple' instead. > > > > > > 3. Chris mentioned setting maxWarmingSearchers to > 1 as a > > > way of > > > minimizing the number of onDeckSearchers. This is > a prudent > > > move -- > > > thanks Chris for bringing this up! > > > > > > All the best, > > > Peter > > > > > > > > > > > > > > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich > <peat...@yahoo.de> > > > wrote: > > > > Peter Sturge, > > > > > > > > this was a nice hint, thanks again! If you > are here in > > > Germany anytime I > > > > can invite you to a beer or an apfelschorle > ! :-) > > > > I only needed to change the lockType to none > in the > > > solrconfig.xml, > > > > disable the replication and set the data dir > to the > > > master data dir! > > > > > > > > Regards, > > > > Peter Karich. > > > > > > > >> Hi Peter, > > > >> > > > >> this scenario would be really great for > us - I > > > didn't know that this is > > > >> possible and works, so: thanks! > > > >> At the moment we are doing similar with > > > replicating to the readonly > > > >> instance but > > > >> the replication is somewhat lengthy and > > > resource-intensive at this > > > >> datavolume ;-) > > > >> > > > >> Regards, > > > >> Peter. > > > >> > > > >> > > > >>> 1. You can run multiple Solr > instances in > > > separate JVMs, with both > > > >>> having their solr.xml configured to > use the > > > same index folder. > > > >>> You need to be careful that one and > only one > > > of these instances will > > > >>> ever update the index at a time. The > best way > > > to ensure this is to use > > > >>> one for writing only, > > > >>> and the other is read-only and never > writes to > > > the index. This > > > >>> read-only instance is the one to use > for > > > tuning for high search > > > >>> performance. Even though the RO > instance > > > doesn't write to the index, > > > >>> it still needs periodic (albeit > empty) commits > > > to kick off > > > >>> autowarming/cache refresh. > > > >>> > > > >>> Depending on your needs, you might > not need to > > > have 2 separate > > > >>> instances. We need it because the > 'write' > > > instance is also doing a lot > > > >>> of metadata pre-write operations in > the same > > > jvm as Solr, and so has > > > >>> its own memory requirements. > > > >>> > > > >>> 2. We use sharding all the time, and > it works > > > just fine with this > > > >>> scenario, as the RO instance is > simply another > > > shard in the pack. > > > >>> > > > >>> > > > >>> On Sun, Sep 12, 2010 at 8:46 PM, > Peter Karich > > > <peat...@yahoo.de> > > > wrote: > > > >>> > > > >>> > > > >>>> Peter, > > > >>>> > > > >>>> thanks a lot for your in-depth > > > explanations! > > > >>>> Your findings will be definitely > helpful > > > for my next performance > > > >>>> improvement tests :-) > > > >>>> > > > >>>> Two questions: > > > >>>> > > > >>>> 1. How would I do that: > > > >>>> > > > >>>> > > > >>>> > > > >>>>> or a local read-only > instance that > > > reads the same core as the indexing > > > >>>>> instance (for the latter, > you'll need > > > something that periodically refreshes - i.e. > runs > > > commit()). > > > >>>>> > > > >>>>> > > > >>>> 2. Did you try sharding with > your current > > > setup (e.g. one big, > > > >>>> nearly-static index and a tiny > write+read > > > index)? > > > >>>> > > > >>>> Regards, > > > >>>> Peter. > > > >>>> > > > >>>> > > > >>>> > > > >>>>> Hi, > > > >>>>> > > > >>>>> Below are some notes > regarding Solr > > > cache tuning that should prove > > > >>>>> useful for anyone who uses > Solr with > > > frequent commits (e.g. <5min). > > > >>>>> > > > >>>>> Environment: > > > >>>>> Solr 1.4.1 or branch_3x > trunk. > > > >>>>> Note the 4.x trunk has lots > of neat > > > new features, so the notes here > > > >>>>> are likely less relevant to > the 4.x > > > environment. > > > >>>>> > > > >>>>> Overview: > > > >>>>> Our Solr environment makes > extensive > > > use of faceting, we perform > > > >>>>> commits every 30secs, and > the indexes > > > tend be on the large-ish side > > > >>>>> (>20million docs). > > > >>>>> Note: For our data, when we > commit, we > > > are always adding new data, > > > >>>>> never changing existing > data. > > > >>>>> This type of environment can > be tricky > > > to tune, as Solr is more geared > > > >>>>> toward fast reads than > frequent > > > writes. > > > >>>>> > > > >>>>> Symptoms: > > > >>>>> If anyone has used faceting > in > > > searches where you are also performing > > > >>>>> frequent commits, you've > likely > > > encountered the dreaded OutOfMemory or > > > >>>>> GC Overhead Exeeded errors. > > > >>>>> In high commit rate > environments, this > > > is almost always due to > > > >>>>> multiple 'onDeck' searchers > and > > > autowarming - i.e. new searchers don't > > > >>>>> finish autowarming their > caches before > > > the next commit() > > > >>>>> comes along and invalidates > them. > > > >>>>> Once this starts happening > on a > > > regular basis, it is likely your > > > >>>>> Solr's JVM will run out of > memory > > > eventually, as the number of > > > >>>>> searchers (and their cache > arrays) > > > will keep growing until the JVM > > > >>>>> dies of thirst. > > > >>>>> To check if your Solr > environment is > > > suffering from this, turn on INFO > > > >>>>> level logging, and look > for: > > > 'PERFORMANCE WARNING: Overlapping > > > >>>>> onDeckSearchers=x'. > > > >>>>> > > > >>>>> In tests, we've only ever > seen this > > > problem when using faceting, and > > > >>>>> facet.method=fc. > > > >>>>> > > > >>>>> Some solutions to this are: > > > >>>>> > Reduce the commit rate to allow > > > searchers to fully warm before the > > > >>>>> next commit > > > >>>>> > Reduce or eliminate the > > > autowarming in caches > > > >>>>> Both > of the above > > > >>>>> > > > >>>>> The trouble is, if you're > doing NRT > > > commits, you likely have a good > > > >>>>> reason for it, and > > > reducing/elimintating autowarming will very > > > >>>>> significantly impact search > > > performance in high commit rate > > > >>>>> environments. > > > >>>>> > > > >>>>> Solution: > > > >>>>> Here are some setup steps > we've used > > > that allow lots of faceting (we > > > >>>>> typically search with at > least 20-35 > > > different facet fields, and date > > > >>>>> faceting/sorting) on large > indexes, > > > and still keep decent search > > > >>>>> performance: > > > >>>>> > > > >>>>> 1. Firstly, you should > consider using > > > the enum method for facet > > > >>>>> searches (facet.method=enum) > unless > > > you've got A LOT of memory on your > > > >>>>> machine. In our tests, this > method > > > uses a lot less memory and > > > >>>>> autowarms more quickly than > fc. (Note, > > > I've not tried the new > > > >>>>> segement-based 'fcs' option, > as I > > > can't find support for it in > > > >>>>> branch_3x - looks nice for > 4.x > > > though) > > > >>>>> Admittedly, for our data, > enum is not > > > quite as fast for searching as > > > >>>>> fc, but short of purchsing > a > > > Thaiwanese RAM factory, it's a worthwhile > > > >>>>> tradeoff. > > > >>>>> If you do have access to > LOTS of > > > memory, AND you can guarantee that > > > >>>>> the index won't grow beyond > the memory > > > capacity (i.e. you have some > > > >>>>> sort of deletion policy in > place), fc > > > can be a lot faster than enum > > > >>>>> when searching with lots of > facets > > > across many terms. > > > >>>>> > > > >>>>> 2. Secondly, we've found > that LRUCache > > > is faster at autowarming than > > > >>>>> FastLRUCache - in our tests, > about 20% > > > faster. Maybe this is just our > > > >>>>> environment - your mileage > may vary. > > > >>>>> > > > >>>>> So, our filterCache section > in > > > solrconfig.xml looks like this: > > > >>>>> > <filterCache > > > >>>>> > class="solr.LRUCache" > > > >>>>> > size="3600" > > > >>>>> > initialSize="1400" > > > >>>>> > autowarmCount="3600"/> > > > >>>>> > > > >>>>> For a 28GB index, running in > a > > > quad-core x64 VMWare instance, 30 > > > >>>>> warmed facet fields, Solr is > running > > > at ~4GB. Stats filterCache size > > > >>>>> shows usually in the region > of ~2400. > > > >>>>> > > > >>>>> 3. It's also a good idea to > have some > > > sort of > > > >>>>> firstSearcher/newSearcher > event > > > listener queries to allow new data to > > > >>>>> populate the caches. > > > >>>>> Of course, what you put in > these is > > > dependent on the facets you need/use. > > > >>>>> We've found a good > combination is a > > > firstSearcher with as many facets > > > >>>>> in the search as your > environment can > > > handle, then a subset of the > > > >>>>> most common facets for the > > > newSearcher. > > > >>>>> > > > >>>>> 4. We also set: > > > >>>>> > > > > <useColdSearcher>true</useColdSearcher> > > > >>>>> just in case. > > > >>>>> > > > >>>>> 5. Another key area for > search > > > performance with high commits is to use > > > >>>>> 2 Solr instances - one for > the high > > > commit rate indexing, and one for > > > >>>>> searching. > > > >>>>> The read-only searching > instance can > > > be a remote replica, or a local > > > >>>>> read-only instance that > reads the same > > > core as the indexing instance > > > >>>>> (for the latter, you'll need > something > > > that periodically refreshes - > > > >>>>> i.e. runs commit()). > > > >>>>> This way, you can tune the > indexing > > > instance for writing performance > > > >>>>> and the searching instance > as above > > > for max read performance. > > > >>>>> > > > >>>>> Using the setup above, we > get > > > fantastic searching speed for small > > > >>>>> facet sets (well under > 1sec), and > > > really good searching for large > > > >>>>> facet sets (a couple of secs > depending > > > on index size, number of > > > >>>>> facets, unique terms etc. > etc.), > > > >>>>> even when searching against > largeish > > > indexes (>20million docs). > > > >>>>> We have yet to see any OOM > or GC > > > errors using the techniques above, > > > >>>>> even in low memory > conditions. > > > >>>>> > > > >>>>> I hope there are people that > find this > > > useful. I know I've spent a lot > > > >>>>> of time looking for stuff > like this, > > > so hopefullly, this will save > > > >>>>> someone some time. > > > >>>>> > > > >>>>> > > > >>>>> Peter > > > >>>>> > > > >>>>> > > > > > > > > > > > > > >