Yeah there's no patch... I think Yonik can write it. :-) Yah... The Lucene version shouldn't matter. The distributed faceting theoretically can easily be applied to multiple segments, however the way it's written for me is a challenge to untangle and apply successfully to a working patch. Also I don't have this as an itch to scratch at the moment.
On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge <peter.stu...@gmail.com> wrote: > Hi Jason, > > I've tried some limited testing with the 4.x trunk using fcs, and I > must say, I really like the idea of per-segment faceting. > I was hoping to see it in 3.x, but I don't see this option in the > branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the > one to use with 3.1? > There seems to be a number of Solr issues tied to this - one of them > being Lucene-1785. Can the per-segment faceting patch work with Lucene > 2.9/branch_3x? > > Thanks, > Peter > > > > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen > <jason.rutherg...@gmail.com> wrote: >> Peter, >> >> Are you using per-segment faceting, eg, SOLR-1617? That could help >> your situation. >> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge <peter.stu...@gmail.com> >> wrote: >>> Hi, >>> >>> Below are some notes regarding Solr cache tuning that should prove >>> useful for anyone who uses Solr with frequent commits (e.g. <5min). >>> >>> Environment: >>> Solr 1.4.1 or branch_3x trunk. >>> Note the 4.x trunk has lots of neat new features, so the notes here >>> are likely less relevant to the 4.x environment. >>> >>> Overview: >>> Our Solr environment makes extensive use of faceting, we perform >>> commits every 30secs, and the indexes tend be on the large-ish side >>> (>20million docs). >>> Note: For our data, when we commit, we are always adding new data, >>> never changing existing data. >>> This type of environment can be tricky to tune, as Solr is more geared >>> toward fast reads than frequent writes. >>> >>> Symptoms: >>> If anyone has used faceting in searches where you are also performing >>> frequent commits, you've likely encountered the dreaded OutOfMemory or >>> GC Overhead Exeeded errors. >>> In high commit rate environments, this is almost always due to >>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't >>> finish autowarming their caches before the next commit() >>> comes along and invalidates them. >>> Once this starts happening on a regular basis, it is likely your >>> Solr's JVM will run out of memory eventually, as the number of >>> searchers (and their cache arrays) will keep growing until the JVM >>> dies of thirst. >>> To check if your Solr environment is suffering from this, turn on INFO >>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping >>> onDeckSearchers=x'. >>> >>> In tests, we've only ever seen this problem when using faceting, and >>> facet.method=fc. >>> >>> Some solutions to this are: >>> Reduce the commit rate to allow searchers to fully warm before the >>> next commit >>> Reduce or eliminate the autowarming in caches >>> Both of the above >>> >>> The trouble is, if you're doing NRT commits, you likely have a good >>> reason for it, and reducing/elimintating autowarming will very >>> significantly impact search performance in high commit rate >>> environments. >>> >>> Solution: >>> Here are some setup steps we've used that allow lots of faceting (we >>> typically search with at least 20-35 different facet fields, and date >>> faceting/sorting) on large indexes, and still keep decent search >>> performance: >>> >>> 1. Firstly, you should consider using the enum method for facet >>> searches (facet.method=enum) unless you've got A LOT of memory on your >>> machine. In our tests, this method uses a lot less memory and >>> autowarms more quickly than fc. (Note, I've not tried the new >>> segement-based 'fcs' option, as I can't find support for it in >>> branch_3x - looks nice for 4.x though) >>> Admittedly, for our data, enum is not quite as fast for searching as >>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile >>> tradeoff. >>> If you do have access to LOTS of memory, AND you can guarantee that >>> the index won't grow beyond the memory capacity (i.e. you have some >>> sort of deletion policy in place), fc can be a lot faster than enum >>> when searching with lots of facets across many terms. >>> >>> 2. Secondly, we've found that LRUCache is faster at autowarming than >>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our >>> environment - your mileage may vary. >>> >>> So, our filterCache section in solrconfig.xml looks like this: >>> <filterCache >>> class="solr.LRUCache" >>> size="3600" >>> initialSize="1400" >>> autowarmCount="3600"/> >>> >>> For a 28GB index, running in a quad-core x64 VMWare instance, 30 >>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size >>> shows usually in the region of ~2400. >>> >>> 3. It's also a good idea to have some sort of >>> firstSearcher/newSearcher event listener queries to allow new data to >>> populate the caches. >>> Of course, what you put in these is dependent on the facets you need/use. >>> We've found a good combination is a firstSearcher with as many facets >>> in the search as your environment can handle, then a subset of the >>> most common facets for the newSearcher. >>> >>> 4. We also set: >>> <useColdSearcher>true</useColdSearcher> >>> just in case. >>> >>> 5. Another key area for search performance with high commits is to use >>> 2 Solr instances - one for the high commit rate indexing, and one for >>> searching. >>> The read-only searching instance can be a remote replica, or a local >>> read-only instance that reads the same core as the indexing instance >>> (for the latter, you'll need something that periodically refreshes - >>> i.e. runs commit()). >>> This way, you can tune the indexing instance for writing performance >>> and the searching instance as above for max read performance. >>> >>> Using the setup above, we get fantastic searching speed for small >>> facet sets (well under 1sec), and really good searching for large >>> facet sets (a couple of secs depending on index size, number of >>> facets, unique terms etc. etc.), >>> even when searching against largeish indexes (>20million docs). >>> We have yet to see any OOM or GC errors using the techniques above, >>> even in low memory conditions. >>> >>> I hope there are people that find this useful. I know I've spent a lot >>> of time looking for stuff like this, so hopefullly, this will save >>> someone some time. >>> >>> >>> Peter >>> >> >