Yeah there's no patch... I think Yonik can write it. :-)  Yah... The
Lucene version shouldn't matter.  The distributed faceting
theoretically can easily be applied to multiple segments, however the
way it's written for me is a challenge to untangle and apply
successfully to a working patch.  Also I don't have this as an itch to
scratch at the moment.

On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge <peter.stu...@gmail.com> wrote:
> Hi Jason,
>
> I've tried some limited testing with the 4.x trunk using fcs, and I
> must say, I really like the idea of per-segment faceting.
> I was hoping to see it in 3.x, but I don't see this option in the
> branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
> one to use with 3.1?
> There seems to be a number of Solr issues tied to this - one of them
> being Lucene-1785. Can the per-segment faceting patch work with Lucene
> 2.9/branch_3x?
>
> Thanks,
> Peter
>
>
>
> On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
> <jason.rutherg...@gmail.com> wrote:
>> Peter,
>>
>> Are you using per-segment faceting, eg, SOLR-1617?  That could help
>> your situation.
>>
>> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge <peter.stu...@gmail.com> 
>> wrote:
>>> Hi,
>>>
>>> Below are some notes regarding Solr cache tuning that should prove
>>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>>
>>> Environment:
>>> Solr 1.4.1 or branch_3x trunk.
>>> Note the 4.x trunk has lots of neat new features, so the notes here
>>> are likely less relevant to the 4.x environment.
>>>
>>> Overview:
>>> Our Solr environment makes extensive use of faceting, we perform
>>> commits every 30secs, and the indexes tend be on the large-ish side
>>> (>20million docs).
>>> Note: For our data, when we commit, we are always adding new data,
>>> never changing existing data.
>>> This type of environment can be tricky to tune, as Solr is more geared
>>> toward fast reads than frequent writes.
>>>
>>> Symptoms:
>>> If anyone has used faceting in searches where you are also performing
>>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>>> GC Overhead Exeeded errors.
>>> In high commit rate environments, this is almost always due to
>>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>>> finish autowarming their caches before the next commit()
>>> comes along and invalidates them.
>>> Once this starts happening on a regular basis, it is likely your
>>> Solr's JVM will run out of memory eventually, as the number of
>>> searchers (and their cache arrays) will keep growing until the JVM
>>> dies of thirst.
>>> To check if your Solr environment is suffering from this, turn on INFO
>>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>>> onDeckSearchers=x'.
>>>
>>> In tests, we've only ever seen this problem when using faceting, and
>>> facet.method=fc.
>>>
>>> Some solutions to this are:
>>>    Reduce the commit rate to allow searchers to fully warm before the
>>> next commit
>>>    Reduce or eliminate the autowarming in caches
>>>    Both of the above
>>>
>>> The trouble is, if you're doing NRT commits, you likely have a good
>>> reason for it, and reducing/elimintating autowarming will very
>>> significantly impact search performance in high commit rate
>>> environments.
>>>
>>> Solution:
>>> Here are some setup steps we've used that allow lots of faceting (we
>>> typically search with at least 20-35 different facet fields, and date
>>> faceting/sorting) on large indexes, and still keep decent search
>>> performance:
>>>
>>> 1. Firstly, you should consider using the enum method for facet
>>> searches (facet.method=enum) unless you've got A LOT of memory on your
>>> machine. In our tests, this method uses a lot less memory and
>>> autowarms more quickly than fc. (Note, I've not tried the new
>>> segement-based 'fcs' option, as I can't find support for it in
>>> branch_3x - looks nice for 4.x though)
>>> Admittedly, for our data, enum is not quite as fast for searching as
>>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>>> tradeoff.
>>> If you do have access to LOTS of memory, AND you can guarantee that
>>> the index won't grow beyond the memory capacity (i.e. you have some
>>> sort of deletion policy in place), fc can be a lot faster than enum
>>> when searching with lots of facets across many terms.
>>>
>>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>>> environment - your mileage may vary.
>>>
>>> So, our filterCache section in solrconfig.xml looks like this:
>>>    <filterCache
>>>      class="solr.LRUCache"
>>>      size="3600"
>>>      initialSize="1400"
>>>      autowarmCount="3600"/>
>>>
>>> For a 28GB index, running in a quad-core x64 VMWare instance, 30
>>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
>>> shows usually in the region of ~2400.
>>>
>>> 3. It's also a good idea to have some sort of
>>> firstSearcher/newSearcher event listener queries to allow new data to
>>> populate the caches.
>>> Of course, what you put in these is dependent on the facets you need/use.
>>> We've found a good combination is a firstSearcher with as many facets
>>> in the search as your environment can handle, then a subset of the
>>> most common facets for the newSearcher.
>>>
>>> 4. We also set:
>>>   <useColdSearcher>true</useColdSearcher>
>>> just in case.
>>>
>>> 5. Another key area for search performance with high commits is to use
>>> 2 Solr instances - one for the high commit rate indexing, and one for
>>> searching.
>>> The read-only searching instance can be a remote replica, or a local
>>> read-only instance that reads the same core as the indexing instance
>>> (for the latter, you'll need something that periodically refreshes -
>>> i.e. runs commit()).
>>> This way, you can tune the indexing instance for writing performance
>>> and the searching instance as above for max read performance.
>>>
>>> Using the setup above, we get fantastic searching speed for small
>>> facet sets (well under 1sec), and really good searching for large
>>> facet sets (a couple of secs depending on index size, number of
>>> facets, unique terms etc. etc.),
>>> even when searching against largeish indexes (>20million docs).
>>> We have yet to see any OOM or GC errors using the techniques above,
>>> even in low memory conditions.
>>>
>>> I hope there are people that find this useful. I know I've spent a lot
>>> of time looking for stuff like this, so hopefullly, this will save
>>> someone some time.
>>>
>>>
>>> Peter
>>>
>>
>

Reply via email to