On Tue, Jul 31, 2012 at 2:34 PM, roz dev <rozde...@gmail.com> wrote:
> Hi All
>
> I am using Solr 4 from trunk and using it with Tomcat 6. I am noticing that
> when we are indexing lots of data with 16 concurrent threads, Heap grows
> continuously. It remains high and ultimately most of the stuff ends up
> being moved to Old Gen. Eventually, Old Gen also fills up and we start
> getting into excessive GC problem.

Hi: I don't claim to know anything about how tomcat manages threads,
but really you shouldnt have all these objects.

In general snowball stemmers should be reused per-thread-per-field.
But if you have a lot of fields*threads, especially if there really is
high thread churn on tomcat, then this could be bad with snowball:
see eks dev's comment on https://issues.apache.org/jira/browse/LUCENE-3841

I think it would be useful to see if you can tune tomcat's threadpool
as he describes.

separately: Snowball stemmers are currently really ram-expensive for
stupid reasons.
each one creates a ton of Among objects, e.g. an EnglishStemmer today
is about 8KB.

I'll regenerate these and open a JIRA issue: as the snowball code
generator in their svn was improved
recently and each one now takes about 64 bytes instead (the Among's
are static and reused).

Still this wont really "solve your problem", because the analysis
chain could have other heavy parts
in initialization, but it seems good to fix.

As a workaround until then you can also just use the "good old
PorterStemmer" (PorterStemFilterFactory in solr).
Its not exactly the same as using Snowball(English) but its pretty
close and also much faster.

-- 
lucidimagination.com

Reply via email to