David
The main content organization I index is some number of articles
existing under a common title.
I have three SOLR instances containing:
- Instance 1 - All 'live' articles ~ 750K articles - 3-4KB each
- Instance 2 - All 'live' titles' - ~ 95K titles - < 1 KB each
- Instance 3 - All articles and titles ~ 1.2mm articles + titles
I create Instance 1 and Instance 2 to provide fast response for heavy
query usage on 'live' articles and 'live' titles. I use Instance 3
for all low-volume, complex queries.
(All above as preamble)
My current JVM settings are
- Instance 1 - -Xms256m -Xmx2000m
- Instance 2 - -Xms256m -Xmx1000m
- Instance 3 - -Xms256m -Xmx2000m
I'm in the middle of tuning the application. These values reflect
optimization for document indexing. Haven't looked at the query side
yet.
Notes
I'm using 'top' to look at process sizes (Redhat 4.x, 4 GB Xeon Dual
core)
For instance 1, I could probably get away with -Xmx1000m - but I
think it's just a matter of (a short) time until I need to increase
that limit.
For instance 2, it currently runs in steady state at 1.2 - 1.4 GB max,
so I boosted to 2 GB max.
Regards,
Tracy
On May 11, 2008, at 8:31 AM, David Pratt wrote:
Hi Tracy. Can you advise the sort of difference in max heap space
that resulted in the improvement, that is, your before and after max
heap space. Many thanks.
Regards,
David
Tracy Flynn wrote:
Thanks for the replies.
For a completely different reason, I happened to look at the memory
stats for all processes including the SOLR instances. Noticed that
the SLOW Solr instance was maxing out with more virtual memory than
allocated. After boosting the maximum heap space and restarting,
everything started to run at 4x-5x the speed before the fix - and
at the rate I reasonably thought it should.
Tracy
On May 9, 2008, at 8:02 AM, Tracy Flynn wrote:
Hi,
I'm starting to see significant slowdown in loading performance
after I have loaded about 400K documents. I go from a load rate
of near 40 docs/sec to 20- 25 docs a second.
Am I correct in assuming that, during indexing operations, Lucene/
SOLR tries to hold as much of the indexex in memory as possible?
If so, does the slowdown indicate need to increase JVM heap space?
Any ideas / help would be appreciated
Regards,
Tracy
---------------------------------------------------------------------------------------------------------------------
Details
Documents loaded as XML via POST command in batches of 1000,
commit after each batch
Total current documents ~ 450,000
Avg document size: 4KB
One indexed text field contains 3KB or so. (body field below -
standard type 'text')
Dual XEON 3 GHZ 4 GB memory
SOLR JVM Startup options
java -Xms256m -Xmx1000m -jar start.jar
Relevant portion of the schema follows
<field name="document_id" type="string" indexed="true"
stored="true" required="true"/>
<field name="language" type="string" indexed="true" stored="true"
required="false"/>
<field name="languages" type="string" indexed="true"
stored="true" required="false"/>
<!-- The value specified for folding_id must be a field of type
"integer" -
type "sint" does not work -->
<field name="folding_id" type="integer" indexed="true"
stored="true" required="false" default="0"/>
<field name="document_type" type="string" indexed="true"
stored="true" required="true"/>
<field name="title" type="text" indexed="true" stored="true"
required="false"/>
<field name="body" type="text" indexed="true" stored="true"
required="false" compressed="true"/>
<field name="teaser" type="text" indexed="no" stored="true"
required="false"/>
<field name="articles_in_category" type="sint" indexed="true"
stored="true" required="false" default="0"/>
<field name="pen_name" type="text" indexed="true" stored="true"
required="false"/>
<field name="article_id" type="sint" indexed="true" stored="true"
required="false" default="0"/>
<field name="article_status_id" type="sint" indexed="true"
stored="true" required="false" default="0"/>
<field name="user_id" type="sint" indexed="true" stored="true"
required="false" default="0"/>
<field name="user_name" type="text" indexed="true" stored="true"
required="false"/>
<field name="user_email" type="text" indexed="true" stored="true"
required="false"/>
<field name="channel_context" type="sint" indexed="true"
stored="true" required="false" multiValued="true"/>
<field name="category_id" type="sint" indexed="true"
stored="true" required="false" default="0"/>
<field name="category_status_id" type="sint" indexed="true"
stored="true" required="false" default="0"/>
<field name="category_title" type="text" indexed="true"
stored="true" required="false"/>
<field name="category_keywords" type="text" indexed="true"
stored="true" required="false" multiValued="true"/>
<field name="category_type" type="text" indexed="true"
stored="true" required="false"/>
<field name="channel_id" type="sint" indexed="true" stored="true"
required="false" default="0"/>
<field name="channel_title" type="text" indexed="true"
stored="true" required="false"/>
<field name="helium_rank" type="sint" indexed="false"
stored="true" required="false" default="0"/>
<field name="helium_rank_percentile" type="sfloat"
indexed="false" stored="true" required="false"/>
<field name="helium_scaled_rank_boost" type="sfloat"
indexed="true" stored="true" required="false"/>
<field name="helium_scaled_rank_boost_string" type="string"
indexed="true" stored="true" required="false"/>
<!--
<field name="title_popularity" type="sint" indexed="true"
stored="true" default="0"/>
<field name="title_recent_popularity" type="sint" indexed="true"
stored="true" default="0"/>
<field name="title_views_measure" type="sint" indexed="true"
stored="true" default="0"/>
<field name="title_recent_earnings_measure" type="sint"
indexed="true" stored="true" default="0"/>
<field name="title_earnings_measure" type="sint" indexed="true"
stored="true" default="0"/>
-->
<field name="created_date" type="date" indexed="true"
stored="true" required="false" />