Hi all Solr users/developers/experts,

I have the following scenario and I appreciate any advice for tuning my solr
master server.  

I have a field in my schema that would index (but not stored) about ~10000
ids for each document.  This field is expected to govern the size of the
document.  Each id can contain up to 6 characters.  I figure that there are
two alternatives for this field, one is the use a string multi-valued field,
and the other would be to pass a white-space-delimited string to solr and
have solr tokenize such string based on whitespace (the text_ws fieldType).  
The master server is expected to receive constant stream of updates.

The expected/estimated document size can range from 50k to 100k for a single
document.  (I know this is quite large). The number of documents is expected
to be around 200,000 on each master server, and there can be multiple master
servers (sharding).  I wish the master can handle more docs too if I can
figure a way out.  

Currently, I’m performing some basic stress tests to simulate the indexing
side on the master server.  This stress test would continuously add new
documents at the rate of about 10 documents every 30 seconds.  Autocommit is
being used (50 docs and 180 seconds constraints), but I have no idea if this
is the preferred way.  The goal is to keep adding new documents until we can
get at least 200,000 documents (or about 20GB of index) on the master (or
even more if the server can handle it)

What I experienced from the indexing stress test is that the master server
failed to respond after a while, such as non-pingable when there are about
30k documents.  When looking at the log, they are mostly:
java.lang.OutOfMemoryError: Java heap space
OR
Ping query caused exception: null (this is probably caused by the OOM
problem)

There were also a few cases that the java process even went away.

Questions:
1)      Is it better to use the multi-valued string field or the text_ws field
for this large field?
2)      Is it better to have more outstanding docs per commit or more frequent
commit, in term of maximizing server resources?  What is the preferred way
to commit documents assuming that solr master receives updates frequently?
How many updated docs should there be before issuing a commit? 
3)      How to avoid the OOM problem in my case? I’m already doing (-Xms1536M
-Xmx1536M) on a 2-GB machine. Is that not enough?  I’m concerned that adding
more Ram would just delay the OOM problem.  Any additional JVM option to
consider?
4)      Any recommendation for the master server configuration, in a sense that 
I
can maximize the number of indexed docs?
5)      How can it disable caching on the master altogether as queries won’t hit
the master?
6)      For an average doc size of 50k-100k, is that too large for solr, or even
solr is the right tool? If not, any alternative?  If we are able to reduce
the size of docs, can we expect to index more documents?

The followings are info related to software/hardware/configuration:

Solr version (solr nightly build on 5/23/2008)
        Solr Specification Version: 1.2.2008.05.23.08.06.59
        Solr Implementation Version: nightly
        Lucene Specification Version: 2.3.2
        Lucene Implementation Version: 2.3.2 652650
        Jetty: 6.1.3

Schema.xml (the section that I think are relevant to the master server.)

    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

<field name="id" type="string" indexed="true" stored="true" required="true"
/>
<field name="hex_id_multi" type="string" indexed="true" stored="false"
multiValued="true" omitNorms="true"/>
        <field name="hex_id_string" type="text_ws" indexed="true" stored="false"
omitNorms="true"/>

<uniqueKey>id</uniqueKey>

Solrconfig.xml
  <indexDefaults>
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>10</mergeFactor>
    <maxBufferedDocs>500</maxBufferedDocs>
    <ramBufferSizeMB>50</ramBufferSizeMB>
    <maxMergeDocs>5000</maxMergeDocs>
    <maxFieldLength>20000</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>
   
<mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>
<mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler>
    <lockType>single</lockType>
  </indexDefaults>

  <mainIndex>
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>50</ramBufferSizeMB>
    <mergeFactor>10</mergeFactor>
    <!-- Deprecated -->
    <maxBufferedDocs>500</maxBufferedDocs>
    <maxMergeDocs>5000</maxMergeDocs>
    <maxFieldLength>20000</maxFieldLength>
    <unlockOnStartup>false</unlockOnStartup>
  </mainIndex>
  <updateHandler class="solr.DirectUpdateHandler2">

    <autoCommit> 
      <maxDocs>50</maxDocs>
      <maxTime>180000</maxTime> 
    </autoCommit>
    <listener event="postCommit" class="solr.RunExecutableListener">
      <str name="exe">solr/bin/snapshooter</str>
      <str name="dir">.</str>
      <bool name="wait">true</bool>
    </listener>
  </updateHandler>

  <query>
    <maxBooleanClauses>50</maxBooleanClauses>    
    <filterCache
      class="solr.LRUCache"
      size="0"
      initialSize="0"
      autowarmCount="0"/>
    <queryResultCache
      class="solr.LRUCache"
      size="0"
      initialSize="0"
      autowarmCount="0"/>
    <documentCache
      class="solr.LRUCache"
      size="0"
      initialSize="0"
      autowarmCount="0"/>
    <enableLazyFieldLoading>true</enableLazyFieldLoading>

    <queryResultWindowSize>1</queryResultWindowSize>
    <queryResultMaxDocsCached>1</queryResultMaxDocsCached>
    <HashDocSet maxSize="1000" loadFactor="0.75"/>
    <listener event="newSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">user_id</str> <str name="start">0</str> <str
name="rows">1</str> </lst>
        <lst><str name="q">static newSearcher warming query from
solrconfig.xml</str></lst>
      </arr>
    </listener>
    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">fast_warm</str> <str name="start">0</str> <str
name="rows">10</str> </lst>
        <lst><str name="q">static firstSearcher warming query from
solrconfig.xml</str></lst>
      </arr>
    </listener>
    <useColdSearcher>false</useColdSearcher>
    <maxWarmingSearchers>4</maxWarmingSearchers>
  </query>

Replication:
        The snappuller is scheduled to run every 15 mins for now. 

Hardware:
        AMD (2.1GHz) dual core with 2GB ram 160GB SATA harddrive

OS:
        Fedora 8 (64-bit)

JVM version:
        java version "1.7.0"
IcedTea Runtime Environment (build 1.7.0-b21)
IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)

Java options:
        java  -Djetty.home=/path/to/solr/home -d64 -Xms1536M -Xmx1536M
-XX:+UseParallelGC -jar start.jar 


-- 
View this message in context: 
http://www.nabble.com/Solr-indexing-configuration-help-tp17524364p17524364.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to