Re: Solr indexing configuration help

Yonik Seeley Wed, 28 May 2008 20:02:37 -0700

On Wed, May 28, 2008 at 10:30 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
> I used the admin GUI to get the java info.
> java.vm.specification.vendor = Sun Microsystems Inc.
Well, your original email listed IcedTea... but that is mostly Sun
code,  so maybe that's why the vendor is still listed as Sun.


I'd recommend downloading1.6.0_3 from java.sun.com and trying that.

Later versions (1.6.0_04+) have a JVM bug that bites Lucene, so stick
with 1.6.0_03 for now.

-Yonik


> Any suggestion?  Thanks a lot for your help!!
>
> -Gaku
>
>
> Yonik Seeley wrote:
>>
>> Not sure why you would be getting an OOM from just indexing, and with
>> the 1.5G heap you've given the JVM.
>> Have you tried Sun's JVM?
>>
>> -Yonik
>>
>> On Wed, May 28, 2008 at 7:35 PM, gaku113 <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi all Solr users/developers/experts,
>>>
>>> I have the following scenario and I appreciate any advice for tuning my
>>> solr
>>> master server.
>>>
>>> I have a field in my schema that would index (but not stored) about
>>> ~10000
>>> ids for each document.  This field is expected to govern the size of the
>>> document.  Each id can contain up to 6 characters.  I figure that there
>>> are
>>> two alternatives for this field, one is the use a string multi-valued
>>> field,
>>> and the other would be to pass a white-space-delimited string to solr and
>>> have solr tokenize such string based on whitespace (the text_ws
>>> fieldType).
>>> The master server is expected to receive constant stream of updates.
>>>
>>> The expected/estimated document size can range from 50k to 100k for a
>>> single
>>> document.  (I know this is quite large). The number of documents is
>>> expected
>>> to be around 200,000 on each master server, and there can be multiple
>>> master
>>> servers (sharding).  I wish the master can handle more docs too if I can
>>> figure a way out.
>>>
>>> Currently, I'm performing some basic stress tests to simulate the
>>> indexing
>>> side on the master server.  This stress test would continuously add new
>>> documents at the rate of about 10 documents every 30 seconds.  Autocommit
>>> is
>>> being used (50 docs and 180 seconds constraints), but I have no idea if
>>> this
>>> is the preferred way.  The goal is to keep adding new documents until we
>>> can
>>> get at least 200,000 documents (or about 20GB of index) on the master (or
>>> even more if the server can handle it)
>>>
>>> What I experienced from the indexing stress test is that the master
>>> server
>>> failed to respond after a while, such as non-pingable when there are
>>> about
>>> 30k documents.  When looking at the log, they are mostly:
>>> java.lang.OutOfMemoryError: Java heap space
>>> OR
>>> Ping query caused exception: null (this is probably caused by the OOM
>>> problem)
>>>
>>> There were also a few cases that the java process even went away.
>>>
>>> Questions:
>>> 1)      Is it better to use the multi-valued string field or the text_ws
>>> field
>>> for this large field?
>>> 2)      Is it better to have more outstanding docs per commit or more
>>> frequent
>>> commit, in term of maximizing server resources?  What is the preferred
>>> way
>>> to commit documents assuming that solr master receives updates
>>> frequently?
>>> How many updated docs should there be before issuing a commit?
>>> 3)      How to avoid the OOM problem in my case? I'm already doing
>>> (-Xms1536M
>>> -Xmx1536M) on a 2-GB machine. Is that not enough?  I'm concerned that
>>> adding
>>> more Ram would just delay the OOM problem.  Any additional JVM option to
>>> consider?
>>> 4)      Any recommendation for the master server configuration, in a
>>> sense that I
>>> can maximize the number of indexed docs?
>>> 5)      How can it disable caching on the master altogether as queries
>>> won't hit
>>> the master?
>>> 6)      For an average doc size of 50k-100k, is that too large for solr,
>>> or even
>>> solr is the right tool? If not, any alternative?  If we are able to
>>> reduce
>>> the size of docs, can we expect to index more documents?
>>>
>>> The followings are info related to software/hardware/configuration:
>>>
>>> Solr version (solr nightly build on 5/23/2008)
>>>        Solr Specification Version: 1.2.2008.05.23.08.06.59
>>>        Solr Implementation Version: nightly
>>>        Lucene Specification Version: 2.3.2
>>>        Lucene Implementation Version: 2.3.2 652650
>>>        Jetty: 6.1.3
>>>
>>> Schema.xml (the section that I think are relevant to the master server.)
>>>
>>>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>>> omitNorms="true"/>
>>>    <fieldType name="text_ws" class="solr.TextField"
>>> positionIncrementGap="100">
>>>      <analyzer>
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>      </analyzer>
>>>    </fieldType>
>>>
>>> <field name="id" type="string" indexed="true" stored="true"
>>> required="true"
>>> />
>>> <field name="hex_id_multi" type="string" indexed="true" stored="false"
>>> multiValued="true" omitNorms="true"/>
>>>        <field name="hex_id_string" type="text_ws" indexed="true"
>>> stored="false"
>>> omitNorms="true"/>
>>>
>>> <uniqueKey>id</uniqueKey>
>>>
>>> Solrconfig.xml
>>>  <indexDefaults>
>>>    <useCompoundFile>false</useCompoundFile>
>>>    <mergeFactor>10</mergeFactor>
>>>    <maxBufferedDocs>500</maxBufferedDocs>
>>>    <ramBufferSizeMB>50</ramBufferSizeMB>
>>>    <maxMergeDocs>5000</maxMergeDocs>
>>>    <maxFieldLength>20000</maxFieldLength>
>>>    <writeLockTimeout>1000</writeLockTimeout>
>>>    <commitLockTimeout>10000</commitLockTimeout>
>>>
>>> <mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>
>>> <mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler>
>>>    <lockType>single</lockType>
>>>  </indexDefaults>
>>>
>>>  <mainIndex>
>>>    <useCompoundFile>false</useCompoundFile>
>>>    <ramBufferSizeMB>50</ramBufferSizeMB>
>>>    <mergeFactor>10</mergeFactor>
>>>    <!-- Deprecated -->
>>>    <maxBufferedDocs>500</maxBufferedDocs>
>>>    <maxMergeDocs>5000</maxMergeDocs>
>>>    <maxFieldLength>20000</maxFieldLength>
>>>    <unlockOnStartup>false</unlockOnStartup>
>>>  </mainIndex>
>>>  <updateHandler class="solr.DirectUpdateHandler2">
>>>
>>>    <autoCommit>
>>>      <maxDocs>50</maxDocs>
>>>      <maxTime>180000</maxTime>
>>>    </autoCommit>
>>>    <listener event="postCommit" class="solr.RunExecutableListener">
>>>      <str name="exe">solr/bin/snapshooter</str>
>>>      <str name="dir">.</str>
>>>      <bool name="wait">true</bool>
>>>    </listener>
>>>  </updateHandler>
>>>
>>>  <query>
>>>    <maxBooleanClauses>50</maxBooleanClauses>
>>>    <filterCache
>>>      class="solr.LRUCache"
>>>      size="0"
>>>      initialSize="0"
>>>      autowarmCount="0"/>
>>>    <queryResultCache
>>>      class="solr.LRUCache"
>>>      size="0"
>>>      initialSize="0"
>>>      autowarmCount="0"/>
>>>    <documentCache
>>>      class="solr.LRUCache"
>>>      size="0"
>>>      initialSize="0"
>>>      autowarmCount="0"/>
>>>    <enableLazyFieldLoading>true</enableLazyFieldLoading>
>>>
>>>    <queryResultWindowSize>1</queryResultWindowSize>
>>>    <queryResultMaxDocsCached>1</queryResultMaxDocsCached>
>>>    <HashDocSet maxSize="1000" loadFactor="0.75"/>
>>>    <listener event="newSearcher" class="solr.QuerySenderListener">
>>>      <arr name="queries">
>>>        <lst> <str name="q">user_id</str> <str name="start">0</str> <str
>>> name="rows">1</str> </lst>
>>>        <lst><str name="q">static newSearcher warming query from
>>> solrconfig.xml</str></lst>
>>>      </arr>
>>>    </listener>
>>>    <listener event="firstSearcher" class="solr.QuerySenderListener">
>>>      <arr name="queries">
>>>        <lst> <str name="q">fast_warm</str> <str name="start">0</str> <str
>>> name="rows">10</str> </lst>
>>>        <lst><str name="q">static firstSearcher warming query from
>>> solrconfig.xml</str></lst>
>>>      </arr>
>>>    </listener>
>>>    <useColdSearcher>false</useColdSearcher>
>>>    <maxWarmingSearchers>4</maxWarmingSearchers>
>>>  </query>
>>>
>>> Replication:
>>>        The snappuller is scheduled to run every 15 mins for now.
>>>
>>> Hardware:
>>>        AMD (2.1GHz) dual core with 2GB ram 160GB SATA harddrive
>>>
>>> OS:
>>>        Fedora 8 (64-bit)
>>>
>>> JVM version:
>>>        java version "1.7.0"
>>> IcedTea Runtime Environment (build 1.7.0-b21)
>>> IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)
>>>
>>> Java options:
>>>        java  -Djetty.home=/path/to/solr/home -d64 -Xms1536M -Xmx1536M
>>> -XX:+UseParallelGC -jar start.jar
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Solr-indexing-configuration-help-tp17524364p17524364.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Solr-indexing-configuration-help-tp17524364p17526135.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Solr indexing configuration help

Reply via email to