Hi all Solr users/developers/experts, I have the following scenario and I appreciate any advice for tuning my solr master server.
I have a field in my schema that would index (but not stored) about ~10000 ids for each document. This field is expected to govern the size of the document. Each id can contain up to 6 characters. I figure that there are two alternatives for this field, one is the use a string multi-valued field, and the other would be to pass a white-space-delimited string to solr and have solr tokenize such string based on whitespace (the text_ws fieldType). The master server is expected to receive constant stream of updates. The expected/estimated document size can range from 50k to 100k for a single document. (I know this is quite large). The number of documents is expected to be around 200,000 on each master server, and there can be multiple master servers (sharding). I wish the master can handle more docs too if I can figure a way out. Currently, I’m performing some basic stress tests to simulate the indexing side on the master server. This stress test would continuously add new documents at the rate of about 10 documents every 30 seconds. Autocommit is being used (50 docs and 180 seconds constraints), but I have no idea if this is the preferred way. The goal is to keep adding new documents until we can get at least 200,000 documents (or about 20GB of index) on the master (or even more if the server can handle it) What I experienced from the indexing stress test is that the master server failed to respond after a while, such as non-pingable when there are about 30k documents. When looking at the log, they are mostly: java.lang.OutOfMemoryError: Java heap space OR Ping query caused exception: null (this is probably caused by the OOM problem) There were also a few cases that the java process even went away. Questions: 1) Is it better to use the multi-valued string field or the text_ws field for this large field? 2) Is it better to have more outstanding docs per commit or more frequent commit, in term of maximizing server resources? What is the preferred way to commit documents assuming that solr master receives updates frequently? How many updated docs should there be before issuing a commit? 3) How to avoid the OOM problem in my case? I’m already doing (-Xms1536M -Xmx1536M) on a 2-GB machine. Is that not enough? I’m concerned that adding more Ram would just delay the OOM problem. Any additional JVM option to consider? 4) Any recommendation for the master server configuration, in a sense that I can maximize the number of indexed docs? 5) How can it disable caching on the master altogether as queries won’t hit the master? 6) For an average doc size of 50k-100k, is that too large for solr, or even solr is the right tool? If not, any alternative? If we are able to reduce the size of docs, can we expect to index more documents? The followings are info related to software/hardware/configuration: Solr version (solr nightly build on 5/23/2008) Solr Specification Version: 1.2.2008.05.23.08.06.59 Solr Implementation Version: nightly Lucene Specification Version: 2.3.2 Lucene Implementation Version: 2.3.2 652650 Jetty: 6.1.3 Schema.xml (the section that I think are relevant to the master server.) <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="hex_id_multi" type="string" indexed="true" stored="false" multiValued="true" omitNorms="true"/> <field name="hex_id_string" type="text_ws" indexed="true" stored="false" omitNorms="true"/> <uniqueKey>id</uniqueKey> Solrconfig.xml <indexDefaults> <useCompoundFile>false</useCompoundFile> <mergeFactor>10</mergeFactor> <maxBufferedDocs>500</maxBufferedDocs> <ramBufferSizeMB>50</ramBufferSizeMB> <maxMergeDocs>5000</maxMergeDocs> <maxFieldLength>20000</maxFieldLength> <writeLockTimeout>1000</writeLockTimeout> <commitLockTimeout>10000</commitLockTimeout> <mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy> <mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler> <lockType>single</lockType> </indexDefaults> <mainIndex> <useCompoundFile>false</useCompoundFile> <ramBufferSizeMB>50</ramBufferSizeMB> <mergeFactor>10</mergeFactor> <!-- Deprecated --> <maxBufferedDocs>500</maxBufferedDocs> <maxMergeDocs>5000</maxMergeDocs> <maxFieldLength>20000</maxFieldLength> <unlockOnStartup>false</unlockOnStartup> </mainIndex> <updateHandler class="solr.DirectUpdateHandler2"> <autoCommit> <maxDocs>50</maxDocs> <maxTime>180000</maxTime> </autoCommit> <listener event="postCommit" class="solr.RunExecutableListener"> <str name="exe">solr/bin/snapshooter</str> <str name="dir">.</str> <bool name="wait">true</bool> </listener> </updateHandler> <query> <maxBooleanClauses>50</maxBooleanClauses> <filterCache class="solr.LRUCache" size="0" initialSize="0" autowarmCount="0"/> <queryResultCache class="solr.LRUCache" size="0" initialSize="0" autowarmCount="0"/> <documentCache class="solr.LRUCache" size="0" initialSize="0" autowarmCount="0"/> <enableLazyFieldLoading>true</enableLazyFieldLoading> <queryResultWindowSize>1</queryResultWindowSize> <queryResultMaxDocsCached>1</queryResultMaxDocsCached> <HashDocSet maxSize="1000" loadFactor="0.75"/> <listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">user_id</str> <str name="start">0</str> <str name="rows">1</str> </lst> <lst><str name="q">static newSearcher warming query from solrconfig.xml</str></lst> </arr> </listener> <listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">fast_warm</str> <str name="start">0</str> <str name="rows">10</str> </lst> <lst><str name="q">static firstSearcher warming query from solrconfig.xml</str></lst> </arr> </listener> <useColdSearcher>false</useColdSearcher> <maxWarmingSearchers>4</maxWarmingSearchers> </query> Replication: The snappuller is scheduled to run every 15 mins for now. Hardware: AMD (2.1GHz) dual core with 2GB ram 160GB SATA harddrive OS: Fedora 8 (64-bit) JVM version: java version "1.7.0" IcedTea Runtime Environment (build 1.7.0-b21) IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode) Java options: java -Djetty.home=/path/to/solr/home -d64 -Xms1536M -Xmx1536M -XX:+UseParallelGC -jar start.jar -- View this message in context: http://www.nabble.com/Solr-indexing-configuration-help-tp17524364p17524364.html Sent from the Solr - User mailing list archive at Nabble.com.