Hi, Answers inlined.
-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > We have a distributed Solr system (2-3 boxes with each running 2 > instances of Solr and each Solr instance can write to multiple cores). Is this really optimal? How many CPU cores do your boxes have vs. the number of Solr cores? > Our use case is high index volume - we can get up to 100 million > records (1 record = 500 bytes) per day, but very low query traffic > (only administrators may need to search for data - once an hour our > so). So, we need very fast index time. Here are the things I'm trying > to find out in order to optimize our index process, It's tarting to sound like you might be able to batch your data and use http://wiki.apache.org/solr/UpdateCSV -- it's the fastest indexing method, I believe. > 1) What's the optimum index size? I've noticed as the index size grows > the indexing time starts increasing. In our test less than 10G index > size we could index over 2K/sec, but as it grows over 20G the index > rate drops to 1400/sec and keeps dropping as index size grows. I'm > trying to see whether we can partition (create new SolrCore) after > 10G. That's likely due to Lucene's segment merging. You can make mergeFactor bigger to make segment merging less frequent, but don't make it to high or you'll run into open file descriptor limits (which you could raise, of course). > - related question, is there a way to find the SolrCore size (any > web service for that?) - based on that information I can create a new > core and freeze the one which has reached 10G. You can see the number of docs in an index via Admin Statistics page (the response is actually XML, look at the source) > 2) In our test, we noticed that after few hours (after 8 hours of > indexing) there is a period (3-4 hours period) where the indexing is > very-very slow (like 500 records/sec) and after that period indexing > returns back to normal rate (1500/sec). Does Solr run any optimize > command on its own? How can we find that out? I'm not issuing any > optimize command - should I be doing that after certain time? No, it doesn't run optimize on its own. It could be running auto-commit, but you should comment that out anyway. Try doing a thread dump to see what's doing on and watching the system with top, vmstat. No, you shouldn't optimize until you are completely done. > 3) Every time I add new documents (10K at once) to the index I see > searcher closing and then re-opening/re-warming (in Catalina.out) > after commit is done. I'm not sure if this is an expensive operation. > Since, our search volume is very low can I configure Solr to not do > this? Would it make indexing any faster? Are you running the commit command after every 10K docs? No need to do that if you don't need your searcher to see the changes immediately. > Mar 26, 2009 11:59:45 PM org.apache.solr.search.SolrIndexSearcher close > INFO: Closing searc...@33d9337c main > Mar 26, 2009 11:59:52 PM org.apache.solr.update.DirectUpdateHandler2 commit > INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true) > Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher > INFO: Opening searc...@46ba6905 main > Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher warm > INFO: autowarming searc...@46ba6905 main from searc...@5c5ffecd main > > 4) Anything else (any other configuration in Solr - I'm currently > using all default settings in the solrconfig.xml and default handlers) > that could help optimize my indexing process? Increase ramBufferSizeMB as much as you can afford. Comment out maxBufferedDocs, it's deprecated. Increase mergeFactor slightly. Consider the CSV approach. Index with multiple threads (match the number of CPU cores). If you are using Solrj, use the Streaming version of SolrServer. Give the JVM more memory (you'll need it if you increase ramBufferSizeMB) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch