Thanks Otis. This is very useful. I'll try all your suggestions and post my findings (and improvements).
Thanks, -vivek On Fri, Mar 27, 2009 at 7:08 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > > Hi, > > Answers inlined. > > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- >> We have a distributed Solr system (2-3 boxes with each running 2 >> instances of Solr and each Solr instance can write to multiple cores). > > Is this really optimal? How many CPU cores do your boxes have vs. the number > of Solr cores? > >> Our use case is high index volume - we can get up to 100 million >> records (1 record = 500 bytes) per day, but very low query traffic >> (only administrators may need to search for data - once an hour our >> so). So, we need very fast index time. Here are the things I'm trying >> to find out in order to optimize our index process, > > It's tarting to sound like you might be able to batch your data and use > http://wiki.apache.org/solr/UpdateCSV -- it's the fastest indexing method, I > believe. > >> 1) What's the optimum index size? I've noticed as the index size grows >> the indexing time starts increasing. In our test less than 10G index >> size we could index over 2K/sec, but as it grows over 20G the index >> rate drops to 1400/sec and keeps dropping as index size grows. I'm >> trying to see whether we can partition (create new SolrCore) after >> 10G. > > That's likely due to Lucene's segment merging. You can make mergeFactor > bigger to make segment merging less frequent, but don't make it to high or > you'll run into open file descriptor limits (which you could raise, of > course). > >> - related question, is there a way to find the SolrCore size (any >> web service for that?) - based on that information I can create a new >> core and freeze the one which has reached 10G. > > You can see the number of docs in an index via Admin Statistics page (the > response is actually XML, look at the source) > >> 2) In our test, we noticed that after few hours (after 8 hours of >> indexing) there is a period (3-4 hours period) where the indexing is >> very-very slow (like 500 records/sec) and after that period indexing >> returns back to normal rate (1500/sec). Does Solr run any optimize >> command on its own? How can we find that out? I'm not issuing any >> optimize command - should I be doing that after certain time? > > No, it doesn't run optimize on its own. It could be running auto-commit, but > you should comment that out anyway. Try doing a thread dump to see what's > doing on and watching the system with top, vmstat. > No, you shouldn't optimize until you are completely done. > >> 3) Every time I add new documents (10K at once) to the index I see >> searcher closing and then re-opening/re-warming (in Catalina.out) >> after commit is done. I'm not sure if this is an expensive operation. >> Since, our search volume is very low can I configure Solr to not do >> this? Would it make indexing any faster? > > Are you running the commit command after every 10K docs? No need to do that > if you don't need your searcher to see the changes immediately. > >> Mar 26, 2009 11:59:45 PM org.apache.solr.search.SolrIndexSearcher close >> INFO: Closing searc...@33d9337c main >> Mar 26, 2009 11:59:52 PM org.apache.solr.update.DirectUpdateHandler2 commit >> INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true) >> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher >> INFO: Opening searc...@46ba6905 main >> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher warm >> INFO: autowarming searc...@46ba6905 main from searc...@5c5ffecd main >> >> 4) Anything else (any other configuration in Solr - I'm currently >> using all default settings in the solrconfig.xml and default handlers) >> that could help optimize my indexing process? > > Increase ramBufferSizeMB as much as you can afford. > Comment out maxBufferedDocs, it's deprecated. > Increase mergeFactor slightly. > Consider the CSV approach. > Index with multiple threads (match the number of CPU cores). > If you are using Solrj, use the Streaming version of SolrServer. > Give the JVM more memory (you'll need it if you increase ramBufferSizeMB) > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >