Thanks Otis. This is very useful. I'll try all your suggestions and
post my findings (and improvements).

Thanks,
-vivek

On Fri, Mar 27, 2009 at 7:08 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
>
> Hi,
>
> Answers inlined.
>
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
>>   We have a distributed Solr system (2-3 boxes with each running 2
>> instances of Solr and each Solr instance can write to multiple cores).
>
> Is this really optimal?  How many CPU cores do your boxes have vs. the number 
> of Solr cores?
>
>> Our use case is high index volume - we can get up to 100 million
>> records (1 record = 500 bytes) per day, but very low query traffic
>> (only administrators may need to search for data - once an hour our
>> so). So, we need very fast index time. Here are the things I'm trying
>> to find out in order to optimize our index process,
>
> It's tarting to sound like you might be able to batch your data and use 
> http://wiki.apache.org/solr/UpdateCSV -- it's the fastest indexing method, I 
> believe.
>
>> 1) What's the optimum index size? I've noticed as the index size grows
>> the indexing time starts increasing. In our test less than 10G index
>> size we could index over 2K/sec, but as it grows over 20G the index
>> rate drops to 1400/sec and keeps dropping as index size grows. I'm
>> trying to see whether we can partition (create new SolrCore) after
>> 10G.
>
> That's likely due to Lucene's segment merging. You can make mergeFactor 
> bigger to make segment merging less frequent, but don't make it to high or 
> you'll run into open file descriptor limits (which you could raise, of 
> course).
>
>>      - related question, is there a way to find the SolrCore size (any
>> web service for that?) - based on that information I can create a new
>> core and freeze the one which has reached 10G.
>
> You can see the number of docs in an index via Admin Statistics page (the 
> response is actually XML, look at the source)
>
>> 2) In our test, we noticed that after few hours (after 8 hours of
>> indexing) there is a period (3-4 hours period) where the indexing is
>> very-very slow (like 500 records/sec) and after that period indexing
>> returns back to normal rate (1500/sec). Does Solr run any optimize
>> command on its own? How can we find that out?  I'm not issuing any
>> optimize command - should I be doing that after certain time?
>
> No, it doesn't run optimize on its own.  It could be running auto-commit, but 
> you should comment that out anyway.  Try doing a thread dump to see what's 
> doing on and watching the system with top, vmstat.
> No, you shouldn't optimize until you are completely done.
>
>> 3) Every time I add new documents (10K at once) to the index I see
>> searcher closing and then re-opening/re-warming (in Catalina.out)
>> after commit is done. I'm not sure if this is an expensive operation.
>> Since, our search volume is very low can I configure Solr to not do
>> this? Would it make indexing any faster?
>
> Are you running the commit command after every 10K docs?  No need to do that 
> if you don't need your searcher to see the changes immediately.
>
>> Mar 26, 2009 11:59:45 PM org.apache.solr.search.SolrIndexSearcher close
>> INFO: Closing searc...@33d9337c main
>> Mar 26, 2009 11:59:52 PM org.apache.solr.update.DirectUpdateHandler2 commit
>> INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
>> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher
>> INFO: Opening searc...@46ba6905 main
>> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher warm
>> INFO: autowarming searc...@46ba6905 main from searc...@5c5ffecd main
>>
>> 4) Anything else (any other configuration in Solr - I'm currently
>> using all default settings in the solrconfig.xml and default handlers)
>> that could help optimize my indexing process?
>
> Increase ramBufferSizeMB as much as you can afford.
> Comment out maxBufferedDocs, it's deprecated.
> Increase mergeFactor slightly.
> Consider the CSV approach.
> Index with multiple threads (match the number of CPU cores).
> If you are using Solrj, use the Streaming version of SolrServer.
> Give the JVM more memory (you'll need it if you increase ramBufferSizeMB)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>

Reply via email to