It's hard to guess, but I might start by looking at what the new UpdateLog is 
costing you. Take it's definition out of solrconfig.xml and try your test 
again. Then let's take it from there.

- Mark

On Jan 23, 2013, at 11:00 AM, Kevin Stone <kevin.st...@jax.org> wrote:

> I am having some difficulty migrating our solr indexing scripts from using 
> 3.5 to solr 4.0. Notably, I am trying to track down why our performance in 
> solr 4.0 is about 5-10 times slower when indexing documents. Querying is 
> still quite fast.
> 
> The code adds  documents in groups of 1000, and adds each group to the solr 
> in a thread. The documents are somewhat large, including maybe 30-40 
> different field types, mostly multivalued. Here are some snippets of the code 
> we used in 3.5.
> 
> 
> MultiThreadedHttpConnectionManager mgr = new 
> MultiThreadedHttpConnectionManager();
> 
> HttpClient client = new HttpClient(mgr);
> 
> CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for our 
> index",client );
> 
> server.setRequestWriter(new BinaryRequestWriter());
> 
> 
> Then, we delete the index, and proceed to generate documents and load the 
> groups in a thread that looks kind of like this. I've omitted some overhead 
> for handling exceptions, and retry attempts.
> 
> 
> class DocWriterThread implements Runnable
> 
> {
> 
>    CommonsHttpSolrServer server;
> 
>    Collection<SolrInputDocument> docs;
> 
>    private int commitWithin = 50000; // 50 seconds
> 
>    public DocWriterThread(CommonsHttpSolrServer 
> server,Collection<SolrInputDocument> docs)
> 
>    {
> 
>    this.server=server;
> 
>    this.docs=docs;
> 
>    }
> 
> public void run()
> 
> {
> 
>    // set the commitWithin feature
> 
>    server.add(docs,commitWithin);
> 
> }
> 
> }
> 
> 
> Now, I've had to change some things to get this compile with the Solr 4.0 
> libraries. Here is what I tried to convert the above code to. I don't know if 
> these are the correct equivalents, as I am not familiar with apache 
> httpcomponents.
> 
> 
> 
> ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();
> 
> DefaultHttpClient client = new DefaultHttpClient(mgr);
> 
> HttpSolrServer server = new HttpSolrServer( "some url for our solr 
> index",client );
> 
> server.setRequestWriter(new BinaryRequestWriter());
> 
> 
> 
> 
> The thread method is the same, but uses HttpSolrServer instead of 
> CommonsHttpSolrServer.
> 
> We also, had an old solrconfig (not sure what version, but it is pre 3.x and 
> had mostly default values) that I had to replace with a 4.0 style 
> solrconfig.xml. I don't want to post the entire file (as it is large), but I 
> copied one from the solr 4.0 examples, and made a couple changes. First, I 
> wanted to turn off transaction logging. So essentially I have a line like 
> this (everything inside is commented out):
> 
> 
> <updateHandler class="solr.DirectUpdateHandler2"></updateHandler>
> 
> 
> And I added a handler for javabin
> 
> 
> <requestHandler name="/update/javabin" 
> class="solr.BinaryUpdateRequestHandler">
> 
>        <lst name="defaults">
> 
>         <str name="stream.contentType">application/javabin</str>
> 
>       </lst>
> 
>  </requestHandler>
> 
> I'm not sure what other configurations I should look at. I would think that 
> there should be a big obvious reason why the indexing performance would drop 
> nearly 10 fold.
> 
> Against our 3.5 instance I timed our index load, and it adds roughly 40,000 
> documents every 3-8 seconds.
> 
> Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.
> 
> This isn't the end of the world, and I would love to use the new join feature 
> in solr 4.0. However, we have many different indexes with millions of 
> documents, and this kind of increase in load time is troubling.
> 
> 
> Thanks for your help.
> 
> 
> -Kevin
> 
> 
> The information in this email, including attachments, may be confidential and 
> is intended solely for the addressee(s). If you believe you received this 
> email by mistake, please notify the sender by return email as soon as 
> possible.

Reply via email to