Re: Improving indexing performance

Matteo Grolla Tue, 08 Oct 2013 02:31:28 -0700

Thanks Erik,
        I think I have been able to exhaust a resource
        if I split the data in 2 and upload it with 2 clients like benchmark 
1.1 it takes 120s here the bottleneck it my LAN,
        if I use a setting like benchmark 1 probably the bottleneck is the 
ramBuffer.


        I'm going to buy a Gigabit ethernet cable so I can make a better test.

        OutOfMemory error: it's the solrj client that crashes
                I'm using solr 4.2.1 and corresponding solrj client
                httpsolrserver works fine
                concurrentupdatesolrsever gives me problems, and I didn't 
understand how to size the queuesize parameter optimally

        
Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:

> Just skimmed, but the usual reason you can't max out the server
> is that the client can't go fast enough. Very quick experiment:
> comment out the server.add line in your client and run it again,
> does that speed up the client substantially? If not, then the time
> is being spent on the client.
> 
> Or split your csv file into, say, 5 parts and run it from 5 different
> PCs in parallel.
> 
> bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
> This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
> allocating more memory to the JVM running Solr.
> 
> bq: committing every 100k docs gives worse performance
> It'll be best to specify openSearcher=false for max indexing throughput
> BTW. You should be able to do this quite frequently, 15 seconds seems
> quite reasonable.
> 
> Best,
> Erick
> 
> On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <matteo.gro...@gmail.com> 
> wrote:
>> I'd like to have some suggestion on how to improve the indexing performance 
>> on the following scenario
>> I'm uploading 1M docs to solr,
>> 
>> every docs has
>>        id: sequential number
>>        title:  small string
>>        date: date
>>        body: 1kb of text
>> 
>> Here are my benchmarks (they are all single executions, not averages from 
>> multiple executions):
>> 
>> 1)      using the updaterequesthandler
>>        and streaming docs from a csv file on the same disk of solr
>>        auto commit every 15s with openSearcher=false and commit after last 
>> document
>> 
>>        total time: 143035ms
>> 
>> 1.1)    using the updaterequesthandler
>>        and streaming docs from a csv file on the same disk of solr
>>        auto commit every 15s with openSearcher=false and commit after last 
>> document
>>        <ramBufferSizeMB>500</ramBufferSizeMB>
>>        <maxBufferedDocs>100000</maxBufferedDocs>
>> 
>>        total time: 134493ms
>> 
>> 1.2)    using the updaterequesthandler
>>        and streaming docs from a csv file on the same disk of solr
>>        auto commit every 15s with openSearcher=false and commit after last 
>> document
>>        <mergeFactor>30</mergeFactor>
>> 
>>        total time: 143134ms
>> 
>> 2)      using a solrj client from another pc in the lan (100Mbps)
>>        with httpsolrserver
>>        with javabin format
>>        add documents to the server in batches of 1k docs       ( server.add( 
>> <collection> ) )
>>        auto commit every 15s with openSearcher=false and commit after last 
>> document
>> 
>>        total time: 139022ms
>> 
>> 3)      using a solrj client from another pc in the lan (100Mbps)
>>        with concurrentupdatesolrserver
>>        with javelin format
>>        add documents to the server in batches of 1k docs       ( server.add( 
>> <collection> ) )
>>        server queue size=20k
>>    server threads=4
>>        no auto-commit and commit every 100k docs
>> 
>>        total time: 167301ms
>> 
>> 
>> --On the solr server--
>> cpu averages    25%
>>        at best 100% for 1 core
>> IO      is still far from being saturated
>>        iostat gives a pattern like this (every 5 s)
>> 
>>        time(s)         %util
>>        100                     45,20
>>        105                     1,68
>>        110                     17,44
>>        115                     76,32
>>        120                     2,64
>>        125                     68
>>        130                     1,28
>> 
>> I thought that using concurrentupdatesolrserver I was able to max cpu or IO 
>> but I wasn't.
>> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get 
>> an OutOfMemory error
>> and I found that committing every 100k docs gives worse performance than 
>> auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
>> 
>> I'd really like to understand why I can't max out the resources on the 
>> server hosting solr (disk above all)
>> And I'd really like to understand what I'm doing wrong with 
>> concurrentupdatesolrserver
>> 
>> thanks
>>

Re: Improving indexing performance

Reply via email to