Re: Performance potential for updating (reindexing) documents

2016-04-02 Thread Shawn Heisey
On 4/1/2016 8:56 PM, Erick Erickson wrote: > bq: The bottleneck is definitely Solr. > > Since you commented out the server.add(doclist), you're right to focus > there. I've seen > a few things that help. > > 1> batch the documents, i.e. in the doclist above the list should be > on the order of 1,00

Re: Performance potential for updating (reindexing) documents

2016-04-01 Thread Erick Erickson
Shawn: bq: The bottleneck is definitely Solr. Since you commented out the server.add(doclist), you're right to focus there. I've seen a few things that help. 1> batch the documents, i.e. in the doclist above the list should be on the order of 1,000 docs. Here are some numbers I worked up one tim

Re: Performance potential for updating (reindexing) documents

2016-03-31 Thread Shawn Heisey
On 3/24/2016 11:57 AM, tedsolr wrote: > My post was scant on details. The numbers I gave for collection sizes are > projections for the future. I am in the midst of an upgrade that will be > completed within a few weeks. My concern is that I may not be able to > produce the throughput necessary to

Re: Performance potential for updating (reindexing) documents

2016-03-24 Thread Erick Erickson
Well, for comparison I routinely get 20K docs/second on my Mac Pro indexing Wikipedia docs. I _think_ I have 4 shards when I do this, all in the same JVM. I'd be surprised if you can't get your 5K docs/sec, but you may indeed need more shards. All that said, 4G for the JVM is kind of constrained,

Re: Performance potential for updating (reindexing) documents

2016-03-24 Thread tedsolr
Hi Erick, My post was scant on details. The numbers I gave for collection sizes are projections for the future. I am in the midst of an upgrade that will be completed within a few weeks. My concern is that I may not be able to produce the throughput necessary to index an entire collection quickly

Re: Performance potential for updating (reindexing) documents

2016-03-24 Thread Erick Erickson
Impossible to say if for no other reason than you haven't told us how many physical machines this is spread over ;). For the process you've outlined to work, all the fields are stored, right? So why not use Atomic Updates? You still have to query the docs. About querying. If I'm reading this righ