Hi all. We're in the midst of upgrading from Solr 1.4 to 4.3.1, and we've run into issues with memory on our client side during a mass index operation.
We use the approach described on the SolrJ wiki at http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update. In the Solr 1.4 days this worked very smoothly and reliably, consuming a fairly small amount of memory within our client. We could send bulk updates in batches of 100,000 or more. With Solr 4.3 it appears that the HttpSolrServer.add(Iterator) method fetches all of the SolrInputDocuments included in the transaction before opening the stream to the server. We're indexing stories for a news site, and some of the documents are 10's of KB. With Solr 4.3 we need to keep our transaction batch size very small (in the hundreds) to avoid going OOM. I've searched the Wiki as well as Google in general, and I don't find any other approaches in Solr 4.x. The SolrJ wiki still recommends using the Iterator approach for indexing large amounts of data, but we're hoping someone has another method that's as efficient as the old 1.4/3.6 approach. I don't really want to send 10 million documents one at a time. Thanks! Terry