Hi all.  We're in the midst of upgrading from Solr 1.4 to 4.3.1, and we've run 
into issues with memory on our client side during a mass index operation.

We use the approach described on the SolrJ wiki at 
http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update.
In the Solr 1.4 days this worked very smoothly and reliably, consuming a fairly 
small amount of memory within our client.  We could send bulk updates in 
batches of 100,000 or more.

With Solr 4.3 it appears that the HttpSolrServer.add(Iterator) method fetches 
all of the SolrInputDocuments included in the transaction before opening the 
stream to the server.

We're indexing stories for a news site, and some of the documents are 10's of 
KB.  With Solr 4.3 we need to keep our transaction batch size very small (in 
the hundreds) to avoid going OOM.

I've searched the Wiki as well as Google in general, and I don't find any other 
approaches in Solr 4.x.  The SolrJ wiki still recommends using the Iterator 
approach for indexing large amounts of data, but we're hoping someone has 
another method that's as efficient as the old 1.4/3.6 approach.  I don't really 
want to send 10 million documents one at a time.

Thanks!
Terry

Reply via email to