Hi there, I am new to SOLR and trying to use MapReduce to index on 4.0. Per online suggenstions, I tried both ConcurrentUpdateSolrServer and CloudSolrServer.
For ConcurrentUpdateSolrServer, I did this: in setup: int taskId = context.getTaskAttemptID().getTaskID().getId(); int serverId = taskId % 5 + 5; // rotate shard ID by using nature of mapduce task ID String url = "http://solr" + serverId + ":8983/solr/core0"; logger.info("using " + url); server = new ConcurrentUpdateSolrServer(url, 1000, 1); in reduce: do add the documents in cleanup: server.commit(); I run 5 reducers on 4 million documents, in the log it shows one reducer calls one solr nodes, so there should be no racing condition there. it took like 10 minutes for whole job to be done. However, I lost 20% of the documents in the index and only got 320 documents overall. For CloudSolrServer, I did this: in setup: try { server = new CloudSolrServer("solr:9983"); server.setDefaultCollection("core0"); } catch (MalformedURLException e) { logger.error(e); } in reduce: do add the documents in cleanup: server.commit(); With this one, it took 1 hour for 4 million documents, but I do get all documents in the index. Time-wise, I much prefer use ConcurrentUpdateSolrServer, however, I cannot accept that so many documents lost over the process. But as a SOLR newbie, I might miss something obvious here and I don't know what it is. Could someone tell? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/using-ConcurrentUpdateSolrServer-and-CloudSolrServer-to-add-documents-tp4016885.html Sent from the Solr - User mailing list archive at Nabble.com.