Hi there,

I am new to SOLR and trying to use MapReduce to index on 4.0. Per online
suggenstions, I tried both ConcurrentUpdateSolrServer and CloudSolrServer. 

For ConcurrentUpdateSolrServer, I did this:
in setup:
int taskId = context.getTaskAttemptID().getTaskID().getId(); 
int serverId = taskId % 5 + 5; // rotate shard ID by using nature of mapduce
task ID
String url = "http://solr"; + serverId + ":8983/solr/core0";
logger.info("using " + url);
server = new ConcurrentUpdateSolrServer(url, 1000, 1);

in reduce:
do add the documents

in cleanup:
server.commit();

I run 5 reducers on 4 million documents, in the log it shows one reducer
calls one solr nodes, so there should be no racing condition there. it took
like 10 minutes for whole job to be done. However, I lost 20% of the
documents in the index and only got 320 documents overall.

For CloudSolrServer, I did this:
in setup:
try {
     server = new CloudSolrServer("solr:9983");
     server.setDefaultCollection("core0");
} catch (MalformedURLException e) {
     logger.error(e);
}

in reduce:
do add the documents

in cleanup:
server.commit();

With this one, it took 1 hour for 4 million documents, but I do get all
documents in the index.

Time-wise, I much prefer use ConcurrentUpdateSolrServer, however, I cannot
accept that so many documents lost over the process. But as a SOLR newbie, I
might miss something obvious here and I don't know what it is. Could someone
tell? Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-ConcurrentUpdateSolrServer-and-CloudSolrServer-to-add-documents-tp4016885.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to