DataImportHandler does not parallelize indexing at all. It is a single
threaded indexer which runs on a single node. However, the documents
themselves are routed to the correct shard by SolrCloud. Therefore,
what you are observing on your servers is normal.

If you want to parallelize indexing then you can either:
a) Use SolrJ or an external client and write the indexing code yourself, or
b) Setup DIH in such a way that each shard indexes a disjoint subset
of data. This way, you can fire DIH full import on multiple
shard/nodes simultaneously.

One way of achieving (b) is by using request parameters to substitute
placeholders in your DIH configuration. See
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

On Tue, Sep 3, 2013 at 3:25 PM,  <jerome.dup...@bnf.fr> wrote:
>
> Hello again,
>
> I still trying to index a with solr cloud and dih. I can index but it seems
> that indexation is done on only 1 shard. (my goal was to parallelze that to
> go fast)
> This my conf:
> I have 2 tomcat instances,
> One with zookeeper embedded in solr 4.4.0 started and 1 shard (port 8080)
> The other with the second shard. (port 9180)
> In my admin interface, I see 2 shards, each one is leader
>
>
> When I launch the dih, documents are indexed. But only the shard1 is
> working.
> http://localhost:8080/solr-0.4.0-pfd/noticesBIBcollection/dataimportMNb?command=full-import&entity=noticebib&optimize=true&indent=true&clean=true&commit=true&verbose=false&debug=false&wt=json&rows=1000
>
>
> In my first shard, I see messages coming from my indexation process:
> DEBUG 2013-09-03 11:48:57,801 Thread-12
> org.apache.solr.handler.dataimport.URLDataSource  (92) - Accessing URL:
> file:/X:/3/7/002/37002118.xml
> DEBUG 2013-09-03 11:48:57,832 Thread-12
> org.apache.solr.handler.dataimport.URLDataSource  (92) - Accessing URL:
> file:/X:/3/7/002/37002120.xml
> DEBUG 2013-09-03 11:48:57,966 Thread-12
> org.apache.solr.handler.dataimport.LogTransformer  (58) - Notice fichier:
> 3/7/002/37002120.xml
> DEBUG 2013-09-03 11:48:57,966 Thread-12 fr.bnf.solr.BnfDateTransformer
> (696) - NN=37002120
>
> In the second instance, I just have this kind of logs, at it was receiving
> notifications from zookeeper of new updates
> INFO 2013-09-03 11:48:57,323 http-9180-7
> org.apache.solr.update.processor.LogUpdateProcessor  (198) - [noticesBIB]
> webapp=/solr-0.4.0-pfd path=/update params=
> {distrib.from=http://172.20.48.237:8080/solr-0.4.0-pfd/noticesBIB/&update.distrib=TOLEADER&wt=javabin&version=2}
>  {add=[37001748 (1445149264874307584), 37001757 (1445149264879550464),
> 37001764 (1445149264883744768), 37001786 (1445149264887939072), 37001817
> (1445149264891084800), 37001819 (1445149264896327680), 37001837
> (1445149264900521984), 37001861 (1445149264903667712), 37001869
> (1445149264907862016), 37001963 (1445149264912056320)]} 0 41
>
> I supposed there was a confusion between cores names and collection name,
> and I tried to change the name of the collection, but it solved nothing.
> When I come to dih interfaces, in shard1, I see indexation processing, and
> on shard 2 "no information available"
>
> Is there something specia to do to distributre indexation process?
> Should I run zookeeper on both instances (even if it's not mandatory?
> ...
> Regards
> Jerome
>
>
>
> Fermeture annuelle des sites François-Mitterrand et Richelieu du 2 au 15 
> septembre 2013 Avant d'imprimer, pensez à l'environnement.



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to