Hi Paul, Hope you have a great weekend so far. I still have a couple of questions you might help me out:
1. In your earlier email, you said "if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism" I am not sure if I understand it right. I put two requesHandler in solrconfig.xml, like this <requestHandler name="/dataimport" class="org.apache.solr.handler..dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">./data-config.xml</str> </lst> </requestHandler> <requestHandler name="/dataimport2" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">./data-config2.xml</str> </lst> </requestHandler> and create data-config.xml and data-config2.xml. then I run the command http://host:8080/solr/dataimport?command=full-import But only one data set (the first one) was indexed. Did I get something wrong? 2. I noticed that after solr indexed about 8M documents (around two hours), it gets very very slow. I use "top" command in linux, and noticed that RES is 1g of memory. I did several experiments, every time RES reaches 1g, the indexing process becomes extremely slow. Is this memory limit set by JVM? And how can I set the JVM memory when I use DIH through web command full-import? Thanks! JB --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> wrote: > From: Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > Subject: Re: How to index large set data > To: "Jianbin Dai" <djian...@yahoo.com> > Date: Friday, May 22, 2009, 10:04 PM > On Sat, May 23, 2009 at 10:27 AM, > Jianbin Dai <djian...@yahoo.com> > wrote: > > > > Hi Pual, but in your previous post, you said "there is > already an issue for writing to Solr in multiple threads > SOLR-1089". Do you think use solrj alone would be better > than DIH? > > nope > you will have to do indexing in multiple threads > > if possible , you can setup multiple DIH say /dataimport1, > /dataimport2 etc and split your files and can achieve > parallelism > > > > Thanks and have a good weekend! > > > > --- On Fri, 5/22/09, Noble Paul നോബിള് > नोब्ळ् <noble.p...@corp.aol.com> > wrote: > > > >> no need to use embedded Solrserver.. > >> you can use SolrJ with streaming > >> in multiple threads > >> > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai > <djian...@yahoo.com> > >> wrote: > >> > > >> > If I do the xml parsing by myself and use > embedded > >> client to do the push, would it be more efficient > than DIH? > >> > > >> > > >> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> > >> wrote: > >> > > >> >> From: Grant Ingersoll <gsing...@apache.org> > >> >> Subject: Re: How to index large set data > >> >> To: solr-user@lucene.apache.org > >> >> Date: Friday, May 22, 2009, 5:38 AM > >> >> Can you parallelize this? I > >> >> don't know that the DIH can handle it, > >> >> but having multiple threads sending docs > to Solr > >> is the > >> >> best > >> >> performance wise, so maybe you need to > look at > >> alternatives > >> >> to pulling > >> >> with DIH and instead use a client to push > into > >> Solr. > >> >> > >> >> > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai > wrote: > >> >> > >> >> > > >> >> > about 2.8 m total docs were created. > only the > >> first > >> >> run finishes. In > >> >> > my 2nd try, it hangs there forever > at the end > >> of > >> >> indexing, (I guess > >> >> > right before commit), with cpu usage > of 100%. > >> Total 5G > >> >> (2050) index > >> >> > files are created. Now I have two > problems: > >> >> > 1. why it hangs there and failed? > >> >> > 2. how can i speed up the indexing? > >> >> > > >> >> > > >> >> > Here is my solrconfig.xml > >> >> > > >> >> > > >> >> > >> > <useCompoundFile>false</useCompoundFile> > >> >> > > >> >> > >> > <ramBufferSizeMB>3000</ramBufferSizeMB> > >> >> > > >> >> > <mergeFactor>1000</mergeFactor> > >> >> > > >> >> > >> > <maxMergeDocs>2147483647</maxMergeDocs> > >> >> > > >> >> > >> > <maxFieldLength>10000</maxFieldLength> > >> >> > > >> >> > >> > <unlockOnStartup>false</unlockOnStartup> > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > --- On Thu, 5/21/09, Noble Paul > >> >> നോബിള് नो > >> >> > ब्ळ् <noble.p...@corp.aol.com> > >> >> wrote: > >> >> > > >> >> >> From: Noble Paul > നോബിള് > >> >> नोब्ळ् > >> >> >> <noble.p...@corp.aol.com> > >> >> >> Subject: Re: How to index large > set data > >> >> >> To: solr-user@lucene.apache.org > >> >> >> Date: Thursday, May 21, 2009, > 10:39 PM > >> >> >> what is the total no:of docs > created > >> >> >> ? I guess it may not be > memory > >> >> >> bound. indexing is mostly amn IO > bound > >> operation. > >> >> You may > >> >> >> be able to > >> >> >> get a better perf if a SSD is > used (solid > >> state > >> >> disk) > >> >> >> > >> >> >> On Fri, May 22, 2009 at 10:46 > AM, Jianbin > >> Dai > >> >> <djian...@yahoo.com> > >> >> >> wrote: > >> >> >>> > >> >> >>> Hi Paul, > >> >> >>> > >> >> >>> Thank you so much for > answering my > >> questions. > >> >> It > >> >> >> really helped. > >> >> >>> After some adjustment, > basically > >> setting > >> >> mergeFactor > >> >> >> to 1000 from the default value > of 10, I > >> can > >> >> finished the > >> >> >> whole job in 2.5 hours. I > checked that > >> during > >> >> running time, > >> >> >> only around 18% of memory is > being used, > >> and VIRT > >> >> is always > >> >> >> 1418m. I am thinking it may be > restricted > >> by JVM > >> >> memory > >> >> >> setting. But I run the data > import > >> command through > >> >> web, > >> >> >> i.e., > >> >> >>> > >> >> >> > >> >> > >> > http://<host>:<port>/solr/dataimport?command=full-import, > >> >> >> how can I set the memory > allocation for > >> JVM? > >> >> >>> Thanks again! > >> >> >>> > >> >> >>> JB > >> >> >>> > >> >> >>> --- On Thu, 5/21/09, Noble > Paul > >> >> നോബിള് > >> >> >> नोब्ळ् <noble.p...@corp..aol.com> > >> >> >> wrote: > >> >> >>> > >> >> >>>> From: Noble Paul > >> നോബിള് > >> >> >> नोब्ळ् <noble.p...@corp.aol.com> > >> >> >>>> Subject: Re: How to > index large > >> set data > >> >> >>>> To: solr-u...@lucene.apache..org > >> >> >>>> Date: Thursday, May 21, > 2009, > >> 9:57 PM > >> >> >>>> check the status page of > DIH and > >> see > >> >> >>>> if it is working > properly. and > >> >> >>>> if, yes what is the rate > of > >> indexing > >> >> >>>> > >> >> >>>> On Thu, May 21, 2009 at > 11:48 AM, > >> Jianbin > >> >> Dai > >> >> >> <djian...@yahoo.com> > >> >> >>>> wrote: > >> >> >>>>> > >> >> >>>>> Hi, > >> >> >>>>> > >> >> >>>>> I have about 45GB > xml files > >> to be > >> >> indexed. I > >> >> >> am using > >> >> >>>> DataImportHandler. I > started the > >> full > >> >> import 4 > >> >> >> hours ago, > >> >> >>>> and it's still > running..... > >> >> >>>>> My computer has 4GB > memory. > >> Any > >> >> suggestion on > >> >> >> the > >> >> >>>> solutions? > >> >> >>>>> Thanks! > >> >> >>>>> > >> >> >>>>> JB > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> -- > >> >> >>>> > >> >> >> > >> >> > >> > ----------------------------------------------------- > >> >> >>>> Noble Paul | Principal > Engineer| > >> AOL | http://aol.com > >> >> >>>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> > >> >> > >> > ----------------------------------------------------- > >> >> >> Noble Paul | Principal Engineer| > AOL | http://aol.com > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > >> >> -------------------------- > >> >> Grant Ingersoll > >> >> http://www.lucidimagination.com/ > >> >> > >> >> Search the Lucene ecosystem > >> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids) > >> >> using Solr/Lucene: > >> >> http://www.lucidimagination...com/search > >> >> > >> >> > >> > > >> > > >> > > >> > > >> > > >> > >> > >> > >> -- > >> > ----------------------------------------------------- > >> Noble Paul | Principal Engineer| AOL | http://aol.com > >> > > > > > > > > > > > > > > -- > ----------------------------------------------------- > Noble Paul | Principal Engineer| AOL | http://aol.com >