If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH?
--- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> wrote: > From: Grant Ingersoll <gsing...@apache.org> > Subject: Re: How to index large set data > To: solr-user@lucene.apache.org > Date: Friday, May 22, 2009, 5:38 AM > Can you parallelize this? I > don't know that the DIH can handle it, > but having multiple threads sending docs to Solr is the > best > performance wise, so maybe you need to look at alternatives > to pulling > with DIH and instead use a client to push into Solr. > > > On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: > > > > > about 2.8 m total docs were created. only the first > run finishes. In > > my 2nd try, it hangs there forever at the end of > indexing, (I guess > > right before commit), with cpu usage of 100%. Total 5G > (2050) index > > files are created. Now I have two problems: > > 1. why it hangs there and failed? > > 2. how can i speed up the indexing? > > > > > > Here is my solrconfig.xml > > > > > <useCompoundFile>false</useCompoundFile> > > > <ramBufferSizeMB>3000</ramBufferSizeMB> > > > <mergeFactor>1000</mergeFactor> > > > <maxMergeDocs>2147483647</maxMergeDocs> > > > <maxFieldLength>10000</maxFieldLength> > > > <unlockOnStartup>false</unlockOnStartup> > > > > > > > > > > --- On Thu, 5/21/09, Noble Paul > നോബിള് नो > > ब्ळ् <noble.p...@corp.aol.com> > wrote: > > > >> From: Noble Paul നോബിള് > नोब्ळ् > >> <noble.p...@corp.aol.com> > >> Subject: Re: How to index large set data > >> To: solr-user@lucene.apache.org > >> Date: Thursday, May 21, 2009, 10:39 PM > >> what is the total no:of docs created > >> ? I guess it may not be memory > >> bound. indexing is mostly amn IO bound operation. > You may > >> be able to > >> get a better perf if a SSD is used (solid state > disk) > >> > >> On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai > <djian...@yahoo.com> > >> wrote: > >>> > >>> Hi Paul, > >>> > >>> Thank you so much for answering my questions. > It > >> really helped. > >>> After some adjustment, basically setting > mergeFactor > >> to 1000 from the default value of 10, I can > finished the > >> whole job in 2.5 hours. I checked that during > running time, > >> only around 18% of memory is being used, and VIRT > is always > >> 1418m. I am thinking it may be restricted by JVM > memory > >> setting. But I run the data import command through > web, > >> i.e., > >>> > >> > http://<host>:<port>/solr/dataimport?command=full-import, > >> how can I set the memory allocation for JVM? > >>> Thanks again! > >>> > >>> JB > >>> > >>> --- On Thu, 5/21/09, Noble Paul > നോബിള് > >> नोब्ळ् <noble.p...@corp..aol.com> > >> wrote: > >>> > >>>> From: Noble Paul നോബിള് > >> नोब्ळ् <noble.p...@corp.aol.com> > >>>> Subject: Re: How to index large set data > >>>> To: solr-user@lucene.apache.org > >>>> Date: Thursday, May 21, 2009, 9:57 PM > >>>> check the status page of DIH and see > >>>> if it is working properly. and > >>>> if, yes what is the rate of indexing > >>>> > >>>> On Thu, May 21, 2009 at 11:48 AM, Jianbin > Dai > >> <djian...@yahoo.com> > >>>> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> I have about 45GB xml files to be > indexed. I > >> am using > >>>> DataImportHandler. I started the full > import 4 > >> hours ago, > >>>> and it's still running.... > >>>>> My computer has 4GB memory. Any > suggestion on > >> the > >>>> solutions? > >>>>> Thanks! > >>>>> > >>>>> JB > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> > >> > ----------------------------------------------------- > >>>> Noble Paul | Principal Engineer| AOL | http://aol.com > >>>> > >>> > >>> > >>> > >>> > >>> > >> > >> > >> > >> -- > >> > ----------------------------------------------------- > >> Noble Paul | Principal Engineer| AOL | http://aol.com > >> > > > > > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem > (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination..com/search > >