Hi Pual, but in your previous post, you said "there is already an issue for writing to Solr in multiple threads SOLR-1089". Do you think use solrj alone would be better than DIH? Thanks and have a good weekend!
--- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> wrote: > no need to use embedded Solrserver. > you can use SolrJ with streaming > in multiple threads > > On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai <djian...@yahoo.com> > wrote: > > > > If I do the xml parsing by myself and use embedded > client to do the push, would it be more efficient than DIH? > > > > > > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> > wrote: > > > >> From: Grant Ingersoll <gsing...@apache.org> > >> Subject: Re: How to index large set data > >> To: solr-user@lucene.apache.org > >> Date: Friday, May 22, 2009, 5:38 AM > >> Can you parallelize this? I > >> don't know that the DIH can handle it, > >> but having multiple threads sending docs to Solr > is the > >> best > >> performance wise, so maybe you need to look at > alternatives > >> to pulling > >> with DIH and instead use a client to push into > Solr. > >> > >> > >> On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: > >> > >> > > >> > about 2.8 m total docs were created. only the > first > >> run finishes. In > >> > my 2nd try, it hangs there forever at the end > of > >> indexing, (I guess > >> > right before commit), with cpu usage of 100%. > Total 5G > >> (2050) index > >> > files are created. Now I have two problems: > >> > 1. why it hangs there and failed? > >> > 2. how can i speed up the indexing? > >> > > >> > > >> > Here is my solrconfig.xml > >> > > >> > > >> > <useCompoundFile>false</useCompoundFile> > >> > > >> > <ramBufferSizeMB>3000</ramBufferSizeMB> > >> > > >> <mergeFactor>1000</mergeFactor> > >> > > >> > <maxMergeDocs>2147483647</maxMergeDocs> > >> > > >> > <maxFieldLength>10000</maxFieldLength> > >> > > >> > <unlockOnStartup>false</unlockOnStartup> > >> > > >> > > >> > > >> > > >> > --- On Thu, 5/21/09, Noble Paul > >> നോബിള് नो > >> > ब्ळ् <noble.p...@corp.aol.com> > >> wrote: > >> > > >> >> From: Noble Paul നോബിള് > >> नोब्ळ् > >> >> <noble.p...@corp.aol.com> > >> >> Subject: Re: How to index large set data > >> >> To: solr-user@lucene.apache.org > >> >> Date: Thursday, May 21, 2009, 10:39 PM > >> >> what is the total no:of docs created > >> >> ? I guess it may not be memory > >> >> bound. indexing is mostly amn IO bound > operation. > >> You may > >> >> be able to > >> >> get a better perf if a SSD is used (solid > state > >> disk) > >> >> > >> >> On Fri, May 22, 2009 at 10:46 AM, Jianbin > Dai > >> <djian...@yahoo.com> > >> >> wrote: > >> >>> > >> >>> Hi Paul, > >> >>> > >> >>> Thank you so much for answering my > questions. > >> It > >> >> really helped. > >> >>> After some adjustment, basically > setting > >> mergeFactor > >> >> to 1000 from the default value of 10, I > can > >> finished the > >> >> whole job in 2.5 hours. I checked that > during > >> running time, > >> >> only around 18% of memory is being used, > and VIRT > >> is always > >> >> 1418m. I am thinking it may be restricted > by JVM > >> memory > >> >> setting. But I run the data import > command through > >> web, > >> >> i.e., > >> >>> > >> >> > >> > http://<host>:<port>/solr/dataimport?command=full-import, > >> >> how can I set the memory allocation for > JVM? > >> >>> Thanks again! > >> >>> > >> >>> JB > >> >>> > >> >>> --- On Thu, 5/21/09, Noble Paul > >> നോബിള് > >> >> नोब्ळ् <noble.p...@corp..aol.com> > >> >> wrote: > >> >>> > >> >>>> From: Noble Paul > നോബിള് > >> >> नोब्ळ् <noble.p...@corp.aol.com> > >> >>>> Subject: Re: How to index large > set data > >> >>>> To: solr-u...@lucene.apache..org > >> >>>> Date: Thursday, May 21, 2009, > 9:57 PM > >> >>>> check the status page of DIH and > see > >> >>>> if it is working properly. and > >> >>>> if, yes what is the rate of > indexing > >> >>>> > >> >>>> On Thu, May 21, 2009 at 11:48 AM, > Jianbin > >> Dai > >> >> <djian...@yahoo.com> > >> >>>> wrote: > >> >>>>> > >> >>>>> Hi, > >> >>>>> > >> >>>>> I have about 45GB xml files > to be > >> indexed. I > >> >> am using > >> >>>> DataImportHandler. I started the > full > >> import 4 > >> >> hours ago, > >> >>>> and it's still running..... > >> >>>>> My computer has 4GB memory. > Any > >> suggestion on > >> >> the > >> >>>> solutions? > >> >>>>> Thanks! > >> >>>>> > >> >>>>> JB > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> > >> >> > >> > ----------------------------------------------------- > >> >>>> Noble Paul | Principal Engineer| > AOL | http://aol.com > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> -- > >> >> > >> > ----------------------------------------------------- > >> >> Noble Paul | Principal Engineer| AOL | http://aol.com > >> >> > >> > > >> > > >> > > >> > >> -------------------------- > >> Grant Ingersoll > >> http://www.lucidimagination.com/ > >> > >> Search the Lucene ecosystem > >> (Lucene/Solr/Nutch/Mahout/Tika/Droids) > >> using Solr/Lucene: > >> http://www.lucidimagination...com/search > >> > >> > > > > > > > > > > > > > > -- > ----------------------------------------------------- > Noble Paul | Principal Engineer| AOL | http://aol.com >