Hello Interesting thread. One request please, because I don't have much experience with solr, could you please use full terms and not DIH, RES etc.?
Thanks :) On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai <djian...@yahoo.com> wrote: > > Hi Paul, > > Hope you have a great weekend so far. > I still have a couple of questions you might help me out: > > 1. In your earlier email, you said "if possible , you can setup multiple > DIH say /dataimport1, /dataimport2 etc and split your files and can achieve > parallelism" > I am not sure if I understand it right. I put two requesHandler in > solrconfig.xml, like this > > <requestHandler name="/dataimport" > class="org.apache.solr.handler..dataimport.DataImportHandler"> > <lst name="defaults"> > <str name="config">./data-config.xml</str> > </lst> > </requestHandler> > > <requestHandler name="/dataimport2" > class="org.apache.solr.handler.dataimport.DataImportHandler"> > <lst name="defaults"> > <str name="config">./data-config2.xml</str> > </lst> > </requestHandler> > > > and create data-config.xml and data-config2.xml. > then I run the command > http://host:8080/solr/dataimport?command=full-import > > But only one data set (the first one) was indexed. Did I get something > wrong? > > > 2. I noticed that after solr indexed about 8M documents (around two hours), > it gets very very slow. I use "top" command in linux, and noticed that RES > is 1g of memory. I did several experiments, every time RES reaches 1g, the > indexing process becomes extremely slow. Is this memory limit set by JVM? > And how can I set the JVM memory when I use DIH through web command > full-import? > > Thanks! > > > JB > > > > > --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > wrote: > > > From: Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > > Subject: Re: How to index large set data > > To: "Jianbin Dai" <djian...@yahoo.com> > > Date: Friday, May 22, 2009, 10:04 PM > > On Sat, May 23, 2009 at 10:27 AM, > > Jianbin Dai <djian...@yahoo.com> > > wrote: > > > > > > Hi Pual, but in your previous post, you said "there is > > already an issue for writing to Solr in multiple threads > > SOLR-1089". Do you think use solrj alone would be better > > than DIH? > > > > nope > > you will have to do indexing in multiple threads > > > > if possible , you can setup multiple DIH say /dataimport1, > > /dataimport2 etc and split your files and can achieve > > parallelism > > > > > > > Thanks and have a good weekend! > > > > > > --- On Fri, 5/22/09, Noble Paul നോബിള് > > नोब्ळ् <noble.p...@corp.aol.com> > > wrote: > > > > > >> no need to use embedded Solrserver.. > > >> you can use SolrJ with streaming > > >> in multiple threads > > >> > > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai > > <djian...@yahoo.com> > > >> wrote: > > >> > > > >> > If I do the xml parsing by myself and use > > embedded > > >> client to do the push, would it be more efficient > > than DIH? > > >> > > > >> > > > >> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> > > >> wrote: > > >> > > > >> >> From: Grant Ingersoll <gsing...@apache.org> > > >> >> Subject: Re: How to index large set data > > >> >> To: solr-user@lucene.apache.org > > >> >> Date: Friday, May 22, 2009, 5:38 AM > > >> >> Can you parallelize this? I > > >> >> don't know that the DIH can handle it, > > >> >> but having multiple threads sending docs > > to Solr > > >> is the > > >> >> best > > >> >> performance wise, so maybe you need to > > look at > > >> alternatives > > >> >> to pulling > > >> >> with DIH and instead use a client to push > > into > > >> Solr. > > >> >> > > >> >> > > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai > > wrote: > > >> >> > > >> >> > > > >> >> > about 2.8 m total docs were created. > > only the > > >> first > > >> >> run finishes. In > > >> >> > my 2nd try, it hangs there forever > > at the end > > >> of > > >> >> indexing, (I guess > > >> >> > right before commit), with cpu usage > > of 100%. > > >> Total 5G > > >> >> (2050) index > > >> >> > files are created. Now I have two > > problems: > > >> >> > 1. why it hangs there and failed? > > >> >> > 2. how can i speed up the indexing? > > >> >> > > > >> >> > > > >> >> > Here is my solrconfig.xml > > >> >> > > > >> >> > > > >> >> > > >> > > <useCompoundFile>false</useCompoundFile> > > >> >> > > > >> >> > > >> > > <ramBufferSizeMB>3000</ramBufferSizeMB> > > >> >> > > > >> >> > > <mergeFactor>1000</mergeFactor> > > >> >> > > > >> >> > > >> > > <maxMergeDocs>2147483647</maxMergeDocs> > > >> >> > > > >> >> > > >> > > <maxFieldLength>10000</maxFieldLength> > > >> >> > > > >> >> > > >> > > <unlockOnStartup>false</unlockOnStartup> > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > --- On Thu, 5/21/09, Noble Paul > > >> >> നോബിള് नो > > >> >> > ब्ळ् <noble.p...@corp.aol.com> > > >> >> wrote: > > >> >> > > > >> >> >> From: Noble Paul > > നോബിള് > > >> >> नोब्ळ् > > >> >> >> <noble.p...@corp.aol.com> > > >> >> >> Subject: Re: How to index large > > set data > > >> >> >> To: solr-user@lucene.apache.org > > >> >> >> Date: Thursday, May 21, 2009, > > 10:39 PM > > >> >> >> what is the total no:of docs > > created > > >> >> >> ? I guess it may not be > > memory > > >> >> >> bound. indexing is mostly amn IO > > bound > > >> operation. > > >> >> You may > > >> >> >> be able to > > >> >> >> get a better perf if a SSD is > > used (solid > > >> state > > >> >> disk) > > >> >> >> > > >> >> >> On Fri, May 22, 2009 at 10:46 > > AM, Jianbin > > >> Dai > > >> >> <djian...@yahoo.com> > > >> >> >> wrote: > > >> >> >>> > > >> >> >>> Hi Paul, > > >> >> >>> > > >> >> >>> Thank you so much for > > answering my > > >> questions. > > >> >> It > > >> >> >> really helped. > > >> >> >>> After some adjustment, > > basically > > >> setting > > >> >> mergeFactor > > >> >> >> to 1000 from the default value > > of 10, I > > >> can > > >> >> finished the > > >> >> >> whole job in 2.5 hours. I > > checked that > > >> during > > >> >> running time, > > >> >> >> only around 18% of memory is > > being used, > > >> and VIRT > > >> >> is always > > >> >> >> 1418m. I am thinking it may be > > restricted > > >> by JVM > > >> >> memory > > >> >> >> setting. But I run the data > > import > > >> command through > > >> >> web, > > >> >> >> i.e., > > >> >> >>> > > >> >> >> > > >> >> > > >> > > http://<host>:<port>/solr/dataimport?command=full-import, > > >> >> >> how can I set the memory > > allocation for > > >> JVM? > > >> >> >>> Thanks again! > > >> >> >>> > > >> >> >>> JB > > >> >> >>> > > >> >> >>> --- On Thu, 5/21/09, Noble > > Paul > > >> >> നോബിള് > > >> >> >> नोब्ळ् <noble.p...@corp..aol.com> > > >> >> >> wrote: > > >> >> >>> > > >> >> >>>> From: Noble Paul > > >> നോബിള് > > >> >> >> नोब्ळ् <noble.p...@corp.aol.com> > > >> >> >>>> Subject: Re: How to > > index large > > >> set data > > >> >> >>>> To: solr-u...@lucene.apache..org > > >> >> >>>> Date: Thursday, May 21, > > 2009, > > >> 9:57 PM > > >> >> >>>> check the status page of > > DIH and > > >> see > > >> >> >>>> if it is working > > properly. and > > >> >> >>>> if, yes what is the rate > > of > > >> indexing > > >> >> >>>> > > >> >> >>>> On Thu, May 21, 2009 at > > 11:48 AM, > > >> Jianbin > > >> >> Dai > > >> >> >> <djian...@yahoo.com> > > >> >> >>>> wrote: > > >> >> >>>>> > > >> >> >>>>> Hi, > > >> >> >>>>> > > >> >> >>>>> I have about 45GB > > xml files > > >> to be > > >> >> indexed. I > > >> >> >> am using > > >> >> >>>> DataImportHandler. I > > started the > > >> full > > >> >> import 4 > > >> >> >> hours ago, > > >> >> >>>> and it's still > > running..... > > >> >> >>>>> My computer has 4GB > > memory. > > >> Any > > >> >> suggestion on > > >> >> >> the > > >> >> >>>> solutions? > > >> >> >>>>> Thanks! > > >> >> >>>>> > > >> >> >>>>> JB > > >> >> >>>>> > > >> >> >>>>> > > >> >> >>>>> > > >> >> >>>>> > > >> >> >>>>> > > >> >> >>>> > > >> >> >>>> > > >> >> >>>> > > >> >> >>>> -- > > >> >> >>>> > > >> >> >> > > >> >> > > >> > > ----------------------------------------------------- > > >> >> >>>> Noble Paul | Principal > > Engineer| > > >> AOL | http://aol.com > > >> >> >>>> > > >> >> >>> > > >> >> >>> > > >> >> >>> > > >> >> >>> > > >> >> >>> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> -- > > >> >> >> > > >> >> > > >> > > ----------------------------------------------------- > > >> >> >> Noble Paul | Principal Engineer| > > AOL | http://aol.com > > >> >> >> > > >> >> > > > >> >> > > > >> >> > > > >> >> > > >> >> -------------------------- > > >> >> Grant Ingersoll > > >> >> http://www.lucidimagination.com/ > > >> >> > > >> >> Search the Lucene ecosystem > > >> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids) > > >> >> using Solr/Lucene: > > >> >> http://www.lucidimagination...com/search > > >> >> > > >> >> > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > >> > > >> > > >> -- > > >> > > ----------------------------------------------------- > > >> Noble Paul | Principal Engineer| AOL | http://aol.com > > >> > > > > > > > > > > > > > > > > > > > > > > > -- > > ----------------------------------------------------- > > Noble Paul | Principal Engineer| AOL | http://aol.com > > > > > > >