Re: How to index large set data

Jianbin Dai Fri, 22 May 2009 08:07:26 -0700

If I do the xml parsing by myself and use embedded client to do the push, would 
it be more efficient than DIH?



--- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> wrote:

> From: Grant Ingersoll <gsing...@apache.org>
> Subject: Re: How to index large set data
> To: solr-user@lucene.apache.org
> Date: Friday, May 22, 2009, 5:38 AM
> Can you parallelize this?  I
> don't know that the DIH can handle it,  
> but having multiple threads sending docs to Solr is the
> best  
> performance wise, so maybe you need to look at alternatives
> to pulling  
> with DIH and instead use a client to push into Solr.
> 
> 
> On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:
> 
> >
> > about 2.8 m total docs were created. only the first
> run finishes. In  
> > my 2nd try, it hangs there forever at the end of
> indexing, (I guess  
> > right before commit), with cpu usage of 100%. Total 5G
> (2050) index  
> > files are created. Now I have two problems:
> > 1. why it hangs there and failed?
> > 2. how can i speed up the indexing?
> >
> >
> > Here is my solrconfig.xml
> >
> >   
> <useCompoundFile>false</useCompoundFile>
> >   
> <ramBufferSizeMB>3000</ramBufferSizeMB>
> >   
> <mergeFactor>1000</mergeFactor>
> >   
> <maxMergeDocs>2147483647</maxMergeDocs>
> >   
> <maxFieldLength>10000</maxFieldLength>
> >   
> <unlockOnStartup>false</unlockOnStartup>
> >
> >
> >
> >
> > --- On Thu, 5/21/09, Noble Paul
> നോബിള്‍  नो 
> > ब्ळ् <noble.p...@corp.aol.com>
> wrote:
> >
> >> From: Noble Paul നോബിള്‍ 
> नोब्ळ्  
> >> <noble.p...@corp.aol.com>
> >> Subject: Re: How to index large set data
> >> To: solr-user@lucene.apache.org
> >> Date: Thursday, May 21, 2009, 10:39 PM
> >> what is the total no:of docs created
> >> ?  I guess it may not be memory
> >> bound. indexing is mostly amn IO bound operation.
> You may
> >> be able to
> >> get a better perf if a SSD is used (solid state
> disk)
> >>
> >> On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai
> <djian...@yahoo.com>
> >> wrote:
> >>>
> >>> Hi Paul,
> >>>
> >>> Thank you so much for answering my questions.
> It
> >> really helped.
> >>> After some adjustment, basically setting
> mergeFactor
> >> to 1000 from the default value of 10, I can
> finished the
> >> whole job in 2.5 hours. I checked that during
> running time,
> >> only around 18% of memory is being used, and VIRT
> is always
> >> 1418m. I am thinking it may be restricted by JVM
> memory
> >> setting. But I run the data import command through
> web,
> >> i.e.,
> >>>
> >>
> http://<host>:<port>/solr/dataimport?command=full-import,
> >> how can I set the memory allocation for JVM?
> >>> Thanks again!
> >>>
> >>> JB
> >>>
> >>> --- On Thu, 5/21/09, Noble Paul
> നോബിള്‍
> >>  नोब्ळ् <noble.p...@corp..aol.com>
> >> wrote:
> >>>
> >>>> From: Noble Paul നോബിള്‍
> >>  नोब्ळ् <noble.p...@corp.aol.com>
> >>>> Subject: Re: How to index large set data
> >>>> To: solr-user@lucene.apache.org
> >>>> Date: Thursday, May 21, 2009, 9:57 PM
> >>>> check the status page of DIH and see
> >>>> if it is working properly. and
> >>>> if, yes what is the rate of indexing
> >>>>
> >>>> On Thu, May 21, 2009 at 11:48 AM, Jianbin
> Dai
> >> <djian...@yahoo.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I have about 45GB xml files to be
> indexed. I
> >> am using
> >>>> DataImportHandler. I started the full
> import 4
> >> hours ago,
> >>>> and it's still running....
> >>>>> My computer has 4GB memory. Any
> suggestion on
> >> the
> >>>> solutions?
> >>>>> Thanks!
> >>>>>
> >>>>> JB
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>
> -----------------------------------------------------
> >>>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >> -- 
> >>
> -----------------------------------------------------
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
> >
> >
> >
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem
> (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination..com/search
> 
>

Re: How to index large set data

Reply via email to