Re: How to index large set data

Jianbin Dai Fri, 22 May 2009 21:58:18 -0700

Hi Pual, but in your previous post, you said "there is already an issue for 
writing to Solr in multiple threads  SOLR-1089". Do you think use solrj alone 
would be better than DIH? 
Thanks and have a good weekend!


--- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com> wrote:

> no need to use embedded Solrserver.
> you can use SolrJ with streaming
> in multiple threads
> 
> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai <djian...@yahoo.com>
> wrote:
> >
> > If I do the xml parsing by myself and use embedded
> client to do the push, would it be more efficient than DIH?
> >
> >
> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org>
> wrote:
> >
> >> From: Grant Ingersoll <gsing...@apache.org>
> >> Subject: Re: How to index large set data
> >> To: solr-user@lucene.apache.org
> >> Date: Friday, May 22, 2009, 5:38 AM
> >> Can you parallelize this?  I
> >> don't know that the DIH can handle it,
> >> but having multiple threads sending docs to Solr
> is the
> >> best
> >> performance wise, so maybe you need to look at
> alternatives
> >> to pulling
> >> with DIH and instead use a client to push into
> Solr.
> >>
> >>
> >> On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:
> >>
> >> >
> >> > about 2.8 m total docs were created. only the
> first
> >> run finishes. In
> >> > my 2nd try, it hangs there forever at the end
> of
> >> indexing, (I guess
> >> > right before commit), with cpu usage of 100%.
> Total 5G
> >> (2050) index
> >> > files are created. Now I have two problems:
> >> > 1. why it hangs there and failed?
> >> > 2. how can i speed up the indexing?
> >> >
> >> >
> >> > Here is my solrconfig.xml
> >> >
> >> >
> >>
> <useCompoundFile>false</useCompoundFile>
> >> >
> >>
> <ramBufferSizeMB>3000</ramBufferSizeMB>
> >> >
> >> <mergeFactor>1000</mergeFactor>
> >> >
> >>
> <maxMergeDocs>2147483647</maxMergeDocs>
> >> >
> >>
> <maxFieldLength>10000</maxFieldLength>
> >> >
> >>
> <unlockOnStartup>false</unlockOnStartup>
> >> >
> >> >
> >> >
> >> >
> >> > --- On Thu, 5/21/09, Noble Paul
> >> നോബിള്‍  नो
> >> > ब्ळ् <noble.p...@corp.aol.com>
> >> wrote:
> >> >
> >> >> From: Noble Paul നോബിള്‍
> >> नोब्ळ्
> >> >> <noble.p...@corp.aol.com>
> >> >> Subject: Re: How to index large set data
> >> >> To: solr-user@lucene.apache.org
> >> >> Date: Thursday, May 21, 2009, 10:39 PM
> >> >> what is the total no:of docs created
> >> >> ?  I guess it may not be memory
> >> >> bound. indexing is mostly amn IO bound
> operation.
> >> You may
> >> >> be able to
> >> >> get a better perf if a SSD is used (solid
> state
> >> disk)
> >> >>
> >> >> On Fri, May 22, 2009 at 10:46 AM, Jianbin
> Dai
> >> <djian...@yahoo.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Paul,
> >> >>>
> >> >>> Thank you so much for answering my
> questions.
> >> It
> >> >> really helped.
> >> >>> After some adjustment, basically
> setting
> >> mergeFactor
> >> >> to 1000 from the default value of 10, I
> can
> >> finished the
> >> >> whole job in 2.5 hours. I checked that
> during
> >> running time,
> >> >> only around 18% of memory is being used,
> and VIRT
> >> is always
> >> >> 1418m. I am thinking it may be restricted
> by JVM
> >> memory
> >> >> setting. But I run the data import
> command through
> >> web,
> >> >> i.e.,
> >> >>>
> >> >>
> >>
> http://<host>:<port>/solr/dataimport?command=full-import,
> >> >> how can I set the memory allocation for
> JVM?
> >> >>> Thanks again!
> >> >>>
> >> >>> JB
> >> >>>
> >> >>> --- On Thu, 5/21/09, Noble Paul
> >> നോബിള്‍
> >> >>  नोब्ळ् <noble.p...@corp..aol.com>
> >> >> wrote:
> >> >>>
> >> >>>> From: Noble Paul
> നോബിള്‍
> >> >>  नोब्ळ् <noble.p...@corp.aol.com>
> >> >>>> Subject: Re: How to index large
> set data
> >> >>>> To: solr-u...@lucene.apache..org
> >> >>>> Date: Thursday, May 21, 2009,
> 9:57 PM
> >> >>>> check the status page of DIH and
> see
> >> >>>> if it is working properly. and
> >> >>>> if, yes what is the rate of
> indexing
> >> >>>>
> >> >>>> On Thu, May 21, 2009 at 11:48 AM,
> Jianbin
> >> Dai
> >> >> <djian...@yahoo.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi,
> >> >>>>>
> >> >>>>> I have about 45GB xml files
> to be
> >> indexed. I
> >> >> am using
> >> >>>> DataImportHandler. I started the
> full
> >> import 4
> >> >> hours ago,
> >> >>>> and it's still running.....
> >> >>>>> My computer has 4GB memory.
> Any
> >> suggestion on
> >> >> the
> >> >>>> solutions?
> >> >>>>> Thanks!
> >> >>>>>
> >> >>>>> JB
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>>
> >> >>
> >>
> -----------------------------------------------------
> >> >>>> Noble Paul | Principal Engineer|
> AOL | http://aol.com
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >>
> >>
> -----------------------------------------------------
> >> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >>
> >> >
> >> >
> >> >
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem
> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >> using Solr/Lucene:
> >> http://www.lucidimagination...com/search
> >>
> >>
> >
> >
> >
> >
> >
> 
> 
> 
> -- 
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: How to index large set data

Reply via email to