Re: How to index large set data

nk 11 Sun, 24 May 2009 22:27:38 -0700

Hello
Interesting thread. One request please, because I don't have much experience
with solr, could you please use full terms and not DIH, RES etc.?


Thanks :)

On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai <djian...@yahoo.com> wrote:

>
> Hi Paul,
>
> Hope you have a great weekend so far.
> I still have a couple of questions you might help me out:
>
> 1. In your earlier email, you said "if possible , you can setup multiple
> DIH say /dataimport1, /dataimport2 etc and split your files and can achieve
> parallelism"
> I am not sure if I understand it right. I put two requesHandler in
> solrconfig.xml, like this
>
> <requestHandler name="/dataimport"
> class="org.apache.solr.handler..dataimport.DataImportHandler">
>    <lst name="defaults">
>      <str name="config">./data-config.xml</str>
>    </lst>
> </requestHandler>
>
> <requestHandler name="/dataimport2"
> class="org.apache.solr.handler.dataimport.DataImportHandler">
>    <lst name="defaults">
>      <str name="config">./data-config2.xml</str>
>    </lst>
> </requestHandler>
>
>
> and create data-config.xml and data-config2.xml.
> then I run the command
> http://host:8080/solr/dataimport?command=full-import
>
> But only one data set (the first one) was indexed. Did I get something
> wrong?
>
>
> 2. I noticed that after solr indexed about 8M documents (around two hours),
> it gets very very slow. I use "top" command in linux, and noticed that RES
> is 1g of memory. I did several experiments, every time RES reaches 1g, the
> indexing process becomes extremely slow. Is this memory limit set by JVM?
> And how can I set the JVM memory when I use DIH through web command
> full-import?
>
> Thanks!
>
>
> JB
>
>
>
>
> --- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
> wrote:
>
> > From: Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
> > Subject: Re: How to index large set data
> > To: "Jianbin Dai" <djian...@yahoo.com>
> > Date: Friday, May 22, 2009, 10:04 PM
> > On Sat, May 23, 2009 at 10:27 AM,
> > Jianbin Dai <djian...@yahoo.com>
> > wrote:
> > >
> > > Hi Pual, but in your previous post, you said "there is
> > already an issue for writing to Solr in multiple threads
> >  SOLR-1089". Do you think use solrj alone would be better
> > than DIH?
> >
> > nope
> > you will have to do indexing in multiple threads
> >
> > if possible , you can setup multiple DIH say /dataimport1,
> > /dataimport2 etc and split your files and can achieve
> > parallelism
> >
> >
> > > Thanks and have a good weekend!
> > >
> > > --- On Fri, 5/22/09, Noble Paul നോബിള്‍
> >  नोब्ळ् <noble.p...@corp.aol.com>
> > wrote:
> > >
> > >> no need to use embedded Solrserver..
> > >> you can use SolrJ with streaming
> > >> in multiple threads
> > >>
> > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
> > <djian...@yahoo.com>
> > >> wrote:
> > >> >
> > >> > If I do the xml parsing by myself and use
> > embedded
> > >> client to do the push, would it be more efficient
> > than DIH?
> > >> >
> > >> >
> > >> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org>
> > >> wrote:
> > >> >
> > >> >> From: Grant Ingersoll <gsing...@apache.org>
> > >> >> Subject: Re: How to index large set data
> > >> >> To: solr-user@lucene.apache.org
> > >> >> Date: Friday, May 22, 2009, 5:38 AM
> > >> >> Can you parallelize this?  I
> > >> >> don't know that the DIH can handle it,
> > >> >> but having multiple threads sending docs
> > to Solr
> > >> is the
> > >> >> best
> > >> >> performance wise, so maybe you need to
> > look at
> > >> alternatives
> > >> >> to pulling
> > >> >> with DIH and instead use a client to push
> > into
> > >> Solr.
> > >> >>
> > >> >>
> > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai
> > wrote:
> > >> >>
> > >> >> >
> > >> >> > about 2.8 m total docs were created.
> > only the
> > >> first
> > >> >> run finishes. In
> > >> >> > my 2nd try, it hangs there forever
> > at the end
> > >> of
> > >> >> indexing, (I guess
> > >> >> > right before commit), with cpu usage
> > of 100%.
> > >> Total 5G
> > >> >> (2050) index
> > >> >> > files are created. Now I have two
> > problems:
> > >> >> > 1. why it hangs there and failed?
> > >> >> > 2. how can i speed up the indexing?
> > >> >> >
> > >> >> >
> > >> >> > Here is my solrconfig.xml
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > <useCompoundFile>false</useCompoundFile>
> > >> >> >
> > >> >>
> > >>
> > <ramBufferSizeMB>3000</ramBufferSizeMB>
> > >> >> >
> > >> >>
> > <mergeFactor>1000</mergeFactor>
> > >> >> >
> > >> >>
> > >>
> > <maxMergeDocs>2147483647</maxMergeDocs>
> > >> >> >
> > >> >>
> > >>
> > <maxFieldLength>10000</maxFieldLength>
> > >> >> >
> > >> >>
> > >>
> > <unlockOnStartup>false</unlockOnStartup>
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > --- On Thu, 5/21/09, Noble Paul
> > >> >> നോബിള്‍  नो
> > >> >> > ब्ळ् <noble.p...@corp.aol.com>
> > >> >> wrote:
> > >> >> >
> > >> >> >> From: Noble Paul
> > നോബിള്‍
> > >> >> नोब्ळ्
> > >> >> >> <noble.p...@corp.aol.com>
> > >> >> >> Subject: Re: How to index large
> > set data
> > >> >> >> To: solr-user@lucene.apache.org
> > >> >> >> Date: Thursday, May 21, 2009,
> > 10:39 PM
> > >> >> >> what is the total no:of docs
> > created
> > >> >> >> ?  I guess it may not be
> > memory
> > >> >> >> bound. indexing is mostly amn IO
> > bound
> > >> operation.
> > >> >> You may
> > >> >> >> be able to
> > >> >> >> get a better perf if a SSD is
> > used (solid
> > >> state
> > >> >> disk)
> > >> >> >>
> > >> >> >> On Fri, May 22, 2009 at 10:46
> > AM, Jianbin
> > >> Dai
> > >> >> <djian...@yahoo.com>
> > >> >> >> wrote:
> > >> >> >>>
> > >> >> >>> Hi Paul,
> > >> >> >>>
> > >> >> >>> Thank you so much for
> > answering my
> > >> questions.
> > >> >> It
> > >> >> >> really helped.
> > >> >> >>> After some adjustment,
> > basically
> > >> setting
> > >> >> mergeFactor
> > >> >> >> to 1000 from the default value
> > of 10, I
> > >> can
> > >> >> finished the
> > >> >> >> whole job in 2.5 hours. I
> > checked that
> > >> during
> > >> >> running time,
> > >> >> >> only around 18% of memory is
> > being used,
> > >> and VIRT
> > >> >> is always
> > >> >> >> 1418m. I am thinking it may be
> > restricted
> > >> by JVM
> > >> >> memory
> > >> >> >> setting. But I run the data
> > import
> > >> command through
> > >> >> web,
> > >> >> >> i.e.,
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> > http://<host>:<port>/solr/dataimport?command=full-import,
> > >> >> >> how can I set the memory
> > allocation for
> > >> JVM?
> > >> >> >>> Thanks again!
> > >> >> >>>
> > >> >> >>> JB
> > >> >> >>>
> > >> >> >>> --- On Thu, 5/21/09, Noble
> > Paul
> > >> >> നോബിള്‍
> > >> >> >>  नोब्ळ् <noble.p...@corp..aol.com>
> > >> >> >> wrote:
> > >> >> >>>
> > >> >> >>>> From: Noble Paul
> > >> നോബിള്‍
> > >> >> >>  नोब्ळ् <noble.p...@corp.aol.com>
> > >> >> >>>> Subject: Re: How to
> > index large
> > >> set data
> > >> >> >>>> To: solr-u...@lucene.apache..org
> > >> >> >>>> Date: Thursday, May 21,
> > 2009,
> > >> 9:57 PM
> > >> >> >>>> check the status page of
> > DIH and
> > >> see
> > >> >> >>>> if it is working
> > properly. and
> > >> >> >>>> if, yes what is the rate
> > of
> > >> indexing
> > >> >> >>>>
> > >> >> >>>> On Thu, May 21, 2009 at
> > 11:48 AM,
> > >> Jianbin
> > >> >> Dai
> > >> >> >> <djian...@yahoo.com>
> > >> >> >>>> wrote:
> > >> >> >>>>>
> > >> >> >>>>> Hi,
> > >> >> >>>>>
> > >> >> >>>>> I have about 45GB
> > xml files
> > >> to be
> > >> >> indexed. I
> > >> >> >> am using
> > >> >> >>>> DataImportHandler. I
> > started the
> > >> full
> > >> >> import 4
> > >> >> >> hours ago,
> > >> >> >>>> and it's still
> > running.....
> > >> >> >>>>> My computer has 4GB
> > memory.
> > >> Any
> > >> >> suggestion on
> > >> >> >> the
> > >> >> >>>> solutions?
> > >> >> >>>>> Thanks!
> > >> >> >>>>>
> > >> >> >>>>> JB
> > >> >> >>>>>
> > >> >> >>>>>
> > >> >> >>>>>
> > >> >> >>>>>
> > >> >> >>>>>
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> --
> > >> >> >>>>
> > >> >> >>
> > >> >>
> > >>
> > -----------------------------------------------------
> > >> >> >>>> Noble Paul | Principal
> > Engineer|
> > >> AOL | http://aol.com
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> --
> > >> >> >>
> > >> >>
> > >>
> > -----------------------------------------------------
> > >> >> >> Noble Paul | Principal Engineer|
> > AOL | http://aol.com
> > >> >> >>
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >>
> > >> >> --------------------------
> > >> >> Grant Ingersoll
> > >> >> http://www.lucidimagination.com/
> > >> >>
> > >> >> Search the Lucene ecosystem
> > >> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> > >> >> using Solr/Lucene:
> > >> >> http://www.lucidimagination...com/search
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >>
> > -----------------------------------------------------
> > >> Noble Paul | Principal Engineer| AOL | http://aol.com
> > >>
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >
>
>
>
>
>

Re: How to index large set data

Reply via email to