Re: How to index large set data

Jianbin Dai Sun, 24 May 2009 20:11:11 -0700

Hi Paul,

Hope you have a great weekend so far.
I still have a couple of questions you might help me out:


1. In your earlier email, you said "if possible , you can setup multiple DIH 
say /dataimport1, /dataimport2 etc and split your files and can achieve 
parallelism"
I am not sure if I understand it right. I put two requesHandler in 
solrconfig.xml, like this

<requestHandler name="/dataimport" 
class="org.apache.solr.handler..dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">./data-config.xml</str>
    </lst>
</requestHandler>

<requestHandler name="/dataimport2" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">./data-config2.xml</str>
    </lst>
</requestHandler>


and create data-config.xml and data-config2.xml.
then I run the command
http://host:8080/solr/dataimport?command=full-import

But only one data set (the first one) was indexed. Did I get something wrong?


2. I noticed that after solr indexed about 8M documents (around two hours), it 
gets very very slow. I use "top" command in linux, and noticed that RES is 1g 
of memory. I did several experiments, every time RES reaches 1g, the indexing 
process becomes extremely slow. Is this memory limit set by JVM? And how can I 
set the JVM memory when I use DIH through web command full-import?

Thanks!


JB




--- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com> wrote:

> From: Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
> Subject: Re: How to index large set data
> To: "Jianbin Dai" <djian...@yahoo.com>
> Date: Friday, May 22, 2009, 10:04 PM
> On Sat, May 23, 2009 at 10:27 AM,
> Jianbin Dai <djian...@yahoo.com>
> wrote:
> >
> > Hi Pual, but in your previous post, you said "there is
> already an issue for writing to Solr in multiple threads
>  SOLR-1089". Do you think use solrj alone would be better
> than DIH?
> 
> nope
> you will have to do indexing in multiple threads
> 
> if possible , you can setup multiple DIH say /dataimport1,
> /dataimport2 etc and split your files and can achieve
> parallelism
> 
> 
> > Thanks and have a good weekend!
> >
> > --- On Fri, 5/22/09, Noble Paul നോബിള്‍
>  नोब्ळ् <noble.p...@corp.aol.com>
> wrote:
> >
> >> no need to use embedded Solrserver..
> >> you can use SolrJ with streaming
> >> in multiple threads
> >>
> >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
> <djian...@yahoo.com>
> >> wrote:
> >> >
> >> > If I do the xml parsing by myself and use
> embedded
> >> client to do the push, would it be more efficient
> than DIH?
> >> >
> >> >
> >> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org>
> >> wrote:
> >> >
> >> >> From: Grant Ingersoll <gsing...@apache.org>
> >> >> Subject: Re: How to index large set data
> >> >> To: solr-user@lucene.apache.org
> >> >> Date: Friday, May 22, 2009, 5:38 AM
> >> >> Can you parallelize this?  I
> >> >> don't know that the DIH can handle it,
> >> >> but having multiple threads sending docs
> to Solr
> >> is the
> >> >> best
> >> >> performance wise, so maybe you need to
> look at
> >> alternatives
> >> >> to pulling
> >> >> with DIH and instead use a client to push
> into
> >> Solr.
> >> >>
> >> >>
> >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai
> wrote:
> >> >>
> >> >> >
> >> >> > about 2.8 m total docs were created.
> only the
> >> first
> >> >> run finishes. In
> >> >> > my 2nd try, it hangs there forever
> at the end
> >> of
> >> >> indexing, (I guess
> >> >> > right before commit), with cpu usage
> of 100%.
> >> Total 5G
> >> >> (2050) index
> >> >> > files are created. Now I have two
> problems:
> >> >> > 1. why it hangs there and failed?
> >> >> > 2. how can i speed up the indexing?
> >> >> >
> >> >> >
> >> >> > Here is my solrconfig.xml
> >> >> >
> >> >> >
> >> >>
> >>
> <useCompoundFile>false</useCompoundFile>
> >> >> >
> >> >>
> >>
> <ramBufferSizeMB>3000</ramBufferSizeMB>
> >> >> >
> >> >>
> <mergeFactor>1000</mergeFactor>
> >> >> >
> >> >>
> >>
> <maxMergeDocs>2147483647</maxMergeDocs>
> >> >> >
> >> >>
> >>
> <maxFieldLength>10000</maxFieldLength>
> >> >> >
> >> >>
> >>
> <unlockOnStartup>false</unlockOnStartup>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --- On Thu, 5/21/09, Noble Paul
> >> >> നോബിള്‍  नो
> >> >> > ब्ळ् <noble.p...@corp.aol.com>
> >> >> wrote:
> >> >> >
> >> >> >> From: Noble Paul
> നോബിള്‍
> >> >> नोब्ळ्
> >> >> >> <noble.p...@corp.aol.com>
> >> >> >> Subject: Re: How to index large
> set data
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Date: Thursday, May 21, 2009,
> 10:39 PM
> >> >> >> what is the total no:of docs
> created
> >> >> >> ?  I guess it may not be
> memory
> >> >> >> bound. indexing is mostly amn IO
> bound
> >> operation.
> >> >> You may
> >> >> >> be able to
> >> >> >> get a better perf if a SSD is
> used (solid
> >> state
> >> >> disk)
> >> >> >>
> >> >> >> On Fri, May 22, 2009 at 10:46
> AM, Jianbin
> >> Dai
> >> >> <djian...@yahoo.com>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Hi Paul,
> >> >> >>>
> >> >> >>> Thank you so much for
> answering my
> >> questions.
> >> >> It
> >> >> >> really helped.
> >> >> >>> After some adjustment,
> basically
> >> setting
> >> >> mergeFactor
> >> >> >> to 1000 from the default value
> of 10, I
> >> can
> >> >> finished the
> >> >> >> whole job in 2.5 hours. I
> checked that
> >> during
> >> >> running time,
> >> >> >> only around 18% of memory is
> being used,
> >> and VIRT
> >> >> is always
> >> >> >> 1418m. I am thinking it may be
> restricted
> >> by JVM
> >> >> memory
> >> >> >> setting. But I run the data
> import
> >> command through
> >> >> web,
> >> >> >> i.e.,
> >> >> >>>
> >> >> >>
> >> >>
> >>
> http://<host>:<port>/solr/dataimport?command=full-import,
> >> >> >> how can I set the memory
> allocation for
> >> JVM?
> >> >> >>> Thanks again!
> >> >> >>>
> >> >> >>> JB
> >> >> >>>
> >> >> >>> --- On Thu, 5/21/09, Noble
> Paul
> >> >> നോബിള്‍
> >> >> >>  नोब्ळ् <noble.p...@corp..aol.com>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>>> From: Noble Paul
> >> നോബിള്‍
> >> >> >>  नोब्ळ् <noble.p...@corp.aol.com>
> >> >> >>>> Subject: Re: How to
> index large
> >> set data
> >> >> >>>> To: solr-u...@lucene.apache..org
> >> >> >>>> Date: Thursday, May 21,
> 2009,
> >> 9:57 PM
> >> >> >>>> check the status page of
> DIH and
> >> see
> >> >> >>>> if it is working
> properly. and
> >> >> >>>> if, yes what is the rate
> of
> >> indexing
> >> >> >>>>
> >> >> >>>> On Thu, May 21, 2009 at
> 11:48 AM,
> >> Jianbin
> >> >> Dai
> >> >> >> <djian...@yahoo.com>
> >> >> >>>> wrote:
> >> >> >>>>>
> >> >> >>>>> Hi,
> >> >> >>>>>
> >> >> >>>>> I have about 45GB
> xml files
> >> to be
> >> >> indexed. I
> >> >> >> am using
> >> >> >>>> DataImportHandler. I
> started the
> >> full
> >> >> import 4
> >> >> >> hours ago,
> >> >> >>>> and it's still
> running.....
> >> >> >>>>> My computer has 4GB
> memory.
> >> Any
> >> >> suggestion on
> >> >> >> the
> >> >> >>>> solutions?
> >> >> >>>>> Thanks!
> >> >> >>>>>
> >> >> >>>>> JB
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> --
> >> >> >>>>
> >> >> >>
> >> >>
> >>
> -----------------------------------------------------
> >> >> >>>> Noble Paul | Principal
> Engineer|
> >> AOL | http://aol.com
> >> >> >>>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >>
> >> >>
> >>
> -----------------------------------------------------
> >> >> >> Noble Paul | Principal Engineer|
> AOL | http://aol.com
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >> --------------------------
> >> >> Grant Ingersoll
> >> >> http://www.lucidimagination.com/
> >> >>
> >> >> Search the Lucene ecosystem
> >> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >> >> using Solr/Lucene:
> >> >> http://www.lucidimagination...com/search
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >>
> -----------------------------------------------------
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
> >
> >
> >
> >
> >
> 
> 
> 
> -- 
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: How to index large set data

Reply via email to