hi,

> 1) I assume you are doing batching interspersed with commits

as each file I crawl for are article-level each <add> contains all the
sentences for the article so they are naturally batched into the about
500 documents per post in LCF.

I use auto-commit in Solr:
<autoCommit>
     <maxDocs>500000</maxDocs> <!-- every 1000 articles -->
     <maxTime>900000</maxTime> <!-- every 15 minutes -->
   </autoCommit>

> 2) Why do you need sentence level Lucene docs?

that's an application specific need due to linguistic info needed on a
per-sentence
basis.

> 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
> surprised if you a memory/connection leak their (or it is not
> releasing some resource explicitly)

I thought this could be the case too -- but if I replace the use of my custom
analysers and specify my fields are of type "text" instead (from standard
solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging
too -- at least it did when I didn't have any explicit GC settings... it does
take longer to replicate as my analysers/field types are more complex than
"text" field type.

i will try it again with the different GC settings tomorrow and post
the results.

> In general, we have NEVER had a problem in loading Solr.

i'm not sure if we would either if we posted as we created the
index.xml format...
but because we post 500+ documents a time (one article file per LCF post) and
LCF can post these files quickly i'm not sure if I need to try and slow down
the post rate!?

thanks for your replies,

bec :)

> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote:
>> sorry -- i used the term "documents" too loosely!
>>
>> 180k scientific articles with between 500-1000 sentences each
>> and we index sentence-level index documents
>> so i'm guessing about 100 million lucene index documents in total.
>>
>> an update on my progress:
>>
>> i used GC settings of:
>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
>>       -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
>> -XX:CMSInitiatingOccupancyFraction=70
>>
>> which allowed the indexing process to run to 11.5k articles and
>> for about 2hours before I got the same kind of hanging/unresponsive Solr
>> with
>> this as the tail of the solr logs:
>>
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 2416734
>> Max   Chunk Size: 2412032
>> Number of Blocks: 3
>> Av.  Block  Size: 805578
>> Tree      Height: 3
>> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.0000193 secs]5980.480:
>> [CMS
>>
>> I also saw (in jconsole) that the number of threads rose from the
>> steady 32 used for the
>> 2 hours to 72 before Solr finally became unresponsive...
>>
>> i've got the following GC info params switched on (as many as i could
>> find!):
>> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>       -XX:+PrintGCApplicationConcurrentTime 
>> -XX:+PrintGCApplicationStoppedTime
>>       -XX:PrintFLSStatistics=1
>>
>> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
>> million fairly small
>> docs per hour!! this produced an index of about 40GB to give you an
>> idea of index
>> size...
>>
>> because i've already got the documents in solr native xml format
>> i.e. one file per article each with <add><doc>...</doc>....
>> i.e. posting each set of sentence docs per article in every LCF file post...
>> this means that LCF can throw documents at Solr very fast.... and i think
>> i'm
>> breaking it GC-wise.
>>
>> i'm going to try adding in System.gc() calls to see if this runs ok
>> (albeit slower)...
>> otherwise i'm pretty much at a loss as to what could be causing this GC
>> issue/
>> solr hanging if it's not a GC issue...
>>
>> thanks :)
>>
>> bec
>>
>> On 12 August 2010 21:42, dc tech <dctech1...@gmail.com> wrote:
>>> I am a little confused - how did 180k documents become 100m index
>>> documents?
>>> We use have over 20 indices (for different content sets), one with 5m
>>> documents (about a couple of pages each) and another with 100k+ docs.
>>> We can index the 5m collection in a couple of days (limitation is in
>>> the source) which is 100k documents an hour without breaking a sweat.
>>>
>>>
>>>
>>> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> When indexing large amounts of data I hit a problem whereby Solr
>>>> becomes unresponsive
>>>> and doesn't recover (even when left overnight!). I think i've hit some
>>>> GC problems/tuning
>>>> is required of GC and I wanted to know if anyone has ever hit this
>>>> problem.
>>>> I can replicate this error (albeit taking longer to do so) using
>>>> Solr/Lucene analysers
>>>> only so I thought other people might have hit this issue before over
>>>> large data sets....
>>>>
>>>> Background on my problem follows -- but I guess my main question is --
>>>> can
>>>> Solr
>>>> become so overwhelmed by update posts that it becomes completely
>>>> unresponsive??
>>>>
>>>> Right now I think the problem is that the java GC is hanging but I've
>>>> been working
>>>> on this all week and it took a while to figure out it might be
>>>> GC-based / wasn't a
>>>> direct result of my custom analysers so i'd appreciate any advice anyone
>>>> has
>>>> about indexing large document collections.
>>>>
>>>> I also have a second questions for those in the know -- do we have a
>>>> chance
>>>> of indexing/searching over our large dataset with what little hardware
>>>> we already
>>>> have available??
>>>>
>>>> thanks in advance :)
>>>>
>>>> bec
>>>>
>>>> a bit of background: -------------------------------
>>>>
>>>> I've got a large collection of articles we want to index/search over
>>>> -- about 180k
>>>> in total. Each article has say 500-1000 sentences and each sentence has
>>>> about
>>>> 15 fields, many of which are multi-valued and we store most fields as
>>>> well
>>>> for
>>>> display/highlighting purposes. So I'd guess over 100 million index
>>>> documents.
>>>>
>>>> In our small test collection of 700 articles this results in a single
>>>> index
>>>> of
>>>> about 13GB.
>>>>
>>>> Our pipeline processes PDF files through to Solr native xml which we call
>>>> "index.xml" files i.e. in <add><doc>... format ready to post straight to
>>>> Solr's
>>>> update handler.
>>>>
>>>> We create the index.xml files as we pull in information from
>>>> a few sources and creation of these files from their original PDF form is
>>>> farmed out across a grid and is quite time-consuming so we distribute
>>>> this
>>>> process rather than creating index.xml files on the fly...
>>>>
>>>> We do a lot of linguistic processing and to enable search functionality
>>>> of our resulting terms requires analysers that split terms/ join terms
>>>> together
>>>> i.e. custom analysers that perform string operations and are quite
>>>> time-consuming/
>>>> have large overhead compared to most analysers (they take approx
>>>> 20-30% more time
>>>> and use twice as many short-lived objects than the "text" field type).
>>>>
>>>> Right now i'm working on my new Imac:
>>>> quad-core 2.8 GHz intel Core i7
>>>> 16 GB 1067 MHz DDR3 RAM
>>>> 2TB hard-drive (about half free)
>>>> Version 10.6.4 OSX
>>>>
>>>> Production environment:
>>>> 2 linux boxes each with:
>>>> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
>>>> 16GB RAM
>>>>
>>>> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
>>>> right now).
>>>>
>>>> I setup Solr to use autocommit as we'll have several document collections
>>>> /
>>>> post
>>>> to Solr from different data sets:
>>>>
>>>>  <!-- autocommit pending docs if certain criteria are met.  Future
>>>> versions may expand the available
>>>>      criteria -->
>>>>     <autoCommit>
>>>>       <maxDocs>500000</maxDocs> <!-- every 1000 articles -->
>>>>       <maxTime>900000</maxTime> <!-- every 15 minutes -->
>>>>     </autoCommit>
>>>>
>>>> I also have
>>>>   <useCompoundFile>false</useCompoundFile>
>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>     <mergeFactor>10</mergeFactor>
>>>> -----------------
>>>>
>>>> *** First question:
>>>> Has anyone else found that Solr hangs/becomes unresponsive after too
>>>> many documents are indexed at once i.e. Solr can't keep up with the post
>>>> rate?
>>>>
>>>> I've got LCF crawling my local test set (file system connection
>>>> required only) and
>>>> posting documents to Solr using 6GB of RAM. As I said above, these
>>>> documents
>>>> are in native Solr XML format (<add><doc>....) with one file per article
>>>> so
>>>> each
>>>> <add> contains all the sentence-level documents for the article.
>>>>
>>>> With LCF I post about 2.5/3k articles (files) per hour -- so about
>>>> 2.5k*500 /3600 =
>>>> 350 <doc>s per second post-rate -- is this normal/expected??
>>>>
>>>> Eventually, after about 3000 files (an hour or so) Solr starts to
>>>> hang/becomes
>>>> unresponsive and with Jconsole/GC logging I can see that the Old-Gen
>>>> space
>>>> is
>>>> about 90% full and the following is the end of the solr log file-- where
>>>> you
>>>> can see GC has been called:
>>>> ------------------------------------------------------------------
>>>> 3012.290: [GC Before GC:
>>>> Statistics for BinaryTreeDictionary:
>>>> ------------------------------------
>>>> Total Free Space: 53349392
>>>> Max   Chunk Size: 3200168
>>>> Number of Blocks: 66
>>>> Av.  Block  Size: 808324
>>>> Tree      Height: 13
>>>> Before GC:
>>>> Statistics for BinaryTreeDictionary:
>>>> ------------------------------------
>>>> Total Free Space: 0
>>>> Max   Chunk Size: 0
>>>> Number of Blocks: 0
>>>> Tree      Height: 0
>>>> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
>>>> 0.0769802 secs]3012.367: [CMS
>>>> ------------------------------------------------------------------
>>>>
>>>> I can replicate this with Solr using "text" field types in place of
>>>> those that use my
>>>> custom analysers -- whereby Solr takes longer to become unresponsive
>>>> (about
>>>> 3 hours / 13k docs) but there is the same kind of GC message at the end
>>>>  of the log file / Jconsole shows that the Old-Gen space was almost full
>>>> so
>>>> was
>>>> due for a collection sweep.
>>>>
>>>> I don't use any special GC settings but found an article here:
>>>> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/
>>>>
>>>> that suggests using particular GC settings for Solr -- I will try
>>>> these but thought
>>>> someone else could suggest another error source/give some GC advice??
>>>>
>>>> -----------------
>>>>
>>>> *** Second question:
>>>>
>>>> Given the production machines available for the Solr servers does it
>>>> look like we've
>>>> got enough hardware to produce reasonable query times / handle a few
>>>> hundred
>>>> queries per second??
>>>>
>>>> I planned on setting up one Solr server per machine (so two in total),
>>>> each with 8GB
>>>> of RAM -- so half of the 16GB available.
>>>>
>>>> We also have a third less powerful machine that houses all our data so
>>>> I plan to setup LCF
>>>> on that machine + post the files to the two Solr servers from this
>>>> machine
>>>> in
>>>> the subnet.
>>>>
>>>> Does it sound like we might be able to achieve indexing/search over this
>>>> little
>>>> hardware (given around 100 million index documents i.e. approx 50 million
>>>> each
>>>> Solr server?).
>>>>
>>>
>>> --
>>> Sent from my mobile device
>>>
>>
>
> --
> Sent from my mobile device
>

Reply via email to