1) I assume you are doing batching interspersed with commits
2) Why do you need sentence level Lucene docs?
3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
surprised if you a memory/connection leak their (or it is not
releasing some resource explicitly)

In general, we have NEVER had a problem in loading Solr.

On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote:
> sorry -- i used the term "documents" too loosely!
>
> 180k scientific articles with between 500-1000 sentences each
> and we index sentence-level index documents
> so i'm guessing about 100 million lucene index documents in total.
>
> an update on my progress:
>
> i used GC settings of:
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
>       -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
> -XX:CMSInitiatingOccupancyFraction=70
>
> which allowed the indexing process to run to 11.5k articles and
> for about 2hours before I got the same kind of hanging/unresponsive Solr
> with
> this as the tail of the solr logs:
>
> Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 2416734
> Max   Chunk Size: 2412032
> Number of Blocks: 3
> Av.  Block  Size: 805578
> Tree      Height: 3
> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.0000193 secs]5980.480:
> [CMS
>
> I also saw (in jconsole) that the number of threads rose from the
> steady 32 used for the
> 2 hours to 72 before Solr finally became unresponsive...
>
> i've got the following GC info params switched on (as many as i could
> find!):
> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>       -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>       -XX:PrintFLSStatistics=1
>
> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
> million fairly small
> docs per hour!! this produced an index of about 40GB to give you an
> idea of index
> size...
>
> because i've already got the documents in solr native xml format
> i.e. one file per article each with <add><doc>...</doc>....
> i.e. posting each set of sentence docs per article in every LCF file post...
> this means that LCF can throw documents at Solr very fast.... and i think
> i'm
> breaking it GC-wise.
>
> i'm going to try adding in System.gc() calls to see if this runs ok
> (albeit slower)...
> otherwise i'm pretty much at a loss as to what could be causing this GC
> issue/
> solr hanging if it's not a GC issue...
>
> thanks :)
>
> bec
>
> On 12 August 2010 21:42, dc tech <dctech1...@gmail.com> wrote:
>> I am a little confused - how did 180k documents become 100m index
>> documents?
>> We use have over 20 indices (for different content sets), one with 5m
>> documents (about a couple of pages each) and another with 100k+ docs.
>> We can index the 5m collection in a couple of days (limitation is in
>> the source) which is 100k documents an hour without breaking a sweat.
>>
>>
>>
>> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote:
>>> Hi,
>>>
>>> When indexing large amounts of data I hit a problem whereby Solr
>>> becomes unresponsive
>>> and doesn't recover (even when left overnight!). I think i've hit some
>>> GC problems/tuning
>>> is required of GC and I wanted to know if anyone has ever hit this
>>> problem.
>>> I can replicate this error (albeit taking longer to do so) using
>>> Solr/Lucene analysers
>>> only so I thought other people might have hit this issue before over
>>> large data sets....
>>>
>>> Background on my problem follows -- but I guess my main question is --
>>> can
>>> Solr
>>> become so overwhelmed by update posts that it becomes completely
>>> unresponsive??
>>>
>>> Right now I think the problem is that the java GC is hanging but I've
>>> been working
>>> on this all week and it took a while to figure out it might be
>>> GC-based / wasn't a
>>> direct result of my custom analysers so i'd appreciate any advice anyone
>>> has
>>> about indexing large document collections.
>>>
>>> I also have a second questions for those in the know -- do we have a
>>> chance
>>> of indexing/searching over our large dataset with what little hardware
>>> we already
>>> have available??
>>>
>>> thanks in advance :)
>>>
>>> bec
>>>
>>> a bit of background: -------------------------------
>>>
>>> I've got a large collection of articles we want to index/search over
>>> -- about 180k
>>> in total. Each article has say 500-1000 sentences and each sentence has
>>> about
>>> 15 fields, many of which are multi-valued and we store most fields as
>>> well
>>> for
>>> display/highlighting purposes. So I'd guess over 100 million index
>>> documents.
>>>
>>> In our small test collection of 700 articles this results in a single
>>> index
>>> of
>>> about 13GB.
>>>
>>> Our pipeline processes PDF files through to Solr native xml which we call
>>> "index.xml" files i.e. in <add><doc>... format ready to post straight to
>>> Solr's
>>> update handler.
>>>
>>> We create the index.xml files as we pull in information from
>>> a few sources and creation of these files from their original PDF form is
>>> farmed out across a grid and is quite time-consuming so we distribute
>>> this
>>> process rather than creating index.xml files on the fly...
>>>
>>> We do a lot of linguistic processing and to enable search functionality
>>> of our resulting terms requires analysers that split terms/ join terms
>>> together
>>> i.e. custom analysers that perform string operations and are quite
>>> time-consuming/
>>> have large overhead compared to most analysers (they take approx
>>> 20-30% more time
>>> and use twice as many short-lived objects than the "text" field type).
>>>
>>> Right now i'm working on my new Imac:
>>> quad-core 2.8 GHz intel Core i7
>>> 16 GB 1067 MHz DDR3 RAM
>>> 2TB hard-drive (about half free)
>>> Version 10.6.4 OSX
>>>
>>> Production environment:
>>> 2 linux boxes each with:
>>> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
>>> 16GB RAM
>>>
>>> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
>>> right now).
>>>
>>> I setup Solr to use autocommit as we'll have several document collections
>>> /
>>> post
>>> to Solr from different data sets:
>>>
>>>  <!-- autocommit pending docs if certain criteria are met.  Future
>>> versions may expand the available
>>>      criteria -->
>>>     <autoCommit>
>>>       <maxDocs>500000</maxDocs> <!-- every 1000 articles -->
>>>       <maxTime>900000</maxTime> <!-- every 15 minutes -->
>>>     </autoCommit>
>>>
>>> I also have
>>>   <useCompoundFile>false</useCompoundFile>
>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>     <mergeFactor>10</mergeFactor>
>>> -----------------
>>>
>>> *** First question:
>>> Has anyone else found that Solr hangs/becomes unresponsive after too
>>> many documents are indexed at once i.e. Solr can't keep up with the post
>>> rate?
>>>
>>> I've got LCF crawling my local test set (file system connection
>>> required only) and
>>> posting documents to Solr using 6GB of RAM. As I said above, these
>>> documents
>>> are in native Solr XML format (<add><doc>....) with one file per article
>>> so
>>> each
>>> <add> contains all the sentence-level documents for the article.
>>>
>>> With LCF I post about 2.5/3k articles (files) per hour -- so about
>>> 2.5k*500 /3600 =
>>> 350 <doc>s per second post-rate -- is this normal/expected??
>>>
>>> Eventually, after about 3000 files (an hour or so) Solr starts to
>>> hang/becomes
>>> unresponsive and with Jconsole/GC logging I can see that the Old-Gen
>>> space
>>> is
>>> about 90% full and the following is the end of the solr log file-- where
>>> you
>>> can see GC has been called:
>>> ------------------------------------------------------------------
>>> 3012.290: [GC Before GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 53349392
>>> Max   Chunk Size: 3200168
>>> Number of Blocks: 66
>>> Av.  Block  Size: 808324
>>> Tree      Height: 13
>>> Before GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 0
>>> Max   Chunk Size: 0
>>> Number of Blocks: 0
>>> Tree      Height: 0
>>> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
>>> 0.0769802 secs]3012.367: [CMS
>>> ------------------------------------------------------------------
>>>
>>> I can replicate this with Solr using "text" field types in place of
>>> those that use my
>>> custom analysers -- whereby Solr takes longer to become unresponsive
>>> (about
>>> 3 hours / 13k docs) but there is the same kind of GC message at the end
>>>  of the log file / Jconsole shows that the Old-Gen space was almost full
>>> so
>>> was
>>> due for a collection sweep.
>>>
>>> I don't use any special GC settings but found an article here:
>>> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/
>>>
>>> that suggests using particular GC settings for Solr -- I will try
>>> these but thought
>>> someone else could suggest another error source/give some GC advice??
>>>
>>> -----------------
>>>
>>> *** Second question:
>>>
>>> Given the production machines available for the Solr servers does it
>>> look like we've
>>> got enough hardware to produce reasonable query times / handle a few
>>> hundred
>>> queries per second??
>>>
>>> I planned on setting up one Solr server per machine (so two in total),
>>> each with 8GB
>>> of RAM -- so half of the 16GB available.
>>>
>>> We also have a third less powerful machine that houses all our data so
>>> I plan to setup LCF
>>> on that machine + post the files to the two Solr servers from this
>>> machine
>>> in
>>> the subnet.
>>>
>>> Does it sound like we might be able to achieve indexing/search over this
>>> little
>>> hardware (given around 100 million index documents i.e. approx 50 million
>>> each
>>> Solr server?).
>>>
>>
>> --
>> Sent from my mobile device
>>
>

-- 
Sent from my mobile device

Reply via email to