sorry -- i used the term "documents" too loosely!

180k scientific articles with between 500-1000 sentences each
and we index sentence-level index documents
so i'm guessing about 100 million lucene index documents in total.

an update on my progress:

i used GC settings of:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
        -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
-XX:CMSInitiatingOccupancyFraction=70

which allowed the indexing process to run to 11.5k articles and
for about 2hours before I got the same kind of hanging/unresponsive Solr with
this as the tail of the solr logs:

Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 2416734
Max   Chunk Size: 2412032
Number of Blocks: 3
Av.  Block  Size: 805578
Tree      Height: 3
5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.0000193 secs]5980.480: [CMS

I also saw (in jconsole) that the number of threads rose from the
steady 32 used for the
2 hours to 72 before Solr finally became unresponsive...

i've got the following GC info params switched on (as many as i could find!):
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
        -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
        -XX:PrintFLSStatistics=1

with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
million fairly small
docs per hour!! this produced an index of about 40GB to give you an
idea of index
size...

because i've already got the documents in solr native xml format
i.e. one file per article each with <add><doc>...</doc>....
i.e. posting each set of sentence docs per article in every LCF file post...
this means that LCF can throw documents at Solr very fast.... and i think i'm
breaking it GC-wise.

i'm going to try adding in System.gc() calls to see if this runs ok
(albeit slower)...
otherwise i'm pretty much at a loss as to what could be causing this GC issue/
solr hanging if it's not a GC issue...

thanks :)

bec

On 12 August 2010 21:42, dc tech <dctech1...@gmail.com> wrote:
> I am a little confused - how did 180k documents become 100m index documents?
> We use have over 20 indices (for different content sets), one with 5m
> documents (about a couple of pages each) and another with 100k+ docs.
> We can index the 5m collection in a couple of days (limitation is in
> the source) which is 100k documents an hour without breaking a sweat.
>
>
>
> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote:
>> Hi,
>>
>> When indexing large amounts of data I hit a problem whereby Solr
>> becomes unresponsive
>> and doesn't recover (even when left overnight!). I think i've hit some
>> GC problems/tuning
>> is required of GC and I wanted to know if anyone has ever hit this problem.
>> I can replicate this error (albeit taking longer to do so) using
>> Solr/Lucene analysers
>> only so I thought other people might have hit this issue before over
>> large data sets....
>>
>> Background on my problem follows -- but I guess my main question is -- can
>> Solr
>> become so overwhelmed by update posts that it becomes completely
>> unresponsive??
>>
>> Right now I think the problem is that the java GC is hanging but I've
>> been working
>> on this all week and it took a while to figure out it might be
>> GC-based / wasn't a
>> direct result of my custom analysers so i'd appreciate any advice anyone has
>> about indexing large document collections.
>>
>> I also have a second questions for those in the know -- do we have a chance
>> of indexing/searching over our large dataset with what little hardware
>> we already
>> have available??
>>
>> thanks in advance :)
>>
>> bec
>>
>> a bit of background: -------------------------------
>>
>> I've got a large collection of articles we want to index/search over
>> -- about 180k
>> in total. Each article has say 500-1000 sentences and each sentence has
>> about
>> 15 fields, many of which are multi-valued and we store most fields as well
>> for
>> display/highlighting purposes. So I'd guess over 100 million index
>> documents.
>>
>> In our small test collection of 700 articles this results in a single index
>> of
>> about 13GB.
>>
>> Our pipeline processes PDF files through to Solr native xml which we call
>> "index.xml" files i.e. in <add><doc>... format ready to post straight to
>> Solr's
>> update handler.
>>
>> We create the index.xml files as we pull in information from
>> a few sources and creation of these files from their original PDF form is
>> farmed out across a grid and is quite time-consuming so we distribute this
>> process rather than creating index.xml files on the fly...
>>
>> We do a lot of linguistic processing and to enable search functionality
>> of our resulting terms requires analysers that split terms/ join terms
>> together
>> i.e. custom analysers that perform string operations and are quite
>> time-consuming/
>> have large overhead compared to most analysers (they take approx
>> 20-30% more time
>> and use twice as many short-lived objects than the "text" field type).
>>
>> Right now i'm working on my new Imac:
>> quad-core 2.8 GHz intel Core i7
>> 16 GB 1067 MHz DDR3 RAM
>> 2TB hard-drive (about half free)
>> Version 10.6.4 OSX
>>
>> Production environment:
>> 2 linux boxes each with:
>> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
>> 16GB RAM
>>
>> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
>> right now).
>>
>> I setup Solr to use autocommit as we'll have several document collections /
>> post
>> to Solr from different data sets:
>>
>>  <!-- autocommit pending docs if certain criteria are met.  Future
>> versions may expand the available
>>      criteria -->
>>     <autoCommit>
>>       <maxDocs>500000</maxDocs> <!-- every 1000 articles -->
>>       <maxTime>900000</maxTime> <!-- every 15 minutes -->
>>     </autoCommit>
>>
>> I also have
>>   <useCompoundFile>false</useCompoundFile>
>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>     <mergeFactor>10</mergeFactor>
>> -----------------
>>
>> *** First question:
>> Has anyone else found that Solr hangs/becomes unresponsive after too
>> many documents are indexed at once i.e. Solr can't keep up with the post
>> rate?
>>
>> I've got LCF crawling my local test set (file system connection
>> required only) and
>> posting documents to Solr using 6GB of RAM. As I said above, these documents
>> are in native Solr XML format (<add><doc>....) with one file per article so
>> each
>> <add> contains all the sentence-level documents for the article.
>>
>> With LCF I post about 2.5/3k articles (files) per hour -- so about
>> 2.5k*500 /3600 =
>> 350 <doc>s per second post-rate -- is this normal/expected??
>>
>> Eventually, after about 3000 files (an hour or so) Solr starts to
>> hang/becomes
>> unresponsive and with Jconsole/GC logging I can see that the Old-Gen space
>> is
>> about 90% full and the following is the end of the solr log file-- where you
>> can see GC has been called:
>> ------------------------------------------------------------------
>> 3012.290: [GC Before GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 53349392
>> Max   Chunk Size: 3200168
>> Number of Blocks: 66
>> Av.  Block  Size: 808324
>> Tree      Height: 13
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree      Height: 0
>> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
>> 0.0769802 secs]3012.367: [CMS
>> ------------------------------------------------------------------
>>
>> I can replicate this with Solr using "text" field types in place of
>> those that use my
>> custom analysers -- whereby Solr takes longer to become unresponsive (about
>> 3 hours / 13k docs) but there is the same kind of GC message at the end
>>  of the log file / Jconsole shows that the Old-Gen space was almost full so
>> was
>> due for a collection sweep.
>>
>> I don't use any special GC settings but found an article here:
>> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/
>>
>> that suggests using particular GC settings for Solr -- I will try
>> these but thought
>> someone else could suggest another error source/give some GC advice??
>>
>> -----------------
>>
>> *** Second question:
>>
>> Given the production machines available for the Solr servers does it
>> look like we've
>> got enough hardware to produce reasonable query times / handle a few hundred
>> queries per second??
>>
>> I planned on setting up one Solr server per machine (so two in total),
>> each with 8GB
>> of RAM -- so half of the 16GB available.
>>
>> We also have a third less powerful machine that houses all our data so
>> I plan to setup LCF
>> on that machine + post the files to the two Solr servers from this machine
>> in
>> the subnet.
>>
>> Does it sound like we might be able to achieve indexing/search over this
>> little
>> hardware (given around 100 million index documents i.e. approx 50 million
>> each
>> Solr server?).
>>
>
> --
> Sent from my mobile device
>

Reply via email to