1) I assume you are doing batching interspersed with commits 2) Why do you need sentence level Lucene docs? 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be surprised if you a memory/connection leak their (or it is not releasing some resource explicitly)
In general, we have NEVER had a problem in loading Solr. On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: > sorry -- i used the term "documents" too loosely! > > 180k scientific articles with between 500-1000 sentences each > and we index sentence-level index documents > so i'm guessing about 100 million lucene index documents in total. > > an update on my progress: > > i used GC settings of: > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled > -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 > -XX:CMSInitiatingOccupancyFraction=70 > > which allowed the indexing process to run to 11.5k articles and > for about 2hours before I got the same kind of hanging/unresponsive Solr > with > this as the tail of the solr logs: > > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 2416734 > Max Chunk Size: 2412032 > Number of Blocks: 3 > Av. Block Size: 805578 > Tree Height: 3 > 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.0000193 secs]5980.480: > [CMS > > I also saw (in jconsole) that the number of threads rose from the > steady 32 used for the > 2 hours to 72 before Solr finally became unresponsive... > > i've got the following GC info params switched on (as many as i could > find!): > -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime > -XX:PrintFLSStatistics=1 > > with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 > million fairly small > docs per hour!! this produced an index of about 40GB to give you an > idea of index > size... > > because i've already got the documents in solr native xml format > i.e. one file per article each with <add><doc>...</doc>.... > i.e. posting each set of sentence docs per article in every LCF file post... > this means that LCF can throw documents at Solr very fast.... and i think > i'm > breaking it GC-wise. > > i'm going to try adding in System.gc() calls to see if this runs ok > (albeit slower)... > otherwise i'm pretty much at a loss as to what could be causing this GC > issue/ > solr hanging if it's not a GC issue... > > thanks :) > > bec > > On 12 August 2010 21:42, dc tech <dctech1...@gmail.com> wrote: >> I am a little confused - how did 180k documents become 100m index >> documents? >> We use have over 20 indices (for different content sets), one with 5m >> documents (about a couple of pages each) and another with 100k+ docs. >> We can index the 5m collection in a couple of days (limitation is in >> the source) which is 100k documents an hour without breaking a sweat. >> >> >> >> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: >>> Hi, >>> >>> When indexing large amounts of data I hit a problem whereby Solr >>> becomes unresponsive >>> and doesn't recover (even when left overnight!). I think i've hit some >>> GC problems/tuning >>> is required of GC and I wanted to know if anyone has ever hit this >>> problem. >>> I can replicate this error (albeit taking longer to do so) using >>> Solr/Lucene analysers >>> only so I thought other people might have hit this issue before over >>> large data sets.... >>> >>> Background on my problem follows -- but I guess my main question is -- >>> can >>> Solr >>> become so overwhelmed by update posts that it becomes completely >>> unresponsive?? >>> >>> Right now I think the problem is that the java GC is hanging but I've >>> been working >>> on this all week and it took a while to figure out it might be >>> GC-based / wasn't a >>> direct result of my custom analysers so i'd appreciate any advice anyone >>> has >>> about indexing large document collections. >>> >>> I also have a second questions for those in the know -- do we have a >>> chance >>> of indexing/searching over our large dataset with what little hardware >>> we already >>> have available?? >>> >>> thanks in advance :) >>> >>> bec >>> >>> a bit of background: ------------------------------- >>> >>> I've got a large collection of articles we want to index/search over >>> -- about 180k >>> in total. Each article has say 500-1000 sentences and each sentence has >>> about >>> 15 fields, many of which are multi-valued and we store most fields as >>> well >>> for >>> display/highlighting purposes. So I'd guess over 100 million index >>> documents. >>> >>> In our small test collection of 700 articles this results in a single >>> index >>> of >>> about 13GB. >>> >>> Our pipeline processes PDF files through to Solr native xml which we call >>> "index.xml" files i.e. in <add><doc>... format ready to post straight to >>> Solr's >>> update handler. >>> >>> We create the index.xml files as we pull in information from >>> a few sources and creation of these files from their original PDF form is >>> farmed out across a grid and is quite time-consuming so we distribute >>> this >>> process rather than creating index.xml files on the fly... >>> >>> We do a lot of linguistic processing and to enable search functionality >>> of our resulting terms requires analysers that split terms/ join terms >>> together >>> i.e. custom analysers that perform string operations and are quite >>> time-consuming/ >>> have large overhead compared to most analysers (they take approx >>> 20-30% more time >>> and use twice as many short-lived objects than the "text" field type). >>> >>> Right now i'm working on my new Imac: >>> quad-core 2.8 GHz intel Core i7 >>> 16 GB 1067 MHz DDR3 RAM >>> 2TB hard-drive (about half free) >>> Version 10.6.4 OSX >>> >>> Production environment: >>> 2 linux boxes each with: >>> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz >>> 16GB RAM >>> >>> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core >>> right now). >>> >>> I setup Solr to use autocommit as we'll have several document collections >>> / >>> post >>> to Solr from different data sets: >>> >>> <!-- autocommit pending docs if certain criteria are met. Future >>> versions may expand the available >>> criteria --> >>> <autoCommit> >>> <maxDocs>500000</maxDocs> <!-- every 1000 articles --> >>> <maxTime>900000</maxTime> <!-- every 15 minutes --> >>> </autoCommit> >>> >>> I also have >>> <useCompoundFile>false</useCompoundFile> >>> <ramBufferSizeMB>1024</ramBufferSizeMB> >>> <mergeFactor>10</mergeFactor> >>> ----------------- >>> >>> *** First question: >>> Has anyone else found that Solr hangs/becomes unresponsive after too >>> many documents are indexed at once i.e. Solr can't keep up with the post >>> rate? >>> >>> I've got LCF crawling my local test set (file system connection >>> required only) and >>> posting documents to Solr using 6GB of RAM. As I said above, these >>> documents >>> are in native Solr XML format (<add><doc>....) with one file per article >>> so >>> each >>> <add> contains all the sentence-level documents for the article. >>> >>> With LCF I post about 2.5/3k articles (files) per hour -- so about >>> 2.5k*500 /3600 = >>> 350 <doc>s per second post-rate -- is this normal/expected?? >>> >>> Eventually, after about 3000 files (an hour or so) Solr starts to >>> hang/becomes >>> unresponsive and with Jconsole/GC logging I can see that the Old-Gen >>> space >>> is >>> about 90% full and the following is the end of the solr log file-- where >>> you >>> can see GC has been called: >>> ------------------------------------------------------------------ >>> 3012.290: [GC Before GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 53349392 >>> Max Chunk Size: 3200168 >>> Number of Blocks: 66 >>> Av. Block Size: 808324 >>> Tree Height: 13 >>> Before GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 0 >>> Max Chunk Size: 0 >>> Number of Blocks: 0 >>> Tree Height: 0 >>> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K), >>> 0.0769802 secs]3012.367: [CMS >>> ------------------------------------------------------------------ >>> >>> I can replicate this with Solr using "text" field types in place of >>> those that use my >>> custom analysers -- whereby Solr takes longer to become unresponsive >>> (about >>> 3 hours / 13k docs) but there is the same kind of GC message at the end >>> of the log file / Jconsole shows that the Old-Gen space was almost full >>> so >>> was >>> due for a collection sweep. >>> >>> I don't use any special GC settings but found an article here: >>> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/ >>> >>> that suggests using particular GC settings for Solr -- I will try >>> these but thought >>> someone else could suggest another error source/give some GC advice?? >>> >>> ----------------- >>> >>> *** Second question: >>> >>> Given the production machines available for the Solr servers does it >>> look like we've >>> got enough hardware to produce reasonable query times / handle a few >>> hundred >>> queries per second?? >>> >>> I planned on setting up one Solr server per machine (so two in total), >>> each with 8GB >>> of RAM -- so half of the 16GB available. >>> >>> We also have a third less powerful machine that houses all our data so >>> I plan to setup LCF >>> on that machine + post the files to the two Solr servers from this >>> machine >>> in >>> the subnet. >>> >>> Does it sound like we might be able to achieve indexing/search over this >>> little >>> hardware (given around 100 million index documents i.e. approx 50 million >>> each >>> Solr server?). >>> >> >> -- >> Sent from my mobile device >> > -- Sent from my mobile device