hi, > 1) I assume you are doing batching interspersed with commits
as each file I crawl for are article-level each <add> contains all the sentences for the article so they are naturally batched into the about 500 documents per post in LCF. I use auto-commit in Solr: <autoCommit> <maxDocs>500000</maxDocs> <!-- every 1000 articles --> <maxTime>900000</maxTime> <!-- every 15 minutes --> </autoCommit> > 2) Why do you need sentence level Lucene docs? that's an application specific need due to linguistic info needed on a per-sentence basis. > 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be > surprised if you a memory/connection leak their (or it is not > releasing some resource explicitly) I thought this could be the case too -- but if I replace the use of my custom analysers and specify my fields are of type "text" instead (from standard solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging too -- at least it did when I didn't have any explicit GC settings... it does take longer to replicate as my analysers/field types are more complex than "text" field type. i will try it again with the different GC settings tomorrow and post the results. > In general, we have NEVER had a problem in loading Solr. i'm not sure if we would either if we posted as we created the index.xml format... but because we post 500+ documents a time (one article file per LCF post) and LCF can post these files quickly i'm not sure if I need to try and slow down the post rate!? thanks for your replies, bec :) > On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: >> sorry -- i used the term "documents" too loosely! >> >> 180k scientific articles with between 500-1000 sentences each >> and we index sentence-level index documents >> so i'm guessing about 100 million lucene index documents in total. >> >> an update on my progress: >> >> i used GC settings of: >> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled >> -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 >> -XX:CMSInitiatingOccupancyFraction=70 >> >> which allowed the indexing process to run to 11.5k articles and >> for about 2hours before I got the same kind of hanging/unresponsive Solr >> with >> this as the tail of the solr logs: >> >> Before GC: >> Statistics for BinaryTreeDictionary: >> ------------------------------------ >> Total Free Space: 2416734 >> Max Chunk Size: 2412032 >> Number of Blocks: 3 >> Av. Block Size: 805578 >> Tree Height: 3 >> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.0000193 secs]5980.480: >> [CMS >> >> I also saw (in jconsole) that the number of threads rose from the >> steady 32 used for the >> 2 hours to 72 before Solr finally became unresponsive... >> >> i've got the following GC info params switched on (as many as i could >> find!): >> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps >> -XX:+PrintGCApplicationConcurrentTime >> -XX:+PrintGCApplicationStoppedTime >> -XX:PrintFLSStatistics=1 >> >> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 >> million fairly small >> docs per hour!! this produced an index of about 40GB to give you an >> idea of index >> size... >> >> because i've already got the documents in solr native xml format >> i.e. one file per article each with <add><doc>...</doc>.... >> i.e. posting each set of sentence docs per article in every LCF file post... >> this means that LCF can throw documents at Solr very fast.... and i think >> i'm >> breaking it GC-wise. >> >> i'm going to try adding in System.gc() calls to see if this runs ok >> (albeit slower)... >> otherwise i'm pretty much at a loss as to what could be causing this GC >> issue/ >> solr hanging if it's not a GC issue... >> >> thanks :) >> >> bec >> >> On 12 August 2010 21:42, dc tech <dctech1...@gmail.com> wrote: >>> I am a little confused - how did 180k documents become 100m index >>> documents? >>> We use have over 20 indices (for different content sets), one with 5m >>> documents (about a couple of pages each) and another with 100k+ docs. >>> We can index the 5m collection in a couple of days (limitation is in >>> the source) which is 100k documents an hour without breaking a sweat. >>> >>> >>> >>> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: >>>> Hi, >>>> >>>> When indexing large amounts of data I hit a problem whereby Solr >>>> becomes unresponsive >>>> and doesn't recover (even when left overnight!). I think i've hit some >>>> GC problems/tuning >>>> is required of GC and I wanted to know if anyone has ever hit this >>>> problem. >>>> I can replicate this error (albeit taking longer to do so) using >>>> Solr/Lucene analysers >>>> only so I thought other people might have hit this issue before over >>>> large data sets.... >>>> >>>> Background on my problem follows -- but I guess my main question is -- >>>> can >>>> Solr >>>> become so overwhelmed by update posts that it becomes completely >>>> unresponsive?? >>>> >>>> Right now I think the problem is that the java GC is hanging but I've >>>> been working >>>> on this all week and it took a while to figure out it might be >>>> GC-based / wasn't a >>>> direct result of my custom analysers so i'd appreciate any advice anyone >>>> has >>>> about indexing large document collections. >>>> >>>> I also have a second questions for those in the know -- do we have a >>>> chance >>>> of indexing/searching over our large dataset with what little hardware >>>> we already >>>> have available?? >>>> >>>> thanks in advance :) >>>> >>>> bec >>>> >>>> a bit of background: ------------------------------- >>>> >>>> I've got a large collection of articles we want to index/search over >>>> -- about 180k >>>> in total. Each article has say 500-1000 sentences and each sentence has >>>> about >>>> 15 fields, many of which are multi-valued and we store most fields as >>>> well >>>> for >>>> display/highlighting purposes. So I'd guess over 100 million index >>>> documents. >>>> >>>> In our small test collection of 700 articles this results in a single >>>> index >>>> of >>>> about 13GB. >>>> >>>> Our pipeline processes PDF files through to Solr native xml which we call >>>> "index.xml" files i.e. in <add><doc>... format ready to post straight to >>>> Solr's >>>> update handler. >>>> >>>> We create the index.xml files as we pull in information from >>>> a few sources and creation of these files from their original PDF form is >>>> farmed out across a grid and is quite time-consuming so we distribute >>>> this >>>> process rather than creating index.xml files on the fly... >>>> >>>> We do a lot of linguistic processing and to enable search functionality >>>> of our resulting terms requires analysers that split terms/ join terms >>>> together >>>> i.e. custom analysers that perform string operations and are quite >>>> time-consuming/ >>>> have large overhead compared to most analysers (they take approx >>>> 20-30% more time >>>> and use twice as many short-lived objects than the "text" field type). >>>> >>>> Right now i'm working on my new Imac: >>>> quad-core 2.8 GHz intel Core i7 >>>> 16 GB 1067 MHz DDR3 RAM >>>> 2TB hard-drive (about half free) >>>> Version 10.6.4 OSX >>>> >>>> Production environment: >>>> 2 linux boxes each with: >>>> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz >>>> 16GB RAM >>>> >>>> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core >>>> right now). >>>> >>>> I setup Solr to use autocommit as we'll have several document collections >>>> / >>>> post >>>> to Solr from different data sets: >>>> >>>> <!-- autocommit pending docs if certain criteria are met. Future >>>> versions may expand the available >>>> criteria --> >>>> <autoCommit> >>>> <maxDocs>500000</maxDocs> <!-- every 1000 articles --> >>>> <maxTime>900000</maxTime> <!-- every 15 minutes --> >>>> </autoCommit> >>>> >>>> I also have >>>> <useCompoundFile>false</useCompoundFile> >>>> <ramBufferSizeMB>1024</ramBufferSizeMB> >>>> <mergeFactor>10</mergeFactor> >>>> ----------------- >>>> >>>> *** First question: >>>> Has anyone else found that Solr hangs/becomes unresponsive after too >>>> many documents are indexed at once i.e. Solr can't keep up with the post >>>> rate? >>>> >>>> I've got LCF crawling my local test set (file system connection >>>> required only) and >>>> posting documents to Solr using 6GB of RAM. As I said above, these >>>> documents >>>> are in native Solr XML format (<add><doc>....) with one file per article >>>> so >>>> each >>>> <add> contains all the sentence-level documents for the article. >>>> >>>> With LCF I post about 2.5/3k articles (files) per hour -- so about >>>> 2.5k*500 /3600 = >>>> 350 <doc>s per second post-rate -- is this normal/expected?? >>>> >>>> Eventually, after about 3000 files (an hour or so) Solr starts to >>>> hang/becomes >>>> unresponsive and with Jconsole/GC logging I can see that the Old-Gen >>>> space >>>> is >>>> about 90% full and the following is the end of the solr log file-- where >>>> you >>>> can see GC has been called: >>>> ------------------------------------------------------------------ >>>> 3012.290: [GC Before GC: >>>> Statistics for BinaryTreeDictionary: >>>> ------------------------------------ >>>> Total Free Space: 53349392 >>>> Max Chunk Size: 3200168 >>>> Number of Blocks: 66 >>>> Av. Block Size: 808324 >>>> Tree Height: 13 >>>> Before GC: >>>> Statistics for BinaryTreeDictionary: >>>> ------------------------------------ >>>> Total Free Space: 0 >>>> Max Chunk Size: 0 >>>> Number of Blocks: 0 >>>> Tree Height: 0 >>>> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K), >>>> 0.0769802 secs]3012.367: [CMS >>>> ------------------------------------------------------------------ >>>> >>>> I can replicate this with Solr using "text" field types in place of >>>> those that use my >>>> custom analysers -- whereby Solr takes longer to become unresponsive >>>> (about >>>> 3 hours / 13k docs) but there is the same kind of GC message at the end >>>> of the log file / Jconsole shows that the Old-Gen space was almost full >>>> so >>>> was >>>> due for a collection sweep. >>>> >>>> I don't use any special GC settings but found an article here: >>>> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/ >>>> >>>> that suggests using particular GC settings for Solr -- I will try >>>> these but thought >>>> someone else could suggest another error source/give some GC advice?? >>>> >>>> ----------------- >>>> >>>> *** Second question: >>>> >>>> Given the production machines available for the Solr servers does it >>>> look like we've >>>> got enough hardware to produce reasonable query times / handle a few >>>> hundred >>>> queries per second?? >>>> >>>> I planned on setting up one Solr server per machine (so two in total), >>>> each with 8GB >>>> of RAM -- so half of the 16GB available. >>>> >>>> We also have a third less powerful machine that houses all our data so >>>> I plan to setup LCF >>>> on that machine + post the files to the two Solr servers from this >>>> machine >>>> in >>>> the subnet. >>>> >>>> Does it sound like we might be able to achieve indexing/search over this >>>> little >>>> hardware (given around 100 million index documents i.e. approx 50 million >>>> each >>>> Solr server?). >>>> >>> >>> -- >>> Sent from my mobile device >>> >> > > -- > Sent from my mobile device >