hi, ok I have a theory about the cause of my problem -- java's GC failure I think is due to a solr memory leak caused from overlapping auto-commit calls -- does that sound plausible?? (ducking for cover now...)
I watched the log files and noticed that when the threads start to increase (from a stable 32 or so up to 72 before hanging!) there are two commit calls too close to each other + it looked like the index is in the process of merging at the time of the first commit call -- i.e. first was a long commit call with merge required then before that one finished another commit call was issued. i think this was due to the autocommit settings I had: <autoCommit> <maxDocs>500000</maxDocs> <!-- every 1000 articles --> <maxTime>900000</maxTime> <!-- every 15 minutes --> </autoCommit> and eventually, it seems these two different auto-commit settings would coincide!! a few times this seems to happen and not cause a problem -- but I think two eventually coincide where the first one is doing something heavy-duty like a merge over large index segments and so the system spirals downwards.... combined with the fact I was posting to Solr as fast as possible (LCF was waiting for Solr....) --> i think this causes java to keel over and die. Two things were noticeable in Jconsole - 1) lots of threads were spawned with the two commit calls - the thread spawing started after the first commit call making me think it was a commit requiring an index merge... whereby threads overall went from the stable 32 used during indexing for the 2 hours prior to 72 or so within 15 minutes after the two commit calls were made... 2) both Old-gen/survivor heaps were almost totally full! so i think a memory leak is happening with overlapping commit calls + heavy duty lucene index processing behind solr (like index merge!?) So if the overlapping commit call (second commit called before first one finished) caused a memory leak and with old-gen/survivor heaps full at that point, Solr became unresponsive and never recovered. is this expected when you use both autocommit settings / if concurrent commit calls are issued to Solr? This explains why it was happening even if without the use of my custom analysers ("text" field type used in place of mine) but took longer to happen --> my analysers are more expensive CPU/RAM-wise so the overlapping commit calls were less likely to be forgiven as my system was already using a lot of RAM... Also, I played with the GC settings a bit where I could find settings that helped to postpone this issue as they were more forgiving to the increased RAM usage during overlapping commit calls (GC settings with increased eden heap space). Solr was hanging after about 14k files (each one an article with a set of <doc> that are each sentences in the article) with a total of about 7 million index documents. If i switch off both auto-commit settings I can get through my smallish 20k file set (10 million index <doc>s) in 4 hours. I'm Trying to run now on 100k articles (50 million index <docs> within 100k files) where I use LCF to crawl/post each file to Solr so i'll email an update about this. if this works ok i'm then going to try using only one auto-commit setting rather than two and see if this works ok. thanks :) bec On 13 August 2010 00:24, Rebecca Watson <bec.wat...@gmail.com> wrote: > hi, > >> 1) I assume you are doing batching interspersed with commits > > as each file I crawl for are article-level each <add> contains all the > sentences for the article so they are naturally batched into the about > 500 documents per post in LCF. > > I use auto-commit in Solr: > <autoCommit> > <maxDocs>500000</maxDocs> <!-- every 1000 articles --> > <maxTime>900000</maxTime> <!-- every 15 minutes --> > </autoCommit> > >> 2) Why do you need sentence level Lucene docs? > > that's an application specific need due to linguistic info needed on a > per-sentence > basis. > >> 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be >> surprised if you a memory/connection leak their (or it is not >> releasing some resource explicitly) > > I thought this could be the case too -- but if I replace the use of my custom > analysers and specify my fields are of type "text" instead (from standard > solrconfig.xml i.e. using solr-based analysers) then I get this kind of > hanging > too -- at least it did when I didn't have any explicit GC settings... it does > take longer to replicate as my analysers/field types are more complex than > "text" field type. > > i will try it again with the different GC settings tomorrow and post > the results. > >> In general, we have NEVER had a problem in loading Solr. > > i'm not sure if we would either if we posted as we created the > index.xml format... > but because we post 500+ documents a time (one article file per LCF post) and > LCF can post these files quickly i'm not sure if I need to try and slow down > the post rate!? > > thanks for your replies, > > bec :) > >> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: >>> sorry -- i used the term "documents" too loosely! >>> >>> 180k scientific articles with between 500-1000 sentences each >>> and we index sentence-level index documents >>> so i'm guessing about 100 million lucene index documents in total. >>> >>> an update on my progress: >>> >>> i used GC settings of: >>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled >>> -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 >>> -XX:CMSInitiatingOccupancyFraction=70 >>> >>> which allowed the indexing process to run to 11.5k articles and >>> for about 2hours before I got the same kind of hanging/unresponsive Solr >>> with >>> this as the tail of the solr logs: >>> >>> Before GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 2416734 >>> Max Chunk Size: 2412032 >>> Number of Blocks: 3 >>> Av. Block Size: 805578 >>> Tree Height: 3 >>> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.0000193 secs]5980.480: >>> [CMS >>> >>> I also saw (in jconsole) that the number of threads rose from the >>> steady 32 used for the >>> 2 hours to 72 before Solr finally became unresponsive... >>> >>> i've got the following GC info params switched on (as many as i could >>> find!): >>> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps >>> -XX:+PrintGCApplicationConcurrentTime >>> -XX:+PrintGCApplicationStoppedTime >>> -XX:PrintFLSStatistics=1 >>> >>> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 >>> million fairly small >>> docs per hour!! this produced an index of about 40GB to give you an >>> idea of index >>> size... >>> >>> because i've already got the documents in solr native xml format >>> i.e. one file per article each with <add><doc>...</doc>.... >>> i.e. posting each set of sentence docs per article in every LCF file post... >>> this means that LCF can throw documents at Solr very fast.... and i think >>> i'm >>> breaking it GC-wise. >>> >>> i'm going to try adding in System.gc() calls to see if this runs ok >>> (albeit slower)... >>> otherwise i'm pretty much at a loss as to what could be causing this GC >>> issue/ >>> solr hanging if it's not a GC issue... >>> >>> thanks :) >>> >>> bec >>> >>> On 12 August 2010 21:42, dc tech <dctech1...@gmail.com> wrote: >>>> I am a little confused - how did 180k documents become 100m index >>>> documents? >>>> We use have over 20 indices (for different content sets), one with 5m >>>> documents (about a couple of pages each) and another with 100k+ docs. >>>> We can index the 5m collection in a couple of days (limitation is in >>>> the source) which is 100k documents an hour without breaking a sweat. >>>> >>>> >>>> >>>> On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: >>>>> Hi, >>>>> >>>>> When indexing large amounts of data I hit a problem whereby Solr >>>>> becomes unresponsive >>>>> and doesn't recover (even when left overnight!). I think i've hit some >>>>> GC problems/tuning >>>>> is required of GC and I wanted to know if anyone has ever hit this >>>>> problem. >>>>> I can replicate this error (albeit taking longer to do so) using >>>>> Solr/Lucene analysers >>>>> only so I thought other people might have hit this issue before over >>>>> large data sets.... >>>>> >>>>> Background on my problem follows -- but I guess my main question is -- >>>>> can >>>>> Solr >>>>> become so overwhelmed by update posts that it becomes completely >>>>> unresponsive?? >>>>> >>>>> Right now I think the problem is that the java GC is hanging but I've >>>>> been working >>>>> on this all week and it took a while to figure out it might be >>>>> GC-based / wasn't a >>>>> direct result of my custom analysers so i'd appreciate any advice anyone >>>>> has >>>>> about indexing large document collections. >>>>> >>>>> I also have a second questions for those in the know -- do we have a >>>>> chance >>>>> of indexing/searching over our large dataset with what little hardware >>>>> we already >>>>> have available?? >>>>> >>>>> thanks in advance :) >>>>> >>>>> bec >>>>> >>>>> a bit of background: ------------------------------- >>>>> >>>>> I've got a large collection of articles we want to index/search over >>>>> -- about 180k >>>>> in total. Each article has say 500-1000 sentences and each sentence has >>>>> about >>>>> 15 fields, many of which are multi-valued and we store most fields as >>>>> well >>>>> for >>>>> display/highlighting purposes. So I'd guess over 100 million index >>>>> documents. >>>>> >>>>> In our small test collection of 700 articles this results in a single >>>>> index >>>>> of >>>>> about 13GB. >>>>> >>>>> Our pipeline processes PDF files through to Solr native xml which we call >>>>> "index.xml" files i.e. in <add><doc>... format ready to post straight to >>>>> Solr's >>>>> update handler. >>>>> >>>>> We create the index.xml files as we pull in information from >>>>> a few sources and creation of these files from their original PDF form is >>>>> farmed out across a grid and is quite time-consuming so we distribute >>>>> this >>>>> process rather than creating index.xml files on the fly... >>>>> >>>>> We do a lot of linguistic processing and to enable search functionality >>>>> of our resulting terms requires analysers that split terms/ join terms >>>>> together >>>>> i.e. custom analysers that perform string operations and are quite >>>>> time-consuming/ >>>>> have large overhead compared to most analysers (they take approx >>>>> 20-30% more time >>>>> and use twice as many short-lived objects than the "text" field type). >>>>> >>>>> Right now i'm working on my new Imac: >>>>> quad-core 2.8 GHz intel Core i7 >>>>> 16 GB 1067 MHz DDR3 RAM >>>>> 2TB hard-drive (about half free) >>>>> Version 10.6.4 OSX >>>>> >>>>> Production environment: >>>>> 2 linux boxes each with: >>>>> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz >>>>> 16GB RAM >>>>> >>>>> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core >>>>> right now). >>>>> >>>>> I setup Solr to use autocommit as we'll have several document collections >>>>> / >>>>> post >>>>> to Solr from different data sets: >>>>> >>>>> <!-- autocommit pending docs if certain criteria are met. Future >>>>> versions may expand the available >>>>> criteria --> >>>>> <autoCommit> >>>>> <maxDocs>500000</maxDocs> <!-- every 1000 articles --> >>>>> <maxTime>900000</maxTime> <!-- every 15 minutes --> >>>>> </autoCommit> >>>>> >>>>> I also have >>>>> <useCompoundFile>false</useCompoundFile> >>>>> <ramBufferSizeMB>1024</ramBufferSizeMB> >>>>> <mergeFactor>10</mergeFactor> >>>>> ----------------- >>>>> >>>>> *** First question: >>>>> Has anyone else found that Solr hangs/becomes unresponsive after too >>>>> many documents are indexed at once i.e. Solr can't keep up with the post >>>>> rate? >>>>> >>>>> I've got LCF crawling my local test set (file system connection >>>>> required only) and >>>>> posting documents to Solr using 6GB of RAM. As I said above, these >>>>> documents >>>>> are in native Solr XML format (<add><doc>....) with one file per article >>>>> so >>>>> each >>>>> <add> contains all the sentence-level documents for the article. >>>>> >>>>> With LCF I post about 2.5/3k articles (files) per hour -- so about >>>>> 2.5k*500 /3600 = >>>>> 350 <doc>s per second post-rate -- is this normal/expected?? >>>>> >>>>> Eventually, after about 3000 files (an hour or so) Solr starts to >>>>> hang/becomes >>>>> unresponsive and with Jconsole/GC logging I can see that the Old-Gen >>>>> space >>>>> is >>>>> about 90% full and the following is the end of the solr log file-- where >>>>> you >>>>> can see GC has been called: >>>>> ------------------------------------------------------------------ >>>>> 3012.290: [GC Before GC: >>>>> Statistics for BinaryTreeDictionary: >>>>> ------------------------------------ >>>>> Total Free Space: 53349392 >>>>> Max Chunk Size: 3200168 >>>>> Number of Blocks: 66 >>>>> Av. Block Size: 808324 >>>>> Tree Height: 13 >>>>> Before GC: >>>>> Statistics for BinaryTreeDictionary: >>>>> ------------------------------------ >>>>> Total Free Space: 0 >>>>> Max Chunk Size: 0 >>>>> Number of Blocks: 0 >>>>> Tree Height: 0 >>>>> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K), >>>>> 0.0769802 secs]3012.367: [CMS >>>>> ------------------------------------------------------------------ >>>>> >>>>> I can replicate this with Solr using "text" field types in place of >>>>> those that use my >>>>> custom analysers -- whereby Solr takes longer to become unresponsive >>>>> (about >>>>> 3 hours / 13k docs) but there is the same kind of GC message at the end >>>>> of the log file / Jconsole shows that the Old-Gen space was almost full >>>>> so >>>>> was >>>>> due for a collection sweep. >>>>> >>>>> I don't use any special GC settings but found an article here: >>>>> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/ >>>>> >>>>> that suggests using particular GC settings for Solr -- I will try >>>>> these but thought >>>>> someone else could suggest another error source/give some GC advice?? >>>>> >>>>> ----------------- >>>>> >>>>> *** Second question: >>>>> >>>>> Given the production machines available for the Solr servers does it >>>>> look like we've >>>>> got enough hardware to produce reasonable query times / handle a few >>>>> hundred >>>>> queries per second?? >>>>> >>>>> I planned on setting up one Solr server per machine (so two in total), >>>>> each with 8GB >>>>> of RAM -- so half of the 16GB available. >>>>> >>>>> We also have a third less powerful machine that houses all our data so >>>>> I plan to setup LCF >>>>> on that machine + post the files to the two Solr servers from this >>>>> machine >>>>> in >>>>> the subnet. >>>>> >>>>> Does it sound like we might be able to achieve indexing/search over this >>>>> little >>>>> hardware (given around 100 million index documents i.e. approx 50 million >>>>> each >>>>> Solr server?). >>>>> >>>> >>>> -- >>>> Sent from my mobile device >>>> >>> >> >> -- >> Sent from my mobile device >> >