Re: Solr and jvm Garbage Collection tuning
Hi Tim, For what it is worth, behind Trove (http://trove.nla.gov.au/) are 3 SOLR-managed indices and 1 Lucene index. None of ours is as big as one of your shards, and one of our SOLR-managed indices is tiny, but your experiences with long GC pauses are familar to us. One of the most difficult indices to tune is our bibliographic index of around 38M mostly metadata records which is around 125GB and 97MB tii files. We need to commit updates and reopen the index every 90 seconds, and the facet recalculation (using UnInverted) was taking quite a lot of time, and seemed to generate lots of objects to be collected on each reopening. Although we've been through several rounds of tuning which have seemed to work, at least temporarily, a few months ago we started getting 12 sec "full gc" times every 90 secs, which was no good! We've noticed/did three things: 1) optimise to 1 segment - we'd got to the stage where 50% of the documents had been updated (hence deleted), and the maxdocid was 50% bigger than it needed to be, and hence datastructures whose size was proportional to maxdocid had increased a lot. Optimising to 1 segment greatly reduced full GC frequency and times. 2) for most of our facets, forcing the facets to be filters rather than uninverted happened to work better - but this depends on many factors, and certainly isnt a cure-all for all facets - uninverted often works much better than filters! 3) after lots of benchmarking real updates and queries on a dev system, we came up with this set of JVM parameters that worked "best" for our environment (at the moment!): -Xmx17000M -XX:NewSize=3500M -XX:SurvivorRatio=3 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC \ -XX:+CMSIncrementalMode I can't say exactly why, except that with this combination of parameters and our data, a much bigger newgen led to less movement of objects to oldgen, and non-full-GC collections on oldgen worked much better. Currently we are seeing less than 10 Full GC's a day, and they almost always take less than 4 seconds. This index is running on an 8 core X5570 machine with 64GB, sharing it with a large/busy mysql instance and the Trove web server. One of our other indices is only updated once per day, but is larger: 33.5M docs representing full text of archived web pages, 246GB, tii file is 36MB. JVM parms are -Xmx1M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC. It also does less than 10 Full GC's per day, taking less than 5 sec each. Our other large index, newspapers, is a native Lucene index, about 180GB with comparatively large tii of 280MB (probably for the same reason your tii is large - the contents of this database is mostly OCR'ed text). This index is updated/reopened every 3 minutes (to incorporate OCR text corrections and tagging) and we use a bitmap to represent all facet values, which typically take 5 secs to rebuild on each reopen. JVM parms: -mx15000M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC Although this JVM usually does fewer than 5 GC's per day, these Full GC's often take 20-30 seconds, and we need to test increasing the Newsize on this JVM to see if we can reduce these pauses. The web archive and newspaper index are running on 8 core X5570 machine with 72GB. We are also running a separate copy/version of this index behind the site http://newspapers.nla.gov.au/ - the main difference is that the Trove version using shingling (inspired by the Hathi Trust results) to improve searches containing common words. This other version is running on a machine with 32GB and 8 X5460 cores and has JVM parms: -mx11500M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC Apart from the old newspapers index, all other SOLR/lucene indices are maintained on SSDs (Intel x25m 160GB), which whilst not having anything to do with GCs, work very very well - we couldnt cope with our current query volumes on rotating disk without spending a great deal of money. The old newspaper index is running on a SAN with 24 fast disks backing it, and we can't support the same query rate on it as we can with the other newspaper index on SSDs (even before the shingling change). Kent Fitch Trove development team National Library of Australia
Re: Need help with graphing function (MATH)
Hi, assuming you have x and want to generate y, then maybe - if x < 50, y = 150 - if x > 175, y = 60 - otherwise : either y = (100/(e^((x -50)/75)^2)) + 50 http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e ^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175 - or maybe y =sin((x+5)/38)*42+105 http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175 Regards, Kent Fitch On Tue, Feb 14, 2012 at 12:29 PM, Mark wrote: > I need some help with one of my boost functions. I would like the > function to look something like the following mockup below. Starts off flat > then there is a gradual decline, steep decline then gradual decline and > then back to flat. > > Can some of you math guys please help :) > > Thanks. > > > >
Re: Need help with graphing function (MATH)
agreeing with wunder - I don't know the application, but I think almost always, a set of linear approximations over a few ranges would be ok (and you could increase the number of ranges until it was), and will be faster. And if you need just one equation, a sigmoid function will do the trick, such as 110 - 50((x-100)/20)/(sqrt(1+((x-100)/20)^2)) http://www.wolframalpha.com/input/?i=plot+110+-+50%28%28x-100%29%2F20%29%2F%28sqrt%281%2B%28%28x-100%29%2F20%29 ^2%29%29%2C+x%3D0..200 Regards Kent Fitch On Wed, Feb 15, 2012 at 6:17 AM, Walter Underwood wrote: > In practice, I expect a linear piecewise function (with sharp corners) > would be indistinguishable from the smoothed function. It is also much > easier to read, test, and debug. It might even be faster. > > Try the sharp corners one first. > > wunder > > On Feb 14, 2012, at 10:56 AM, Ted Dunning wrote: > > > In general this kind of function is very easy to construct using sums of > basic sigmoidal functions. The logistic and probit functions are commonly > used for this. > > > > Sent from my iPhone > > > > On Feb 14, 2012, at 10:05, Mark wrote: > > > >> Thanks I'll have a look at this. I should have mentioned that the > actual values on the graph aren't important rather I was showing an example > of how the function should behave. > >> > >> On 2/13/12 6:25 PM, Kent Fitch wrote: > >>> Hi, assuming you have x and want to generate y, then maybe > >>> > >>> - if x < 50, y = 150 > >>> > >>> - if x > 175, y = 60 > >>> > >>> - otherwise : > >>> > >>> either y = (100/(e^((x -50)/75)^2)) + 50 > >>> http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e > ^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175 > >>> > >>> > >>> - or maybe y =sin((x+5)/38)*42+105 > >>> > >>> > http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175 > >>> > >>> Regards, > >>> > >>> Kent Fitch > >>> > >>> On Tue, Feb 14, 2012 at 12:29 PM, Mark static.void@gmail.com>> wrote: > >>> > >>> I need some help with one of my boost functions. I would like the > >>> function to look something like the following mockup below. Starts > >>> off flat then there is a gradual decline, steep decline then > >>> gradual decline and then back to flat. > >>> > >>> Can some of you math guys please help :) > >>> > >>> Thanks. > >>> > > > > >
Re: high CPU usage and SelectCannelConnector threads used a lot
Hi John, sounds like this bug in NIO: http://jira.codehaus.org/browse/JETTY-937 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933 I think recent versions of jetty work around this bug, or maybe try the non-NIO socket connector Kent On Tue, Dec 7, 2010 at 9:10 AM, John Russell wrote: > Hi, > I'm using solr and have been load testing it for around 4 days. We use the > solrj client to communicate with a separate jetty based solr process on the > same box. > > After a few days solr's CPU% is now consistently at or above 100% (multiple > processors available) and the application using it is mostly not responding > because it times out talking to solr. I connected visual VM to the solr JVM > and found that out of the many btpool-# threads there are 4 that are pretty > much stuck in the running state 100% of the time. Their names are > > btpool0-1-Acceptor1 SelectChannelConnector @0.0.0.0:9983 > btpool0-2-Acceptor2 SelectChannelConnector @0.0.0.0:9983 > btpool0-3-Acceptor3 SelectChannelConnector @0.0.0.0:9983 > btpool0-9-Acceptor0 SelectChannelConnector @0.0.0.0:9983 > > > > The stacks are all the same > > "btpool0-2 - Acceptor2 SelectChannelConnector @ 0.0.0.0:9983" - Thread > t...@27 > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <106a644> (a sun.nio.ch.Util$1) > - locked <18dd381> (a java.util.Collections$UnmodifiableSet) > - locked <38d07d> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at > org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:419) > at > org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:169) > at > org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124) > at > org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516) > at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > > Locked ownable synchronizers: > - None > > All of the other idle thread pool threads are just waiting for new tasks. > The active threads never seem to change, its always these 4. The selector > channel appears to be in the jetty code, receiving requests from our other > process through the solrj client. > > Does anyone know what this might mean or how to address it? Are these > running all the time because they are blocked on IO so not actually > consuming CPU? If so, what else might be? Is there a better way to figure > out what is pinning the CPU? > > Some more info that might be useful. > > 32 bit machine ( I know, I know) > 2.7GB of RAM for solr process ~2.5 is "used" > According to visual VM around 25% of CPU time is spent in GC with the rest > in application. > > Thanks for the help. > > John >
best way to cache "base" queries (before application of filters)
Hi, I'm looking for some advice on how to add "base query" caching to SOLR. Our use-case for SOLR is: - a large Lucene index (32M docs, doubling in 6 months, 110GB increasing x 8 in 6 months) - a frontend which presents views of this data in 5 "categories" by firing off 5 queries with the same search term but 5 different "fq" values For example, an originating query for "sydney harbour" generates 5 SOLR queries: - ../search?q=&fq=category:books - ../search?q=&fq=category:maps - ../search?q=&fq=category:music etc The complicated expansion requiring sloppy phrase matches, and the large database with lots of very large documents means that some queries take quite some time (10's to several 100's of ms), so we'd like to cache the results of the base query for a short time (long enough for all related queries to be issued). It looks like this isnt the use-case for queryResultCache, because its key is calculated in SolrIndexSearcher like this: key = new QueryResultKey(cmd.getQuery(), cmd.getFilterList(), cmd.getSort(), cmd.getFlags()); That is, the filters are part of the key; and the result that's cached results reflects the application of the filters, and this works great for what it is probably designed for - supporting paging through results. So, I think our options are: - create a new queryComponent that invokes SolrIndexSearcher differently, and which has its own (short lived but long entry length) cache of the base query results - subclass or change SolrIndexSearcher, perhaps making it "pluggable", perhaps defining an optional new cache of base query results - create a sublcass of the Lucene IndexSearcher which manages a cache of query results "hidden" from SolrIndexSearcher (and organise somehow for SolrIndexSearcher to use that sublass) Or perhaps Im taking the wrong approach to this problem entirely! Any advice is greatly appreciated. Kent Fitch
Re: best way to cache "base" queries (before application of filters)
Thanks for your reply, Yonik: On Thu, May 21, 2009 at 2:43 AM, Yonik Seeley wrote: > > Some thoughts: > > #1) This is sort of already implemented in some form... see this > section of solrconfig.xml and try uncommenting it: > ... > > On Wed, May 20, 2009 at 12:43 PM, Yonik Seeley > > > wrote: > > >true > > Of course the examples you gave used the default sort (by score) so > > this wouldn't help if you do actually need to sort by score. Right - we need to sort by relevance > #2) Your problem might be able to be solved with field collapsing on > the "category" field in the future (but it's not in Solr yet). Sorry - I didnt understand this > #3) Current work I'm doing right now will push Filters down a level > and check them in tandem with the query instead of after. This should > speed things up by at least a factor of 2 in your case. > https://issues.apache.org/jira/browse/SOLR-1165 > > I'm trying to get SOLR-1165 finished this week, and I'd love to see > how it affects your performance. > In the meantime, try useFilterForSortedQuery and let us know if it > still works (it's been turned off for a lng time) ;-) OK - so this looks like something to make all queries much faster by only bothering to score results matching a filter? If so, that's really great, but I'm not sure it particularly helps our use-case (other than making all filtered results faster) because: - we've got one query we want filtered 5 ways to find the top scoring results matching the query and each filter - the filtering basically divides that query result set into 5 non overlapping sets - the query part is often complicated and expensive - we want to avoid running it 5 times because our sloppy phrase requirement and often millions of hits make finding and scoring expensive - all documents in the query part will be scored eventually, even with SOLR-1165, because they'll be part of one of the 5 filters It is tempting to pass back to a custom query component lots of results - enough so that the 'n' top scoring document that satisfy each filter appear, but we may need to pass up to the query component millions of hits to find, say, the top 5 ranked results for "maps". It is tempting to apply the filters one by one in our own query component on a scored document list retrieved by SolrIndexSearcher - Im not sure - maybe I havent understood SOLR-1165? Thanks also Walter for your suggestions. Our users have a requirement for the index to be continuously updated (well, every 10 minutes or so), and our queries are extremely diverse/"long tail"ish, so an HTTP cache will probably not help us. Kent Fitch
UnInvertedField performance on faceted fields containing many unique terms
Hi, This may be of interest to other users of SOLR's UnInvertedField who have a very large number of unique terms in faceted fields. Our setup is : - about 34M lucene documents of bibliographic and full text content - index currently 115GB, will at least double over next 6 months - moving to support real-time-ish updates (maybe 5 min delay) We facet on 8 fields, 6 of which are "normal" with small numbers of distinct values. But 2 faceted fields, creator and subject, are huge, with 18M and 9M terms respectively. (Whether we should be faceting on such a huge number of values, and at the same time attempting to provide real time-ish updates is another question! Whether facets derived from all of the hundreds of thousands of results regardless of match quality which typically happens in a large full text index is yet another question!). The app is visible here: http://sbdsproto.nla.gov.au/ On a server with 2xquad core AMD 2382 processors and 64GB memory, java 1.6.0_13-b03, 64 bit run with "-Xmx15192M -Xms6000M -verbose:gc", with the index on Intel X25M SSD, on start-up the elapsed time to create the 8 facets is 306 seconds (best time). Following an index reopen, the time to recreate them in 318 seconds (best time). [We have made an independent experimental change to create the facets with 3 async threads, that is, in parallel, and also to decouple them from the underlying index, so our facets lag the index changes by the time to recreate the facets. With our setup, the 3 threads reduced facet creation elapsed time from about 450 secs to around 320 secs, but this will depend a lot on IO capabilities of the device containing the index, amount of file system caching, load, etc] Anyway, we noticed that huge amounts of garbage were being collected during facet generation of the creator and subject fields, and tracked it down to this decision in UnInvertedField univert(): if (termNum >= maxTermCounts.length) { // resize, but conserve memory by not doubling // resize at end??? we waste a maximum of 16K (average of 8K) int[] newMaxTermCounts = new int[maxTermCounts.length+4096]; System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum); maxTermCounts = newMaxTermCounts; } So, we tried the obvious thing: - allocate 10K terms initially, rather than 1K - extend by doubling the current size, rather than adding a fixed 4K - free unused space at the end (but only if unused space is "significant") by reallocating the array to the exact required size And also: - created a static HashMap lookup keyed on field name which remembers the previous allocated size for maxTermCounts for that field, and initially allocates that size + 1000 entries The second change is a minor optimisation, but the first change, by eliminating thousands of array reallocations and copies, greatly improved load times, down from 306 to 124 seconds on the initial load and from 318 to 134 seconds on reloads after index updates. About 60-70 secs is still spend in GC, but it is a significant improvement. Unless you have very large numbers of facet values, this change won't have any positive benefit. Regards, Kent Fitch
Re: UnInvertedField performance on faceted fields containing many unique terms
Hi Yonik, On Tue, Jun 16, 2009 at 10:52 AM, Yonik Seeley wrote: > All the constants you see in UnInvertedField were a best guess - I > wasn't working with any real data. It's surprising that a big array > allocation every 4096 terms is so significant - I had figured that the > work involved in processing that many terms would far outweigh > realloc+GC. Well, they were pretty good guesses! The code is extremely fast for "reasonable" sized term lists. I think with our 18M terms, the increasingly long array of ints was being reallocated, copied and garbage collected 18M/4K = 4,500 times, creating 4500x(18Mx4bytes)/2 = 162GB of garbage to collect. > Could you open a JIRA issue with your recommended changes? It's > simple enough we should have no problem getting it in for Solr 1.4. Thanks - just added SOLR-1220. I havent mentioned the change to the initial allocation on 10K (rather than 1024) because I dont think it is significant. I also havent mentioned the remembering of sizes to initially allocate, because the improvement is marginal compared to this big change, and for all I know, a static hashmap with fieldnames could cause unwanted side effects from field name clashes if running SOLR with multiple indices. > Also, are you using a recent Solr build (within the last month)? > LUCENE-1596 should improve uninvert time for non-optimized indexes. We're not - but we'll upgrade to the latest version of 1.4 very soon. > And don't forget to update http://wiki.apache.org/solr/PublicServers > when you go live! We will - thanks for your great work in improving SOLR performance with 1.4 which makes such outrageous uses of facets even thinkable. Regards, Kent Fitch
Re: Facets with an IDF concept
Hi Asif, I was holding back because we have a similar problem, but we're not sure how best to approach it, or even whether approaching it at all is the right thing to do. Background: - large index (~35m documents) - about 120k on these include full text book contents plus metadata, the rest are just metadata - we plan to increase number of full text books to around 1m, number of records will greatly increase We've found that because of the sheer volume of content in full text, we get lots of results in full text of very low relevance. The Lucene relevance ranking works wonderfully to "hide" these way down the list, and when these are the only results at all, the user may be delighted to find obscure hits. But when you search for, say : soldier of fortune : one of the 55k+ results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it probably isn't relevant. The searcher will find it in the result sets, but should the author, subject, dates, formats etc (our facets) of Huck Finn be contributing to the facets shown to the user as equally as, say, the top 500 results? Maybe, but perhaps they are "diluting" the value of facets contributed by the more relevant results. So, we are considering restricting the contents of the result bit set used for faceting to exclude results with a very very low score (with our own QueryComponent). But there are problems: - what's a low score? How will a low score threshold vary across queries? (Or should we use a rank cutoff instead, which is much more expensive to compute, or some combo that works with results that only have very low relevance results?) - should we do this for all facets, or just some (where the less relevant results seem particularly annoying, as they can "mask" facets from the most relevant results - the authors, years and subjects we have full text for are not representative of the whole corpus) - if a searcher pages through to the 1000th result page, down to these less relevant results, should we somehow include these results in the facets we show? sorry, only more questions! Regards, Kent Fitch On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman wrote: > Hi again, > > I guess nobody has used facets in the way I described below before. Do any > of the experts have any ideas as to how to do this efficiently and > correctly? Any thoughts would be greatly appreciated. > > Thanks, > > Asif > > On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman wrote: > >> Hi all, >> >> We have an index of news articles that are tagged with news topics. >> Currently, we use solr facets to see which topics are popular for a given >> query or time period. I'd like to apply the concept of IDF to the facet >> counts so as to penalize the topics that occur broadly through our index. >> I've begun to write custom facet component that applies the IDF to the facet >> counts, but I also wanted to check if anyone has experience using facets in >> this way. >> >> Thanks, >> >> Asif >> > > > > -- > Asif Rahman > Lead Engineer - NewsCred > a...@newscred.com > http://platform.newscred.com >