RE: multivalued fields in result
But it doesn't seem to be returning mulitvalued fields that are stored. It is returning all of the single value fields though. -Original Message- From: Markus Jelsma [mailto:markus.jel...@buyways.nl] Sent: Sat 9/11/2010 4:19 AM To: solr-user@lucene.apache.org Subject: RE: multivalued fields in result Yes, you'll get what is stored and asked for. -Original message- From: Jason Chaffee Sent: Sat 11-09-2010 05:27 To: solr-user@lucene.apache.org; Subject: multivalued fields in result Is it possible to return multivalued files in the result? I would like to have a multivalued field that is stored and not indexed (I also copy the same field into another field where it is tokenized and indexed). I would then like all the values of this field returned in the result set. Is there a way to do this? If it is not possible, could someone elaborate why that is so that I may see if I can make it work. thanks, Jason
Re: Solr memory use, jmap and TermInfos/tii
One thing that the Codec API makes possible ("in theory", anyway)... is variable gap terms index. Ie, Lucene today makes an indexed term at regular (every N -- 128 in 3.x, 32 in 4.0) intervals. But this is rather silly. Imagine the terms you are going through are all singletons (happen only in one doc, eg if they are OCR noise or whatver). Maybe you have 500 such terms in sequence and then you hit a "real" term with a high freq. In this case, you don't really need to add any indexed terms from those 500, but then make the real term an indexed term. Because... a TermQuery against those singleton terms is going to be wicked fast, so you can afford the extra term-seek time. Whereas a TermQuery against a high-frequency term will be costly, so you want to minimize term-seek time. Such an approach could tremendously reduce the RAM required by the terms index w/ no appreciable hit to the worst-case queries (and possibly a slight improvement). Mike On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless wrote: > On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom wrote: >> Is there an example of how to set up the divisor parameter in >> solrconfig.xml somewhere? > > Alas I don't know how to configure terms index divisor from Solr... > In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large parallel arrays instead of separate objects, and, we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will show this gain...; >> >> I'm looking forward to a number of the developments in 4.0, but am a bit >> wary of using it in production. I've wanted to work in some tests with >> 4.0, but other more pressing issues have so far prevented this. > > Understood. > >> What about Lucene 2205? Would that be a way to get some of the benefit >> similar to the changes in flex without the rest of the changes in flex and >> 4.0? > > 2205 was a similar idea (don't create tons of small objects), but it > was never committed... > I'd be really curious to test the RAM reduction in 4.0 on your terms dict/index -- is there any way I could get a copy of just the tii/tis files in your index? Your index is a great test for Lucene! >> >> We haven't been able to make much data available due to copyright and other >> legal issues. However, since there is absolutely no way anyone could >> reconstruct copyrighted works from the tii/tis index alone, that should be >> ok on that front. On Monday I'll try to get legal/administrative clearance >> to provide the data and also ask around and see if I can get the ok to >> either find a spare hard drive to ship, or make some kind of sftp >> arrangement. Hopefully we will find a way to be able to do this. > > That would be awesome, thanks! > >> BTW Most of the terms are probably the result of dirty OCR and the impact >> is probably increased by our present "punctuation filter". When we re-index >> we plan to use a more intelligent filter that will truncate extremely long >> tokens on punctuation and we also plan to do some minimal prefiltering prior >> to sending documents to Solr for indexing. However, since with now have >> over 400 languages , we will have to be conservative in our filtering since >> we would rather index dirty OCR than risk not indexing legitimate content. > > Got it... it's a great test case for Lucene :) > > Mike >
RE: Delta Import with something other than Date
Alternatively, you could use the deltaQuery to retrieve the last indexed id from the DB (you'd have to save it there on your previous import). Your entity would look something like: You could implement your deltaImportQuery as a stored procedure which would store the appropriate id in last_id_table (for the next delta-import) in addition to returning the data from the query. Ephraim Ofir -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, September 10, 2010 4:54 AM To: solr-user@lucene.apache.org Subject: Re: Delta Import with something other than Date On 9/9/2010 1:23 PM, Vladimir Sutskever wrote: > Shawn, > > Can you provide a sample of passing the parameter via URL? And how using it would look in the data-config.xml > Here's the URL that I send to do a full build on my last shard: http://idxst5-a:8983/solr/build/dataimport?command=full-import&optimize= true&commit=true&dataTable=ncdat&numShards=6&modVal=5&minDid=0&maxDid=24 2895591 If I want to do a delta, I just change the command to delta-import and give it a proper minDid value, rather than 0. Below is the entity from my data-config.xml. You have to have a deltaQuery defined for delta-import to work, but if you're going to use your own placeholders, just put something in that returns a single value very quickly. In my case, my query and deltaImportQuery are actually identical.
Re: Solr memory use, jmap and TermInfos/tii
On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom > wrote: > > Is there an example of how to set up the divisor parameter in > solrconfig.xml somewhere? > > Alas I don't know how to configure terms index divisor from Solr... > > To change the divisor in your solrconfig, for example to 4, it looks like you need to do this. 4 This parameter was added in SOLR-1296 so its in Solr 1.4 Tom, i would recommend altering this parameter, instead of the default (1)... especially since you don't have to reindex to take advantage of it. -- Robert Muir rcm...@gmail.com
Invalid version or the data in not in 'javabin' format
hi... currently i am integrating nutch (release 1.2) into solr (trunk). if i indexing to solr index with nutch i got the exception: java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job failed! can you tell me, whats wrong or how can i fix this? best regards marcel :)
Re: Invalid version or the data in not in 'javabin' format
Could be a solrj .jar version compat issue. Check that the client and server's solrj version jars match up. Peter On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com wrote: > hi... currently i am integrating nutch (release 1.2) into solr (trunk). if > i indexing to solr index with nutch i got the exception: > > java.lang.RuntimeException: Invalid version or the data in not in 'javabin' > format > at > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) > at > org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) > at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > 2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job > failed! > > can you tell me, whats wrong or how can i fix this? > > best regards marcel :) > > > >
Re: Solr memory use, jmap and TermInfos/tii
On Sun, Sep 12, 2010 at 12:42 PM, Robert Muir wrote: > On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom >> wrote: >> > Is there an example of how to set up the divisor parameter in >> solrconfig.xml somewhere? >> >> Alas I don't know how to configure terms index divisor from Solr... >> >> > To change the divisor in your solrconfig, for example to 4, it looks like > you need to do this. > > class="org.apache.solr.core.StandardIndexReaderFactory"> > 4 > Ah, thanks robert! I didn't know about that one either! simon > > This parameter was added in SOLR-1296 so its in Solr 1.4 > > Tom, i would recommend altering this parameter, instead of the default > (1)... especially since you don't have to reindex to take advantage of it. > > -- > Robert Muir > rcm...@gmail.com >
Re: mm=0?
Could you explain the use-case a bit? Because the very first response I would have is "why in the world did product management make this a requirement" and try to get the requirement changed As a user, I'm having a hard time imagining being well served by getting a document in response to a search that had no relation to my search, it was just a random doc selected from the corpus. All that said, I don't think a single query would do the trick. You could include a "very special" document with a field that no other document had with very special text in it. Say field name "bogusmatch", filled with the text "bogustext" then, at least the second query would match one and only one document and would take minimal time. Or you could tack on to each and every query "OR bogusmatch:bogustext^0.001" (which would really be inexpensive) and filter it out if there was more than one response. By boosting it really low, it should always appear at the end of the list which wouldn't be a bad thing. DisMax might help you here... But do ask if it is really a requirement or just something nobody's objected to before bothering IMO... Best Erick On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar < satish.kumar.just.d...@gmail.com> wrote: > Hi, > > We have a requirement to show at least one result every time -- i.e., even > if user entered term is not found in any of the documents. I was hoping > setting mm to 0 will return results in all cases, but it is not. > > For example, if user entered term "alpha" and it is *not* in any of the > documents in the index, any document in the index can be returned. If term > "alpha" is in the document set, documents having the term "alpha" only must > be returned. > > My idea so far is to perform a search using user entered term. If there are > any results, return them. If there are no results, perform another search > without the query term-- this means doing two searches. Any suggestions on > implementing this requirement using only one search? > > > Thanks, > Satish >
Re: Invalid version or the data in not in 'javabin' format
thats was the solution!! i package the current lucene and solrj repositories (dev 4.0) and copy the nesseccary jars to nutch-libs (after removing the old), building nutch and run it - it works!! thank you peter :) marcel On 09/12/2010 03:40 PM, Peter Sturge wrote: Could be a solrj .jar version compat issue. Check that the client and server's solrj version jars match up. Peter On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com wrote: hi... currently i am integrating nutch (release 1.2) into solr (trunk). if i indexing to solr index with nutch i got the exception: java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job failed! can you tell me, whats wrong or how can i fix this? best regards marcel :)
Re: multivalued fields in result
Can we see your schema file? Because it sounds like you didn't really declare your field multivalued="true" on the face of things. But if it is multivalued AND you changed it, did you reindex after you changed the schema? Best Erick On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee wrote: > But it doesn't seem to be returning mulitvalued fields that are stored. It > is returning all of the single value fields though. > > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@buyways.nl] > Sent: Sat 9/11/2010 4:19 AM > To: solr-user@lucene.apache.org > Subject: RE: multivalued fields in result > > Yes, you'll get what is stored and asked for. > > -Original message- > From: Jason Chaffee > Sent: Sat 11-09-2010 05:27 > To: solr-user@lucene.apache.org; > Subject: multivalued fields in result > > Is it possible to return multivalued files in the result? > > I would like to have a multivalued field that is stored and not indexed (I > also copy the same field into another field where it is tokenized and > indexed). I would then like all the values of this field returned in the > result set. Is there a way to do this? > > If it is not possible, could someone elaborate why that is so that I may > see if I can make it work. > > thanks, > > Jason > >
Re: Solr memory use, jmap and TermInfos/tii
On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer < simon.willna...@googlemail.com> wrote: > > To change the divisor in your solrconfig, for example to 4, it looks like > > you need to do this. > > > > > class="org.apache.solr.core.StandardIndexReaderFactory"> > >4 > > > > Ah, thanks robert! I didn't know about that one either! > > simon actually I'm wrong, for solr 1.4, use "setTermIndexDivisor". i was looking at 3.1/trunk and there is a bug in the name of this parameter: https://issues.apache.org/jira/browse/SOLR-2118 -- Robert Muir rcm...@gmail.com
Tuning Solr caches with high commit rates (NRT)
Hi, Below are some notes regarding Solr cache tuning that should prove useful for anyone who uses Solr with frequent commits (e.g. <5min). Environment: Solr 1.4.1 or branch_3x trunk. Note the 4.x trunk has lots of neat new features, so the notes here are likely less relevant to the 4.x environment. Overview: Our Solr environment makes extensive use of faceting, we perform commits every 30secs, and the indexes tend be on the large-ish side (>20million docs). Note: For our data, when we commit, we are always adding new data, never changing existing data. This type of environment can be tricky to tune, as Solr is more geared toward fast reads than frequent writes. Symptoms: If anyone has used faceting in searches where you are also performing frequent commits, you've likely encountered the dreaded OutOfMemory or GC Overhead Exeeded errors. In high commit rate environments, this is almost always due to multiple 'onDeck' searchers and autowarming - i.e. new searchers don't finish autowarming their caches before the next commit() comes along and invalidates them. Once this starts happening on a regular basis, it is likely your Solr's JVM will run out of memory eventually, as the number of searchers (and their cache arrays) will keep growing until the JVM dies of thirst. To check if your Solr environment is suffering from this, turn on INFO level logging, and look for: 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x'. In tests, we've only ever seen this problem when using faceting, and facet.method=fc. Some solutions to this are: Reduce the commit rate to allow searchers to fully warm before the next commit Reduce or eliminate the autowarming in caches Both of the above The trouble is, if you're doing NRT commits, you likely have a good reason for it, and reducing/elimintating autowarming will very significantly impact search performance in high commit rate environments. Solution: Here are some setup steps we've used that allow lots of faceting (we typically search with at least 20-35 different facet fields, and date faceting/sorting) on large indexes, and still keep decent search performance: 1. Firstly, you should consider using the enum method for facet searches (facet.method=enum) unless you've got A LOT of memory on your machine. In our tests, this method uses a lot less memory and autowarms more quickly than fc. (Note, I've not tried the new segement-based 'fcs' option, as I can't find support for it in branch_3x - looks nice for 4.x though) Admittedly, for our data, enum is not quite as fast for searching as fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile tradeoff. If you do have access to LOTS of memory, AND you can guarantee that the index won't grow beyond the memory capacity (i.e. you have some sort of deletion policy in place), fc can be a lot faster than enum when searching with lots of facets across many terms. 2. Secondly, we've found that LRUCache is faster at autowarming than FastLRUCache - in our tests, about 20% faster. Maybe this is just our environment - your mileage may vary. So, our filterCache section in solrconfig.xml looks like this: For a 28GB index, running in a quad-core x64 VMWare instance, 30 warmed facet fields, Solr is running at ~4GB. Stats filterCache size shows usually in the region of ~2400. 3. It's also a good idea to have some sort of firstSearcher/newSearcher event listener queries to allow new data to populate the caches. Of course, what you put in these is dependent on the facets you need/use. We've found a good combination is a firstSearcher with as many facets in the search as your environment can handle, then a subset of the most common facets for the newSearcher. 4. We also set: true just in case. 5. Another key area for search performance with high commits is to use 2 Solr instances - one for the high commit rate indexing, and one for searching. The read-only searching instance can be a remote replica, or a local read-only instance that reads the same core as the indexing instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()). This way, you can tune the indexing instance for writing performance and the searching instance as above for max read performance. Using the setup above, we get fantastic searching speed for small facet sets (well under 1sec), and really good searching for large facet sets (a couple of secs depending on index size, number of facets, unique terms etc. etc.), even when searching against largeish indexes (>20million docs). We have yet to see any OOM or GC errors using the techniques above, even in low memory conditions. I hope there are people that find this useful. I know I've spent a lot of time looking for stuff like this, so hopefullly, this will save someone some time. Peter
Re: Tuning Solr caches with high commit rates (NRT)
Peter: This kind of information is extremely useful to document, thanks! Do you have the time/energy to put it up on the Wiki? Anyone can edit it by creating a logon. If you don't, would it be OK if someone else did it (with attribution, of course)? I guess that by bringing it up I'm volunteering :)... Best Erick On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge wrote: > Hi, > > Below are some notes regarding Solr cache tuning that should prove > useful for anyone who uses Solr with frequent commits (e.g. <5min). > > Environment: > Solr 1.4.1 or branch_3x trunk. > Note the 4.x trunk has lots of neat new features, so the notes here > are likely less relevant to the 4.x environment. > > Overview: > Our Solr environment makes extensive use of faceting, we perform > commits every 30secs, and the indexes tend be on the large-ish side > (>20million docs). > Note: For our data, when we commit, we are always adding new data, > never changing existing data. > This type of environment can be tricky to tune, as Solr is more geared > toward fast reads than frequent writes. > > Symptoms: > If anyone has used faceting in searches where you are also performing > frequent commits, you've likely encountered the dreaded OutOfMemory or > GC Overhead Exeeded errors. > In high commit rate environments, this is almost always due to > multiple 'onDeck' searchers and autowarming - i.e. new searchers don't > finish autowarming their caches before the next commit() > comes along and invalidates them. > Once this starts happening on a regular basis, it is likely your > Solr's JVM will run out of memory eventually, as the number of > searchers (and their cache arrays) will keep growing until the JVM > dies of thirst. > To check if your Solr environment is suffering from this, turn on INFO > level logging, and look for: 'PERFORMANCE WARNING: Overlapping > onDeckSearchers=x'. > > In tests, we've only ever seen this problem when using faceting, and > facet.method=fc. > > Some solutions to this are: >Reduce the commit rate to allow searchers to fully warm before the > next commit >Reduce or eliminate the autowarming in caches >Both of the above > > The trouble is, if you're doing NRT commits, you likely have a good > reason for it, and reducing/elimintating autowarming will very > significantly impact search performance in high commit rate > environments. > > Solution: > Here are some setup steps we've used that allow lots of faceting (we > typically search with at least 20-35 different facet fields, and date > faceting/sorting) on large indexes, and still keep decent search > performance: > > 1. Firstly, you should consider using the enum method for facet > searches (facet.method=enum) unless you've got A LOT of memory on your > machine. In our tests, this method uses a lot less memory and > autowarms more quickly than fc. (Note, I've not tried the new > segement-based 'fcs' option, as I can't find support for it in > branch_3x - looks nice for 4.x though) > Admittedly, for our data, enum is not quite as fast for searching as > fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile > tradeoff. > If you do have access to LOTS of memory, AND you can guarantee that > the index won't grow beyond the memory capacity (i.e. you have some > sort of deletion policy in place), fc can be a lot faster than enum > when searching with lots of facets across many terms. > > 2. Secondly, we've found that LRUCache is faster at autowarming than > FastLRUCache - in our tests, about 20% faster. Maybe this is just our > environment - your mileage may vary. > > So, our filterCache section in solrconfig.xml looks like this: > class="solr.LRUCache" > size="3600" > initialSize="1400" > autowarmCount="3600"/> > > For a 28GB index, running in a quad-core x64 VMWare instance, 30 > warmed facet fields, Solr is running at ~4GB. Stats filterCache size > shows usually in the region of ~2400. > > 3. It's also a good idea to have some sort of > firstSearcher/newSearcher event listener queries to allow new data to > populate the caches. > Of course, what you put in these is dependent on the facets you need/use. > We've found a good combination is a firstSearcher with as many facets > in the search as your environment can handle, then a subset of the > most common facets for the newSearcher. > > 4. We also set: > true > just in case. > > 5. Another key area for search performance with high commits is to use > 2 Solr instances - one for the high commit rate indexing, and one for > searching. > The read-only searching instance can be a remote replica, or a local > read-only instance that reads the same core as the indexing instance > (for the latter, you'll need something that periodically refreshes - > i.e. runs commit()). > This way, you can tune the indexing instance for writing performance > and the searching instance as above for max read performance. > > Using the setup above, we get fantastic searching speed for s
Re: Tuning Solr caches with high commit rates (NRT)
Wow! Thanks for that. This email is DEFINITELY being filed. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Sun, 9/12/10, Peter Sturge wrote: > From: Peter Sturge > Subject: Tuning Solr caches with high commit rates (NRT) > To: solr-user@lucene.apache.org > Date: Sunday, September 12, 2010, 9:26 AM > Hi, > > Below are some notes regarding Solr cache tuning that > should prove > useful for anyone who uses Solr with frequent commits (e.g. > <5min). > > Environment: > Solr 1.4.1 or branch_3x trunk. > Note the 4.x trunk has lots of neat new features, so the > notes here > are likely less relevant to the 4.x environment. > > Overview: > Our Solr environment makes extensive use of faceting, we > perform > commits every 30secs, and the indexes tend be on the > large-ish side > (>20million docs). > Note: For our data, when we commit, we are always adding > new data, > never changing existing data. > This type of environment can be tricky to tune, as Solr is > more geared > toward fast reads than frequent writes. > > Symptoms: > If anyone has used faceting in searches where you are also > performing > frequent commits, you've likely encountered the dreaded > OutOfMemory or > GC Overhead Exeeded errors. > In high commit rate environments, this is almost always due > to > multiple 'onDeck' searchers and autowarming - i.e. new > searchers don't > finish autowarming their caches before the next commit() > comes along and invalidates them. > Once this starts happening on a regular basis, it is likely > your > Solr's JVM will run out of memory eventually, as the number > of > searchers (and their cache arrays) will keep growing until > the JVM > dies of thirst. > To check if your Solr environment is suffering from this, > turn on INFO > level logging, and look for: 'PERFORMANCE WARNING: > Overlapping > onDeckSearchers=x'. > > In tests, we've only ever seen this problem when using > faceting, and > facet.method=fc. > > Some solutions to this are: > Reduce the commit rate to allow searchers to > fully warm before the > next commit > Reduce or eliminate the autowarming in > caches > Both of the above > > The trouble is, if you're doing NRT commits, you likely > have a good > reason for it, and reducing/elimintating autowarming will > very > significantly impact search performance in high commit > rate > environments. > > Solution: > Here are some setup steps we've used that allow lots of > faceting (we > typically search with at least 20-35 different facet > fields, and date > faceting/sorting) on large indexes, and still keep decent > search > performance: > > 1. Firstly, you should consider using the enum method for > facet > searches (facet.method=enum) unless you've got A LOT of > memory on your > machine. In our tests, this method uses a lot less memory > and > autowarms more quickly than fc. (Note, I've not tried the > new > segement-based 'fcs' option, as I can't find support for it > in > branch_3x - looks nice for 4.x though) > Admittedly, for our data, enum is not quite as fast for > searching as > fc, but short of purchsing a Thaiwanese RAM factory, it's a > worthwhile > tradeoff. > If you do have access to LOTS of memory, AND you can > guarantee that > the index won't grow beyond the memory capacity (i.e. you > have some > sort of deletion policy in place), fc can be a lot faster > than enum > when searching with lots of facets across many terms. > > 2. Secondly, we've found that LRUCache is faster at > autowarming than > FastLRUCache - in our tests, about 20% faster. Maybe this > is just our > environment - your mileage may vary. > > So, our filterCache section in solrconfig.xml looks like > this: > class="solr.LRUCache" > size="3600" > initialSize="1400" > autowarmCount="3600"/> > > For a 28GB index, running in a quad-core x64 VMWare > instance, 30 > warmed facet fields, Solr is running at ~4GB. Stats > filterCache size > shows usually in the region of ~2400. > > 3. It's also a good idea to have some sort of > firstSearcher/newSearcher event listener queries to allow > new data to > populate the caches. > Of course, what you put in these is dependent on the facets > you need/use. > We've found a good combination is a firstSearcher with as > many facets > in the search as your environment can handle, then a subset > of the > most common facets for the newSearcher. > > 4. We also set: > true > just in case. > > 5. Another key area for search performance with high > commits is to use > 2 Solr instances - one for the high commit rate indexing, > and one for > searching. > The read-only searching instance can be a remote replica, > or a local > read-only instance that reads the same core as the indexing > instance > (for the latter, you'll need something that periodically > refreshes - > i.e. runs commit()). > This way, you
Re: Solr and jvm Garbage Collection tuning
On Sep 10, 2010, at 7:01 PM, Burton-West, Tom wrote: > We have noticed that when the first query hits Solr after starting it up, > memory use increases significantly, from about 1GB to about 16GB, and then as > queries are received it goes up to about 19GB at which point there is a Full > Garbage Collection which takes about 30 seconds and then memory use drops > back down to 16GB. Under a relatively heavy load, the full GC happens about > every 10-20 minutes. > > We are running 3 Solr shards under one Tomcat with 20GB allocated to the jvm. > Each shard has a total index size of about 400GB on and a tii size of about > 600MB and indexes about 650,000 full-text books. (The server has a total of > 72GB of memory, so we are leaving quite a bit of memory for the OS disk > cache). > > Is there some argument we could give the jvm so that it would collect garbage > more frequently? Or some other JVM tuning action that might reduce the amount > of time where Solr is waiting on GC? > > If we could get the time for each GC to take under a second, with the > trade-off being that GC would occur much more frequently, that would help us > avoid the occasional query taking more than 30 seconds at the cost of a > larger number of queries taking at least a second. > What are your current GC settings? Also, I guess I'd look at ways you can reduce the heap size needed. Caching, field type choices, faceting choices. Also could try playing with the termIndexInterval which will load fewer terms into memory at the cost of longer seeks. At some point, though, you just may need more shards and the resulting smaller indexes. How many CPU cores do you have on each machine?
Re: Tuning Solr caches with high commit rates (NRT)
Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that: > or a local read-only instance that reads the same core as the indexing > instance (for the latter, you'll need something that periodically refreshes - > i.e. runs commit()). 2. Did you try sharding with your current setup (e.g. one big, nearly-static index and a tiny write+read index)? Regards, Peter. > Hi, > > Below are some notes regarding Solr cache tuning that should prove > useful for anyone who uses Solr with frequent commits (e.g. <5min). > > Environment: > Solr 1.4.1 or branch_3x trunk. > Note the 4.x trunk has lots of neat new features, so the notes here > are likely less relevant to the 4.x environment. > > Overview: > Our Solr environment makes extensive use of faceting, we perform > commits every 30secs, and the indexes tend be on the large-ish side > (>20million docs). > Note: For our data, when we commit, we are always adding new data, > never changing existing data. > This type of environment can be tricky to tune, as Solr is more geared > toward fast reads than frequent writes. > > Symptoms: > If anyone has used faceting in searches where you are also performing > frequent commits, you've likely encountered the dreaded OutOfMemory or > GC Overhead Exeeded errors. > In high commit rate environments, this is almost always due to > multiple 'onDeck' searchers and autowarming - i.e. new searchers don't > finish autowarming their caches before the next commit() > comes along and invalidates them. > Once this starts happening on a regular basis, it is likely your > Solr's JVM will run out of memory eventually, as the number of > searchers (and their cache arrays) will keep growing until the JVM > dies of thirst. > To check if your Solr environment is suffering from this, turn on INFO > level logging, and look for: 'PERFORMANCE WARNING: Overlapping > onDeckSearchers=x'. > > In tests, we've only ever seen this problem when using faceting, and > facet.method=fc. > > Some solutions to this are: > Reduce the commit rate to allow searchers to fully warm before the > next commit > Reduce or eliminate the autowarming in caches > Both of the above > > The trouble is, if you're doing NRT commits, you likely have a good > reason for it, and reducing/elimintating autowarming will very > significantly impact search performance in high commit rate > environments. > > Solution: > Here are some setup steps we've used that allow lots of faceting (we > typically search with at least 20-35 different facet fields, and date > faceting/sorting) on large indexes, and still keep decent search > performance: > > 1. Firstly, you should consider using the enum method for facet > searches (facet.method=enum) unless you've got A LOT of memory on your > machine. In our tests, this method uses a lot less memory and > autowarms more quickly than fc. (Note, I've not tried the new > segement-based 'fcs' option, as I can't find support for it in > branch_3x - looks nice for 4.x though) > Admittedly, for our data, enum is not quite as fast for searching as > fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile > tradeoff. > If you do have access to LOTS of memory, AND you can guarantee that > the index won't grow beyond the memory capacity (i.e. you have some > sort of deletion policy in place), fc can be a lot faster than enum > when searching with lots of facets across many terms. > > 2. Secondly, we've found that LRUCache is faster at autowarming than > FastLRUCache - in our tests, about 20% faster. Maybe this is just our > environment - your mileage may vary. > > So, our filterCache section in solrconfig.xml looks like this: >class="solr.LRUCache" > size="3600" > initialSize="1400" > autowarmCount="3600"/> > > For a 28GB index, running in a quad-core x64 VMWare instance, 30 > warmed facet fields, Solr is running at ~4GB. Stats filterCache size > shows usually in the region of ~2400. > > 3. It's also a good idea to have some sort of > firstSearcher/newSearcher event listener queries to allow new data to > populate the caches. > Of course, what you put in these is dependent on the facets you need/use. > We've found a good combination is a firstSearcher with as many facets > in the search as your environment can handle, then a subset of the > most common facets for the newSearcher. > > 4. We also set: >true > just in case. > > 5. Another key area for search performance with high commits is to use > 2 Solr instances - one for the high commit rate indexing, and one for > searching. > The read-only searching instance can be a remote replica, or a local > read-only instance that reads the same core as the indexing instance > (for the latter, you'll need something that periodically refreshes - > i.e. runs commit()). > This way, you can tune the indexing instance for writing performance
Saravanan Chinnadurai/Actionimages is out of the office.
I will be out of the office starting 12/09/2010 and will not return until 14/09/2010. Please email to itsta...@actionimages.com for any urgent issues. (Embedded image moved to file: pic19187.jpg)
Re: Tuning Solr caches with high commit rates (NRT)
Peter, Are you using per-segment faceting, eg, SOLR-1617? That could help your situation. On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge wrote: > Hi, > > Below are some notes regarding Solr cache tuning that should prove > useful for anyone who uses Solr with frequent commits (e.g. <5min). > > Environment: > Solr 1.4.1 or branch_3x trunk. > Note the 4.x trunk has lots of neat new features, so the notes here > are likely less relevant to the 4.x environment. > > Overview: > Our Solr environment makes extensive use of faceting, we perform > commits every 30secs, and the indexes tend be on the large-ish side > (>20million docs). > Note: For our data, when we commit, we are always adding new data, > never changing existing data. > This type of environment can be tricky to tune, as Solr is more geared > toward fast reads than frequent writes. > > Symptoms: > If anyone has used faceting in searches where you are also performing > frequent commits, you've likely encountered the dreaded OutOfMemory or > GC Overhead Exeeded errors. > In high commit rate environments, this is almost always due to > multiple 'onDeck' searchers and autowarming - i.e. new searchers don't > finish autowarming their caches before the next commit() > comes along and invalidates them. > Once this starts happening on a regular basis, it is likely your > Solr's JVM will run out of memory eventually, as the number of > searchers (and their cache arrays) will keep growing until the JVM > dies of thirst. > To check if your Solr environment is suffering from this, turn on INFO > level logging, and look for: 'PERFORMANCE WARNING: Overlapping > onDeckSearchers=x'. > > In tests, we've only ever seen this problem when using faceting, and > facet.method=fc. > > Some solutions to this are: > Reduce the commit rate to allow searchers to fully warm before the > next commit > Reduce or eliminate the autowarming in caches > Both of the above > > The trouble is, if you're doing NRT commits, you likely have a good > reason for it, and reducing/elimintating autowarming will very > significantly impact search performance in high commit rate > environments. > > Solution: > Here are some setup steps we've used that allow lots of faceting (we > typically search with at least 20-35 different facet fields, and date > faceting/sorting) on large indexes, and still keep decent search > performance: > > 1. Firstly, you should consider using the enum method for facet > searches (facet.method=enum) unless you've got A LOT of memory on your > machine. In our tests, this method uses a lot less memory and > autowarms more quickly than fc. (Note, I've not tried the new > segement-based 'fcs' option, as I can't find support for it in > branch_3x - looks nice for 4.x though) > Admittedly, for our data, enum is not quite as fast for searching as > fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile > tradeoff. > If you do have access to LOTS of memory, AND you can guarantee that > the index won't grow beyond the memory capacity (i.e. you have some > sort of deletion policy in place), fc can be a lot faster than enum > when searching with lots of facets across many terms. > > 2. Secondly, we've found that LRUCache is faster at autowarming than > FastLRUCache - in our tests, about 20% faster. Maybe this is just our > environment - your mileage may vary. > > So, our filterCache section in solrconfig.xml looks like this: > class="solr.LRUCache" > size="3600" > initialSize="1400" > autowarmCount="3600"/> > > For a 28GB index, running in a quad-core x64 VMWare instance, 30 > warmed facet fields, Solr is running at ~4GB. Stats filterCache size > shows usually in the region of ~2400. > > 3. It's also a good idea to have some sort of > firstSearcher/newSearcher event listener queries to allow new data to > populate the caches. > Of course, what you put in these is dependent on the facets you need/use. > We've found a good combination is a firstSearcher with as many facets > in the search as your environment can handle, then a subset of the > most common facets for the newSearcher. > > 4. We also set: > true > just in case. > > 5. Another key area for search performance with high commits is to use > 2 Solr instances - one for the high commit rate indexing, and one for > searching. > The read-only searching instance can be a remote replica, or a local > read-only instance that reads the same core as the indexing instance > (for the latter, you'll need something that periodically refreshes - > i.e. runs commit()). > This way, you can tune the indexing instance for writing performance > and the searching instance as above for max read performance. > > Using the setup above, we get fantastic searching speed for small > facet sets (well under 1sec), and really good searching for large > facet sets (a couple of secs depending on index size, number of > facets, unique terms etc. etc.), > even when searching against largeish indexes (>20
Re: Tuning Solr caches with high commit rates (NRT)
Hi Jason, I've tried some limited testing with the 4.x trunk using fcs, and I must say, I really like the idea of per-segment faceting. I was hoping to see it in 3.x, but I don't see this option in the branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the one to use with 3.1? There seems to be a number of Solr issues tied to this - one of them being Lucene-1785. Can the per-segment faceting patch work with Lucene 2.9/branch_3x? Thanks, Peter On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen wrote: > Peter, > > Are you using per-segment faceting, eg, SOLR-1617? That could help > your situation. > > On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge wrote: >> Hi, >> >> Below are some notes regarding Solr cache tuning that should prove >> useful for anyone who uses Solr with frequent commits (e.g. <5min). >> >> Environment: >> Solr 1.4.1 or branch_3x trunk. >> Note the 4.x trunk has lots of neat new features, so the notes here >> are likely less relevant to the 4.x environment. >> >> Overview: >> Our Solr environment makes extensive use of faceting, we perform >> commits every 30secs, and the indexes tend be on the large-ish side >> (>20million docs). >> Note: For our data, when we commit, we are always adding new data, >> never changing existing data. >> This type of environment can be tricky to tune, as Solr is more geared >> toward fast reads than frequent writes. >> >> Symptoms: >> If anyone has used faceting in searches where you are also performing >> frequent commits, you've likely encountered the dreaded OutOfMemory or >> GC Overhead Exeeded errors. >> In high commit rate environments, this is almost always due to >> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't >> finish autowarming their caches before the next commit() >> comes along and invalidates them. >> Once this starts happening on a regular basis, it is likely your >> Solr's JVM will run out of memory eventually, as the number of >> searchers (and their cache arrays) will keep growing until the JVM >> dies of thirst. >> To check if your Solr environment is suffering from this, turn on INFO >> level logging, and look for: 'PERFORMANCE WARNING: Overlapping >> onDeckSearchers=x'. >> >> In tests, we've only ever seen this problem when using faceting, and >> facet.method=fc. >> >> Some solutions to this are: >> Reduce the commit rate to allow searchers to fully warm before the >> next commit >> Reduce or eliminate the autowarming in caches >> Both of the above >> >> The trouble is, if you're doing NRT commits, you likely have a good >> reason for it, and reducing/elimintating autowarming will very >> significantly impact search performance in high commit rate >> environments. >> >> Solution: >> Here are some setup steps we've used that allow lots of faceting (we >> typically search with at least 20-35 different facet fields, and date >> faceting/sorting) on large indexes, and still keep decent search >> performance: >> >> 1. Firstly, you should consider using the enum method for facet >> searches (facet.method=enum) unless you've got A LOT of memory on your >> machine. In our tests, this method uses a lot less memory and >> autowarms more quickly than fc. (Note, I've not tried the new >> segement-based 'fcs' option, as I can't find support for it in >> branch_3x - looks nice for 4.x though) >> Admittedly, for our data, enum is not quite as fast for searching as >> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile >> tradeoff. >> If you do have access to LOTS of memory, AND you can guarantee that >> the index won't grow beyond the memory capacity (i.e. you have some >> sort of deletion policy in place), fc can be a lot faster than enum >> when searching with lots of facets across many terms. >> >> 2. Secondly, we've found that LRUCache is faster at autowarming than >> FastLRUCache - in our tests, about 20% faster. Maybe this is just our >> environment - your mileage may vary. >> >> So, our filterCache section in solrconfig.xml looks like this: >> > class="solr.LRUCache" >> size="3600" >> initialSize="1400" >> autowarmCount="3600"/> >> >> For a 28GB index, running in a quad-core x64 VMWare instance, 30 >> warmed facet fields, Solr is running at ~4GB. Stats filterCache size >> shows usually in the region of ~2400. >> >> 3. It's also a good idea to have some sort of >> firstSearcher/newSearcher event listener queries to allow new data to >> populate the caches. >> Of course, what you put in these is dependent on the facets you need/use. >> We've found a good combination is a firstSearcher with as many facets >> in the search as your environment can handle, then a subset of the >> most common facets for the newSearcher. >> >> 4. We also set: >> true >> just in case. >> >> 5. Another key area for search performance with high commits is to use >> 2 Solr instances - one for the high commit rate indexing, and one for >> searching. >> The read-only searching
Re: No more trunk support for 2.9 indexes
> I suppose an index 'remaker' might be something like a DIH reader for > a Solr index - streams everything out of the existing index, writing > it into the new one? This works fine if all fields are stored (and copy field does not go to a stored field), otherwise you would need/want to start with the orignial source. ryan
Re: Tuning Solr caches with high commit rates (NRT)
Bravo! Other tricks: here is a policy for deciding when to merge segments that attempts to balance merging with performance. It was contributed by LinkedIn- they also run index&search in the same instance (not Solr, a different Lucene app). lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java The optimize command now includes a partial optimize option, so you can do larger controlled merges. Peter Sturge wrote: Hi, Below are some notes regarding Solr cache tuning that should prove useful for anyone who uses Solr with frequent commits (e.g.<5min). Environment: Solr 1.4.1 or branch_3x trunk. Note the 4.x trunk has lots of neat new features, so the notes here are likely less relevant to the 4.x environment. Overview: Our Solr environment makes extensive use of faceting, we perform commits every 30secs, and the indexes tend be on the large-ish side (>20million docs). Note: For our data, when we commit, we are always adding new data, never changing existing data. This type of environment can be tricky to tune, as Solr is more geared toward fast reads than frequent writes. Symptoms: If anyone has used faceting in searches where you are also performing frequent commits, you've likely encountered the dreaded OutOfMemory or GC Overhead Exeeded errors. In high commit rate environments, this is almost always due to multiple 'onDeck' searchers and autowarming - i.e. new searchers don't finish autowarming their caches before the next commit() comes along and invalidates them. Once this starts happening on a regular basis, it is likely your Solr's JVM will run out of memory eventually, as the number of searchers (and their cache arrays) will keep growing until the JVM dies of thirst. To check if your Solr environment is suffering from this, turn on INFO level logging, and look for: 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x'. In tests, we've only ever seen this problem when using faceting, and facet.method=fc. Some solutions to this are: Reduce the commit rate to allow searchers to fully warm before the next commit Reduce or eliminate the autowarming in caches Both of the above The trouble is, if you're doing NRT commits, you likely have a good reason for it, and reducing/elimintating autowarming will very significantly impact search performance in high commit rate environments. Solution: Here are some setup steps we've used that allow lots of faceting (we typically search with at least 20-35 different facet fields, and date faceting/sorting) on large indexes, and still keep decent search performance: 1. Firstly, you should consider using the enum method for facet searches (facet.method=enum) unless you've got A LOT of memory on your machine. In our tests, this method uses a lot less memory and autowarms more quickly than fc. (Note, I've not tried the new segement-based 'fcs' option, as I can't find support for it in branch_3x - looks nice for 4.x though) Admittedly, for our data, enum is not quite as fast for searching as fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile tradeoff. If you do have access to LOTS of memory, AND you can guarantee that the index won't grow beyond the memory capacity (i.e. you have some sort of deletion policy in place), fc can be a lot faster than enum when searching with lots of facets across many terms. 2. Secondly, we've found that LRUCache is faster at autowarming than FastLRUCache - in our tests, about 20% faster. Maybe this is just our environment - your mileage may vary. So, our filterCache section in solrconfig.xml looks like this: For a 28GB index, running in a quad-core x64 VMWare instance, 30 warmed facet fields, Solr is running at ~4GB. Stats filterCache size shows usually in the region of ~2400. 3. It's also a good idea to have some sort of firstSearcher/newSearcher event listener queries to allow new data to populate the caches. Of course, what you put in these is dependent on the facets you need/use. We've found a good combination is a firstSearcher with as many facets in the search as your environment can handle, then a subset of the most common facets for the newSearcher. 4. We also set: true just in case. 5. Another key area for search performance with high commits is to use 2 Solr instances - one for the high commit rate indexing, and one for searching. The read-only searching instance can be a remote replica, or a local read-only instance that reads the same core as the indexing instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()). This way, you can tune the indexing instance for writing performance and the searching instance as above for max read performance. Using the setup above, we get fantastic searching speed for small facet sets (well under 1sec), and really good searching for large facet sets (a couple of secs depending on index size, number of facets, unique terms etc. etc.), even when searching against l
Re: multivalued fields in result
Also, the 'v' is capitalized: multiValued. (This is one reason why posting your schema helps.) Erick Erickson wrote: Can we see your schema file? Because it sounds like you didn't really declare your field multivalued="true" on the face of things. But if it is multivalued AND you changed it, did you reindex after you changed the schema? Best Erick On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee wrote: But it doesn't seem to be returning mulitvalued fields that are stored. It is returning all of the single value fields though. -Original Message- From: Markus Jelsma [mailto:markus.jel...@buyways.nl] Sent: Sat 9/11/2010 4:19 AM To: solr-user@lucene.apache.org Subject: RE: multivalued fields in result Yes, you'll get what is stored and asked for. -Original message- From: Jason Chaffee Sent: Sat 11-09-2010 05:27 To: solr-user@lucene.apache.org; Subject: multivalued fields in result Is it possible to return multivalued files in the result? I would like to have a multivalued field that is stored and not indexed (I also copy the same field into another field where it is tokenized and indexed). I would then like all the values of this field returned in the result set. Is there a way to do this? If it is not possible, could someone elaborate why that is so that I may see if I can make it work. thanks, Jason
Re: Tuning Solr caches with high commit rates (NRT)
Thanks, Peter. This is really great info. One setting I've found to be very useful for the problem of overlapping onDeskSearchers is to reduce the value of maxWarmingSearchers in solrconfig.xml. I've reduced this to 1, so if a slave is already busy doing pre-warming, it won't try to also pre-warm additional updates. This has greatly reduced our time to incorporate updates, with no visible downsides other than an uglier snapinstaller.log (we're still using 1.3 w/rsync-based replication). -Chris On Sep 12, 2010, at 9:26 AM, Peter Sturge wrote: > Hi, > > Below are some notes regarding Solr cache tuning that should prove > useful for anyone who uses Solr with frequent commits (e.g. <5min). > > Environment: > Solr 1.4.1 or branch_3x trunk. > Note the 4.x trunk has lots of neat new features, so the notes here > are likely less relevant to the 4.x environment. > > Overview: > Our Solr environment makes extensive use of faceting, we perform > commits every 30secs, and the indexes tend be on the large-ish side > (>20million docs). > Note: For our data, when we commit, we are always adding new data, > never changing existing data. > This type of environment can be tricky to tune, as Solr is more geared > toward fast reads than frequent writes. > > Symptoms: > If anyone has used faceting in searches where you are also performing > frequent commits, you've likely encountered the dreaded OutOfMemory or > GC Overhead Exeeded errors. > In high commit rate environments, this is almost always due to > multiple 'onDeck' searchers and autowarming - i.e. new searchers don't > finish autowarming their caches before the next commit() > comes along and invalidates them. > Once this starts happening on a regular basis, it is likely your > Solr's JVM will run out of memory eventually, as the number of > searchers (and their cache arrays) will keep growing until the JVM > dies of thirst. > To check if your Solr environment is suffering from this, turn on INFO > level logging, and look for: 'PERFORMANCE WARNING: Overlapping > onDeckSearchers=x'. > > In tests, we've only ever seen this problem when using faceting, and > facet.method=fc. > > Some solutions to this are: >Reduce the commit rate to allow searchers to fully warm before the > next commit >Reduce or eliminate the autowarming in caches >Both of the above > > The trouble is, if you're doing NRT commits, you likely have a good > reason for it, and reducing/elimintating autowarming will very > significantly impact search performance in high commit rate > environments. > > Solution: > Here are some setup steps we've used that allow lots of faceting (we > typically search with at least 20-35 different facet fields, and date > faceting/sorting) on large indexes, and still keep decent search > performance: > > 1. Firstly, you should consider using the enum method for facet > searches (facet.method=enum) unless you've got A LOT of memory on your > machine. In our tests, this method uses a lot less memory and > autowarms more quickly than fc. (Note, I've not tried the new > segement-based 'fcs' option, as I can't find support for it in > branch_3x - looks nice for 4.x though) > Admittedly, for our data, enum is not quite as fast for searching as > fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile > tradeoff. > If you do have access to LOTS of memory, AND you can guarantee that > the index won't grow beyond the memory capacity (i.e. you have some > sort of deletion policy in place), fc can be a lot faster than enum > when searching with lots of facets across many terms. > > 2. Secondly, we've found that LRUCache is faster at autowarming than > FastLRUCache - in our tests, about 20% faster. Maybe this is just our > environment - your mileage may vary. > > So, our filterCache section in solrconfig.xml looks like this: > class="solr.LRUCache" > size="3600" > initialSize="1400" > autowarmCount="3600"/> > > For a 28GB index, running in a quad-core x64 VMWare instance, 30 > warmed facet fields, Solr is running at ~4GB. Stats filterCache size > shows usually in the region of ~2400. > > 3. It's also a good idea to have some sort of > firstSearcher/newSearcher event listener queries to allow new data to > populate the caches. > Of course, what you put in these is dependent on the facets you need/use. > We've found a good combination is a firstSearcher with as many facets > in the search as your environment can handle, then a subset of the > most common facets for the newSearcher. > > 4. We also set: > true > just in case. > > 5. Another key area for search performance with high commits is to use > 2 Solr instances - one for the high commit rate indexing, and one for > searching. > The read-only searching instance can be a remote replica, or a local > read-only instance that reads the same core as the indexing instance > (for the latter, you'll need something that periodically refreshes - > i.e. runs commit()
Re: Tuning Solr caches with high commit rates (NRT)
Yeah there's no patch... I think Yonik can write it. :-) Yah... The Lucene version shouldn't matter. The distributed faceting theoretically can easily be applied to multiple segments, however the way it's written for me is a challenge to untangle and apply successfully to a working patch. Also I don't have this as an itch to scratch at the moment. On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge wrote: > Hi Jason, > > I've tried some limited testing with the 4.x trunk using fcs, and I > must say, I really like the idea of per-segment faceting. > I was hoping to see it in 3.x, but I don't see this option in the > branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the > one to use with 3.1? > There seems to be a number of Solr issues tied to this - one of them > being Lucene-1785. Can the per-segment faceting patch work with Lucene > 2.9/branch_3x? > > Thanks, > Peter > > > > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen > wrote: >> Peter, >> >> Are you using per-segment faceting, eg, SOLR-1617? That could help >> your situation. >> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge >> wrote: >>> Hi, >>> >>> Below are some notes regarding Solr cache tuning that should prove >>> useful for anyone who uses Solr with frequent commits (e.g. <5min). >>> >>> Environment: >>> Solr 1.4.1 or branch_3x trunk. >>> Note the 4.x trunk has lots of neat new features, so the notes here >>> are likely less relevant to the 4.x environment. >>> >>> Overview: >>> Our Solr environment makes extensive use of faceting, we perform >>> commits every 30secs, and the indexes tend be on the large-ish side >>> (>20million docs). >>> Note: For our data, when we commit, we are always adding new data, >>> never changing existing data. >>> This type of environment can be tricky to tune, as Solr is more geared >>> toward fast reads than frequent writes. >>> >>> Symptoms: >>> If anyone has used faceting in searches where you are also performing >>> frequent commits, you've likely encountered the dreaded OutOfMemory or >>> GC Overhead Exeeded errors. >>> In high commit rate environments, this is almost always due to >>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't >>> finish autowarming their caches before the next commit() >>> comes along and invalidates them. >>> Once this starts happening on a regular basis, it is likely your >>> Solr's JVM will run out of memory eventually, as the number of >>> searchers (and their cache arrays) will keep growing until the JVM >>> dies of thirst. >>> To check if your Solr environment is suffering from this, turn on INFO >>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping >>> onDeckSearchers=x'. >>> >>> In tests, we've only ever seen this problem when using faceting, and >>> facet.method=fc. >>> >>> Some solutions to this are: >>> Reduce the commit rate to allow searchers to fully warm before the >>> next commit >>> Reduce or eliminate the autowarming in caches >>> Both of the above >>> >>> The trouble is, if you're doing NRT commits, you likely have a good >>> reason for it, and reducing/elimintating autowarming will very >>> significantly impact search performance in high commit rate >>> environments. >>> >>> Solution: >>> Here are some setup steps we've used that allow lots of faceting (we >>> typically search with at least 20-35 different facet fields, and date >>> faceting/sorting) on large indexes, and still keep decent search >>> performance: >>> >>> 1. Firstly, you should consider using the enum method for facet >>> searches (facet.method=enum) unless you've got A LOT of memory on your >>> machine. In our tests, this method uses a lot less memory and >>> autowarms more quickly than fc. (Note, I've not tried the new >>> segement-based 'fcs' option, as I can't find support for it in >>> branch_3x - looks nice for 4.x though) >>> Admittedly, for our data, enum is not quite as fast for searching as >>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile >>> tradeoff. >>> If you do have access to LOTS of memory, AND you can guarantee that >>> the index won't grow beyond the memory capacity (i.e. you have some >>> sort of deletion policy in place), fc can be a lot faster than enum >>> when searching with lots of facets across many terms. >>> >>> 2. Secondly, we've found that LRUCache is faster at autowarming than >>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our >>> environment - your mileage may vary. >>> >>> So, our filterCache section in solrconfig.xml looks like this: >>> >> class="solr.LRUCache" >>> size="3600" >>> initialSize="1400" >>> autowarmCount="3600"/> >>> >>> For a 28GB index, running in a quad-core x64 VMWare instance, 30 >>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size >>> shows usually in the region of ~2400. >>> >>> 3. It's also a good idea to have some sort of >>> firstSearcher/newSearcher event listener queries to allow new
RE: multivalued fields in result
My schema.xml was fine. The problem was that my test queries weren't returning top 10 documents that had data in the fields. Once I increased the rows, I saw the results. Definitely user error. :) Thanks for help though. Jason -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Sun 9/12/2010 6:23 PM To: solr-user@lucene.apache.org Subject: Re: multivalued fields in result Also, the 'v' is capitalized: multiValued. (This is one reason why posting your schema helps.) Erick Erickson wrote: > Can we see your schema file? Because it sounds like you didn't > really declare your field multivalued="true" on the face of things. > > But if it is multivalued AND you changed it, did you reindex after > you changed the schema? > > Best > Erick > > On Sun, Sep 12, 2010 at 4:21 AM, Jason Chaffee wrote: > > >> But it doesn't seem to be returning mulitvalued fields that are stored. It >> is returning all of the single value fields though. >> >> >> -Original Message- >> From: Markus Jelsma [mailto:markus.jel...@buyways.nl] >> Sent: Sat 9/11/2010 4:19 AM >> To: solr-user@lucene.apache.org >> Subject: RE: multivalued fields in result >> >> Yes, you'll get what is stored and asked for. >> >> -Original message- >> From: Jason Chaffee >> Sent: Sat 11-09-2010 05:27 >> To: solr-user@lucene.apache.org; >> Subject: multivalued fields in result >> >> Is it possible to return multivalued files in the result? >> >> I would like to have a multivalued field that is stored and not indexed (I >> also copy the same field into another field where it is tokenized and >> indexed). I would then like all the values of this field returned in the >> result set. Is there a way to do this? >> >> If it is not possible, could someone elaborate why that is so that I may >> see if I can make it work. >> >> thanks, >> >> Jason >> >> >> >
Re: Tuning Solr caches with high commit rates (NRT)
BTW, what is a segment? I've only heard about them in the last 2 weeks here on the list. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Sun, 9/12/10, Jason Rutherglen wrote: > From: Jason Rutherglen > Subject: Re: Tuning Solr caches with high commit rates (NRT) > To: solr-user@lucene.apache.org > Date: Sunday, September 12, 2010, 7:52 PM > Yeah there's no patch... I think > Yonik can write it. :-) Yah... The > Lucene version shouldn't matter. The distributed > faceting > theoretically can easily be applied to multiple segments, > however the > way it's written for me is a challenge to untangle and > apply > successfully to a working patch. Also I don't have > this as an itch to > scratch at the moment. > > On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge > wrote: > > Hi Jason, > > > > I've tried some limited testing with the 4.x trunk > using fcs, and I > > must say, I really like the idea of per-segment > faceting. > > I was hoping to see it in 3.x, but I don't see this > option in the > > branch_3x trunk. Is your SOLR-1606 patch referred to > in SOLR-1617 the > > one to use with 3.1? > > There seems to be a number of Solr issues tied to this > - one of them > > being Lucene-1785. Can the per-segment faceting patch > work with Lucene > > 2.9/branch_3x? > > > > Thanks, > > Peter > > > > > > > > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen > > > wrote: > >> Peter, > >> > >> Are you using per-segment faceting, eg, SOLR-1617? > That could help > >> your situation. > >> > >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge > > wrote: > >>> Hi, > >>> > >>> Below are some notes regarding Solr cache > tuning that should prove > >>> useful for anyone who uses Solr with frequent > commits (e.g. <5min). > >>> > >>> Environment: > >>> Solr 1.4.1 or branch_3x trunk. > >>> Note the 4.x trunk has lots of neat new > features, so the notes here > >>> are likely less relevant to the 4.x > environment. > >>> > >>> Overview: > >>> Our Solr environment makes extensive use of > faceting, we perform > >>> commits every 30secs, and the indexes tend be > on the large-ish side > >>> (>20million docs). > >>> Note: For our data, when we commit, we are > always adding new data, > >>> never changing existing data. > >>> This type of environment can be tricky to > tune, as Solr is more geared > >>> toward fast reads than frequent writes. > >>> > >>> Symptoms: > >>> If anyone has used faceting in searches where > you are also performing > >>> frequent commits, you've likely encountered > the dreaded OutOfMemory or > >>> GC Overhead Exeeded errors. > >>> In high commit rate environments, this is > almost always due to > >>> multiple 'onDeck' searchers and autowarming - > i.e. new searchers don't > >>> finish autowarming their caches before the > next commit() > >>> comes along and invalidates them. > >>> Once this starts happening on a regular basis, > it is likely your > >>> Solr's JVM will run out of memory eventually, > as the number of > >>> searchers (and their cache arrays) will keep > growing until the JVM > >>> dies of thirst. > >>> To check if your Solr environment is suffering > from this, turn on INFO > >>> level logging, and look for: 'PERFORMANCE > WARNING: Overlapping > >>> onDeckSearchers=x'. > >>> > >>> In tests, we've only ever seen this problem > when using faceting, and > >>> facet.method=fc. > >>> > >>> Some solutions to this are: > >>> Reduce the commit rate to allow searchers > to fully warm before the > >>> next commit > >>> Reduce or eliminate the autowarming in > caches > >>> Both of the above > >>> > >>> The trouble is, if you're doing NRT commits, > you likely have a good > >>> reason for it, and reducing/elimintating > autowarming will very > >>> significantly impact search performance in > high commit rate > >>> environments. > >>> > >>> Solution: > >>> Here are some setup steps we've used that > allow lots of faceting (we > >>> typically search with at least 20-35 different > facet fields, and date > >>> faceting/sorting) on large indexes, and still > keep decent search > >>> performance: > >>> > >>> 1. Firstly, you should consider using the enum > method for facet > >>> searches (facet.method=enum) unless you've got > A LOT of memory on your > >>> machine. In our tests, this method uses a lot > less memory and > >>> autowarms more quickly than fc. (Note, I've > not tried the new > >>> segement-based 'fcs' option, as I can't find > support for it in > >>> branch_3x - looks nice for 4.x though) > >>> Admittedly, for our data, enum is not quite as > fast for searching as > >>> fc, but short of purchsing a Thaiwanese RAM > factory, it's a worthwhile > >>> tradeoff. > >>> If you do have access to LOTS of memory, AND > you can guarantee that > >>> the index won't grow beyond the memory > capacity (i.e. you have some > >>> sort of deletion policy in place), fc ca
what differents between SolrCloud and Solr+Hadoop
Dear All: now,i need solr to distributed search.and i found there are two choices: SolrCloud and Solr+Hadoop So i want to know what differents between them? and we can download SolrCloud from svn,and how can we get the Solr+Hadoop? please help me!Thank you! 2010-09-13 郭芸