Multiple sorting on text fields
Hi all! i found some strange behavior of solr. If I do sorting by 2 text fields in chain, I do receive some results doubled. The both text fields are not multivalued, one of them is string, the other custom type based on text field and keyword analyzer. I do this: *CommonsHttpSolrServer server = SolrServer.getInstance().getServer(); SolrQuery query = new SolrQuery(); query.setQuery(suchstring); query.addSortField("type", SolrQuery.ORDER.asc); //String field- it's only one letter query.addSortField("sortName", SolrQuery.ORDER.asc); //text field, not tokenized QueryResponse rsp = new QueryResponse(); rsp = server.query(query);* after that I extract results as a list Entity objects, the most of them are unique, but some of them are doubled and even tripled in this list. (Each object has a unique id and there is only one time in index) If I'm sorting only by one text field, I'm receiving "normal" results w/o problems. Where could I do a mistake, or is it a bug? Best regards, Stanislaw
Re: what differents between SolrCloud and Solr+Hadoop
Well these are pretty different things. SolrCloud is meant to handle distributed search in a more easy way that "raw" solr distributed search. You have to build the shards in your own way. Solr+hadoop is a way to build these shards/indexes in paralel. -- View this message in context: http://lucene.472066.n3.nabble.com/what-differents-between-SolrCloud-and-Solr-Hadoop-tp1463809p1464106.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple sorting on text fields
My guess is two things are happening: 1/ Your combination of filters is in parallel,or an OR expression. This I think for sure maybe, seen next. 2/ To get 3 duplicate results, your custom filter AND the OR expression above have to be working togther, or it's possible that your customer filter is the WHOLE problem, supplying the duplicates and the triplicates. A first guess nothing more :-) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/13/10, Stanislaw wrote: > From: Stanislaw > Subject: Multiple sorting on text fields > To: solr-user@lucene.apache.org > Date: Monday, September 13, 2010, 12:12 AM > Hi all! > > i found some strange behavior of solr. If I do sorting by 2 > text fields in > chain, I do receive some results doubled. > The both text fields are not multivalued, one of them is > string, the other > custom type based on text field and keyword analyzer. > > I do this: > > * CommonsHttpSolrServer server > = > SolrServer.getInstance().getServer(); > SolrQuery query = new > SolrQuery(); > query.setQuery(suchstring); > query.addSortField("type", > SolrQuery.ORDER.asc); > //String field- it's only one letter > query.addSortField("sortName", > SolrQuery.ORDER.asc); //text > field, not tokenized > > QueryResponse rsp = new > QueryResponse(); > rsp = server.query(query);* > > after that I extract results as a list Entity objects, the > most of them are > unique, but some of them are doubled and even tripled in > this list. > (Each object has a unique id and there is only one time in > index) > If I'm sorting only by one text field, I'm receiving > "normal" results w/o > problems. > Where could I do a mistake, or is it a bug? > > Best regards, > Stanislaw >
Re: Tuning Solr caches with high commit rates (NRT)
The balanced segment merging is a really cool idea. I'll definetely have a look at this, thanks! One thing I forgot to mention in the original post is we use a mergeFactor of 25. Somewhat on the high side, so that incoming commits aren't trying to merge new data into large segments. 25 is a good balance for us between number of files and search performance. This LinkedIn patch could come in very handy for handling merges. On Mon, Sep 13, 2010 at 2:20 AM, Lance Norskog wrote: > Bravo! > > Other tricks: here is a policy for deciding when to merge segments that > attempts to balance merging with performance. It was contributed by > LinkedIn- they also run index&search in the same instance (not Solr, a > different Lucene app). > > lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java > > The optimize command now includes a partial optimize option, so you can do > larger controlled merges. > > Peter Sturge wrote: >> >> Hi, >> >> Below are some notes regarding Solr cache tuning that should prove >> useful for anyone who uses Solr with frequent commits (e.g.<5min). >> >> Environment: >> Solr 1.4.1 or branch_3x trunk. >> Note the 4.x trunk has lots of neat new features, so the notes here >> are likely less relevant to the 4.x environment. >> >> Overview: >> Our Solr environment makes extensive use of faceting, we perform >> commits every 30secs, and the indexes tend be on the large-ish side >> (>20million docs). >> Note: For our data, when we commit, we are always adding new data, >> never changing existing data. >> This type of environment can be tricky to tune, as Solr is more geared >> toward fast reads than frequent writes. >> >> Symptoms: >> If anyone has used faceting in searches where you are also performing >> frequent commits, you've likely encountered the dreaded OutOfMemory or >> GC Overhead Exeeded errors. >> In high commit rate environments, this is almost always due to >> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't >> finish autowarming their caches before the next commit() >> comes along and invalidates them. >> Once this starts happening on a regular basis, it is likely your >> Solr's JVM will run out of memory eventually, as the number of >> searchers (and their cache arrays) will keep growing until the JVM >> dies of thirst. >> To check if your Solr environment is suffering from this, turn on INFO >> level logging, and look for: 'PERFORMANCE WARNING: Overlapping >> onDeckSearchers=x'. >> >> In tests, we've only ever seen this problem when using faceting, and >> facet.method=fc. >> >> Some solutions to this are: >> Reduce the commit rate to allow searchers to fully warm before the >> next commit >> Reduce or eliminate the autowarming in caches >> Both of the above >> >> The trouble is, if you're doing NRT commits, you likely have a good >> reason for it, and reducing/elimintating autowarming will very >> significantly impact search performance in high commit rate >> environments. >> >> Solution: >> Here are some setup steps we've used that allow lots of faceting (we >> typically search with at least 20-35 different facet fields, and date >> faceting/sorting) on large indexes, and still keep decent search >> performance: >> >> 1. Firstly, you should consider using the enum method for facet >> searches (facet.method=enum) unless you've got A LOT of memory on your >> machine. In our tests, this method uses a lot less memory and >> autowarms more quickly than fc. (Note, I've not tried the new >> segement-based 'fcs' option, as I can't find support for it in >> branch_3x - looks nice for 4.x though) >> Admittedly, for our data, enum is not quite as fast for searching as >> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile >> tradeoff. >> If you do have access to LOTS of memory, AND you can guarantee that >> the index won't grow beyond the memory capacity (i.e. you have some >> sort of deletion policy in place), fc can be a lot faster than enum >> when searching with lots of facets across many terms. >> >> 2. Secondly, we've found that LRUCache is faster at autowarming than >> FastLRUCache - in our tests, about 20% faster. Maybe this is just our >> environment - your mileage may vary. >> >> So, our filterCache section in solrconfig.xml looks like this: >> > class="solr.LRUCache" >> size="3600" >> initialSize="1400" >> autowarmCount="3600"/> >> >> For a 28GB index, running in a quad-core x64 VMWare instance, 30 >> warmed facet fields, Solr is running at ~4GB. Stats filterCache size >> shows usually in the region of ~2400. >> >> 3. It's also a good idea to have some sort of >> firstSearcher/newSearcher event listener queries to allow new data to >> populate the caches. >> Of course, what you put in these is dependent on the facets you need/use. >> We've found a good combination is a firstSearcher with as many facets >> in the search as your environment can handle, then a subset of the
Re: Tuning Solr caches with high commit rates (NRT)
1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich wrote: > Peter, > > thanks a lot for your in-depth explanations! > Your findings will be definitely helpful for my next performance > improvement tests :-) > > Two questions: > > 1. How would I do that: > >> or a local read-only instance that reads the same core as the indexing >> instance (for the latter, you'll need something that periodically refreshes >> - i.e. runs commit()). > > > 2. Did you try sharding with your current setup (e.g. one big, > nearly-static index and a tiny write+read index)? > > Regards, > Peter. > >> Hi, >> >> Below are some notes regarding Solr cache tuning that should prove >> useful for anyone who uses Solr with frequent commits (e.g. <5min). >> >> Environment: >> Solr 1.4.1 or branch_3x trunk. >> Note the 4.x trunk has lots of neat new features, so the notes here >> are likely less relevant to the 4.x environment. >> >> Overview: >> Our Solr environment makes extensive use of faceting, we perform >> commits every 30secs, and the indexes tend be on the large-ish side >> (>20million docs). >> Note: For our data, when we commit, we are always adding new data, >> never changing existing data. >> This type of environment can be tricky to tune, as Solr is more geared >> toward fast reads than frequent writes. >> >> Symptoms: >> If anyone has used faceting in searches where you are also performing >> frequent commits, you've likely encountered the dreaded OutOfMemory or >> GC Overhead Exeeded errors. >> In high commit rate environments, this is almost always due to >> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't >> finish autowarming their caches before the next commit() >> comes along and invalidates them. >> Once this starts happening on a regular basis, it is likely your >> Solr's JVM will run out of memory eventually, as the number of >> searchers (and their cache arrays) will keep growing until the JVM >> dies of thirst. >> To check if your Solr environment is suffering from this, turn on INFO >> level logging, and look for: 'PERFORMANCE WARNING: Overlapping >> onDeckSearchers=x'. >> >> In tests, we've only ever seen this problem when using faceting, and >> facet.method=fc. >> >> Some solutions to this are: >> Reduce the commit rate to allow searchers to fully warm before the >> next commit >> Reduce or eliminate the autowarming in caches >> Both of the above >> >> The trouble is, if you're doing NRT commits, you likely have a good >> reason for it, and reducing/elimintating autowarming will very >> significantly impact search performance in high commit rate >> environments. >> >> Solution: >> Here are some setup steps we've used that allow lots of faceting (we >> typically search with at least 20-35 different facet fields, and date >> faceting/sorting) on large indexes, and still keep decent search >> performance: >> >> 1. Firstly, you should consider using the enum method for facet >> searches (facet.method=enum) unless you've got A LOT of memory on your >> machine. In our tests, this method uses a lot less memory and >> autowarms more quickly than fc. (Note, I've not tried the new >> segement-based 'fcs' option, as I can't find support for it in >> branch_3x - looks nice for 4.x though) >> Admittedly, for our data, enum is not quite as fast for searching as >> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile >> tradeoff. >> If you do have access to LOTS of memory, AND you can guarantee that >> the index won't grow beyond the memory capacity (i.e. you have some >> sort of deletion policy in place), fc can be a lot faster than enum >> when searching with lots of facets across many terms. >> >> 2. Secondly, we've found that LRUCache is faster at autowarming than >> FastLRUCache - in our tests, about 20% faster. Maybe this is just our >> environment - your mileage may vary. >> >> So, our filterCache section in solrconfig.xml looks like this: >> > class="solr.LRUCache" >> size="3600" >> initialSize="1400" >> autowarmCount="
Re: Tuning Solr caches with high commit rates (NRT)
Hi Erik, I thought this would be good for the wiki, but I've not submitted to the wiki before, so I thought I'd put this info out there first, then add it if it was deemed useful. If you could let me know the procedure for submitting, it probably would be worth getting it into the wiki (couldn't do it straightaway, as I have a lot of projects on at the moment). If you're able/willing to put it on there for me, that would be very kind of you! Thanks! Peter On Sun, Sep 12, 2010 at 5:43 PM, Erick Erickson wrote: > Peter: > > This kind of information is extremely useful to document, thanks! Do you > have the time/energy to put it up on the Wiki? Anyone can edit it by > creating > a logon. If you don't, would it be OK if someone else did it (with > attribution, > of course)? I guess that by bringing it up I'm volunteering :)... > > Best > Erick > > On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge wrote: > >> Hi, >> >> Below are some notes regarding Solr cache tuning that should prove >> useful for anyone who uses Solr with frequent commits (e.g. <5min). >> >> Environment: >> Solr 1.4.1 or branch_3x trunk. >> Note the 4.x trunk has lots of neat new features, so the notes here >> are likely less relevant to the 4.x environment. >> >> Overview: >> Our Solr environment makes extensive use of faceting, we perform >> commits every 30secs, and the indexes tend be on the large-ish side >> (>20million docs). >> Note: For our data, when we commit, we are always adding new data, >> never changing existing data. >> This type of environment can be tricky to tune, as Solr is more geared >> toward fast reads than frequent writes. >> >> Symptoms: >> If anyone has used faceting in searches where you are also performing >> frequent commits, you've likely encountered the dreaded OutOfMemory or >> GC Overhead Exeeded errors. >> In high commit rate environments, this is almost always due to >> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't >> finish autowarming their caches before the next commit() >> comes along and invalidates them. >> Once this starts happening on a regular basis, it is likely your >> Solr's JVM will run out of memory eventually, as the number of >> searchers (and their cache arrays) will keep growing until the JVM >> dies of thirst. >> To check if your Solr environment is suffering from this, turn on INFO >> level logging, and look for: 'PERFORMANCE WARNING: Overlapping >> onDeckSearchers=x'. >> >> In tests, we've only ever seen this problem when using faceting, and >> facet.method=fc. >> >> Some solutions to this are: >> Reduce the commit rate to allow searchers to fully warm before the >> next commit >> Reduce or eliminate the autowarming in caches >> Both of the above >> >> The trouble is, if you're doing NRT commits, you likely have a good >> reason for it, and reducing/elimintating autowarming will very >> significantly impact search performance in high commit rate >> environments. >> >> Solution: >> Here are some setup steps we've used that allow lots of faceting (we >> typically search with at least 20-35 different facet fields, and date >> faceting/sorting) on large indexes, and still keep decent search >> performance: >> >> 1. Firstly, you should consider using the enum method for facet >> searches (facet.method=enum) unless you've got A LOT of memory on your >> machine. In our tests, this method uses a lot less memory and >> autowarms more quickly than fc. (Note, I've not tried the new >> segement-based 'fcs' option, as I can't find support for it in >> branch_3x - looks nice for 4.x though) >> Admittedly, for our data, enum is not quite as fast for searching as >> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile >> tradeoff. >> If you do have access to LOTS of memory, AND you can guarantee that >> the index won't grow beyond the memory capacity (i.e. you have some >> sort of deletion policy in place), fc can be a lot faster than enum >> when searching with lots of facets across many terms. >> >> 2. Secondly, we've found that LRUCache is faster at autowarming than >> FastLRUCache - in our tests, about 20% faster. Maybe this is just our >> environment - your mileage may vary. >> >> So, our filterCache section in solrconfig.xml looks like this: >> > class="solr.LRUCache" >> size="3600" >> initialSize="1400" >> autowarmCount="3600"/> >> >> For a 28GB index, running in a quad-core x64 VMWare instance, 30 >> warmed facet fields, Solr is running at ~4GB. Stats filterCache size >> shows usually in the region of ~2400. >> >> 3. It's also a good idea to have some sort of >> firstSearcher/newSearcher event listener queries to allow new data to >> populate the caches. >> Of course, what you put in these is dependent on the facets you need/use. >> We've found a good combination is a firstSearcher with as many facets >> in the search as your environment can handle, then a subset of the >> most common facets for the newSearcher. >> >>
Re: Tuning Solr caches with high commit rates (NRT)
Hi Dennis, These are the Lucene file segments that hold the index data on the file system. Have a look at: http://wiki.apache.org/solr/SolrPerformanceFactors Peter On Mon, Sep 13, 2010 at 7:02 AM, Dennis Gearon wrote: > BTW, what is a segment? > > I've only heard about them in the last 2 weeks here on the list. > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Sun, 9/12/10, Jason Rutherglen wrote: > >> From: Jason Rutherglen >> Subject: Re: Tuning Solr caches with high commit rates (NRT) >> To: solr-user@lucene.apache.org >> Date: Sunday, September 12, 2010, 7:52 PM >> Yeah there's no patch... I think >> Yonik can write it. :-) Yah... The >> Lucene version shouldn't matter. The distributed >> faceting >> theoretically can easily be applied to multiple segments, >> however the >> way it's written for me is a challenge to untangle and >> apply >> successfully to a working patch. Also I don't have >> this as an itch to >> scratch at the moment. >> >> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge >> wrote: >> > Hi Jason, >> > >> > I've tried some limited testing with the 4.x trunk >> using fcs, and I >> > must say, I really like the idea of per-segment >> faceting. >> > I was hoping to see it in 3.x, but I don't see this >> option in the >> > branch_3x trunk. Is your SOLR-1606 patch referred to >> in SOLR-1617 the >> > one to use with 3.1? >> > There seems to be a number of Solr issues tied to this >> - one of them >> > being Lucene-1785. Can the per-segment faceting patch >> work with Lucene >> > 2.9/branch_3x? >> > >> > Thanks, >> > Peter >> > >> > >> > >> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen >> > >> wrote: >> >> Peter, >> >> >> >> Are you using per-segment faceting, eg, SOLR-1617? >> That could help >> >> your situation. >> >> >> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge >> >> wrote: >> >>> Hi, >> >>> >> >>> Below are some notes regarding Solr cache >> tuning that should prove >> >>> useful for anyone who uses Solr with frequent >> commits (e.g. <5min). >> >>> >> >>> Environment: >> >>> Solr 1.4.1 or branch_3x trunk. >> >>> Note the 4.x trunk has lots of neat new >> features, so the notes here >> >>> are likely less relevant to the 4.x >> environment. >> >>> >> >>> Overview: >> >>> Our Solr environment makes extensive use of >> faceting, we perform >> >>> commits every 30secs, and the indexes tend be >> on the large-ish side >> >>> (>20million docs). >> >>> Note: For our data, when we commit, we are >> always adding new data, >> >>> never changing existing data. >> >>> This type of environment can be tricky to >> tune, as Solr is more geared >> >>> toward fast reads than frequent writes. >> >>> >> >>> Symptoms: >> >>> If anyone has used faceting in searches where >> you are also performing >> >>> frequent commits, you've likely encountered >> the dreaded OutOfMemory or >> >>> GC Overhead Exeeded errors. >> >>> In high commit rate environments, this is >> almost always due to >> >>> multiple 'onDeck' searchers and autowarming - >> i.e. new searchers don't >> >>> finish autowarming their caches before the >> next commit() >> >>> comes along and invalidates them. >> >>> Once this starts happening on a regular basis, >> it is likely your >> >>> Solr's JVM will run out of memory eventually, >> as the number of >> >>> searchers (and their cache arrays) will keep >> growing until the JVM >> >>> dies of thirst. >> >>> To check if your Solr environment is suffering >> from this, turn on INFO >> >>> level logging, and look for: 'PERFORMANCE >> WARNING: Overlapping >> >>> onDeckSearchers=x'. >> >>> >> >>> In tests, we've only ever seen this problem >> when using faceting, and >> >>> facet.method=fc. >> >>> >> >>> Some solutions to this are: >> >>> Reduce the commit rate to allow searchers >> to fully warm before the >> >>> next commit >> >>> Reduce or eliminate the autowarming in >> caches >> >>> Both of the above >> >>> >> >>> The trouble is, if you're doing NRT commits, >> you likely have a good >> >>> reason for it, and reducing/elimintating >> autowarming will very >> >>> significantly impact search performance in >> high commit rate >> >>> environments. >> >>> >> >>> Solution: >> >>> Here are some setup steps we've used that >> allow lots of faceting (we >> >>> typically search with at least 20-35 different >> facet fields, and date >> >>> faceting/sorting) on large indexes, and still >> keep decent search >> >>> performance: >> >>> >> >>> 1. Firstly, you should consider using the enum >> method for facet >> >>> searches (facet.method=enum) unless you've got >> A LOT of memory on your >> >>> machine. In our tests, this method uses a lot >> less memory and >> >>> autowarms more quickly than fc. (Note, I've >> not tried the new >> >>> segement-based 'fcs' option, as I can't find >> support for it in >> >>> branch_3x - looks
Re: Tuning Solr caches with high commit rates (NRT)
On Mon, Sep 13, 2010 at 8:02 AM, Dennis Gearon wrote: > BTW, what is a segment? On the Lucene level an index is composed of one or more index segments. Each segment is an index by itself and consists of several files like doc stores, proximity data, term dictionaries etc. During indexing Lucene / Solr creates those segments depending on ram buffer / document buffer settings and flushes them to disk (if you index to disk). Once a segment has been flushed Lucene will never change the segments (well up to a certain level - lets keep this simple) but write new ones for new added documents. Since segments have a write-once policy Lucene merges multiple segments into a new segment (how and when this happens is different story) from time to time to get rid of deleted documents and to reduce the number of overall segments in the index. Generally a higher number of segments will also influence you search performance since Lucene performs almost all operations on a per-segment level. If you want to reduce the number of segment to one you need to call optimize and lucene will merge all existing ones into one single segment. hope that answers your question simon > > I've only heard about them in the last 2 weeks here on the list. > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Sun, 9/12/10, Jason Rutherglen wrote: > >> From: Jason Rutherglen >> Subject: Re: Tuning Solr caches with high commit rates (NRT) >> To: solr-user@lucene.apache.org >> Date: Sunday, September 12, 2010, 7:52 PM >> Yeah there's no patch... I think >> Yonik can write it. :-) Yah... The >> Lucene version shouldn't matter. The distributed >> faceting >> theoretically can easily be applied to multiple segments, >> however the >> way it's written for me is a challenge to untangle and >> apply >> successfully to a working patch. Also I don't have >> this as an itch to >> scratch at the moment. >> >> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge >> wrote: >> > Hi Jason, >> > >> > I've tried some limited testing with the 4.x trunk >> using fcs, and I >> > must say, I really like the idea of per-segment >> faceting. >> > I was hoping to see it in 3.x, but I don't see this >> option in the >> > branch_3x trunk. Is your SOLR-1606 patch referred to >> in SOLR-1617 the >> > one to use with 3.1? >> > There seems to be a number of Solr issues tied to this >> - one of them >> > being Lucene-1785. Can the per-segment faceting patch >> work with Lucene >> > 2.9/branch_3x? >> > >> > Thanks, >> > Peter >> > >> > >> > >> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen >> > >> wrote: >> >> Peter, >> >> >> >> Are you using per-segment faceting, eg, SOLR-1617? >> That could help >> >> your situation. >> >> >> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge >> >> wrote: >> >>> Hi, >> >>> >> >>> Below are some notes regarding Solr cache >> tuning that should prove >> >>> useful for anyone who uses Solr with frequent >> commits (e.g. <5min). >> >>> >> >>> Environment: >> >>> Solr 1.4.1 or branch_3x trunk. >> >>> Note the 4.x trunk has lots of neat new >> features, so the notes here >> >>> are likely less relevant to the 4.x >> environment. >> >>> >> >>> Overview: >> >>> Our Solr environment makes extensive use of >> faceting, we perform >> >>> commits every 30secs, and the indexes tend be >> on the large-ish side >> >>> (>20million docs). >> >>> Note: For our data, when we commit, we are >> always adding new data, >> >>> never changing existing data. >> >>> This type of environment can be tricky to >> tune, as Solr is more geared >> >>> toward fast reads than frequent writes. >> >>> >> >>> Symptoms: >> >>> If anyone has used faceting in searches where >> you are also performing >> >>> frequent commits, you've likely encountered >> the dreaded OutOfMemory or >> >>> GC Overhead Exeeded errors. >> >>> In high commit rate environments, this is >> almost always due to >> >>> multiple 'onDeck' searchers and autowarming - >> i.e. new searchers don't >> >>> finish autowarming their caches before the >> next commit() >> >>> comes along and invalidates them. >> >>> Once this starts happening on a regular basis, >> it is likely your >> >>> Solr's JVM will run out of memory eventually, >> as the number of >> >>> searchers (and their cache arrays) will keep >> growing until the JVM >> >>> dies of thirst. >> >>> To check if your Solr environment is suffering >> from this, turn on INFO >> >>> level logging, and look for: 'PERFORMANCE >> WARNING: Overlapping >> >>> onDeckSearchers=x'. >> >>> >> >>> In tests, we've only ever seen this problem >> when using faceting, and >> >>> facet.method=fc. >> >>> >> >>> Some solutions to this are: >> >>> Reduce the commit rate to allow searchers >> to fully warm before the >> >>> next commit >> >>> Reduce or eliminate the autowarming in >> caches >> >>> Both of the above >> >>>
Re: Sorting not working on a string field
Hi, May you show us what result you actually get? Wouldn't it make more sense to choose a numeric fieldtype? To get proper sort order of numbers in a string field, all number need to be exactly same length since order will be lexiographical, i.e. "10" will come before "2", but after "02". -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 10. sep. 2010, at 19.14, n...@frameweld.com wrote: > Hello, I seem to be having a problem with sorting. I have a string field > (time_code) that I want to order by. When the results come up, it displays > the results differently from relevance which I would assume, but the results > aren't ordered. The data in time_code came from a numeric decimal with a six > digit precision if that makes a difference(ex: 1.00). > > Here is the query I give it: > > q=ceremony+AND+presentation_id%3A296+AND+type%3Ablob&version=1.3&json.nl=map&rows=10&start=0&wt=json&hl=true&hl.fl=text&hl.simple.pre=&hl.simple.post=<%2Fspan>&hl.fragsize=0&hl.mergeContiguous=false&&sort=time_code+asc > > > And here's the field schema: > > > > > multiValued="true"/> > > > > allowDups="true" multiValued="true"/> > > allowDups="true"/> > > > Thanks for any help. >
Re: mm=0?
As Erick points out, you don't want a random doc as response! What you're looking at is how to avoid the "0 hits" problem. You could look into one of these: * Introduce autosuggest to avoid many 0-hits cases * Introduce spellchecking * Re-run the failed query with fuzzy turned on (e.g. alpha~) * Redirect user to some other, broader source (wikipedia, google...) if relevant to your domain. No matter what you do, it is important to communicate it to the user in a very clear way. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 11. sep. 2010, at 19.10, Satish Kumar wrote: > Hi, > > We have a requirement to show at least one result every time -- i.e., even > if user entered term is not found in any of the documents. I was hoping > setting mm to 0 will return results in all cases, but it is not. > > For example, if user entered term "alpha" and it is *not* in any of the > documents in the index, any document in the index can be returned. If term > "alpha" is in the document set, documents having the term "alpha" only must > be returned. > > My idea so far is to perform a search using user entered term. If there are > any results, return them. If there are no results, perform another search > without the query term-- this means doing two searches. Any suggestions on > implementing this requirement using only one search? > > > Thanks, > Satish
Re: Multiple sorting on text fields
Hi Dennis, thanks for reply. Please explain me what filter do you mean. I'm searching only on one field with names: query.setQuery(suchstring); then I'm adding two sortings on another fields: query.addSortField("type", SolrQuery.ORDER.asc); query.addSortField("sortName", SolrQuery.ORDER.asc); the results should be sorted in first queue by 'type' (only one letter 'A' or 'B') and then they should be sorted by names how I can define hier 'OR' or 'AND' relations? Best regards, Stanislaw 2010/9/13 Dennis Gearon > My guess is two things are happening: > 1/ Your combination of filters is in parallel,or an OR expression. This I > think for sure maybe, seen next. > 2/ To get 3 duplicate results, your custom filter AND the OR expression > above have to be working togther, or it's possible that your customer filter > is the WHOLE problem, supplying the duplicates and the triplicates. > > A first guess nothing more :-) > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Mon, 9/13/10, Stanislaw wrote: > > > From: Stanislaw > > Subject: Multiple sorting on text fields > > To: solr-user@lucene.apache.org > > Date: Monday, September 13, 2010, 12:12 AM > > Hi all! > > > > i found some strange behavior of solr. If I do sorting by 2 > > text fields in > > chain, I do receive some results doubled. > > The both text fields are not multivalued, one of them is > > string, the other > > custom type based on text field and keyword analyzer. > > > > I do this: > > > > *CommonsHttpSolrServer server > > = > > SolrServer.getInstance().getServer(); > > SolrQuery query = new > > SolrQuery(); > > query.setQuery(suchstring); > > query.addSortField("type", > > SolrQuery.ORDER.asc); > > //String field- it's only one letter > > query.addSortField("sortName", > > SolrQuery.ORDER.asc); //text > > field, not tokenized > > > > QueryResponse rsp = new > > QueryResponse(); > > rsp = server.query(query);* > > > > after that I extract results as a list Entity objects, the > > most of them are > > unique, but some of them are doubled and even tripled in > > this list. > > (Each object has a unique id and there is only one time in > > index) > > If I'm sorting only by one text field, I'm receiving > > "normal" results w/o > > problems. > > Where could I do a mistake, or is it a bug? > > > > Best regards, > > Stanislaw > > >
Re: Solr CoreAdmin create ignores dataDir Parameter
MitchK schrieb: Frank, have a look at SOLR-646. Do you think a workaround for the data-dir-tag in the solrconfig.xml can help? I think about something like ${solr./data/corename} for illustration. Unfortunately I am not very skilled in working with solr's variables and therefore I do not know what variables are available. No, variables are not available at this stage. If we find a solution, we should provide it as a suggestion at the wiki's CoreAdmin-page. Kind regards, - Mitch -- mit freundlichem Gruß, Frank Wesemann Fotofinder GmbH USt-IdNr. DE812854514 Software EntwicklungWeb: http://www.fotofinder.com/ Potsdamer Str. 96 Tel: +49 30 25 79 28 90 10785 BerlinFax: +49 30 25 79 28 999 Sitz: Berlin Amtsgericht Berlin Charlottenburg (HRB 73099) Geschäftsführer: Ali Paczensky
Re: Multiple sorting on text fields
A couple of things come to mind: 1> what happens if you remove the sort clauses? Because I suspect they're irrelevant and your duplicate issue is something different. 2> SOLR admin should let you determine this. 3> Please show us the configurations that make you sure that the documents are unique (I'm assuming you've defined in your schema, but please show us. And show us the field TYPE definition). 4> Assuming the uniqueKey is defined, did you perhaps define it after you'd indexed some documents? SOLR doesn't apply uniqueness retroactively. 5> Your secondary sort looks like it's on a tokenized field (again guessing, you haven't provided your schema definitions). It should not be. NOTE: this is different than multivalued! Again, I doubt this has anything to do with your duplcate issue, but it'll make your sorting "interesting". Again, I think the sorting is unrelated to your underlying duplication issue, so until you're sure your index is in the state you think it's in, I'd ignore sorting.. Best Erick On Mon, Sep 13, 2010 at 5:56 AM, Stanislaw wrote: > Hi Dennis, > thanks for reply. > Please explain me what filter do you mean. > > I'm searching only on one field with names: > query.setQuery(suchstring); > > then I'm adding two sortings on another fields: > query.addSortField("type", SolrQuery.ORDER.asc); > query.addSortField("sortName", SolrQuery.ORDER.asc); > > the results should be sorted in first queue by 'type' (only one letter 'A' > or 'B') > and then they should be sorted by names > > how I can define hier 'OR' or 'AND' relations? > > Best regards, > Stanislaw > > > 2010/9/13 Dennis Gearon > > > My guess is two things are happening: > > 1/ Your combination of filters is in parallel,or an OR expression. This > I > > think for sure maybe, seen next. > > 2/ To get 3 duplicate results, your custom filter AND the OR expression > > above have to be working togther, or it's possible that your customer > filter > > is the WHOLE problem, supplying the duplicates and the triplicates. > > > > A first guess nothing more :-) > > Dennis Gearon > > > > Signature Warning > > > > EARTH has a Right To Life, > > otherwise we all die. > > > > Read 'Hot, Flat, and Crowded' > > Laugh at http://www.yert.com/film.php > > > > > > --- On Mon, 9/13/10, Stanislaw wrote: > > > > > From: Stanislaw > > > Subject: Multiple sorting on text fields > > > To: solr-user@lucene.apache.org > > > Date: Monday, September 13, 2010, 12:12 AM > > > Hi all! > > > > > > i found some strange behavior of solr. If I do sorting by 2 > > > text fields in > > > chain, I do receive some results doubled. > > > The both text fields are not multivalued, one of them is > > > string, the other > > > custom type based on text field and keyword analyzer. > > > > > > I do this: > > > > > > *CommonsHttpSolrServer server > > > = > > > SolrServer.getInstance().getServer(); > > > SolrQuery query = new > > > SolrQuery(); > > > query.setQuery(suchstring); > > > query.addSortField("type", > > > SolrQuery.ORDER.asc); > > > //String field- it's only one letter > > > query.addSortField("sortName", > > > SolrQuery.ORDER.asc); //text > > > field, not tokenized > > > > > > QueryResponse rsp = new > > > QueryResponse(); > > > rsp = server.query(query);* > > > > > > after that I extract results as a list Entity objects, the > > > most of them are > > > unique, but some of them are doubled and even tripled in > > > this list. > > > (Each object has a unique id and there is only one time in > > > index) > > > If I'm sorting only by one text field, I'm receiving > > > "normal" results w/o > > > problems. > > > Where could I do a mistake, or is it a bug? > > > > > > Best regards, > > > Stanislaw > > > > > >
stopwords in AND clauses
Let's suppose we have a regular search field body_t, and an internal boolean flag flag_t not exposed to the user. I'd like body_t:foo AND flag_t:true to be an intersection, but if "foo" is a stopword I get all documents for which flag_t is true, as if the first class was dropped, or if technically all documents match an empty string. Is there a way to get 0 results instead?
Re: stopwords in AND clauses
On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria wrote: > Let's suppose we have a regular search field body_t, and an internal > boolean flag flag_t not exposed to the user. > > I'd like > > body_t:foo AND flag_t:true this is solr right? why don't you use filterquery for you unexposed flat_t field q=boty_t:foo&fq=flag_t:true this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq simon > > to be an intersection, but if "foo" is a stopword I get all documents > for which flag_t is true, as if the first class was dropped, or if > technically all documents match an empty string. > > Is there a way to get 0 results instead? >
Re: stopwords in AND clauses
On Mon, Sep 13, 2010 at 4:29 PM, Simon Willnauer wrote: > On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria wrote: >> Let's suppose we have a regular search field body_t, and an internal >> boolean flag flag_t not exposed to the user. >> >> I'd like >> >> body_t:foo AND flag_t:true > > this is solr right? why don't you use filterquery for you unexposed > flat_t field q=boty_t:foo&fq=flag_t:true > this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq Sounds good.
Re: mm=0?
Hi Erik, I completely agree with you that showing a random document for user's query would be very poor experience. I have raised this in our product review meetings before. I was told that because of contractual agreement some sponsored content needs to be returned even if it meant no match. And the sponsored content drives the ads displayed on the page-- so it is more for showing some ad on the page when there is no matching result from sponsored content for user's query. Note that some other content in addition to sponsored content is displayed on the page, so user is not seeing just one random result when there is not a good match. It looks like I have to do another search to get a random result when there are no results. In this case I will use RandomSortField to generate random result (so that a different ad is displayed from set of sponsored ads) for each no result case. Thanks for the comments! Satish On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson wrote: > Could you explain the use-case a bit? Because the very > first response I would have is "why in the world did > product management make this a requirement" and try > to get the requirement changed > > As a user, I'm having a hard time imagining being well > served by getting a document in response to a search that > had no relation to my search, it was just a random doc > selected from the corpus. > > All that said, I don't think a single query would do the trick. > You could include a "very special" document with a field > that no other document had with very special text in it. Say > field name "bogusmatch", filled with the text "bogustext" > then, at least the second query would match one and only > one document and would take minimal time. Or you could > tack on to each and every query "OR bogusmatch:bogustext^0.001" > (which would really be inexpensive) and filter it out if there > was more than one response. By boosting it really low, it should > always appear at the end of the list which wouldn't be a bad thing. > > DisMax might help you here... > > But do ask if it is really a requirement or just something nobody's > objected to before bothering IMO... > > Best > Erick > > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar < > satish.kumar.just.d...@gmail.com> wrote: > > > Hi, > > > > We have a requirement to show at least one result every time -- i.e., > even > > if user entered term is not found in any of the documents. I was hoping > > setting mm to 0 will return results in all cases, but it is not. > > > > For example, if user entered term "alpha" and it is *not* in any of the > > documents in the index, any document in the index can be returned. If > term > > "alpha" is in the document set, documents having the term "alpha" only > must > > be returned. > > > > My idea so far is to perform a search using user entered term. If there > are > > any results, return them. If there are no results, perform another search > > without the query term-- this means doing two searches. Any suggestions > on > > implementing this requirement using only one search? > > > > > > Thanks, > > Satish > > >
Re: Sorting not working on a string field
You're right, it would be better to just give it a sortable numerical value. For now I gave time_code a sdouble type and see if it sorted, and it did. However all the 0's are trimmed, but that shouldn't be a problem unless it were to truncate any values past the hundreds column. Thanks. - Noel -Original Message- From: "Jan Høydahl / Cominvent" Sent: Monday, September 13, 2010 5:31am To: solr-user@lucene.apache.org Subject: Re: Sorting not working on a string field Hi, May you show us what result you actually get? Wouldn't it make more sense to choose a numeric fieldtype? To get proper sort order of numbers in a string field, all number need to be exactly same length since order will be lexiographical, i.e. "10" will come before "2", but after "02". -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 10. sep. 2010, at 19.14, n...@frameweld.com wrote: > Hello, I seem to be having a problem with sorting. I have a string field > (time_code) that I want to order by. When the results come up, it displays > the results differently from relevance which I would assume, but the results > aren't ordered. The data in time_code came from a numeric decimal with a six > digit precision if that makes a difference(ex: 1.00). > > Here is the query I give it: > > q=ceremony+AND+presentation_id%3A296+AND+type%3Ablob&version=1.3&json.nl=map&rows=10&start=0&wt=json&hl=true&hl.fl=text&hl.simple.pre=&hl.simple.post=<%2Fspan>&hl.fragsize=0&hl.mergeContiguous=false&&sort=time_code+asc > > > And here's the field schema: > > > > > multiValued="true"/> > > > > allowDups="true" multiValued="true"/> > > allowDups="true"/> > > > Thanks for any help. >
Re: Multiple sorting on text fields
I thought I saw 'custom analyzer', but you wrote 'custom field'. My mistake. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/13/10, Stanislaw wrote: > From: Stanislaw > Subject: Re: Multiple sorting on text fields > To: solr-user@lucene.apache.org > Date: Monday, September 13, 2010, 2:56 AM > Hi Dennis, > thanks for reply. > Please explain me what filter do you mean. > > I'm searching only on one field with names: > query.setQuery(suchstring); > > then I'm adding two sortings on another fields: > query.addSortField("type", SolrQuery.ORDER.asc); > query.addSortField("sortName", SolrQuery.ORDER.asc); > > the results should be sorted in first queue by 'type' (only > one letter 'A' > or 'B') > and then they should be sorted by names > > how I can define hier 'OR' or 'AND' relations? > > Best regards, > Stanislaw > > > 2010/9/13 Dennis Gearon > > > My guess is two things are happening: > > 1/ Your combination of filters is in parallel,or > an OR expression. This I > > think for sure maybe, seen next. > > 2/ To get 3 duplicate results, your custom > filter AND the OR expression > > above have to be working togther, or it's possible > that your customer filter > > is the WHOLE problem, supplying the duplicates and the > triplicates. > > > > A first guess nothing more :-) > > Dennis Gearon > > > > Signature Warning > > > > EARTH has a Right To Life, > > otherwise we all die. > > > > Read 'Hot, Flat, and Crowded' > > Laugh at http://www.yert.com/film.php > > > > > > --- On Mon, 9/13/10, Stanislaw > wrote: > > > > > From: Stanislaw > > > Subject: Multiple sorting on text fields > > > To: solr-user@lucene.apache.org > > > Date: Monday, September 13, 2010, 12:12 AM > > > Hi all! > > > > > > i found some strange behavior of solr. If I do > sorting by 2 > > > text fields in > > > chain, I do receive some results doubled. > > > The both text fields are not multivalued, one of > them is > > > string, the other > > > custom type based on text field and keyword > analyzer. > > > > > > I do this: > > > > > > * > CommonsHttpSolrServer server > > > = > > > SolrServer.getInstance().getServer(); > > > SolrQuery > query = new > > > SolrQuery(); > > > > query.setQuery(suchstring); > > > > query.addSortField("type", > > > SolrQuery.ORDER.asc); > > > //String field- it's only one letter > > > > query.addSortField("sortName", > > > SolrQuery.ORDER.asc); > //text > > > field, not tokenized > > > > > > > QueryResponse rsp = new > > > QueryResponse(); > > > rsp = > server.query(query);* > > > > > > after that I extract results as a list Entity > objects, the > > > most of them are > > > unique, but some of them are doubled and even > tripled in > > > this list. > > > (Each object has a unique id and there is only > one time in > > > index) > > > If I'm sorting only by one text field, I'm > receiving > > > "normal" results w/o > > > problems. > > > Where could I do a mistake, or is it a bug? > > > > > > Best regards, > > > Stanislaw > > > > > >
Re: mm=0?
This issue is one I hope to head off in my application / on my site. Instead of an ad feed, I HOPE to be able to have an ad QUEUE on my site. If necessary, I'll convert the feed TO a queue. The queue will get a first pass done on it by either an employee or a compensated user. Either one generates up to 4 keywords/tags for the advertisement. THEY determine when the ad gets shown based on relevancy. Nice idea, hope it'll fly :-) I actually detest the adds that say 'Lucene instance for sale, lowest prices!', or the industrial clearing houses that make you wade through 4 -6 screens to find that you need a membership in order to look for the rice of some stainless steel nuts. And usually, those ads must be paying top dollar, because they are the first three ads on google's search (that is until reacently.) Anyone notice that there's hardly any more ads on google search results? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/13/10, Satish Kumar wrote: > From: Satish Kumar > Subject: Re: mm=0? > To: solr-user@lucene.apache.org > Date: Monday, September 13, 2010, 7:41 AM > Hi Erik, > > I completely agree with you that showing a random document > for user's query > would be very poor experience. I have raised this in our > product review > meetings before. I was told that because of contractual > agreement some > sponsored content needs to be returned even if it meant no > match. And the > sponsored content drives the ads displayed on the page-- so > it is more for > showing some ad on the page when there is no matching > result from sponsored > content for user's query. > > Note that some other content in addition to sponsored > content is displayed > on the page, so user is not seeing just one random result > when there is not > a good match. > > It looks like I have to do another search to get a random > result when there > are no results. In this case I will use RandomSortField to > generate random > result (so that a different ad is displayed from set of > sponsored ads) for > each no result case. > > Thanks for the comments! > > > Satish > > > > On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson > wrote: > > > Could you explain the use-case a bit? Because the > very > > first response I would have is "why in the world did > > product management make this a requirement" and try > > to get the requirement changed > > > > As a user, I'm having a hard time imagining being > well > > served by getting a document in response to a search > that > > had no relation to my search, it was just a random > doc > > selected from the corpus. > > > > All that said, I don't think a single query would do > the trick. > > You could include a "very special" document with a > field > > that no other document had with very special text in > it. Say > > field name "bogusmatch", filled with the text > "bogustext" > > then, at least the second query would match one and > only > > one document and would take minimal time. Or you > could > > tack on to each and every query "OR > bogusmatch:bogustext^0.001" > > (which would really be inexpensive) and filter it out > if there > > was more than one response. By boosting it really low, > it should > > always appear at the end of the list which wouldn't be > a bad thing. > > > > DisMax might help you here... > > > > But do ask if it is really a requirement or just > something nobody's > > objected to before bothering IMO... > > > > Best > > Erick > > > > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar < > > satish.kumar.just.d...@gmail.com> > wrote: > > > > > Hi, > > > > > > We have a requirement to show at least one result > every time -- i.e., > > even > > > if user entered term is not found in any of the > documents. I was hoping > > > setting mm to 0 will return results in all cases, > but it is not. > > > > > > For example, if user entered term "alpha" and it > is *not* in any of the > > > documents in the index, any document in the index > can be returned. If > > term > > > "alpha" is in the document set, documents having > the term "alpha" only > > must > > > be returned. > > > > > > My idea so far is to perform a search using user > entered term. If there > > are > > > any results, return them. If there are no > results, perform another search > > > without the query term-- this means doing two > searches. Any suggestions > > on > > > implementing this requirement using only one > search? > > > > > > > > > Thanks, > > > Satish > > > > > >
Re: mm=0?
I just tried several searches again on google. I think they've refined the ads placements so that certain kind of searches return no ads, the kinds that I've been doing relative to programming being one of them. If OTOH I do some product related search, THEN lots of ads show up, but fairly accurate ones. They've immproved the ads placement a LOT! Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/13/10, Satish Kumar wrote: > From: Satish Kumar > Subject: Re: mm=0? > To: solr-user@lucene.apache.org > Date: Monday, September 13, 2010, 7:41 AM > Hi Erik, > > I completely agree with you that showing a random document > for user's query > would be very poor experience. I have raised this in our > product review > meetings before. I was told that because of contractual > agreement some > sponsored content needs to be returned even if it meant no > match. And the > sponsored content drives the ads displayed on the page-- so > it is more for > showing some ad on the page when there is no matching > result from sponsored > content for user's query. > > Note that some other content in addition to sponsored > content is displayed > on the page, so user is not seeing just one random result > when there is not > a good match. > > It looks like I have to do another search to get a random > result when there > are no results. In this case I will use RandomSortField to > generate random > result (so that a different ad is displayed from set of > sponsored ads) for > each no result case. > > Thanks for the comments! > > > Satish > > > > On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson > wrote: > > > Could you explain the use-case a bit? Because the > very > > first response I would have is "why in the world did > > product management make this a requirement" and try > > to get the requirement changed > > > > As a user, I'm having a hard time imagining being > well > > served by getting a document in response to a search > that > > had no relation to my search, it was just a random > doc > > selected from the corpus. > > > > All that said, I don't think a single query would do > the trick. > > You could include a "very special" document with a > field > > that no other document had with very special text in > it. Say > > field name "bogusmatch", filled with the text > "bogustext" > > then, at least the second query would match one and > only > > one document and would take minimal time. Or you > could > > tack on to each and every query "OR > bogusmatch:bogustext^0.001" > > (which would really be inexpensive) and filter it out > if there > > was more than one response. By boosting it really low, > it should > > always appear at the end of the list which wouldn't be > a bad thing. > > > > DisMax might help you here... > > > > But do ask if it is really a requirement or just > something nobody's > > objected to before bothering IMO... > > > > Best > > Erick > > > > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar < > > satish.kumar.just.d...@gmail.com> > wrote: > > > > > Hi, > > > > > > We have a requirement to show at least one result > every time -- i.e., > > even > > > if user entered term is not found in any of the > documents. I was hoping > > > setting mm to 0 will return results in all cases, > but it is not. > > > > > > For example, if user entered term "alpha" and it > is *not* in any of the > > > documents in the index, any document in the index > can be returned. If > > term > > > "alpha" is in the document set, documents having > the term "alpha" only > > must > > > be returned. > > > > > > My idea so far is to perform a search using user > entered term. If there > > are > > > any results, return them. If there are no > results, perform another search > > > without the query term-- this means doing two > searches. Any suggestions > > on > > > implementing this requirement using only one > search? > > > > > > > > > Thanks, > > > Satish > > > > > >
Re: Tuning Solr caches with high commit rates (NRT)
Thanks guys for the explanation. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/13/10, Simon Willnauer wrote: > From: Simon Willnauer > Subject: Re: Tuning Solr caches with high commit rates (NRT) > To: solr-user@lucene.apache.org > Date: Monday, September 13, 2010, 1:33 AM > On Mon, Sep 13, 2010 at 8:02 AM, > Dennis Gearon > wrote: > > BTW, what is a segment? > > On the Lucene level an index is composed of one or more > index > segments. Each segment is an index by itself and consists > of several > files like doc stores, proximity data, term dictionaries > etc. During > indexing Lucene / Solr creates those segments depending on > ram buffer > / document buffer settings and flushes them to disk (if you > index to > disk). Once a segment has been flushed Lucene will never > change the > segments (well up to a certain level - lets keep this > simple) but > write new ones for new added documents. Since segments have > a > write-once policy Lucene merges multiple segments into a > new segment > (how and when this happens is different story) from time to > time to > get rid of deleted documents and to reduce the number of > overall > segments in the index. > Generally a higher number of segments will also influence > you search > performance since Lucene performs almost all operations on > a > per-segment level. If you want to reduce the number of > segment to one > you need to call optimize and lucene will merge all > existing ones into > one single segment. > > hope that answers your question > > simon > > > > I've only heard about them in the last 2 weeks here on > the list. > > Dennis Gearon > > > > Signature Warning > > > > EARTH has a Right To Life, > > otherwise we all die. > > > > Read 'Hot, Flat, and Crowded' > > Laugh at http://www.yert.com/film.php > > > > > > --- On Sun, 9/12/10, Jason Rutherglen > wrote: > > > >> From: Jason Rutherglen > >> Subject: Re: Tuning Solr caches with high commit > rates (NRT) > >> To: solr-user@lucene.apache.org > >> Date: Sunday, September 12, 2010, 7:52 PM > >> Yeah there's no patch... I think > >> Yonik can write it. :-) Yah... The > >> Lucene version shouldn't matter. The > distributed > >> faceting > >> theoretically can easily be applied to multiple > segments, > >> however the > >> way it's written for me is a challenge to untangle > and > >> apply > >> successfully to a working patch. Also I don't > have > >> this as an itch to > >> scratch at the moment. > >> > >> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge > > >> wrote: > >> > Hi Jason, > >> > > >> > I've tried some limited testing with the 4.x > trunk > >> using fcs, and I > >> > must say, I really like the idea of > per-segment > >> faceting. > >> > I was hoping to see it in 3.x, but I don't > see this > >> option in the > >> > branch_3x trunk. Is your SOLR-1606 patch > referred to > >> in SOLR-1617 the > >> > one to use with 3.1? > >> > There seems to be a number of Solr issues > tied to this > >> - one of them > >> > being Lucene-1785. Can the per-segment > faceting patch > >> work with Lucene > >> > 2.9/branch_3x? > >> > > >> > Thanks, > >> > Peter > >> > > >> > > >> > > >> > On Mon, Sep 13, 2010 at 12:05 AM, Jason > Rutherglen > >> > > >> wrote: > >> >> Peter, > >> >> > >> >> Are you using per-segment faceting, eg, > SOLR-1617? > >> That could help > >> >> your situation. > >> >> > >> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter > Sturge > >> > >> wrote: > >> >>> Hi, > >> >>> > >> >>> Below are some notes regarding Solr > cache > >> tuning that should prove > >> >>> useful for anyone who uses Solr with > frequent > >> commits (e.g. <5min). > >> >>> > >> >>> Environment: > >> >>> Solr 1.4.1 or branch_3x trunk. > >> >>> Note the 4.x trunk has lots of neat > new > >> features, so the notes here > >> >>> are likely less relevant to the 4.x > >> environment. > >> >>> > >> >>> Overview: > >> >>> Our Solr environment makes extensive > use of > >> faceting, we perform > >> >>> commits every 30secs, and the indexes > tend be > >> on the large-ish side > >> >>> (>20million docs). > >> >>> Note: For our data, when we commit, > we are > >> always adding new data, > >> >>> never changing existing data. > >> >>> This type of environment can be > tricky to > >> tune, as Solr is more geared > >> >>> toward fast reads than frequent > writes. > >> >>> > >> >>> Symptoms: > >> >>> If anyone has used faceting in > searches where > >> you are also performing > >> >>> frequent commits, you've likely > encountered > >> the dreaded OutOfMemory or > >> >>> GC Overhead Exeeded errors. > >> >>> In high commit rate environments, > this is > >> almost always due to > >> >>> multiple 'onDeck' searchers and > autowarming - > >> i.e. new searchers don't > >> >>> finish autowarming their caches > before the > >> next commit() > >> >>> comes along and invalidates them.
Re: what differents between SolrCloud and Solr+Hadoop
You do not need either addition if you just want to have multiple Solr instances on different machines, and query them all at once. Look at this for the simplest way: http://wiki.apache.org/solr/DistributedSearch On Mon, Sep 13, 2010 at 12:52 AM, Marc Sturlese wrote: > > Well these are pretty different things. SolrCloud is meant to handle > distributed search in a more easy way that "raw" solr distributed search. > You have to build the shards in your own way. > Solr+hadoop is a way to build these shards/indexes in paralel. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/what-differents-between-SolrCloud-and-Solr-Hadoop-tp1463809p1464106.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Lance Norskog goks...@gmail.com
Re: mm=0?
"Java Swing" no longer gives ads for "swinger's clubs". On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon wrote: > I just tried several searches again on google. > > I think they've refined the ads placements so that certain kind of searches > return no ads, the kinds that I've been doing relative to programming being > one of them. > > If OTOH I do some product related search, THEN lots of ads show up, but > fairly accurate ones. > > They've immproved the ads placement a LOT! > > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Mon, 9/13/10, Satish Kumar wrote: > >> From: Satish Kumar >> Subject: Re: mm=0? >> To: solr-user@lucene.apache.org >> Date: Monday, September 13, 2010, 7:41 AM >> Hi Erik, >> >> I completely agree with you that showing a random document >> for user's query >> would be very poor experience. I have raised this in our >> product review >> meetings before. I was told that because of contractual >> agreement some >> sponsored content needs to be returned even if it meant no >> match. And the >> sponsored content drives the ads displayed on the page-- so >> it is more for >> showing some ad on the page when there is no matching >> result from sponsored >> content for user's query. >> >> Note that some other content in addition to sponsored >> content is displayed >> on the page, so user is not seeing just one random result >> when there is not >> a good match. >> >> It looks like I have to do another search to get a random >> result when there >> are no results. In this case I will use RandomSortField to >> generate random >> result (so that a different ad is displayed from set of >> sponsored ads) for >> each no result case. >> >> Thanks for the comments! >> >> >> Satish >> >> >> >> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson >> wrote: >> >> > Could you explain the use-case a bit? Because the >> very >> > first response I would have is "why in the world did >> > product management make this a requirement" and try >> > to get the requirement changed >> > >> > As a user, I'm having a hard time imagining being >> well >> > served by getting a document in response to a search >> that >> > had no relation to my search, it was just a random >> doc >> > selected from the corpus. >> > >> > All that said, I don't think a single query would do >> the trick. >> > You could include a "very special" document with a >> field >> > that no other document had with very special text in >> it. Say >> > field name "bogusmatch", filled with the text >> "bogustext" >> > then, at least the second query would match one and >> only >> > one document and would take minimal time. Or you >> could >> > tack on to each and every query "OR >> bogusmatch:bogustext^0.001" >> > (which would really be inexpensive) and filter it out >> if there >> > was more than one response. By boosting it really low, >> it should >> > always appear at the end of the list which wouldn't be >> a bad thing. >> > >> > DisMax might help you here... >> > >> > But do ask if it is really a requirement or just >> something nobody's >> > objected to before bothering IMO... >> > >> > Best >> > Erick >> > >> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar < >> > satish.kumar.just.d...@gmail.com> >> wrote: >> > >> > > Hi, >> > > >> > > We have a requirement to show at least one result >> every time -- i.e., >> > even >> > > if user entered term is not found in any of the >> documents. I was hoping >> > > setting mm to 0 will return results in all cases, >> but it is not. >> > > >> > > For example, if user entered term "alpha" and it >> is *not* in any of the >> > > documents in the index, any document in the index >> can be returned. If >> > term >> > > "alpha" is in the document set, documents having >> the term "alpha" only >> > must >> > > be returned. >> > > >> > > My idea so far is to perform a search using user >> entered term. If there >> > are >> > > any results, return them. If there are no >> results, perform another search >> > > without the query term-- this means doing two >> searches. Any suggestions >> > on >> > > implementing this requirement using only one >> search? >> > > >> > > >> > > Thanks, >> > > Satish >> > > >> > >> > -- Lance Norskog goks...@gmail.com
Re: mm=0?
On Mon, Sep 13, 2010 at 8:07 PM, Lance Norskog wrote: > "Java Swing" no longer gives ads for "swinger's clubs". damned no i have to explicitly enter it?! - argh! :) simon > > On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon wrote: >> I just tried several searches again on google. >> >> I think they've refined the ads placements so that certain kind of searches >> return no ads, the kinds that I've been doing relative to programming being >> one of them. >> >> If OTOH I do some product related search, THEN lots of ads show up, but >> fairly accurate ones. >> >> They've immproved the ads placement a LOT! >> >> Dennis Gearon >> >> Signature Warning >> >> EARTH has a Right To Life, >> otherwise we all die. >> >> Read 'Hot, Flat, and Crowded' >> Laugh at http://www.yert.com/film.php >> >> >> --- On Mon, 9/13/10, Satish Kumar wrote: >> >>> From: Satish Kumar >>> Subject: Re: mm=0? >>> To: solr-user@lucene.apache.org >>> Date: Monday, September 13, 2010, 7:41 AM >>> Hi Erik, >>> >>> I completely agree with you that showing a random document >>> for user's query >>> would be very poor experience. I have raised this in our >>> product review >>> meetings before. I was told that because of contractual >>> agreement some >>> sponsored content needs to be returned even if it meant no >>> match. And the >>> sponsored content drives the ads displayed on the page-- so >>> it is more for >>> showing some ad on the page when there is no matching >>> result from sponsored >>> content for user's query. >>> >>> Note that some other content in addition to sponsored >>> content is displayed >>> on the page, so user is not seeing just one random result >>> when there is not >>> a good match. >>> >>> It looks like I have to do another search to get a random >>> result when there >>> are no results. In this case I will use RandomSortField to >>> generate random >>> result (so that a different ad is displayed from set of >>> sponsored ads) for >>> each no result case. >>> >>> Thanks for the comments! >>> >>> >>> Satish >>> >>> >>> >>> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson >>> wrote: >>> >>> > Could you explain the use-case a bit? Because the >>> very >>> > first response I would have is "why in the world did >>> > product management make this a requirement" and try >>> > to get the requirement changed >>> > >>> > As a user, I'm having a hard time imagining being >>> well >>> > served by getting a document in response to a search >>> that >>> > had no relation to my search, it was just a random >>> doc >>> > selected from the corpus. >>> > >>> > All that said, I don't think a single query would do >>> the trick. >>> > You could include a "very special" document with a >>> field >>> > that no other document had with very special text in >>> it. Say >>> > field name "bogusmatch", filled with the text >>> "bogustext" >>> > then, at least the second query would match one and >>> only >>> > one document and would take minimal time. Or you >>> could >>> > tack on to each and every query "OR >>> bogusmatch:bogustext^0.001" >>> > (which would really be inexpensive) and filter it out >>> if there >>> > was more than one response. By boosting it really low, >>> it should >>> > always appear at the end of the list which wouldn't be >>> a bad thing. >>> > >>> > DisMax might help you here... >>> > >>> > But do ask if it is really a requirement or just >>> something nobody's >>> > objected to before bothering IMO... >>> > >>> > Best >>> > Erick >>> > >>> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar < >>> > satish.kumar.just.d...@gmail.com> >>> wrote: >>> > >>> > > Hi, >>> > > >>> > > We have a requirement to show at least one result >>> every time -- i.e., >>> > even >>> > > if user entered term is not found in any of the >>> documents. I was hoping >>> > > setting mm to 0 will return results in all cases, >>> but it is not. >>> > > >>> > > For example, if user entered term "alpha" and it >>> is *not* in any of the >>> > > documents in the index, any document in the index >>> can be returned. If >>> > term >>> > > "alpha" is in the document set, documents having >>> the term "alpha" only >>> > must >>> > > be returned. >>> > > >>> > > My idea so far is to perform a search using user >>> entered term. If there >>> > are >>> > > any results, return them. If there are no >>> results, perform another search >>> > > without the query term-- this means doing two >>> searches. Any suggestions >>> > on >>> > > implementing this requirement using only one >>> search? >>> > > >>> > > >>> > > Thanks, >>> > > Satish >>> > > >>> > >>> >> > > > > -- > Lance Norskog > goks...@gmail.com >
Re: How to Update Value of One Field of a Document in Index?
Hi Savannah, if you *only want to boost* documents based on the information you calculate from the MoreLikeThis results (i.e. numeric measure), you might want to take a look at the ExternalFileField type. This field type reads its contents from a file which contains key-value pairs, e.g. the document ids and the corresponding measure values, resp. If some values change you still have to regenerate the whole file (instead of the whole index). But of course, this file can be generated from a DB, which might be updated incrementally. For setup and usage e.g. see: http://dev.tailsweep.com/solr-external-scoring/ Zachary On 10.09.2010 19:57, Savannah Beckett wrote: I want to do MoreLikeThis to find documents that are similar to the document that I am indexing. Then I want to calculate the average of one of the fields of all those documents and input this average into a field of the document that I am indexing. From my research, it seems that MoreLikeThis can only be used to find similarity of document that is already in the index. So, I think I need to index it first, and then use MoreLikeThis to find similar documents in the index and then reindex that document. Any better way? I try not to reindex a document because it's not efficient. I don't have to use MoreLikeThis. Thanks. From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Fri, September 10, 2010 9:58:20 AM Subject: RE: How to Update Value of One Field of a Document in Index? "More like this" is intended to be run at query time. For what reasons are you thinking you want to (re-)index each document based on the results of MoreLikeThis? You're right that that's not what the component is intended for. Jonathan From: Savannah Beckett [savannah_becket...@yahoo.com] Sent: Friday, September 10, 2010 11:18 AM To: solr-user@lucene.apache.org Subject: Re: How to Update Value of One Field of a Document in Index? Thanks. I am trying to use MoreLikeThis in Solr to find similar documents in the solr index and use the data from these similar documents to modify a field in each document that I am indexing. I found that MoreLikeThis in Solr only works when the document is in the index, is it true? If so, I may have to wait til the indexing is finished, then run my own command to do MoreLikeThis to each document in the index, and then reindex each document? It sounds like it's not efficient. Is there a better way? Thanks. From: Liam O'Boyle To: solr-user@lucene.apache.org Cc: u...@nutch.apache.org Sent: Thu, September 9, 2010 11:06:36 PM Subject: Re: How to Update Value of One Field of a Document in Index? Hi Savannah, You can only reindex the entire document; if you only have the ID, then do a search to retrieve the rest of the data, then reindex. This assumes that all of the fields you need to index are stored (so that you can retrieve them) and not just indexed. Liam On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett wrote: I use nutch to crawl and index to Solr. My code is working. Now, I want to update the value of one of the fields of a document in the solr index after the document was already indexed, and I have only the document id. How do I do that? Thanks. __ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com
RE: Solr memory use, jmap and TermInfos/tii
Thanks Robert and everyone! I'm working on changing our JVM settings today, since putting Solr 1.4.1 into production will take a bit more work and testing. Hopefully, I'll be able to test the setTermIndexDivisor on our test server tomorrow. Mike, I've started the process to see if we can provide you with our tii/tis data. I'll let you know as soon as I hear anything. Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, September 12, 2010 10:48 AM To: solr-user@lucene.apache.org; simon.willna...@gmail.com Subject: Re: Solr memory use, jmap and TermInfos/tii On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer < simon.willna...@googlemail.com> wrote: > > To change the divisor in your solrconfig, for example to 4, it looks like > > you need to do this. > > > > > class="org.apache.solr.core.StandardIndexReaderFactory"> > >4 > > > > Ah, thanks robert! I didn't know about that one either! > > simon actually I'm wrong, for solr 1.4, use "setTermIndexDivisor". i was looking at 3.1/trunk and there is a bug in the name of this parameter: https://issues.apache.org/jira/browse/SOLR-2118 -- Robert Muir rcm...@gmail.com
RE: Solr and jvm Garbage Collection tuning
Thanks Kent for your info. We are not doing any faceting, sorting, or much else. My guess is that most of the memory increase is just the data structures created when parts of the frq and prx files get read into memory. Our frq files are about 77GB and the prx files are about 260GB per shard and we are running 3 shards per machine. I suspect that the document cache and query result cache don't take up that much space, but will try a run with those caches set to 0, just to see. We have dual 4 core processors and 74GB total memory. We want to leave a significant amount of memory free for OS disk caching. We tried increasing the memory from 20GB to 28GB and adding the -XXMaxGCPauseMillis=1000 flag but that seemed to have no effect. Currently I'm testing using the ConcurrentMarkSweep and that's looking much better although I don't understand why it has sized the Eden space down into the 20MB range. However, I am very new to Java memory management. Anyone know if when using ConcurrentMarkSweep its better to let the JVM size the Eden space or better to give it some hints? Once we get some decent JVM settings we can put into production I'll be testing using termIndexInterval with Solr 1.4.1 on our test server. Tom -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] >.What are your current GC settings? Also, I guess I'd look at ways you can >reduce the heap size needed. >> Caching, field type choices, faceting choices. >>Also could try playing with the termIndexInterval which will load fewer terms >>into memory at the cost of longer seeks. >>At some point, though, you just may need more shards and the resulting >>smaller indexes. How many CPU cores do you have on each machine?
Field names
Hi is it possible to issue a query to solr, to get a list which contains all the field names in the index? What about to get a list of the freqency of individual words in each field? thanks, Peter
Re: How to extend IndexSchema and SchemaField
: Yes, I have thought of that, or even extending field type. But this does not : work for my use case, since I can have multiple fields of a same type : (therefore with the same field type, and same analyzer), but each one of them : needs specific information. Therefore, I think the only "nice" way to achieve : this is to have the possibility to add attributes to any field definition. Right, at the moment custom FieldType classes can specify whatever attributes they want to use in the declaration -- but it's not possible to specify arbitrary attributes that can be used in the declaration. By all means, pelase open an issue requesting this as a feature. I don't know that anyone explicitly set out to impose this limitation, but one of the reasons it likely exists is because SchemaField is not something that is intended to be customized -- while FieldType objects are constructed once at startup, SchemaField obejcts are frequently created on the fly when dealing with dynamicFields, so initialization complexity is kept to a minimum. That said -- this definitely seems like that type of usecase that we should try to find *some* solution for -- even if it just means having Solr automaticly create hidden FieldType instances for you on startup based on attributes specified in the that the corrisponding FieldType class understands. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Need Advice for Finding Freelance Solr Expert
: References: : <4c881061.60...@jhu.edu> : : In-Reply-To: : : Subject: Need Advice for Finding Freelance Solr Expert http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Field names
check: http://wiki.apache.org/solr/LukeRequestHandler On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk wrote: > Hi > > is it possible to issue a query to solr, to get a list which contains all the > field names in the index? > > What about to get a list of the freqency of individual words in each field? > > thanks, > Peter >
RE: Field names
Fantastic - that is exactly what I was looking for! But here is one thing I don't undertstand: If I call the url: http://localhost:8983/solr/admin/luke?numTerms=10&fl=name Some of the result looks like: 18 Does this mean that the term "gb" occurs 18 times in the name field? Because if I issue this search: http://localhost:8983/solr/select/?q=name:gb I get results like: So it only finds 9? What do the above results actually tell me? Thanks, Peter From: Ryan McKinley [ryan...@gmail.com] Sent: Tuesday, 14 September 2010 11:30 To: solr-user@lucene.apache.org Subject: Re: Field names check: http://wiki.apache.org/solr/LukeRequestHandler On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk wrote: > Hi > > is it possible to issue a query to solr, to get a list which contains all the > field names in the index? > > What about to get a list of the freqency of individual words in each field? > > thanks, > Peter >
Re: Solr memory use, jmap and TermInfos/tii
On Mon, Sep 13, 2010 at 6:29 PM, Burton-West, Tom wrote: > Thanks Robert and everyone! > > I'm working on changing our JVM settings today, since putting Solr 1.4.1 into > production will take a bit more work and testing. Hopefully, I'll be able to > test the setTermIndexDivisor on our test server tomorrow. > > Mike, I've started the process to see if we can provide you with our tii/tis > data. I'll let you know as soon as I hear anything. Super, thanks Tom! Mike
Re: Distance sorting with spatial filtering
I tracked down the problem and found a workaround. If there is a wildcard entry in schema.xml such as the following. then sort by function fails and returns Error 400 can not sort on unindexed field: Removing the name="*" entry from schema.xml is a workaround. I noted this in the Solr-1297 JIRA entry. Scott On Fri, Sep 10, 2010 at 01:40, Lance Norskog wrote: > Since no one has jumped in to give the right syntax- yeah, it's a bug. > Please file a JIRA. > > On Thu, Sep 9, 2010 at 9:44 PM, Scott K wrote: >> On Thu, Sep 9, 2010 at 21:00, Lance Norskog wrote: >>> I just checked out the trunk, and branch 3.x This query is accepted on both, >>> but gives no responses: >>> http://localhost:8983/solr/select/?q=*:*&sort=dist(2,x_dt,y_dt,0,0)+asc >> >> So you are saying when you add the sort parameter you get no results >> back, but do not get the error I am seeing? Should I open a Jira >> ticket? >> >>> x_dt and y_dt are wildcard fields with the tdouble type. "tdouble" >>> explicitly says it is stored and indexed. Your 'longitude' and 'latitude' >>> fields may not be stored? >> >> No, they are stored. >> http://localhost:8983/solr/select?q=*:*&rows=1&wt=xml&indent=true >> >> >> >> 0 >> 9 >> >> >> >> ... >> 47.6636 >> -122.3054 >> >> >>> Also, this is accepted on both branches: >>> http://localhost:8983/solr/select/?q=*:*&sort=sum(1)+asc >>> >>> The documentation for sum() does not mention single-argument calls. >> >> This also fails >> http://localhost:8983/solr/select/?q=*:*&sort=sum(1,2)+asc >> http://localhost:8983/solr/select/?q=*:*&sort=sum(latitude,longitude)+asc >> >> >>> Scott K wrote: According to the documentation, sorting by function has been a feature since Solr 1.5. It seems like a major regression if this no longer works. http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function The _val_ trick does not seem to work if used with a query term, although I can try some more things to give 0 value to the query term. On Wed, Sep 8, 2010 at 22:21, Lance Norskog wrote: > > It says that the field "sum(1)" is not indexed. You don't have a field > called 'sum(1)'. I know there has been a lot of changes in query parsing, > and sorting by functions may be on the list. But the _val_ trick is the > older one and, and you noted, still works. The _val_ trick sets the > ranking > value to the output of the function, thus indirectly doing what sort= > does. > > Lance > > Scott K wrote: > >> >> I get the error on all functions. >> GET 'http://localhost:8983/solr/select?q=*:*&sort=sum(1)+asc' >> Error 400 can not sort on unindexed field: sum(1) >> >> I tried another nightly build from today, Sep 7th, with the same >> results. I attached the schema.xml >> >> Thanks for the help! >> Scott >> >> On Wed, Sep 1, 2010 at 18:43, Lance Norskog wrote: >> >> >>> >>> Post your schema. >>> >>> On Mon, Aug 30, 2010 at 2:04 PM, Scott K wrote: >>> >>> The new spatial filtering (SOLR-1586) works great and is much faster than fq={!frange. However, I am having problems sorting by distance. If I try GET 'http://localhost:8983/solr/select/?q=*:*&sort=dist(2,latitude,longitude,0,0)+asc' I get an error: Error 400 can not sort on unindexed field: dist(2,latitude,longitude,0,0) I was able to work around this with GET 'http://localhost:8983/solr/select/?q=*:* AND _val_:"recip(dist(2, latitude, longitude, 0,0),1,1,1)"&fl=*,score' But why isn't sorting by functions working? I get this error with any function I try to sort on.This is a nightly trunk build from Aug 25th. I see SOLR-1297 was reopened, but that seems to be for edge cases. Second question: I am using the LatLonType from the Spatial Filtering wiki, http://wiki.apache.org/solr/SpatialSearch Are there any distance sorting functions that use this field, or do I need to have three indexed fields, store_lat_lon, latitude, and longitude, if I want both filtering and sorting by distance. Thanks, Scott >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >>> >>> > > >>> >> > > > > -- > Lance Norskog > goks...@gmail.com >
Re: Solr and jvm Garbage Collection tuning
On Mon, Sep 13, 2010 at 6:45 PM, Burton-West, Tom wrote: > Thanks Kent for your info. > > We are not doing any faceting, sorting, or much else. My guess is that most > of the memory increase is just the data structures created when parts of the > frq and prx files get read into memory. Our frq files are about 77GB and > the prx files are about 260GB per shard and we are running 3 shards per > machine. I suspect that the document cache and query result cache don't > take up that much space, but will try a run with those caches set to 0, just > to see. > > We have dual 4 core processors and 74GB total memory. We want to leave a > significant amount of memory free for OS disk caching. > > We tried increasing the memory from 20GB to 28GB and adding the > -XXMaxGCPauseMillis=1000 flag but that seemed to have no effect. > > Currently I'm testing using the ConcurrentMarkSweep and that's looking much > better although I don't understand why it has sized the Eden space down into > the 20MB range. However, I am very new to Java memory management. > > Anyone know if when using ConcurrentMarkSweep its better to let the JVM size > the Eden space or better to give it some hints? Really the best thing to do is to run the system for a while with GC logging on and then look at how often the young generation GC is occurring. A set of parameters like: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails Should give you some indication how often the young gen GC is occurring. If it's often, you can try increasing the size of the young generation. The option: -Xloggc: will dump this information to the specified file rather than sending it to the standard error. I've done this a few times with a variety of systems: some times you want to make the young gen bigger and some times you don't. Steve -- Stephen Green http://thesearchguy.wordpress.com
geographic sharding . . . or not
Think about THE big one - google. (First, China for this example is avoided because much Chinese data is ILLEGAL to be provided for search outside of China) If there is data generated by people in Europe, in various languages: 1/ Is it stored close to where it is generated? 2/ Are sharding and replication also close to where it is generated? 3/ How accessible IS that data to someone from the US who speaks one of those languages? 4/ How much is sharding and replication done AWAY from where data is geographically generated? 5/ What if a set of linked documents, from a relational database, has half of it's documents in one language AND related to people/places/or things in one country, and half in another country and it's language. There's a parent record for the two sets.= in the country of the user originating the parent/dual sets. A/ is the parent record replicated in both countries, so that searches finding the child records can easily get to the parent record, vs transatlantic/pacific fetches? B/ Any thoughts about machine translation of said parent record? What are people's thoughts on making sites that cater to people interested in web pages, etc in other countries? Any examples out there? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php
Our SOLR instance seems to be single-threading and therefore not taking advantage of its multi-proc host
We are running SOLR 1.4.1 (Lucene 2.9.3) on a 2-CPU Linux host, but it seems that only 1 CPU is ever being used. It almost seems like something is single-threading inside the SOLR application. The CPU utilization is very seldom over 0.9 even under load. We are running on virtual Linux hosts and our other apps in the same cluster are multi-threading w/o issue. Some more info on our stack and versions: Linux 2.6.16.33-xenU Apache 2.2.3 Tomcat 6.0.16 Java SE Runtime Environment (build 1.6.0_10-ea-b11) Has anyone else noticed this problem? Might there be some SOLR config aspect to enable multi-threading that we're missing? Any suggestions for troubleshooting? Judging by SOLR's logs, we do see that multiple requests are processing simultaneously inside SOLR so we do not believe we're sequentially feeding requests to SOLR, ie. bottle-necking things outside of SOLR. Thanks, David Crane -- View this message in context: http://lucene.472066.n3.nabble.com/Our-SOLR-instance-seems-to-be-single-threading-and-therefore-not-taking-advantage-of-its-multi-proc-t-tp1470282p1470282.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Field names
On Tue, Sep 14, 2010 at 1:39 AM, Peter A. Kirk wrote: > Fantastic - that is exactly what I was looking for! > > But here is one thing I don't undertstand: > > If I call the url: > http://localhost:8983/solr/admin/luke?numTerms=10&fl=name > > Some of the result looks like: > > > > > 18 > > Does this mean that the term "gb" occurs 18 times in the name field? Yes that is the Doc Frequency of the term "gb". Remember that deleted / updated documents and their terms contribute to the doc frequency until they are expunged from the index. That either happens through a segment merge in the background or due to an explicit call to optimize. > > Because if I issue this search: > http://localhost:8983/solr/select/?q=name:gb > > I get results like: > > > > So it only finds 9? Since the "gb" term says 18 occurrences throughout the index I suspect you updated you docs once without optimizing or indexing a lot of docs so that segments are merged. Try to call optimize if you can afford it and see if the doc-freq count goes back to 9 simon > > What do the above results actually tell me? > > Thanks, > Peter > > > From: Ryan McKinley [ryan...@gmail.com] > Sent: Tuesday, 14 September 2010 11:30 > To: solr-user@lucene.apache.org > Subject: Re: Field names > > check: > http://wiki.apache.org/solr/LukeRequestHandler > > > > On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk > wrote: >> Hi >> >> is it possible to issue a query to solr, to get a list which contains all >> the field names in the index? >> >> What about to get a list of the freqency of individual words in each field? >> >> thanks, >> Peter >>
Spell checking and keyword tokenizer
Hi, I'm trying to spell check a whole field using a lowercasing keyword tokenizer [1]. for example if I query for "furntree gully" I'm hoping to get back "ferntree gully" as a suggestion. Unfortunately the spell checker seems to be recognizing this as two tokens and returning suggestions for both. Query [2] and result [3] below. In this case ferntree actually does end up with ferntree gully as a suggestion however it also gives bulla as a suggestion for gully (go figure :-) ). Any suggestions? Regards, Glen [1] - [2] - Query q=locality_lc%3A%22furntree+gully%22&spellcheck=true&spellcheck.build=true&spellcheck.reload=true&spellcheck.accuracy=0.5&spellcheck.dictionary=locality_spellchecker&spellcheck.collate=true&fl=street_name%2Clocality%2Cstate [3] - 0 379 true street_name,locality,state 0.5 locality_lc:"furntree gully" locality_spellchecker true true true build 1 13 21 ferntree gully 1 22 27 bulla locality_lc:"ferntree gully bulla"
Re: Spell checking and keyword tokenizer
Nevermind this one... With a bit more research I discovered I can use spellcheck.q to provide the correct suggestion. On 14 September 2010 16:02, Glen Stampoultzis wrote: > Hi, > > I'm trying to spell check a whole field using a lowercasing keyword > tokenizer [1]. > > for example if I query for "furntree gully" I'm hoping to get back > "ferntree gully" as a suggestion. Unfortunately the spell checker > seems to be recognizing this as two tokens and returning suggestions > for both. Query [2] and result [3] below. In this case ferntree > actually does end up with ferntree gully as a suggestion however it > also gives bulla as a suggestion for gully (go figure :-) ). > > Any suggestions? > > Regards, > > Glen > > > [1] - > > positionIncrementGap="100"> > > > > > > > [2] - > > Query > > q=locality_lc%3A%22furntree+gully%22&spellcheck=true&spellcheck.build=true&spellcheck.reload=true&spellcheck.accuracy=0.5&spellcheck.dictionary=locality_spellchecker&spellcheck.collate=true&fl=street_name%2Clocality%2Cstate > > [3] - > > > > > 0 > > > 379 > > > > true > > > street_name,locality,state > > > 0.5 > > > locality_lc:"furntree gully" > > > locality_spellchecker > > > true > > > true > > > true > > > > > build > > > > > > > 1 > > > 13 > > > 21 > > > > ferntree gully > > > > > > 1 > > > 22 > > > 27 > > > > bulla > > > > > locality_lc:"ferntree gully bulla" > > > > >