Multiple custom Similarity implementations
Hi, We have a requirement where we want to run an A/B test over multiple Similarity implementations. Is it possible to define multiple similarity tags in schema.xml file and chose one using the URL parameter? We are using solr 4.7 Currently, we are planning to have different cores with different similarity configured and split traffic based on core names. This is leading to index duplication and un-necessary resource usage. Any help is highly appreciated. Parvesh Garg, http://www.zettata.com
Re: Multiple custom Similarity implementations
Thanks Markus. We will look at other options. May I ask what can be the reasons for not supporting this ever? Parvesh Garg, http://www.zettata.com On Tue, Mar 8, 2016 at 8:59 PM, Markus Jelsma wrote: > Hello, you can not change similarities per request, and this is likely > never going to be supported for good reasons. You need multiple cores, or > multiple fields with different similarity defined in the same core. > Markus > > -Original message- > > From:Parvesh Garg > > Sent: Tuesday 8th March 2016 5:36 > > To: solr-user@lucene.apache.org > > Subject: Multiple custom Similarity implementations > > > > Hi, > > > > We have a requirement where we want to run an A/B test over multiple > > Similarity implementations. Is it possible to define multiple similarity > > tags in schema.xml file and chose one using the URL parameter? We are > using > > solr 4.7 > > > > Currently, we are planning to have different cores with different > > similarity configured and split traffic based on core names. This is > > leading to index duplication and un-necessary resource usage. > > > > Any help is highly appreciated. > > > > Parvesh Garg, > > > > http://www.zettata.com > > >
Re: Multiple custom Similarity implementations
Hi Ahmet, Thanks for the pointer. I have similar thoughts on the subject. The risk assumptions are based on not testing your stuff before taking it in. That risk is still valid with similarity configuration. And sometimes, it may not be possible to use multiple similarities (custom or otherwise). But overall, it seems like a nice feature to have. Parvesh Garg, Head of Engineering http://www.zettata.com On Thu, Mar 10, 2016 at 3:05 PM, Ahmet Arslan wrote: > Hi Parvesh, > > Please see the similar discussion : > http://search-lucene.com/m/eHNlijx91I7etm1 > > Ahmet > > > On Thursday, March 10, 2016 6:57 AM, Parvesh Garg > wrote: > > > > Thanks Markus. We will look at other options. May I ask what can be the > reasons for not supporting this ever? > > > Parvesh Garg, > > http://www.zettata.com > > > On Tue, Mar 8, 2016 at 8:59 PM, Markus Jelsma > wrote: > > > Hello, you can not change similarities per request, and this is likely > > never going to be supported for good reasons. You need multiple cores, or > > multiple fields with different similarity defined in the same core. > > Markus > > > > -Original message- > > > From:Parvesh Garg > > > Sent: Tuesday 8th March 2016 5:36 > > > To: solr-user@lucene.apache.org > > > Subject: Multiple custom Similarity implementations > > > > > > Hi, > > > > > > We have a requirement where we want to run an A/B test over multiple > > > Similarity implementations. Is it possible to define multiple > similarity > > > tags in schema.xml file and chose one using the URL parameter? We are > > using > > > solr 4.7 > > > > > > Currently, we are planning to have different cores with different > > > similarity configured and split traffic based on core names. This is > > > leading to index duplication and un-necessary resource usage. > > > > > > Any help is highly appreciated. > > > > > > Parvesh Garg, > > > > > > http://www.zettata.com > > > > > >
Difference between CustomScoreQuery and RankQuery
Hi All, I wanted to understand the difference between CustomScoreQuery and RankQuery. From the outside, it seems they do the same thing with RankQuery having more functionality. Am I missing something? Parvesh Garg
utility methods to get field values from index
Hi All, Was wondering if there is any class in Solr that provides utility methods to fetch indexed field values for documents using docId. Something simple like getMultiLong(String field, int docId) getLong(String field, int docId) We have written a solr component to return group level stats like avg score, max score etc over a large number of documents (say 5000+) against a query executed using edismax. Need to get the group id fields value to do that, this is a single valued long field. This component also looks at one more field that is a multivalued long field for each document and compute a score based on frequency + document score for each value. Currently we are using stored fields and was wondering if this approach would be faster. Apologies if this is too much to ask for. Parvesh Garg,
Re: utility methods to get field values from index
Hi Shalin, Thanks for your answer. Forgot to mention that we are using 4.10 solr. Also, I tried using docValues and the performance was worse than getting it from stored values. Time taken to retrieve data for 2000 docs for 2 fields was 120 ms vs 230 ms previously and for docValues respectively. May be there is something wrong in my code. The code used for retrieving docValues is: *public* *static* *long* getSingleLong(*SolrIndexSearcher* searcher, *int* docId, *String* field) *throws* IOException { *NumericDocValues* sdv = *DocValues*.*getNumeric* (searcher.getAtomicReader(), field); *return* sdv.get(docId); } and *public* *static* *List* getMultiLong(*SolrIndexSearcher* searcher, *int* docId, *String* field) *throws* IOException { *SortedSetDocValues* ssdv = *DocValues*.*getSortedSet*( searcher.getAtomicReader(), field); ssdv.setDocument(docId); *long* l; *List* retval = *new* *ArrayList*(40); *while* ((l = ssdv.nextOrd()) != *SortedSetDocValues*.*NO_MORE_ORDS*) { *BytesRef* bytes = ssdv.lookupOrd(l); retval.add(*NumericUtils*.*prefixCodedToLong*(bytes)); } *return* retval; } Parvesh Garg On Wed, May 13, 2015 at 11:36 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > In Solr 5.0+ you can use Lucene's DocValues API to read the indexed > information. This is a unifying API over field cache and doc values so it > can be used on all indexed fields. > > e.g. for single-valued field use > searcher.getLeafReader().getSortedDocValues(fieldName); > and for multi-valued fields > use searcher.getLeafReader().getSortedSetDocValues(fieldName); > > On Wed, May 13, 2015 at 11:11 AM, Parvesh Garg > wrote: > > > Hi All, > > > > Was wondering if there is any class in Solr that provides utility methods > > to fetch indexed field values for documents using docId. Something simple > > like > > > > getMultiLong(String field, int docId) > > > > getLong(String field, int docId) > > > > We have written a solr component to return group level stats like avg > > score, max score etc over a large number of documents (say 5000+) > against a > > query executed using edismax. Need to get the group id fields value to do > > that, this is a single valued long field. > > > > This component also looks at one more field that is a multivalued long > > field for each document and compute a score based on frequency + document > > score for each value. > > > > Currently we are using stored fields and was wondering if this approach > > would be faster. > > > > Apologies if this is too much to ask for. > > > > Parvesh Garg, > > > > > > -- > Regards, > Shalin Shekhar Mangar. >
Compound words
Hi, I'm an infant in Solr/Lucene family, just a couple of months old. We are trying to find a way to combine words into a single compound word at index and query time. E.g. if the document has "sea bird" in it, it should be indexed as seabird and any query having sea bird in it should also look for seabird not only in qf but also in pf, pf2, pf3 fields. Well, we are using edismax query parser. Our problem is not at index time, we have achieved it by writing our own token filter, but at query time. Our token filter takes a dictionary in the form of "prefix,suffix" in the file and keeps emitting regular and compound tokens as it encounters them. We configured our own filter at query time but figured that at query time individual clauses like field:sea , field:bird etc are created first and then sent to the analyzer. First of all, can someone please confirm if this part of my understanding is correct? So, we are forced to emit sea and bird as individual tokens because we are not getting them in sequence at all. Is it possible to achieve this by other means than pre-processing query before sending it to solr? Can a CharFilter be used instead, are they applied before creating query clauses? I can keep providing more details as necessary. This mail has already crossed TL;DR limits for many :) Parvesh Garg http://www.zettata.com +91 963 222 5540
Re: Compound words
One more thing, Is there a way to remove my "accidentally sent phone number in the signature" from the previous mail? aarrrggghhh
Re: Compound words
Hi Erick, Thanks for the suggestion. Like I said, I'm an infant. We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit => sea biscuit and didn't understand exactly how it worked. But I just checked the analysis tool, and it seems to work perfectly fine at index time. Now, I can happily discard my own filter and 4 days of work. I'm happy I got to know a few ways on how/when not to write a solr filter :) I tried the string "sea biscuit sea bird" with expand=false and the tokens i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But at query time, when I enter the same term "sea biscuit sea bird", using edismax and qf, pf2, and pf3, the parsedQuery looks like this: +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit sea\") (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea bird\"))" What I wanted instead was this "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\") (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" Looks like there isn't any other way than to pre-process query myself and create the compound word. What do you mean by "just query the raw string"? Am I still missing something? Parvesh Garg http://www.zettata.com (This time I did remove my phone number :) ) On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson wrote: > Why did you reject using synonyms? You can have multi-word > synonyms just fine at index time, and at query time, since the > multiple words are already substituted in the index you don't > need to do the same substitution, just query the raw strings. > > I freely acknowledge you may have very good reasons for doing > this yourself, I'm just making sure you know what's already > there. > > See: > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > Look particularly at the explanations for "sea biscuit" in that section. > > Best, > Erick > > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg wrote: > > > One more thing, Is there a way to remove my "accidentally sent phone > number > > in the signature" from the previous mail? aarrrggghhh > > >
Re: Compound words
Hi Roman, thanks for the link, will go through it. Erick, will try with expand=true once and check out the results. Will update this thread with the findings. I remember we rejected expand=true because of some weird spaghetti problem. Will check it out again. Thanks, Parvesh Garg http://www.zettata.com On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla wrote: > Hi Parvesh, > I think you should check the following jira > https://issues.apache.org/jira/browse/SOLR-5379. You will find there links > to other possible solutions/problems:-) > Roman > On 28 Oct 2013 09:06, "Erick Erickson" wrote: > > > Consider setting expand=true at index time. That > > puts all the tokens in your index, and then you > > may not need to have any synonym > > processing at query time since all the variants will > > already be in the index. > > > > As it is, you've replaced the words in the original with > > synonyms, essentially collapsed them down to a single > > word and then you have to do something at query time > > to get matches. If all the variants are in the index, you > > shouldn't have to. That's what I meant by "raw". > > > > Best, > > Erick > > > > > > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg > wrote: > > > > > Hi Erick, > > > > > > Thanks for the suggestion. Like I said, I'm an infant. > > > > > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit > => > > > sea biscuit and didn't understand exactly how it worked. But I just > > checked > > > the analysis tool, and it seems to work perfectly fine at index time. > > Now, > > > I can happily discard my own filter and 4 days of work. I'm happy I got > > to > > > know a few ways on how/when not to write a solr filter :) > > > > > > I tried the string "sea biscuit sea bird" with expand=false and the > > tokens > > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But > > at > > > query time, when I enter the same term "sea biscuit sea bird", using > > > edismax and qf, pf2, and pf3, the parsedQuery looks like this: > > > > > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit > > sea\") > > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea > > > bird\"))" > > > > > > What I wanted instead was this > > > > > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit sea\") > > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" > > > > > > Looks like there isn't any other way than to pre-process query myself > and > > > create the compound word. What do you mean by "just query the raw > > string"? > > > Am I still missing something? > > > > > > Parvesh Garg > > > http://www.zettata.com > > > (This time I did remove my phone number :) ) > > > > > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > Why did you reject using synonyms? You can have multi-word > > > > synonyms just fine at index time, and at query time, since the > > > > multiple words are already substituted in the index you don't > > > > need to do the same substitution, just query the raw strings. > > > > > > > > I freely acknowledge you may have very good reasons for doing > > > > this yourself, I'm just making sure you know what's already > > > > there. > > > > > > > > See: > > > > > > > > > > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > > > > > > > Look particularly at the explanations for "sea biscuit" in that > > section. > > > > > > > > Best, > > > > Erick > > > > > > > > > > > > > > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg > > > wrote: > > > > > > > > > One more thing, Is there a way to remove my "accidentally sent > phone > > > > number > > > > > in the signature" from the previous mail? aarrrggghhh > > > > > > > > > > > > > > >
Re: Compound words
Hi Erick, I tried with expand=true and got exactly the same tokens i.e., seabiscuit sea bird at 1,2 and 3 positions respectively. As per solr documentation at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory, explicit mappings ignore the expand parameter in the schema. So, the problem of creating compound problems at query time remains. Parvesh Garg http://www.zettata.com On Tue, Oct 29, 2013 at 2:11 AM, Parvesh Garg wrote: > Hi Roman, thanks for the link, will go through it. > > Erick, will try with expand=true once and check out the results. Will > update this thread with the findings. I remember we rejected expand=true > because of some weird spaghetti problem. Will check it out again. > > Thanks, > > Parvesh Garg > http://www.zettata.com > > > On Mon, Oct 28, 2013 at 9:01 PM, Roman Chyla wrote: > >> Hi Parvesh, >> I think you should check the following jira >> https://issues.apache.org/jira/browse/SOLR-5379. You will find there >> links >> to other possible solutions/problems:-) >> Roman >> On 28 Oct 2013 09:06, "Erick Erickson" wrote: >> >> > Consider setting expand=true at index time. That >> > puts all the tokens in your index, and then you >> > may not need to have any synonym >> > processing at query time since all the variants will >> > already be in the index. >> > >> > As it is, you've replaced the words in the original with >> > synonyms, essentially collapsed them down to a single >> > word and then you have to do something at query time >> > to get matches. If all the variants are in the index, you >> > shouldn't have to. That's what I meant by "raw". >> > >> > Best, >> > Erick >> > >> > >> > On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg >> wrote: >> > >> > > Hi Erick, >> > > >> > > Thanks for the suggestion. Like I said, I'm an infant. >> > > >> > > We tried synonyms both ways. sea biscuit => seabiscuit and seabiscuit >> => >> > > sea biscuit and didn't understand exactly how it worked. But I just >> > checked >> > > the analysis tool, and it seems to work perfectly fine at index time. >> > Now, >> > > I can happily discard my own filter and 4 days of work. I'm happy I >> got >> > to >> > > know a few ways on how/when not to write a solr filter :) >> > > >> > > I tried the string "sea biscuit sea bird" with expand=false and the >> > tokens >> > > i got were seabiscuit sea bird at 1,2 and 3 positions respectively. >> But >> > at >> > > query time, when I enter the same term "sea biscuit sea bird", using >> > > edismax and qf, pf2, and pf3, the parsedQuery looks like this: >> > > >> > > +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\"biscuit >> > sea\") >> > > (text:\"sea bird\")) ((text:\"seabiscuit sea\") (text:\"biscuit sea >> > > bird\"))" >> > > >> > > What I wanted instead was this >> > > >> > > "+((text:seabiscuit) (text:sea) (text:bird)) ((text:\"seabiscuit >> sea\") >> > > (text:\"sea bird\")) (text:\"seabiscuit sea bird\")" >> > > >> > > Looks like there isn't any other way than to pre-process query myself >> and >> > > create the compound word. What do you mean by "just query the raw >> > string"? >> > > Am I still missing something? >> > > >> > > Parvesh Garg >> > > http://www.zettata.com >> > > (This time I did remove my phone number :) ) >> > > >> > > On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson < >> erickerick...@gmail.com >> > > >wrote: >> > > >> > > > Why did you reject using synonyms? You can have multi-word >> > > > synonyms just fine at index time, and at query time, since the >> > > > multiple words are already substituted in the index you don't >> > > > need to do the same substitution, just query the raw strings. >> > > > >> > > > I freely acknowledge you may have very good reasons for doing >> > > > this yourself, I'm just making sure you know what's already >> > > > there. >> > > > >> > > > See: >> > > > >> > > > >> > > >> > >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory >> > > > >> > > > Look particularly at the explanations for "sea biscuit" in that >> > section. >> > > > >> > > > Best, >> > > > Erick >> > > > >> > > > >> > > > >> > > > On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg >> > > wrote: >> > > > >> > > > > One more thing, Is there a way to remove my "accidentally sent >> phone >> > > > number >> > > > > in the signature" from the previous mail? aarrrggghhh >> > > > > >> > > > >> > > >> > >> > >
custom group sort in solr
Hi, I want to use solr/lucene's grouping feature with a some customisations like - sorting the groups based on average scores instead of max scores or some other complex computation over scores. - group articles based on some computation instead of a field value. So far it seems like I have to write some code for it. Can someone please point me to the right direction? - If I have to write a plugin, which files I need to check? - Which part of the code currently executes the grouping feature? Does it happen in solr or lucene? Is it SearchHandler? Parvesh Garg http://www.zettata.com
Facet counts and RankQuery
Hi All, We have written a RankQuery plugin with a custom TopDocsCollector to suppress documents below a certain threshold w.r.t. to the maxScore for that query. It works fine and is reflected well with numFound and start parameters. Our problem lies with facet counts. Even though the docs numFound gives a very less number, the facet counts are still coming from unsuppressed query results. E.g. in a test with a threshold of 20% , we reduced the totalDocs from 46030 to 6080 but the top facet count on a field is still 20500 The query parameter we are using looks like rq={!threshold value=0.2} Is there a way propagate the suppression of results to FacetsComponent as well? Can we send the same rq to FacetsComponent ? Regards, Parvesh Garg, http://www.zettata.com
Re: Facet counts and RankQuery
Hi Erick, Thanks for the input. We have other requirements regarding precision and recall, especially when other sorts are specified. So need to suppress docs based on thresholds. Parvesh Garg, Founding Architect http://www.zettata.com On Tue, Oct 21, 2014 at 8:20 PM, Erick Erickson wrote: > I _very strongly_ recommend that you do _not_ do this. > > First, the "problem" of having documents in the results > list with, say, scores < 20% of the max takes care of itself; > users stop paging pretty quickly. You're arbitrarily > denying the users any chance of finding some documents > that _do_ match their query. A user may know that a > doc is in the corpus but be unable to find it. Very bad from > a confidence-building standpoint. > > I've seen people put, say, 1-5 stars next to docs in the result > to give the user some visual cue that they're getting into "less > good" matches, but even that is of very limited value IMO. The > stars represent quintiles, 5 stars for docs > 80% of max, 4 > stars between 60% and 80% etc. > > If you insist on this, then you'll need to run two passes > across the data, the first will get the max score and the second > will have a custom collector that somehow gets this number > and rejects any docs below the threshold. > > Bet, > Erick > > On Tue, Oct 21, 2014 at 3:09 AM, Parvesh Garg wrote: > > Hi All, > > > > We have written a RankQuery plugin with a custom TopDocsCollector to > > suppress documents below a certain threshold w.r.t. to the maxScore for > > that query. It works fine and is reflected well with numFound and start > > parameters. > > > > Our problem lies with facet counts. Even though the docs numFound gives a > > very less number, the facet counts are still coming from unsuppressed > query > > results. > > > > E.g. in a test with a threshold of 20% , we reduced the totalDocs from > > 46030 to 6080 but the top facet count on a field is still 20500 > > > > The query parameter we are using looks like rq={!threshold value=0.2} > > > > Is there a way propagate the suppression of results to FacetsComponent as > > well? Can we send the same rq to FacetsComponent ? > > > > > > > > Regards, > > Parvesh Garg, > > > > http://www.zettata.com >
Re: Facet counts and RankQuery
Hi Joel, Thanks for the pointer. Can you point me to any example implementation. Parvesh Garg, Founding Architect http://www.zettata.com On Tue, Oct 21, 2014 at 9:32 PM, Joel Bernstein wrote: > The RankQuery cannot be used as filter. It is designed for custom > ordering/ranking of results only. If it's used as filter the facet counts > will not match up. If you need a filter collector then you need to use a > PostFilter. > > Joel Bernstein > Search Engineer at Heliosearch > > On Tue, Oct 21, 2014 at 10:50 AM, Erick Erickson > wrote: > > > I _very strongly_ recommend that you do _not_ do this. > > > > First, the "problem" of having documents in the results > > list with, say, scores < 20% of the max takes care of itself; > > users stop paging pretty quickly. You're arbitrarily > > denying the users any chance of finding some documents > > that _do_ match their query. A user may know that a > > doc is in the corpus but be unable to find it. Very bad from > > a confidence-building standpoint. > > > > I've seen people put, say, 1-5 stars next to docs in the result > > to give the user some visual cue that they're getting into "less > > good" matches, but even that is of very limited value IMO. The > > stars represent quintiles, 5 stars for docs > 80% of max, 4 > > stars between 60% and 80% etc. > > > > If you insist on this, then you'll need to run two passes > > across the data, the first will get the max score and the second > > will have a custom collector that somehow gets this number > > and rejects any docs below the threshold. > > > > Bet, > > Erick > > > > On Tue, Oct 21, 2014 at 3:09 AM, Parvesh Garg > wrote: > > > Hi All, > > > > > > We have written a RankQuery plugin with a custom TopDocsCollector to > > > suppress documents below a certain threshold w.r.t. to the maxScore for > > > that query. It works fine and is reflected well with numFound and start > > > parameters. > > > > > > Our problem lies with facet counts. Even though the docs numFound > gives a > > > very less number, the facet counts are still coming from unsuppressed > > query > > > results. > > > > > > E.g. in a test with a threshold of 20% , we reduced the totalDocs from > > > 46030 to 6080 but the top facet count on a field is still 20500 > > > > > > The query parameter we are using looks like rq={!threshold value=0.2} > > > > > > Is there a way propagate the suppression of results to FacetsComponent > as > > > well? Can we send the same rq to FacetsComponent ? > > > > > > > > > > > > Regards, > > > Parvesh Garg, > > > > > > http://www.zettata.com > > >