Too many results in dismax queries with one word
Hi all, I have a database of e-commerce products (5M) and trying to build a search solution for it. I have used steemer, edgengram and doublemetaphone phonetic fields for omiting common typos in queries. It works quite good with dismax QParser for queries longer than one word: "tv lc20", "sny psp 3001", "cannon 5d" etc. For not having too many results I manipulated with `mm` parameter. But when user type a single word like "ipad", "cannon". I always having a lot of results (~6). This is unacceptable for my client. He would like to have then only the `good` results. That particulary match specific query. It's hard to acomplish for me cause of use doublemetaphone field which converts words like "apt", "opt" and "ipad" and even "ipod" to the same phonetic word - APT. And then all of these words are matched fairly the same gives me huge amount of results. Similar problems I have with other words like "canon", "canine" and "cannon" which are KNN in phonetic way. But lexically have different meanings: "canon" - camera, "canine" - cat food , "cannon" - may be a misspell for canon or part of book title about cannon weapons. My first idea was to make a second requestHandler without searching in *_phonetic fields. And use it for queries with only one word. But it didn't worked cause sometimes I want to correct user even if there is only one word and suggest him something better. Query "cannon" is a good example. I'm fairly sure that most of the time when someone type "cannon" it would be a typo for "canon" and I want to show user also CANON cameras. That's why I can't use second requestHandler for one word queries. I'm looking for any ideas how could I change my requestHandler. My regular queries are: http://localhost:8983/solr/select?q=cannon Below I put my configuration for requestHandler and schema.xml. solrconfig.xml: *:* dismax title^1.3 title_text^0.9 title_phonetic^0.74 title_ng^0.17 title_ngram^0.54 producer_name^0.9 producer_name_text^0.89 category_path_text^0.8 category_path_phonetic^0.65 description^0.60 description_text^0.56 title_text^1.1 title^1.2 description^0.3 3 0.1 2<100% 3<-1 5<85% *,score schema.xml: id title -- Rafał "RaVbaker" Piekarski. web: http://ja.ravbaker.net mail: ravba...@gmail.com jid/xmpp/aim: ravba...@gmail.com mobile: +48-663-808-481
Re: Update field value in the document based on value of another field in the document
Now that I have set it up using UpdateProcessorChain, I am running into null exeception. Here is what I have- In SolrConfig.xml mychain Here is my java code- package mysolr; import java.io.IOException; import org.apache.solr.common.SolrInputDocument; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.request.SolrQueryResponse; import org.apache.solr.update.AddUpdateCommand; import org.apache.solr.update.processor.UpdateRequestProcessor; import org.apache.solr.update.processor.UpdateRequestProcessorFactory; public class AddConditionalFieldsFactory extends UpdateRequestProcessorFactory { @Override public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { System.out.println("From customization:"); return new AddConditionalFields(next); } } class AddConditionalFields extends UpdateRequestProcessor { public AddConditionalFields( UpdateRequestProcessor next) { super( next ); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); Object v = doc.getFieldValue( "url" ); if( v != null ) { String url = v.toString(); if( url.contains("question") ) { doc.addField( "tierFilter", "1" ); } } // pass it up the chain super.processAdd(cmd); } } Here is my Java code- and I get the following error when I try to index- Aug 20, 2011 10:48:43 AM org.apache.solr.common.SolrException log SEVERE: java.lang.AbstractMethodError at org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:74) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:53) Any pointers please. I am using Solr 3.3 Thanks, Bhawna On Thu, Aug 18, 2011 at 2:04 PM, simon wrote: > An UpdateRequestProcessor would do the trick. Look at the (rather minimal) > documentation and code example in > http://wiki.apache.org/solr/UpdateRequestProcessor > > -Simon > > On Thu, Aug 18, 2011 at 4:15 PM, bhawna singh > wrote: > > > Hi All, > > I have a requirement to update a certain field value depending on the > field > > value of another field. > > To elaborate- > > I have a field called 'popularity' and a field called 'URL'. I need to > > assign popularity value depending on the domain (URL) ( I have the > > popularity and domain mapping in a text file). > > > > I am using CSVRequestHandler to import the data. > > > > What are the suggested ways to achieve this. > > Your quick response is much appreciated. > > > > Thanks, > > Bhawna > > >
Null exception in UpdateRequestProcessor
Hi All, I am trying to add a field in my Solr document based on the domain. Now that I have set it up using UpdateProcessorChain, I am running into null exception. Here is what I have- In SolrConfig.xml mychain Here is my java code- package mysolr; import java.io.IOException; import org.apache.solr.common.SolrInputDocument; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.request.SolrQueryResponse; import org.apache.solr.update.AddUpdateCommand; import org.apache.solr.update.processor.UpdateRequestProcessor; import org.apache.solr.update.processor.UpdateRequestProcessorFactory; public class AddConditionalFieldsFactory extends UpdateRequestProcessorFactory { @Override public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { System.out.println("From customization:"); return new AddConditionalFields(next); } } class AddConditionalFields extends UpdateRequestProcessor { public AddConditionalFields( UpdateRequestProcessor next) { super( next ); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); Object v = doc.getFieldValue( "url" ); if( v != null ) { String url = v.toString(); if( url.contains("question") ) { doc.addField( "tierFilter", "1" ); } } // pass it up the chain super.processAdd(cmd); } } Here is my Java code- and I get the following error when I try to index- Aug 20, 2011 10:48:43 AM org.apache.solr.common.SolrException log SEVERE: java.lang.AbstractMethodError at org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:74) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:53) Any pointers please, I am using Solr 3.3 Thanks, Bhawna
heads up: re-index trunk Lucene/Solr indices
Hi, I just committed a new block tree terms dictionary implementation, which requires fully re-indexing any trunk indices. See here for details: https://issues.apache.org/jira/browse/LUCENE-3030 If you are using a released version of Lucene/Solr then you can ignore this message. Mike McCandless http://blog.mikemccandless.com
Re: query cache result
thanks tomas .. can we set querywindowsize of particular query through url ? say, i want only a particular set of query's result to be cached and not other queries . is it possible to control this query cache results and window size for each query separately ? 2011/8/19 Tomás Fernández Löbbe > From my understanding, seeing the cache as a set of key-value pairs, this > cache has the query as key and the list of IDs resulting from the query as > values. When the exact same query is issued, it will be found as key in > this > cache, and Solr will already have the list of IDs that match it. > If you set the size of this cache to 50, that means that Solr will keep in > memory the last 50 queries with their list of resulting document IDs. > > The number of IDs per query can be configured with the parameter > queryResultWindowSize > http://wiki.apache.org/solr/SolrCaching#queryResultWindowSize > > On Fri, Aug 19, 2011 at 10:34 AM, jame vaalet > wrote: > > > wiki says *"size > > > > The maximum number of entries in the cache." > > andqueryResultCache > > > > This cache stores ordered sets of document IDs — the top N results of a > > query ordered by some criteria. > > * > > > > doesn't it mean number of document ids rather than number of queries ? > > > > > > > > > > > > 2011/8/19 Tomás Fernández Löbbe > > > > > Hi Jame, the size for the queryResultCache is the number of queries > that > > > will fit into this cache. AutowarmCount is the number of queries that > are > > > going to be copyed from the old cache to the new cache when a commit > > > occurrs > > > (actually, the queries are going to be executed again agains the new > > > IndexSearcher, as the results for them may have changed on the new > > Index). > > > initial size is the initial size of the array, it will start to grow > from > > > that size up to "size". You may want to see this page of the wiki: > > > http://wiki.apache.org/solr/SolrCaching > > > > > > Regards, > > > > > > Tomás > > > On Fri, Aug 19, 2011 at 8:39 AM, jame vaalet > > wrote: > > > > > > > hi, > > > > i understand that queryResultCache tag in solrconfig is the one which > > > > determines the cache size of SOLR in jvm. > > > > > > > > > > > size="*${queryResultCacheSize:0}*"initialSize > > > > ="*${queryResultCacheInitialSize:0}*" autowarmCount="* > > > > ${queryResultCacheRows:0}*" /> > > > > > > > > > > > > out of the different attributes what is size? Is it the amount of > > memory > > > > reserved in bytes ? or number of doc ids cached ? or is it the number > > of > > > > queries it will cache? > > > > > > > > similarly wat is initial size and autowarm depicted in? > > > > > > > > can some please reply ... > > > > > > > > > > > > > > > -- > > > > -JAME > > > -- -JAME
Re: get update record from database using DIH
Actually I requested .../dataimport?command=delta-import&commit=true And DIH in delta-import mode does not commit, you can se log below. My index is quite empty, maybe 10 data rows max... It's just the beginning. INFO: Starting Delta Import Aug 14, 2011 1:42:02 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-3.3.0 path=/dataimport params={commit=true&command=delta-import} status=0 QTime=0 Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Starting delta collection. Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Running ModifiedRowKey() for Entity: event Aug 14, 2011 1:42:02 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity event with URL: jdbc:mysql:// 85.168.123.207:3306/AGENDA Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 865 Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed ModifiedRowKey for Entity: event rows obtained : 3 Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed DeletedRowKey for Entity: event rows obtained : 0 Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed parentDeltaQuery for Entity: event Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Delta Import completed successfully Aug 14, 2011 1:42:03 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 0 Aug 14, 2011 1:42:03 AM org.apache.solr.handler.dataimport.DocBuilder execute INFO: Time taken = 0:0:1.282 On 19 août 2011, at 10:39, Gora Mohanty wrote: On Fri, Aug 19, 2011 at 5:32 AM, Alexandre Sompheng wrote: Hi guys, i try the delta import, i got logs saying that it found delta data to update. But it seems that the index is not updated. Amy guess why this happens ? Did i miss something? I'm on solr 3.3 with no patch. [...] Please show us the following: * The exact URL you loaded for delta-import * The Solr response which shows the delta documents that it found, and the status of the delta-import. If your index is large, and if you are running an optimise after the delta-import (the default is to optimise), it can take some time. Check the status: It will say "busy" if the optimise is still running. Regards, Gora
Re: Date Facet Question
Makes complete sense, when faceting on dates I'm just checking to see if NOW is in it and replace it with either the beginning of the day or the end of the day (depending if it's lower or upper) and use that. This works well. Thanks for the quick response. On Fri, Aug 19, 2011 at 3:13 PM, Chris Hostetter wrote: > > : when the response comes back the facet names are > : > : 2010-08-14T01:50:58.813Z > ... > : instead of something like > : > : NOW-11MONTH > ... > : where as facet queries if specifying a set of facet queries like > : > : datetime:[NOW-1YEAR TO NOW] > ... > : the labels come back just as specified. Is there a way to make date > : range queries come back using the query specified and not the parsed > : date? > > No. If dates were the only factor here we could maybe add an option for > that but the faceting code is all generalized now to support all numerics, > so it wouldn't relaly make sense in general. > > it's also not clear how an option like this would work if/when stuff like > SOLR-2366 get implemented - returning the concrete value used as the > lowerbound of the range is un-ambiguious. > > the functional difference between facet.range and facet.query is pretty > signifigant, so it's kind of an apples/oranges thing to compare their > output -- with facet.query you can specify any arbitrary query > expression your heart desires, and that literal unparsed query string > is again used as the constraint key in the resulting NamedList because > it's as unambiguious as we can be given the circumstances. > > > > > -Hoss >
Re: Why are not query keywords treated as a set?
Part of the query is 'injected' by my application while unaware of the user query. Would I know that 'paste past' end up together as query 'past past' I would not inject anything as it distorts the score calculation. I could inject after it, but it is not easy. So, trying to solve it right into the RequestHandler I've difficulties with queries that contain phrases ("") or the 'must be present' + operator. For example I'd not want to touch a user query: +"zusammen essen" +"alein essen" where 'essen' is the duplicate term. My 'good enough solution' is thus to not remove the duplicate in clauses prefixed by + or ". C := set of clauses in which duplicated term t occurs. for each clause c in C: do if(!c.toString().startsWith(") && !c.toString().startsWith(+) && |C| > 1){ C.remove(c); } end What do you think? Better solutions or algorithms to make sure the same term occurs only once in a query, or at least it's weighted once only in the score calculation? On Mon, Jun 20, 2011 at 11:15 AM, Markus Jelsma wrote: > That only removed tokens on the same position, as the wiki explains. > > Gabrielle, why would you expect that? You input two tokens so you query for > two tokens, why would it be a `set` ? > > > this might help in your analysis chain > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDupl > > icatesTokenFilterFactory > > > > On 20 June 2011 04:21, Gabriele Kahlout > wrote: > > > past past > > > *past past* > > > *content:past content:past* > > > > > > I was expecting the query to get parsed into content:past only and not > > > content:past content:past. > > > > > > On Mon, Jun 20, 2011 at 12:12 AM, lee carroll > > > > > > wrote: > > >> do you mean a phrase query? "past past" > > >> can you give some more detail? > > >> > > >> On 18 June 2011 13:02, Gabriele Kahlout > wrote: > > >> > q=past past > > >> > > > >> > 1.0 = (MATCH) sum of: > > >> > * 0.5 = (MATCH) fieldWeight(content:past in 0), product of:* > > >> > 1.0 = tf(termFreq(content:past)=1) > > >> > 1.0 = idf(docFreq=1, maxDocs=2) > > >> > 0.5 = fieldNorm(field=content, doc=0) > > >> > * 0.5 = (MATCH) fieldWeight(content:past in 0), product of:* > > >> > 1.0 = tf(termFreq(content:past)=1) > > >> > 1.0 = idf(docFreq=1, maxDocs=2) > > >> > 0.5 = fieldNorm(field=content, doc=0) > > >> > > > >> > Is there how I can treat the query keywords as a set? > > >> > > > >> > -- > > >> > Regards, > > >> > K. Gabriele > > >> > > > >> > --- unchanged since 20/9/10 --- > > >> > P.S. If the subject contains "[LON]" or the addressee acknowledges > the > > >> > receipt within 48 hours then I don't resend the email. > > >> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > > >> > > >> time(x) > > >> > > >> > < Now + 48h) ⇒ ¬resend(I, this). > > >> > > > >> > If an email is sent by a sender that is not a trusted contact or the > > >> > > >> email > > >> > > >> > does not contain a valid code then the email is not received. A > valid > > >> > > >> code > > >> > > >> > starts with a hyphen and ends with "X". > > >> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ > y > > >> > ∈ L(-[a-z]+[0-9]X)). > > > > > > -- > > > Regards, > > > K. Gabriele > > > > > > --- unchanged since 20/9/10 --- > > > P.S. If the subject contains "[LON]" or the addressee acknowledges the > > > receipt within 48 hours then I don't resend the email. > > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > > > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > > > > > If an email is sent by a sender that is not a trusted contact or the > > > email does not contain a valid code then the email is not received. A > > > valid code starts with a hyphen and ends with "X". > > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y > ∈ > > > L(-[a-z]+[0-9]X)). > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).